HCI Bibliography Home | HCI Conferences | WWW Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
WWW Tables of Contents: 01020304-104-205-105-2060708091011-111-212-112-213-113-214-1

Proceedings of the 2007 International Conference on the World Wide Web

Fullname:Proceedings of the 16th International Conference on World Wide Web
Editors:Peter F. Patel-Schneider; Prashant Shenoy
Location:Banff, Alberta, Canada
Dates:2007-May-08 to 2007-May-12
Standard No:ISBN 978-1-59593-654-7; 1-59593-654-8; ACM DL: Table of Contents hcibib: WWW07
Links:Conference Home Page
  1. Personalization
  2. Smarter browsing
  3. Data mining
  4. Mining textual data
  5. Similarity search
  6. Predictive modeling of web users
  7. Mining in social networks
  8. E-communities
  9. E-commerce and e-content
  10. Industrial practice & experience
  11. Scalable systems for dynamic content
  12. Performance engineering of web applications
  13. Pervasive web and mobility
  14. Search potpourri
  15. Crawlers
  16. Web graphs
  17. Search quality and precision
  18. Advertisements & click estimates
  19. Knowledge discovery
  20. Personalization
  21. Defending against emerging threats
  22. Passwords and phishing
  23. Access control and trust on the web
  24. Ontologies
  25. Applications
  26. Similarity and extraction
  27. Query languages and DBs
  28. Semantic web and web 2.0
  29. Communication in developing regions
  30. Networking issues in the web
  31. Web modeling
  32. End-user perspectives and measurement in web engineering
  33. Orchestration and choreography
  34. SLAs and QoS
  35. Querying & transforming XML
  36. Parsing, normalizing, & storing XML
  37. Developing regions
  38. Search
  39. Semantic web
  40. Services
  41. Social networks
  42. Systems
  43. User interfaces & accessibility
  44. XML


Homepage live: automatic block tracing for web personalization BIBAFull-Text 1-10
  Jie Han; Dingyi Han; Chenxi Lin; Hua-Jun Zeng; Zheng Chen; Yong Yu
The emergence of personalized homepage services, e.g. personalized Google Homepage and Microsoft Windows Live, has enabled Web users to select Web contents of interest and to aggregate them in a single Web page. The web contents are often predefined content blocks provided by the service providers. However, it involves intensive manual efforts to define the content blocks and maintain the information in it. In this paper, we propose a novel personalized homepage system, called "Homepage Live", to allow end users to use drag-and-drop actions to collect their favorite Web content blocks from existing Web pages and organize them in a single page. Moreover, Homepage Live automatically traces the changes of blocks with the evolvement of the container pages by measuring the tree edit distance of the selected blocks. By exploiting the immutable elements of Web pages, the tracing algorithm performance is significantly improved. The experimental results demonstrate the effectiveness and efficiency of our algorithm.
Open user profiles for adaptive news systems: help or harm? BIBAFull-Text 11-20
  Jae-wook Ahn; Peter Brusilovsky; Jonathan Grady; Daqing He; Sue Yeon Syn
Over the last five years, a range of projects have focused on progressively more elaborated techniques for adaptive news delivery. However, the adaptation process in these systems has become more complicated and thus less transparent to the users. In this paper, we concentrate on the application of open user models in adding transparency and controllability to adaptive news systems. We present a personalized news system, YourNews, which allows users to view and edit their interest profiles, and report a user study on the system. Our results confirm that users prefer transparency and control in their systems, and generate more trust to such systems. However, similar to previous studies, our study demonstrate that this ability to edit user profiles may also harm the system's performance and has to be used with caution.
Investigating behavioral variability in web search BIBAFull-Text 21-30
  Ryen W. White; Steven M. Drucker
Understanding the extent to which people's search behaviors differ in terms of the interaction flow and information targeted is important in designing interfaces to help World Wide Web users search more effectively. In this paper we describe a longitudinal log-based study that investigated variability in people's interaction behavior when engaged in search-related activities on the Web. We analyze the search interactions of more than two thousand volunteer users over a five-month period, with the aim of characterizing differences in their interaction styles. The findings of our study suggest that there are dramatic differences in variability in key aspects of the interaction within and between users, and within and between the search queries they submit. Our findings also suggest two classes of extreme user. navigators and explorers. whose search interaction is highly consistent or highly variable. Lessons learned from these users can inform the design of tools to support effective Web-search interactions for everyone.

Smarter browsing

CSurf: a context-driven non-visual web-browser BIBAFull-Text 31-40
  Jalal U. Mahmud; Yevgen Borodin; I. V. Ramakrishnan
Web sites are designed for graphical mode of interaction. Sighted users can "cut to the chase" and quickly identify relevant information in Web pages. On the contrary, individuals with visual disabilities have to use screen-readers to browse the Web. As screen-readers process pages sequentially and read through everything, Web browsing can become strenuous and time-consuming. Although, the use of shortcuts and searching offers some improvements, the problem still remains. In this paper, we address the problem of information overload in non-visual Web access using the notion of context. Our prototype system, CSurf, embodying our approach, provides the usual features of a screen-reader.
   However, when a user follows a link, CSurf captures the context of the link using a simple topic-boundary detection technique, and uses it to identify relevant information on the next page with the help of a Support Vector Machine, a statistical machine-learning model. Then, CSurf reads the Web page starting from the most relevant section, identified by the model. We conducted a series experiments to evaluate the performance of CSurf against the state-of-the-art screen-reader, JAWS. Our results show that the use of context can potentially save browsing time and substantially improve browsing experience of visually disabled people.
Geotracker: geospatial and temporal RSS navigation BIBAFull-Text 41-50
  Yih-Farn Robin Chen; Giuseppe Di Fabbrizio; David Gibbon; Serban Jora; Bernard Renger; Bin Wei
The Web is rapidly moving towards a platform for mass collaboration in content production and consumption. Fresh content on a variety of topics, people, and places is being created and made available on the Web at breathtaking speed. Navigating the content effectively not only requires techniques such as aggregating various RSS-enabled feeds, but it also demands a new browsing paradigm. In this paper, we present novel geospatial and temporal browsing techniques that provide users with the capability of aggregating and navigating RSS-enabled content in a timely, personalized and automatic manner. In particular, we describe a system called GeoTracker that utilizes both a geospatial representation and a temporal (chronological) presentation to help users spot the most relevant updates quickly. Within the context of this work, we provide a middleware engine that supports intelligent aggregation and dissemination of RSS feeds with personalization to desktops and mobile devices. We study the navigation capabilities of this system on two kinds of data sets, namely, 2006 World Cup soccer data collected over two months and breaking news items that occur every day. We also demonstrate that the application of such technologies to the video search results returned by YouTube and Google greatly enhances a user's ability in locating and browsing videos based on his or her geographical interests. Finally, we demonstrate that the location inference performance of GeoTracker compares well against machine learning techniques used in the natural language processing/information retrieval community. Despite its algorithm simplicity, it preserves high recall percentages.
Learning information intent via observation BIBAFull-Text 51-60
  Anthony Tomasic; Isaac Simmons; John Zimmerman
Users in an organization frequently request help by sending request messages to assistants that express information intent: an intention to update data in an information system. Human assistants spend a significant amount of time and effort processing these requests. For example, human resource assistants process requests to update personnel records, and executive assistants process requests to schedule conference rooms or to make travel reservations. To process the intent of a request, assistants read the request and then locate, complete, and submit a form that corresponds to the expressed intent. Automatically or semi-automatically processing the intent expressed in a request on behalf of an assistant would ease the mundane and repetitive nature of this kind of work.
   For a well-understood domain, a straightforward application of natural language processing techniques can be used to build an intelligent form interface to semi-automatically process information intent request messages. However, high performance parsers are based on machine learning algorithms that require a large corpus of examples that have been labeled by an expert. The generation of a labeled corpus of requests is a major barrier to the construction of a parser. In this paper, we investigate the construction of a natural language processing system and an intelligent form system that observes an assistant processing requests. The intelligent form system then generates a labeled training corpus by interpreting the observations. This paper reports on the measurement of the performance of the machine learning algorithms based on real data. The combination of observations, machine learning and interaction design produces an effective intelligent form interface based on natural language processing.

Data mining

Page-level template detection via isotonic smoothing BIBAFull-Text 61-70
  Deepayan Chakrabarti; Ravi Kumar; Kunal Punera
We develop a novel framework for the page-level template detection problem. Our framework is built on two main ideas. The first is the automatic generation of training data for a classifier that, given a page, assigns a templateness score to every DOM node of the page. The second is the global smoothing of these per-node classifier scores by solving a regularized isotonic regression problem; the latter follows from a simple yet powerful abstraction of templateness on a page. Our extensive experiments on human-labeled test data show that our approach detects templates effectively.
Towards domain-independent information extraction from web tables BIBAFull-Text 71-80
  Wolfgang Gatterbauer; Paul Bohunsky; Marcus Herzog; Bernhard Krüpl; Bernhard Pollak
Traditionally, information extraction from web tables has focused on small, more or less homogeneous corpora, often based on assumptions about the use of <table> tags. A multitude of different HTML implementations of web tables make these approaches difficult to scale. In this paper, we approach the problem of domain-independent information extraction from web tables by shifting our attention from the tree-based representation of web pages to a variation of the two-dimensional visual box model used by web browsers to display the information on the screen. The thereby obtained topological and style information allows us to fill the gap created by missing domain-specific knowledge about content and table templates. We believe that, in a future step, this approach can become the basis for a new way of large-scale knowledge acquisition from the current "Visual Web."
Web object retrieval BIBAFull-Text 81-90
  Zaiqing Nie; Yunxiao Ma; Shuming Shi; Ji-Rong Wen; Wei-Ying Ma
The primary function of current Web search engines is essentially relevance ranking at the document level. However, myriad structured information about real-world objects is embedded in static Web pages and online Web databases. Document-level information retrieval can unfortunately lead to highly inaccurate relevance ranking in answering object-oriented queries. In this paper, we propose a paradigm shift to enable searching at the object level. In traditional information retrieval models, documents are taken as the retrieval units and the content of a document is considered reliable. However, this reliability assumption is no longer valid in the object retrieval context when multiple copies of information about the same object typically exist. These copies may be inconsistent because of diversity of Web site qualities and the limited performance of current information extraction techniques. If we simply combine the noisy and inaccurate attribute information extracted from different sources, we may not be able to achieve satisfactory retrieval performance. In this paper, we propose several language models for Web object retrieval, namely an unstructured object retrieval model, a structured object retrieval model, and a hybrid model with both structured and unstructured retrieval features. We test these models on a paper search engine and compare their performances. We conclude that the hybrid model is the superior by taking into account the extraction errors at varying levels.

Mining textual data

Summarizing email conversations with clue words BIBAFull-Text 91-100
  Giuseppe Carenini; Raymond T. Ng; Xiaodong Zhou
Accessing an ever increasing number of emails, possibly on small mobile devices, has become a major problem for many users. Email summarization is a promising way to solve this problem. In this paper, we propose a new framework for email summarization. One novelty is to use a fragment quotation graph to try to capture an email conversation. The second novelty is to use clue words to measure the importance of sentences in conversation summarization. Based on clue words and their scores, we propose a method called CWS, which is capable of producing a summary of any length as requested by the user. We provide a comprehensive comparison of CWS with various existing methods on the Enron data set. Preliminary results suggest that CWS provides better summaries than existing methods.
Organizing and searching the world wide web of facts -- step two: harnessing the wisdom of the crowds BIBAFull-Text 101-110
  Marius Pasca
As part of a large effort to acquire large repositories of facts from unstructured text on the Web, a seed-based framework for textual information extraction allows for weakly supervised extraction of class attributes (e.g., side effects and generic equivalent for drugs) from anonymized query logs. The extraction is guided by a small set of seed attributes, without any need for handcrafted extraction patterns or further domain-specific knowledge. The attributes of classes pertaining to various domains of interest to Web search users have accuracy levels significantly exceeding current state of the art. Inherently noisy search queries are shown to be a highly valuable, albeit unexplored, resource for Web-based information extraction, in particular for the task of class attribute extraction.
Do not crawl in the dust: different URLs with similar text BIBAFull-Text 111-120
  Ziv Bar-Yossef; Idit Keidar; Uri Schonfeld
We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, and dynamically generates the same page from various different URL requests. We present a novel algorithm, DustBuster, for uncovering DUST; that is, for discovering rules that transform a given URL to others that are likely to have similar content. DustBuster mines DUST effectively from previous crawl logs or web server logs, without examining page contents. Verifying these rules via sampling requires fetching few actual web pages. Search engines can benefit from information about DUST to increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as PageRank.

Similarity search

A new suffix tree similarity measure for document clustering BIBAFull-Text 121-130
  Hung Chim; Xiaotie Deng
In this paper, we propose a new similarity measure to compute the pairwise similarity of text-based documents based on suffix tree document model. By applying the new suffix tree similarity measure in Group-average Agglomerative Hierarchical Clustering (GAHC) algorithm, we developed a new suffix tree document clustering algorithm (NSTC). Experimental results on two standard document clustering benchmark corpus OHSUMED and RCV1 indicate that the new clustering algorithm is a very effective document clustering algorithm. Comparing with the results of traditional word term weight tf-idf similarity measure in the same GAHC algorithm, NSTC achieved an improvement of 51% on the average of F-measure score. Furthermore, we apply the new clustering algorithm in analyzing the Web documents in online forum communities. A topic oriented clustering algorithm is developed to help people in assessing, classifying and searching the Web documents in a large forum community.
Scaling up all pairs similarity search BIBAFull-Text 131-140
  Roberto J. Bayardo; Yiming Ma; Ramakrishnan Srikant
Given a large collection of sparse vector data in a high dimensional space, we investigate the problem of finding all pairs of vectors whose similarity score (as determined by a function such as cosine distance) is above a given threshold. We propose a simple algorithm based on novel indexing and optimization strategies that solves this problem without relying on approximation methods or extensive parameter tuning. We show the approach efficiently handles a variety of datasets across a wide setting of similarity thresholds, with large speedups over previous state-of-the-art approaches.
Detecting near-duplicates for web crawling BIBAFull-Text 141-150
  Gurmeet Singh Manku; Arvind Jain; Anish Das Sarma
Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a near-duplicate of a previously crawled web page or not. In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we demonstrate that Charikar's fingerprinting technique is appropriate for this goal. Second, we present an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and all batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.

Predictive modeling of web users

Demographic prediction based on user's browsing behavior BIBAFull-Text 151-160
  Jian Hu; Hua-Jun Zeng; Hua Li; Cheng Niu; Zheng Chen
Demographic information plays an important role in personalized web applications. However, it is usually not easy to obtain this kind of personal data such as age and gender. In this paper, we made a first approach to predict users' gender and age from their Web browsing behaviors, in which the Webpage view information is treated as a hidden variable to propagate demographic information between different users. There are three main steps in our approach: First, learning from the Webpage click-though data, Webpages are associated with users' (known) age and gender tendency through a discriminative model; Second, users' (unknown) age and gender are predicted from the demographic information of the associated Webpages through a Bayesian framework; Third, based on the fact that Webpages visited by similar users may be associated with similar demographic tendency, and users with similar demographic information would visit similar Webpages, a smoothing component is employed to overcome the data sparseness of web click-though log. Experiments are conducted on a real web click-through log to demonstrate the effectiveness of the proposed approach. The experimental results show that the proposed algorithm can achieve up to 30.4% improvements on gender prediction and 50.3% on age prediction in terms of macro F1, compared to baseline algorithms.
Why we search: visualizing and predicting user behavior BIBAFull-Text 161-170
  Eytan Adar; Daniel S. Weld; Brian N. Bershad; Steven S. Gribble
The aggregation and comparison of behavioral patterns on the WWW represent a tremendous opportunity for understanding past behaviors and predicting future behaviors. In this paper, we take a first step at achieving this goal. We present a large scale study correlating the behaviors of Internet users on multiple systems ranging in size from 27 million queries to 14 million blog posts to 20,000 news articles. We formalize a model for events in these time-varying datasets and study their correlation. We have created an interface for analyzing the datasets, which includes a novel visual artifact, the DTWRadar, for summarizing differences between time series. Using our tool we identify a number of behavioral properties that allow us to understand the predictive power of patterns of use.
Topic sentiment mixture: modeling facets and opinions in weblogs BIBAFull-Text 171-180
  Qiaozhu Mei; Xu Ling; Matthew Wondra; Hang Su; ChengXiang Zhai
In this paper, we define the problem of topic-sentiment analysis on Weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. The proposed Topic-Sentiment Mixture (TSM) model can reveal the latent topical facets in a Weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. It could also provide general sentiment models that are applicable to any ad hoc topics. With a specifically designed HMM structure, the sentiment models and topic models estimated with TSM can be utilized to extract topic life cycles and sentiment dynamics. Empirical experiments on different Weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from Weblog collections. The TSM model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction.

Mining in social networks

Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography BIBAFull-Text 181-190
  Lars Backstrom; Cynthia Dwork; Jon Kleinberg
In a social network, nodes correspond to people or other social entities, and edges correspond to social links between them. In an effort to preserve privacy, the practice of anonymization replaces names with meaningless unique identifiers. We describe a family of attacks such that even from a single anonymized copy of a social network, it is possible for an adversary to learn whether edges exist or not between specific targeted pairs of nodes.
Information flow modeling based on diffusion rate for prediction and ranking BIBAFull-Text 191-200
  Xiaodan Song; Yun Chi; Koji Hino; Belle L. Tseng
Information flows in a network where individuals influence each other. The diffusion rate captures how efficiently the information can diffuse among the users in the network. We propose an information flow model that leverages diffusion rates for: (1) prediction . identify where information should flow to, and (2) ranking . identify who will most quickly receive the information. For prediction, we measure how likely information will propagate from a specific sender to a specific receiver during a certain time period. Accordingly a rate-based recommendation algorithm is proposed that predicts who will most likely receive the information during a limited time period. For ranking, we estimate the expected time for information diffusion to reach a specific user in a network. Subsequently, a DiffusionRank algorithm is proposed that ranks users based on how quickly information will flow to them. Experiments on two datasets demonstrate the effectiveness of the proposed algorithms to both improve the recommendation performance and rank users by the efficiency of information flow.
Netprobe: a fast and scalable system for fraud detection in online auction networks BIBAFull-Text 201-210
  Shashank Pandit; Duen Horng Chau; Samuel Wang; Christos Faloutsos
Given a large online network of online auction users and their histories of transactions, how can we spot anomalies and auction fraud? This paper describes the design and implementation of NetProbe, a system that we propose for solving this problem. NetProbe models auction users and transactions as a Markov Random Field tuned to detect the suspicious patterns that fraudsters create, and employs a Belief Propagation mechanism to detect likely fraudsters. Our experiments show that NetProbe is both efficient and effective for fraud detection. We report experiments on synthetic graphs with as many as 7,000 nodes and 30,000 edges, where NetProbe was able to spot fraudulent nodes with over 90% precision and recall, within a matter of seconds. We also report experiments on a real dataset crawled from eBay, with nearly 700,000 transactions between more than 66,000users, where NetProbe was highly effective at unearthing hidden networks of fraudsters, within a realistic response time of about 6 minutes. For scenarios where the underlying data is dynamic in nature, we propose IncrementalNetProbe, which is an approximate, but fast, variant of NetProbe. Our experiments prove that Incremental NetProbe executes nearly doubly fast as compared to NetProbe, while retaining over 99% of its accuracy.


The complex dynamics of collaborative tagging BIBAFull-Text 211-220
  Harry Halpin; Valentin Robu; Hana Shepherd
The debate within the Web community over the optimal means by which to organize information often pits formalized classifications against distributed collaborative tagging systems. A number of questions remain unanswered, however, regarding the nature of collaborative tagging systems including whether coherent categorization schemes can emerge from unsupervised tagging by users. This paper uses data from the social bookmarking site delicio. us to examine the dynamics of collaborative tagging systems. In particular, we examine whether the distribution of the frequency of use of tags for "popular" sites with a long history (many tags and many users) can be described by a power law distribution, often characteristic of what are considered complex systems. We produce a generative model of collaborative tagging in order to understand the basic dynamics behind tagging, including how a power law distribution of tags could arise. We empirically examine the tagging history of sites in order to determine how this distribution arises over time and to determine the patterns prior to a stable distribution. Lastly, by focusing on the high-frequency tags of a site where the distribution of tags is a stabilized power law, we show how tag co-occurrence networks for a sample domain of tags can be used to analyze the meaning of particular tags given their relationship to other tags.
Expertise networks in online communities: structure and algorithms BIBAFull-Text 221-230
  Jun Zhang; Mark S. Ackerman; Lada Adamic
Web-based communities have become important places for people to seek and share expertise. We find that networks in these communities typically differ in their topology from other online networks such as the World Wide Web. Systems targeted to augment web-based communities by automatically identifying users with expertise, for example, need to adapt to the underlying interaction dynamics. In this study, we analyze the Java Forum, a large online help-seeking community, using social network analysis methods. We test a set of network-based ranking algorithms, including PageRank and HITS, on this large size social network in order to identify users with high expertise. We then use simulations to identify a small number of simple simulation rules governing the question-answer dynamic in the network. These simple rules not only replicate the structural characteristics and algorithm performance on the empirically observed Java Forum, but also allow us to evaluate how other algorithms may perform in communities with different characteristics. We believe this approach will be fruitful for practical algorithm design and implementation for online expertise-sharing communities.
Internet-scale collection of human-reviewed data BIBAFull-Text 231-240
  Qi Su; Dmitry Pavlov; Jyh-Herng Chow; Wendell C. Baker
Enterprise and web data processing and content aggregation systems often require extensive use of human-reviewed data (e.g. for training and monitoring machine learning-based applications). Today these needs are often met by in-house efforts or out-sourced offshore contracting. Emerging applications attempt to provide automated collection of human-reviewed data at Internet-scale. We conduct extensive experiments to study the effectiveness of one such application. We also study the feasibility of using Yahoo! Answers, a general question-answering forum, for human-reviewed data collection.

E-commerce and e-content

Detectives: detecting coalition hit inflation attacks in advertising networks streams BIBAFull-Text 241-250
  Ahmed Metwally; Divyakant Agrawal; Amr El Abbadi
Click fraud is jeopardizing the industry of Internet advertising. Internet advertising is crucial for the thriving of the entire Internet, since it allows producers to advertise their products, and hence contributes to the well being of e-commerce. Moreover, advertising supports the intellectual value of the Internet by covering the running expenses of publishing content. Some content publishers are dishonest, and use automation to generate traffic to defraud the advertisers. Similarly, some advertisers automate clicks on the advertisements of their competitors to deplete their competitors' advertising budgets. This paper describes the advertising network model, and focuses on the most sophisticated type of fraud, which involves coalitions among fraudsters. We build on several published theoretical results to devise the Similarity-Seeker algorithm that discovers coalitions made by pairs of fraudsters. We then generalize the solution to coalitions of arbitrary sizes. Before deploying our system on a real network, we conducted comprehensive experiments on data samples for proof of concept. The results were very accurate. We detected several coalitions, formed using various techniques, and spanning numerous sites. This reveals the generality of our model and approach.
Extraction and search of chemical formulae in text documents on the web BIBAFull-Text 251-260
  Bingjun Sun; Qingzhao Tan; Prasenjit Mitra; C. Lee Giles
Often scientists seek to search for articles on the Web related to a particular chemical. When a scientist searches for a chemical formula using a search engine today, she gets articles where the exact keyword string expressing the chemical formula is found. Searching for the exact occurrence of keywords during searching results in two problems for this domain: a) if the author searches for CH4 and the article has H4C, the article is not returned, and b) ambiguous searches like "He" return all documents where Helium is mentioned as well as documents where the pronoun "he" occurs. To remedy these deficiencies, we propose a chemical formula search engine. To build a chemical formula search engine, we must solve the following problems: 1) extract chemical formulae from text documents, 2) index chemical formulae, and 3) design ranking functions for the chemical formulae. Furthermore, query models are introduced for formula search, and for each a scoring scheme based on features of partial formulae is proposed to measure the relevance of chemical formulae and queries. We evaluate algorithms for identifying chemical formulae in documents using classification methods based on Support Vector Machines (SVM), and a probabilistic model based on conditional random fields (CRF). Different methods for SVM and CRF to tune the trade-off between recall and precision for imbalanced data are proposed to improve the overall performance. A feature selection method based on frequency and discrimination is used to remove uninformative and redundant features. Experiments show that our approaches to chemical formula extraction work well, especially after trade-off tuning. The results also demonstrate that feature selection can reduce the index size without changing ranked query results much.
A content-driven reputation system for the Wikipedia BIBAFull-Text 261-270
  B. Thomas Adler; Luca de Alfaro
We present a content-driven reputation system for Wikipedia authors. In our system, authors gain reputation when the edits they perform to Wikipedia articles are preserved by subsequent authors, and they lose reputation when their edits are rolled back or undone in short order. Thus, author reputation is computed solely on the basis of content evolution; user-to-user comments or ratings are not used. The author reputation we compute could be used to flag new contributions from low-reputation authors, or it could be used to allow only authors with high reputation to contribute to controversial or critical pages. A reputation system for the Wikipedia could also provide an incentive for high-quality contributions. We have implemented the proposed system, and we have used it to analyze the entire Italian and French Wikipedias, consisting of a total of 691, 551 pages and 5, 587, 523 revisions. Our results show that our notion of reputation has good predictive value: changes performed by low-reputation authors have a significantly larger than average probability of having poor quality, as judged by human observers, and of being later undone, as measured by our algorithms.

Industrial practice & experience

Google news personalization: scalable online collaborative filtering BIBAFull-Text 271-280
  Abhinandan S. Das; Mayur Datar; Ashutosh Garg; Shyam Rajaram
Several approaches to collaborative filtering have been studied but seldom have studies been reported for large (several million users and items) and dynamic (the underlying item set is continually changing) settings. In this paper we describe our approach to collaborative filtering for generating personalized recommendations for users of Google News. We generate recommendations using three approaches: collaborative filtering using MinHash clustering, Probabilistic Latent Semantic Indexing (PLSI), and covisitation counts. We combine recommendations from different algorithms using a linear model. Our approach is content agnostic and consequently domain independent, making it easily adaptable for other applications and languages with minimal effort. This paper will describe our algorithms and system setup in detail, and report results of running the recommendations engine on Google News.
Exploring in the weblog space by detecting informative and affective articles BIBAFull-Text 281-290
  Xiaochuan Ni; Gui-Rong Xue; Xiao Ling; Yong Yu; Qiang Yang
Weblogs have become a prevalent source of information for people to express themselves. In general, there are two genres of contents in weblogs. The first kind is about the webloggers' personal feelings, thoughts or emotions. We call this kind of weblogs affective articles. The second kind of weblogs is about technologies and different kinds of informative news. In this paper, we present a machine learning method for classifying informative and affective articles among weblogs. We consider this problem as a binary classification problem. By using machine learning approaches, we achieve about 92% on information retrieval performance measures including precision, recall and F1. We set up three studies on the applications of above classification approach in both research and industrial fields. The above classification approach is used to improve the performance of classification of emotions from weblog articles. We also develop an intent-driven weblog-search engine based on the classification techniques to improve the satisfaction of Web users. Finally, our approach is applied to search for weblogs with a great deal of informative articles.
Spam double-funnel: connecting web spammers with advertisers BIBAFull-Text 291-300
  Yi-Min Wang; Ming Ma; Yuan Niu; Hao Chen
Spammers use questionable search engine optimization (SEO) techniques to promote their spam links into top search results. In this paper, we focus on one prevalent type of spam -- redirection spam -- where one can identify spam pages by the third-party domains that these pages redirect traffic to. We propose a five-layer, double-funnel model for describing end-to-end redirection spam, present a methodology for analyzing the layers, and identify prominent domains on each layer using two sets of commercial keywords. one targeting spammers and the other targeting advertisers. The methodology and findings are useful for search engines to strengthen their ranking algorithms against spam, for legitimate website owners to locate and remove spam doorway pages, and for legitimate advertisers to identify unscrupulous syndicators who serve ads on spam pages.

Scalable systems for dynamic content

Globetp: template-based database replication for scalable web applications BIBAFull-Text 301-310
  Tobias Groothuyse; Swaminathan Sivasubramanian; Guillaume Pierre
Generic database replication algorithms do not scale linearly in throughput as all update, deletion and insertion (UDI) queries must be applied to every database replica. The throughput is therefore limited to the point where the number of UDI queries alone is sufficient to overload one server. In such scenarios, partial replication of a database can help, as UDI queries are executed only by a subset of all servers. In this paper we propose GlobeTP, a system that employs partial replication to improve database throughput. GlobeTP exploits the fact that a Web application's query workload is composed of a small set of read and write templates. Using knowledge of these templates and their respective execution costs, GlobeTP provides database table placements that produce significant improvements in database throughput. We demonstrate the efficiency of this technique using two different industry standard benchmarks. In our experiments, GlobeTP increases the throughput by 57% to 150% compared to full replication, while using identical hardware configuration. Furthermore, adding a single query cache improves the throughput by another 30% to 60%.
Consistency-preserving caching of dynamic database content BIBAFull-Text 311-320
  Niraj Tolia; M. Satyanarayanan
With the growing use of dynamic web content generated from relational databases, traditional caching solutions for through put and latency improvements are ineffective. We describe a middleware layer called Ganesh that reduces the volume of data transmitted without semantic interpretation of queries or results. It achieves this reduction through the use of cryptographic hashing to detect similarities with previous results. These benefits do not require any compromise of the strict consistency semantics provided by the back-end database. Further, Ganesh does not require modifications to applications, web servers, or database servers, and works with closed-source applications and databases. Using two bench marks representative of dynamic web sites, measurements of our prototype show that it can increase end-to-end throughput by as much as two fold for non-data intensive applications and by as much as ten fold for data intensive ones.
Optimized query planning of continuous aggregation queries in dynamic data dissemination networks BIBAFull-Text 321-330
  Rajeev Gupta; Krithi Ramamritham
Continuous queries are used to monitor changes to time varying data and to provide results useful for online decision making. Typically a user desires to obtain the value of some aggregation function over distributed data items, for example, to know (a) the average of temperatures sensed by a set of sensors (b) the value of index of mid-cap stocks. In these queries a client specifies a coherency requirement as part of the query. In this paper we present a low-cost, scalable technique to answer continuous aggregation queries using a content distribution network of dynamic data items. In such a network of data aggregators, each data aggregator serves a set of data items at specific coherencies. Just as various fragments of a dynamic web-page are served by one or more nodes of a content distribution network, our technique involves decomposing a client query into sub-queries and executing sub-queries on judiciously chosen data aggregators with their individual sub-query incoherency bounds. We provide a technique of getting the optimal query plan (i.e., set of sub-queries and their chosen data aggregators) which satisfies client query's coherency requirement with least cost, measured in terms of the number of refresh messages sent from aggregators to the client. For estimating query execution cost, we build a continuous query cost model which can be used to estimate the number of messages required to satisfy the client specified incoherency bound. Performance results using real-world traces show that our cost based query planning leads to queries being executed using less than one third the number of messages required by existing schemes.

Performance engineering of web applications

A scalable application placement controller for enterprise data centers BIBAFull-Text 331-340
  Chunqiang Tang; Malgorzata Steinder; Michael Spreitzer; Giovanni Pacifici
Given a set of machines and a set of Web applications with dynamically changing demands, an online application placement controller decides how many instances to run for each application and where to put them, while observing all kinds of resource constraints. This NP hard problem has real usage in commercial middleware products. Existing approximation algorithms for this problem can scale to at most a few hundred machines, and may produce placement solutions that are far from optimal when system resources are tight. In this paper, we propose a new algorithm that can produce within 30seconds high-quality solutions for hard placement problems with thousands of machines and thousands of applications. This scalability is crucial for dynamic resource provisioning in large-scale enterprise data centers. Our algorithm allows multiple applications to share a single machine, and strives to maximize the total satisfied application demand, to minimize the number of application starts and stops, and to balance the load across machines. Compared with existing state-of-the-art algorithms, for systems with 100 machines or less, our algorithm is up to 134 times faster, reduces application starts and stops by up to 97%, and produces placement solutions that satisfy up to 25% more application demands. Our algorithm has been implemented and adopted in a leading commercial middleware product for managing the performance of Web applications.
A unified platform for data driven web applications with automatic client-server partitioning BIBAFull-Text 341-350
  Fan Yang; Nitin Gupta; Nicholas Gerner; Xin Qi; Alan Demers; Johannes Gehrke; Jayavel Shanmugasundaram
Data-driven web applications are usually structured in three tiers with different programming models at each tier. This division forces developers to manually partition application functionality across the tiers, resulting in complex logic, suboptimal partitioning, and expensive re-partitioning of applications. In this paper, we introduce a unified platform for automatic partitioning of data-driven web applications. Our approach is based on Hilda[41, 46], a high-level declarative programming language with a unified data and programming model for all the layers of the application. Based on run-time properties of the application, Hilda's run time system automatically partitions the application between the tiers to improve response time while adhering to memory and/or processing constraints at the clients. We evaluate our methodology with traces from a real application and with TPC-W, and our results show that automatic partitioning outperforms manual partitioning without the associated development overhead.
MyXDNS: a request routing DNS server with decoupled server selection BIBAFull-Text 351-360
  Hussein A. Alzoubi; Michael Rabinovich; Oliver Spatscheck
This paper presents the architecture and the preliminary evaluation of a request routing DNS server that decouples server selection from the rest of DNS functionality. Our DNS server, which we refer to as MyXDNS, exposes well-defined APIs for uploading an externally computed server selection policy and for interacting with an external network proximity service. With MyXDNS, researchers can explore their own network proximity metrics and request routing algorithms without having to worry about DNS internals. Furthermore, MyXDNS is based onopen-source MyDNS and is available to public. Stress-testing of MyXDNS indicated that it achieves its flexibility at an acceptable cost: a single MyXDNS running on a low-level server can process 3000 req/sec with sub-millisecond response even in the presence of continuous updates to server selection policy.

Pervasive web and mobility

Robust web page segmentation for mobile terminal using content-distances and page layout information BIBAFull-Text 361-370
  Gen Hattori; Keiichiro Hoashi; Kazunori Matsumoto; Fumiaki Sugaya
The demand of browsing information from general Web pages using a mobile phone is increasing. However, since the majority of Web pages on the Internet are optimized for browsing from PCs, it is difficult for mobile phone users to obtain sufficient information from the Web. Therefore, a method to reconstruct PC-optimized Web pages for mobile phone users is essential. An example approach is to segment the Web page based on its structure, and utilize the hierarchy of the content element to regenerate a page suitable for mobile phone browsing. In our previous work, we have examined a robust automatic Web page segmentation scheme which uses the distance between content elements based on the relative HTML tag hierarchy, i.e., the number and depth of HTML tags in Web pages. However, this scheme has a problem that the content-distance based on the order of HTML tags does not always correspond to the intuitional distance between content elements on the actual layout of a Web page. In this paper, we propose a hybrid segmentation method which segments Web pages based on both the content-distance calculated by the previous scheme, and a novel approach which utilizes Web page layout information. Experiments conducted to evaluate the accuracy of Web page segmentation results prove that the proposed method can segment Web pages more accurately than conventional methods. Furthermore, implementation and evaluation of our system on the mobile phone prove that our method can realize superior usability compared to commercial Web browsers.
PRIVE: anonymous location-based queries in distributed mobile systems BIBAFull-Text 371-380
  Gabriel Ghinita; Panos Kalnis; Spiros Skiadopoulos
Nowadays, mobile users with global positioning devices can access Location Based Services (LBS) and query about points of interest in their proximity. For such applications to succeed, privacy and confidentiality are essential. Encryption alone is not adequate; although it safeguards the system against eavesdroppers, the queries themselves may disclose the location and identity of the user. Recently, there have been proposed centralized architectures based on K-anonymity, which utilize an intermediate anonymizer between the mobile users and the LBS. However, the anonymizer must be updated continuously with the current locations of all users. Moreover, the complete knowledge of the entire system poses a security threat, if the anonymizer is compromised.
   In this paper we address two issues: (i) We show that existing approaches may fail to provide spatial anonymity for some distributions of user locations and describe a novel technique which solves this problem. (ii) We propose Prive, a decentralized architecture for preserving the anonymity of users issuing spatial queries to LBS. Mobile users self-organizein to an overlay network with good fault tolerance and load balancing properties. Prive avoids the bottleneck caused by centralized techniques both in terms of anonymization and location updates. Moreover, the system state is distributed in numerous users, rendering Prive resilient to attacks. Extensive experimental studies suggest that Prive is applicable to real-life scenarios with large populations of mobile users.
A mobile application framework for the geospatial web BIBAFull-Text 381-390
  Rainer Simon; Peter Fröhlich
In this paper we present an application framework that leverages geospatial content on the World Wide Web by enabling innovative modes of interaction and novel types of user interfaces on advanced mobile phones and PDAs. We discuss the current development steps involved in building mobile geospatial Web applications and derive three technological pre-requisites for our framework: spatial query operations based on visibility and field of view, a 2.5D environment model, and a presentation-independent data exchange format for geospatial query results. We propose the Local Visibility Model as a suitable XML-based candidate and present a prototype implementation.

Search potpourri

Navigation-aided retrieval BIBAFull-Text 391-400
  Shashank Pandit; Christopher Olston
Users searching for information in hypermedia environments often perform querying followed by manual navigation. Yet, the conventional text/hypertext retrieval paradigm does not explicity take post-query navigation into account. This paper proposes a new retrieval paradigm, called navigation-aided retrieval (NAR), which treats both querying and navigation as first-class activities. In the NAR paradigm, querying is seen as a means to identify starting points for navigation, and navigation is guided based on information supplied in the query. NAR is a generalization of the conventional probabilistic information retrieval paradigm, which implicitly assumes no navigation takes place. This paper presents a formal model for navigation-aided retrieval, and reports empirical results that point to the real-world applicability of the model. The experiments were performed over a large Web corpus provided by TREC, using human judgments on a new rating scale developed for navigation-aided retrieval. In the case of ambiguous queries, the new retrieval model identifies good starting points for post-query navigation. For less ambiguous queries that need not be paired with navigation, the output closely matches that of a conventional retrieval system.
Efficient search engine measurements BIBAFull-Text 401-410
  Ziv Bar-Yossef; Maxim Gurevich
We address the problem of measuring global quality metrics of search engines, like corpus size, index freshness, and density of duplicates in the corpus. The recently proposed estimators for such metrics [2, 6] suffer from significant bias and/or poor performance, due to inaccurate approximation of the so called "document degrees".
   We present two new estimators that are able to overcome the bias introduced by approximate degrees. Our estimators are based on a careful implementation of an approximate importance sampling procedure. Comprehensive theoretical and empirical analysis of the estimators demonstrates that they have essentially no bias even in situations where document degrees are poorly approximated.
   Building on an idea from [6], we discuss Rao Blackwellization as a generic method for reducing variance in search engine estimators. We show that Rao-Blackwellizing our estimators results in significant performance improvements, while not compromising accuracy.
Efficient search in large textual collections with redundancy BIBAFull-Text 411-420
  Jiangong Zhang; Torsten Suel
Current web search engines focus on searching only the most recent snapshot of the web. In some cases, however, it would be desirable to search over collections that include many different crawls and versions of each page. One important example of such a collection is the Internet Archive, though there are many others. Since the data size of such an archive is multiple times that of a single snapshot, this presents us with significant performance challenges.
   Current engines use various techniques for index compression and optimized query execution, but these techniques do not exploit the significant similarities between different versions of a page, or between different pages.
   In this paper, we propose a general framework for indexing and query processing of archival collections and, more generally, any collections with a sufficient amount of redundancy. Our approach results in significant reductions in index size and query processing costs on such collections, and it is orthogonal to and can be combined with the existing techniques. It also supports highly efficient updates, both locally and over a network. Within this framework, we describe and evaluate different implementations that trade off index size versus CPU cost and other factors, and discuss applications ranging from archival web search to local search of web sites, email archives, or file systems. We present experimental results based on search engine query log and a large collection consisting of multiple crawls.


The discoverability of the web BIBAFull-Text 421-430
  Anirban Dasgupta; Arpita Ghosh; Ravi Kumar; Christopher Olston; Sandeep Pandey; Andrew Tomkins
Previous studies have highlighted the high arrival rate of new content on the web. We study the extent to which this new content can be efficiently discovered by a crawler. Our study has two parts. First, we study the inherent difficulty of the discovery problem using a maximum cover formulation, under an assumption of perfect estimates of likely sources of links to new content. Second, we relax this assumption and study a more realistic setting in which algorithms must use historical statistics to estimate which pages are most likely to yield links to new content. We recommend a simple algorithm that performs comparably to all approaches we consider.
   We measure the overhead of discovering new content, defined as the average number of fetches required to discover one new page. We show first that with perfect foreknowledge of where to explore for links to new content, it is possible to discover 90% of all new content with under 3% overhead, and 100% of new content with 9% overhead. But actual algorithms, which do not have access to perfect foreknowledge, face a more difficult task: one quarter of new content is simply not amenable to efficient discovery. Of the remaining three quarters, 80% of new content during a given week may be discovered with 160% overhead if content is recrawled fully on a monthly basis.
Combining classifiers to identify online databases BIBAFull-Text 431-440
  Luciano Barbosa; Juliana Freire
We address the problem of identifying the domain of online databases. More precisely, given a set F of Web forms automatically gathered by a focused crawler and an online database domain D, our goal is to select from F only the forms that are entry points to databases in D. Having a set of Webforms that serve as entry points to similar online databases is a requirement for many applications and techniques that aim to extract and integrate hidden-Web information, such as meta-searchers, online database directories, hidden-Web crawlers, and form-schema matching and merging.
   We propose a new strategy that automatically and accurately classifies online databases based on features that can be easily extracted from Web forms. By judiciously partitioning the space of form features, this strategy allows the use of simpler classifiers that can be constructed using learning techniques that are better suited for the features of each partition. Experiments using real Web data in a representative set of domains show that the use of different classifiers leads to high accuracy, precision and recall. This indicates that our modular classifier composition provides an effective and scalable solution for classifying online databases.
An adaptive crawler for locating hidden-web entry points BIBAFull-Text 441-450
  Luciano Barbosa; Juliana Freire
In this paper we describe new adaptive crawling strategies to efficiently locate the entry points to hidden-Web sources. The fact that hidden-Web sources are very sparsely distributed makes the problem of locating them especially challenging. We deal with this problem by using the contents of pages to focus the crawl on a topic; by prioritizing promising links within the topic; and by also following links that may not lead to immediate benefit. We propose a new framework whereby crawlers automatically learn patterns of promising links and adapt their focus as the crawl progresses, thus greatly reducing the amount of required manual setup and tuning. Our experiments over real Web pages in a representative set of domains indicate that online learning leads to significant gains in harvest rates -- the adaptive crawlers retrieve up to three times as many forms as crawlers that use a fixed focus strategy.

Web graphs

Random web crawls BIBAFull-Text 451-460
  Toufik Bennouas; Fabien de Montgolfier
This paper proposes a random Web crawl model. A Web crawl is a (biased and partial) image of the Web. This paper deals with the hyperlink structure, i.e. a Web crawl is a graph, whose vertices are the pages and whose edges are the hypertextual links. Of course a Web crawl has a very special structure; we recall some known results about it. We then propose a model generating similar structures. Our model simply simulates a crawling, i.e. builds and crawls the graph at the same time. The graphs generated have lot of known properties of Web crawls. Our model is simpler than most random Web graph models, but captures the sames properties. Notice that it models the crawling process instead of the page writing process of Web graph models.
Extraction and classification of dense communities in the web BIBAFull-Text 461-470
  Yon Dourisboure; Filippo Geraci; Marco Pellegrini
The World Wide Web (WWW) is rapidly becoming important for society as a medium for sharing data, information and services, and there is a growing interest in tools for understanding collective behaviors and emerging phenomena in the WWW. In this paper we focus on the problem of searching and classifying communities in the web. Loosely speaking a community is a group of pages related to a common interest. More formally communities have been associated in the computer science literature with the existence of a locally dense sub-graph of the web-graph (where web pages are nodes and hyper-links are arcs of the web-graph). The core of our contribution is a new scalable algorithm for finding relatively dense subgraphs in massive graphs. We apply our algorithm on web-graphs built on three publicly available large crawls of the web (with raw sizes up to 120M nodes and 1G arcs). The effectiveness of our algorithm in finding dense subgraphs is demonstrated experimentally by embedding artificial communities in the web-graph and counting how many of these are blindly found. Effectiveness increases with the size and density of the communities: it is close to 100% for communities of a thirty nodes or more (even at low density). It is still about 80% even for communities of twenty nodes with density over 50% of the arcs present. At the lower extremes the algorithm catches 35% of dense communities made of ten nodes. We complete our Community Watch system by clustering the communities found in the web-graph into homogeneous groups by topic and labelling each group by representative keywords.
Web projections: learning from contextual subgraphs of the web BIBAFull-Text 471-480
  Jure Leskovec; Susan Dumais; Eric Horvitz
Graphical relationships among Web pages have been exploited in methods for ranking search results. To date, specific graphical properties have been used in these analyses. We introduce a Web Projection methodology that generalizes prior efforts of graphical relationships of the web in several ways. With the approach, we create subgraphs by projecting sets of pages and domains onto the larger web graph, and then use machine learning to construct predictive models that consider graphical properties as evidence. We describe the method and then present experiments that illustrate the construction of predictive models of search result quality and user query reformulation.

Search quality and precision

Supervised rank aggregation BIBAFull-Text 481-490
  Yu-Ting Liu; Tie-Yan Liu; Tao Qin; Zhi-Ming Ma; Hang Li
This paper is concerned with rank aggregation, the task of combining the ranking results of individual rankers at meta-search. Previously, rank aggregation was performed mainly by means of unsupervised learning. To further enhance ranking accuracies, we propose employing supervised learning to perform the task, using labeled data. We refer to the approach as Supervised Rank Aggregation. We set up a general framework for conducting Supervised Rank Aggregation, in which learning is formalized an optimization which minimizes disagreements between ranking results and the labeled data. As case study, we focus on Markov Chain based rank aggregation in this paper. The optimization for Markov Chain based methods is not a convex optimization problem, however, and thus is hard to solve. We prove that we can transform the optimization problem into that of Semidefinite Programming and solve it efficiently. Experimental results on meta-searches show that Supervised Rank Aggregation can significantly outperform existing unsupervised methods.
Navigating the intranet with high precision BIBAFull-Text 491-500
  Huaiyu Zhu; Sriram Raghavan; Shivakumar Vaithyanathan; Alexander Löser
Despite the success of web search engines, search over large enterprise intranets still suffers from poor result quality. Earlier work [6] that compared intranets and the Internet from the view point of keyword search has pointed to several reasons why the search problem is quite different in these two domains. In this paper, we address the problem of providing high quality answers to navigational queries in the intranet (e.g., queries intended to find product or personal home pages, service pages, etc.). Our approach is based on offline identification of navigational pages, intelligent generation of term-variants to associate with each page, and the construction of separate indices exclusively devoted to answering navigational queries. Using a testbed of 5.5M pages from the IBM intranet, we present evaluation results that demonstrate that for navigational queries, our approach of using custom indices produces results of significantly higher precision than those produced by a general purpose search algorithm.
Optimizing web search using social annotations BIBAFull-Text 501-510
  Shenghua Bao; Guirong Xue; Xiaoyuan Wu; Yong Yu; Ben Fei; Zhong Su
This paper explores the use of social annotations to improve web search. Nowadays, many services, e.g. del.icio.us, have been developed for web users to organize and share their favorite webpages on line by using social annotations. We observe that the social annotations can benefit web search in two aspects: 1) the annotations are usually good summaries of corresponding webpages; 2) the count of annotations indicates the popularity of webpages. Two novel algorithms are proposed to incorporate the above information into page ranking: 1) SocialSimRank (SSR) calculates the similarity between social annotations and web queries; 2) SocialPageRank (SPR) captures the popularity of webpages. Preliminary experimental results show that SSR can find the latent semantic association between queries and annotations, while SPR successfully measures the quality (popularity) of a webpage from the web users' perspective. We further evaluate the proposed methods empirically with 50 manually constructed queries and 3000 auto-generated queries on a dataset crawled from delicious. Experiments show that both SSR and SPR benefit web search significantly.

Advertisements & click estimates

Robust methodologies for modeling web click distributions BIBAFull-Text 511-520
  Kamal Ali; Mark Scarr
Metrics such as click counts are vital to online businesses but their measurement has been problematic due to inclusion of high variance robot traffic. We posit that by applying statistical methods more rigorous than have been employed to date that we can build a robust model of the distribution of clicks following which we can set probabilistically sound thresholds to address outliers and robots. Prior research in this domain has used inappropriate statistical methodology to model distributions and current industrial practice eschews this research for conservative ad-hoc click-level thresholds. Prevailing belief is that such distributions are scale-free power law distributions but using more rigorous statistical methods we find the best description of the data is instead provided by a scale-sensitive Zipf-Mandelbrot mixture distribution. Our results are based on ten data sets from various verticals in the Yahoo domain. Since mixture models can overfit the data we take care to use the BIC log-likelihood method which penalizes overly complex models. Using a mixture model in the web activity domain makes sense because there are likely multiple classes of users. In particular, we have noticed that there is a significantly large set of "users" that visit the Yahoo portal exactly once a day. We surmise these may be robots testing internet connectivity by pinging the Yahoo main website.
   Backing up our quantitative analysis is graphical analysis in which empirical distributions are plotted against theoretical distributions in log-log space using robust cumulative distribution plots. This methodology has two advantages: plotting in log-log space allows one to visually differentiate the various exponential distributions and secondly, cumulative plots are much more robust to outliers. We plan to use the results of this work for applications for robot removal from web metrics business intelligence systems.
Predicting clicks: estimating the click-through rate for new ads BIBAFull-Text 521-530
  Matthew Richardson; Ewa Dominowska; Robert Ragno
Search engine advertising has become a significant element of the Web browsing experience. Choosing the right ads for the query and the order in which they are displayed greatly affects the probability that a user will see and click on each ad. This ranking has a strong impact on the revenue the search engine receives from the ads. Further, showing the user an ad that they prefer to click on improves user satisfaction. For these reasons, it is important to be able to accurately estimate the click-through rate of ads in the system. For ads that have been displayed repeatedly, this is empirically measurable, but for new ads, other means must be used. We show that we can use features of ads, terms, and advertisers to learn a model that accurately predicts the click-though rate for new ads. We also show that using our model improves the convergence and performance of an advertising system. As a result, our model increases both revenue and user satisfaction.
Dynamics of bid optimization in online advertisement auctions BIBAFull-Text 531-540
  Christian Borgs; Jennifer Chayes; Nicole Immorlica; Kamal Jain; Omid Etesami; Mohammad Mahdian
We consider the problem of online keyword advertising auctions among multiple bidders with limited budgets, and study a natural bidding heuristic in which advertisers attempt to optimize their utility by equalizing their return-on-investment across all keywords. We show that existing auction mechanisms combined with this heuristic can experience cycling (as has been observed in many current systems), and therefore propose a modified class of mechanisms with small random perturbations. This perturbation is reminiscent of the small time-dependent perturbations employed in the dynamical systems literature to convert many types of chaos into attracting motions. We show that the perturbed mechanism provably converges in the case of first-price auctions and experimentally converges in the case of second-price auctions. Moreover, the point of convergence has a natural economic interpretation as the unique market equilibrium in the case of first-price mechanisms. In the case of second-price auctions, we conjecture that it converges to the "supply-aware" market equilibrium. Thus, our results can be alternatively described as a tatonnement process for convergence to market equilibrium in which prices are adjusted on the side of the buyers rather than the sellers. We also observe that perturbation in mechanism design is useful in a broader context: In general, it can allow bidders to "share" a particular item, leading to stable allocations and pricing for the bidders, and improved revenue for the auctioneer.

Knowledge discovery

Compare&contrast: using the web to discover comparable cases for news stories BIBAFull-Text 541-550
  Jiahui Liu; Earl Wagner; Larry Birnbaum
Comparing and contrasting is an important strategy people employ to understand new situations and create solutions for new problems. Similar events can provide hints for problem solving, as well as larger contexts for understanding the specific circumstances of an event. Lessons can leaned from past experience, insights can be gained about the new situation from familiar examples, and trends can be discovered among similar events. As the largest knowledge base for human beings, the Web provides both an opportunity and a challenge to discover comparable cases in order to facilitate situation analysis and problem solving. In this paper, we present Compare & Contrast, a system that uses the Web to discover comparable cases for news stories, documents about similar situations but involving distinct entities. The system analyzes a news story given by the user and builds a model of the story. With the story model, the system dynamically discovers entities comparable to the main entity in the original story and uses these comparable entities as seeds to retrieve web pages about comparable cases. The system is domain independent, does not require any domain-specific knowledge engineering efforts, and deals with the complexity of unstructured text and noise on the web in a robust way. We evaluated the system with an experiment on a collection of news articles and a user study.
Answering bounded continuous search queries in the world wide web BIBAFull-Text 551-560
  Dirk Kukulenz; Alexandros Ntoulas
Search queries applied to extract relevant information from the World Wide Web over a period of time may be denoted as continuous search queries. The improvement of continuous search queries may concern not only the quality of retrieved results but also the freshness of results, i.e. the time between the availability of a respective data object on the Web and the notification of a user by the search engine. In some cases a user should be notified immediately since the value of the respective information decreases quickly, as e.g. news about companies that affect the value of respective stocks, or sales offers for products that may no longer be available after a short period of time.
   In the document filtering literature, the optimization of such queries is usually based on threshold classification. Documents above a quality threshold are returned to a user. The threshold is tuned in order to optimize the quality of retrieved results. The disadvantage of such approaches is that the amount of information returned to a user may hardly be controlled without further user-interaction. In this paper, we consider the optimization of bounded continuous search queries where only the estimated best k elements are returned to a user. We present a new optimization method for bounded continuous search queries based on the optimal stopping theory and compare the new method to methods currently applied by Web search systems. The new method provides results of significantly higher quality for the cases where very fresh results have to be delivered.
Answering relationship queries on the web BIBAFull-Text 561-570
  Gang Luo; Chunqiang Tang; Ying-li Tian
Finding relationships between entities on the Web, e.g., the connections between different places or the commonalities of people, is a novel and challenging problem. Existing Web search engines excel in keyword matching and document ranking, but they cannot well handle many relationship queries. This paper proposes a new method for answering relationship queries on two entities. Our method first respectively retrieves the top Web pages for either entity from a Web search engine. It then matches these Web pages and generates an ordered list of Web page pairs. Each Web page pair consists of one Web page for either entity. The top ranked Web page pairs are likely to contain the relationships between the two entities. One main challenge in the ranking process is to effectively filter out the large amount of noise in the Web pages without losing much useful information. To achieve this, our method assigns appropriate weights to terms in Web pages and intelligently identifies the potential connecting terms that capture the relationships between the two entities. Only those top potential connecting terms with large weights are used to rank Web page pairs. Finally, the top ranked Web page pairs are presented to the searcher. For each such pair, the query terms and the top potential connecting terms are properly highlighted so that the relationships between the two entities can be easily identified. We implemented a prototype on top of the Google search engine and evaluated it under a wide variety of query scenarios. The experimental results show that our method is effective at finding important relationships with low overhead.


Dynamic personalized pagerank in entity-relation graphs BIBAFull-Text 571-580
  Soumen Chakrabarti
Extractors and taggers turn unstructured text into entity-relation (ER) graphs where nodes are entities (email, paper, person, conference, company) and edges are relations (wrote, cited, works-for). Typed proximity search of the form type=person NEAR company~"IBM", paper~"XML" is an increasingly useful search paradigm in ER graphs. Proximity search implementations either perform a Pagerank-like computation at query time, which is slow, or precompute, store and combine per-word Pageranks, which can be very expensive in terms of preprocessing time and space. We present HubRank, a new system for fast, dynamic, space-efficient proximity searches in ER graphs. During preprocessing, HubRank compute sand indexes certain "sketchy" random walk fingerprints for a small fraction of nodes, carefully chosen using query log statistics. At query time, a small "active" subgraph is identified, bordered by nodes with indexed fingerprints. These fingerprints are adaptively loaded to various resolutions to form approximate personalized Pagerank vectors (PPVs). PPVs at remaining active nodes are now computed iteratively. We report on experiments with CiteSeer's ER graph and millions of real Cite Seer queries. Some representative numbers follow. On our testbed, HubRank preprocesses and indexes 52 times faster than whole-vocabulary PPV computation. A text index occupies 56 MB. Whole-vocabulary PPVs would consume 102GB. If PPVs are truncated to 56 MB, precision compared to true Pagerank drops to 0.55; in contrast, HubRank has precision 0.91 at 63MB. HubRank's average query time is 200-300 milliseconds; query-time Pagerank computation takes 11 seconds on average.
A large-scale evaluation and analysis of personalized search strategies BIBAFull-Text 581-590
  Zhicheng Dou; Ruihua Song; Ji-Rong Wen
Although personalized search has been proposed for many years and many personalization strategies have been investigated, it is still unclear whether personalization is consistently effective on different queries for different users, and under different search contexts. In this paper, we study this problem and get some preliminary conclusions. We present a large-scale evaluation framework for personalized search based on query logs, and then evaluate five personalized search strategies (including two click-based and three profile-based ones) using 12-day MSN query logs. By analyzing the results, we reveal that personalized search has significant improvement over common web search on some queries but it also has little effect on other queries (e.g., queries with small click entropy). It even harms search accuracy under some situations. Furthermore, we show that straightforward click-based personalization strategies perform consistently and considerably well, while profile-based ones are unstable in our experiments. We also reveal that both long-term and short-term contexts are very important in improving search performance for profile-based personalized search strategies.
Privacy-enhancing personalized web search BIBAFull-Text 591-600
  Yabo Xu; Ke Wang; Benyu Zhang; Zheng Chen
Personalized web search is a promising way to improve search quality by customizing search results for people with individual information goals. However, users are uncomfortable with exposing private preference information to search engines. On the other hand, privacy is not absolute, and often can be compromised if there is a gain in service or profitability to the user. Thus, a balance must be struck between search quality and privacy protection. This paper presents a scalable way for users to automatically build rich user profiles. These profiles summarize a user's interests into a hierarchical organization according to specific interests. Two parameters for specifying privacy requirements are proposed to help the user to choose the content and degree of detail of the profile information that is exposed to the search engine. Experiments showed that the user profile improved search quality when compared to standard MSN rankings. More importantly, results verified our hypothesis that a significant improvement on search quality can be achieved by only sharing some higher-level user profile information, which is potentially less sensitive than detailed personal information.

Defending against emerging threats

Defeating script injection attacks with browser-enforced embedded policies BIBAFull-Text 601-610
  Trevor Jim; Nikhil Swamy; Michael Hicks
Web sites that accept and display content such as wiki articles or comments typically filter the content to prevent injected script code from running in browsers that view the site. The diversity of browser rendering algorithms and the desire to allow rich content make filtering quite difficult, however, and attacks such as the Samy and Yamanner worms have exploited filtering weaknesses. This paper proposes a simple alternative mechanism for preventing script injection called Browser-Enforced Embedded Policies (BEEP). The idea is that a web site can embed a policy in its pages that specifies which scripts are allowed to run. The browser, which knows exactly when it will run a script, can enforce this policy perfectly. We have added BEEP support to several browsers, and built tools to simplify adding policies to web applications. We found that supporting BEEP in browsers requires only small and localized modifications, modifying web applications requires minimal effort, and enforcing policies is generally lightweight.
Subspace: secure cross-domain communication for web mashups BIBAFull-Text 611-620
  Collin Jackson; Helen J. Wang
Combining data and code from third-party sources has enabled a new wave of web mashups that add creativity and functionality to web applications. However, browsers are poorly designed to pass data between domains, often forcing web developers to abandon security in the name of functionality. To address this deficiency, we developed Subspace, a cross-domain communication mechanism that allows efficient communication across domains without sacrificing security. Our prototype requires only a small JavaScript library, and works across all major browsers. We believe Subspace can serve as a new secure communication primitive for web mashups.
Exposing private information by timing web applications BIBAFull-Text 621-628
  Andrew Bortz; Dan Boneh
We show that the time web sites take to respond to HTTP requests can leak private information, using two different types of attacks. The first, direct timing, directly measures response times from a web site to expose private information such as validity of an username at a secured site or the number of private photos in a publicly viewable gallery. The second, cross-site timing, enables a malicious web site to obtain information from the user's perspective at another site. For example, a malicious site can learn if the user is currently logged in at a victim site and, in some cases, the number of objects in the user's shopping cart. Our experiments suggest that these timing vulnerabilities are wide-spread. We explain in detail how and why these attacks work, and discuss methods for writing web application code that resists these attacks.
On anonymizing query logs via token-based hashing BIBAFull-Text 629-638
  Ravi Kumar; Jasmine Novak; Bo Pang; Andrew Tomkins
In this paper we study the privacy preservation properties of a specific technique for query log anonymization: token-based hashing. In this approach, each query is tokenized, and then a secure hash function is applied to each token. We show that statistical techniques may be applied to partially compromise the anonymization. We then analyze the specific risks that arise from these partial compromises, focused on revelation of identity from unambiguous names, addresses, and so forth, and the revelation of facts associated with an identity that are deemed to be highly sensitive. Our goal in this work is two fold: to show that token-based hashing is unsuitable for anonymization, and to present a concrete analysis of specific techniques that may be effective in breaching privacy, against which other anonymization schemes should be measured.

Passwords and phishing

Cantina: a content-based approach to detecting phishing web sites BIBAFull-Text 639-648
  Yue Zhang; Jason I. Hong; Lorrie F. Cranor
Phishing is a significant problem involving fraudulent email and web sites that trick unsuspecting users into revealing private information. In this paper, we present the design, implementation, and evaluation of CANTINA, a novel, content-based approach to detecting phishing web sites, based on the TF-IDF information retrieval algorithm. We also discuss the design and evaluation of several heuristics we developed to reduce false positives. Our experiments show that CANTINA is good at detecting phishing sites, correctly labeling approximately 95% of phishing sites.
Learning to detect phishing emails BIBAFull-Text 649-656
  Ian Fette; Norman Sadeh; Anthony Tomasic
Each month, more attacks are launched with the aim of making web users believe that they are communicating with a trusted entity for the purpose of stealing account information, logon credentials, and identity information in general. This attack method, commonly known as "phishing," is most commonly initiated by sending out emails with links to spoofed websites that harvest information. We present a method for detecting these attacks, which in its most general form is an application of machine learning on a feature set designed to highlight user-targeted deception in electronic communication. This method is applicable, with slight modification, to detection of phishing websites, or the emails used to direct victims to these sites. We evaluate this method on a set of approximately 860 such phishing emails, and 6950 non-phishing emails, and correctly identify over 96% of the phishing emails while only mis-classifying on the order of 0.1% of the legitimate emails. We conclude with thoughts on the future for such techniques to specifically identify deception, specifically with respect to the evolutionary nature of the attacks and information available.
A large-scale study of web password habits BIBAFull-Text 657-666
  Dinei Florencio; Cormac Herley
We report the results of a large scale study of password use and password re-use habits. The study involved half a million users over a three month period. A client component on users' machines recorded a variety of password strength, usage and frequency metrics. This allows us to measure or estimate such quantities as the average number of passwords and average number of accounts each user has, how many passwords she types per day, how often passwords are shared among sites, and how often they are forgotten. We get extremely detailed data on password strength, the types and lengths of passwords chosen, and how they vary by site. The data is the first large scale study of its kind, and yields numerous other insights into the role the passwords play in users' online experience.

Access control and trust on the web

A fault model and mutation testing of access control policies BIBAFull-Text 667-676
  Evan Martin; Tao Xie
To increase confidence in the correctness of specified policies, policy developers can conduct policy testing by supplying typical test inputs (requests) and subsequently checking test outputs (responses) against expected ones. Unfortunately, manual testing is tedious and few tools exist for automated testing of access control policies. We present a fault model for access control policies and a framework to explore it. The framework includes mutation operators used to implement the fault model, mutant generation, equivalent-mutant detection, and mutant-killing determination. This framework allows us to investigate our fault model, evaluate coverage criteria for test generation and selection, and determine a relationship between structural coverage and fault-detection effectiveness. We have implemented the framework and applied it to various policies written in XACML. Our experimental results offer valuable insights into choosing mutation operators in mutation testing and choosing coverage criteria in test generation and selection.
Analyzing web access control policies BIBAFull-Text 677-686
  Vladimir Kolovski; James Hendler; Bijan Parsia
XACML has emerged as a popular access control language on the Web, but because of its rich expressiveness, it has proved difficult to analyze in an automated fashion. In this paper, we present a formalization of XACML using description logics (DL), which are a decidable fragment of First-Order logic. This formalization allows us to cover a more expressive subset of XACML than propositional logic-based analysis tools, and in addition we provide a new analysis service (policy redundancy). Also, mapping XACML to description logics allows us to use off-the-shelf DL reasoners for analysis tasks such as policy comparison, verification and querying. We provide empirical evaluation of a policy analysis tool that was implemented on top of open source DL reasoner Pellet.
Compiling cryptographic protocols for deployment on the web BIBAFull-Text 687-696
  Jay A. McCarthy; Shriram Krishnamurthi; Joshua D. Guttman; John D. Ramsdell
Cryptographic protocols are useful for trust engineering in Web transactions. The Cryptographic Protocol Programming Language (CPPL) provides a model wherein trust management annotations are attached to protocol actions, and are used to constrain the behavior of a protocol participant to be compatible with its own trust policy.
   The first implementation of CPPL generated stand-alone, single-session servers, making it unsuitable for deploying protocols on the Web. We describe a new compiler that uses a constraint-based analysis to produce multi-session server programs. The resulting programs run without persistent TCP connections for deployment on traditional Web servers. Most importantly, the compiler preserves existing proofs about the protocols. We present an enhanced version of the CPPL language, discuss the generation and use of constraints, show their use in the compiler, formalize the preservation of properties, present subtleties, and outline implementation details.


Yago: a core of semantic knowledge BIBAFull-Text 697-706
  Fabian M. Suchanek; Gjergji Kasneci; Gerhard Weikum
We present YAGO, a light-weight and extensible ontology with high coverage and quality. YAGO builds on entities and relations and currently contains more than 1 million entities and 5 million facts. This includes the Is-A hierarchy as well as non-taxonomic relations between entities (such as HASONEPRIZE). The facts have been automatically extracted from Wikipedia and unified with WordNet, using a carefully designed combination of rule-based and heuristic methods described in this paper. The resulting knowledge base is a major step beyond WordNet: in quality by adding knowledge about individuals like persons, organizations, products, etc. with their semantic relationships -- and in quantity by increasing the number of facts by more than an order of magnitude. Our empirical evaluation of fact correctness shows an accuracy of about 95%. YAGO is based on a logically clean model, which is decidable, extensible, and compatible with RDFS. Finally, we show how YAGO can be further extended by state-of-the-art information extraction techniques.
Ontology summarization based on RDF sentence graph BIBAFull-Text 707-716
  Xiang Zhang; Gong Cheng; Yuzhong Qu
Ontology summarization is very important to quick understanding and selection of ontologies. In this paper, we study extractive summarization of ontology. We propose a notion of RDF sentence as the basic unit of summarization. An RDF Sentence Graph is proposed to characterize the links between RDF sentences derived from a given ontology. The salience of each RDF sentence is assessed in terms of its "centrality" in the graph. We propose to summarize an ontology by extracting a set of salient RDF sentences according to a re-ranking strategy. We compare several measurements in assessing the salience of RDF sentences and give an overall evaluation of experiment results, which shows that our approach to ontology summarization is feasible.
Just the right amount: extracting modules from ontologies BIBAFull-Text 717-726
  Bernardo Cuenca Grau; Ian Horrocks; Yevgeny Kazakov; Ulrike Sattler
The ability to extract meaningful fragments from an ontology is key for ontology re-use. We propose a definition of a module that guarantees to completely capture the meaning of a given set of terms, i.e., to include all axioms relevant to the meaning of these terms, and study the problem of extracting minimal modules. We show that the problem of determining whether a subset of an ontology is a module for a given vocabulary is undecidable even for rather restricted sub-languages of OWL DL. Hence we propose two "approximations", i.e., alternative definitions of modules for a vocabulary that still provide the above guarantee, but that are possibly too strict, and that may thus result in larger modules: the first approximation is semantic and can be computed using existing DL reasoners; the second is syntactic, and can be computed in polynomial time. Finally, we report on an empirical evaluation of our syntactic approximation which demonstrates that the modules we extract are surprisingly small.


Toward expressive syndication on the web BIBAFull-Text 727-736
  Christian Halaschek-Wiener; James Hendler
Syndication systems on the Web have attracted vast amounts of attention in recent years. As technologies have emerged and matured, there has been a transition to more expressive syndication approaches; that is, subscribers and publishers are provided with more expressive means of describing their interests and published content, enabling more accurate information filtering. In this paper, we formalize a syndication architecture that utilizes expressive Web ontologies and logic-based reasoning for selective content dissemination. This provides finer grained control for filtering and automated reasoning for discovering implicit subscription matches, both of which are not achievable in less expressive approaches. We then address one of the main limitations with such a syndication approach, namely matching newly published information with subscription requests in an efficient and practical manner. To this end, we investigate continuous query answering for a large subset of the Web Ontology Language (OWL); specifically, we formally define continuous queries for OWL knowledge bases and present a novel algorithm for continuous query answering in a large subset of this language. Lastly, an evaluation of the query approach is shown, demonstrating its effectiveness for syndication purposes.
Exhibit: lightweight structured data publishing BIBAFull-Text 737-746
  David F. Huynh; David R. Karger; Robert C. Miller
The early Web was hailed for giving individuals the same publishing power as large content providers. But over time, large content providers learned to exploit the structure in their data, leveraging databases and server side technologies to provide rich browsing and visualization. Individual authors fall behind once more: neither old-fashioned static pages nor domain-specific publishing frameworks supporting limited customization can match custom database-backed web applications.
   In this paper, we propose Exhibit, a lightweight framework for publishing structured data on standard web servers that requires no installation, database administration, or programming. Exhibit lets authors with relatively limited skills-those same enthusiasts who could write HTML pages for the early Web-publish richly interactive pages that exploit the structure of their data for better browsing and visualization. Such structured publishing in turn makes that data more useful to all of its consumers: individual readers get more powerful interfaces, mashup creators can more easily repurpose the data, and Semantic Web enthusiasts can feed the data to the nascent Semantic Web.
Explorations in the use of semantic web technologies for product information management BIBAFull-Text 747-756
  Jean-Sébastien Brunner; Li Ma; Chen Wang; Lei Zhang; Daniel C. Wolfson; Yue Pan; Kavitha Srinivas
Master data refers to core business entities a company uses repeatedly across many business processes and systems (such as lists or hierarchies of customers, suppliers, accounts, products, or organizational units). Product information is the most important kind of master data and product information management (PIM) is becoming critical for modern enterprises because it provides a rich business context for various applications. Existing PIM systems are less flexible and scalable for on-demand business, as well as too weak to completely capture and use the semantics of master data. This paper explores how to use semantic web technologies to enhance a collaborative PIM system by simplifying modeling and representation while preserving enough dynamic flexibility. Furthermore, we build a semantic PIM system using one of the state-of-art ontology repositories and summarize the challenges we encountered based on our experimental results, especially on performance and scalability. We believe that our study and experiences are valuable for both semantic web community and master data management community.

Similarity and extraction

Measuring semantic similarity between words using web search engines BIBAFull-Text 757-766
  Danushka Bollegala; Yutaka Matsuo; Mitsuru Ishizuka
Semantic similarity measures play important roles in information retrieval and Natural Language Processing. Previous work in semantic web-related applications such as community mining, relation extraction, automatic meta data extraction have used various semantic similarity measures. Despite the usefulness of semantic similarity measures in these applications, robustly measuring semantic similarity between two words (or entities) remains a challenging task. We propose a robust semantic similarity measure that uses the information available on the Web to measure similarity between words or entities. The proposed method exploits page counts and text snippets returned by a Web search engine. We define various similarity scores for two given words P and Q, using the page counts for the queries P, Q and P AND Q. Moreover, we propose a novel approach to compute semantic similarity using automatically extracted lexico-syntactic patterns from text snippets. These different similarity scores are integrated using support vector machines, to leverage a robust semantic similarity measure. Experimental results on Miller-Charles benchmark dataset show that the proposed measure outperforms all the existing web-based semantic similarity measures by a wide margin, achieving a correlation coefficient of 0:834. Moreover, the proposed semantic similarity measure significantly improves the accuracy (F-measure of 0:78) in a community mining task, and in an entity disambiguation task, thereby verifying the capability of the proposed measure to capture semantic similarity using web content.
Using Google distance to weight approximate ontology matches BIBAFull-Text 767-776
  Risto Gligorov; Warner ten Kate; Zharko Aleksovski; Frank van Harmelen
Discovering mappings between concept hierarchies is widely regarded as one of the hardest and most urgent problems facing the Semantic Web. The problem is even harder in domains where concepts are inherently vague and ill-defined, and cannot be given a crisp definition. A notion of approximate concept mapping is required in such domains, but until now, no such notion is available.
   The first contribution of this paper is a definition for approximate mappings between concepts. Roughly, a mapping between two concepts is decomposed into a number of submappings, and a sloppiness value determines the fraction of these submappings that can be ignored when establishing the mapping.
   A potential problem of such a definition is that with an increasing sloppiness value, it will gradually allow mappings between any two arbitrary concepts. To improve on this trivial behaviour, we need to design a heuristic weighting which minimises the sloppiness required to conclude desirable matches, but at the same time maximises the sloppiness required to conclude undesirable matches. The second contribution of this paper is to show that a Google based similarity measure has exactly these desirable properties.
   We establish these results by experimental validation in the domain of musical genres. We show that this domain does suffer from ill-defined concepts. We take two real-life genre hierarchies from the Web, we compute approximate mappings between them at varying levels of sloppiness, and we validate our results against a handcrafted Gold Standard.
   Our method makes use of the huge amount of knowledge that is implicit in the current Web, and exploits this knowledge as a heuristic for establishing approximate mappings between ill-defined concepts.
Hierarchical, perceptron-like learning for ontology-based information extraction BIBAFull-Text 777-786
  Yaoyong Li; Kalina Bontcheva
Recent work on ontology-based Information Extraction (IE) has tried to make use of knowledge from the target ontology in order to improve semantic annotation results. However, very few approaches exploit the ontology structure itself, and those that do so, have some limitations. This paper introduces a hierarchical learning approach for IE, which uses the target ontology as an essential part of the extraction process, by taking into account the relations between concepts. The approach is evaluated on the largest available semantically annotated corpus. The results demonstrate clearly the benefits of using knowledge from the ontology as input to the information extraction process. We also demonstrate the advantages of our approach over other state-of-the-art learning systems on a commonly used benchmark dataset.

Query languages and DBs

From SPARQL to rules (and back) BIBAFull-Text 787-796
  Axel Polleres
As the data and ontology layers of the Semantic Web stack have achieved a certain level of maturity in standard recommendations such as RDF and OWL, the current focus lies on two related aspects. On the one hand, the definition of a suitable query language for RDF, SPARQL, is close to recommendation status within the W3C. The establishment of the rules layer on top of the existing stack on the other hand marks the next step to be taken, where languages with their roots in Logic Programming and Deductive Databases are receiving considerable attention. The purpose of this paper is threefold. First, we discuss the formal semantics of SPARQLextending recent results in several ways. Second, we provide translations from SPARQL to Datalog with negation as failure. Third, we propose some useful and easy to implement extensions of SPARQL, based on this translation. As it turns out, the combination serves for direct implementations of SPARQL on top of existing rules engines as well as a basis for more general rules and query languages on top of RDF.
SPARQ2L: towards support for subgraph extraction queries in RDF databases BIBAFull-Text 797-806
  Kemafor Anyanwu; Angela Maduko; Amit Sheth
Many applications in analytical domains often have the need to "connect the dots" i.e., query about the structure of data. In bioinformatics for example, it is typical to want to query about interactions between proteins. The aim of such queries is to "extract" relationships between entities i.e. paths from a data graph. Often, such queries will specify certain constraints that qualifying results must satisfy e.g. paths involving a set of mandatory nodes. Unfortunately, most present day Semantic Web query languages including the current draft of the anticipated recommendation SPARQL, lack the ability to express queries about arbitrary path structures in data. In addition, many systems that support some limited form of path queries rely on main memory graph algorithms limiting their applicability to very large scale graphs.
   In this paper, we present an approach for supporting Path Extraction queries. Our proposal comprises (i) a query language SPARQ2L which extends SPARQL with path variables and path variable constraint expressions, and (ii) a novel query evaluation framework based on efficient algebraic techniques for solving path problems which allows for path queries to be efficiently evaluated on disk resident RDF graphs. The effectiveness of our proposal is demonstrated by a performance evaluation of our approach on both real world based and synthetic dataset.
Bridging the gap between OWL and relational databases BIBAFull-Text 807-816
  Boris Motik; Ian Horrocks; Ulrike Sattler
Schema statements in OWL are interpreted quite differently from analogous statements in relational databases. If these statements are meant to be interpreted as integrity constraints (ICs), OWL's interpretation may seem confusing and/or inappropriate. Therefore, we propose an extension of OWL with ICs that captures the intuition behind ICs in relational databases. We discuss the algorithms for checking IC satisfaction for different types of knowledge bases, and show that, if the constraints are satisfied, we can disregard them while answering a broad range of positive queries.
ActiveRDF: object-oriented semantic web programming BIBAFull-Text 817-824
  Eyal Oren; Renaud Delbru; Sebastian Gerke; Armin Haller; Stefan Decker
Object-oriented programming is the current mainstream programming paradigm but existing RDF APIs are mostly triple-oriented. Traditional techniques for bridging a similar gap between relational databases and object-oriented programs cannot be applied directly given the different nature of Semantic Web data, for example in the semantics of class membership, inheritance relations, and object conformance to schemas.
   We present ActiveRDF, an object-oriented API for managing RDF data that offers full manipulation and querying of RDF data, does not rely on a schema and fully conforms to RDF(S) semantics. ActiveRDF can be used with different RDF data stores: adapters have been implemented to generic SPARQL endpoints, Sesame, Jena, Redland and YARS and new adapters can be added easily. In addition, integration with the popular Ruby on Rails framework enables rapid development of Semantic Web applications.

Semantic web and web 2.0

The two cultures: mashing up web 2.0 and the semantic web BIBAFull-Text 825-834
  Anupriya Ankolekar; Markus Krötzsch; Thanh Tran; Denny Vrandecic
A common perception is that there are two competing visions for the future evolution of the Web: the Semantic Web and Web 2.0. A closer look, though, reveals that the core technologies and concerns of these two approaches are complementary and that each field can and must draw from the other's strengths. We believe that future web applications will retain the Web 2.0 focus on community and usability, while drawing on Semantic Web infrastructure to facilitate mashup-like information sharing. However, there are several open issues that must be addressed before such applications can become commonplace. In this paper, we outline a semantic weblogs scenario that illustrates the potential for combining Web 2.0 and Semantic Web technologies, while highlighting the unresolved issues that impede its realization. Nevertheless, we believe that the scenario can be realized in the short-term. We point to recent progress made in resolving each of the issues as well as future research directions for each of the communities.
Analysis of topological characteristics of huge online social networking services BIBAFull-Text 835-844
  Yong-Yeol Ahn; Seungyeop Han; Haewoon Kwak; Sue Moon; Hawoong Jeong
Social networking services are a fast-growing business in the Internet. However, it is unknown if online relationships and their growth patterns are the same as in real-life social networks. In this paper, we compare the structures of three online social networking services: Cyworld, MySpace, and orkut, each with more than 10 million users, respectively. We have access to complete data of Cyworld's ilchon (friend) relationships and analyze its degree distribution, clustering property, degree correlation, and evolution over time. We also use Cyworld data to evaluate the validity of snowball sampling method, which we use to crawl and obtain partial network topologies of MySpace and orkut. Cyworld, the oldest of the three, demonstrates a changing scaling behavior over time in degree distribution. The latest Cyworld data's degree distribution exhibits a multi-scaling behavior, while those of MySpace and orkut have simple scaling behaviors with different exponents. Very interestingly, each of the two exponents corresponds to the different segments in Cyworld's degree distribution. Certain online social networking services encourage online activities that cannot be easily copied in real life; we show that they deviate from close-knit online social networks which show a similar degree correlation pattern to real-life social networks.
P-TAG: large scale automatic generation of personalized annotation tags for the web BIBAFull-Text 845-854
  Paul-Alexandru Chirita; Stefania Costache; Wolfgang Nejdl; Siegfried Handschuh
The success of the Semantic Web depends on the availability of Web pages annotated with metadata. Free form metadata or tags, as used in social bookmarking and folksonomies, have become more and more popular and successful. Such tags are relevant keywords associated with or assigned to a piece of information (e.g., a Web page), describing the item and enabling keyword-based classification. In this paper we propose P-TAG, a method which automatically generates personalized tags for Web pages. Upon browsing a Web page, P-TAG produces keywords relevant both to its textual content, but also to the data residing on the surfer's Desktop, thus expressing a personalized viewpoint. Empirical evaluations with several algorithms pursuing this approach showed very promising results. We are therefore very confident that such a user oriented automatic tagging approach can provide large scale personalized metadata annotations as an important step towards realizing the Semantic Web.

Communication in developing regions

Connecting the "bottom of the pyramid": an exploratory case study of india's rural communication environment BIBAFull-Text 855-862
  Sarita Seshagiri; Sagar Aman; Dhaval Joshi
This paper is based on our exploratory study of a South Indian village in Chamrajanagar district of Karnataka. The study was to understand the rural communication environment and villagers. communication preferences. We examined people's lifestyle, working conditions and their communication eco-system. Our study revealed that villagers, unlike urban inhabitants, interacted with people outside the village only for specific, rather than casual purposes. Another interesting aspect of rural communication was the marginal use of the postal system and the ubiquitous use of pay phone, apart from word of mouth and face-to-face interactions. In fact, personal (face-to-face) interaction was usually preferred among villages in this region, over other kinds of communication, despite infrastructural constraints like poor transport services.
   We also observed that communication frequency increased when status quo changed to one that required immediate attention. During the analysis we identified certain social, economic and cultural communication gaps (or problems). However, these problems were clear opportunities to connect the unconnected rural users, by deploying new communication systems and features. Here, we have highlighted some of our findings and possible design avenues based on these findings.
Communication as information-seeking: the case for mobile social software for developing regions BIBAFull-Text 863-872
  Beth E. Kolko; Emma J. Rose; Erica J. Johnson
In this paper, we describe several findings from a multi-year, multi-method study of how information and communication technologies have been adopted and adapted in Central Asia. We have found that mobile phone usage is outpacing the rate of Internet adoption, that access to the Internet is primarily through public access sites carrying with it issues regarding privacy and surveillance, that people rely on their social networks as information sources, that public institutions tend to be fairly weak as citizen resources, and that information seeking and communication are conflated in people's usage patterns with different technologies. In addition, in the developed world social networking software has grown rapidly and shown itself to have significant potential for mobilizing a population. Based on the collection of findings from Central Asia and observing patterns of technology usage in other parts of the world, our research leads to the conclusion that exploring mobile social software holds significant potential as an ICT that meshes well with preexisting patterns of communication and information seeking and also leverages the most predominant pattern of technology adoption. Many of the findings from this research echo results from studies in other geographic areas, and so we anticipate that much of this research will be relevant to developing regions generally.
Optimal audio-visual representations for illiterate users of computers BIBAFull-Text 873-882
  Indrani Medhi; Archana Prasad; Kentaro Toyama
We present research leading toward an understanding of the optimal audio-visual representation for illustrating concepts for illiterate and semi-literate users of computers. In our user study, which to our knowledge is the first of its kind, we presented to 200 illiterate subjects each of 13 different health symptoms in one representation randomly selected among the following ten: text, static drawings, static photographs, hand-drawn animations, and video, each with and without voice annotation. The goal was to see how comprehensible these representation types were for an illiterate audience. We used a methodology for generating each of the representations tested in a way that fairly stacks one representational type against the others.
   Our main results are that (1) voice annotation generally helps in speed of comprehension, but bimodal audio-visual information can be confusing for the target population; (2) richer information is not necessarily better understood overall; (3) the relative value of dynamic imagery versus static imagery depends on various factors. Analysis of these statistically significant results and additional detailed results are also provided.

Networking issues in the web

Identifying and discriminating between web and peer-to-peer traffic in the network core BIBAFull-Text 883-892
  Jeffrey Erman; Anirban Mahanti; Martin Arlitt; Carey Williamson
Traffic classification is the ability to identify and categorize network traffic by application type. In this paper, we consider the problem of traffic classification in the network core.
   Classification at the core is challenging because only partial information about the flows and their contributors is available. We address this problem by developing a framework that can classify a flow using only unidirectional flow information. We evaluated this approach using recent packet traces that we collected and pre-classified to establish a "base truth". From our evaluation, we find that flow statistics for the server-to-client direction of a TCP connection provide greater classification accuracy than the flow statistics for the client-to-server direction. Because collection of the server-to-client flow statistics may not always be feasible, we developed and validated an algorithm that can estimate the missing statistics from a unidirectional packet trace.
Long distance wireless mesh network planning: problem formulation and solution BIBAFull-Text 893-902
  Sayandeep Sen; Bhaskaran Raman
Several research efforts as well as deployments have chosen IEEE802.11 as a low-cost, long-distance access technology to bridge the digital divide. In this paper, we consider the important issue of planning such networks to the minimize system cost. This is a non-trivial task since it involves several sets of variables: the network topology, tower heights, antenna types to be used and their orientations, and radio transmit powers. The task is further complicated due to the presence of network performance constraints, and the inter-dependence among the variables. Our first contribution in this paper is the formulation of this problem in terms of the variables, constraints and the optimization criterion. Our second contribution is in identifying the dependencies among the variables and breaking-down the problem into four tractable sub-parts. In this process, we extensively use domain knowledge to strike a balance between tractability and practicality.
   We have evaluated the proposed algorithms using random input sets as well as real-life instances with success. We have been able to show detailed planning of network topology, required tower heights, antenna types, and transmit powers for the Ashwini project, a long distance WiFi network under deployment in Andhra Pradesh, India, In this case, we are able to achieve within 2% additional cost of a lower bound estimate.
Is high-quality VoD feasible using P2P swarming? BIBAFull-Text 903-912
  Siddhartha Annapureddy; Saikat Guha; Christos Gkantsidis; Dinan Gunawardena; Pablo Rodriguez Rodriguez
Peer-to-peer technologies are increasingly becoming the medium of choice for delivering media content, both professional and home-grown, to large user populations. Indeed, current P2P swarming systems have been shown to be very efficient for large-scale content distribution with few server resources.
   However, such systems have been designed for generic file distribution and provide a limited user experience for viewing media content.
   For example, users need to wait to download the full video before they can start watching it.
   In general, the main challenge resides in designing systems that ensure that users can start watching a movie at any point in time, with small start-up times and sustainable playback rates.
   In this work, we address the issues of providing a Video-on-Demand (VoD) using P2P mesh-based networks. We show that providing high quality VoD using P2P is feasible using a combination of techniques including (a) network coding, (b) optimized resource allocation across different parts of the video, and (c) overlay topology management algorithms.
   Our evaluation also shows that systems that do not use these techniques and do not optimize all of those dimensions can significantly under-utilize the network resources and result in poor VoD performance.
   Our results are based on simulations and results from a prototype implementation.

Web modeling

Turning portlets into services: the consumer profile BIBAFull-Text 913-922
  Oscar Díaz; Salvador Trujillo; Sandy Pérez
Portlets strive to play at the front end the same role that Web services currently enjoy at the back end, namely, enablers of application assembly through reusable services. However, it is well-known in the component community that, the larger the component, the more reduced the reuse. Hence, the coarse-grained nature of portlets (they encapsulate also the presentation layer) can jeopardize this vision of portlets as reusable services. To avoid this situation, this work proposes a perspective shift in portlet development by introducing the notion of Consumer Profile. While the user profile characterizes the end user (e.g. age, name, etc), the Consumer Profile captures the idiosyncrasies of the organization through which the portlet is being delivered (e.g. the portal owner) as far as the portlet functionality is concerned. The user profile can be dynamic and hence, requires the portlet to be customized at runtime. By contrast, the Consumer Profile is known at registration time, and it is not always appropriate/possible to consider it at runtime. Rather, it is better to customize the code at development time, and produce an organization-specific portlet which built-in, custom functionality. In this scenario, we no longer have a portlet but a family of portlets, and the portlet provider becomes the "assembly line" of this family. This work promotes this vision by introducing an organization-aware, WSRP-compliant architecture that let portlet consumers registry and handle "family portlets" in the same way that "traditional portlets". In so doing, portlets are nearer to become truly reusable services.
A framework for rapid integration of presentation components BIBAFull-Text 923-932
  Jin Yu; Boualem Benatallah; Regis Saint-Paul; Fabio Casati; Florian Daniel; Maristella Matera
The development of user interfaces (UIs) is one of the most time-consuming aspects in software development. In this context, the lack of proper reuse mechanisms for UIs is increasingly becoming manifest, especially as software development is more and more moving toward composite applications. In this paper we propose a framework for the integration of stand-alone modules or applications, where integration occurs at the presentation layer. Hence, the final goal is to reduce the effort required for UI development by maximizing reuse.
   The design of the framework is inspired by lessons learned from application integration, appropriately modified to account for the specificity of the UI integration problem. We provide an abstract component model to specify characteristics and behaviors of presentation components and propose an event-based composition model to specify the composition logic. Components and composition are described by means of a simple XML-based language, which is interpreted by a runtime middleware for the execution of the resulting composite application. A proof-of-concept prototype allows us to show that the proposed component model can also easily be applied to existing presentation components, built with different languages and/or component technologies.
Integrating value-based requirement engineering models to WEBML using VIP business modeling framework BIBAFull-Text 933-942
  Farooque Azam; Zhang Li; Rashid Ahmad
Requirement engineering (RE) is emerging as an increasingly important discipline for supporting Web application development, as these are designed to satisfy diverse stakeholder needs, additional functional, information, multimedia and usability requirements as compared to traditional software applications. Moreover, when considering innovative e-commerce applications, value-based RE is an extremely relevant methodology which exploits the concept of economic value during the RE activity. In contrast, most of the methodologies proposed for the development of Web applications, primarily focus on the system design, and paying less attention to the RE, and specifically to value-based RE. Focusing this aspect, the paper presents integration of value-based RE models to WebML models using our recently proposed VIP Business Modeling Framework [1]. We also analyze the framework's potential in linking other modeling approaches, and argue about its significant integration potential with various E-R/OO-based, process aware Web modeling approaches.

End-user perspectives and measurement in web engineering

Towards effective browsing of large scale social annotations BIBAFull-Text 943-952
  Rui Li; Shenghua Bao; Yong Yu; Ben Fei; Zhong Su
This paper is concerned with the problem of browsing social annotations. Today, a lot of services (e.g., Del.icio.us, Filckr) have been provided for helping users to manage and share their favorite URLs and photos based on social annotations. Due to the exponential increasing of the social annotations, more and more users, however, are facing the problem how to effectively find desired resources from large annotation data. Existing methods such as tag cloud and annotation matching work well only on small annotation sets. Thus, an effective approach for browsing large scale annotation sets and the associated resources is in great demand by both ordinary users and service providers. In this paper, we propose a novel algorithm, namely Effective Large Scale Annotation Browser (ELSABer), to browse large-scale social annotation data. ELSABer helps the users browse huge number of annotations in a semantic, hierarchical and efficient way. More specifically, ELSABer has the following features: 1) the semantic relations between annotations are explored for browsing of similar resources; 2) the hierarchical relations between annotations are constructed for browsing in a top-down fashion; 3) the distribution of social annotations is studied for efficient browsing. By incorporating the personal and time information, ELSABer can be further extended for personalized and time-related browsing. A prototype system is implemented and shows promising results.
Supporting end-users in the creation of dependable web clips BIBAFull-Text 953-962
  Sandeep Lingam; Sebastian Elbaum
Web authoring environments enable end-users to create applications that integrate information from other web sources. Users can create web sites that include built-in components to dynamically incorporate, for example, weather information, stock-quotes, or the latest news from different web sources. Recent surveys conducted among end-users have indicated an increasing interest in creating such applications. Unfortunately, web authoring environments do not provide support beyond a limited set of built-in components. This work addresses this limitation by providing end-user support for "clipping" information from a target web site to incorporate it into the end-user site. The support consists of a mechanism to identify the target clipping with multiple markers to increase robustness, and a dynamic assessment of the retrieved information to quantify its reliability. The clipping approach has been integrated as a feature into a popular web authoring tool on which we present the results of two preliminary studies.
Effort estimation: how valuable is it for a web company to use a cross-company data set, compared to using its own single-company data set? BIBAFull-Text 963-972
  Emilia Mendes; Sergio Di Martino; Filomena Ferrucci; Carmine Gravino
Previous studies comparing the prediction accuracy of effort models built using Web cross- and single-company data sets have been inconclusive, and as such replicated studies are necessary to determine under what circumstances a company can place reliance on a cross-company effort model.
   This paper therefore replicates a previous study by investigating how successful a cross-company effort model is: i) to estimate effort for Web projects that belong to a single company and were not used to build the cross-company model; ii) compared to a single-company effort model. Our single-company data set had data on 15 Web projects from a single company and our cross-company data set had data on 68 Web projects from 25 different companies. The effort estimates used in our analysis were obtained by means of two effort estimation techniques, namely forward stepwise regression and case-based reasoning.
   Our results were similar to those from the replicated study, showing that predictions based on the single-company model were significantly more accurate than those based on the cross-company model.

Orchestration and choreography

Towards the theoretical foundation of choreography BIBAFull-Text 973-982
  Zongyan Qiu; Xiangpeng Zhao; Chao Cai; Hongli Yang
With the growth of interest on the web services, people pay increasingly attention to the choreography, that is, to describe collaborations of participants in accomplishing a common business goal from a global viewpoint. In this paper, based on a simple choreography language and a role-oriented process language, we study some fundamental issues related to choreography, especially those related to implementation, including semantics, projection and natural projection, dominant role in choices and iterations, etc. We propose the concept of dominant role and some novel languages structures related to it. The study reveals some clues about the language, the semantics, the specification and the implementation of choreography.
Introduction and evaluation of Martlet: a scientific workflow language for abstracted parallelisation BIBAFull-Text 983-992
  Daniel James Goodman
The workflow language Martlet described in this paper implements a new programming model that allows users to write parallel programs and analyse distributed data without having to be aware of the details of the parallelisation. Martlet abstracts the parallelisation of the computation and the splitting of the data through the inclusion of constructs inspired by functional programming. These allow programs to be written as an abstract description that can be adjusted automatically at runtime to match the data set and available resources. Using this model it is possible to write programs to perform complex calculations across a distributed data set such as Singular Value Decomposition or Least Squares problems, as well as creating an intuitive way of working with distributed system.
   Having described and evaluated Martlet against other functional languages for parallel computation, this paper goes on to look at how Martlet might develop. In doing so it covers both possible additions to the language itself, and the use of JIT compilers to increase the range of platforms it is capable of running on.
Semi-automated adaptation of service interactions BIBAFull-Text 993-1002
  Hamid Reza Motahari Nezhad; Boualem Benatallah; Axel Martens; Francisco Curbera; Fabio Casati
In today's Web, many functionality-wise similar Web services are offered through heterogeneous interfaces (operation definitions) and business protocols (ordering constraints defined on legal operation invocation sequences). The typical approach to enable interoperation in such a heterogeneous setting is through developing adapters. There have been approaches for classifying possible mismatches between service interfaces and business protocols to facilitate adapter development. However, the hard job is that of identifying, given two service specifications, the actual mismatches between their interfaces and business protocols.
   In this paper we present novel techniques and a tool that provides semi-automated support for identifying and resolution of mismatches between service interfaces and protocols, and for generating adapter specification. We make the following main contributions: (i) we identify mismatches between service interfaces, which leads to finding mismatches of type of signature, merge/split, and extra/missing messages; (ii) we identify all ordering mismatches between service protocols and generate a tree, called mismatch tree, for mismatches that require developers' input for their resolution. In addition, we provide semi-automated support in analyzing the mismatch tree to help in resolving such mismatches. We have implemented the approach in a tool inside IBM WID (WebSphere Integration Developer). Our experiments with some real-world case studies show the viability of the proposed approach. The methods and tool are significant in that they considerably simplify the problem of adapting services so that interoperation is possible.

SLAs and QoS

Reliable QoS monitoring based on client feedback BIBAFull-Text 1003-1012
  Radu Jurca; Boi Faltings; Walter Binder
Service-level agreements (SLAs) establish a contract between service providers and clients concerning Quality of Service (QoS) parameters. Without proper penalties, service providers have strong incentives to deviate from the advertised QoS, causing losses to the clients. Reliable QoS monitoring (and proper penalties computed on the basis of delivered QoS) are therefor essential for the trustworthiness of a service-oriented environment. In this paper, we present a novel QoS monitoring mechanism based on quality ratings from the clients. A reputation mechanism collects the ratings and computes the actual quality delivered to the clients. The mechanism provides incentives for the clients to report honestly, and pays special attention to minimizing cost and overhead.
Preference-based selection of highly configurable web services BIBAFull-Text 1013-1022
  Steffen Lamparter; Anupriya Ankolekar; Rudi Studer; Stephan Grimm
A key challenge for dynamic Web service selection is that Web services are typically highly configurable and service requesters often have dynamic preferences on service configurations. Current approaches, such as WS-Agreement, describe Web services by enumerating the various possible service configurations, an inefficient approach when dealing with numerous service attributes with large value spaces. We model Web service configurations and associated prices and preferences more compactly using utility function policies, which also allows us to draw from multi-attribute decision theory methods to develop an algorithm for optimal service selection. In this paper, we present an OWL ontology for the specification of configurable Web service offers and requests, and a flexible and extensible framework for optimal service selection that combines declarative logic-based matching rules with optimization methods, such as linear programming. Assuming additive price/preference functions, experimental results indicate that our algorithm introduces an overhead of only around 2 sec.~compared to random service selection, while giving optimal results. The overhead, as percentage of total time, decreases as the number of offers and configurations increase.
Speeding up adaptation of web service compositions using expiration times BIBAFull-Text 1023-1032
  John Harney; Prashant Doshi
Web processes must often operate in volatile environments where the quality of service parameters of the participating service providers change during the life time of the process. In order to remain optimal, the Web process must adapt to these changes. Adaptation requires knowledge about the parameter changes of each of the service providers and using this knowledge to determine whether the Web process should make a different more optimal decision. Previously, we defined a mechanism called the value of changed information which measures the impact of expected changes in the service parameters on the Web process, thereby offering a way to query and incorporate those changes that are useful and cost-efficient. However, computing the value of changed information incurs a substantial computational overhead. In this paper, we use service expiration times obtained from pre-defined service level agreements to reduce the computational overhead of adaptation. We formalize the intuition that services whose parameters have not expired need not be considered for querying for revised information. Using two realistic scenarios, we illustrate our approach and demonstrate the associated computational savings.
DIANE: an integrated approach to automated service discovery, matchmaking and composition BIBAFull-Text 1033-1042
  Ulrich Küster; Birgitta König-Ries; Mirco Stern; Michael Klein
Automated matching of semantic service descriptions is the key to automatic service discovery and binding. But when trying to find a match for a certain request it may often happen, that the request cannot be serviced by a single offer but could be handled by combining existing offers. In this case automatic service composition is needed. Although automatic composition is an active field of research it is mainly viewed as a planning problem and treated separately from service discovery. In this paper we argue that an integrated approach to the problem is better than separating these issues as is usually done. We propose an approach that integrates service composition into service discovery and matchmaking to match service requests that ask for multiple connected effects, discuss general issues involved in describing and matching such services and present an efficient algorithm implementing our ideas.

Querying & transforming XML

Multiway SLCA-based keyword search in XML data BIBAFull-Text 1043-1052
  Chong Sun; Chee-Yong Chan; Amit K. Goenka
Keyword search for smallest lowest common ancestors (SLCAs) in XML data has recently been proposed as a meaningful way to identify interesting data nodes inXML data where their subtrees contain an input set of keywords. In this paper, we generalize this useful search paradigm to support keyword search beyond the traditional AND semantics to include both AND and OR boolean operators as well. We first analyze properties of the LCA computation and propose improved algorithms to solve the traditional keyword search problem (with only AND semantics). We then extend our approach to handle general keyword search involving combinations of AND and OR boolean operators. The effectiveness of our new algorithms is demonstrated with a comprehensive experimental performance study.
Visibly pushdown automata for streaming XML BIBAFull-Text 1053-1062
  Viraj Kumar; P. Madhusudan; Mahesh Viswanathan
We propose the study of visibly pushdown automata (VPA) for processing XML documents. VPAs are pushdown automata where the input determines the stack operation, and XML documents are naturally visibly pushdown with the VPA pushing onto the stack on open-tags and popping the stack on close-tags. In this paper we demonstrate the power and ease visibly pushdown automata give in the design of streaming algorithms for XML documents.
   We study the problems of type-checking streaming XML documents against SDTD schemas, and the problem of typing tags in a streaming XML document according to an SDTD schema. For the latter problem, we consider both pre-order typing and post-order typing of a document, which dynamically determines types at open-tags and close-tags respectively as soon as they are met. We also generalize the problems of pre-order and post-order typing to prefix querying. We show that a deterministic VPA yields an algorithm to the problem of answering in one pass the set of all answers to any query that has the property that a node satisfying the query is determined solely by the prefix leading to the node. All the streaming algorithms we develop in this paper are based on the construction of deterministic VPAs, and hence, for any fixed problem, the algorithms process each element of the input in constant time, and use space (d), where d is the depth of the document.
Mapping-driven XML transformation BIBAFull-Text 1063-1072
  Haifeng Jiang; Howard Ho; Lucian Popa; Wook-Shin Han
Clio is an existing schema-mapping tool that provides user-friendly means to manage and facilitate the complex task of transformation and integration of heterogeneous data such as XML over the Web or in XML databases. By means of mappings from source to target schemas, Clio can help users conveniently establish the precise semantics of data transformation and integration. In this paper we study the problem of how to efficiently implement such data transformation (i.e., generating target data from the source data based on schema mappings). We present a three-phase framework for high-performance XML-to-XML transformation based on schema mappings, and discuss methodologies and algorithms for implementing these phases. In particular, we elaborate on novel techniques such as streamed extraction of mapped source values and scalable disk-based merging of overlapping data (including duplicate elimination). We compare our transformation framework with alternative methods such as using XQuery or SQL/XML provided by current commercial databases. The results demonstrate that the three-phase framework (although as simple as it is) is highly scalable and outperforms the alternative methods by orders of magnitude.

Parsing, normalizing, & storing XML

Querying and maintaining a compact XML storage BIBAFull-Text 1073-1082
  Raymond K. Wong; Franky Lam; William M. Shui
As XML database sizes grow, the amount of space used for storing the data and auxiliary data structures becomes a major factor in query and update performance. This paper presents a new storage scheme for XML data that supports all navigational operations in near constant time. In addition to supporting efficient queries, the space requirement of the proposed scheme is within a constant factor of the information theoretic minimum, while insertions and deletions can be performed in near constant time as well. As a result, the proposed structure features a small memory footprint that increases cache locality, whilst still supporting standard APIs, such as DOM, and necessary database operations, such as queries and updates, efficiently. Analysis and experiments show that the proposed structure is space and time efficient.
XML design for relational storage BIBAFull-Text 1083-1092
  Solmaz Kolahi; Leonid Libkin
Design principles for XML schemas that eliminate redundancies and avoid update anomalies have been studied recently. Several normal forms, generalizing those for relational databases, have been proposed. All of them, however, are based on the assumption of a native XML storage, while in practice most of XML data is stored in relational databases.
   In this paper we study XML design and normalization for relational storage of XML documents. To be able to relate and compare XML and relational designs, we use an information-theoretic framework that measures information content in relations and documents, with higher values corresponding to lower levels of redundancy. We show that most common relational storage schemes preserve the notion of being well-designed (i.e., anomalies- and redundancy-free). Thus, existing XML normal forms guarantee well-designed relational storages as well. We further show that if this perfect option is not achievable, then a slight restriction on XML constraints guarantees a "second-best" relational design, according to possible values of the information-theoretic measure. We finally consider an edge-based relational representation of XML documents, and show that while it has similar information-theoretic properties with other relational representations, it can behave significantly worse in terms of enforcing integrity constraints.
A high-performance interpretive approach to schema-directed parsing BIBAFull-Text 1093-1114
  Morris Matsa; Eric Perkins; Abraham Heifets; Margaret Gaitatzes Kostoulas; Daniel Silva; Noah Mendelsohn; Michelle Leger
XML delivers key advantages in interoperability due to its flexibility, expressiveness, and platform-neutrality. As XML has become a performance-critical aspect of the next generation of business computing infrastructure, however, it has become increasingly clear that XML parsing often carries a heavy performance penalty, and that current, widely-used parsing technologies are unable to meet the performance demands of an XML-based computing infrastructure. Several efforts have been made to address this performance gap through the use of grammar-based parser generation. While the performance of generated parsers has been significantly improved, adoption of the technology has been hindered by the complexity of compiling and deploying the generated parsers. Through careful analysis of the operations required for parsing and validation, we have devised a set of specialized byte codes, designed for the task of XML parsing and validation. These byte codes are designed to engender the benefits of fine-grained composition of parsing and validation that make existing compiled parsers fast, while being coarse-grained enough to minimize interpreter overhead. This technique of using an interpretive, validating parser balances the need for performance against the requirements of simple tooling and robust scalable infrastructure. Our approach is demonstrated with a specialized schema compiler, used to generate byte codes which in turn drive an interpretive parser. With almost as little tooling and deployment complexity as a traditional interpretive parser, the byte code-driven parser usually demonstrates performance within 20% of the fastest fully compiled solutions.

Developing regions

Collaborative ICT for Indian business clusters BIBAFull-Text 1115-1116
  Soumya Roy; Shantanu Biswas
Indian business clusters have contributed immensely to the country's industrial output, poverty alleviation and employment generation. However, with recent globalization these clusters can loose out to international competitors if they do not continuously innovate and take advantage of the new opportunities that are available through economic liberalization. In this paper, we discuss how information and communication technologies (ICT) can help in improving the productivity and growth of these clusters.
Delay tolerant applications for low bandwidth and intermittently connected users: the aAQUA experience BIBAFull-Text 1117-1118
  Saurabh Sahni; Krithi Ramamritham
With the explosive growth and spread of Internet, web access from mobile and rural users has become significant. But these users face problems of low bandwidth and intermittent Internet connectivity. To make the benefits of the Internet reach the common man in developing countries, accessibility and availability of the information has to be improved. aAQUA is an online multilingual, multimedia agricultural portal for disseminating information from and to rural communities. Considering resource constrained rural environments, we have designed and implemented an offline solution which provides an online experience to users in disconnected mode. Our solution is based on heterogeneous database synchronization which involves only a small synchronization payload ensuring an efficient use of available bandwidth. Offline aAQUA has been deployed in the field and systematic studies of our solution show that user experience has improved tremendously not only in disconnected mode but also in connected mode.


A cautious surfer for PageRank BIBAFull-Text 1119-1120
  Lan Nie; Baoning Wu; Brian D. Davison
This work proposes a novel cautious surfer to incorporate trust into the process of calculating authority for web pages. We evaluate a total of sixty queries over two large, real-world datasets to demonstrate that incorporating trust can improve PageRank's performance.
A clustering method for web data with multi-type interrelated components BIBAFull-Text 1121-1122
  Levent Bolelli; Seyda Ertekin; Ding Zhou; C. Lee Giles
Traditional clustering algorithms work on "flat" data, making the assumption that the data instances can only be represented by a set of homogeneous and uniform features. Many real world data, however, is heterogeneous in nature, comprising of multiple types of interrelated components. We present a clustering algorithm, K-SVMeans, that integrates the well known K-Means clustering with the highly popular Support Vector Machines (SVM) in order to utilize the richness of data. Our experimental results on authorship analysis of scientific publications show that K-SVMeans achieves better clustering performance than homogeneous data clustering.
A large-scale study of robots.txt BIBAFull-Text 1123-1124
  Yang Sun; Ziming Zhuang; C. Lee Giles
Search engines largely rely on Web robots to collect information from the Web. Due to the unregulated open-access nature of the Web, robot activities are extremely diverse. Such crawling activities can be regulated from the server side by deploying the Robots Exclusion Protocol in a file called robots.txt. Although it is not an enforcement standard, ethical robots (and many commercial) will follow the rules specified in robots.txt. With our focused crawler, we investigate 7,593 websites from education, government, news, and business domains. Five crawls have been conducted in succession to study the temporal changes. Through statistical analysis of the data, we present a survey of the usage of Web robots rules at the Web scale. The results also show that the usage of robots.txt has increased over time.
A link-based ranking scheme for focused search BIBAFull-Text 1125-1126
  Philip O'Brien; Tony Abou-Assaleh; Tapajyoti Das; Weizheng Gao; Yingbo Miao; Zhen Zhen
This paper introduces a novel link-based ranking algorithm based on a model of focused Web surfers. FocusedRank is described and compared to implementations of PageRank and Topic-Sensitive PageRank. We report a user study that measures the relevance and precision of each approach. FocusedRank gives superior relevancy over PageRank, while significantly reducing the computational complexity compared to the Topic-Senstive PageRank.
A link classification based approach to website topic hierarchy generation BIBAFull-Text 1127-1128
  Nan Liu; Christopher C. Yang
Hierarchical models are commonly used to organize a Website's content. A Website's content structure can be represented by a topic hierarchy, a directed tree rooted at a Website's homepage in which the vertices and edges correspond to Web pages and hyperlinks. In this work, we propose a new method for constructing the topic hierarchy of a Website. We model the Website's link structure using weighted directed graph, in which the edge weights are computed using a classifier that predicts if an edge connects a pair of nodes representing a topic and a sub-topic. We then pose the problem of building the topic hierarchy as finding the shortest-path tree and directed minimum spanning tree in the weighted graph. We've done extensive experiments using real Websites and obtained very promising results.
A search-based Chinese word segmentation method BIBAFull-Text 1129-1130
  Xin-Jing Wang; Yong Qin; Wen Liu
In this paper, we propose a novel Chinese word segmentation method which leverages the huge deposit of Web documents and search technology. It simultaneously solves ambiguous phrase boundary resolution and unknown word identification problems. Evaluations prove its effectiveness.
Anchor-based proximity measures BIBAFull-Text 1131-1132
  Amruta Joshi; Ravi Kumar; Benjamin Reed; Andrew Tomkins
We present a family of measures of proximity of an arbitrary node in a directed graph to a pre-specified subset of nodes, called the anchor. Our measures are based on three different propagation schemes and two different uses of the connectivity structure of the graph. We consider a web-specific application of the above measures with two disjoint anchors -- good and bad web pages -- and study the accuracy of these measures in this context.
Automatic search engine performance evaluation with click-through data analysis BIBAFull-Text 1133-1134
  Yiqun Liu; Yupeng Fu; Min Zhang; Shaoping Ma; Liyun Ru
Performance evaluation is an important issue in Web search engine researches. Traditional evaluation methods rely on much human efforts and are therefore quite time-consuming. With click-through data analysis, we proposed an automatic search engine performance evaluation method. This method generates navigational type query topics and answers automatically based on search users. querying and clicking behavior. Experimental results based on a commercial Chinese search engine's user logs show that the automatically method gets a similar evaluation result with traditional assessor-based ones.
Automatic searching of tables in digital libraries BIBAFull-Text 1135-1136
  Ying Liu; Kun Bai; Prasenjit Mitra; C. Lee Giles
Tables are ubiquitous. Unfortunately, no search engine supports table search. In this paper, we propose a novel table specific searching engine, TableSeer, to facilitate the table extracting, indexing, searching, and sharing. In addition, we propose an extensive set of medium-independent metadata to precisely present tables. Given a query, TableSeer ranks the returned results using an innovative ranking algorithm -- TableRank with a tailored vector space model and a novel term weighting scheme. Experimental results show that TableSeer outperforms existing search engines on table search. In addition, incorporating multiple weighting factors can significantly improve the ranking results.
Bayesian network based sentence retrieval model BIBAFull-Text 1137-1138
  Keke Cai; Jiajun Bu; Chun Chen; Kangmiao Liu; Wei Chen
This paper makes an intensive investigation of the application of Bayesian network in sentence retrieval and introduces three Bayesian network based sentence retrieval models with or without consideration of term relationships. Term relationships in this paper are considered from two perspectives: relationships between pairs of terms and relationships between terms and term sets. Experiments have proven the efficiency of Bayesian network in the application of sentence retrieval. Particularly, retrieval result with consideration of the second kind of term relationship performs better in improving retrieval precision.
Brand awareness and the evaluation of search results BIBAFull-Text 1139-1140
  Bernard J. Jansen; Mimi Zhang; Ying Zhang
We investigate the effect of search engine brand (i.e., the identifying name or logo that distinguishes a product from its competitors) on evaluation of system performance. This research is motivated by the large amount of search traffic directed to a handful of Web search engines, even though most are of equal technical quality with similar interfaces. We conducted a laboratory study with 32 participants to measure the effect of four search engine brands while controlling for the quality of search engine results. There was a 25% difference between the most highly rated search engine and the lowest using average relevance ratings, even though search engine results were identical in both content and presentation. Qualitative analysis suggests branding affects user views of popularity, trust and specialization. We discuss implications for search engine marketing and the design of search engine quality studies.
Causal relation of queries from temporal logs BIBAFull-Text 1141-1142
  Yizhou Sun; Kunqing Xie; Ning Liu; Shuicheng Yan; Benyu Zhang; Zheng Chen
In this paper, we study a new problem of mining causal relation of queries in search engine query logs. Causal relation between two queries means event on one query is the causation of some event on the other. We first detect events in query logs by efficient statistical frequency threshold. Then the causal relation of queries is mined by the geometric features of the events. Finally the Granger Causality Test (GCT) is utilized to further re-rank the causal relation of queries according to their GCT coefficients. In addition, we develop a 2-dimensional visualization tool to display the detected relationship of events in a more intuitive way. The experimental results on the MSN search engine query logs demonstrate that our approach can accurately detect the events in temporal query logs and the causal relation of queries is detected effectively.
Classifying web sites BIBAFull-Text 1143-1144
  Christoph Lindemann; Lars Littig
In this paper, we present a novel method for the classification of Web sites. This method exploits both structure and content of Web sites in order to discern their functionality. It allows for distinguishing between eight of the most relevant functional classes of Web sites. We show that a pre-classification of Web sites utilizing structural properties considerably improves a subsequent textual classification with standard techniques. We evaluate this approach on a dataset comprising more than 16,000 Web sites with about 20 million crawled and 100 million known Web pages. Our approach achieves an accuracy of 92% for the coarse-grained classification of these Web sites.
Comparing apples and oranges: normalized pagerank for evolving graphs BIBAFull-Text 1145-1146
  Klaus Berberich; Srikanta Bedathur; Gerhard Weikum; Michalis Vazirgiannis
PageRank is the best known technique for link-based importance ranking. The computed importance scores, however, are not directly comparable across different snapshots of an evolving graph. We present an efficiently computable normalization for PageRank scores that makes them comparable across graphs. Furthermore, we show that the normalized PageRank scores are robust to non-local changes in the graph, unlike the standard PageRank measure.
Designing efficient sampling techniques to detect webpage updates BIBAFull-Text 1147-1148
  Qingzhao Tan; Ziming Zhuang; Prasenjit Mitra; C. Lee Giles
Due to resource constraints, Web archiving systems and search engines usually have difficulties keeping the entire local repository synchronized with the Web. We advance the state-of-art of the sampling-based synchronization techniques by answering a challenging question: Given a sampled webpage and its change status, which other webpages are also likely to change? We present a study of various downloading granularities and policies, and propose an adaptive model based on the update history and the popularity of the webpages. We run extensive experiments on a large dataset of approximately 300,000 webpages to demonstrate that it is most likely to find more updated webpages in the current or upper directories of the changed samples. Moreover, the adaptive strategies outperform the non-adaptive one in terms of detecting important changes.
Determining the user intent of web search engine queries BIBAFull-Text 1149-1150
  Bernard J. Jansen; Danielle L. Booth; Amanda Spink
Determining the user intent of Web searches is a difficult problem due to the sparse data available concerning the searcher. In this paper, we examine a method to determine the user intent underlying Web search engine queries. We qualitatively analyze samples of queries from seven transaction logs from three different Web search engines containing more than five million queries. From this analysis, we identified characteristics of user queries based on three broad classifications of user intent. The classifications of informational, navigational, and transactional represent the type of content destination the searcher desired as expressed by their query. We implemented our classification algorithm and automatically classified a separate Web search engine transaction log of over a million queries submitted by several hundred thousand users. Our findings show that more than 80% of Web queries are informational in nature, with about 10% each being navigational and transactional. In order to validate the accuracy of our algorithm, we manually coded 400 queries and compared the classification to the results from our algorithm. This comparison showed that our automatic classification has an accuracy of 74%. Of the remaining 25% of the queries, the user intent is generally vague or multi-faceted, pointing to the need to for probabilistic classification. We illustrate how knowledge of searcher intent might be used to enhance future Web search engines.
EPCI: extracting potentially copyright infringement texts from the web BIBAFull-Text 1151-1152
  Takashi Tashiro; Takanori Ueda; Taisuke Hori; Yu Hirate; Hayato Yamana
In this paper, we propose a new system extracting potentially copyright infringement texts from the Web, called EPCI. EPCI extracts them in the following way: (1) generating a set of queries based on a given copyright reserved seed-text, (2) putting every query to search engine API, (3) gathering the search result Web pages from high ranking until the similarity between the given seed-text and the search result pages becomes less than a given threshold value, and (4) merging all the gathered pages, then re-ranking them in the order of their similarity. Our experimental result using 40 seed-texts shows that EPCI is able to extract 132 potentially copyright infringement Web pages per a given copyright reserved seed-text with 94% precision in average.
Efficient training on biased minimax probability machine for imbalanced text classification BIBAFull-Text 1153-1154
  Xiang Peng; Irwin King
The Biased Minimax Probability Machine (BMPM) constructs a classifier which deals with the imbalanced learning tasks. In this paper, we propose a Second Order Cone Programming (SOCP) based algorithm to train the model. We outline the theoretical derivatives of the biased classification model, and address the text classification tasks where negative training documents significantly outnumber the positive ones using the proposed strategy. We evaluated the learning scheme in comparison with traditional solutions on three different datasets. Empirical results have shown that our method is more effective and robust to handle imbalanced text classification problems.
Electoral search using the VerkiezingsKijker: an experience report BIBAFull-Text 1155-1156
  Valentin Jijkoun; Maarten Marx; Maarten de Rijke; Frank van Waveren
The Netherlands had parliamentary elections on November 22, 2006. We built a system which helped voters to make an informed choice among the many participating parties. One of the most important pieces of information in the Dutch election and subsequent coalition government formation is the party program, a text document with an average length of 45 pages. Our system provides the voter with focused access to party programs, enabling her to make a topic-wise comparison of parties' viewpoints. We complemented this type of access ("What do the parties promise?") with access to news ("What happens around these topics?") and blogs ("What do people say about them?"). We describe the system, including design technical details, and user statistics.
Exploration of query context for information retrieval BIBAFull-Text 1157-1158
  Keke Cai; Chun Chen; Jiajun Bu; Peng Huang; Zhiming Kang
A number of existing information retrieval systems propose the notion of query context to combine the knowledge of query and user into retrieval to reveal the most exact description of user's information needs. In this paper we interpret query context as a document consisting of sentences related to the current query. This kind of query context is used to re-estimate the relevance probabilities of top-ranked documents and then re-rank top-ranked documents. The experiments show that the proposed context-based approach for information retrieval can greatly improved relevance of search results.
First-order focused crawling BIBAFull-Text 1159-1160
  Qingyang Xu; Wanli Zuo
This paper reports a new general framework of focused web crawling based on "relational subgroup discovery". Predicates are used explicitly to represent the relevance clues of those unvisited pages in the crawl frontier, and then first-order classification rules are induced using subgroup discovery technique. The learned relational rules with sufficient support and confidence will guide the crawling process afterwards. We present the many interesting features of our proposed first-order focused crawler, together with preliminary promising experimental results.
Academic web search engine: generating a survey automatically BIBAFull-Text 1161-1162
  Ye Wang; Zhihua Geng; Sheng Huang; Xiaoling Wang; Aoying Zhou
Given a document repository, search engine is very helpful to retrieve information. Currently, vertical search is a hot topic, and Google Scholar [4] is an example for academic search. However, most vertical search engines only return the flat ranked list without an efficient result exhibition for given users. We study this problem and designed a vertical search engine prototype Dolphin, where the flexible user-oriented templates can be defined and the survey-like results are presented according to the template.
Generative models for name disambiguation BIBAFull-Text 1163-1164
  Yang Song; Jian Huang; Isaac G. Councill; Jia Li; C. Lee Giles
Name ambiguity is a special case of identity uncertainty where one person can be referenced by multiple name variations in different situations or even share the same name with other people. In this paper, we present an efficient framework by using two novel topic-based models, extended from Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA). Our models explicitly introduce a new variable for persons and learn the distribution of topics with regard to persons and words. Experiments indicate that our approach consistently outperforms other unsupervised methods including spectral and DBSCAN clustering. Scalability is addressed by disambiguating authors in over 750,000 papers from the entire CiteSeer dataset.
GigaHash: scalable minimal perfect hashing for billions of URLs BIBAFull-Text 1165-1166
  Kumar Chellapilla; Anton Mityagin; Denis Charles
A minimal perfect function maps a static set of n keys on to the range of integers {0,1,2,...,n - 1}. We present a scalable high performance algorithm based on random graphs for constructing minimal perfect hash functions (MPHFs). For a set of n keys, our algorithm outputs a description of h in expected time O(n). The evaluation of h(x) requires three memory accesses for any key x and the description of h takes up 0.89n bytes (7.13n bits). This is the best (most space efficient) known result to date. Using a simple heuristic and Huffman coding, the space requirement is further reduced to 0.79n bytes (6.86n bits). We present a high performance architecture that is easy to parallelize and scales well to very large data sets encountered in internet search applications. Experimental results on a one billion URL dataset obtained from Live Search crawl data, show that the proposed algorithm (a) finds an MPHF for one billion URLs in less than 4 minutes, and (b) requires only 6.86 bits/key for the description of h.
How NAGA uncoils: searching with entities and relations BIBAFull-Text 1167-1168
  Gjergji Kasneci; Fabain M. Suchanek; Maya Ramanath; Gerhard Weikum
Current keyword-oriented search engines for the World Wide Web do not allow specifying the semantics of queries. We address this limitation with NAGA, a new semantic search engine. NAGA builds on a large semantic knowledge base of binary relationships (facts) derived from the Web. NAGA provides a simple, yet expressive query language to query this knowledge base. The results are then ranked with an intuitive scoring mechanism. We show the effectiveness and utility of NAGA by comparing its output with that of Googleon some interesting queries.
Identifying ambiguous queries in web search BIBAFull-Text 1169-1170
  Ruihua Song; Zhenxiao Luo; Ji-Rong Wen; Yong Yu; Hsiao-Wuen Hon
It is widely believed that some queries submitted to search engines are by nature ambiguous (e.g., java, apple). However, few studies have investigated the questions of "how many queries are ambiguous?" and "how can we automatically identify an ambiguous query?" This paper deals with these issues. First, we construct the taxonomy of query ambiguity, and ask human annotators to manually classify queries based upon it. From manually labeled results, we find that query ambiguity is to some extent predictable. We then use a supervised learning approach to automatically classify queries as being ambiguous or not. Experimental results show that we can correctly identify 87% of labeled queries. Finally, we estimate that about 16% of queries in a real search log are ambiguous.
Web page classification with heterogeneous data fusion BIBAFull-Text 1171-1172
  Zenglin Xu; Irwin King; Michael R. Lyu
Web pages are more than text and they contain much contextual and structural information, e.g., the title, the meta data, the anchor text, etc., each of which can be seen as a data source or are presentation. Due to the different dimensionality and different representing forms of these heterogeneous data sources, simply putting them together would not greatly enhance the classification performance. We observe that via a kernel function, different dimensions and types of data sources can be represented into a common format of kernel matrix, which can be seen as a generalized similarity measure between a pair of web pages. In this sense, a kernel learning approach is employed to fuse these heterogeneous data sources. The experimental results on a collection of the ODP database validate the advantages of the proposed method over traditional methods based on any single data source and the uniformly weighted combination of them.
Learning information diffusion process on the web BIBAFull-Text 1173-1174
  Xiaojun Wan; Jianwu Yang
Many text documents on the Web are not originally created but forwarded or copied from other source documents. The phenomenon of document forwarding or transmission between various web sites is denoted as Web information diffusion. This paper focuses on mining information diffusion processes for specific topics on the Web. A novel system called LIDPW is proposed to address this problem using matching learning techniques. The source site and source document of each document are identified and the diffusion process composed of a sequence of diffusion relationships is visually presented to users. The effectiveness of LIDPW is validated on a real data set. A preliminary user study is performed and the results show that LIDPW does benefit users to monitor the information diffusion process of a specific topic, and aid them to discover the diffusion start and diffusion center of the topic.
MedSearch: a specialized search engine for medical information BIBAFull-Text 1175-1176
  Gang Luo; Chunqiang Tang; Hao Yang; Xing Wei
People are thirsty for medical information. Existing Web search engines cannot handle medical search well because they do not consider its special requirements. Often a medical information searcher is uncertain about his exact questions and unfamiliar with medical terminology. Therefore, he prefers to pose long queries, describing his symptoms and situation in plain English, and receive comprehensive, relevant information from search results. This paper presents MedSearch, a specialized medical Web search engine, to address these challenges. MedSearch can assist ordinary Internet users to search for medical information, by accepting queries of extended length, providing diversified search results, and suggesting related medical phrases.
Mining contiguous sequential patterns from web logs BIBAFull-Text 1177-1178
  Jinlin Chen; Terry Cook
Finding Contiguous Sequential Patterns (CSP) is an important problem in Web usage mining. In this paper we propose a new data structure, UpDown Tree, for CSP mining. An UpDown Tree combines suffix tree and prefix tree for efficient storage of all the sequences that contain a given item. The special structure of UpDown Tree ensures efficient detection of CSPs. Experiments show that UpDown Tree improves CSP mining in terms of both time and memory usage comparing to previous approaches.
Monitoring the evolution of cached content in Google and MSN BIBAFull-Text 1179-1180
  Ioannis Anagnostopoulos
In this paper, we describe a capture-recapture experiment conducted on Google's and MSN's cached directories. The anticipated outcome of this work was to monitor evolution rates in these web search services as well as measure their ability to index and maintain fresh and up-to-date results in their cached directories.
Multi-factor clustering for a marketplace search interface BIBAFull-Text 1181-1182
  Neel Sundaresan; Kavita Ganesan; Roopnath Grandhi
Search engines provide a small window to the vast repository of data they index and against which they search. They try their best to return the documents that are of relevance to the user but often a large number of results may be returned. Users struggle to manage this vast result set looking for the items of interest. Clustering search results is one way of alleviating this navigational pain. In this paper we describe a clustering system that enables clustering search results in an online marketplace search system.
On ranking techniques for desktop search BIBAFull-Text 1183-1184
  Sara Cohen; Carmel Domshlak; Naama Zwerdling
This paper addresses the desktop search problem by considering various techniques for ranking results of a search query over the file system. First, basic ranking techniques, which are based on a single file feature (e.g., file name, file content, access date, etc.) are considered. Next, two learning-based ranking schemes are presented, and are shown to be significantly more effective than the basic ranking methods. Finally, a novel ranking technique, based on query selectiveness is considered, for use during the cold-start period of the system. This method is also shown to be empirically effective, even though it does not involve any learning.
Query-driven indexing for peer-to-peer text retrieval BIBAFull-Text 1185-1186
  Gleb Skobeltsyn; Toan Luu; Karl Aberer; Martin Rajman; Ivana Podnar Zarko
We describe a query-driven indexing framework for scalable text retrieval over structured P2P networks. To cope with the bandwidth consumption problem that has been identified as the major obstacle for full-text retrieval in P2P networks, we truncate posting lists associated with indexing features to a constant size storing only top-k ranked document references. To compensate for the loss of information caused by the truncation, we extend the set of indexing features with carefully chosen term sets. Indexing term sets are selected based on the query statistics extracted from query logs, thus we index only such combinations that are a) frequently present in user queries and b) non-redundant w.r.t the rest of the index. The distributed index is compact and efficient as it constantly evolves adapting to the current query popularity distribution. Moreover, it is possible to control the tradeoff between the storage/bandwidth requirements and the quality of query answering by tuning the indexing parameters. Our theoretical analysis and experimental results indicate that we can indeed achieve scalable P2P text retrieval for very large document collections and deliver good retrieval performance.
Query topic detection for reformulation BIBAFull-Text 1187-1188
  Xuefeng He; Jun Yan; Jinwen Ma; Ning Liu; Zheng Chen
In this paper, we show that most multiple term queries include more than one topic and users usually reformulate their queries by topics instead of terms. In order to provide empirical evidence on user's reformulation behavior and to help search engines better handle the query reformulation problem, we focus on detecting internal topics in the original query and analyzing users. reformulation to those topics. Particularly, we utilize the Interaction Information (II) to measure the degree of one sub-query being a topic based on the local search results. The experimental results on query log show that: most users reformulate query at the topical level; and our proposed II-based algorithm is a good method to detect topics from original queries.
Review spam detection BIBAFull-Text 1189-1190
  Nitin Jindal; Bing Liu
It is now a common practice for e-commerce Web sites to enable their customers to write reviews of products that they have purchased. Such reviews provide valuable sources of information on these products. They are used by potential customers to find opinions of existing users before deciding to purchase a product. They are also used by product manufacturers to identify problems of their products and to find competitive intelligence information about their competitors. Unfortunately, this importance of reviews also gives good incentive for spam, which contains false positive or malicious negative opinions. In this paper, we make an attempt to study review spam and spam detection. To the best of our knowledge, there is still no reported study on this problem.
SCAN: a small-world structured P2P overlay for multi-dimensional queries BIBAFull-Text 1191-1192
  Xiaoping Sun
This paper presents a structured P2P overlay SCAN that augments CAN overlay with long links based on Kleinberg's small-world model in a d-dimensional Cartesian space. The construction of long links does not require the estimate of network size. Queries in multi-dimensional data space can achieve O(log n) hops by equipping each node with O(log n) long links and O(d) short links.
SRing: a structured non DHT P2P overlay supporting string range queries BIBAFull-Text 1193-1194
  Xiaoping Sun; Xue Chen
This paper presents SRing, a structured non DHT P2P overlay that efficiently supports exact and range queries on multiple attribute values. In SRing, all attribute values are interpreted as strings formed by a base alphabet and are published in the lexicographical order. Two virtual rings are built: N-ring is built in a skip-list way for range partition and queries; D-ring is built in a small-world way for the construction of N-ring. A leave-and-join based load balancing method is used to balance range overload in the network with heterogeneous nodes.
Search engine retrieval of changing information BIBAFull-Text 1195-1196
  Yang Sok Kim; Byeong Ho Kang; Paul Compton; Hiroshi Motoda
In this paper we analyze the Web coverage of three search engines, Google, Yahoo and MSN. We conducted a 15 month study collecting 15,770 Web content or information pages linked from 260 Australian federal and local government Web pages. The key feature of this domain is that new information pages are constantly added but the 260 web pages tend to provide links only to the more recently added information pages. Search engines list only some of the information pages and their coverage varies from month to month. Meta-search engines do little to improve coverage of information pages, because the problem is not the size of web coverage, but the frequency with which information is updated. We conclude that organizations such as governments which post important information on the Web cannot rely on all relevant pages being found with conventional search engines, and need to consider other strategies to ensure important information can be found.
Search engines and their public interfaces: which APIs are the most synchronized? BIBAFull-Text 1197-1198
  Frank McCown; Michael L. Nelson
Researchers of commercial search engines often collect data using the application programming interface (API) or by "scraping" results from the web user interface (WUI), but anecdotal evidence suggests the interfaces produce different results. We provide the first in-depth quantitative analysis of the results produced by the Google, MSN and Yahoo API and WUI interfaces. After submitting a variety of queries to the interfaces for 5 months, we found significant discrepancies in several categories. Our findings suggest that the API indexes are not older, but they are probably smaller for Google and Yahoo. Researchers may use our findings to better understand the differences between the interfaces and choose the best API for their particular types of queries.
Spam and popularity ratings for combating link spam BIBAFull-Text 1199-1200
  Mukesh Dalal
We present a new approach for propagating spam scores in web graphs, in order to combat link spam. The resulting spam rating is then used for propagating popularity scores like PageRank. Both propagations work even in presence of censure links that represent distrust. Initial testing using a C++ prototype on small examples show more reasonable results than other published approaches.
Summary attributes and perceived search quality BIBAFull-Text 1201-1202
  Daniel E. Rose; David Orr; Raj Gopal Prasad Kantamneni
We conducted a series of experiments in which surveyed web search users answered questions about the quality of search results on the basis of the result summaries. Summaries shown to different groups of users were editorially constructed so that they differed in only one attribute, such as length. Some attributes had no effect on users' quality judgments, while in other cases, changing an attribute had a "halo effect" which caused seemingly unrelated dimensions of result quality to be rated higher by users.
Tag clouds for summarizing web search results BIBAFull-Text 1203-1204
  Byron Y-L Kuo; Thomas Hentrich; Benjamin M. Good; Mark D. Wilkinson
In this paper, we describe an application, PubCloud that uses tag clouds for the summarization of results from queries over the PubMed database of biomedical literature. PubCloud responds to queries of this database with tag clouds generated from words extracted from the abstracts returned by the query. The results of a user study comparing the PubCloud tag-cloud summarization of query results with the standard result list provided by PubMed indicated that the tag cloud interface is advantageous in presenting descriptive information and in reducing user frustration but that it is less effective at the task of enabling users to discover relations between concepts.
Towards efficient dominant relationship exploration of the product items on the web BIBAFull-Text 1205-1206
  Zhenglu Yang; Lin Li; Botao Wang; Masaru Kitsuregawa
In recent years, there has been a prevalence of search engines being employed to find useful information in the Web as they efficiently explore hyperlinks between web pages which define a natural graph structure that yields a good ranking. Unfortunately, current search engines cannot effectively rank those relational data, which exists on dynamic websites supported by online databases. In this study, to rank such structured data (i.e., find the "best" items), we propose an integrated online system consisting of compressed data structure to encode the dominant relationship of the relational data. Efficient querying strategies and updating scheme are devised to facilitate the ranking process. Extensive experiments illustrate the effectiveness and efficiency of our methods. As such, we believe the work in this poster can be complementary to traditional search engines.
Understanding web search via a learning paradigm BIBAFull-Text 1207-1208
  Bernard J. Jansen; Brian Smith; Danielle L. Booth
Investigating whether one can view Web searching as a learning process, we examined the searching characteristics of 41 participants engaged in 246 searching tasks. We classified the searching tasks according an updated version of Bloom's taxonomy, a six level categorization of cognitive learning. Results show that Applying takes the most searching effort as measured by queries per session and specific topics searched per sessions. The lower level categories of Remembering and Understanding exhibit searching characteristics similar to the higher order learning of Evaluating and Creating. It appears that searchers rely primarily on their internal knowledge for Evaluating and Creating, using searching primarily as fact checking and verification. Implications are that the commonly held notion that Web searchers have simple information needs may not be correct. We discuss the implications for Web searching, including designing interfaces to support exploration.
Using d-gap patterns for index compression BIBAFull-Text 1209-1210
  Jinlin Chen; Terry Cook
Sequential patterns of d-gaps exist pervasively in inverted lists of Web document collection indices due to the cluster property. In this paper the information of d-gap sequential patterns is used as a new dimension for improving inverted index compression. We first detect d-gap sequential patterns using a novel data structure, UpDown Tree. Based on the detected patterns, we further substitute each pattern with its pattern Id in the inverted lists that contain it. The resulted inverted lists are then coded with an existing coding scheme. Experiments show that this approach can effectively improve the compression ratio of existing codes.
Utility analysis for topically biased PageRank BIBAFull-Text 1211-1212
  Christian Kohlschütter; Paul-Alexandru Chirita; Wolfgang Nejdl
PageRank is known to be an efficient metric for computing general document importance in the Web. While commonly used as a one-size-fits-all measure, the ability to produce topically biased ranks has not yet been fully explored in detail. In particular, it was still unclear to what granularity of "topic" the computation of biased page ranks makes sense. In this paper we present the results of a thorough quantitative and qualitative analysis of biasing PageRank on Open Directory categories. We show that the MAP quality of Biased PageRank generally increases with the ODP level up to a certain point, thus sustaining the usage of more specialized categories to bias PageRank on, in order to improve topic specific search.
Sliding window technique for the web log analysis BIBAFull-Text 1213-1214
  Nikolai Buzikashvili
The results of the Web query log analysis may be significantly shifted depending on the fraction of agents (non-human clients), which are not excluded from the log. To detect and exclude agents the Web log studies use threshold values for a number of requests submitted by a client during the observation period. However, different studies use different observation periods, and a threshold assigned to one period is usually incomparable with the threshold assigned to the other period. We propose the uniform method equally working on the different observation periods. The method bases on the sliding window technique: a threshold is assigned to the sliding window rather than to the whole observation period. Besides, we determine the sub-optimal values of the parameters of the method: a window size and a threshold and recommend 5-7 unique queries as an upper bound of the threshold for 1-hour sliding window.
A password stretching method using user specific salts BIBAFull-Text 1215-1216
  ChangHee Lee; Heejo Lee
In this paper, we present a password stretching method using user specific salts. Our scheme takes similar time to stretch a password as recent password stretching algorithms, but the complexity of a pre-computation attack increases by 10^8 times and the storage required to store the pre-computation result increases by 10^8 times.
Simple authentication for the web BIBAFull-Text 1217-1218
  Timothy W. van der Horst; Kent E. Seamons
Automated email-based password reestablishment (EBPR) is an efficient, cost-effective means to deal with forgotten passwords. In this technique, email providers authenticate users on behalf of web sites. This method works because web sites trust email providers to deliver messages to their intended recipients. Simple Authentication for the Web (SAW) improves upon this basic approach to user authentication to create an alternative to password-based logins. SAW: 1) Removes the setup and management costs of passwords at sites that accept the risks of EBPR; 2) Provides single sign-on without a specialized identity provider; 3) Thwarts all passive attacks.

Semantic web

A management and performance framework for semantic web servers BIBAFull-Text 1219-1220
  Malena Mesarina; Venugopal Srinivasmurthy; Nic Lyons; Craig Sayers
The unification of Semantic Web query languages under the SPARQL standard and the development of commercial-quality implementations are encouraging industries to use semantic technologies for managing information. Current implementations, however, lack the performance monitoring and management services that the industry expects. In this paper, we present a performance and management framework interface to a generic SPARQL web server. We leverage existing standards for instrumentation to make the system ready-to-manage through existing monitoring applications, and we provide a performance framework which has the distinct feature of providing measurement results through the same SPARQL interface used to query data, eliminating the need for special interfaces.
A probabilistic semantic approach for discovering web services BIBAFull-Text 1221-1222
  Jiangang Ma; Jinli Cao; Yanchun Zhang
Service discovery is one of challenging issues in Service-Oriented computing. Currently, most of the existing service discovering and matching approaches are based on keywords-based strategy. However, this method is inefficient and time-consuming. In this paper, we present a novel approach for discovering web services. Based on the current dominating mechanisms of discovering and describing Web Services with UDDI and WSDL, the proposed approach utilizes Probabilistic Latent Semantic Analysis (PLSA) to capture semantic concepts hidden behind words in the query and advertisements in services so that services matching is expected to carry out at concept level. We also present related algorithms and preliminary experiments to evaluate the effectiveness of our approach.
Acquiring ontological knowledge from query logs BIBAFull-Text 1223-1224
  Satoshi Sekine; Hisami Suzuki
We present a method for acquiring ontological knowledge using search query logs. We first use query logs to identify important contexts associated with terms belonging to a semantic category; we then use these contexts to harvest new words belonging to this category. Our evaluation on selected categories indicates that the method works very well to help harvesting terms, achieving 85% to 95% accuracy in categorizing newly acquired terms.
Altering document term vectors for classification: ontologies as expectations of co-occurrence BIBAFull-Text 1225-1226
  Meenakshi Nagarajan; Amit Sheth; Marcos Aguilera; Kimberly Keeton; Arif Merchant; Mustafa Uysal
In this paper we extend the state-of-the-art in utilizing background knowledge for supervised classification by exploiting the semantic relationships between terms explicated in Ontologies. Preliminary evaluations indicate that the new approach generally improves precision and recall, more so for hard to classify cases and reveals patterns indicating the usefulness of such background knowledge.
Building and managing personalized semantic portals BIBAFull-Text 1227-1228
  Melike Sah; Wendy Hall
This paper presents a semantic portal, SEMPort, which provides better user support with personalized views, semantic navigation, ontology-based search and three different kinds of semantic hyperlinks. Distributed content editing and provision is supplied for the maintenance of the contents in real-time. As a case study, SEMPort is tested on the Course Modules Web Page (CMWP) of the School of Electronics and Computer Science (ECS).
Deriving knowledge from figures for digital libraries BIBAFull-Text 1229-1230
  Xiaonan Lu; James Z. Wang; Prasenjit Mitra; C. Lee Giles
Figures in digital documents contain important information. Current digital libraries do not summarize and index information available within figures for document retrieval. We present our system on automatic categorization of figures and extraction of data from 2-D plots. A machine-learning based method is used to categorize figures into a set of predefined types based on image features. An automated algorithm is designed to extract data values from solid line curves in 2-D plots. The semantic type of figures and extracted data values from 2-D plots can be integrated with textual information within documents to provide more effective document retrieval services for digital library users. Experimental evaluation has demonstrated that our system can produce results suitable for real-world use.
Development of a semantic web based mobile local search system BIBAFull-Text 1231-1232
  Joo Seong Jeon; Gi Jeong Lee
This paper describes the development of a semantic web and ontology based local search system that can be used in wireless mobile communication services.
Estimating the cardinality of RDF graph patterns BIBAFull-Text 1233-1234
  Angela Maduko; Kemafor Anyanwu; Amit Sheth; Paul Schliekelman
Most RDF query languages allow for graph structure search through a conjunction of triples which is typically processed using join operations. A key factor in optimizing joins is determining the join order which depends on the expected cardinality of intermediate results. This work proposes a pattern-based summarization framework for estimating the cardinality of RDF graph patterns. We present experiments on real world and synthetic datasets which confirm the feasibility of our approach.
Extending WebML towards semantic web BIBAFull-Text 1235-1236
  Federico Michele Facca; Marco Brambilla
Available methodologies for developing Sematic Web applications do not fully exploit the whole potential deriving from interaction with ontological data sources. Here we introduce an extension of the WebML modeling framework to fulfill most of the design requirements emerging for the new area of Semantic Web. We generalize the development process to support Semantic Web applications and we introduce a set of new primitives for ontology importing and querying.
Image annotation by hierarchical mapping of features BIBAFull-Text 1237-1238
  Qiankun Zhao; Prasenjit Mitra; C. Lee Giles
In this paper, we propose a novel approach of image annotation by constructing a hierarchical mapping between low-level visual features and text features utilizing the relations within and across both visual features and text features. Moreover, we propose a novel annotation strategy that maximizes both the accuracy and the diversity of the generated annotation by generalizing or specifying the annotation in the corresponding annotation hierarchy.
   Experiments with 4500 scientific images from Royal Society of Chemistry journals show that the proposed annotation approach produces satisfactory results at different levels of annotations.
Integrating web directories by learning their structures BIBAFull-Text 1239-1240
  Christopher C. Yang; Jianfeng Lin
Documents in the Web are often organized using category trees by information providers (e.g. CNN, BBC) or search engines (e.g. Google, Yahoo!). Such category trees are commonly known as Web directories. The category tree structures from different internet content providers may be similar to some extent but are usually not exactly the same. As a result, it is desirable to integrate these category trees together so that web users only need to browse through a unified category tree to extract information from multiple providers. In this paper, we address this problem by capturing structural information of multiple category trees, which are embedded with the knowledge of professional in organizing the documents. Our experiments with real Web data show that the proposed technique is promising.
Learning ontologies to improve the quality of automatic web service matching BIBAFull-Text 1241-1242
  Hui Guo; Anca Andreea Ivan; Rama Akkiraju; Richard Goodwin
This paper presents a novel technique that significantly improves the quality of semantic Web service matching by (1) automatically generating ontologies based on Web service descriptions and (2) using these ontologies to guide the mapping between Web services.
Ontology engineering using volunteer labor BIBAFull-Text 1243-1244
  Benjamin M. Good; Mark D. Wilkinson
We describe an approach designed to reduce the costs of ontology development through the use of untrained, volunteer knowledge engineers. Results are provided from an experiment in which volunteers were asked to judge the correctness of automatically inferred subsumption relationships in the biomedical domain. The experiment indicated that volunteers can be recruited fairly easily but that their attention is difficult to hold, that most do not understand the subsumption relationship without training, and that incorporating learned estimates of trust into voting systems is beneficial to aggregate performance.
Semantic personalization of web portal contents BIBAFull-Text 1245-1246
  Christina Tziviskou; Marco Brambilla
Enriching Web applications with personalized data is of major interest for facilitating the user access to the published contents, and therefore, for guaranteeing successful user navigation. We propose a conceptual model for extracting personalized recommendations based on user profiling, ontological domain models, and semantic reasoning. The approach offers a high-level representation of the designed application based on a domain-specific metamodel for Web applications called WebML.
The largest scholarly semantic network...ever. BIBAFull-Text 1247-1248
  Johan Bollen; Marko A. Rodriguez; Herbert Van de Sompel; Lyudmila L. Balakireva; Aric Hagberg
Scholarly entities, such as articles, journals, authors and institutions, are now mostly ranked according to expert opinion and citation data. The Andrew W. Mellon Foundation funded MESUR project at the Los Alamos National Laboratory is developing metrics of scholarly impact that can rank a wide range of scholarly entities on the basis of their usage. The MESUR project starts with the creation of a semantic network model of the scholarly community that integrates bibliographic, citation, and usage data collected from publishers and repositories world-wide. It is estimated that this scholarly semantic network will include approximately 50 million articles, 1 million authors, 10,000 journals and conference proceedings, 500 million citations, and 1 billion usage-related events; the largest scholarly semantic network ever created. The developed scholarly semantic network will then serve as a standardized platform for the definition and validation of new metrics of scholarly impact. This poster describes the MESUR project's data aggregation and processing techniques including the OWL scholarly ontology that was developed to model the scholarly communication process.


A kernel based structure matching for web services search BIBAFull-Text 1249-1250
  Yu Jianjun; Guo Shengmin; Su Hao; Zhang Hui; Xu Ke
This paper describes a kernel based Web Services (abbreviated as service) matching mechanism for service discovery and integration. The matching mechanism tries to exploit the latent semantics by the structure of services. Using textual similarity and n-spectrum kernel values as features of low-level and mid-level, we build up a model to estimate the functional similarity between services, whose parameters are learned by a Ranking-SVM. The experiment results showed that several metrics for the retrieval of services have been improved by our approach.
A novel collaborative filtering-based framework for personalized services in m-commerce BIBAFull-Text 1251-1252
  Qiudan Li; Chunheng Wang; Guanggang Geng; Ruwei Dai
With the rapid growth of wireless technologies and handheld devices, m-commerce is becoming a promising research area. Personalization is especially important to the success of m-commerce. This paper proposes a novel collaborative filtering-based framework for personalized services in m-commerce. The framework extends our previous work by using Online Analytical Processing (OLAP) to represent the relations among user, content and context information, and adopting a multi-dimensional collaborative filtering model to perform inference. It provides a powerful and well-founded mechanism to personalization for m-commerce. We implemented it in an existing m-commerce platform, and experimental results demonstrate its feasibility and correctness.
Towards service pool based approach for services discovery and subscription BIBAFull-Text 1253-1254
  Xuanzhe Liu; Li Zhou; Gang Huang; Hong Mei
In current web service discovery and subscription, consumers must pay too much time on manually selection and cannot easily benefit from the wide QoS spectrum brought by the proliferating services. In our approach, we introduce the service pool as a "virtual service" grouping function identical services together and dispatching consumer requests to the proper service in terms of QoS requirements.
Crawling multiple UDDI business registries BIBAFull-Text 1255-1256
  Eyhab Al-Masri; Qusay H. Mahmoud
As Web services proliferate, size and magnitude of UDDI Business Registries (UBRs) are likely to increase. The ability to discover Web services of interest then across multiple UBRs becomes a major challenge specially when using primitive search methods provided by existing UDDI APIs. Clients do not have the time to endlessly search accessible UBRs for finding appropriate services particularly when operating via mobile devices. Finding services of interest should be time effective and highly productive. This paper addresses issues relating to the efficient access and discovery of Web services across multiple UBRs and introduces a novel exploration engine, the Web Service Crawler Engine (WSCE). WSCE is capable of crawling multiple UBRs, and enables for the establishment of a centralized Web services repository that can be used for discovering Web services much more efficiently. The paper presents experimental validation, results, and analysis of the proposed ideas.
Discovering the best web service BIBAFull-Text 1257-1258
  Eyhab Al-Masri; Qusay H. Mahmoud
Major research challenges in discovering Web services include, provisioning of services across multiple or heterogeneous registries, differentiating between services that share similar functionalities, improving end-to-end Quality of Service (QoS), and enabling clients to customize the discovery process. Proliferation and interoperability of this multitude of Web services have lead to the emergence of new standards on how services can be published, discovered, or used (i.e. UDDI, WSDL, SOAP). Such standards can potentially provide many of these features and much more, however, there are technical challenges associated with existing standards. One of these challenges is the client's ability to control the discovery process across accessible service registries for finding services of interest. This work proposes a solution to this problem and introduces the Web Service Relevancy Function (WsRF) used for measuring the relevancy ranking of a particular Web service based on QoS metrics and client preferences. We present experimental validation, results, and analysis of the presented ideas.
Mobile shopping assistant: integration of mobile applications and web services BIBAFull-Text 1259-1260
  Huaigu Wu; Yuri Natchetoi
The goal of this poster is to describe our implementation of a new architecture enabling efficient integration between mobile phone applications and Web Services. Using this architecture, we have implemented a mobile shopping assistant described further. In order to build this architecture, we designed an innovative XML compression mechanism to facilitate data exchange between mobile phones and Web Services. We also designed a smart connection manager to control asynchronous communication for all possible channels of a mobile phone. In addition, we used diverse input modes in order to extend users' access to Web Services.
On automated composition for web services BIBAFull-Text 1261-1262
  Zhongnan Shen; Jianwen Su
We develop a framework to compose services through discovery and orchestration for a given goal service. Tightening techniques are used in composition algorithms to achieve "completeness".
Providing session management as core business service BIBAFull-Text 1263-1264
  Ismail Ari; Jun Li; Riddhiman Ghosh; Mohamed Dekhil
It is extremely hard for a global organization with services over multiple channels to capture a consistent and unified view of its data, services, and interactions. While SOA and web services are addressing integration and interoperability problems, it is painful for an operational organization with legacy systems to quickly switch to service-based methods. We need methods to combine advantages of traditional (i.e. web, desktop, or mobile) application development environments and service-based deployments.
   In this paper, we focus on the design and implementation of session management as a core service to support business processes and go beyond application-specific sessions and web sessions. We develop local session components for different platforms and complement them with a remote "session service" that is independent of applications and platforms. We aim to close the gap between the two worlds by combining their performance, availability and interoperability advantages.
Towards automating regression test selection for web services BIBAFull-Text 1265-1266
  Michael E. Ruth; Shengru Tu
This paper reports a safe regression test selection (RTS) approach that is designed for verifying Web services in an end-to-end manner. The Safe RTS technique has been integrated into a systematic method that monitors distributed code modifications and automates the RTS and RT processes.
Towards environment generated media: object-participation-type weblog in home sensor network BIBAFull-Text 1267-1268
  Takuya Maekawa; Yutaka Yanagisawa; Takeshi Okadome
The environment generated media (EGM) are defined here as being generated from a massive amount of and/or incomprehensible environmental data by compressing them into averages or representative values and/or by converting them into such user-friendly media as text, figures, charts, and animations. As an application of EGM, an object-participation-type weblog is introduced, where anthropomorphic indoor objects with sensor nodes post weblog entries and comments about what happened to them in a sensor networked environment.

Social networks

BlogScope: spatio-temporal analysis of the blogosphere BIBAFull-Text 1269-1270
  Nilesh Bansal; Nick Koudas
We present BlogScope (www.blogscope.net), a system for analyzing the Blogosphere. BlogScope is an information discovery and text analysis system that offers a set of unique features. Such features include, spatio-temporal analysis of blogs, flexible navigation of the Blogosphere through information bursts, keyword correlations and burst synopsis, as well as enhanced ranking functions for improved query answer relevance. We describe the system, its design and the features of the current version of BlogScope.
EOS: expertise oriented search using social networks BIBAFull-Text 1271-1272
  Juanzi Li; Jie Tang; Jing Zhang; Qiong Luo; Yunhao Liu; Mingcai Hong
In this paper, we present the design and implementation of our expertise oriented search system, EOS http://www.arnetminer.net. EOS is a researcher social network system. It has gathered information about a half-million computer science researchers from the Web and constructed a social network among the researchers through their co-authorship. In particular, the relationship in the social network information is used in both ranking experts for a given topic and searching for associations between researchers. Our experimental results demonstrate that the proposed methods for expert finding and association search in a social network are both more effective and efficient than the baseline methods.
Exploring social dynamics in online media sharing BIBAFull-Text 1273-1274
  Martin J. Halvey; Mark T. Keane
It is now feasible to view media at home as easily as text-based pages were viewed when the World Wide Web (WWW) first emerged. This development has supported media sharing and search services providing hosting, indexing and access to large, online media repositories. Many of these sharing services also have a social aspect to them. This paper provides an initial analysis of the social interactions on a video sharing and search service (www.youtube.com). Results show that many users do not form social networks in the online community and a very small number do not appear to contribute to the wider community. However, it does seem those people who do use the available tools have much a greater tendency to form social connections.
Finding community structure in mega-scale social networks: [extended abstract] BIBAFull-Text 1275-1276
  Ken Wakita; Toshiyuki Tsurumi
Community analysis algorithm proposed by Clauset, Newman, and Moore (CNM algorithm) finds community structure in social networks. Unfortunately, CNM algorithm does not scale well and its use is practically limited to networks whose sizes are up to 500,000 nodes. We show that this inefficiency is caused from merging communities in unbalanced manner and that a simple heuristics that attempts to merge community structures in a balanced manner can dramatically improve community structure analysis. The proposed techniques are tested using data sets obtained from existing social networking service that hosts 5.5 million users. We have tested three three variations of the heuristics. The fastest method processes a SNS friendship network with 1 million users in 5 minutes (70 times faster than CNM) and another friendship network with 4 million users in 35 minutes, respectively. Another one processes a network with 500,000 nodes in 50 minutes (7 times faster than CNM), finds community structures that has improved modularity, and scales to a network with 5.5 million.
Life is sharable: mechanisms to support and sustain blogging life experience BIBAFull-Text 1277-1278
  Yun-Maw Cheng; Tzu-Chuan Chou; Wai Yu; Li-Chieh Chen; Ching-Long Yeh; Meng-Chang Chen
Recent trend in the development of mobile devices, wireless communications, sensor technologies, weblogs, and peer-to-peer communications have prompted a new design opportunity for enhancing social interactions. This paper introduces our preliminary experiences in designing a prototype utilizing the aforementioned technologies to share life experience. Users equipped with camera phones coupled with short-range communication technology, such as RFID, can capture life experience and share it as weblogs to other people. However, in reality, this is easier said than done. The success of weblogs relies on the active participation and willingness of people to contribute. To encourage active participations, a ranking system, AgreeRank, is specifically developed to get them motivated.
Measuring credibility of users in an e-learning environment BIBAFull-Text 1279-1280
  Wei Wei; Jimmy Lee; Irwin King
Learning Villages (LV) is an E-learning platform for people's online discussions and frequently citing postings of one another. In this paper, we propose a novel method to rank credit authors in the LV system. We first propose a k-EACM graph to describe the article citation structure in the LV system. And then we build a weighted graph model k-UCM graph to reveal the implicit relationship between authors hidden behind the citations among their articles. Furthermore, we design a graph-based ranking algorithm, the Credit Author Ranking (CAR) algorithm, which can be applied to rank nodes in a graph with negative edges. Finally, we perform experimental evaluations by simulations. The results of evaluations illustrate that the proposed method works pretty well on ranking the credibility of users in the LV system.
Modeling user behavior in recommender systems based on maximum entropy BIBAFull-Text 1281-1282
  Tomoharu Iwata; Kazumi Saito; Takeshi Yamada
We propose a model for user purchase behavior in online stores that provide recommendation services. We model the purchase probability given recommendations for each user based on the maximum entropy principle using features that deal with recommendations and user interests. The proposed model enable us to measure the effect of recommendations on user purchase behavior, and the effect can be used to evaluate recommender systems. We show the validity of our model using the log data of an online cartoon distribution service, and measure the recommendation effects for evaluating the recommender system.
Parallel crawling for online social networks BIBAFull-Text 1283-1284
  Duen Horng Chau; Shashank Pandit; Samuel Wang; Christos Faloutsos
Given a huge online social network, how do we retrieve information from it through crawling? Even better, how do we improve the crawling performance by using parallel crawlers that work independently? In this paper, we present the framework of parallel crawlers for online social networks, utilizing a centralized queue. To show how this works in practice, we describe our implementation of the crawlers for an online auction website. The crawlers work independently, therefore the failing of one crawler does not affect the others at all. The framework ensures that no redundant crawling would occur. Using the crawlers that we built, we visited a total of approximately 11 million auction users, about 66,000 of which were completely crawled.
Personalized social & real-time collaborative search BIBAFull-Text 1285-1286
  Mukesh Dalal
This paper presents Adaptive Web Search (AWS), a novel search technique that combines personalized, social, and real-time collaborative search. Preliminary empirical results from a small sample suggest that an AWS prototype built on WAMP platform using Yahoo! Web Search API generates more relevant results and allows faster discovery of information.
Towards extracting flickr tag semantics BIBAFull-Text 1287-1288
  Tye Rattenbury; Nathan Good; Mor Naaman
We address the problem of extracting semantics of tags -- short, unstructured text-labels assigned to resources on the Web -- based on each tag's metadata patterns. In particular, we describe an approach for extracting place and event semantics for tags that are assigned to photos on Flickr, a popular photo sharing website supporting time and location (latitude/longitude) metadata. The approach can be generalized to other domains where text terms can be extracted and associated with metadata patterns, such as geo-annotated web pages.


A no-frills architecture for lightweight answer retrieval BIBAFull-Text 1289-1290
  Marius Pasca
In a new model for answer retrieval, document collections are distilled offline into large repositories of facts. Each fact constitutes a potential direct answer to questions seeking a particular kind of entity or relation, such as questions asking about the date of particular events. Question answering becomes equivalent to online fact retrieval, which greatly simplifies the de-facto system architecture.
AutoPerf: an automated load generator and performance measurement tool for multi-tier software systems BIBAFull-Text 1291-1292
  Shrirang Sudhir Shirodkar; Varsha Apte
We present a load generator and performance measurement tool AutoPerf which requires minimal input and configuration from the user, and produces a comprehensive capacity analysis as well as server-side resource usage profile of a Web-based distributed system, in an automated fashion. The tool requires only the workload and deployment description of the distributed system, and automatically sets typical parameters that load generator programs need, such as maximum number of users to be emulated, number of users for each experiment, warm-up time, etc. The tool also does all the co-ordination required to generate a critical type of measure, namely, resource usage per transaction or per user for each software server. This is a necessary input for creating a performance model of a software system.
Construction by linking: the linkbase method BIBAFull-Text 1293-1294
  Johannes Meinecke; Frederic Majer; Martin Gaedke
The success of many innovative Web applications is not based on the content they produce -- but on how they combine and link existing content. Older Web Engineering methods lack flexibility in a sense that they rely strongly on a-priori knowledge of existing content structures and do not take into account initially unknown content sources. We propose the adoption of principles that are also found in Component-based Software Engineering, to assemble highly extensible solutions from reusable artifacts. The main contribution of our work is a support system, consisting of a central service that manages n:m relationships between arbitrary Web resources, and of Web application components that realize navigation, presentation, and interaction for the linked content.
Image collector III: a web image-gathering system with bag-of-keypoints BIBAFull-Text 1295-1296
  Keiji Yanai
We propose a new system to mine visual knowledge on the Web.
   There are huge image data as well as text data on the Web. However, mining image data from the Web is paid less attention than mining text data, since treating semantics of images are much more difficult. In this paper, we propose introducing a latest image recognition technique, which is the bag-of-keypoints representation, into Web image-gathering task. By the experiments we show the proposed system outperforms our previous systems and Google Imagesearch greatly.
Mirror site maintenance based on evolution associations of web directories BIBAFull-Text 1297-1298
  Ling Chen; Sourav Bhowmick; Wolfgang Nejdl
Mirroring Web sites is a well-known technique commonly used in the Web community. A mirror site should be updated frequently to ensure that it reflects the content of the original site. Existing mirroring tools apply page-level strategies to check each page of a site, which is inefficient and expensive. In this paper, we propose a novel site-level mirror maintenance strategy. Our approach studies the evolution of Web directory structures and mines association rules between ancestor-descendant Web directories. Discovered rules indicate the evolution correlations between Web directories. Thus, when maintaining the mirror of a Web site (directory), we can optimally skip subdirectories which are negatively correlated with it in undergoing significant changes. The preliminary experimental results show that our approach improves the efficiency of the mirror maintenance process significantly while sacrificing slightly in keeping the "freshness" of the mirrors.
On building graphs of documents with artificial ants BIBAFull-Text 1299-1300
  Hanane Azzag; Julien Lavergne; Christiane Guinot; Gilles Venturini
We present an incremental algorithm for building a neighborhood graph from a set of documents. This algorithm is based on a population of artificial agents that imitate the way real ants build structures with self-assembly behaviors. We show that our method outperforms standard algorithms for building such neighborhood graphs (up to 2230 times faster on the tested databases with equal quality) and how the user may interactively explore the graph.
Towards a scalable search and query engine for the web BIBAFull-Text 1301-1302
  Aidan Hogan; Andreas Harth; Jürgen Umrich; Stefan Decker
Current search engines do not fully leverage semantically rich datasets, or specialise in indexing just one domain-specific dataset.
   We present a search engine that uses the RDF data model to enable interactive query answering over richly structured and interlinked data collected from many disparate sources on the Web.
Web4CE: accessing web-based applications on consumer devices BIBAFull-Text 1303-1304
  Walter Dees; Paul Shrubsole
In a world where all devices will be interconnected, the boundaries between the different devices will start to disappear. Devices will be able to access each other's applications; sessions can be suspended on one device and resumed on another device; devices can serve as each other's input and output device, and all devices will be able to connect to the Internet. This will give true mobility to the user as he/she will not be restricted to the time and location where he/she accesses an application. Of course, we need a variety of different mechanisms and technologies to enable this, such as: Remote rendering of UIs on other devices in the network. Infrastructure for discovering client and servers in a network. Mechanisms to exchange capability information between devices, and to adapt the UI based on these capabilities. Mechanisms to deal with session migration.
   Support for a wide range of consumer devices, ranging from mobile phones to high-end TVs.
   This requires technologies that cross different domains, i.e. the PC domain, mobile domain, and TV domain. Several major companies within these different domains have decided to work together on these issues. One of the results is a framework for remote user interfaces for both UPnP" networks and the Internet. This framework is called Web4CE (a.k.a. CEA-2014) [1], and has been accepted as the baseline remote user interface technology within the Digital Living Network Alliance (DLNA) [2], which is a large industry-wide effort for creating true interoperability between network-enabled devices.
   This paper provides a short overview of the Web4CE framework, and some of the use cases that it enables.
Web mashup scripting language BIBAFull-Text 1305-1306
  Marwan Sabbouh; Jeff Higginson; Salim Semy; Danny Gagne
The Web Mashup Scripting Language (WMSL) enables an end-user (you) working from his browser, e.g. not needing any other infrastructure, to quickly write mashups that integrate any two, or more, web services on the Web. The end-user accomplishes this by writing a web page that combines HTML, metadata in the form of mapping relations, and small piece of code, or script. The mapping relations enable not only the discovery and retrieval of the WMSL pages, but also affect a new programming paradigm that abstracts many programming complexities from the script writer. Furthermore, the WMSL Web pages or scripts that disparate end-users (you) write, can be harvested by Crawlers to automatically generate the concepts needed to build lightweight ontologies containing local semantics of a web service and its data model, to extend context ontologies or middle ontologies, and to develop links, or mappings, between these ontologies. This enables an open-source model of building ontologies based on the WMSL Web page or scripts that end users (you) write.

User interfaces & accessibility

A browser for a public-domain SpeechWeb BIBAFull-Text 1307-1308
  Richard A. Frost; Xiaoli Ma; Y. Shi
A SpeechWeb is a collection of hyperlinked applications, which are accessed remotely by speech browsers running on end-user devices. Links are activated through spoken commands. Despite the fact that protocols and technologies for creating and deploying speech applications have been readily available for several years, we have not seen the development of a Public-Domain SpeechWeb. In this paper, we show how freely available software and commonly used communication protocols can be used to change this situation.
A novel clustering-based RSS aggregator BIBAFull-Text 1309-1310
  Xin Li; Jun Yan; Zhihong Deng; Lei Ji; Weiguo Fan; Benyu Zhang; Zheng Chen
In recent years, different commercial Weblog subscribing systems have been proposed to return stories from users. subscribed feeds. In this paper, we propose a novel clustering-based RSS aggregator called as RSS Clusgator System (RCS) for Weblog reading. Note that an RSS feed may have several different topics. A user may only be interested in a subset of these topics. In addition there could be many different stories from multiple RSS feeds, which discuss similar topic from different perspectives. A user may be interested in this topic but do not know how to collect all feeds related to this topic. In contrast to many previous works, we cluster all stories in RSS feeds into hierarchical structure to better serve the readers. Through this way, users can easily find all their interested stories. To make the system current, we propose a flexible time window for incremental clustering. RCS utilizes both link information and content information for efficient clustering. Experiments show the effectiveness of RCS.
Adaptive faceted browser for navigation in open information spaces BIBAFull-Text 1311-1312
  Michal Tvarozek; Maria Bielikova
Open information spaces have several unique characteristics such as their changeability, large size, complexity and diverse user base. These result in novel challenges during user navigation, information retrieval and data visualization in open information spaces.
   We propose a method of navigation in open information spaces based on an enhanced faceted browser with support for dynamic facet generation and adaptation based on user characteristics.
An assessment of tag presentation techniques BIBAFull-Text 1313-1314
  Martin J. Halvey; Mark T. Keane
With the growth of social bookmarking a new approach for metadata creation called tagging has emerged. In this paper we evaluate the use of tag presentation techniques. The main goal of our evaluation is to investigate the effect of some of the different properties that can be utilized in presenting tags e.g. alphabetization, using larger fonts etc. We show that a number of these factors can affect the ease with which users can find tags and use the tools for presenting tags to users.
An information state-based dialogue manager for making voice web smarter BIBAFull-Text 1315-1316
  Marta Gatius; Meritxell González; Elisabet Comelles
In this paper we propose the integration of intelligent components technologies (natural language and discourse management) in voice web interfaces to make them smarter. We describe how we have integrated reusable components of dialogue management and language processing in a multilingual voice system to improve its friendliness and portability. The dialogue management component deals with complex dialogue phenomena, such as user-initiative dialogues, and follows the information state-based theory. The resulting dialogue system supports friendly communication (through the telephone and the web) in several languages: English, Spanish, Catalan and Italian. The dialogue system has been adapted to guide the users to access online public administration services.
Behavior based web page evaluation BIBAFull-Text 1317-1318
  Ganesan Velayathan; Seiji Yamada
This paper describes our efforts to investigate factors in user's browsing behavior to automatically evaluate web pages that the user shows interest in. To evaluate web pages automatically, we developed a client-side logging/analyzing tool: the GINIS Framework. This work focuses primarily on client-side user behavior using a customized web browser and AJAX technologies. First, GINIS unobtrusively gathers logs of user behavior through the user's natural interaction with the web browser. Then it analyses the logs and extracts effective rules to evaluate web pages using C4.5 machine learning system. Eventually, GINIS becomes able to automatically evaluate web pages using these learned rules.
Generating efficient labels to facilitate web accessibility BIBAFull-Text 1319-1320
  Leo Spalteholz; Kin Fun Li; Nigel Livingston
For many users with a disability it can be difficult or impossible to use a computer mouse to navigate the web. An alternative way to select elements on a web page is the label typing approach, in which users select elements by typing part of the label. In most cases, these labels are specified by the page authors, but some selectable elements do not have an obvious textual description, thus requiring that a label be generated. The set of element labels on a web page must be both efficient to select by text input and meaningful to the user. This paper discusses our approach to this problem, using page structural analysis and user history to determine important elements of a page, and then matching this information with the efficiency model of the input device.
Generation, documentation and presentation of mathematical equations and symbolic scientific expressions using pure HTML and CSS BIBAFull-Text 1321-1322
  Kehinde Alabi
This paper describes a comprehensive method for presenting mathematical equations and expressions using only pure HTML and CSS. This method renders the equations portable and editable and contrasts with previous procedures that represent equations as whole graphic objects. Methods for generating and documenting the equations using HTML and JavaScript are also described such that the equations can be interpreted and converted to or from other formats such as LaTex, MATHML, or linear representation.
GeoTV: navigating geocoded RSS to create an IPTV experience BIBAFull-Text 1323-1324
  Yih-Farn Chen; Giuseppe Di Fabbrizio; David Gibbon; Rittwik Jana; Serban Jora; Bernard Renger; Bin Wei
The Web is rapidly moving towards a platform for mass collaboration in content production and consumption from three screens: computers, mobile phones, and TVs. While there has been a surge of interests in making Web content accessible from mobile devices, there is a significant lack of progress when it comes to making the web experience suitable for viewing on a television. Towards this end, we describe a novel concept, namely GeoTV, where we explore a framework by which web content can be presented or pushed in a meaningful manner to create an entertainment experience for the TV audience. Fresh content on a variety of topics, people, and places is being created and made available on the Web at breathtaking speed. Navigating fresh content effectively on TV demands a new browsing paradigm that requires fewer mouse clicks or user interactions from the remote control. Novel geospatial and temporal browsing techniques are provided in GeoTV that allow users the capability of aggregating and navigating RSS-enabled content in a timely, personalized and automatic manner for viewing in an IPTV environment. This poster is an extension of our previous work on GeoTracker that utilizes both a geospatial representation and a temporal (chronological) presentation to help users spot the most relevant updates quickly within the context of a Web-enabled environment. We demonstrate 1) the usability of such a tool that greatly enhances a user's ability in locating and browsing videos based on his or her geographical interests and 2) various innovative interface designs for showing RSS-enabled information in an IPTV environment.
Summarization of online image collections via implicit feedback BIBAFull-Text 1325-1326
  Shane Ahern; Simon King; Mor Naaman; Rahul Nair
The availability of map interfaces and location-aware devices makes a growing amount of unstructured, geo-referenced information available on the Web. In particular, over twelve million geo-referenced photos are now available on Flickr, a popular photo-sharing website. We show a method to analyze the Flickr data and generate aggregate knowledge in the form of "representative tags" for arbitrary areas in the world. We display these tags on a map interface in an interactive web application along with images associated with each tag. We then use the implicit feedback of the aggregate user interactions with the tags and images to learn which images best describe the area shown on the map.
System for reminding a user of information obtained through a web browsing experience BIBAFull-Text 1327-1328
  Tetsushi Morita; Tetsuo Hidaka; Akimichi Tanaka; Yasuhisa Kato
We propose a system for reminding a user of information obtained through a web browsing experience. The system extracts keywords from the content of the web page currently being viewed and retrieves the context of past web browsing related to the keywords. We define the context as a sequence of web browsing when many web pages related to the keyword were viewed intensively because we assume that a lot of information connected to the current content was obtained in the sequence.
   The information is not only what pages you viewed but also how you found those pages and what knowledge you acquired from them. Specifically, when you browse web pages, this system automatically displays a list of the contexts judged to be important in relation to the current web page. If you select the context, details of the context are shown graphically with marks indicating characteristic activities.
The ScratchPad: sensemaking support for the web BIBAFull-Text 1329-1330
  David Gotz
The World Wide Web is a powerful platform for a wide range of information tasks. Dramatic advances in technology, such as improved search capabilities and the AJAX application model, have enabled entirely new web-based applications and usage patterns, making many tasks easier to perform than ever before. However, few tools have been developed to assist with sensemaking tasks: complex research behaviors in which users gather and comprehend information from many sources to answer potentially vague, non-procedural questions. Sensemaking tasks are common and include, for example, researching vacation destinations or deciding how to invest. This paper presents the ScratchPad, an extension to the standard browser interface that is designed to capture, organize, and exploit the information discovered while performing a sensemaking task.
Towards multi-granularity multi-facet e-book retrieval BIBAFull-Text 1331-1332
  Chong Huang; Yonghong Tian; Zhi Zhou; Tiejun Huang
Generally speaking, digital libraries have multiple granularities of semantic units: book, chapter, page, paragraph and word. However, there are two limitations of current eBook retrieval systems: (1) the granularity of retrievable units is either too big or too small, scales such as chapters, paragraphs are ignored; (2) the retrieval results should be grouped by facets to facilitate user's browsing and exploration. To overcome these limitations, we propose a multi-granularity multi-facet eBook retrieval approach.
Visualizing structural patterns in web collections BIBAKFull-Text 1333-1334
  M. S. Ali; Mariano P. Consens; Flavio Rizzolo
We present DescribeX, a tool for exploring and visualizing the structural patterns present in collections of XML documents. DescribeX can be employed by developers to interactively discover those XPath expressions that will actually return elements present in a collection of XML files.
Keywords: RSS, XML, XPath, structural summaries, visualization


Adaptive record extraction from web pages BIBAFull-Text 1335-1336
  Justin Park; Denilson Barbosa
We describe an adaptive method for extracting records from web pages. Our algorithm combines a weighted tree matching metric with clustering for obtaining data extraction patterns.
   We compare our method experimentally to the state-of-the-art, and show that our approach is very competitive for rigidly-structured records (such as product descriptions) and far superior for loosely-structured records (such as entries on blogs).
Exploit sequencing views in semantic cache to accelerate XPath query evaluation BIBAFull-Text 1337-1338
  Jianhua Feng; Na Ta; Yong Zhang; Guoliang Li
In XML databases, materializing queries and their results into views in a semantic cache can improve the performance of query evaluation by reducing computational complexity and I/O cost. Although there are a number of proposals of semantic cache for XML queries, the issues of fast cache lookup and compensation query construction could be further studied. In this paper, based on sequential XPath queries, we propose fastCLU, a fast Cache LookUp algorithm and effiCQ, an efficient Compensation Query constructing algorithm to solve these two problems. Experimental results show that our algorithms outperform previous algorithms and can achieve good performance of query evaluation.
Extensible schema documentation with XSLT 2.0 BIBAFull-Text 1339-1340
  Felix Michel; Erik Wilde
XML Schema documents are defined using an XML syntax, which means that the idea of generating schema documentation through standard XML technologies is intriguing. We present X2Doc, a framework for generating schema-documentation solely through XSLT. The framework uses SCX, an XML syntax for XML Schema components, as intermediate format and produces XML-based output formats. Using a modular set of XSLT stylesheets, X2Doc is highly configurable and carefully crafted towards extensibility. This proves especially useful for composite schemas, where additional schema information like Schematron rules are embedded into XML Schemas.
Preserving XML queries during schema evolution BIBAFull-Text 1341-1342
  Mirella M. Moro; Susan Malaika; Lipyeow Lim
In XML databases, new schema versions may be released as frequently as once every two weeks. This poster describes a taxonomy of changes for XML schema evolution. It examines the impact of those changes on schema validation and query evaluation. Based on that study, it proposes guidelines for XML schema evolution and for writing queries in such a way that they continue to operate as expected across evolving schemas.
SPath: a path language for XML schema BIBAFull-Text 1343-1344
  Erik Wilde; Felix Michel
XML is increasingly being used as a typed data format, and therefore it becomes more important to gain access to the type system; very often this is an XML Schema. The XML Schema Path Language (SPath) presented in this paper provides access to XML Schema components by extending the well-known XPath language to also include the domain of XML Schemas. Using SPath, XML developers gain access to XML Schemas and thus can more easily develop software which is type- or schema-aware, and thus more robust.
The use of XML to express a historical knowledge base BIBAFull-Text 1345-1346
  Katsuko T. Nakahira; Masashi Matsui; Yoshiki Mikami
Since conventional historical records have been written assuming human readers, they are not well-suited for computers to collect and process automatically. If computers could understand descriptions in historical records and process them automatically, it would be easy to analyze them from different perspectives. In this paper, we review a number of existing frameworks used to describe historical events, and make a comparative assessment of these frameworks in terms of usability, based on "deep cases" of Fillmore's core grammar. Based on this assessment, we propose a new description framework, and have created a microformat vocabulary set suitable for that framework.
U-REST: an unsupervised record extraction system BIBAFull-Text 1347-1348
  Yuan Kui Shen; David R. Karger
In this paper, we describe a system that can extract record structures from web pages with no direct human supervision.
   Records are commonly occurring HTML-embedded data tuples that describe people, offered courses, products, company profiles, etc. We present a simplified framework for studying the problem of unsupervised record extraction. one which separates the algorithms from the feature engineering.
   Our system, U-REST formalizes an approach to the problem of unsupervised record extraction using a simple two-stage machine learning framework. The first stage involves clustering, where structurally similar regions are discovered, and the second stage involves classification, where discovered groupings (clusters of regions) are ranked by their likelihood of being records. In our work, we describe, and summarize the results of an extensive survey of features for both stages. We conclude by comparing U-REST to related systems. The results of our empirical evaluation show encouraging improvements in extraction accuracy.
XML-based multimodal interaction framework for contact center applications BIBAFull-Text 1349-1350
  Nikolay Anisimov; Brian Galvin; Herbert Ristock
In this paper, we consider a way to represent contact center applications as a set of multiple XML documents written in different markups including VoiceXML and CCXML. Applications can comprise a dialog with IVR, call routing and agent scripting functionalities. We also consider ways how such applications can be executed in run-time contact center environment.
XML-based XML schema access BIBAFull-Text 1351-1352
  Erik Wilde; Felix Michel
XML Schema's abstract data model consists of components, which are the structures that eventually define a schema as a whole. XML Schema's XML syntax, on the other hand, is not a direct representation of the schema components, and it proves to be surprisingly hard to derive a schema's components from the XML syntax. The Schema Component XML Syntax (SCX) is a representation which attempts to map schema components as faithfully as possible to XML structures. SCX serves as the starting point for applications which need access to schema components and want to do so using standardized and widely available XML technologies.