HCI Bibliography Home | HCI Conferences | WWW Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
WWW Tables of Contents: 01020304-104-205-105-2060708091011-111-212-112-213-113-214-114-2

Proceedings of the 2008 International Conference on the World Wide Web

Fullname:Proceedings of the 17th International Conference on World Wide Web
Editors:Jinpeng Huai; Robin Chen; Hsiao-Wuen Hon; Yunhao Liu; Wei-Ying Ma; Andrew Tomkins; Xiaodong Zhang
Location:Beijing, China
Dates:2008-Apr-21 to 2008-Apr-25
Publisher:ACM
Standard No:ISBN 1-60558-085-6, 978-1-60558-085-2; ACM DL: Table of Contents hcibib: WWW08
Papers:235
Pages:1288
Links:Conference Home Page
  1. Browsers and UI
  2. Data mining: log analysis
  3. Data mining: learning
  4. Data mining: modeling
  5. Data mining: algorithms
  6. Internet monetization: online advertising
  7. Internet monetization: recommendation and security
  8. Internet monetization: sponsored Search
  9. Mobility
  10. Performance and scalability
  11. Rich media
  12. Search: query analysis
  13. Search: corpus characterization and Search Perform
  14. Search: ranking and retrieval enhancement
  15. Search: crawlers
  16. Search: applications
  17. Security I: misc
  18. Security II: web client security
  19. Semantic web I
  20. Semantic web II
  21. Semantic web III
  22. Social networks: analysis of social networks & Online Interactive Spaces
  23. Social networks: discovery & evolution of communities
  24. Social networks: applications and infrastructures
  25. Web engineering -- applications
  26. Web engineering -- web service composition
  27. Web engineering -- web service deployment
  28. XML I
  29. XML II
  30. Industrial track session
  31. Technology for developing regions
  32. WWW in China: mining the Chinese web
  33. WWW in China: Chinese web innovations
  34. Posters
  35. Panels
  36. Workshops

Browsers and UI

Personalized web exploration with task models BIBAKFull-Text 1-10
  Jae-wook Ahn; Peter Brusilovsky; Daqing He; Jonathan Grady; Qi Li
Personalized Web search has emerged as one of the hottest topics for both the Web industry and academic researchers. However, the majority of studies on personalized search focused on a rather simple type of search, which leaves an important research topic -- the personalization in exploratory searches -- as an under-studied area. In this paper, we present a study of personalization in task-based information exploration using a system called TaskSieve. TaskSieve is a Web search system that utilizes a relevance feedback based profile, called a "task model", for personalization. Its innovations include flexible and user controlled integration of queries and task models, task-infused text snippet generation, and on-screen visualization of task models. Through an empirical study using human subjects conducting task-based exploration searches, we demonstrate that TaskSieve pushes significantly more relevant documents to the top of search result lists as compared to a traditional search system. TaskSieve helps users select significantly more accurate information for their tasks, allows the users to do so with higher productivity, and is viewed more favorably by subjects under several usability related characteristics.
Keywords: adaptive search, empirical study, personalization, task model, task-based information exploration, user profile
Validating the use and role of visual elements of web pages in navigation with an eye-tracking study BIBAKFull-Text 11-20
  Yeliz Yesilada; Caroline Jay; Robert Stevens; Simon Harper
This paper presents an eye-tracking study that examines how people use the visual elements of Web pages to complete certain tasks. Whilst these elements are available to play their role in these tasks for sighted users, it is not the case for visually disabled users. This lack of access to some visual elements of a page means that visually disabled users are hindered in accomplishing these tasks. Our previous work has introduced a framework that identifies these elements and then reengineers Web pages such that these elements can play their intended roles in an audio, as well as visual presentation. To further improve our understanding of how these elements are used and to validate our framework, we track the eye movements of sighted users performing a number of different tasks. The resulting gaze data show that there is a strong relationship between the aspects of a page that receive visual attention and the objects identified by our framework. The study also shows some limitations, as well as yielding information to address these short-comings. Perhaps the most important result is the support provided for a particular kind of object called a Way Edge -- the visual construct used to group content into sections. There is a significant effect of Way Edges on the distribution of attention across tasks. This is a result that not only provides strong evidence for the utility of re-engineering, but also has consequences for our understanding of how people allocate attention to different parts of a page. We speculate that the phenomenon of 'Banner Blindness' owes as much to Way Edges, as it does to colour and font size.
Keywords: eye-tracking, navigation, travel, travel objects, visual disability
Improving relevance judgment of web search results with image excerpts BIBAKFull-Text 21-30
  Zhiwei Li; Shuming Shi; Lei Zhang
Current web search engines return result pages containing mostly text summary even though the matched web pages may contain informative pictures. A text excerpt (i.e. snippet) is generated by selecting keywords around the matched query terms for each returned page to provide context for user's relevance judgment. However, in many scenarios, we found that the pictures in web pages, if selected properly, could be added into search result pages and provide richer contextual description because a picture is worth a thousand words. Such new summary is named as image excerpts. By well designed user study, we demonstrate image excerpts can help users make much quicker relevance judgment of search results for a wide range of query types. To implement this idea, we propose a practicable approach to automatically generate image excerpts in the result pages by considering the dominance of each picture in each web page and the relevance of the picture to the query. We also outline an efficient way to incorporate image excerpts in web search engines. Web search engines can adopt our approach by slightly modifying their index and inserting a few low cost operations in their workflow. Our experiments on a large web dataset indicate the performance of the proposed approach is very promising.
Keywords: dominant image, image excerpts, usability, user interface, web search
Keysurf: a character controlled browser for people with physical disabilities BIBAKFull-Text 31-40
  Leo Spalteholz; Kin Fun Li; Nigel Livingston; Foad Hamidi
For many users with a physical or motor disability, using a computer mouse or other pointing device to navigate the web is cumbersome or impossible due to problems with pointing accuracy. At the same time, web accessibility using a keyboard in major browsers is rudimentary, requiring many key presses to select links or other elements. We introduce KeySurf, a character controlled web navigation system which addresses this situation by presenting an interface which allows a user to activate any web page element with only two or three keystrokes. Through an implementation of a user-centric incremental search algorithm, elements are matched according to user expectation as characters are entered into the interface. We show how our interface can be integrated with a speech recognition input, as well as with specialized on-screen keyboards for people with disabilities. Using the user's browsing history, we improve the efficiency of the selection process and find potentially interesting page links for the user within the current web page. We present the results from a pilot study evaluating the performance of various components of our system.
Keywords: keyboard access, web accessibility, web navigation

Data mining: log analysis

Query-sets: using implicit feedback and query patterns to organize web documents BIBAKFull-Text 41-50
  Barbara Poblete; Ricardo Baeza-Yates
In this paper we present a new document representation model based on implicit user feedback obtained from search engine queries. The main objective of this model is to achieve better results in non-supervised tasks, such as clustering and labeling, through the incorporation of usage data obtained from search engine queries. This type of model allows us to discover the motivations of users when visiting a certain document. The terms used in queries can provide a better choice of features, from the user's point of view, for summarizing the Web pages that were clicked from these queries. In this work we extend and formalize as "query model" an existing but not very well known idea of "query view" for document representation. Furthermore, we create a novel model based on "frequent query patterns" called the "query-set model". Our evaluation shows that both "query-based" models outperform the vector-space model when used for clustering and labeling documents in a website. In our experiments, the query-set model reduces by more than 90% the number of features needed to represent a set of documents and improves by over 90% the quality of the results. We believe that this can be explained because our model chooses better features and provides more accurate labels according to the user's expectations.
Keywords: feature selection, labeling, search engine queries, usage mining, web page organization
Mining the search trails of surfing crowds: identifying relevant websites from user activity BIBAKFull-Text 51-60
  Mikhail Bilenko; Ryen W. White
The paper proposes identifying relevant information sources from the history of combined searching and browsing behavior of many Web users. While it has been previously shown that user interactions with search engines can be employed to improve document ranking, browsing behavior that occurs beyond search result pages has been largely overlooked in prior work. The paper demonstrates that users' post-search browsing activity strongly reflects implicit endorsement of visited pages, which allows estimating topical relevance of Web resources by mining large-scale datasets of search trails. We present heuristic and probabilistic algorithms that rely on such datasets for suggesting authoritative websites for search queries. Experimental evaluation shows that exploiting complete post-search browsing trails outperforms alternatives in isolation (e.g., clickthrough logs), and yields accuracy improvements when employed as a feature in learning to rank for Web search.
Keywords: implicit feedback, learning from user behavior, mining search and browsing logs, web search
Using the wisdom of the crowds for keyword generation BIBAKFull-Text 61-70
  Ariel Fuxman; Panayiotis Tsaparas; Kannan Achan; Rakesh Agrawal
In the sponsored search model, search engines are paid by businesses that are interested in displaying ads for their site alongside the search results. Businesses bid for keywords, and their ad is displayed when the keyword is queried to the search engine. An important problem in this process is 'keyword generation': given a business that is interested in launching a campaign, suggest keywords that are related to that campaign. We address this problem by making use of the query logs of the search engine. We identify queries related to a campaign by exploiting the associations between queries and URLs as they are captured by the user's clicks. These queries form good keyword suggestions since they capture the "wisdom of the crowd" as to what is related to a site. We formulate the problem as a semi-supervised learning problem, and propose algorithms within the Markov Random Field model. We perform experiments with real query logs, and we demonstrate that our algorithms scale to large query logs and produce meaningful results.
Keywords: absorbing random walks, keyword generation, markov random fields, query click logs, sponsored search

Data mining: learning

Floatcascade learning for fast imbalanced web mining BIBAKFull-Text 71-80
  Xiaoxun Zhang; Xueying Wang; Honglei Guo; Zhili Guo; Xian Wu; Zhong Su
This paper is concerned with the problem of Imbalanced Classification (IC) in web mining, which often arises on the web due to the "Matthew Effect". As web IC applications usually need to provide online service for user and deal with large volume of data, classification speed emerges as an important issue to be addressed. In face detection, Asymmetric Cascade is used to speed up imbalanced classification by building a cascade structure of simple classifiers, but it often causes a loss of classification accuracy due to the iterative feature addition in its learning procedure. In this paper, we adopt the idea of cascade classifier in imbalanced web mining for fast classification and propose a novel asymmetric cascade learning method called FloatCascade to improve the accuracy. To the end, FloatCascade selects fewer yet more effective features at each stage of the cascade classifier. In addition, a decision-tree scheme is adopted to enhance feature diversity and discrimination capability for FloatCascade learning. We evaluate FloatCascade through two typical IC applications in web mining: web page categorization and citation matching. Experimental results demonstrate the effectiveness and efficiency of FloatCascade comparing to the state-of-the-art IC methods like Asymmetric Cascade, Asymmetric AdaBoost and Weighted SVM.
Keywords: cascade learning, citation matching, fast imbalanced classification, float searching, web page categorization
Recommending questions using the mdl-based tree cut model BIBAKFull-Text 81-90
  Yunbo Cao; Huizhong Duan; Chin-Yew Lin; Yong Yu; Hsiao-Wuen Hon
The paper is concerned with the problem of question recommendation. Specifically, given a question as query, we are to retrieve and rank other questions according to their likelihood of being good recommendations of the queried question. A good recommendation provides alternative aspects around users' interest. We tackle the problem of question recommendation in two steps: first represent questions as graphs of topic terms, and then rank recommendations on the basis of the graphs. We formalize both steps as the tree-cutting problems and then employ the MDL (Minimum Description Length) for selecting the best cuts. Experiments have been conducted with the real questions posted at Yahoo! Answers. The questions are about two domains, 'travel' and 'computers & internet'. Experimental results indicate that the use of the MDL-based tree cut model can significantly outperform the baseline methods of word-based VSM or phrase-based VSM. The results also show that the use of the MDL-based tree cut model is essential to our approach.
Keywords: minimum description length, query suggestion, question recommendation, tree cut model
Learning to classify short and sparse text & web with hidden topics from large-scale data collections BIBAKFull-Text 91-100
  Xuan-Hieu Phan; Le-Minh Nguyen; Susumu Horiguchi
This paper presents a general framework for building classifiers that deal with short and sparse text & Web segments by making the most of hidden topics discovered from large-scale data collections. The main motivation of this work is that many classification tasks working with short segments of text & Web, such as search snippets, forum & chat messages, blog & news feeds, product reviews, and book & movie summaries, fail to achieve high accuracy due to the data sparseness. We, therefore, come up with an idea of gaining external knowledge to make the data more related as well as expand the coverage of classifiers to handle future data better. The underlying idea of the framework is that for each classification task, we collect a large-scale external data collection called "universal dataset", and then build a classifier on both a (small) set of labeled training data and a rich set of hidden topics discovered from that data collection. The framework is general enough to be applied to different data domains and genres ranging from Web search results to medical text. We did a careful evaluation on several hundred megabytes of Wikipedia (30M words) and MEDLINE (18M words) with two tasks: "Web search domain disambiguation" and "disease categorization for medical text", and achieved significant quality enhancement.
Keywords: sparse text, topic analysis, web data analysis/classification

Data mining: modeling

Topic modeling with network regularization BIBAKFull-Text 101-110
  Qiaozhu Mei; Deng Cai; Duo Zhang; ChengXiang Zhai
In this paper, we formally define the problem of topic modeling with network structure (TMN). We propose a novel solution to this problem, which regularizes a statistical topic model with a harmonic regularizer based on a graph structure in the data. The proposed method bridges topic modeling and social network analysis, which leverages the power of both statistical topic models and discrete regularization. The output of this model well summarizes topics in text, maps a topic on the network, and discovers topical communities. With concrete selection of a topic model and a graph-based regularizer, our model can be applied to text mining problems such as author-topic analysis, community discovery, and spatial text mining. Empirical experiments on two different genres of data show that our approach is effective, which improves text-oriented methods as well as network-oriented methods. The proposed model is general; it can be applied to any text collections with a mixture of topics and an associated network structure.
Keywords: graph-based regularization, social networks, statistical topic models
Modeling online reviews with multi-grain topic models BIBAKFull-Text 111-120
  Ivan Titov; Ryan McDonald
In this paper we present a novel framework for extracting the ratable aspects of objects from online user reviews. Extracting such aspects is an important challenge in automatically mining product opinions from the web and in generating opinion-based summaries of user reviews [18, 19, 7, 12, 27, 36, 21]. Our models are based on extensions to standard topic modeling methods such as LDA and PLSA to induce multi-grain topics. We argue that multi-grain models are more appropriate for our task since standard models tend to produce topics that correspond to global properties of objects (e.g., the brand of a product type) rather than the aspects of an object that tend to be rated by a user. The models we present not only extract ratable aspects, but also cluster them into coherent topics, e.g., 'waitress' and 'bartender' are part of the same topic 'staff' for restaurants. This differentiates it from much of the previous work which extracts aspects through term frequency analysis with minimal clustering. We evaluate the multi-grain models both qualitatively and quantitatively to show that they improve significantly upon standard topic models.
Keywords: clustering, opinion mining, topic models
Opinion integration through semi-supervised topic modeling BIBAKFull-Text 121-130
  Yue Lu; Chengxiang Zhai
Web 2.0 technology has enabled more and more people to freely express their opinions on the Web, making the Web an extremely valuable source for mining user opinions about all kinds of topics. In this paper we study how to automatically integrate opinions expressed in a well-written expert review with lots of opinions scattering in various sources such as blogspaces and forums. We formally define this new integration problem and propose to use semi-supervised topic models to solve the problem in a principled way. Experiments on integrating opinions about two quite different topics (a product and a political figure) show that the proposed method is effective for both topics and can generate useful aligned integrated opinion summaries. The proposed method is quite general. It can be used to integrate a well written review with opinions in an arbitrary text collection about any topic to potentially support many interesting applications in multiple domains.
Keywords: expert review, opinion integration, probabilistic topic modeling, semi-supervised

Data mining: algorithms

Efficient similarity joins for near duplicate detection BIBAKFull-Text 131-140
  Chuan Xiao; Wei Wang; Xuemin Lin; Jeffrey Xu Yu
With the increasing amount of data and the need to integrate data from multiple data sources, a challenging issue is to find near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pairs of records such that their similarities are above a given threshold. Several existing algorithms rely on the prefix filtering principle to avoid computing similarity values for all possible pairs of records. We propose new filtering techniques by exploiting the ordering information; they are integrated into the existing methods and drastically reduce the candidate sizes and hence improve the efficiency. Experimental results show that our proposed algorithms can achieve up to 2.6x -- 5x speed-up over previous algorithms on several real datasets and provide alternative solutions to the near duplicate Web page detection problem.
Keywords: near duplicate detection, similarity join
Learning multiple graphs for document recommendations BIBAKFull-Text 141-150
  Ding Zhou; Shenghuo Zhu; Kai Yu; Xiaodan Song; Belle L. Tseng; Hongyuan Zha; C. Lee Giles
The Web offers rich relational data with different semantics. In this paper, we address the problem of document recommendation in a digital library, where the documents in question are networked by citations and are associated with other entities by various relations. Due to the sparsity of a single graph and noise in graph construction, we propose a new method for combining multiple graphs to measure document similarities, where different factorization strategies are used based on the nature of different graphs. In particular, the new method seeks a single low-dimensional embedding of documents that captures their relative similarities in a latent space. Based on the obtained embedding, a new recommendation framework is developed using semi-supervised learning on graphs. In addition, we address the scalability issue and propose an incremental algorithm. The new incremental method significantly improves the efficiency by calculating the embedding for new incoming documents only. The new batch and incremental methods are evaluated on two real world datasets prepared from CiteSeer. Experiments demonstrate significant quality improvement for our batch method and significant efficiency improvement with tolerable quality loss for our incremental method.
Keywords: collaborative filtering, recommender systems, semi-supervised learning, social network analysis, spectral clustering
Enhanced hierarchical classification via isotonic smoothing BIBAKFull-Text 151-160
  Kunal Punera; Joydeep Ghosh
Hierarchical topic taxonomies have proliferated on the World Wide Web [5, 18], and exploiting the output space decompositions they induce in automated classification systems is an active area of research. In many domains, classifiers learned on a hierarchy of classes have been shown to outperform those learned on a flat set of classes. In this paper we argue that the hierarchical arrangement of classes leads to intuitive relationships between the corresponding classifiers' output scores, and that enforcing these relationships as a post-processing step after classification can improve its accuracy. We formulate the task of smoothing classifier outputs as a regularized isotonic tree regression problem, and present a dynamic programming based method that solves it optimally. This new problem generalizes the classic isotonic tree regression problem, and both, the new formulation and algorithm, might be of independent interest. In our empirical analysis of two real-world text classification scenarios, we show that our approach to smoothing classifier outputs results in improved classification accuracy.
Keywords: dynamic programming, hierarchical classification, regularized isotonic regression, taxonomy

Internet monetization: online advertising

Externalities in online advertising BIBAKFull-Text 161-168
  Arpita Ghosh; Mohammad Mahdian
Most models for online advertising assume that an advertiser's value from winning an ad auction, which depends on the clickthrough rate or conversion rate of the advertisement, is independent of other advertisements served alongside it in the same session. This ignores an important 'externality effect': as the advertising audience has a limited attention span, a high-quality ad on a page can detract attention from other ads on the same page. That is, the utility to a winner in such an auction also depends on the set of other winners.
   In this paper, we introduce the problem of modeling externalities in online advertising, and study the winner determination problem in these models. Our models are based on choice models on the audience side. We show that in the most general case, the winner determination problem is hard even to approximate. However, we give an approximation algorithm for this problem with an approximation factor that is logarithmic in the ratio of the maximum to the minimum bid. Furthermore, we show that there are some interesting special cases, such as the case where the audience preferences are single peaked, where the problem can be solved exactly in polynomial time. For all these algorithms, we prove that the winner determination algorithm can be combined with VCG-style payments to yield truthful mechanisms.
Keywords: advertising, approximation algorithms, auctions, externalities
A combinatorial allocation mechanism with penalties for banner advertising BIBAKFull-Text 169-178
  Ureil Feige; Nicole Immorlica; Vahab Mirrokni; Hamid Nazerzadeh
Most current banner advertising is sold through negotiation thereby incurring large transaction costs and possibly suboptimal allocations. We propose a new automated system for selling banner advertising. In this system, each advertiser specifies a collection of host webpages which are relevant to his product, a desired total quantity of impressions on these pages, and a maximum per-impression price. The system selects a subset of advertisers as 'winners' and maps each winner to a set of impressions on pages within his desired collection. The distinguishing feature of our system as opposed to current combinatorial allocation mechanisms is that, mimicking the current negotiation system, we guarantee that winners receive at least as many advertising opportunities as they requested or else receive ample compensation in the form of a monetary payment by the host. Such guarantees are essential in markets like banner advertising where a major goal of the advertising campaign is developing brand recognition.
   As we show, the problem of selecting a feasible subset of advertisers with maximum total value is inapproximable. We thus present two greedy heuristics and discuss theoretical techniques to measure their performances. Our first algorithm iteratively selects advertisers and corresponding sets of impressions which contribute maximum marginal per-impression profit to the current solution. We prove a bi-criteria approximation for this algorithm, showing that it generates approximately as much value as the optimum algorithm on a slightly harder problem. However, this algorithm might perform poorly on instances in which the value of the optimum solution is quite large, a clearly undesirable failure mode. Hence, we present an adaptive greedy algorithm which again iteratively selects advertisers with maximum marginal per-impression profit, but additionally reassigns impressions at each iteration. For this algorithm, we prove a structural approximation result, a newly defined framework for evaluating heuristics [10]. We thereby prove that this algorithm has a better performance guarantee than the simple greedy algorithm.
Keywords: combinatorial auctions, internet advertising, structural approximation, supply guarantee
Dynamic cost-per-action mechanisms and applications to online advertising BIBAKFull-Text 179-188
  Hamid Nazerzadeh; Amin Saberi; Rakesh Vohra
We study the Cost-Per-Action or Cost-Per-Acquisition (CPA) charging scheme in online advertising. In this scheme, instead of paying per click, the advertisers pay only when a user takes a specific action (e.g. fills out a form) or completes a transaction on their websites.
   We focus on designing efficient and incentive compatible mechanisms that use this charging scheme. We describe a mechanism based on a sampling-based learning algorithm that under suitable assumptions is asymptotically individually rational, asymptotically Bayesian incentive compatible and asymptotically ex-ante efficient.
   In particular, we demonstrate our mechanism for the case where the utility functions of the advertisers are independent and identically-distributed random variables as well as the case where they evolve like independent reflected Brownian motions.
Keywords: cost-per-action, internet advertising, mechanism design

Internet monetization: recommendation and security

Optimal marketing strategies over social networks BIBAKFull-Text 189-198
  Jason Hartline; Vahab Mirrokni; Mukund Sundararajan
We discuss the use of social networks in implementing viral marketing strategies. While influence maximization has been studied in this context (see Chapter 24 of [10]), we study revenue maximization, arguably, a more natural objective. In our model, a buyer's decision to buy an item is influenced by the set of other buyers that own the item and the price at which the item is offered.
   We focus on algorithmic question of finding revenue maximizing marketing strategies. When the buyers are completely symmetric, we can find the optimal marketing strategy in polynomial time. In the general case, motivated by hardness results, we investigate approximation algorithms for this problem. We identify a family of strategies called influence-and-exploit strategies that are based on the following idea: Initially influence the population by giving the item for free to carefully a chosen set of buyers. Then extract revenue from the remaining buyers using a 'greedy' pricing strategy. We first argue why such strategies are reasonable and then show how to use recently developed set-function maximization techniques to find the right set of buyers to influence.
Keywords: marketing, monetizing social networks, pricing, submodular maximization
Trust-based recommendation systems: an axiomatic approach BIBAKFull-Text 199-208
  Reid Andersen; Christian Borgs; Jennifer Chayes; Uriel Feige; Abraham Flaxman; Adam Kalai; Vahab Mirrokni; Moshe Tennenholtz
High-quality, personalized recommendations are a key feature in many online systems. Since these systems often have explicit knowledge of social network structures, the recommendations may incorporate this information. This paper focuses on networks that represent trust and recommendation systems that incorporate these trust relationships. The goal of a trust-based recommendation system is to generate personalized recommendations by aggregating the opinions of other users in the trust network.
   In analogy to prior work on voting and ranking systems, we use the axiomatic approach from the theory of social choice. We develop a set of five natural axioms that a trust-based recommendation system might be expected to satisfy. Then, we show that no system can simultaneously satisfy all the axioms. However, for any subset of four of the five axioms we exhibit a recommendation system that satisfies those axioms. Next we consider various ways of weakening the axioms, one of which leads to a unique recommendation system based on random walks. We consider other recommendation systems, including systems based on personalized PageRank, majority of majorities, and minimum cuts, and search for alternative axiomatizations that uniquely characterize these systems.
   Finally, we determine which of these systems are incentive compatible, meaning that groups of agents interested in manipulating recommendations can not induce others to share their opinion by lying about their votes or modifying their trust links. This is an important property for systems deployed in a monetized environment.
Keywords: axiomatic approach, recommendation systems, reputation systems, trust networks
Secure or insure?: a game-theoretic analysis of information security games BIBAKFull-Text 209-218
  Jens Grossklags; Nicolas Christin; John Chuang
Despite general awareness of the importance of keeping one's system secure, and widespread availability of consumer security technologies, actual investment in security remains highly variable across the Internet population, allowing attacks such as distributed denial-of-service (DDoS) and spam distribution to continue unabated. By modeling security investment decision-making in established (e.g., weakest-link, best-shot) and novel games (e.g., weakest-target), and allowing expenditures in self-protection versus self-insurance technologies, we can examine how incentives may shift between investment in a public good (protection) and a private good (insurance), subject to factors such as network size, type of attack, loss probability, loss magnitude, and cost of technology. We can also characterize Nash equilibria and social optima for different classes of attacks and defenses. In the weakest-target game, an interesting result is that, for almost all parameter settings, more effort is exerted at Nash equilibrium than at the social optimum. We may attribute this to the "strategic uncertainty" of players seeking to self-protect at just slightly above the lowest protection level.
Keywords: economics of the internet, game theory, incentive-centered design and engineering, protection, public goods, security, self-insurance

Internet monetization: sponsored Search

Analyzing search engine advertising: firm behavior and cross-selling in electronic markets BIBAKFull-Text 219-226
  Anindya Ghose; Sha Yang
The phenomenon of sponsored search advertising is gaining ground as the largest source of revenues for search engines. Firms across different industries have are beginning to adopt this as the primary form of online advertising. This process works on an auction mechanism in which advertisers bid for different keywords, and final rank for a given keyword is allocated by the search engine. But how different are firm's actual bids from their optimal bids? Moreover, what are other ways in which firms can potentially benefit from sponsored search advertising? Based on the model and estimates from prior work [10], we conduct a number of policy simulations in order to investigate to what extent an advertiser can benefit from bidding optimally for its keywords. Further, we build a Hierarchical Bayesian modeling framework to explore the potential for cross-selling or spillovers effects from a given keyword advertisement across multiple product categories, and estimate the model using Markov Chain Monte Carlo (MCMC) methods. Our analysis suggests that advertisers are not bidding optimally with respect to maximizing profits. We conduct a detailed analysis with product level variables to explore the extent of cross-selling opportunities across different categories from a given keyword advertisement. We find that there exists significant potential for cross-selling through search keyword advertisements in that consumers often end up buying products from other categories in addition to the product they were searching for. Latency (the time it takes for consumer to place a purchase order after clicking on the advertisement) and the presence of a brand name in the keyword are associated with consumer spending on product categories that are different from the one they were originally searching for on the Internet.
Keywords: electronic commerce, hierarchical bayesian modeling, online advertising, paid search advertising, search engines, web 2.0
Online learning from click data for sponsored search BIBAKFull-Text 227-236
  Massimiliano Ciaramita; Vanessa Murdock; Vassilis Plachouras
Sponsored search is one of the enabling technologies for today's Web search engines. It corresponds to matching and showing ads related to the user query on the search engine results page. Users are likely to click on topically related ads and the advertisers pay only when a user clicks on their ad. Hence, it is important to be able to predict if an ad is likely to be clicked, and maximize the number of clicks. We investigate the sponsored search problem from a machine learning perspective with respect to three main sub-problems: how to use click data for training and evaluation, which learning framework is more suitable for the task, and which features are useful for existing models. We perform a large scale evaluation based on data from a commercial Web search engine. Results show that it is possible to learn and evaluate directly and exclusively on click data encoding pairwise preferences following simple and conservative assumptions. We find that online multilayer perceptron learning, based on a small set of features representing content similarity of different kinds, significantly outperforms an information retrieval baseline and other learning models, providing a suitable framework for the sponsored search task.
Keywords: online learning, perceptrons, ranking, sponsored search

Mobility

Supporting anonymous location queries in mobile environments with privacygrid BIBAKFull-Text 237-246
  Bhuvan Bamba; Ling Liu; Peter Pesti; Ting Wang
This paper presents PrivacyGrid -- a framework for supporting anonymous location-based queries in mobile information delivery systems. The PrivacyGrid framework offers three unique capabilities. First, it provides a location privacy protection preference profile model, called location P3P, which allows mobile users to explicitly define their preferred location privacy requirements in terms of both location hiding measures (e.g., location k-anonymity and location l-diversity) and location service quality measures (e.g., maximum spatial resolution and maximum temporal resolution). Second, it provides fast and effective location cloaking algorithms for location k-anonymity and location l-diversity in a mobile environment. We develop dynamic bottom-up and top-down grid cloaking algorithms with the goal of achieving high anonymization success rate and efficiency in terms of both time complexity and maintenance cost. A hybrid approach that carefully combines the strengths of both bottom-up and top-down cloaking approaches to further reduce the average anonymization time is also developed. Last but not the least, PrivacyGrid incorporates temporal cloaking into the location cloaking process to further increase the success rate of location anonymization. We also discuss PrivacyGrid mechanisms for supporting anonymous location queries. Experimental evaluation shows that the PrivacyGrid approach can provide close to optimal location k-anonymity as defined by per user location P3P without introducing significant performance penalties.
Keywords: k-anonymity, l-diversity, location privacy
Learning transportation mode from raw gps data for geographic applications on the web BIBAKFull-Text 247-256
  Yu Zheng; Like Liu; Longhao Wang; Xing Xie
Geographic information has spawned many novel Web applications where global positioning system (GPS) plays important roles in bridging the applications and end users. Learning knowledge from users' raw GPS data can provide rich context information for both geographic and mobile applications. However, so far, raw GPS data are still used directly without much understanding. In this paper, an approach based on supervised learning is proposed to automatically infer transportation mode from raw GPS data. The transportation mode, such as walking, driving, etc., implied in a user's GPS data can provide us valuable knowledge to understand the user. It also enables context-aware computing based on user's present transportation mode and design of an innovative user interface for Web users. Our approach consists of three parts: a change point-based segmentation method, an inference model and a post-processing algorithm based on conditional probability. The change point-based segmentation method was compared with two baselines including uniform duration based and uniform length based methods. Meanwhile, four different inference models including Decision Tree, Bayesian Net, Support Vector Machine (SVM) and Conditional Random Field (CRF) are studied in the experiments. We evaluated the approach using the GPS data collected by 45 users over six months period. As a result, beyond other two segmentation methods, the change point based method achieved a higher degree of accuracy in predicting transportation modes and detecting transitions between them. Decision Tree outperformed other inference models over the change point based segmentation method.
Keywords: GPS, classification, geographic application, machine learning, transportation mode
Deciphering mobile search patterns: a study of Yahoo! mobile search queries BIBAKFull-Text 257-266
  Jeonghee Yi; Farzin Maghoul; Jan Pedersen
In this paper we study the characteristics of search queries submitted from mobile devices using various Yahoo! one-Search applications during a 2 months period in the second half of 2007, and report the query patterns derived from 20 million English sample queries submitted by users in US, Canada, Europe, and Asia. We examine the query distribution and topical categories the queries belong to in order to find new trends. We compare and contrast the search patterns between US vs international queries, and between queries from various search interfaces (XHTML/WAP, java widgets, and SMS). We also compare our results with previous studies wherever possible, either to confirm previous findings, or to find interesting differences in the query distribution and pattern.
Keywords: cell phones, mobile applications, mobile devices, mobile query analysis, mobile search, oneSearch, personal devices, query analysis, query log analysis, wireless devices

Performance and scalability

Service-oriented data denormalization for scalable web applications BIBAKFull-Text 267-276
  Zhou Wei; Jiang Dejun; Guillaume Pierre; Chi-Hung Chi; Maarten van Steen
Many techniques have been proposed to scale web applications. However, the data interdependencies between the database queries and transactions issued by the applications limit their efficiency. We claim that major scalability improvements can be gained by restructuring the web application data into multiple independent data services with exclusive access to their private data store. While this restructuring does not provide performance gains by itself, the implied simplification of each database workload allows a much more efficient use of classical techniques. We illustrate the data denormalization process on three benchmark applications: TPC-W, RUBiS and RUBBoS. We deploy the resulting service-oriented implementation of TPC-W across an 85-node cluster and show that restructuring its data can provide at least an order of magnitude improvement in the maximum sustainable throughput compared to master-slave database replication, while preserving strong consistency and transactional properties.
Keywords: data denormalization, scalability, web applications
Anycast CDNS revisited BIBAKFull-Text 277-286
  Hussein A. Alzoubi; Seungjoon Lee; Michael Rabinovich; Oliver Spatscheck; Jacobus Van der Merwe
Because it is an integral part of the Internet routing apparatus, and because it allows multiple instances of the same service to be "naturally" discovered, IP Anycast has many attractive features for any service that involve the replication of multiple instances across the Internet. While briefly considered as an enabler when content distribution networks (CDNs) first emerged, the use of IP Anycast was deemed infeasible in that environment. The main reasons for this decision were the lack of load awareness of IP Anycast and unwanted side effects of Internet routing changes on the IP Anycast mechanism. Prompted by recent developments in route control technology, as well as a better understanding of the behavior of IP Anycast in operational settings, we revisit this decision and propose a load-aware IP Anycast CDN architecture that addresses these concerns while benefiting from inherent IP Anycast features. Our architecture makes use of route control mechanisms to take server and network load into account to realize load-aware Anycast. We show that the resulting redirection requirements can be formulated as a Generalized Assignment Problem and present practical algorithms that address these requirements while at the same time limiting session disruptions that plague regular IP Anycast. We evaluate our algorithms through trace based simulation using traces obtained from an operation CDN network.
Keywords: CDN, anycast, autonomous system, load balancing, routing
A comparative analysis of web and peer-to-peer traffic BIBAKFull-Text 287-296
  Naimul Basher; Aniket Mahanti; Anirban Mahanti; Carey Williamson; Martin Arlitt
Peer-to-Peer (P2P) applications continue to grow in popularity, and have reportedly overtaken Web applications as the single largest contributor to Internet traffic. Using traces collected from a large edge network, we conduct an extensive analysis of P2P traffic, compare P2P traffic with Web traffic, and discuss the implications of increased P2P traffic. In addition to studying the aggregate P2P traffic, we also analyze and compare the two main constituents of P2P traffic in our data, namely BitTorrent and Gnutella. The results presented in the paper may be used for generating synthetic workloads, gaining insights into the functioning of P2P applications, and developing network management strategies. For example, our results suggest that new models are necessary for Internet traffic. As a first step, we present flow-level distributional models for Web and P2P traffic that may be used in network simulation and emulation experiments.
Keywords: peer-to-peer, traffic characterization, traffic models, web

Rich media

Generating diverse and representative image search results for landmarks BIBAKFull-Text 297-306
  Lyndon S. Kennedy; Mor Naaman
Can we leverage the community-contributed collections of rich media on the web to automatically generate representative and diverse views of the world's landmarks? We use a combination of context- and content-based tools to generate representative sets of images for location-driven features and landmarks, a common search task. To do that, we using location and other metadata, as well as tags associated with images, and the images' visual features. We present an approach to extracting tags that represent landmarks. We show how to use unsupervised methods to extract representative views and images for each landmark. This approach can potentially scale to provide better search and representation for landmarks, worldwide. We evaluate the system in the context of image search using a real-life dataset of 110,000 images from the San Francisco area.
Keywords: geo-referenced photographs, photo collections, social media
Pagerank for product image search BIBAKFull-Text 307-316
  Yushi Jing; Shumeet Baluja
In this paper, we cast the image-ranking problem into the task of identifying "authority" nodes on an inferred visual similarity graph and propose an algorithm to analyze the visual link structure that can be created among a group of images. Through an iterative procedure based on the PageRank computation, a numerical weight is assigned to each image; this measures its relative importance to the other images being considered. The incorporation of visual signals in this process differs from the majority of large-scale commercial-search engines in use today. Commercial search-engines often solely rely on the text clues of the pages in which images are embedded to rank images, and often entirely ignore the content of the images themselves as a ranking signal. To quantify the performance of our approach in a real-world system, we conducted a series of experiments based on the task of retrieving images for 2000 of the most popular products queries. Our experimental results show significant improvement, in terms of user satisfaction and relevancy, in comparison to the most recent Google Image Search results.
Keywords: graph theory, pagerank, visual similarity
Graph theoretical framework for simultaneously integrating visual and textual features for efficient web image clustering BIBAKFull-Text 317-326
  Manjeet Rege; Ming Dong; Jing Hua
With the explosive growth of Web and the recent development in digital media technology, the number of images on the Web has grown tremendously. Consequently, Web image clustering has emerged as an important application. Some of the initial efforts along this direction revolved around clustering Web images based on the visual features of images or textual features by making use of the text surrounding the images. However, not much work has been done in using multimodal information for clustering Web images. In this paper, we propose a graph theoretical framework for simultaneously integrating visual and textual features for efficient Web image clustering. Specifically, we model visual features, images and words from surrounding text using a tripartite graph. Partitioning this graph leads to clustering of the Web images. Although, graph partitioning approach has been adopted before, the main contribution of this work lies in a new algorithm that we propose -- Consistent Isoperimetric High-order Co-clustering (CIHC), for partitioning the tripartite graph. Computationally, CIHC is very quick as it requires a simple solution to a sparse system of linear equations. Our theoretical analysis and extensive experiments performed on real Web images demonstrate the performance of CIHC in terms of the quality, efficiency and scalability in partitioning the visual feature-image-word tripartite graph.
Keywords: co-clustering, consistency, isoperimetric, web images
Flickr tag recommendation based on collective knowledge BIBAKFull-Text 327-336
  Börkur Sigurbjörnsson; Roelof van Zwol
Online photo services such as Flickr and Zooomr allow users to share their photos with family, friends, and the online community at large. An important facet of these services is that users manually annotate their photos using so called tags, which describe the contents of the photo or provide additional contextual and semantical information. In this paper we investigate how we can assist users in the tagging phase. The contribution of our research is twofold. We analyse a representative snapshot of Flickr and present the results by means of a tag characterisation focussing on how users tags photos and what information is contained in the tagging. Based on this analysis, we present and evaluate tag recommendation strategies to support the user in the photo annotation task by recommending a set of tags that can be added to the photo. The results of the empirical evaluation show that we can effectively recommend relevant tags for a variety of photos with different levels of exhaustiveness of original tagging.
Keywords: aggregated tag suggestion, collective knowledge, flickr, photo annotations, tag characterisation, tag co-occurence, tag recommendation

Search: query analysis

Modeling anchor text and classifying queries to enhance web document retrieval BIBAKFull-Text 337-346
  Atsushi Fujii
Several types of queries are widely used on the World Wide Web and the expected retrieval method can vary depending on the query type. We propose a method for classifying queries into informational and navigational types. Because terms in navigational queries often appear in anchor text for links to other pages, we analyze the distribution of query terms in anchor texts on the Web for query classification purposes. While content-based retrieval is effective for informational queries, anchor-based retrieval is effective for navigational queries. Our retrieval system combines the results obtained with the content-based and anchor-based retrieval methods, in which the weight for each retrieval result is determined automatically depending on the result of the query classification. We also propose a method for improving anchor-based retrieval. Our retrieval method, which computes the probability that a document is retrieved in response to the given query, identifies synonyms of query terms in the anchor texts on the Web and uses these synonyms for smoothing purposes in the probability estimation. We use the NTCIR test collections and show the effectiveness of individual methods and the entire Web retrieval system experimentally.
Keywords: anchor text, query classification, web retrieval
Unsupervised query segmentation using generative language models and wikipedia BIBAKFull-Text 347-356
  Bin Tan; Fuchun Peng
In this paper, we propose a novel unsupervised approach to query segmentation, an important task in Web search. We use a generative query model to recover a query's underlying concepts that compose its original segmented form. The model's parameters are estimated using an expectation-maximization (EM) algorithm, optimizing the minimum description length objective function on a partial corpus that is specific to the query. To augment this unsupervised learning, we incorporate evidence from Wikipedia.
   Experiments show that our approach dramatically improves performance over the traditional approach that is based on mutual information, and produces comparable results with a supervised method. In particular, the basic generative language model contributes a 7.4% improvement over the mutual information based method (measured by segment F1 on the Intersection test set). EM optimization further improves the performance by 14.3%. Additional knowledge from Wikipedia provides another improvement of 24.3%, adding up to a total of 46% improvement (from 0.530 to 0.774).
Keywords: concept discovery, query segmentation
Spatial variation in search engine queries BIBAKFull-Text 357-366
  Lars Backstrom; Jon Kleinberg; Ravi Kumar; Jasmine Novak
Local aspects of Web search -- associating Web content and queries with geography -- is a topic of growing interest. However, the underlying question of how spatial variation is manifested in search queries is still not well understood. Here we develop a probabilistic framework for quantifying such spatial variation; on complete Yahoo! query logs, we find that our model is able to localize large classes of queries to within a few miles of their natural centers based only on the distribution of activity for the query. Our model provides not only an estimate of a query's geographic center, but also a measure of its spatial dispersion, indicating whether it has highly local interest or broader regional or national appeal. We also show how variations on our model can track geographically shifting topics over time, annotate a map with each location's "distinctive queries", and delineate the "spheres of influence" for competing queries in the same general domain.
Keywords: geolocation, web search

Search: corpus characterization and Search Perform

Genealogical trees on the web: a search engine user perspective BIBAKFull-Text 367-376
  Ricardo Baeza-Yates; Álvaro Pereira; Nivio Ziviani
This paper presents an extensive study about the evolution of textual content on the Web, which shows how some new pages are created from scratch while others are created using already existing content. We show that a significant fraction of the Web is a byproduct of the latter case. We introduce the concept of Web genealogical tree, in which every page in a Web snapshot is classified into a component. We study in detail these components, characterizing the copies and identifying the relation between a source of content and a search engine, by comparing page relevance measures, documents returned by real queries performed in the past, and click-through data. We observe that sources of copies are more frequently returned by queries and more clicked than other documents.
Keywords: content evolution, search engine, text, web, web mining
A graph-theoretic approach to webpage segmentation BIBAKFull-Text 377-386
  Deepayan Chakrabarti; Ravi Kumar; Kunal Punera
We consider the problem of segmenting a webpage into visually and semantically cohesive pieces. Our approach is based on formulating an appropriate optimization problem on weighted graphs, where the weights capture if two nodes in the DOM tree should be placed together or apart in the segmentation; we present a learning framework to learn these weights from manually labeled data in a principled manner. Our work is a significant departure from previous heuristic and rule-based solutions to the segmentation problem. The results of our empirical analysis bring out interesting aspects of our framework, including variants of the optimization problem and the role of learning.
Keywords: correlation clustering, energy minimization, graph cuts, webpage sectioning, webpage segmentation
Performance of compressed inverted list caching in search engines BIBAKFull-Text 387-396
  Jiangong Zhang; Xiaohui Long; Torsten Suel
Due to the rapid growth in the size of the web, web search engines are facing enormous performance challenges. The larger engines in particular have to be able to process tens of thousands of queries per second on tens of billions of documents, making query throughput a critical issue. To satisfy this heavy workload, search engines use a variety of performance optimizations including index compression, caching, and early termination.
   We focus on two techniques, inverted index compression and index caching, which play a crucial rule in web search engines as well as other high-performance information retrieval systems. We perform a comparison and evaluation of several inverted list compression algorithms, including new variants of existing algorithms that have not been studied before. We then evaluate different inverted list caching policies on large query traces, and finally study the possible performance benefits of combining compression and caching. The overall goal of this paper is to provide an updated discussion and evaluation of these two techniques, and to show how to select the best set of approaches and settings depending on parameter such as disk speed and main memory cache size.
Keywords: index caching, index compression, inverted index, search engines

Search: ranking and retrieval enhancement

Ranking refinement and its application to information retrieval BIBAKFull-Text 397-406
  Rong Jin; Hamed Valizadegan; Hang Li
We consider the problem of ranking refinement, i.e., to improve the accuracy of an existing ranking function with a small set of labeled instances. We are, particularly, interested in learning a better ranking function using two complementary sources of information, ranking information given by the existing ranking function (i.e., a base ranker) and that obtained from users' feedbacks. This problem is very important in information retrieval where the feedback is gradually collected. The key challenge in combining the two sources of information arises from the fact that the ranking information presented by the base ranker tends to be imperfect and the ranking information obtained from users' feedbacks tends to be noisy. We present a novel boosting framework for ranking refinement that can effectively leverage the uses of the two sources of information. Our empirical study shows that the proposed algorithm is effective for ranking refinement, and furthermore significantly outperforms the baseline algorithms that incorporate the outputs from the base ranker as an additional feature.
Keywords: background information, boosting, incremental learning, learning to rank
Learning to rank relational objects and its application to web search BIBAKFull-Text 407-416
  Tao Qin; Tie-Yan Liu; Xu-Dong Zhang; De-Sheng Wang; Wen-Ying Xiong; Hang Li
Learning to rank is a new statistical learning technology on creating a ranking model for sorting objects. The technology has been successfully applied to web search, and is becoming one of the key machineries for building search engines. Existing approaches to learning to rank, however, did not consider the cases in which there exists relationship between the objects to be ranked, despite of the fact that such situations are very common in practice. For example, in web search, given a query certain relationships usually exist among the retrieved documents, e.g., URL hierarchy, similarity, etc., and sometimes it is necessary to utilize the information in ranking of the documents. This paper addresses the issue and formulates it as a novel learning problem, referred to as, 'learning to rank relational objects'. In the new learning task, the ranking model is defined as a function of not only the contents (features) of objects but also the relations between objects. The paper further focuses on one setting of the learning problem in which the way of using relation information is predetermined. It formalizes the learning task as an optimization problem in the setting. The paper then proposes a new method to perform the optimization task, particularly an implementation based on SVM. Experimental results show that the proposed method outperforms the baseline methods for two ranking tasks (Pseudo Relevance Feedback and Topic Distillation) in web search, indicating that the proposed method can indeed make effective use of relation information and content information in ranking.
Keywords: learning to rank relational objects, pseudo relevance feedback, relational ranking svm, topic distillation
Contextual advertising by combining relevance with click feedback BIBAKFull-Text 417-426
  Deepayan Chakrabarti; Deepak Agarwal; Vanja Josifovski
Contextual advertising supports much of the Web's ecosystem today. User experience and revenue (shared by the site publisher and the ad network) depend on the relevance of the displayed ads to the page content. As with other document retrieval systems, relevance is provided by scoring the match between individual ads (documents) and the content of the page where the ads are shown (query). In this paper we show how this match can be improved significantly by augmenting the ad-page scoring function with extra parameters from a logistic regression model on the words in the pages and ads. A key property of the proposed model is that it can be mapped to standard cosine similarity matching and is suitable for efficient and scalable implementation over inverted indexes. The model parameter values are learnt from logs containing ad impressions and clicks, with shrinkage estimators being used to combat sparsity. To scale our computations to train on an extremely large training corpus consisting of several gigabytes of data, we parallelize our fitting algorithm in a Hadoop framework [10]. Experimental evaluation is provided showing improved click prediction over a holdout set of impression and click events from a large scale real-world ad placement engine. Our best model achieves a 25% lift in precision relative to a traditional information retrieval model which is based on cosine similarity, for recalling 10% of the clicks in our test data.
Keywords: clickthrough rate, interaction effects, modeling

Search: crawlers

IRLbot: scaling to 6 billion pages and beyond BIBAKFull-Text 427-436
  Hsin-Tsang Lee; Derek Leonard; Xiaoming Wang; Dmitri Loguinov
This paper shares our experience in designing a web crawler that can download billions of pages using a single-server implementation and models its performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, BFS crawl order, and fixed per-host rate-limiting, current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly-branching spam, legitimate multi-million-page blog sites, and infinite loops created by server-side scripts. We offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages ($7.6$ billion connection requests) and sustained an average download rate of 319 mb/s (1,789 pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered a subset of the web graph with 41 billion unique nodes.
Keywords: IRLbot, crawling, large-scale
Recrawl scheduling based on information longevity BIBAKFull-Text 437-446
  Christopher Olston; Sandeep Pandey
It is crucial for a web crawler to distinguish between ephemeral and persistent content. Ephemeral content (e.g., quote of the day) is usually not worth crawling, because by the time it reaches the index it is no longer representative of the web page from which it was acquired. On the other hand, content that persists across multiple page updates (e.g., recent blog postings) may be worth acquiring, because it matches the page's true content for a sustained period of time.
   In this paper we characterize the longevity of information found on the web, via both empirical measurements and a generative model that coincides with these measurements. We then develop new recrawl scheduling policies that take longevity into account. As we show via experiments over real web data, our policies obtain better freshness at lower cost, compared with previous approaches.
Keywords: information longevity, recrawling
iRobot: an intelligent crawler for web forums BIBAKFull-Text 447-456
  Rui Cai; Jiang-Ming Yang; Wei Lai; Yida Wang; Lei Zhang
We study in this paper the Web forum crawling problem, which is a very fundamental step in many Web applications, such as search engine and Web data mining. As a typical user-created content (UCC), Web forum has become an important resource on the Web due to its rich information contributed by millions of Internet users every day. However, Web forum crawling is not a trivial problem due to the in-depth link structures, the large amount of duplicate pages, as well as many invalid pages caused by login failure issues. In this paper, we propose and build a prototype of an intelligent forum crawler, iRobot, which has intelligence to understand the content and the structure of a forum site, and then decide how to choose traversal paths among different kinds of pages. To do this, we first randomly sample (download) a few pages from the target forum site, and introduce the page content layout as the characteristics to group those pre-sampled pages and re-construct the forum's sitemap. After that, we select an optimal crawling path which only traverses informative pages and skips invalid and duplicate ones. The extensive experimental results on several forums show the performance of our system in the following aspects: 1) Effectiveness -- Compared to a generic crawler, iRobot significantly decreases the duplicate and invalid pages; 2) Efficiency -- With a small cost of pre-sampling a few pages for learning the necessary knowledge, iRobot saves substantial network bandwidth and storage as it only fetches informative pages from a forum site; and 3) Long threads that are divided into multiple pages can be re-concatenated and archived as a whole thread, which is of great help for further indexing and data mining.
Keywords: forum crawler, sitemap construction, traversal path selection

Search: applications

Automatic online news issue construction in web environment BIBAKFull-Text 457-466
  Canhui Wang; Min Zhang; Shaoping Ma; Liyun Ru
In many cases, rather than a keyword search, people intend to see what is going on through the Internet. Then the integrated comprehensive information on news topics is necessary, which we called news issues, including the background, history, current progress, different opinions and discussions, etc. Traditionally, news issues are manually generated by website editors. It is quite a time-consuming hard work, and hence real-time update is difficult to perform. In this paper, a three-step automatic online algorithm for news issue construction is proposed. The first step is a topic detection process, in which newly appearing stories are clustered into new topic candidates. The second step is a topic tracking process, where those candidates are compared with previous topics, either merged into old ones or generating a new one. In the final step, news issues are constructed by the combination of related topics and updated by the insertion of new topics. An automatic online news issue construction process under practical Web circumstances is simulated to perform news issue construction experiments. F-measure of the best results is either above (topic detection) or close to (topic detection and tracking) 90%. Four news issue construction results are successfully generated in different time granularities: one meets the needs like "what's new", and the other three will answer questions like "what's hot" or "what's going on". Through the proposed algorithm, news issues can be effectively and automatically constructed with real-time update, and lots of human efforts will be released from tedious manual work.
Keywords: clustering, news issue, topic detection and tracking
Finding the right facts in the crowd: factoid question answering over social media BIBAKFull-Text 467-476
  Jiang Bian; Yandong Liu; Eugene Agichtein; Hongyuan Zha
Community Question Answering has emerged as a popular and effective paradigm for a wide range of information needs. For example, to find out an obscure piece of trivia, it is now possible and even very effective to post a question on a popular community QA site such as Yahoo! Answers, and to rely on other users to provide answers, often within minutes. The importance of such community QA sites is magnified as they create archives of millions of questions and hundreds of millions of answers, many of which are invaluable for the information needs of other searchers. However, to make this immense body of knowledge accessible, effective answer retrieval is required. In particular, as any user can contribute an answer to a question, the majority of the content reflects personal, often unsubstantiated opinions. A ranking that combines both relevance and quality is required to make such archives usable for factual information retrieval. This task is challenging, as the structure and the contents of community QA archives differ significantly from the web setting. To address this problem we present a general ranking framework for factual information retrieval from social media. Results of a large scale evaluation demonstrate that our method is highly effective at retrieving well-formed, factual answers to questions, as evaluated on a standard factoid QA benchmark. We also show that our learning framework can be tuned with the minimum of manual labeling. Finally, we provide result analysis to gain deeper understanding of which features are significant for social media search and retrieval. Our system can be used as a crucial building block for combining results from a variety of social media content with general web search results, and to better integrate social media content for effective information access.
Keywords: community, question answering, ranking
Personalized interactive faceted search BIBAKFull-Text 477-486
  Jonathan Koren; Yi Zhang; Xue Liu
Faceted search is becoming a popular method to allow users to interactively search and navigate complex information spaces. A faceted search system presents users with key-value metadata that is used for query refinement. While popular in e-commerce and digital libraries, not much research has been conducted on which metadata to present to a user in order to improve the search experience. Nor are there repeatable benchmarks for evaluating a faceted search engine. This paper proposes the use of collaborative filtering and personalization to customize the search interface to each user's behavior. This paper also proposes a utility based framework to evaluate the faceted interface. In order to demonstrate these ideas and better understand personalized faceted search, several faceted search algorithms are proposed and evaluated using the novel evaluation methodology.
Keywords: collaborative recommendation, evaluation, faceted search, interactive search, personalization, user modeling

Security I: misc

Privacy-enhanced sharing of personal content on the web BIBAKFull-Text 487-496
  Mohammad Mannan; Paul C. van Oorschot
Publishing personal content on the web is gaining increased popularity with dramatic growth in social networking websites, and availability of cheap personal domain names and hosting services. Although the Internet enables easy publishing of any content intended to be generally accessible, restricting personal content to a selected group of contacts is more difficult. Social networking websites partially enable users to restrict access to a selected group of users of the same network by explicitly creating a "friends' list." While this limited restriction supports users' privacy on those (few) selected websites, personal websites must still largely be protected manually by sharing passwords or obscure links. Our focus is the general problem of privacy-enabled web content sharing from any user-chosen web server. By leveraging the existing "circle of trust" in popular Instant Messaging (IM) networks, we propose a scheme called IM-based Privacy-Enhanced Content Sharing (IMPECS) for personal web content sharing. IMPECS enables a publishing user's personal data to be accessible only to her IM contacts. A user can put her personal web page on any web server she wants (vs. being restricted to a specific social networking website), and maintain privacy of her content without requiring site-specific passwords. Our prototype of IMPECS required only minor modifications to an IM server, and PHP scripts on a web server. The general idea behind IMPECS extends beyond IM and IM circles of trust; any equivalent scheme, (ideally) containing pre-arranged groups, could similarly be leveraged.
Keywords: access control, circle of trust, privacy, sharing
Detecting image spam using visual features and near duplicate detection BIBAKFull-Text 497-506
  Bhaskar Mehta; Saurabh Nangia; Manish Gupta; Wolfgang Nejdl
Email spam is a much studied topic, but even though current email spam detecting software has been gaining a competitive edge against text based email spam, new advances in spam generation have posed a new challenge: image-based spam. Image based spam is email which includes embedded images containing the spam messages, but in binary format. In this paper, we study the characteristics of image spam to propose two solutions for detecting image-based spam, while drawing a comparison with the existing techniques. The first solution, which uses the visual features for classification, offers an accuracy of about 98%, i.e. an improvement of at least 6% compared to existing solutions. SVMs (Support Vector Machines) are used to train classifiers using judiciously decided color, texture and shape features. The second solution offers a novel approach for near duplication detection in images. It involves clustering of image GMMs (Gaussian Mixture Models) based on the Agglomerative Information Bottleneck (AIB) principle, using Jensen-Shannon divergence (JS) as the distance measure.
Keywords: email spam, image analysis, machine learning
Better abstractions for secure server-side scripting BIBAKFull-Text 507-516
  Dachuan Yu; Ajay Chander; Hiroshi Inamura; Igor Serikov
It is notoriously difficult to program a solid web application. Besides addressing web interactions, state maintenance, and whimsical user navigation behaviors, programmers must also avoid a minefield of security vulnerabilities. The problem is twofold. First, we lack a clear understanding of the new computation model underlying web applications. Second, we lack proper abstractions for hiding common and subtle coding details that are orthogonal to the business functionalities of specific web applications.
   This paper addresses both issues. First, we present a language BASS for declarative server-side scripting. BASS allows programmers to work in an ideal world, using new abstractions to tackle common but problematic aspects of web programming. The meta properties of BASS provide useful security guarantees. Second, we present a language MOSS reflecting realistic web programming concepts and scenarios, thus articulating the computation model behind web programming. Finally, we present a translation from BASS to MOSS, demonstrating how the ideal programming model and security guarantees of BASS can be implemented in practice.
Keywords: server-side scripting, web application security

Security II: web client security

Sessionlock: securing web sessions against eavesdropping BIBAKFull-Text 517-524
  Ben Adida
Typical web sessions can be hijacked easily by a network eavesdropper in attacks that have come to be designated "sidejacking." The rise of ubiquitous wireless networks, often unprotected at the transport layer, has significantly aggravated this problem. While SSL can protect against eavesdropping, its usability disadvantages often make it unsuitable when the data is not considered highly confidential. Most web-based email services, for example, use SSL only on their login page and are thus vulnerable to sidejacking.
   We propose SessionLock, a simple approach to securing web sessions against eavesdropping without extending the use of SSL. SessionLock is easily implemented by web developers using only JavaScript and simple server-side logic. Its performance impact is negligible, and all major web browsers are supported. Interestingly, it is particularly easy to implement on single-page AJAX web applications, e.g. Gmail or Yahoo mail, with approximately 200 lines of JavaScript and 60 lines of server-side verification code.
Keywords: web security
Forcehttps: protecting high-security web sites from network attacks BIBAKFull-Text 525-534
  Collin Jackson; Adam Barth
As wireless networks proliferate, web browsers operate in an increasingly hostile network environment. The HTTPS protocol has the potential to protect web users from network attackers, but real-world deployments must cope with misconfigured servers, causing imperfect web sites and users to compromise browsing sessions inadvertently. ForceHTTPS is a simple browser security mechanism that web sites or users can use to opt in to stricter error processing, improving the security of HTTPS by preventing network attacks that leverage the browser's lax error processing. By augmenting the browser with a database of custom URL rewrite rules, ForceHTTPS allows sophisticated users to transparently retrofit security onto some insecure sites that support HTTPS. We provide a prototype implementation of ForceHTTPS as a Firefox browser extension.
Keywords: HTTPS, eavesdropping, pharming, same-origin policy
SMash: secure component model for cross-domain mashups on unmodified browsers BIBAKFull-Text 535-544
  Frederik De Keukelaere; Sumeer Bhola; Michael Steiner; Suresh Chari; Sachiko Yoshihama
Mashup applications mix and merge content (data and code) from multiple content providers in a user's browser, to provide high-value web applications that can rival the user experience provided by desktop applications. Current browser security models were not designed to support such applications and they are therefore implemented with insecure workarounds. In this paper, we present a secure component model, where components are provided by different trust domains, and can interact using a communication abstraction that allows ease of specification of a security policy. We have developed an implementation of this model that works currently in all major browsers, and addresses challenges of communication integrity and frame-phishing. An evaluation of the performance of our implementation shows that this approach is not just feasible but also practical.
Keywords: browser, component model, mashup, phishing, web 2.0
Compoweb: a component-oriented web architecture BIBAKFull-Text 545-554
  Rui Guo; Bin B. Zhu; Min Feng; Aimin Pan; Bosheng Zhou
In this paper, client-site Web mashups are studied from component-oriented perspective, and CompoWeb, a component-oriented Web architecture, is proposed. In CompoWeb, a Web application is decomposed into Web components called gadgets. A gadget is an abstraction of functional or logical Web component. It is isolated from other gadgets for security and reliability. Contract-based channels are the only way to interact with each other. An abstraction of contract-based channels supported or required by a gadget is also presented. It enables binding of gadgets at deployment, and promotes interchangeable gadgets. Unlike the model of a normal function call where the function logic is executed in caller's context, CompoWeb ensures that the function logic is executed in callee's context so that both the caller and callee are protected. Implementation of a prototype CompoWeb system and its performance are also presented.
Keywords: browser, component, delayed-binding, encapsulation, interface, isolation, mashup, protection, reuse, same-origin policy, security, web

Semantic web I

Structured objects in owl: representation and reasoning BIBAKFull-Text 555-564
  Boris Motik; Bernardo Cuenca Grau; Ulrike Sattler
Applications of semantic technologies often require the representation of and reasoning with structured objects -- that is, objects composed of parts connected in complex ways. Although OWL is a general and powerful language, its class descriptions and axioms cannot be used to describe arbitrarily connected structures. An OWL representation of structured objects can thus be underconstrained, which reduces the inferences that can be drawn and causes performance problems in reasoning. To address these problems, we extend OWL with description graphs, which allow for the description of structured objects in a simple and precise way. To represent conditional aspects of the domain, we also allow for SWRL-like rules over description graphs. Based on an observation about the nature of structured objects, we ensure decidability of our formalism. We also present a hypertableau-based decision procedure, which we implemented in the HermiT reasoner. To evaluate its performance, we have extracted description graphs from the GALEN and FMA ontologies, classified them successfully, and even detected a modeling error in GALEN.
Keywords: description logics, owl, structured objects
Computing minimum cost diagnoses to repair populated DL-based ontologies BIBAKFull-Text 565-574
  Jianfeng Du; Yi-Dong Shen
Ontology population is prone to cause inconsistency because the populating process is imprecise or the populated data may conflict with the original data. By assuming that the intensional part of the populated DL-based ontology is fixed and each removable ABox assertion is given a removal cost, we repair the ontology by deleting a subset of removable ABox assertions in which the sum of removal costs is minimum. We call such subset a minimum cost diagnosis. We show that, unless P=NP, the problem of finding a minimum cost diagnosis for a DL-Lite ontology is insolvable in PTIME w.r.t. data complexity. In spite of that, we present a feasible computational method for more general (i.e. SHIQ) ontologies. It transforms a SHIQ ontology to a set of disjoint propositional programs, thus reducing the original problem into a set of independent subproblems. Each such subproblem computes an optimal model and is solvable in logarithmic calls to a SAT solver. Experimental results show that the method can handle moderately complex ontologies with over thousands of ABox assertions, where all ABox assertions can be assumed removable.
Keywords: description logics, diagnosis, disjunctive datalog, ontologies
Scalable querying services over fuzzy ontologies BIBAKFull-Text 575-584
  Jeff Z. Pan; Giorgos Stamou; Giorgos Stoilos; Stuart Taylor; Edward Thomas
Fuzzy ontologies are envisioned to be useful in the Semantic Web. Existing fuzzy ontology reasoners are not scalable enough to handle the scale of data that the Web provides. In this paper, we propose a framework of fuzzy query languages for fuzzy ontologies, and present query answering algorithms for these query languages over fuzzy DL-Lite ontologies. Moreover, this paper reports on implementation of our approach in the fuzzy DL-Lite query engine in the ONTOSEARCH2 system and preliminary, but encouraging, benchmarking results. To the best of our knowledge, this is the first ever scalable query engine for fuzzy ontologies.
Keywords: fuzzy SPARQL, fuzzy ontology, lightweight ontology language, scalable query answerin, semantic web

Semantic web II

Networked graphs: a declarative mechanism for SPARQL rules, SPARQL views and RDF data integration on the web BIBAKFull-Text 585-594
  Simon Schenk; Steffen Staab
Easy reuse and integration of declaratively described information in a distributed setting is one of the main motivations for building the Semantic Web. Despite of this claim, reuse and recombination of RDF data today is mostly done using data replication and procedural code. A simple declarative mechanism for reusing and combining RDF data would help users to generate content for the semantic web. Having such a mechanism, the Semantic Web could better benefit from user generated content, as it is broadly present in the so called Web 2.0, but also from better linkage of existing content.
   We propose Networked Graphs, which allow users to define RDF graphs both, by extensionally listing content, but also by using views on other graphs. These views can be used to include parts of other graphs, to transform data before including it and to denote rules. The relationships between graphs are described declaratively using SPARQL queries and an extension of the SPARQL semantics. Networked Graphs are easily exchangeable between and interpretable on different computers. Using existing protocols, Networked Graphs can be evaluated in a distributed setting.
Keywords: SPARQL, distributed rules, rules, semantic web, views, well founded semantics
SPARQL basic graph pattern optimization using selectivity estimation BIBAKFull-Text 595-604
  Markus Stocker; Andy Seaborne; Abraham Bernstein; Christoph Kiefer; Dave Reynolds
In this paper, we formalize the problem of Basic Graph Pattern (BGP) optimization for SPARQL queries and main memory graph implementations of RDF data. We define and analyze the characteristics of heuristics for selectivity-based static BGP optimization. The heuristics range from simple triple pattern variable counting to more sophisticated selectivity estimation techniques. Customized summary statistics for RDF data enable the selectivity estimation of joined triple patterns and the development of efficient heuristics. Using the Lehigh University Benchmark (LUBM), we evaluate the performance of the heuristics for the queries provided by the LUBM and discuss some of them in more details.
Keywords: SPARQL, query optimization, selectivity estimation
Scaling RDF with Time BIBAKFull-Text 605-614
  Andrea Pugliese; Octavian Udrea; V. S. Subrahmanian
The World Wide Web Consortium's RDF standard primarily consists of (subject, property, object) triples that specify the value that a given subject has for a given property. However, it is frequently the case that even for a fixed subject and property, the value varies with time. As a consequence, efforts have been made to annotate RDF triples with "valid time" intervals. However, to date, no proposals exist for efficient indexing of such temporal RDF databases. It is clearly beneficial to store RDF data in a relational DB -- however, standard relational indexes are inadequately equipped to handle RDF's graph structure. In this paper, we propose the tGRIN index structure that builds a specialized index for temporal RDF that is physically stored in an RDBMS. Past efforts to store RDF in relational stores include Jena2 from HP, Sesame from OpenRDF.org, and 3store from the University of Southampton. We show that even when these efforts are augmented with well known temporal indexes like R+ trees, SR-trees, ST-index, and MAP21, the tGRIN index exhibits superior performance. In terms of index build time, tGRIN takes two thirds or less of the time used by any other system, and it uses a comparable amount of memory and less disk space than Jena, Sesame and 3store. More importantly, tGRIN can answer queries three to six times faster for average query graph patterns and five to ten times faster for complex queries than these systems.
Keywords: RDF indexing, resource description framework, temporal RDF

Semantic web III

Wiki content templating BIBAKFull-Text 615-624
  Angelo Di Iorio; Fabio Vitali; Stefano Zacchiroli
Wiki content templating enables reuse of content structures among wiki pages. In this paper we present a thorough study of this widespread feature, showing how its two state of the art models (functional and creational templating) are sub-optimal. We then propose a third, better, model called lightly constrained (LC) templating and show its implementation in the Moin wiki engine. We also show how LC templating implementations are the appropriate technologies to push forward semantically rich web pages on the lines of (lowercase) semantic web and microformats.
Keywords: microformats, semantic web, template, wiki engine
Querying for meta knowledge BIBAKFull-Text 625-634
  Bernhard Schueler; Sergej Sizov; Steffen Staab; Duc Thanh Tran
The Semantic Web is based on accessing and reusing RDF data from many different sources, which one may assign different levels of authority and credibility. Existing Semantic Web query languages, like SPARQL, have targeted the retrieval, combination and reuse of facts, but have so far ignored all aspects of meta knowledge, such as origins, authorship, recency or certainty of data, to name but a few.
   In this paper, we present an original, generic, formalized and implemented approach for managing many dimensions of meta knowledge, like source, authorship, certainty and others. The approach re-uses existing RDF modeling possibilities in order to represent meta knowledge. Then, it extends SPARQL query processing in such a way that given a SPARQL query for data, one may request meta knowledge without modifying the original query. Thus, our approach achieves highly flexible and automatically coordinated querying for data and meta knowledge, while completely separating the two areas of concern.
Keywords: RDF, SPARQL, semantic web
Automatically refining the wikipedia infobox ontology BIBAKFull-Text 635-644
  Fei Wu; Daniel S. Weld
The combined efforts of human volunteers have recently extracted numerous facts from Wikipedia, storing them as machine-harvestable object-attribute-value triples in Wikipedia infoboxes. Machine learning systems, such as Kylin, use these infoboxes as training data, accurately extracting even more semantic knowledge from natural language text. But in order to realize the full power of this information, it must be situated in a cleanly-structured ontology. This paper introduces KOG, an autonomous system for refining Wikipedia's infobox-class ontology towards this end. We cast the problem of ontology refinement as a machine learning problem and solve it using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Networks. We present experiments demonstrating the superiority of the joint-inference approach and evaluating other aspects of our system. Using these techniques, we build a rich ontology, integrating Wikipedia's infobox-class schemata with WordNet. We demonstrate how the resulting ontology may be used to enhance Wikipedia with improved query processing and other features.
Keywords: markov logic networks, ontology, semantic web, wikipedia

Social networks: analysis of social networks & Online Interactive Spaces

Statistical analysis of the social network and discussion threads in slashdot BIBAKFull-Text 645-654
  Vicenç Gómez; Andreas Kaltenbrunner; Vicente López
We analyze the social network emerging from the user comment activity on the website Slashdot. The network presents common features of traditional social networks such as a giant component, small average path length and high clustering, but differs from them showing moderate reciprocity and neutral assortativity by degree. Using Kolmogorov-Smirnov statistical tests, we show that the degree distributions are better explained by log-normal instead of power-law distributions. We also study the structure of discussion threads using an intuitive radial tree representation. Threads show strong heterogeneity and self-similarity throughout the different nesting levels of a conversation. We use these results to propose a simple measure to evaluate the degree of controversy provoked by a post.
Keywords: bulletin board, h-index, log-normal, online communities, power-law, radial tree, social networks, thread, weblogs
Yes, there is a correlation: -- from social networks to personal behavior on the web BIBAKFull-Text 655-664
  Parag Singla; Matthew Richardson
Characterizing the relationship that exists between a person's social group and his/her personal behavior has been a long standing goal of social network analysts. In this paper, we apply data mining techniques to study this relationship for a population of over 10 million people, by turning to online sources of data. The analysis reveals that people who chat with each other (using instant messaging) are more likely to share interests (their Web searches are the same or topically similar). The more time they spend talking, the stronger this relationship is. People who chat with each other are also more likely to share other personal characteristics, such as their age and location (and, they are likely to be of opposite gender). Similar findings hold for people who do not necessarily talk to each other but do have a friend in common. Our analysis is based on a well-defined mathematical formulation of the problem, and is the largest such study we are aware of.
Keywords: demographics, instant messaging, search, social networks
Knowledge sharing and yahoo answers: everyone knows something BIBAKFull-Text 665-674
  Lada A. Adamic; Jun Zhang; Eytan Bakshy; Mark S. Ackerman
Yahoo Answers (YA) is a large and diverse question-answer forum, acting not only as a medium for sharing technical knowledge, but as a place where one can seek advice, gather opinions, and satisfy one's curiosity about a countless number of things. In this paper, we seek to understand YA's knowledge sharing and activity. We analyze the forum categories and cluster them according to content characteristics and patterns of interaction among the users. While interactions in some categories resemble expertise sharing forums, others incorporate discussion, everyday advice, and support. With such a diversity of categories in which one can participate, we find that some users focus narrowly on specific topics, while others participate across categories. This not only allows us to map related categories, but to characterize the entropy of the users' interests. We find that lower entropy correlates with receiving higher answer ratings, but only for categories where factual expertise is primarily sought after. We combine both user attributes and answer characteristics to predict, within a given category, whether a particular answer will be chosen as the best answer by the asker.
Keywords: expertise finding, help seeking, knowledge sharing, online communities, question answering, social network analysis

Social networks: discovery & evolution of communities

Tag-based social interest discovery BIBAKFull-Text 675-684
  Xin Li; Lei Guo; Yihong Eric Zhao
The success and popularity of social network systems, such as del.icio.us, Facebook, MySpace, and YouTube, have generated many interesting and challenging problems to the research community. Among others, discovering social interests shared by groups of users is very important because it helps to connect people with common interests and encourages people to contribute and share more contents. The main challenge to solving this problem comes from the difficulty of detecting and representing the interest of the users. The existing approaches are all based on the online connections of users and so unable to identify the common interest of users who have no online connections.
   In this paper, we propose a novel social interest discovery approach based on user-generated tags. Our approach is motivated by the key observation that in a social network, human users tend to use descriptive tags to annotate the contents that they are interested in. Our analysis on a large amount of real-world traces reveals that in general, user-generated tags are consistent with the web content they are attached to, while more concise and closer to the understanding and judgments of human users about the content. Thus, patterns of frequent co-occurrences of user tags can be used to characterize and capture topics of user interests. We have developed an Internet Social Interest Discovery system, ISID, to discover the common user interests and cluster users and their saved URLs by different interest topics. Our evaluation shows that ISID can effectively cluster similar documents by interest topics and discover user communities with common interests no matter if they have any online connections.
Keywords: ISID, del.icio.us, social networks, tag
Facetnet: a framework for analyzing communities and their evolutions in dynamic networks BIBAKFull-Text 685-694
  Yu-Ru Lin; Yun Chi; Shenghuo Zhu; Hari Sundaram; Belle L. Tseng
We discover communities from social network data, and analyze the community evolution. These communities are inherent characteristics of human interaction in online social networks, as well as paper citation networks. Also, communities may evolve over time, due to changes to individuals' roles and social status in the network as well as changes to individuals' research interests. We present an innovative algorithm that deviates from the traditional two-step approach to analyze community evolutions. In the traditional approach, communities are first detected for each time slice, and then compared to determine correspondences. We argue that this approach is inappropriate in applications with noisy data. In this paper, we propose FacetNet for analyzing communities and their evolutions through a robust unified process. In this novel framework, communities not only generate evolutions, they also are regularized by the temporal smoothness of evolutions. As a result, this framework will discover communities that jointly maximize the fit to the observed data and the temporal evolution. Our approach relies on formulating the problem in terms of non-negative matrix factorization, where communities and their evolutions are factorized in a unified way. Then we develop an iterative algorithm, with proven low time complexity, which is guaranteed to converge to an optimal solution. We perform extensive experimental studies, on both synthetic datasets and real datasets, to demonstrate that our method discovers meaningful communities and provides additional insights not directly obtainable from traditional methods.
Keywords: community, community net, evolution, evolution net, non-negative matrix factorization, soft membership
Statistical properties of community structure in large social and information networks BIBAKFull-Text 695-704
  Jure Leskovec; Kevin J. Lang; Anirban Dasgupta; Michael W. Mahoney
A large body of work has been devoted to identifying community structure in networks. A community is often though of as a set of nodes that has more connections between its members than to the remainder of the network. In this paper, we characterize as a function of size the statistical and structural properties of such sets of nodes. We define the network community profile plot, which characterizes the "best" possible community -- according to the conductance measure -- over a wide range of size scales, and we study over 70 large sparse real-world networks taken from a wide range of application domains. Our results suggest a significantly more refined picture of community structure in large real-world networks than has been appreciated previously.
   Our most striking finding is that in nearly every network dataset we examined, we observe tight but almost trivial communities at very small scales, and at larger size scales, the best possible communities gradually "blend in" with the rest of the network and thus become less "community-like." This behavior is not explained, even at a qualitative level, by any of the commonly-used network generation models. Moreover, this behavior is exactly the opposite of what one would expect based on experience with and intuition from expander graphs, from graphs that are well-embeddable in a low-dimensional structure, and from small social networks that have served as testbeds of community detection algorithms. We have found, however, that a generative model, in which new edges are added via an iterative "forest fire" burning process, is able to produce graphs exhibiting a network community structure similar to our observations.
Keywords: community structure, conductance, graph partitioning, random walks, social networks

Social networks: applications and infrastructures

Why web 2.0 is good for learning and for research: principles and prototypes BIBAKFull-Text 705-714
  Carsten Ullrich; Kerstin Borau; Heng Luo; Xiaohong Tan; Liping Shen; Ruimin Shen
The term "Web 2.0" is used to describe applications that distinguish themselves from previous generations of software by a number of principles. Existing work shows that Web 2.0 applications can be successfully exploited for technology-enhance learning. However, in-depth analyses of the relationship between Web 2.0 technology on the one hand and teaching and learning on the other hand are still rare. In this article, we will analyze the technological principles of the Web 2.0 and describe their pedagogical implications on learning. We will furthermore show that Web 2.0 is not only well suited for learning but also for research on learning: the wealth of services that is available and their openness regarding API and data allow to assemble prototypes of technology-supported learning applications in amazingly small amount of time. These prototypes can be used to evaluate research hypotheses quickly. We will present two example prototypes and discuss the lessons we learned from building and using these prototypes.
Keywords: education, research, web 2.0
Exploring social annotations for information retrieval BIBAKFull-Text 715-724
  Ding Zhou; Jiang Bian; Shuyi Zheng; Hongyuan Zha; C. Lee Giles
Social annotation has gained increasing popularity in many Web-based applications, leading to an emerging research area in text analysis and information retrieval. This paper is concerned with developing probabilistic models and computational algorithms for social annotations. We propose a unified framework to combine the modeling of social annotations with the language modeling-based methods for information retrieval. The proposed approach consists of two steps: (1) discovering topics in the contents and annotations of documents while categorizing the users by domains; and (2) enhancing document and query language models by incorporating user domain interests as well as topical background models. In particular, we propose a new general generative model for social annotations, which is then simplified to a computationally tractable hierarchical Bayesian network. Then we apply smoothing techniques in a risk minimization framework to incorporate the topical information to language models. Experiments are carried out on a real-world annotation data set sampled from del.icio.us. Our results demonstrate significant improvements over the traditional approaches.
Keywords: folksonomy, information retrieval, language modeling, social annotations
Lock-free consistency control for web 2.0 applications BIBAKFull-Text 725-734
  Jiangming Yang; Haixun Wang; Ning Gu; Yiming Liu; Chunsong Wang; Qiwei Zhang
Online collaboration and sharing is the central theme of many web-based services that create the so-called Web 2.0 phenomena. Using the Internet as a computing platform, many Web 2.0 applications set up mirror sites to provide large-scale availability and to achieve load balance. However, in the age of Web 2.0, where every user is also a writer and publisher, the deployment of mirror sites makes consistency maintenance a Web scale problem. Traditional concurrency control methods (e.g. two phase lock, serialization, etc.) are not up to the task for several reasons. First, large network latency between mirror sites will make two phase locking a throughput bottleneck. Second, locking will block a large portion of concurrent operations, which makes it impossible to provide large-scale availability. On the other hand, most Web 2.0 operations do not need strict serializability -- it is not the intention of a user who is correcting a typo in a shared document to block another who is adding a comment, as long as consistency can still be achieved. Thus, in order to enable maximal online collaboration and sharing, we need a lock-free mechanism that can maintain consistency among mirror sites on the Web. In this paper, we propose a flexible and efficient method to achieve consistency maintenance in the Web 2.0 world. Our experiments show its good performance improvement compared with existing methods based on distributed lock.
Keywords: concurrency control, consistency maintenance, xml

Web engineering -- applications

Mining, indexing, and searching for textual chemical molecule information on the web BIBAKFull-Text 735-744
  Bingjun Sun; Prasenjit Mitra; C. Lee Giles
Current search engines do not support user searches for chemical entities (chemical names and formulae) beyond simple keyword searches. Usually a chemical molecule can be represented in multiple textual ways. A simple keyword search would retrieve only the exact match and not the others. We show how to build a search engine that enables searches for chemical entities and demonstrate empirically that it improves the relevance of returned documents. Our search engine first extracts chemical entities from text, performs novel indexing suitable for chemical names and formulae, and supports different query models that a scientist may require. We propose a model of hierarchical conditional random fields for chemical formula tagging that considers long-term dependencies at the sentence level. To substring searches of chemical names, a search engine must index substrings of chemical names. Indexing all possible sub-sequences is not feasible in practice. We propose an algorithm for independent frequent subsequence mining to discover sub-terms of chemical names with their probabilities. We then propose an unsupervised hierarchical text segmentation (HTS) method to represent a sequence with a tree structure based on discovered independent frequent subsequences, so that sub-terms on the HTS tree should be indexed. Query models with corresponding ranking functions are introduced for chemical name searches. Experiments show that our approaches to chemical entity tagging perform well. Furthermore, we show that index pruning can reduce the index size and query time without changing the returned ranked results significantly. Finally, experiments show that our approaches out-perform traditional methods for document search with ambiguous chemical terms.
Keywords: conditional random fields, entity extraction, hierarchical text segmentation, independent frequent subsequence, index pruning, ranking, similarity search, substring search
Value-driven design for "infosuasive" web applications BIBAKFull-Text 745-754
  Davide Bolchini; Franca Garzotto; Paolo Paolini
An infosuasive web application is mainly intended to be at the same time informative and persuasive, i.e., it aims at supporting knowledge needs and it has also the (declared or not declared) goal of influencing user's opinions, attitudes and behaviors. Most web applications, in fact, are infosuasive (except those whose aim is mainly operational). In this paper, we investigate the complex set of elements that informs the very early design of infosuasive web applications. We propose a conceptual framework aimed at supporting the actors involved in this process to integrate their different viewpoints, to organize the variety of issues that need to be analyzed, to find a direction in the numerous design options, and to represent the results of this activity in an effective way.
   Our approach is value-driven since it is centered around the concept of communication value, regarded as a vehicle to fulfill communication goals on specific communication targets. We place the analysis of these aspects in the wider context of web requirements analysis, highlighting their relationships with business values analysis and user needs analysis. We pinpoint how values and communication goals impact on various design dimensions of infosuasive web application -- contents, information architecture, interaction, operations, and lay-out. Our approach is multidisciplinary, and was inspired to goal-based and value-based requirements engineering (often used in web engineering), to brand design (often used in marketing), and to value-centered design "frameworks"(as proposed by the HCI community). A case study exemplifies our methodological proposal, discussing a large project in which we are currently involved.
Keywords: brand, communication goals, conceptual model, globalization, persuasive design, requirements, value, value-driven design, web design, web engineering
Organizing and sharing distributed personal web-service data BIBAKFull-Text 755-764
  Roxana Geambasu; Cherie Cheung; Alexander Moshchuk; Steven D. Gribble; Henry M. Levy
The migration from desktop applications to Web-based services is scattering personal data across a myriad of Web sites, such as Google, Flickr, YouTube, and Amazon S3. This dispersal poses new challenges for users, making it more difficult for them to: (1) organize, search, and archive their data, much of which is now hosted by Web sites; (2) create heterogeneous, multi-Web-service object collections and share them in a protected way; and (3) manipulate their data with standard applications or scripts.
   In this paper, we show that a Web-service interface supporting standardized naming, protection, and object-access services can solve these problems and can greatly simplify the creation of a new generation of object-management services for the Web. We describe the implementation of Menagerie, a proof-of-concept prototype that provides these services for Web-based applications. At a high level, Menagerie creates an integrated file and object system from heterogeneous, personal Web-service objects dispersed across the Internet. We present several object-management applications we developed on Menagerie to show the practicality and benefits of our approach.
Keywords: data organization, data sharing, menagerie, web services

Web engineering -- web service composition

Matching independent global constraints for composite web services BIBAKFull-Text 765-774
  Nalaka Gooneratne; Zahir Tari
Service discovery employs matching techniques to select services by comparing their descriptions against user constraints. Semantic-based matching approaches achieve higher recall than syntactic-based ones (as they employ ontological reasoning mechanisms to match syntactically heterogeneous descriptions). However, semantic-based approaches still have problems (e.g. lack of scalability as an exhaustive search is often performed to located services conforming to constraints). This paper proposes two approaches that deal with the problem of scalability/performance for composite service location. First, services are indexed based on the values they assign to their restricted attributes (the attributes restricted by a given constraint). Then, services that assign "conforming values" to those attributes are combined to form composite services. The first proposed approach extends a local optimisation technique to perform the latter, since identifying such values is NP-hard. However, this approach returns false negatives since the local optimisation technique does not consider all the values. Hence, a second approach that derives conforming values using domain rules is defined. The used rules are returned with each composite service so that a user can understand the context in which it is retrieved. Results obtained from the experiments that varied the number of available services demonstrate that the performance of the local optimisation-based approach is 76% better than existing semantic-based approaches and recall is 98% higher than syntactic-based approaches.
Keywords: composite matching, service discovery, service matching
Wishful search: interactive composition of data mashups BIBAKFull-Text 775-784
  Anton V. Riabov; Eric Boillet; Mark D. Feblowitz; Zhen Liu; Anand Ranganathan
With the emergence of Yahoo Pipes and several similar services, data mashup tools have started to gain interest of business users. Making these tools simple and accessible ton users with no or little programming experience has become a pressing issue. In this paper we introduce MARIO (Mashup Automation with Runtime Orchestration and Invocation), a new tool that radically simplifies data mashup composition. We have developed an intelligent automatic composition engine in MARIO together with a simple user interface using an intuitive "wishful search" abstraction. It thus allows users to explore the space of potentially composable data mashups and preview composition results as they iteratively refine their "wishes", i.e. mashup composition goals. It also lets users discover and make use of system capabilities without having to understand the capabilities of individual components, and instantly reflects changes made to the components by presenting an aggregate view of changed capabilities of the entire system. We describe our experience with using MARIO to compose flows of Yahoo Pipes components.
Keywords: composition, programmable web, tag cloud
Extending the compatibility notion for abstract WS-BPEL processes BIBAKFull-Text 785-794
  Dieter König; Niels Lohmann; Simon Moser; Christian Stahl; Karsten Wolf
WS-BPEL defines a standard for executable processes. Executable processes are business processes which can be automated through an IT infrastructure. The WS-BPEL specification also introduces the concept of abstract processes: In contrast to their executable siblings, abstract processes are not executable and can have parts where business logic is disguised. Nevertheless, the WS-BPEL specification introduces a notion of compatibility between such an under-specified abstract process and a fully specified executable one. Basically, this compatibility notion defines a set of syntactical rules that can be augmented or restricted by profiles. So far, there exist two of such profiles: the Abstract Process Profile for Observable Behavior and the Abstract Process Profile for Templates. None of these profiles defines a concept of behavioral equivalence. Therefore, both profiles are too strict with respect to the rules they impose when deciding whether an executable process is compatible to an abstract one. In this paper, we propose a novel profile that extends the existing Abstract Process Profile for Observable Behavior by defining a behavioral relationship. We also show that our novel profile allows for more flexibility when deciding whether an executable and an abstract process are compatible.
Keywords: WS-BPEL, abstract profile, compliance, petri nets

Web engineering -- web service deployment

Investigating web services on the world wide web BIBAKFull-Text 795-804
  Eyhab Al-Masri; Qusay H. Mahmoud
Searching for Web service access points is no longer attached to service registries as Web search engines have become a new major source for discovering Web services. In this work, we conduct a thorough analytical investigation on the plurality of Web service interfaces that exist on the Web today. Using our Web Service Crawler Engine (WSCE), we collect metadata service information on retrieved interfaces through accessible UBRs, service portals and search engines. We use this data to determine Web service statistics and distribution based on object sizes, types of technologies employed, and the number of functioning services. This statistical data can be used to help determine the current status of Web services. We determine an intriguing result that 63% of the available Web services on the Web are considered to be active. We further use our findings to provide insights on improving the service retrieval process.
Keywords: UDDI, UDDI business registries, WSCE, WSDL, crawler, crawling, interface, searching, service portals, web services
Restful web services vs. "big"' web services: making the right architectural decision BIBAKFull-Text 805-814
  Cesare Pautasso; Olaf Zimmermann; Frank Leymann
Recent technology trends in the Web Services (WS) domain indicate that a solution eliminating the presumed complexity of the WS-* standards may be in sight: advocates of REpresentational State Transfer (REST) have come to believe that their ideas explaining why the World Wide Web works are just as applicable to solve enterprise application integration problems and to simplify the plumbing required to build service-oriented architectures. In this paper we objectify the WS-* vs. REST debate by giving a quantitative technical comparison based on architectural principles and decisions. We show that the two approaches differ in the number of architectural decisions that must be made and in the number of available alternatives. This discrepancy between freedom-from-choice and freedom-of-choice explains the complexity difference perceived. However, we also show that there are significant differences in the consequences of certain decisions in terms of resulting development and maintenance costs. Our comparison helps technical decision makers to assess the two integration styles and technologies more objectively and select the one that best fits their needs: REST is well suited for basic, ad hoc integration scenarios, WS-* is more flexible and addresses advanced quality of service requirements commonly occurring in enterprise computing.
Keywords: HTTP, REST, SOAP, WS-* vs. REST, WSDL, architectural decision modeling, resource oriented architectures, service oriented architectures, technology comparison, web services
Non-intrusive monitoring and service adaptation for WS-BPEL BIBAKFull-Text 815-824
  Oliver Moser; Florian Rosenberg; Schahram Dustdar
Web service processes currently lack monitoring and dynamic (runtime) adaptation mechanisms. In highly dynamic processes, services frequently need to be exchanged due to a variety of reasons. In this paper we present VieDAME, a system which allows monitoring of BPEL processes according to Quality of Service (QoS) attributes and replacement of existing partner services based on various (pluggable) replacement strategies. The chosen replacement services can be syntactically or semantically equivalent to the BPEL interface. Services can be automatically replaced during runtime without any downtime of the overall system. We implemented our solution with an aspect-oriented approach by intercepting SOAP messages and allow services to be exchanged during runtime with little performance penalty costs, as shown in our experiments, thereby making our approach suitable for high-availability BPEL environments.
Keywords: BPEL, message mediation, monitoring, quality of service, service selection

XML I

Learning deterministic regular expressions for the inference of schemas from XML data BIBAKFull-Text 825-834
  Geert Jan Bex; Wouter Gelade; Frank Neven; Stijn Vansummeren
Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.
Keywords: XML, regular expressions, schema inference
Efficient evaluation of generalized path pattern queries on XML data BIBAKFull-Text 835-844
  Xiaoying Wu; Stefanos Souldatos; Dimitri Theodoratos; Theodore Dalamagas; Timos Sellis
Finding the occurrences of structural patterns in XML data is a key operation in XML query processing. Existing algorithms for this operation focus almost exclusively on path-patterns or tree-patterns. Requirements in flexible querying of XML data have motivated recently the introduction of query languages that allow a partial specification of path-patterns in a query. In this paper, we focus on the efficient evaluation of partial path queries, a generalization of path pattern queries. Our approach explicitly deals with repeated labels (that is, multiple occurrences of the same label in a query).
   We show that partial path queries can be represented as rooted dags for which a topological ordering of the nodes exists. We present three algorithms for the efficient evaluation of these queries under the indexed streaming evaluation model. The first one exploits a structural summary of data to generate a set of path-patterns that together are equivalent to a partial path query. To evaluate these path-patterns, we extend PathStack so that it can work on path-patterns with repeated labels. The second one extracts a spanning tree from the query dag, uses a stack-based algorithm to find the matches of the root-to-leaf paths in the tree, and merge-joins the matches to compute the answer. Finally, the third one exploits multiple pointers of stack entries and a topological ordering of the query dag to apply a stack-based holistic technique. An analysis of the algorithms and extensive experimental evaluation shows that the holistic algorithm outperforms the other ones.
Keywords: XML, Xpath query evaluation
On incremental maintenance of 2-hop labeling of graphs BIBAKFull-Text 845-854
  Ramadhana Bramandia; Byron Choi; Wee Keong Ng
Recent interests on XML, Semantic Web, and Web ontology, among other topics, have sparked a renewed interest on graph-structured databases. A fundamental query on graphs is the reachability test of nodes. Recently, 2-hop labeling has been proposed to index large collections of XML and/or graphs for efficient reachability tests. However, there has been few work on updates of 2-hop labeling. This is compounded by the fact that Web data changes over time. In response to these, this paper studies the incremental maintenance of 2-hop labeling. We identify the main reason for the inefficiency of updates of existing 2-hop labels. We propose two updatable 2-hop labelings, hybrids of 2-hop labeling, and their incremental maintenance algorithms. The proposed 2-hop labeling is derived from graph connectivities, as opposed to set cover which is used by all previous work. Our experimental evaluation illustrates the space efficiency and update performance of various kinds of 2-hop labeling. The main conclusion is that there is a natural way to spare some index size for update performance in 2-hop labeling.
Keywords: 2-hop, graph indexing, incremental maintenance, reachability test

XML II

Utility-driven load shedding for xml stream processing BIBAKFull-Text 855-864
  Mingzhu Wei; Elke A. Rundensteiner; Murali Mani
Because of the high volume and unpredictable arrival rate, stream processing systems may not always be able to keep up with the input data streams -- resulting in buffer overflow and uncontrolled loss of data. Load shedding, the prevalent strategy for solving this overflow problem, has so far only been considered for relational stream processing, but not for XML. Shedding applied to XML stream processing brings new opportunities and challenges due to complex nested nature of XML structures. In this paper, we tackle this unsolved XML shedding problem using a three-pronged approach. First, we develop an XQuery preference model that enables users to specify the relative importance of preserving different subpatterns in the XML result structure. This transforms shedding into the problem of rewriting the user query into shed queries that return approximate query answers with utility as measured by the given user preference model. Second, we develop a cost model to compare the performance of alternate shed queries. Third, we develop two shedding algorithms, OptShed and FastShed. OptShed guarantees to find an optimal solution however at the cost of exponential complexity. FastShed, as confirmed by our experiments, achieves a close-to-optimal result in a wide range of test cases. Finally we describe the in-automaton shedding mechanism for XQuery stream engines. The experiments show that our proposed utility-driven shedding solutions consistently achieve higher utility results compared to the existing relational shedding techniques.
Keywords: load shedding, preference model, xml query processing, xml streams
Xml data dissemination using automata on top of structured overlay networks BIBAKFull-Text 865-874
  Iris Miliaraki; Zoi Kaoudi; Manolis Koubarakis
We present a novel approach for filtering XML documents using nondeterministic finite automata and distributed hash tables. Our approach differs architecturally from recent proposals that deal with distributed XML filtering; they assume an XML broker architecture, whereas our solution is built on top of distributed hash tables. The essence of our work is a distributed implementation of YFilter, a state-of-the-art automata-based XML filtering system on top of Chord. We experimentally evaluate our approach and demonstrate that our algorithms can scale to millions of XPath queries under various filtering scenarios, and also exhibit very good load balancing properties.
Keywords: automata, load balancing, structured overlay networks, xml data dissemination
Sewnet -: a framework for creating services utilizing telecom functionality BIBAKFull-Text 875-884
  Sumit Mittal; Dipanjan Chakraborty; Sunil Goyal; Sougata Mukherjea
With Telecom market reaching saturation in many geographies and revenues from voice calls decreasing, Telecom operators are trying to identify new sources of revenue. For this purpose, these operators can take advantage of their core functionalities like Location, Call Control, etc. by exposing them as services to be composed by developers with third party offerings available over the Web. To hide the complexity of underlying Telecom protocols from application developers, the operators are steadily adopting Service Oriented Architecture (SOA) and reference standards like Parlay-X and IMS. However, a number of challenges still remain in rapid utilization of Telecom functionalities for creating new applications -- existence of multiple protocols, different classes of developers, and the need to coordinate and manage usage of these functionalities. In this paper, we present SewNet, a framework for creating applications exploiting Telecom functionality exposed over a (converged) IP network. More specifically, SewNet a) provides an abstraction model for encapsulating invocation, coordination and enrichment of the Telecom functionalities, b) renders a service creation environment on top of this model, and c) caters to various different categories of developers. With the help of two use-case scenarios, we demonstrate how SewNet can create services utilizing rich Telecom functionality.
Keywords: abstraction model, service composition, telecom, web 2.0

Industrial track session

Characterizing typical and atypical user sessions in clickstreams BIBAKFull-Text 885-894
  Narayanan Sadagopan; Jie Li
Millions of users retrieve information from the Internet using search engines. Mining these user sessions can provide valuable information about the quality of user experience and the perceived quality of search results. Often search engines rely on accurate estimates of Click Through Rate (CTR) to evaluate the quality of user experience. The vast heterogeneity in the user population and presence of automated software programs (bots) can result in high variance in the estimates of CTR. To improve the estimation accuracy of user experience metrics like CTR, we argue that it is important to identify typical and atypical user sessions in clickstreams. Our approach to identify these sessions is based on detecting outliers using Mahalanobis distance in the user session space. Our user session model incorporates several key clickstream characteristics including a novel conformance score obtained by Markov Chain analysis. Editorial results show that our approach of identifying typical and atypical sessions has a precision of about 89%. Filtering out these atypical sessions reduces the uncertainty (95% confidence interval) of the mean CTR by about 40%. These results demonstrate that our approach of identifying typical and atypical user sessions is extremely valuable for cleaning "noisy" user session data for increased accuracy in evaluating user experience.
Keywords: clickstream analysis, outlier detection, web search
Video suggestion and discovery for youtube: taking random walks through the view graph BIBAKFull-Text 895-904
  Shumeet Baluja; Rohan Seth; D. Sivakumar; Yushi Jing; Jay Yagnik; Shankar Kumar; Deepak Ravichandran; Mohamed Aly
The rapid growth of the number of videos in YouTube provides enormous potential for users to find content of interest to them. Unfortunately, given the difficulty of searching videos, the size of the video repository also makes the discovery of new content a daunting task. In this paper, we present a novel method based upon the analysis of the entire user-video graph to provide personalized video suggestions for users. The resulting algorithm, termed Adsorption, provides a simple method to efficiently propagate preference information through a variety of graphs. We extensively test the results of the recommendations on a three month snapshot of live data from YouTube.
Keywords: collaborative filtering, label propagation, random walks, recommendation systems, video search
How people use the web on mobile devices BIBAKFull-Text 905-914
  Yanqing Cui; Virpi Roto
This paper describes a series of user studies on how people use the Web via mobile devices. The data primarily comes from contextual inquiries with 47 participants between 2004 and 2007, and is complemented with a phone log analysis of 577 panelists in 2007. We report four key contextual factors in using the Web on mobile devices and propose mobile Web activity taxonomy. The framework contains three user activity categories identical to previous stationary Web studies: information seeking, communication, and transaction, and a new category: personal space extension. The new category refers to the practice that people put their content on the Web for personal access, therefore extending their personal information space.
Keywords: activity taxonomy, content object handling, information seeking, mobile web, personal space extension
Planetary-scale views on a large instant-messaging network BIBAKFull-Text 915-924
  Jure Leskovec; Eric Horvitz
We present a study of anonymized data capturing a month of high-level communication activities within the whole of the Microsoft Messenger instant-messaging system. We examine characteristics and patterns that emerge from the collective dynamics of large numbers of people, rather than the actions and characteristics of individuals. The dataset contains summary properties of 30 billion conversations among 240 million people. From the data, we construct a communication graph with 180 million nodes and 1.3 billion undirected edges, creating the largest social network constructed and analyzed to date. We report on multiple aspects of the dataset and synthesized graph. We find that the graph is well-connected and robust to node removal. We investigate on a planetary-scale the oft-cited report that people are separated by "six degrees of separation" and find that the average path length among Messenger users is 6.6. We find that people tend to communicate more with each other when they have similar age, language, and location, and that cross-gender conversations are both more frequent and of longer duration than conversations with the same gender.
Keywords: communication networks, large data, online communication, social networks, user demographics
Online auctions efficiency: a survey of ebay auctions BIBAKFull-Text 925-934
  Wenyan Hu; Alvaro Bolivar
Online auctions have become a pervasive transaction mechanism for e-commerce. As the largest online marketplace in the world, eBay is an attractive case study that enables the study of online auctions utilizing data involving real people and transactions.
   In this paper, we present a detailed investigation and analysis of multiple online auction properties including: consumer surplus, sniping, bidding strategy and their cross-relationships. Our goal is to evaluate the theoretical foundations of online auctions and discover patterns and behaviors hidden due to the lack of real and extensive transaction data. Among our findings, we uncover an important correlation among sniping and high surplus ratios, which implies the uncertainty of true value in a competitive environment. The key issue is the wrong assumption that bidders' valuations are independent from each other, which leads to inefficient auctions.
   In order to address the inefficiencies of current online formats we introduce a declining price auction model customized for online transactions. Conceptually, this model ought to deal with the complexities of competition in an online environment while maximizing social welfare.
Keywords: auction efficiency, auction theory, online auctions

Technology for developing regions

Organizing the unorganized -- employing IT to empower the under-privileged BIBAKFull-Text 935-944
  Arun Kumar; Nitendra Rajput; Sheetal Agarwal; Dipanjan Chakraborty; Amit Anit Nanavati
Various sectors in developing countries are typically dominated by the presence of a large number of small and micro-businesses that operate in an informal, unorganized manner. Many of these are single person run micro-businesses and cannot afford to buy and maintain their own IT infrastructure. For others, easy availability of cheap labour provides a convenient alternative even though it results in inefficiency, as little or no records are maintained, and only manual, paper-based processes are followed. This results in high response times for customers, no formal accountability and higher charges. For the businesses this translates to lower earnings and losses due to inefficiencies. In this paper, we look at few such micro-business segments and explore their current models of operation, while identifying existing inefficiencies and pain points. We build upon the findings and propose an approach for delivering benefits of IT solutions to such micro-business segments. Finally, we present technology that realizes the proposed approach in the specific context of two such segments.
Keywords: developing regions, virtual communities, voice applications
Dtwiki: a disconnection and intermittency tolerant wiki BIBAKFull-Text 945-952
  Bowei Du; Eric A. Brewer
Wikis have proven to be a valuable tool for collaboration and content generation on the web. Simple semantics and ease-of-use make wiki systems well suited for meeting many emerging region needs in the areas of education, collaboration and local content generation. Despite their usefulness, current wiki software does not work well in the network environments found in emerging regions. For example, it is common to have long-lasting network partitions due to cost, power and poor connectivity. Network partitions make a traditional centralized wiki architecture unusable due to the unavailability of the central server. Existing solutions towards addressing connectivity problems include web-caching proxies and snapshot distribution. While proxies and snapshots allow wiki data to be read while disconnected, they prevent users from contributing updates back to the wiki.
   In this paper we detail the design and implementation of DTWiki, a wiki system which explicitly addresses the problem of operating a wiki system in an intermittent environment. The DTWiki system is able to cope with long-lasting partitions and bad connectivity while providing the functionality of popular wiki software such as MediaWiki and TWiki.
Keywords: delay-tolerant networking, developing regions, wiki
Action science approach to nonprofit housing services using web 2.0 mapping tools BIBAKFull-Text 953-958
  Yao-Jen Chang; Hsin-Yu Hsu; Tsen-Yung Wang
The study follows action science approach to the problem of nonprofit housing services. After 4 months of action science-based activities, such as organized participant observations, in-depth interviews, field work, and focus group studies, the main findings are (1) Web 2.0 suits nonprofit organizations better than traditional webs in terms of maintenance cost and usability, (2) mapping tools make better GUI with respect to web-based housing services, and (3) context-aware personalization can translate to better user experiences as an RDFS-based working prototype has been built and tested. A user survey shows high level user satisfaction. Although the case study was carried in a nonprofit housing organization, the practices in action research approach can be applied to other NPOs as well.
Keywords: RDFS, action research, context-aware services, mapping tools, web 2.0

WWW in China: mining the Chinese web

Hidden sentiment association in Chinese web opinion mining BIBAKFull-Text 959-968
  Qi Su; Xinying Xu; Honglei Guo; Zhili Guo; Xian Wu; Xiaoxun Zhang; Bin Swen; Zhong Su
The boom of product review websites, blogs and forums on the web has attracted many research efforts on opinion mining. Recently, there was a growing interest in the finer-grained opinion mining, which detects opinions on different review features as opposed to the whole review level. The researches on feature-level opinion mining mainly rely on identifying the explicit relatedness between product feature words and opinion words in reviews. However, the sentiment relatedness between the two objects is usually complicated. For many cases, product feature words are implied by the opinion words in reviews. The detection of such hidden sentiment association is still a big challenge in opinion mining. Especially, it is an even harder task of feature-level opinion mining on Chinese reviews due to the nature of Chinese language. In this paper, we propose a novel mutual reinforcement approach to deal with the feature-level opinion mining problem. More specially, 1) the approach clusters product features and opinion words simultaneously and iteratively by fusing both their content information and sentiment link information. 2) under the same framework, based on the product feature categories and opinion word groups, we construct the sentiment association set between the two groups of data objects by identifying their strongest n sentiment links. Moreover, knowledge from multi-source is incorporated to enhance clustering in the procedure. Based on the pre-constructed association set, our approach can largely predict opinions relating to different product features, even for the case without the explicit appearance of product feature words in reviews. Thus it provides a more accurate opinion evaluation. The experimental results demonstrate that our method outperforms the state-of-art algorithms.
Keywords: association, mutual reinforcement, opinion mining, opinion word, product feature
Can Chinese web pages be classified with english data source? BIBAKFull-Text 969-978
  Xiao Ling; Gui-Rong Xue; Wenyuan Dai; Yun Jiang; Qiang Yang; Yong Yu
As the World Wide Web in China grows rapidly, mining knowledge in Chinese Web pages becomes more and more important. Mining Web information usually relies on the machine learning techniques which require a large amount of labeled data to train credible models. Although the number of Chinese Web pages increases quite fast, it still lacks Chinese labeled data. However, there are relatively sufficient English labeled Web pages. These labeled data, though in different linguistic representations, share a substantial amount of semantic information with Chinese ones, and can be utilized to help classify Chinese Web pages. In this paper, we propose an information bottleneck based approach to address this cross-language classification problem. Our algorithm first translates all the Chinese Web pages to English. Then, all the Web pages, including Chinese and English ones, are encoded through an information bottleneck which can allow only limited information to pass. Therefore, in order to retain as much useful information as possible, the common part between Chinese and English Web pages is inclined to be encoded to the same code (i.e. class label), which makes the cross-language classification accurate. We evaluated our approach using the Web pages collected from Open Directory Project (ODP). The experimental results show that our method significantly improves several existing supervised and semi-supervised classifiers.
Keywords: cross-language classification, information bottleneck
Substructure similarity measurement in Chinese recipes BIBAKFull-Text 979-988
  Liping Wang; Qing Li; Na Li; Guozhu Dong; Yu Yang
Improving the precision of information retrieval has been a challenging issue on Chinese Web. As exemplified by Chinese recipes on the Web, it is not easy/natural for people to use keywords (e.g. recipe names) to search recipes, since the names can be literally so abstract that they do not bear much, if any, information on the underlying ingredients or cooking methods. In this paper, we investigate the underlying features of Chinese recipes, and based on workflow-like cooking procedures, we model recipes as graphs. We further propose a novel similarity measurement based on the frequent patterns, and devise an effective filtering algorithm to prune unrelated data so as to support efficient on-line searching. Benefiting from the characteristics of graphs, frequent common patterns can be mined from a cooking graph database. So in our prototype system called RecipeView, we extend the subgraph mining algorithm FSG to cooking graphs and combine it with our proposed similarity measurement, resulting in an approach that well caters for specific users' needs. Our initial experimental studies show that the filtering algorithm can efficiently prune unrelated cooking graphs without affecting the retrieval performance and the similarity measurement gets a relatively higher precision/recall against its counterparts
Keywords: cooking graph, filtering, recipes, similarity measurement, subgraph mining

WWW in China: Chinese web innovations

Efficient multi-keyword search over p2p web BIBAKFull-Text 989-998
  Hanhua Chen; Hai Jin; Jiliang Wang; Lei Chen; Yunhao Liu; Lionel M. Ni
Current search mechanisms of DHT-based P2P systems can well handle a single keyword search problem. Other than single keyword search, multi-keyword search is quite popular and useful in many real applications. Simply using the solution for single keyword search will require distributed intersection/union operations in wide area networks, leading to unacceptable traffic cost. As it is well known that Bloom Filter (BF) is effective in reducing traffic, we would like to use BF encoding to handle multi-keyword search.
   Applying BF is not difficult, but how to get optimal results is not trivial. In this study we show, through mathematical proof, that the optimal setting of BF in terms of traffic cost is determined by the global statistical information of keywords, not the minimized false positive rate as claimed by previous methods. Through extensive experiments, we demonstrate how to obtain optimal settings. We further argue that the intersection order between sets is important for multi-keyword search. Thus, we design optimal order strategies based on BF for both "and" and "or" queries. To better evaluate the performance of this design, we conduct extensive simulations on TREC WT10G test collection and the query log of a commercial search engine. Results show that our design significantly reduces the search traffic of existing approach by 73%.
Keywords: DHT, P2P, bloom filter, multi-keyword search
Towards a global schema for web entities BIBAKFull-Text 999-1008
  Conglei Yao; Yongjian Yu; Sicong Shou; Xiaoming Li
Popular entities often have thousands of instances on the Web. In this paper, we focus on the case where they are presented in table-like format, namely appearing with their attribute names. It is observed that, on one hand, for the same entity, different web pages often incorporate different attributes; on the other, for the same attribute, different web pages often use different attribute names (labels). Therefore, it is imaginably difficult to produce a global attribute schema for all the web entities of a given entity type based on their web instances, although the global attribute schema is usually highly desired in web entity instances integration and web object extraction. To this end, we propose a novel framework of automatically learning a global attribute schema for all web entities of one specific entity type. Under this framework, an iterative instances extraction procedure is first employed to extract sufficient web entity instances to discover enough attribute labels. Next, based on the labels, entity instances, and related web pages, a maximum entropy-based schema discovery approach is adopted to learn the global attribute schema for the target entity type. Experimental results on the Chinese Web achieve weighted average Fscores of 0.7122 and 0.7123 on two global attribute schemas for person-type and movie-type web entities, respectively. These results show that our framework is general, efficient and effective.
Keywords: attribute labels, entity instances, global attribute schema, web entities
Web video topic discovery and tracking via bipartite graph reinforcement model BIBAKFull-Text 1009-1018
  Lu Liu; Lifeng Sun; Yong Rui; Yao Shi; Shiqiang Yang
Automatic topic discovery and tracking on web-shared videos can greatly benefit both web service providers and end users. Most of current solutions of topic detection and tracking were done on news and cannot be directly applied on web videos, because the semantic information of web videos is much less than that of news videos. In this paper, we propose a bipartite graph model to address this issue. The bipartite graph represents the correlation between web videos and their keywords, and automatic topic discovery is achieved through two steps -- coarse topic filtering and fine topic re-ranking. First, a weight-updating co-clustering algorithm is employed to filter out topic candidates at a coarse level. Then the videos on each topic are re-ranked by analyzing the link structures of the corresponding bipartite graph. After the topics are discovered, the interesting ones can also be tracked over a period of time using the same bipartite graph model. The key is to propagate the relevant scores and keywords from the videos of interests to other relevant ones through the bipartite graph links. Experimental results on real web videos from YouKu, a YouTube counterpart in China, demonstrate the effectiveness of the proposed methods. We report very promising results.
Keywords: bipartite graph model, co-clustering, reinforcement, topic discovery, topic tracking, web videos

Posters

The seamless browser: enhancing the speed of web browsing by zooming and preview thumbnails BIBAKFull-Text 1019-1020
  ByungIn Yoo; JongHo Lea; YeunBae Kim
In this paper, we present a new web browsing system, Seamless Browser, for fast link traversal on a large screen like TV In navigating web, users mainly suffer from cognitive overhead of determining whether or not to follow links. This overhead can be reduced by providing preview information of the destination of links, and also by providing semantic cues on the nearest location in relation to the anchor. In order to reduce disorientation and annoyance from the preview information, we propose that users will focus on the small area nearside around a pointer, and a small number of hyperlink previews in that focused area will appear differently depending on the distances between the pointer and the hyperlinks: the nearer the distance is, the richer the content of the information scent is. We also propose that users can navigate the link paths by controlling the pointer and the zooming interface, so that users may go backward and forward seamlessly along several possible link paths. We found that combining the pointer and a zoom significantly improved performance for navigational tasks.
Keywords: preview thumbnails, seamless browser, zoomable interaction
What do they think?: aggregating local views about news events and topics BIBAKFull-Text 1021-1022
  Jiahui Liu; Larry Birnbaum
The web has become an important medium for news delivery and consumption. Fresh content about a variety of topics and events is constantly being created and published on the web by many sources. As intuitively understood by readers, and studied in journalism, news articles produced by different social groups present different attitudes towards and interpretations of the same news issues. In this paper, we propose a new paradigm for aggregating news articles according to the news sources related to the stakeholders of the news issues. We implement this paradigm in a prototype system called LocalSavvy. The system provides users the capability to aggregate and browse various local views about the news issues in which they are interested.
Keywords: local opinion, news aggregation
Personalized view-based search and visualization as a means for deep/semantic web data access BIBAKFull-Text 1023-1024
  Michal Tvarozek; Maria Bielikova
Effective access to and navigation in information stored in deep Web ontological repositories or relational databases has yet to be realized due to issues with usability of user interfaces and the overall scope and complexity of information as well as the nature of exploratory user tasks. We propose the integration and adaptation of novel navigation and visualization approaches to faceted browsing such as visual depiction of facets and restrictions, visual navigation in (clusters of) search results and graph like exploration of individual search results' properties.
Keywords: graph visualization, navigation, personalized faceted browsing, view-based search, visualization
Personalized multimedia web summarizer for tourist BIBAKFull-Text 1025-1026
  Xiao Wu; Jintao Li; Yongdong Zhang; Sheng Tang; Shi-Yong Neo
In this paper, we highlight the use of multimedia technology in generating intrinsic summaries of tourism related information. The system utilizes an automated process to gather, filter and classify information on various tourist spots on the Web. The end result present to the user is a personalized multimedia summary generated with respect to users queries filled with text, image, video and real-time news made retrievable for mobile devices. Preliminary experiments demonstrate the superiority of our presentation scheme to traditional methods.
Keywords: classification, summarization, video and image analysis
A generic framework for collaborative multi-perspective ontology acquisition BIBAKFull-Text 1027-1028
  Maayan Zhitomirsky-Geffet; Judit Bar-Ilan; Yitzchak Miller; Snunith Shoham
The research objective of this work is to develop a general framework that incorporates collaborative social tagging with a novel ontology scheme conveying multiple perspectives. We propose a framework where multiple users tag the same object (an image in our case), and an ontology is extended based on these tags while being tolerant about different points of view. We are not aware of any other work that attempted to devise such an environment and to study its dynamics. The proposed framework characterizes the underlying processes for controlled collaborative development of a multi-perspective ontology and its application to improve image annotation, searching and browsing. Our case study experiment with a set of selected annotated images indicates the soundness of the proposed ontological model.
Keywords: collaborative multi-perspective ontology
As we may perceive: finding the boundaries of compound documents on the web BIBAKFull-Text 1029-1030
  Pavel Dmitriev
This paper considers the problem of identifying on the Web compound documents (cDocs) -- groups of web pages that in aggregate constitute semantically coherent information entities. Examples of cDocs are a news article consisting of several html pages, or a set of pages describing specifications, price, and reviews of a digital camera. Being able to identify cDocs would be useful in many applications including web and intranet search, user navigation, automated collection generation, and information extraction.
   In the past, several heuristic approaches have been proposed to identify cDocs [1][5]. However, heuristics fail to capture the variety of types, styles and goals of information on the web, and do not account for the fact that the definition of a cDoc often depends on the context. This paper presents an experimental evaluation of three machine learning-based algorithms for cDoc discovery. These algorithms are responsive to the varying structure of cDocs and adaptive to their application-specific nature. Based on our previous work [4], this paper proposes a different scenario for discovering cDocs, and compares in this new setting the local machine learned clustering algorithm from [4] to a global purely graph based approach [3] and a Conditional Markov Network approach previously applied to noun coreference task [6]. The results show that the approach of [4] outperforms the other algorithms, suggesting that global relational characteristics of web sites are too noisy for cDoc identification purposes.
Keywords: compound documents, machine learning, www
Personalized search and exploration with mytag BIBKFull-Text 1031-1032
  Max Braun; Klaas Dellschaft; Thomas Franz; Dominik Hering; Peter Jungen; Hagen Metzler; Eugen Müller; Alexander Rostilov; Carsten Saathoff
Keywords: personomy, searching, tagging
Emergence of terminological conventions as an author-searcher coordination game BIBAKFull-Text 1033-1034
  David Bodoff; Sheizaf Rafaeli
All information exchange on the Internet -- whether through full text, controlled vocabularies, ontologies, or other mechanisms -- ultimately requires that an information provider and seeker use the same word or symbol. In this paper, we investigate what happens when both searchers and authors are dynamically choosing terms to match the other side. With each side trying to anticipate the other, does a terminological convention ever emerge, or do searchers and providers continue to miss potential partners through mis-match of terms? We use a game-theoretic setup to frame questions, and learning theory to make predictions about whether and which term will emerge as a convention.
Keywords: game theory, keyword advertising, learning, manual indexing
System II: a hypergraph based native rdf repository BIBAKFull-Text 1035-1036
  Gang Wu; Juanzi Li; Kehong Wang
To manage the increasing amount of RDF data, an RDF repository should provide not only necessary scalability and efficiency, but also sufficient inference capabilities. In this paper, we propose a native RDF repository, System, to pursue a better tradeoff among the above requirements. System takes the hypergraph representation for RDF as the data model for its persistent storage, which effectively avoids the costs of data model transformation when accessing RDF data. In addition, a set of efficient semantic query processing techniques are designed. The results of performance evaluation on the LUBM benchmark show that System has a better combined metric value than the other comparable systems.
Keywords: hypergraph, rdf, repository
Efficient vectorial operators for processing xml twig queries BIBAKFull-Text 1037-1038
  Guoliang Li; Jianhua Feng; Jianyong Wang; Feng Lin; Lizhu Zhou
This paper proposes several vectorial operators for processing XML twig queries, which are easy to be performed and inherently efficient for both Ancestor-Descendant (A-D) and Parent-Child (P-C) relationships. We develop optimizations on the vectorial operators to improve the efficiency of answering twig queries in holistic. We propose an algorithm to answer GTP queries based on our vectorial operators.
Keywords: holistic twig join, subsequence match, vectorial operators
User behavior oriented web spam detection BIBAKFull-Text 1039-1040
  Yiqun Liu; Min Zhang; Shaoping Ma; Liyun Ru
Combating Web spam has become one of the top challenges for Web search engines. State-of-the-art spam detection techniques are usually designed for specific known types of Web spam and are incapable and inefficient for recently-appeared spam. With user behavior analyses into Web access logs, we propose a spam page detection algorithm based on Bayes learning. Preliminary experiments on Web access data collected by a commercial Web site (containing over 2.74 billion user clicks in 2 months) show the effectiveness of the proposed detection framework and algorithm.
Keywords: spam detection, user behavior analysis, web search engine
Feature weighting in content based recommendation system using social network analysis BIBAKFull-Text 1041-1042
  Souvik Debnath; Niloy Ganguly; Pabitra Mitra
We propose a hybridization of collaborative filtering and content based recommendation system. Attributes used for content based recommendations are assigned weights depending on their importance to users. The weight values are estimated from a set of linear regression equations obtained from a social network graph which captures human judgment about similarity of items.
Keywords: feature similarity, recommender system, social network
Extracting spam blogs with co-citation clusters BIBAKFull-Text 1043-1044
  Kazunari Ishida
This paper reports the estimated number of spam blogs in order to assess their current state in the blogosphere. To extract spam blogs, I developed a traversal method among co-citation clusters of blogs from a spam seed. Spam seeds were collected in terms of high out-degree and spam keyword. According to the experiment, a mixed seed set composed of high out-degree and spam keyword seeds is more effective than individual seed sets in terms of F-Measure. In conclusion, mixed seeds from different methods are effective in improving the F-Measure results of spam extraction with co-citation clusters.
Keywords: advertisement link, co-citation cluster, spam blog extraction
Race: finding and ranking compact connected trees for keyword proximity search over xml documents BIBAKFull-Text 1045-1046
  Guoliang Li; Jianhua Feng; Jianyong Wang; Bei Yu; Yukai He
In this paper, we study the problem of keyword proximity search over XML documents and leverage the efficiency and effectiveness. We take the disjunctive semantics among input keywords into consideration and identify meaningful compact connected trees as the answers of keyword proximity queries. We introduce the notions of Compact Lowest Common Ancestor (CLCA) and Maximal CLCA (MCLCA) and propose Compact Connected Trees (CCTrees) and Maximal CCTrees (MCCTrees) to efficiently and effectively answer keyword queries. We propose a novel ranking mechanism, RACE, to Rank compAct Connected trEes, by taking into consideration both the structural similarity and the textual similarity. Our extensive experimental study shows that our method achieves both high search efficiency and effectiveness, and outperforms existing approaches significantly.
Keywords: compact lca (clca), lowest common ancestor (lca), maximal clca (mclca)
The scale-free nature of semantic web ontology BIBAKFull-Text 1047-1048
  Hongyu Zhang
Semantic web ontology languages, such as OWL, have been widely used for knowledge representation. Through empirical analysis of real-world ontologies we discover that, like many natural and social phenomenon, the semantic web ontology is also "scale-free".
Keywords: ontology, power-law, scale-free, semantic web
Asymmetrical query recommendation method based on bipartite network resource allocation BIBAKFull-Text 1049-1050
  Zhiyuan Liu; Maosong Sun
This paper presents a new query recommendation method that generates recommended query list by mining large-scale user logs. Starting from the user logs of click-through data, we construct a bipartite network where the nodes on one side correspond to unique queries, on the other side to unique URLs. Inspired by the bipartite network based resource allocation method, we try to extract the hidden information from the Query-URL bipartite network. The recommended queries generated by the method are asymmetrical which means two related queries may have different strength to recommend each other. To evaluate the method, we use one week user logs from Chinese search engine Sogou. The method is not only 'content ignorant', but also can be easily implemented in a paralleled manner, which is feasible for commercial search engines to handle large scale user logs.
Keywords: asymmetrical query recommendation, bipartite network., network resource allocation, user log analysis
Efficient mining of frequent sequence generators BIBAKFull-Text 1051-1052
  Chuancong Gao; Jianyong Wang; Yukai He; Lizhu Zhou
Sequential pattern mining has raised great interest in data mining research field in recent years. However, to our best knowledge, no existing work studies the problem of frequent sequence generator mining. In this paper we present a novel algorithm, FEAT (abbr. Frequent sEquence generATor miner), to perform this task. Experimental results show that FEAT is more efficient than traditional sequential pattern mining algorithms but generates more concise result set, and is very effective for classifying Web product reviews.
Keywords: sequence, sequence generators, web mining
Efficiently querying rdf data in triple stores BIBAKFull-Text 1053-1054
  Ying Yan; Chen Wang; Aoying Zhou; Weining Qian; Li Ma; Yue Pan
Efficiently querying RDF data is being an important factor in applying Semantic Web technologies to real-world applications. In this context, many efforts have been made to store and query RDF data in relational database using particular schemas. In this paper, we propose a new scheme to store, index, and query RDF data in triple stores. Graph feature of RDF data is taken into considerations which might help reduce the join costs on the vertical database structure. We would partition RDF triples into overlapped groups, store them in a triple table with one more column of group identity, and build up a signature tree to index them. Based on this infrastructure, a complex RDF query is decomposed into multiple pieces of sub-queries which could be easily filtered into some RDF groups using signature tree index, and finally is evaluated with a composed and optimized SQL with specific constraints. We compare the performance of our method with prior art on typical queries over a large scaled LUBM and UOBM benchmark data (more than 10 million triples). For some extreme cases, they can promote 3 to 4 orders of magnitude.
Keywords: graph partitioning, indexing, rdf, signature
Collaborative knowledge semantic graph image search BIBAKFull-Text 1055-1056
  Jyh-Ren Shieh; Yang-Ting Yeh; Chih-Hung Lin; Ching-Yung Lin; Ja-Ling Wu
In this paper, we propose a Collaborative Knowledge Semantic Graphs Image Search (CKSGIS) system. It provides a novel way to conduct image search by utilizing the collaborative nature in Wikipedia and by performing network analysis to form semantic graphs for search-term expansion. The collaborative article editing process used by Wikipedia's contributors is formalized as bipartite graphs that are folded into networks between terms. When a user types in a search term, CKSGIS automatically retrieves an interactive semantic graph of related terms that allow users to easily find related images not limited to a specific search term. Interactive semantic graph then serve as an interface to retrieve images through existing commercial search engines. This method significantly saves users' time by avoiding multiple search keywords that are usually required in generic search engines. It benefits both naïve users who do not possess a large vocabulary and professionals who look for images on a regular basis. In our experiments, 85% of the participants favored CKSGIS system rather than commercial search engines.
Keywords: keyword expansion, re-ranking, social network
A logical framework for modeling and reasoning about semantic web services contract BIBAKFull-Text 1057-1058
  Hai Liu; Qing Li; Naijie Gu; An Liu
In this paper, we incorporate concrete domain and action theory into a very expressive Description Logic (DL), called ALCQO. Notably, this extension can significantly augment the expressive power for modeling and reasoning about dynamic aspects of services contracting. Meanwhile, the original nature and advantages of classical DLs are also preserved to the extent possible.
Keywords: dls, semantic web services, services contract
Dissemination of heterogeneous xml data BIBAKFull-Text 1059-1060
  Yuan Ni; Chee-Yong Chan
A lot of recent research has focused on the content-based dissemination of XML data. However, due to the heterogeneous data schemas used by different data publishers even for data in the same domain, an important challenge is how to efficiently and effectively disseminate relevant data to subscribers whose subscriptions might be specified based on schemas that are different from those used by the data publishers. This paper examines the options to resolve this schema heterogeneity problem in XML data dissemination, and proposes a novel paradigm that is based on data rewriting. Our experimental results demonstrate the effectiveness of the data rewriting paradigm and identifies the tradeoffs of the various approaches.
Keywords: data rewriting, dissemination, heterogeneous, xml
Sailer: an effective search engine for unified retrieval of heterogeneous xml and web documents BIBAKFull-Text 1061-1062
  Guoliang Li; Jianhua Feng; Jianyong Wang; Xiaoming Song; Lizhu Zhou
This paper studies the problem of unified ranked retrieval of heterogeneous XML documents and Web data. We propose an effective search engine called Sailer to adaptively and versatilely answer keyword queries over the heterogenous data. We model the Web pages and XML documents as graphs. We propose the concept of pivotal trees to effectively answer keyword queries and present an effective method to identify the top-k pivotal trees with the highest ranks from the graphs. Moreover, we propose effective indexes to facilitate the effective unified ranked retrieval. We have conducted an extensive experimental study using real datasets, and the experimental results show that Sailer achieves both high search efficiency and accuracy, and outperforms the existing approaches significantly.
Keywords: keyword search, unified keyword search, web pages, xml
Personalized tag suggestion for flickr BIBAKFull-Text 1063-1064
  Nikhil Garg; Ingmar Weber
We present a system for personalized tag suggestion for Flickr: While the user is entering/selecting new tags for a particular picture, the system is suggesting related tags to her, based on the tags that she or other people have used in the past along with (some of) the tags already entered. The suggested tags are dynamically updated with every additional tag entered/selected. We describe three algorithms which can be applied to this problem. In experiments, our best-performing method yields an improvement in precision of 10-15% over a baseline method very similar to the system currently used by Flickr. Our system is accessible at http://ltaa5.epfl.ch/flickr-tags/.
   To the best of our knowledge, this is the first study on tag suggestion in a setting where (i) no full text information is available, such as for blogs, (ii) no item has been tagged by more than one person, such as for social bookmarking sites, and (iii) suggestions are dynamically updated, requiring efficient yet effective algorithms.
Keywords: experiments, flickr, tag suggestions, tagging systems
Larger is better: seed selection in link-based anti-spamming algorithms BIBAKFull-Text 1065-1066
  Qiancheng Jiang; Lei Zhang; Yizhen Zhu; Yan Zhang
Seed selection is of significant importance for the biased PageRank algorithms such as TrustRank to combat link spamming. Previous work usually uses a small seed set, which has a big problem that the top ranking results have a strong bias towards seeds. In this paper, we analyze the relationship between the result bias and the number of seeds. Furthermore, we experimentally show that an automatically selected large seed set can work better than a carefully selected small seed set.
Keywords: biased pagerank, link spamming, seed selection
Using subspace analysis for event detection from web click-through data BIBAKFull-Text 1067-1068
  Ling Chen; Yiqun Hu; Wolfgang Nejdl
Although most of existing research usually detects events by analyzing the content or structural information of Web documents, a recent direction is to study the usage data. In this paper, we focus on detecting events from Web click-through data generated by Web search engines. We propose a novel approach which effectively detects events from click-through data based on robust subspace analysis. We first transform click-through data to the 2D polar space. Next, an algorithm based on Generalized Principal Component Analysis (GPCA) is used to estimate subspaces of transformed data such that each subspace contains query sessions of similar topics. Then, we prune uninteresting subspaces which do not contain query sessions corresponding to real events by considering both the semantic certainty and the temporal certainty of query sessions in each subspace. Finally, various events are detected from interesting subspaces by utilizing a nonparametric clustering technique. Compared with existing approaches, our experimental results based on real-life click-through data have shown that the proposed approach is more accurate in detecting real events and more effective in determining the number of events.
Keywords: click-through data, event detection, gpca, subspace estimation
A domain-specific language for the model-driven construction of advanced web-based dialogs BIBAKFull-Text 1069-1070
  Patrick Freudenstein; Martin Nussbaumer; Florian Allerding; Martin Gaedke
Complex dialogs with comprehensive underlying data models are gaining increasing importance in today's Web applications. This in turn accelerates the need for highly dynamic dialogs offering guidance to the users and thus reducing cognitive overload. Beyond that, requirements from the fields of aesthetics, Web accessibility, platform-independence, and Web service integration arise. To this end, we present an evolutionary, extensible approach for the model-driven construction of advanced dialogs. It is based on a Domain-specific Language (DSL) focusing on simplicity and fostering collaboration with stakeholders.
Keywords: dsl, model-driven, user interaction, web engineering
Web people search: results of the first evaluation and the plan for the second BIBAKFull-Text 1071-1072
  Javier Artiles; Satoshi Sekine; Julio Gonzalo
This paper presents the motivation, resources and results for the first Web People Search task, which was organized as part of the SemEval-2007 evaluation exercise. Also, we will describe a survey and proposal for a new task, "attribute extraction", which is planned for inclusion in the second evaluation, planned for autumn, 2008.
Keywords: attributes of people, disambiguation, information extraction, person names
Context-based page unit recommendation for web-based sensemaking tasks BIBAKFull-Text 1073-1074
  Wen-Huang Cheng; David Gotz
Sensemaking tasks require users to perform complex research behaviors to gather and comprehend information from many sources. Such tasks are common and include, for example, researching vacation destinations or deciding how to invest. In this paper, we present an algorithm and interface that provides context-based page unit recommendation to assist in connection discovery during sensemaking tasks. We exploit the natural note-taking activity common to sensemaking behavior as the basis for a task-specific context model. Each web page visited by a user is dynamically analyzed to determine the most relevant content fragments which are then recommended to the user. Our initial evaluations indicate that our approach improves user performance.
Keywords: recommendation, search, sensemaking, www
Web page rank prediction with markov models BIBAKFull-Text 1075-1076
  Michalis Vazirgiannis; Dimitris Drosos; Pierre Senellart; Akrivi Vlachou
In this paper we propose a method for predicting the ranking position of a Web page. Assuming a set of successive past top-k rankings, we study the evolution of Web pages in terms of ranking trend sequences used for Markov Models training, which are in turn used to predict future rankings. The predictions are highly accurate for all experimental setups and similarity measures.
Keywords: markov models, ranking prediction
Ajax for mobility: mobileweaver ajax framework BIBAKFull-Text 1077-1078
  Louenas Hamdi; Huaigu Wu; Serhan Dagtas; Abdel Benharref
This paper provides a brief description of a research project on using Ajax to enhance mobile Web applications for enterprise use. The project, known as MobileWeaver Ajax Framework, leverages enterprise SOA (Service Oriented Architecture) and the latest web technologies on mobile devices.
Keywords: ajax, enterprise soa, mobility, rest, soap, web 2.0
Cm-pmi: improved web-based association measure with contextual label matching BIBAKFull-Text 1079-1080
  Xiaojun Wan
WebPMI is a popular web-based association measure to evaluate the semantic similarity between two queries (i.e. words or entities) by leveraging search results returned by search engines. This paper proposes a novel measure named CM-PMI to evaluate query similarity at a finer granularity than WebPMI, under the assumption that a query is usually associated with more than one aspect and two queries are deemed semantically related if their associated aspect sets are highly consistent with each other. CM-PMI first extracts contextual labels from search results to represent the aspects of a query, and then uses the optimal matching method to assess the consistency between the aspects of two queries. Experimental results on the benchmark Miller Charles' dataset demonstrate the good effectiveness of the proposed CM-PMI measure. Moreover, we further fuse WebPMI and CM-PMI to obtain improved results.
Keywords: association measure, cm-pmi, web mining
Web user de-identification in personalization BIBAKFull-Text 1081-1082
  Jiaqian Zheng; Jing Yao; Junyu Niu
It is a kind of privacy infraction in personalized web service if the user profile submitted to one web site transferred to another site without user permission. That can cause the second web site easily re-identify to whom these personal data belong, no matter whether the transfer is under control or by hacking.
   This paper presents a portable solution for users to bind their sensitive web data under the appointed domain. Such data, including query logs, user accounts, click stream etc, could be used to identify the sensitive information of the particular user. By our domain stretching de-identification method, if personal data leak from domain A to B, the web user could still not be identified even though he logins to sites under domain B using the same name and password. In the experiment implemented by javascript, we show the flexibility and efficiency of our de-identification approach.
Keywords: de-identification, domain, user profile
A systematic approach for cell-phone worm containment BIBAKFull-Text 1083-1084
  Liang Xie; Hui Song; Trent Jaeger; Sencun Zhu
Cell phones are increasingly becoming attractive targets of various worms, which cause the leakage of user privacy, extra service charges and depletion of battery power. In this work, we study propagation of cell-phone worms, which exploit Multimedia Messaging Service (MMS) and/or Bluetooth for spreading. We then propose a systematic countermeasure against the worms. At the terminal level, we adopt Graphic Turing test and identity-based signature to block unauthorized messages from leaving compromised phones; at the network level, we propose a push-based automated patching scheme for cleansing compromised phones. Through experiments on phone devices and a wide variety of networks, we show that cellular systems taking advantage of our defense can achieve a low infection rate (e.g., less than 3% within 30 hours) even under severe attacks.
Keywords: cell phone, patch, turing test
Information retrieval and knowledge discovery on the semantic web of traditional Chinese medicine BIBAKFull-Text 1085-1086
  Zhaohui Wu; Tong Yu; Huajun Chen; Xiaohong Jiang; Yi Feng; Yuxin Mao; Heng Wang; Jingming Tang; Chunying Zhou
We conduct the first systematical adoption of the Semantic Web solution in the integration, management, and utilization of TCM information and knowledge resources. As the results, the largest TCM Semantic Web ontology is engineered as the uniform knowledge representation mechanism; the ontology-based query and search engine is deployed, mapping legacy and heterogeneous relational databases to the Semantic Web layer for query and search across database boundaries; the first global herb-drug interaction network is mapped through semantic integration, and the semantic graph mining methodology is implemented for discovering and interpreting interesting patterns from this network. The platform and underlying methodology are proved effective in TCM-related drug usage, discovery, and safety analysis.
Keywords: information retrieval, knowledge discovery, semantic web
Topigraphy: visualization for large-scale tag clouds BIBAKFull-Text 1087-1088
  Ko Fujimura; Shigeru Fujimura; Tatsushi Matsubayashi; Takeshi Yamada; Hidenori Okuda
This paper proposes a new method for displaying large-scale tag clouds. We use a topographical image that helps users to grasp the relationship among tags intuitively as a background to the tag clouds. We apply this interface to a blog navigation system and show that the proposed method enables users to find the desired tags easily even if the tag clouds are very large, 5,000 and above tags. Our approach is also effective for understanding the overall structure of a large amount of tagged documents.
Keywords: blog, clustering, search, tag clouds, topigraphy, topography
Gsp-exr: gsp protocol with an exclusive right for keyword auctions BIBAKFull-Text 1089-1090
  Yuko Sakurai; Atsushi Iwasaki; Yasumasa Saito; Makoto Yokoo
We propose a keyword auction protocol called the Generalized Second Price with an Exclusive Right (GSP-ExR). In existing keyword auctions, the number of displayed advertisements is determined in advance. Thus, we consider adjusting the number of advertisements dynamically based on bids. In the GSP-ExR, the number of slots can be either 1 or K. When K slots are displayed, the protocol is identical to the GSP. If the value per click of the highest ranked bidder is large enough, then this bidder can exclusively display her advertisement by paying a premium. Thus, this pricing scheme is relatively simple and seller revenue is at least as good as the GSP. Also, in the GSP-ExR, the highest ranked bidder has no incentive to change the number of slots by over/under-bidding as long as she retains the top position.
Keywords: electronic commerce, keyword auction, mechanism design, web advertising
Finding similar pages in a social tagging repository BIBAKFull-Text 1091-1092
  Alex Penev; Raymond K. Wong
Social tagging describes a community of users labeling web content with tags. It is a simple activity that enriches our knowledge about resources on the web. For a computer to help users search the tagged repository, it must know when tags are good or bad. We describe TagScore, a scoring function that rates the goodness of tags. The tags and their ratings give us a succinct synopsis for a page. We 'find similar' pages in Del.icio.us by comparing synopses. Our approach gives good correlation to the full cosine similarity but is hundreds of times faster.
Keywords: del.icio.us, social bookmarking, tagging, web
Folksoviz: a subsumption-based folksonomy visualization using wikipedia texts BIBAKFull-Text 1093-1094
  Kangpyo Lee; Hyunwoo Kim; Chungsu Jang; Hyoung-Joo Kim
In this paper, targeting del.icio.us tag data, we propose a method, FolksoViz, for deriving subsumption relationships between tags by using Wikipedia texts, and visualizing a folksonomy. To fulfill this method, we propose a statistical model for deriving subsumption relationships based on the frequency of each tag on the Wikipedia texts, as well as the TSD (Tag Sense Disambiguation) method for mapping each tag to a corresponding Wikipedia text. The derived subsumption pairs are visualized effectively on the screen. The experiment shows that the FolksoViz manages to find the correct subsumption pairs with high accuracy.
Keywords: collaborative tagging, folksonomy, subsumption, visualization, web 2.0, wikipedia
Size matters: word count as a measure of quality on wikipedia BIBAKFull-Text 1095-1096
  Joshua E. Blumenstock
Wikipedia, "the free encyclopedia", now contains over two million English articles, and is widely regarded as a high-quality, authoritative encyclopedia. Some Wikipedia articles, however, are of questionable quality, and it is not always apparent to the visitor which articles are good and which are bad. We propose a simple metric -- word count -- for measuring article quality. In spite of its striking simplicity, we show that this metric significantly outperforms the more complex methods described in related work.
Keywords: information quality, wikipedia, word count
Understanding internet video sharing site workload: a view from data center design BIBAKFull-Text 1097-1098
  Xiaozhu Kang; Hui Zhang; Guofei Jiang; Haifeng Chen; Xiaoqiao Meng; Kenji Yoshihira
In this paper we measured and analyzed the workload on Yahoo! Video, the 2nd largest U.S. video sharing site, to understand its nature and the impact on online video data center design. We discovered interesting statistical properties on both static and temporal dimensions of the workload including file duration and popularity distributions, arrival rate dynamics and predictability, and workload stationarity and burstiness. Complemented with queueing-theoretic techniques, we further extended our understanding on the measurement data with a virtual design on the workload and capacity management components of a data center assuming the same workload as measured, which reveals key results regarding the impact of Service Level Agreements (SLAs) and workload scheduling schemes on the design and operations of such large-scale video distribution systems.
Keywords: capacity planning, data center design, measurement, online video, queueing model, sla, workload management
Representing a web page as sets of named entities of multiple types: a model and some preliminary applications BIBAKFull-Text 1099-1100
  Nan Di; Conglei Yao; Mengcheng Duan; Jonathan Zhu; Xiaoming Li
As opposed to representing a document as a "bag of words" in most information retrieval applications, we propose a model of representing a web page as sets of named entities of multiple types. Specifically, four types of named entities are extracted, namely person, geographic location, organization, and time. Moreover, the relations among these entities are also extracted, weighted, classified and marked by labels. On top of this model, some interesting applications are demonstrated. In particular, we introduce a notion of person-activity, which contains four different elements: person, location, time and activity. With this notion and based on a reasonably large set of web pages, we are able to show how one person's activities can be attributed by time and location, which gives a good idea of the mobility of the person under question.
Keywords: named entity, web content mining, web page model
Falcons: searching and browsing entities on the semantic web BIBAKFull-Text 1101-1102
  Gong Cheng; Weiyi Ge; Yuzhong Qu
As of today, the amount of data on the Semantic Web has grown considerably. The services for searching and browsing entities on the Semantic Web are in demand. To provide such services, we developed the Falcons system. In this poster, we present the features of the Falcons system.
Keywords: indexing, search engine, semantic web, summarization
Influencers and their barriers to technology BIBAKFull-Text 1103-1104
  Ann Hsieh; Todd Hausman; Nerija Titus
Modern Day Marco Polos, young influencers in emerging markets who have more access to technology than their peers, often act as the gateway to new websites and technology in their respective countries. However, as they influence their peers through account gifting, they are often met with barriers such as language.
Keywords: emerging markets, ethnography, international
A semantic layer for publishing and localizing xml data for a p2p xquery mediator BIBAKFull-Text 1105-1106
  Florin Dragan; Georges Gardarin; Laurent Yeh
In this poster, we present the P2P XML data localization layer of XLive, an XQuery mediation system developed at University of Versailles [2]. A major challenge in the evaluation of XQuery over a P2P network in the context of multiple XML sources is to abstract from structural heterogeneity. Most existing approaches have mainly exploited varied types of 1:1 semantic mappings between peer schemas and/or ontologies. Complex query algorithms are required to exploit these semantic links between peers. In contrast, our approach focuses on a simple semantic layer. It is built on a Chord [4] DHT for indexing semantic descriptions of mediated data sources. A mapping process transforms an XML view of a mediator (each peer is equipped with a mediator) to a semantic form that is published into the P2P network and stored in the DHT. For a given user query, a search in the DHT retrieves all relevant data sources able to contribute to the result elaboration. Then, every peer that contains relevant data is directly queried to collect data for elaborating the final answer.
Keywords: p2p, semantic mediation., xquery
Mining for personal name aliases on the web BIBAKFull-Text 1107-1108
  Danushka Bollegala; Taiki Honma; Yutaka Matsuo; Mitsuru Ishizuka
We propose a novel approach to find aliases of a given name from the web. We exploit a set of known names and their aliases as training data and extract lexical patterns that convey information related to aliases of names from text snippets returned by a web search engine. The patterns are then used to find candidate aliases of a given name. We use anchor texts and hyperlinks to design a word co-occurrence model and define numerous ranking scores to evaluate the association between a name and its candidate aliases. The proposed method outperforms numerous baselines and previous work on alias extraction on a dataset of personal names, achieving a statistically significant mean reciprocal rank of 0.6718. Moreover, the aliases extracted using the proposed method improve recall by 20% in a relation-detection task.
Keywords: name alias extraction, semantic web, web mining
Application of bitmap index to information retrieval BIBAKFull-Text 1109-1110
  Kengo Fujioka; Yukio Uematsu; Makoto Onizuka
We developed the HS-bitmap index for efficient information retrieval. The HS-bitmap index is a hierarchical document-term matrix: the original document-term matrix is called the leaf matrix and an upper matrix is the summary of its lower matrix. Our experiment results show the HS-bitmap index performs better than the inverted index with a minor space overhead.
Keywords: bitmap index, information retrieval
Pivotbrowser: a tag-space image searching prototype BIBAKFull-Text 1111-1112
  Xiaoyan Li; Lidan Shou; Gang Chen; Xiaolong Zhang; Tianlei Hu; Jinxiang Dong
We propose a novel iterative searching and refining prototype for tagged images. This prototype, named PivotBrowser, captures semantically similar tag sets in a structure called pivot. By constructing a pivot for a textual query, PivotBrowser first selects candidate images possibly relevant to the query. The tags contained in these candidate images are then selected in terms of their tag relevances to the pivot. The shortlisted tags are clustered and one of the tag clusters is used to select the results from the candidate images. Ranking of the images in each partition is based on their relevance to the tag cluster. With the guidance of the tag clusters presented, a user is able to perform searching and iterative query refinement.
Keywords: ambiguity, inconsistency, relevance, tag
Measuring extremal dependencies in web graphs BIBAKFull-Text 1113-1114
  Yana Volkovich; Nelly Litvak; Bert Zwart
We analyze dependencies in power law graph data (Web sample, Wikipedia sample and a preferential attachment graph) using statistical inference for multivariate regular variation. The well developed theory of regular variation is widely applied in extreme value theory, telecommunications and mathematical finance, and it provides a natural mathematical formalism for analyzing dependencies between variables with power laws. However, most of the proposed methods have never been used in the Web graph data mining. The present work fills this gap. The new insights this yields are striking: the three above-mentioned data sets are shown to have a totally different dependence structure between different graph parameters, such as in-degree and PageRank.
Keywords: pagerank, preferential attachment, regular variation, web, wikipedia
Determining user's interest in real time BIBAKFull-Text 1115-1116
  Ranbir S. Sanasam; Hema A. Murthy; Timothy A. Gonsalves
Most of the search engine optimization techniques attempt to predict users interest by learning from the past information collected from different sources. But, a user's current interest often depends on many factors which are not captured in the past information. In this paper, we attempt to identify user's current interest in real time from the information provided by the user in the current query session. By identifying user's interest in real time, the engine could adapt differently to different users in real time. Experimental verification indicates that our approach is encouraging for short queries.
Keywords: real time, search engine, users clicks, users' interest
How to influence my customers?: the impact of electronic market design BIBAKFull-Text 1117-1118
  Nan Hu; Ling Liu; Bin Chen; Jialie Shen
This paper investigates the strategic decisions of online vendors for offering different mechanism, such as sampling and online reviews of information products, to increase their online sales. Focusing on measuring the effectiveness of electronic market design (offering reviews, sampling, or both), our study shows that online markets behavior as communication markets, and consumers learn product quality information both passively (reading online reviews) and actively but subjectively (listening to music sampling). Using data from Amazon, first we show that sampling along is a strong product quality signal that reduces the product uncertainty after controlling for halo effect. In general, products with sampling option enjoy a higher conversion rate (which leads to better sales) than those without sampling because sampling decreases the uncertainty of consuming experience goods. Second, the impact of online reviews on sales conversion rate is lower for experience goods with a sampling option than those without. Third, when the uncertainty of the societal reviews is higher, sampling plays a more important role because it mitigates such uncertainty introduced by online reviews.
Keywords: conversion rate, electronic market design, halo effect, online product reviews, sampling
Improving web spam detection with re-extracted features BIBAKFull-Text 1119-1120
  Guanggang Geng; Chunheng Wang; Qiudan Li
Web spam detection has become one of the top challenges for the Internet search industry. Instead of using some heuristic rules, we propose a feature re-extraction strategy to optimize the detection result. Based on the predicted spamicity obtained by the preliminary detection, through the host level web graph, three types of features are extracted. Experiments on WEBSPAM-UK2006 benchmark show that with this strategy, the performance of web spam detection can be improved evidently.
Keywords: content spam, link spam, machine learning, web spam
The world wide telecom web browser BIBAKFull-Text 1121-1122
  Sheetal Agarwal; Arun Kumar; Amit Anil Nanavati; Nitendra Rajput
As the number of telephony voice applications grow, there will be a need for a browser to surf the Web of interconnected voice applications (called as VoiceSites). These VoiceSites are accessed through a telephone over an audio channel. We present the concept and architecture of T-Web Browser, a World Wide Telecom Web browser that enables browsing the Web of voice applications through an ordinary phone. This browser will support rich browsing features such as history and bookmarking.
Keywords: developing regions, hstp, voice browser, wwtw
VoiKiosk: increasing reachability of kiosks in developing regions BIBAKFull-Text 1123-1124
  Sheetal Agarwal; Arun Kumar; Amit Anil Nanavati; Nitendra Rajput
One of the several initiatives to bridge the digital divide in developing countries has been the deployment of information kiosks or knowledge centers in villages in rural parts of the country. These kiosks provide services ranging from email, chat and browsing to distance education programs, agricultural services and eGovernance services. A kiosk typically comprises of a computer with printer, web cam, multimedia system and Internet connectivity and is owned by a local entrepreneur. Moving away from the PC based kiosk model, we present an alternative platform to create and host such information kiosks in the telephony network. We call these as VoiKiosks and they are accessible through voice interaction over an ordinary phone call.
Keywords: developing regions, kiosks, voigen
Semantic similarity based on compact concept ontology BIBAKFull-Text 1125-1126
  Ce Zhang; Yu-Jing Wang; Bin Cui; Gao Cong
This paper presents a new method of calculating the semantic similarity between two articles based on WordNet. To further improve the performance of the proposed method, we build a new Compact Concept Ontology (CCO) from WordNet by combining the words with similar semantic meanings. The experimental results show that our approach significantly outperforms a recent proposal of computing semantic similarity, and demonstrate the superiority of the proposed CCO method.
Keywords: concept, ontology, similarity
Layman tuning of websites: facing change resilience BIBAKFull-Text 1127-1128
  Oscar Díaz; Cristóbal Arellano; Jon Iturrioz
Client scripting permits end users to customize content, layout or style of their favourite websites. But current scripting suffers from a tight coupling with the website. If the page changes, all the scripting can fall apart. The problem is that websites are reckoned to evolve frequently, and this can jeopardize all the scripting efforts. To avoid this situation, this work enriches websites with a "modding interface" in an attempt to decouple layman's script from website upgrades. From the website viewpoint, this interface ensures safe scripting, i.e. scripts that do not break the page. From a scripter perspective, this interface limits tuning but increases change resilience. The approach tries to find a balance between openness (scripter free inspection) and modularity (scripter isolation from website design decisions) that permits scripting to scale up as a mature software practice. The approach is realized for Greasemonkey scripts.
Keywords: personalization, semantic web, web 2.0
Model bloggers' interests based on forgetting mechanism BIBAKFull-Text 1129-1130
  Yuan Cheng; Guang Qiu; Jiajun Bu; Kangmiao Liu; Ye Han; Can Wang; Chun Chen
Blogs have been expanded at an incredible speed in recent years. Plentiful personal information makes blogs a popular way mining user profiles. In this paper, we propose a novel bloggers' interests modeling approach based on forgetting mechanism. A new forgetting function is introduced to track interest drift. Based on that, the Short Term Interest Models (STIM) and Long Term Interest Models (LTIM) are constructed to describe bloggers' short-term and long-term interests. The experiments show that both models can identify bloggers' preferences well respectively.
Keywords: blog, interest forgetting, long term interests, short term interests
Temporal views over rdf data BIBAKFull-Text 1131-1132
  Geetha Manjunath; R. Badrinath; Craig Sayers; K. S. Venugopal
Supporting fast access to large RDF stores has been one of key challenges for enabling use of the Semantic Web in real-life applications, more so in sensor-based systems where large amounts of historic data needs to be stored. We propose a semantics-based temporal view mechanism that enables faster access to time-varying data by caching into memory only the required subset of RDF triples. We describe our experience of implementing such a framework in the context of a wide area network monitoring system. Our preliminary results show that our solution significantly improves client access time and scales well for moderate data sets.
Keywords: owl, rdf, semantic web, sensors, time varying data, views
A teapot graph and its hierarchical structure of the Chinese web BIBAKFull-Text 1133-1134
  Jonathan J. H. Zhu; Tao Meng; Zhengmao Xie; Geng Li; Xiaoming Li
The shape of the Web in terms of its graphical structure has been a widely interested topic. Two graphs, Bow Tie and Daisy, have stood out from previous research. In this work, we take a different approach, by viewing the Web as a hierarchy of three levels, namely page level, host level, and domain level. Such structures are analyzed and compared with a snapshot of Chinese Web in early 2006, involving 830 million pages, 17 million hosts, and 0.8 million domains. Some interesting results have emerged. For example, the Chinese Web appears more like a teapot (with a large size of SCC, a medium size of IN and a small size of OUT) at page level than the classic bow tie or daisy shape. Some challenging phenomena are also observed. For example, the INs become much smaller than OUTs at host and domain levels. Future work will tackle these puzzles.
Keywords: bow tie graph, daisy graph, self similarity, teapot graph
Collaborative filtering on skewed datasets BIBAKFull-Text 1135-1136
  Somnath Banerjee; Krishnan Ramanathan
Many real life datasets have skewed distributions of events when the probability of observing few events far exceeds the others. In this paper, we observed that in skewed datasets the state of the art collaborative filtering methods perform worse than a simple probabilistic model. Our test bench includes a real ad click stream dataset which is naturally skewed. The same conclusion is obtained even from the popular movie rating dataset when we pose a binary prediction problem of whether a user will give maximum rating to a movie or not.
Keywords: collaborative filtering, plsa, skewed dataset
2lip: the step towards the web3d BIBAKFull-Text 1137-1138
  Jacek Jankowski; Sebastian Ryszard Kruk
The World Wide Web allows users to create and publish a variety of resources, including multimedia ones. Most of the contemporary best practices for designing web interfaces, however, do not take into account the 3D techniques.
   In this paper we present a novel approach for designing interactive web applications -- 2-Layer Interface Paradigm (2LIP). The background layer of the 2LIP-type user interface is a 3D scene, which a user cannot directly interact with. The foreground layer is HTML content. Only taking an action on this content (e.g. pressing a hyperlink, scrolling a page) can affect the 3D scene.
   We introduce a reference implementation of 2LIP: Copernicus -- The Virtual 3D Encyclopedia, which shows one of the potential paths of the evolution of Wikipedia towards Web 3.0. Based on the evaluation of Copernicus we prove that designing web interfaces according to 2LIP provides users a better browsing experience, without harming the interaction.
Keywords: design, web3d, wikipedia, wpf
Protecting web services from remote exploit code: a static analysis approach BIBAKFull-Text 1139-1140
  Xinran Wang; Yoon-chan Jhi; Sencun Zhu; Peng Liu
We propose STILL, a signature-free remote exploit binary code injection attack blocker to protect web servers and web applications. STILL is robust to almost all anti-signature, anti-static-analysis and anti-emulation obfuscation.
Keywords: code injection attack, http, static analysis
Composing and optimizing data providing web services BIBAKFull-Text 1141-1142
  Mahmoud Barhamgi; Djamal Benslimane; Aris M. Ouksel
In this paper, we propose a new approach to automatically compose data providing Web services. Our approach exploits existing mature works done in data integration systems. Specifically, data providing services are modeled as RDF Parameterized views over mediated ontologies. Then, an RDF oriented algorithm to compose services based on query rewriting techniques was devised. We apply also an optimization algorithm on the generated composition to speed up its execution. The results of our experiments show that the algorithms scale up very well up to a large number of services, covering thus most realistic applications.
Keywords: optimization, rdf, web service composition
Psst: a web-based system for tracking political statements BIBAKFull-Text 1143-1144
  Samantha Kleinberg; Bud Mishra
Determining candidates' views on important issues is critical in deciding whom to support and vote for; but finding their statements and votes on an issue can be laborious. In this paper we present PSST, (Political Statement and Support Tracker), a search engine to facilitate analysis of political statements and votes over time. We show that prior tools for text analysis can be combined with minimal manual processing to provide a first step in the full automation of this process.
Keywords: elections, politics, search
Exploiting semantic web technologies to model web form interactions BIBAKFull-Text 1145-1146
  Wolfgang Holzinger; Bernhard Kruepl; Robert Baumgartner
Form mapping is the key problem that needs to be solved in order to get access to the hidden web. Currently available solutions for fully automatic mapping are not ready for commercial meta-search engines, which still have to rely on hand crafted code and are hard to maintain.
   We believe that a thorough formal description of the problem with semantic web technologies provides a promising perspective to develop a new class of vertical search engines that is more robust and easier to maintain than existing solutions.
   In this paper, instead of trying to tackle the mapping problem, we model the interaction necessary to fill out a web form. First, during a user-assisted phase, the connection from the visible elements on the form to the domain concepts is established. Then, with help from background knowledge about the possible interaction steps, a plan for filling out the form is derived.
Keywords: hidden web, web data extraction, web form mapping
Groupme! BIBAKFull-Text 1147-1148
  Fabian Abel; Nicola Henze; Daniel Krause
This paper presents the GroupMe! system, a resource sharing system with advanced tagging functionality. GroupMe! provides a novel user interface, which enables users to organize and arrange arbitrary Web resources into groups. The content of such groups can be overlooked and inspected immediately as resources are visualized in a multimedia-based fashion. In this paper, we furthermore introduce new folksonomy-based ranking strategies that exploit the group structure shipped with GroupMe! folksonomies. Experiments show that those strategies significantly improve the performance of such ranking algorithms.
Keywords: folksonomies, groupme!, ranking, search, social media
Microscale evolution of web pages BIBAKFull-Text 1149-1150
  Carrie Grimes
We track a large set of "rapidly" changing web pages and examine the assumption that the arrival of content changes follows a Poisson process on a microscale. We demonstrate that there are significant differences in the behavior of pages that can be exploited to maintain freshness in a web corpus.
Keywords: change detection, rate of change, web evolution
Web page sectioning using regex­-based template BIBAKFull-Text 1151-1152
  Rupesh R. Mehta; Amit Madaan
This work aims to provide a novel, site-specific web page segmentation and section importance detection algorithm, which leverages structural, content, and visual information. The structural and content information is leveraged via template, a generalized regular expression learnt over set of pages. The template along with visual information results into high sectioning accuracy. The experimental results demonstrate the effectiveness of the approach.
Keywords: site-specific noise elimination, site-specific segmentation, tree-based reg-ex
Towards a programming language for services computing BIBAKFull-Text 1153-1154
  Arun Kumar; D. Janakiram
Services Computing is emerging as a new discipline. The acceptance of web services technology stems from the fact that services enable easy integration and interoperation of enterprise level distributed systems. However, currently software developers are forced to translate business level service requirements and encode them into programs using low level abstractions such as objects. We propose to introduce language constructs for Service Oriented Programming that would enable raising programming abstractions from objects to services.
Keywords: object orientation, programming languages, semantic web services
Histrace: building a search engine of historical events BIBAKFull-Text 1155-1156
  Lian'en Huang; Jonathan J. H. Zhu; Xiaoming Li
In this paper, we describe an experimental search engine on our Chinese web archive since 2001. The original data set contains nearly 3 billion Chinese web pages crawled from past 5 years. From the collection, 430 million "article-like" pages are selected and then partitioned into 68 million sets of similar pages. The titles and publication dates are determined for the pages. An index is built. When searching, the system returns related pages in a chronological order. This way, if a user is interested in news reports or commentaries for certain previously happened event, he/she will be able to find a quite rich set of highly related pages in a convenient way.
Keywords: replica detection, text mining, web archive
Online change detection in individual web user behaviour BIBAKFull-Text 1157-1158
  Peter I. Hofgesang; Jan Peter Patist
Based on our field studies and consultations with field experts, we identified three main problems that are of key importance to online web personalization and customer relationship management: 1) detecting changes in individual behaviour, 2) reporting on user actions that may need special care, and 3) detecting changes in visitation frequency.
   We propose solutions to these problems and experiment on real-world data from an investment bank collected over 1.5 years of web traffic. These solutions can be applied on any domains where individuals tend to revisit the website and can be identified accurately.
Keywords: change detection, incremental web mining, user profiling
Webanywhere: enabling a screen reading interface for the web on any computer BIBAKFull-Text 1159-1160
  Jeffrey P. Bigham; Craig M. Prince; Richard E. Ladner
People often use computers other than their own to access web content, but blind users are restricted to using computers equipped with expensive, special-purpose screen reading programs that they use to access the web. WebAnywhere is a web-based, self-voicing web application that enables blind web users to access the web from almost any computer that can produce sound without installing new software. WebAnywhere could serve as a convenient, low-cost solution for blind users on-the-go, for blind users unable to afford another screen reader and for web developers targeting accessible design. This paper describes the implementation of WebAnywhere, overviews an evaluation of it by blind web users, and summarizes a survey of public terminals that shows it can run on most public computers.
Keywords: blind users, screen reader, web accessibility
Guanxi in the Chinese web -- a study of mutual linking BIBAKFull-Text 1161-1162
  Valerie King; Louis Lei Yu; Yan Zhuang
Guanxi is a type of dyadic social interaction based on feelings ("qing") and trust ("xin"). Long studied by scholars of Chinese origin, it has recently drawn the attention of researchers outside of China. We define the concept of guanxi as applied to the interaction between web sites. We explore methods to identify guanxi in the Chinese web, show the unique characteristics of the Chinese web which result from it, and introduce a mechanism for simulating guanxi in a web graph model.
Keywords: social networks, stochastic graph modeling, web graph, web measurement
Towards robust trust establishment in web-based social networks with socialtrust BIBAKFull-Text 1163-1164
  James Caverlee; Ling Liu; Steve Webb
We propose the SocialTrust framework for tamper-resilient trust establishment in online social networks. Two of the salient features of SocialTrust are its dynamic revision of trust by (i) distinguishing relationship quality from trust; and (ii) incorporating a personalized feedback mechanism for adapting as the social network evolves.
Keywords: social networks, trust
Plurality: a context-aware personalized tagging system BIBAKFull-Text 1165-1166
  Robert Graham; Brian Eoff; James Caverlee
We present the design of Plurality, an interactive tagging system. Plurality's modular architecture allows users to automatically generate high-quality tags over Web content, as well as over archival and personal content typically beyond the reach of existing Web 2.0 social tagging systems. Three of the salient features of Plurality are: (i) its self-learning and feedback-sensitive capabilities based on a user's personalized tagging style; (ii) its leveraging of the collective intelligence of existing social tagging services; and (iii) its context-awareness for optimizing tag suggestions, e.g., based on spatial or temporal features.
Keywords: context-sensitive, personalization, social annotation, tags
Web graph similarity for anomaly detection (poster) BIBAKFull-Text 1167-1168
  Panagiotis Papadimitriou; Ali Dasdan; Hector Garcia-Molina
Web graphs are approximate snapshots of the web, created by search engines. Their creation is an error-prone procedure that relies on the availability of Internet nodes and the faultless operation of multiple software and hardware units. Checking the validity of a web graph requires a notion of graph similarity. Web graph similarity helps measure the amount and significance of changes in consecutive web graphs. These measurements validate how well search engines acquire content from the web. In this paper we study five similarity schemes: three of them adapted from existing graph similarity measures and two adapted from well-known document and vector similarity methods. We compare and evaluate all five schemes using a sequence of web graphs for Yahoo! and study if the schemes can identify anomalies that may occur due to hardware or other problems.
Keywords: anomaly detection, graph similarity, web graph lsh
Static query result caching revisited BIBAKFull-Text 1169-1170
  Rifat Ozcan; Ismail Sengor Altingovde; Özgür Ulusoy
Query result caching is an important mechanism for search engine efficiency. In this study, we first review several query features that are used to determine the contents of a static result cache. Next, we introduce a new feature that more accurately represents the popularity of a query by measuring the stability of query frequency over a set of time intervals. Experimental results show that this new feature achieves hit ratios better than those of the previously proposed features.
Keywords: query result caching, search engine, static caching
A larger scale study of robots.txt BIBAKFull-Text 1171-1172
  Santanu Kolay; Paolo D'Alberto; Ali Dasdan; Arnab Bhattacharjee
A website can regulate search engine crawler access to its content using the robots exclusion protocol, specified in its robots.txt file. The rules in the protocol enable the site to allow or disallow part or all of its content to certain crawlers, resulting in a favorable or unfavorable bias towards some of them. A 2007 survey on the robots.txt usage of about 7,593 sites found some evidence of such biases, the news of which led to widespread discussions on the web. In this paper, we report on our survey of about 6 million sites. Our survey tries to correct the shortcomings of the previous survey and shows the lack of any significant preferences towards any particular search engine.
Keywords: crawler, robots exclusion, robots.txt, search engine
Defection detection: predicting search engine switching BIBAKFull-Text 1173-1174
  Allison P. Heath; Ryen W. White
Searchers have a choice about which Web search engine they use when looking for information online. If they are unsuccessful on one engine, users may switch to a different engine to continue their search. By predicting when switches are likely to occur, the search experience can be modified to retain searchers or ensure a quality experience for incoming searchers. In this poster, we present research on a technique for predicting search engine switches. Our findings show that prediction is possible at a reasonable level of accuracy, particularly when personalization or user grouping is employed. These findings have implications for the design of applications to support more effective online searching.
Keywords: search engine switching
Algorithm for stochastic multiple-choice knapsack problem and application to keywords bidding BIBAKFull-Text 1175-1176
  Yunhong Zhou; Victor Naroditskiy
We model budget-constrained keyword bidding in sponsored search auctions as a stochastic multiple-choice knapsack problem (S-MCKP) and design an algorithm to solve S-MCKP and the corresponding bidding optimization problem. Our algorithm selects items online based on a threshold function which can be built/updated using historical data. Our algorithm achieved about 99% performance compared to the offline optimum when applied to a real bidding dataset. With synthetic dataset and iid item-sets, its performance ratio against the offline optimum converges to one empirically with increasing number of periods.
Keywords: keyword bidding, multiple-choice knapsack problem, sponsored search auction, stochastic optimization
Simrank++: query rewriting through link analysis of the clickgraph (poster) BIBAKFull-Text 1177-1178
  Ioannis Antonellis; Hector Garcia-Molina; Chi-Chao Chang
We focus on the problem of query rewriting for sponsored search. We base rewrites on a historical click graph that records the ads that have been clicked on in response to past user queries. Given a query q, we first consider Simrank [2] as a way to identify queries similar to q, i.e., queries whose ads a user may be interested in. We argue that Simrank fails to properly identify query similarities in our application, and we present two enhanced versions of Simrank: one that exploits weights on click graph edges and another that exploits evidence." We experimentally evaluate our new schemes against Simrank, using actual click graphs and queries form Yahoo!, and using a variety of metrics. Our results show that the enhanced methods can yield more and better query rewrites.
Keywords: click graph, link analysis, similarity metric, sponsored search
An initial investigation on evaluating semantic web instance data BIBAKFull-Text 1179-1180
  Li Ding; Jiao Tao; Deborah L. McGuinness
Many emerging semantic web applications include ontologies from one set of authors and instance data from another (often much larger) set of authors. Often ontologies are reused and instance data is integrated in manners unanticipated by their authors. Not surprisingly, many instance data rich applications encounter instance data that is not compatible with the expectations of the original ontology author(s). This line of work focuses on issues related to semantic expectation mismatches in instance data. Our initial results include a customizable and extensible service-oriented evaluation architecture, and a domain implementation called PmlValidator, which checks instance data using the corresponding ontologies and additional style requirements.
Keywords: architecture, instance data evaluation, semantic web
Mining tag clouds and emoticons behind community feedback BIBAKFull-Text 1181-1182
  Kavita A. Ganesan; Neelakantan Sundaresan; Harshal Deo
In this paper we describe our mining system which automatically mines tags from feedback text in an eCommerce scenario. It renders these tags in a visually appealing manner. Further, emoticons are attached to mined tags to add sentiment to the visual aspect.
Keywords: algorithms, experimentation, feedback, measurement, search, web 2.0
Investigation of partial query proximity in web search BIBAKFull-Text 1183-1184
  Jing Bai; Yi Chang; Hang Cui; Zhaohui Zheng; Gordon Sun; Xin Li
Proximity of query terms in a document is an important criterion in IR. However, no investigation has been made to determine the most useful term sequences for which proximity should be considered. In this study, we test the effectiveness of using proximity of partial term sequences (n-grams) for Web search. We observe that the proximity of sequences of 3 to 5 terms is most effective for long queries, while shorter or longer sequences appear less useful. This suggests that combinations of 3 to 5 terms can best capture the intention in user queries. In addition, we also experiment with weighing the importance of query sub-sequences using query log frequencies. Our preliminary tests show promising empirical results.
Keywords: information retrieval, term proximity
Identifying regional sensitive queries in web search BIBAKFull-Text 1185-1186
  Srinivas Vadrevu; Ya Zhang; Belle Tseng; Gordon Sun; Xin Li
In Web search ranking, the expected results for some queries could vary greatly depending upon location of the user. We name such queries regional sensitive queries. Identifying regional sensitivity of queries is important to meet users' needs. The objective of this work is to identify whether a user expects only regional results for a query. We present three novel features generated from search logs and build a meta query classifier to identify regional sensitive query. Experimental results show that the proposed method achieves high accuracy in identifying regional sensitive queries.
Keywords: query classification, regional sensitive, search
Offline matching approximation algorithms in exchange markets BIBAKFull-Text 1187-1188
  Zeinab Abbassi; Laks V. S. Lakshmanan
Motivated by several marketplace applications on rapidly growing online social networks, we study the problem of efficient offline matching algorithms for online exchange markets. We consider two main models of one-shot markets and exchange markets over time. For one-shot markets, we study three main variants of the problem: one-to-one exchange market problem, exchange market problem with short cycles, and probabilistic exchange market problem. We show that all the above problems are NP-hard, and propose heuristics and approximation algorithms for these problems. Experiments show that the number of items exchanged will increase when exchanges through cycles are allowed. Exploring algorithms for markets over time is an interesting direction for future work.
Keywords: edge partitioning, exchange markets, recommendation algorithms, set packing, social networks
An efficient two-phase service discovery mechanism BIBAKFull-Text 1189-1190
  Shuiguang Deng; Zhaohui Wu; Jian Wu; Ying Li
We bring forward a two-phase semantic service discovery mechanism which supports both the operation matchmaking and operation-composition matchmaking. A serial of experiments on a service management framework show that the mechanism gains better performance on both discovery recall rate and precision than a traditional matchmaker.
Keywords: bipartite graph matching, composition, service discovery, service matchmaking, web service
User oriented link function classification BIBAKFull-Text 1191-1192
  Mingliang Zhu; Weiming Hu; Ou Wu; Xi Li; Xiaoqin Zhang
Currently most link-related applications treat all links in the same web page to be identical. One link-related application usually requires one certain property of hyperlinks but actually not all links have this property or they have this property on different levels. Based on a study of how human users judge the links, the idea of the link function classification (LFC) is introduced in this paper. The link functions reflect the purpose that links are created by web page designers and the way they are used by viewers. Links in a certain function class imply one certain relationship between the adjacent pages, and thus they can be assumed to have similar properties. An algorithm is proposed to analyze the link functions based on both vision and structure features which simulates the reaction on the links of human users. Current applications can be enhanced by LFC with a more accurate modeling of the web graph. New mining methods can be also developed by making more and stronger assumptions on links within each function class due to the purer property set they share.
Keywords: link function classification, structure features, vision features, web mining
Extraction and mining of an academic social network BIBAKFull-Text 1193-1194
  Jie Tang; Jing Zhang; Limin Yao; Juanzi Li
This paper addresses several key issues in extraction and mining of an academic social network: 1) extraction of a researcher social network from the existing Web; 2) integration of the publications from existing digital libraries; 3) expertise search on a given topic; and 4) association search between researchers. We developed a social network system, called ArnetMiner, based on proposed methods to the above problems. In total, 448,470 researcher profiles and 981,599 publications were extracted/integrated after the system having been in operation for two years. The paper describes the architecture and main features of the system. It also briefly presents the experimental results of the proposed methods.
Keywords: expertise search, information extraction, social network
Keyword extraction for contextual advertisement BIBAKFull-Text 1195-1196
  Xiaoyuan Wu; Alvaro Bolivar
As the largest online marketplace, eBay strives to promote its inventory throughout the Web via different types of online advertisement. Contextually relevant links to eBay assets on third party sites is one example of such advertisement avenues. Keyword extraction is the task at the core of any contextual advertisement system. In this paper, we explore a machine learning approach to this problem. The proposed solution uses linear and logistic regression models learnt from human labeled data, combined with document, text and eBay specific features. In addition, we propose a solution to identify the prevalent category of eBay items in order to solve the problem of keyword ambiguity.
Keywords: contextual advertisement, keyword extraction
Which "Apple" are you talking about? BIBAKFull-Text 1197-1198
  Mandar A. Rahurkar; Dan Roth; Thomas S. Huang
In a higher level task such as clustering of web results or
   word sense disambiguation, knowledge of all possible distinct concepts in which an ambiguous word can be expressed would be advantageous, for instance in determining the number of clusters in case of clustering web search results. We propose an algorithm to generate such a ranked list of distinct concepts associated with an ambiguous word. Concepts which are popular in terms of usage are ranked higher.
   We evaluate the coverage of the concepts inferred from our algorithm on the results retrieved by querying the ambiguous word using a major search engine and show a coverage of 85% for top 30 documents averaged over all keywords.
Keywords: clustering, wikipedia
Making BPEL flexible: adapting in the context of coordination constraints using WS-BPEL BIBAKFull-Text 1199-1200
  Yunzhou Wu; Prashant Doshi
Although WS-BPEL is emerging as the prominent language for modeling executable business processes, its support for designing flexible processes is limited.
   An important need of many adaptive processes is for concurrent activities in the process to respect coordination constraints. These require that concurrent activities coordinate their behaviors in response to exogenous events.
   We show how coordination inducing constraints may be represented in WS-BPEL, and use generalized adaptation and constraint enforcement models to transform the traditional BPEL process to an adaptive one. The final outcome is an executable WS-BPEL process without extensions, able to adapt to events while respecting coordination constraints between activities.
Keywords: adaptation, bpel, coordination, web services
Speeding up web service composition with volatile information BIBAKFull-Text 1201-1202
  John F. Harney; Prashant Doshi
This paper introduces a novel method for composing Web services in the presence of external volatile information. Our approach, which we call the informed-presumptive, is compared to previous state-of-the-art approaches for Web service composition in volatile environments. We show empirically that the informed-presumptive strategy produces compositions in significantly less time than the other strategies with lesser backtracks.
Keywords: adaptation, expiration times, volatile environments, web services
Context-sensitive QoS model: a rule-based approach to web service composition BIBAKFull-Text 1203-1204
  Tao Zhou; Xiaolin Zheng; William Wei Song; Xiaofeng Du; Deren Chen
Generally, web services are provided with different QoS values, so they can be selected dynamically in service composition process. However, the conventional context free composition QoS model does not consider the changeability of QoS values and the context sensitive constraints during composition process. In this paper, we propose a rule based context sensitive QoS model to support the changeability of QoS values and the context sensitive constraints. By considering context in the QoS model, web service composition can be used widely and flexibly in the real world business.
Keywords: QoS, context model, rule, web service composition
A unified framework for name disambiguation BIBAKFull-Text 1205-1206
  Jie Tang; Jing Zhang; Duo Zhang; Juanzi Li
Name ambiguity problem has been a challenging issue for a long history. In this paper, we intend to make a thorough investigation of the whole problem. Specifically, we formalize the name disambiguation problem in a unified framework. The framework can incorporate both attribute and relationship into a probabilistic model. We explore a dynamic approach for automatically estimating the person number K and employ an adaptive distance measure to estimate the distance between objects. Experimental results show that our proposed framework can significantly outperform the baseline method.
Keywords: digital library, name disambiguation, probabilistic model
Low-load server crawler: design and evaluation BIBAKFull-Text 1207-1208
  Katsuko T. Nakahira; Tetsuya Hoshino; Yoshiki Mikami
This paper proposes a method of crawling Web servers connected to the Internet without imposing a high processing load. We are using the crawler for a field survey of the digital divide, including the ability to connect to the network. Rather than employing normal Web 'page' crawling algorithm, which usually collect all pages found on the target server, we have developed 'server' crawling algorithm, which collect only minimum pages from the same server and achieved low-load and high-speed crawling of servers.
Keywords: global digital divide, server crawler
KC3 browser: semantic mash-up and link-free browsing BIBAKFull-Text 1209-1210
  Michiaki Iwazume; Ken Kaneiwa; Koji Zettsu; Takafumi Nakanishi; Yutaka Kidawara; Yssushi Kiyoki
This paper proposes a general framework of for a system with a semantic browsing and visualization interface called Knowledge Communication, Collaboration and Creation Browser (KC3 Browser) integrates multimedia contests and web services on the grid networks, and makes a semantic mash-up called knowledge workspace (k-workspace) with various visual gadgets according to user's contexts (e.g. their interests, purpose and computational environments).
   KC3 Browser also achieves a link-free browsing for seamless knowledge access by generating semantic links based on an arbitrary knowledge models such as ontology and vector space models. It assists users to look down and to figure out various social and natural events from the web contents.
   We have implemented a prototype of KC3 Browser and tested it to an international project on risk intelligence against natural disaster.
Keywords: semantic browsing, semantic mash-up, semantic web
Generating hypotheses from the web BIBAKFull-Text 1211-1212
  Wei Jin; Rohini Srihari; Abhishek Singh
Hypothesis generation is a crucial initial step for making scientific discoveries. This paper addresses the problem of automatically discovering interesting hypotheses from the web. Given a query containing one or two entities of interest, our algorithm automatically generates a semantic profile describing the specified entity or provides the potential connections between two entities of interest. We implemented a prototype on top of the Google search engine and the experimental results demonstrate the effectiveness of our algorithms.
Keywords: hypothesis generation, link analysis, web search
Using graphics processors for high-performance IR query processing BIBAKFull-Text 1213-1214
  Shuai Ding; Jinru He; Hao Yan; Torsten Suel
Web search engines are facing formidable performance challenges as they need to process thousands of queries per second over billions of documents. To deal with this heavy workload, current engines use massively parallel architectures of thousands of machines that require large hardware investments.
   We investigate new ways to build such high-performance IR systems based on Graphical Processing Units (GPUs). GPUs were originally designed to accelerate computer graphics applications through massive on-chip parallelism. Recently a number of researchers have studied how to use GPUs for other problem domains including databases and scientific computing [2,3,5], but we are not aware of previous attempts to use GPUs for large-scale web search.
   Our contribution here is to design a basic system architecture for GPU-based high-performance IR, and to describe how to perform highly efficient query processing within such an architecture. Preliminary experimental results based on a prototype implementation suggest that significant gains in query processing performance might be obtainable with such an approach.
Keywords: gpu, query processing
A framework for fast community extraction of large-scale networks BIBAKFull-Text 1215-1216
  Yutaka I. Leon-Suematsu; Kikuo Yuta
Most of the faster community extraction algorithms are based on the Clauset, Newman and Moore (CNM), which is employed for networks with sizes up to 500,000 nodes. The modification proposed by Danon, Diaz and Arenas (DDA) obtains better modularity among CNM and its variations, but there is no improvement in speed as its authors expressed. In this paper, we identify some inefficiencies in the data structure employed by former algorithms. We propose a new framework for the algorithm and a modification of the DDA to make it applicable to large-scale networks. For instance, the community extraction of a network with 1 million nodes and 5 million edges was performed in about 14 minutes in contrast to former CNM that required 45 hours (192 times the former CNM, obtaining better modularity).
   The scalability of our improvements is shown by applying it to networks with sizes up to 10 million nodes, obtaining the best modularity and execution time compared to the former algorithms.
Keywords: clustering, community analysis, large-scale networks
Enabling secure digital marketplace BIBAKFull-Text 1217-1218
  Hongxia Jin; Vladimir Zbarsky
The fast development of the Web provides new ways for effective distribution of network-based digital goods. A digital marketplace provides a platform to enable Web users to effectively acquire, share, market and distribute digital content. However, the success of the digital marketplace business models hinges on securely managing the digital rights and usage of the digital content. For example, the digital content should be only consumable by paid users. This paper describes a Web-based system that enables the secure exchange of digital content between Web users and prevents users from illegally re-sell of the digital content. Part of our solution is based on broadcast encryption technology.
Keywords: content protection, download, drm, security
Extracting XML schema from multiple implicit xml documents based on inductive reasoning BIBAKFull-Text 1219-1220
  Masaya Eki; Tadachika Ozono; Toramatsu Shintani
We propose a method of classifying XML documents and extracting XML schema from XML by inductive inference based on constraint logic programming. The goal of this work is to type a large collection of XML approximately but efficiently. This can also process XML code written in a different schema or even code which is schema-less. Our approach is intended to achieve identification based on the syntax and semantics of the XML documents by information extraction using ontology, and to support retrieval and data management. Our approach has three steps. The first step is XML to predicates, the second step is to compare predicates and classifies structures which represent similar meanings in different structures, and the last step is predicates to rules by using ontology and to maintain XML Schema. We evaluate similarity of data type and data range by using an ontology dictionary, and XML Schema is made from results of second and last step.
Keywords: inductive reasoning, predicate logic, xml
Visualizing historical content of web pages BIBAKFull-Text 1221-1222
  Adam Jatowt; Yukiko Kawai; Katsumi Tanaka
Recently, along with the rapid growth of the Web, the preservation efforts have also increased. As a consequence, large amounts of past Web data are stored in Web archives. This historical data can be used for better understanding of long-term page topics and characteristics. In this paper, we propose an interactive visualization system called Page History Explorer for exploring page histories. It allows for roughly portraying evolution of pages and summarizing their content over time. We use a temporal term cloud as a structure for visualizing prevailing and active terms appearing on pages in the past.
Keywords: history summarization, page history visualization, past web, web archive
Integrating the IAC neural network in ontology mapping BIBAKFull-Text 1223-1224
  Ming Mao; Yefei Peng; Michael Spring
Ontology mapping seeks to find semantic correspondences between similar elements of different ontologies. This paper proposes a neural network based approach to search for a global optimal solution that best satisfies ontology constraints. Experiments on OAEI benchmark tests show it dramatically improves the performance of preliminary mapping results.
Keywords: constraint satisfaction problem (csp), interactive activation and competition (IAC) neural network, ontology mapping, prior+
Fast algorithms for topk personalized pagerank queries BIBAKFull-Text 1225-1226
  Manish Gupta; Amit Pathak; Soumen Chakrabarti
In entity-relation (ER) graphs (V,E), nodes V represent typed entities and edges E represent typed relations. For dynamic personalized PageRank queries, nodes are ranked by their steady-state probabilities obtained using the standard random surfer model. In this work, we propose a framework to answer top-k graph conductance queries. Our top-k ranking technique leads to a 4X speedup, and overall, our system executes queries 200-1600X faster than whole-graph PageRank. Some queries might contain hard predicates i.e. predicates that must be satisfied by the answer nodes. E.g. we may seek authoritative papers on public key cryptography, but only those written during 1997. We extend our system to handle hard predicates. Our system achieves these substantial query speedups while consuming only 10-20% of the space taken by a regular text index.
Keywords: hubrank, node-deletion, pagerank, personalized, top-k
Reasoning about similarity queries in text retrieval tasks BIBAKFull-Text 1227-1228
  Xiaohui Yu; Yang Liu
In many text retrieval tasks, it is highly desirable to obtain a "similarity profile" of the document collection for a given query. We propose sampling-based techniques to address this need, using calibration techniques to improve the accuracy. Experimental results confirm the effectiveness of the proposed approaches.
Keywords: calibration, query evaluation, sampling, text retrieval
Mashups for semantic user profiles BIBAKFull-Text 1229-1230
  Riddhiman Ghosh; Mohamed Dekhil
In this paper, we discuss challenges and provide solutions for capturing and maintaining accurate models of user profiles using semantic web technologies, by aggregating and sharing distributed fragments of user profile information spread over multiple services. Our framework for profile management allows for evolvable, extensible and expressive user profiles. We have implemented a prototype, targeting the retail domain, on the HP Labs Retail Store Assistant.
Keywords: information management, personalization, semantic web, user profiles
Using CEP technology to adapt messages exchanged by web services BIBAKFull-Text 1231-1232
  Yehia Taher; Marie-Christine Fauvet; Marlon Dumas; Djamal Benslimane
Web service may be unable to interact with each other because of incompatibilities between their interfaces. In this paper, we present an event driven approach which aims at adapting messages exchanged during service interactions. The proposed framework relies on the Complex Event Processing (CEP) technology, which provides an environment for the development of applications that need to continuously process, analyse and respond to event streams. Our main contribution is a system that enables developers to design and implement CEP-based adapters. These latter are deployed in a CEP engine which is responsible for continuously receiving messages and processing them according to rules implemented by the adapters. Resulting transformed messages are thus forwarded to their original service recipient.
Keywords: cep technology, web service adaptation
Finding core members in virtual communities BIBAKFull-Text 1233-1234
  Haiqiang Chen; Xueqi Cheng; Yue Liu
Finding the core members of a virtual community is an important problem in community analysis. Here we presented an simulated annealing algorithm to solve this problem by optimizing the user interests concentration ratio in user groups. As an example, we test this algorithm on a virtual community site and evaluate its results using human "gold standard" method.
Keywords: core members finding, simulated annealing, virtual community
Improving personalized services in mobile commerce by a novel multicriteria rating approach BIBAKFull-Text 1235-1236
  Qiudan Li; Chunheng Wang; Guanggang Geng
With the rapid growth of wireless technologies and mobile devices, there is a great demand for personalized services in m-commerce. Collaborative filtering (CF) is one of successful techniques to produce personalized recommendations for users. This paper proposes a novel approach to improve CF algorithms, where the contextual information of a user and the multicriteria ratings of an item are considered besides the typical information on users and items. The multilinear singular value decomposition (MSVD) technique is utilized to explore both explicit relations and implicit relations among user, item and criterion. We implement the approach in an existing m-commerce platform, and encouraging experimental results demonstrate its effectiveness.
Keywords: collaborative filtering, m-commerce., personalized service
Automatic web image selection with a probabilistic latent topic model BIBAKFull-Text 1237-1238
  Keiji Yanai
We propose a new method to select relevant images to the given keywords from images gathered from theWeb based on the Probabilistic Latent Semantic Analysis (PLSA) model which is a probabilistic latent topic model originally proposed for text document analysis. The experimental results shows that the results by the proposed method is almost equivalent to or outperforms the results by existing methods. In addition, it is proved that our method can select more various images compared to the existing SVM-based methods.
Keywords: image recognition, web image mining
R-U-in?: doing what you like, with people whom you like BIBAKFull-Text 1239-1240
  Nilanjan Banerjee; Dipanjan Chakraborty; Koustuv Dasgupta; Sumit Mittal; Seema Nagar
This paper presents R-U-In? -- a social networking application that leverages Web 2.0 and IMS-based Converged Networks technologies to create a rich next-generation service. R-U-In? allows a user to search (in real-time) and solicit participation of like-minded partners for an activity of mutual interest (e.g. a rock concert, a soccer game, or a movie). It is an example of a situational mashup application that exploits content and capabilities of a Telecom operator, blended with Web 2.0 technologies, to provide an enhanced, value-added service experience.
Keywords: NGN, mashups, social networking, web 2.0
Behavioral classification on the click graph BIBAKFull-Text 1241-1242
  Martin Szummer; Nick Craswell
A bipartite query-URL graph, where an edge indicates that a document was clicked for a query, is a useful construct for finding groups of related queries and URLs. Here we use this behavior graph for classification. We choose a click graph sampled from two weeks of image search activity, and the task of "adult" filtering: identifying content in the graph that is inappropriate for minors. We show how to perform classification using random walks on this graph, and two methods for estimating classifier parameters.
Keywords: classification, click data
Budget constrained bidding in keyword auctions and online knapsack problems BIBAKFull-Text 1243-1244
  Yunhong Zhou; Deeparnab Chakrabarty; Rajan Lukose
We consider the budget-constrained bidding optimization problem for sponsored search auctions, and model it as an online (multiple-choice) knapsack problem. We design both deterministic and randomized algorithms for the online (multiple-choice) knapsack problems achieving a provably optimal competitive ratio. This translates back to fully automatic bidding strategies maximizing either profit or revenue for the budget-constrained advertiser. Our bidding strategy for revenue maximization is oblivious (i.e., without knowledge) of other bidders' prices and/or click-through-rates for those positions. We evaluate our bidding algorithms using both synthetic data and real bidding data gathered manually, and also discuss a sniping heuristic that strictly improves bidding performance. With sniping and parameter tuning enabled, our bidding algorithms can achieve a performance ratio above 90% against the optimum by the omniscient bidder.
Keywords: keyword bidding, multiple-choice knapsack problem, online knapsack problem, sponsored search auction
Social and semantics analysis via non-negative matrix factorization BIBAKFull-Text 1245-1246
  Zhi-li Wu; Chi-Wa Cheng; Chun-hung Li
Social media such as Web forum often have dense interactions between user and content where network models are often appropriate for analysis. Joint non-negative matrix factorization model of participation and content data can be viewed as a bipartite graph model between users and media and is proposed for analysis social media. The factorizations allow simultaneous automatic discovery of leaders and sub-communities in the Web forum as well as the core latent topics in the forum. Results on topic detection of Web forums and cluster analysis show that social features are highly effective for forum analysis.
Keywords: algorithm, graph theory
Incremental web page template detection BIBAKFull-Text 1247-1248
  Yu Wang; Bingxing Fang; Xueqi Cheng; Li Guo; Hongbo Xu
Most template detection methods process web pages in batches that a newly crawled page can not be processed until enough pages have been collected. This results in large storage consumption and a huge delay of data refreshing. In this paper, we present an incremental framework to detect templates in which a page is processed as soon as it has been crawled. In this framework, we don't need to cache any web page. Experiments show that our framework consumes less than 7% storage than traditional methods. And also the speed of data refreshing is accelerated because of the incremental manner.
Keywords: automatic template removal, web page segmentation
Rogue access point detection using segmental TCP jitter BIBAKFull-Text 1249-1250
  Gaogang Xie; Tingting He; Guangxing Zhang
Rogue Access Points (RAPs) pose serious security threats to local networks. An analytic model of prior probability distribution of Segmental TCP Jitter (STJ) is deduced from the mechanism of IEEE 802.11 MAC Distributed Coordinated Function (DCF) and used to differentiate the types of wire and WLAN connections which is the crucial step for RAPs detecting. STJ as the detecting metric can reflect more the characteristic of 802.11 MAC than ACK-Pair since it can eliminate the delay caused by packet transmission. The experiment on an operated network shows the average detection ratio of the algorithm with STJ is more than 92.8% and the average detection time is less than 1s with improvement of 20% and 60% over the detecting approach of ACK-Pair respectively. Farther more no WLAN training trace is needed in the detecting algorithm.
Keywords: analytic model, rogue ap, segmental tcp jitter, sequential hypothesis testing

Panels

Towards the policy-aware web: the real web 3.0? BIBAKFull-Text 1251-1252
  Renato Iannella
The sharing and control of information on the Web needs to be balanced to provide the web community with the best experience and outcomes. The Policy-Aware Web is an emerging direction that may hold the key in getting this balance right by allowing future policy languages to maintain this harmonization. This panel will discuss, highlight, and debate the challenges in moving towards the Policy-Aware Web and the paths it could lead. The target audience includes all web stakeholders as most policies, such as privacy and rights, are intrinsic to all web users, information, services and content.
Keywords: identity, policy-aware web, privacy, rights
Protecting the web: phishing, malware, and other security threats BIBAKFull-Text 1253-1254
  Greg Aaron; Katharine A. Bostik; Rod Rasmussen; Edmon Chung
Web site operators, Internet users, and online service providers are besieged by a growing array of abuses and threats. Spam leads users to online scams and phishing Web pages, which cyber-criminals implant on innocent sites to fool the unwary into revealing their financial data and passwords. Other criminals use Web sites to spread malware, which can steal personal data or take over users' computers into a botnet, which can be used to send spam or mount cyber-attacks against Web sites and other Internet services. Together, these abuses undermine user trust, hamper e-commerce, and cost the Internet community huge losses in money, service and support costs, and time.
   What should Web site operators and online service providers do to protect themselves and their users? What are Internet companies, organizations, and law enforcement doing (and not doing) to combat these problems? And how can the international Internet community work together on these problems? The panel brings together representatives from the chain of organizations that respond to Internet abuse problems, and promises a lively, compelling, and relevant discussion.
Keywords: dns, security
The future of online social interactions: what to expect in 2020 BIBAKFull-Text 1255-1256
  Ajita John; Lada Adamic; Marc Davis; Frank Nack; David A. Shamma; Doree D. Seligmann
Blogs, wikis, tagging, podcasts, and social networking websites such as MySpace, Facebook, Flickr and YouTube have radically changed user interactions on the World Wide Web from a static, one-way, consumption model to a dynamic, multi-way, participation model. Broad user power and flexibility have changed how people engage in and experience their interconnections, interests, and collaborations. Online social interactions will evolve in the next decade to address the growing needs of its user community and make entries into many aspects of our lives. This evolution may very well be among the most exciting ones of our times where the individual and collective power of people to contribute and share content, experiences, ideas, expertise etc. may be even more enhanced than it is today. The enhancement may be shaped through a better understanding of user needs and behavior and it may be enabled through the seamless convergence of multi-modal technologies, new applications, new domains, data mining and better navigational and search capabilities. Some of these changes will also permeate into the workplace and change the way we work. This panel will discuss how online social interactions may evolve in the next decade and what impact it may have on diverse dimensions in our world.
Keywords: online interactions, social media, social networks
Information "uptrieval": exploring models for content assimilation and aggregation for developing regions BIBAKFull-Text 1257-1258
  Sheetal K. Agarwal; Arun Kumar; Sougata Mukherjea; Amit A. Nanavati; Nitendra Rajput
Information Retrieval on the WWW is important because it is hard to find what one is looking for. There is a plethora of information available, and searching relevant information is a challenge. In the case of developing regions, we have the opposite problem: 1) Information availability of global markets is scarce. Most of the consumers and producers (of information as well as goods) are relegated to local markets in geographical vicinity. In order to reach wider markets, it is important for this local information to reach wider audiences. (Local information for global consumption LIG model). (2) At the same time, locally relevant information, such as delays in bus/train timings, mobile medical van schedule changes, electricity outage timings, is not easily available either. (Local information for local consumption LIL model). We introduce the term Information Uptrieval to address the reverse problem of acquiring, assimilating, aggregating and uploading global and local information that is relevant for developing regions to a platform that improves the reach of the information. While the WWW is an obvious example of one such platform, given the low internet penetration in such regions, we need to explore effective alternatives. Several innovative, but disconnected approaches have been attempted to address the information uptrieval problem, ranging from the use of DVDs (eSagu, http://www.esagu.in/esagu) through the use of wireless stations on motorcycles (First Mile Solutions, http://www.firstmilesolutions.com). Many of these have met with reasonable success in their pilot deployments.
Keywords: developing regions, information management
Rich media and web 2.0 BIBAKFull-Text 1259-1260
  Edward Chang; Ken Ong; Susanne Boll; Wei-Ying Ma
Rich media data, such as video, imagery, music, and gaming, do no longer play just a supporting role on the World Wide Web to text data. Thanks to Web 2.0, rich media is the primary content on sites such as Flickr, PicasaWeb, YouTube, and QQ. Because of massive user generated content, the volume of rich media being transmitted on the Internet has surpassed that of text. It is vital to properly manage these data to ensure efficient bandwidth utilization, to support effective indexing and search, and to safeguard copyrights (just to name a few). This panel invites both researchers and practitioners to discuss the challenges of Web-scale media-data management. In particular, the panelists will address issues such as leveraging Rich Media and Web 2.0, indexing, search, and scalability.
Keywords: rich media, web 2.0

Workshops

Location and the web (LocWeb 2008) BIBAKFull-Text 1261-1262
  Susanne Boll; Christopher Jones; Eric Kansa; Puneet Kishor; Mor Naaman; Ross Purves; Arno Scharl; Erik Wilde
The World Wide Web has become the world's largest networked information resource, but references to geographical locations remain unstructured and typically implicit in nature. This lack of explicit spatial knowledge within the Web makes it difficult to service user needs for location-specific information. At present, spatial knowledge is hidden in many small information fragments such as addresses on Web pages, annotated photos with GPS co-ordinates, geographic mapping applications, and geotags in user-generated content. Several emerging formats that primarily or secondarily include location metadata, like GeoRSS, KML, and microformats, aim to improve this state of affairs. However, the question remains how to extract, index, mine, find, view, mashup, and exploit Web content using its location semantics. This work-shop brings together researchers from academia and industry labs to discuss and present the latest results and trends in all facets of the relationships between location concepts and Web information.
Keywords: geographic data, geolocation, geospatial, location, location-based services
WS3: international workshop on context-enabled source and service selection, integration and adaptation (CSSSIA 2008) BIBAKFull-Text 1263-1264
  Quan Z. Sheng; Ullas Nambiar; Amit P. Sheth; Biplav Srivastava; Zakaria Maamar; Said Elnaffar
This write-up provides a summary of the International Workshop on Context enabled Source and Service Selection, Integration and Adaptation (CSSSIA 2008), organized in conjunction with WWW 2008, at Beijing, China on April 22nd 2008. We outline the motivation for organizing the workshop, briefly describe the organizational details and program of the workshop, and summarize each of the papers accepted by the workshop. More information about the workshop can be found at http://www.cs.adelaide.edu.au/~csssia08/.
Keywords: context awareness, service and data integration, web service
Linked data on the web (LDOW2008) BIBAKFull-Text 1265-1266
  Christian Bizer; Tom Heath; Kingsley Idehen; Tim Berners-Lee
The Web is increasingly understood as a global information space consisting not just of linked documents, but also of Linked Data. More than just a vision, the resulting Web of Data has been brought into being by the maturing of the Semantic Web technology stack, and by the publication of an increasing number of datasets according to the principles of Linked Data.
   The Linked Data on the Web (LDOW2008) workshop brings together researchers and practitioners working on all aspects of Linked Data. The workshop provides a forum to present the state of the art in the field and to discuss ongoing and future research challenges. In this workshop summary we will outline the technical context in which Linked Data is situated, describe developments in the past year through initiatives such as the Linking Open Data community project, and look ahead to the workshop itself.
Keywords: linked data, rdf, semantic web, web of data
Fourth international workshop on adversarial information retrieval on the web (AIRWeb 2008) BIBAKFull-Text 1267-1268
  Carlos Castillo; Kumar Chellapilla; Dennis Fetterly
Adversarial IR in general, and search engine spam, in particular, are engaging research topics with a real-world impact for Web users, advertisers and publishers. The AIRWeb workshop will bring researchers and practitioners in these areas together, to present and discuss state-of-the-art techniques as well as real-world experiences. Given the continued growth in search engine spam creation and detection efforts, we expect interest in this AIRWeb to surpass that of the previous three editions of the workshop (held jointly with WWW 2005, SIGIR 2006, and WWW 2007 respectively).
Keywords: adversarial information retrieval, search engine spam, spamdexing, web spam
First workshop on targeting and ranking for online advertising BIBAKFull-Text 1269-1270
  Ewa Dominowska; Vanja Josifovski
Online advertising is a rapidly growing, multi-billion dollar industry. It has become a significant element of the Web browsing experience. Online advertising providers use sophisticated ad targeting and ranking algorithms with the dual aim of maximizing revenue while providing a superior user experience. As a result, advertising optimization is a very complex research problem, since it combines relevance with user interaction models, advertiser valuations, and commercial constraints. Online advertising integrates a number of core research areas: machine learning, data mining, search, auction theory, and user modeling.
   This workshop is intended to serve as an open forum for discussion of new ideas and current research in the field of online advertising. We expect that the workshop will promote a community of researchers interested in this area and yield future collaboration and exchanges. The research included should address problems faced by advertisers, end-users, advertising platforms, and the market.
Keywords: online advertising, ranking, targeting of online advertising
WS7 -- MobEA VI: personal rich social media BIBAKFull-Text 1271-1272
  Rittwik Jana; Daniel Appelquist; Galit Zadok; Bin Wei
2008 will be the "Mobile Eureka! Moment". With the Olympics just round the corner and the mobile market experiencing a phenomenal growth, we believe that WWW 2008 will truly be the climax and focal point in the midst of this mobile revolution. Mobile Web Initiative spearheaded by W3C is making a strong stand on how to realize the vision of pervasive mobile computing. Three screen services (TV, PC, Phone) are demanding new architectures to be developed. Furthermore, the need to go beyond technology, demands an embrace of the human-centric aspects of mobile computing. The objective of this workshop is to provide a single forum for researchers, sociologists, and technologists to discuss the state-of-the-art, present their contributions, and set future directions in personalized applications for mobile users with a focus in rich social media.
   The transition from the wired Internet to the mobile web has generated a wide array of novel services bringing in billions of dollars in revenue in the recent years. So there are a few natural questions that one may ask. "What is "the" next big thing in the mobile web as far as applications and services are concerned?", "How will operators provide these services efficiently with this next wave of applications?" and "What are the emerging business models that drive the next generation of services?".
   In a nutshell, to answer some of the questions above, we believe that in general there is a convergence in the fixed and mobile worlds by means of standards like IP Multimedia Subsystems (IMS) [1]. This convergence allows enterprise and consumer applications to be blended via a unified infrastructure. This unification results in a number of advantages like a) application availability from any access methods, b) enhanced user experience for combined/blended services (presence, messaging, address book etc.), c) centralized user profiles shared between applications and common provisioning, d) flexible charging and billing of multimedia services in particular, e) scalable deployments with guaranteed Quality of Services and last but not least f) secure solutions through built in identity management, authentication and authorization procedures.
   There are a variety of novel applications that are starting to emerge. A general shift from static content to a more dynamic nature has been observed in recent years. In particular, multimedia and social networking are the foundations of emerging applications. One such example is the delivery of IPTV to the third screen (mobile, PDA). Social networking/messaging (applications like Twitter) and peer-to-peer applications are also on the verge of breakthroughs in the mobile landscape. Interestingly, a lot of the momentum has been reinvigorated with the introduction of smart phones like Apple's iPhone [2]. Devices are becoming more powerful and standards are being written to take advantage of these smart features.
   Finally, we believe that the mobile web is at the forefront of technology and this workshop will discuss many of these technical and business challenges that are on the wish list of everyday practitioners.
Keywords: mobile web, rich media
Report on semantic web for health care and life sciences workshop BIBAKFull-Text 1273-1274
  Huajun Chen; Kei Cheung; Michel Dumontier; Eric Prudhommeaux; Alan Ruttenberg; Susie Stephens; Yimin Wang
The Semantic Web for Health Care and Life Sciences Workshop will be held in Beijing, China, on April 22, 2008. The goal of the workshop is to foster the development and advancement in the use of Semantic Web technologies to facilitate collaboration, research and development, and innovation adoption in the domains of Health Care and Life Sciences, We also encourage the participation of all research communities in this event, with enhanced participation from Asia due to the location of the event. The workshop consists of two invited keynote talks, eight peer-reviewed presentations, and one panel discussion.
Keywords: biomedical informatics, data integration, health care, life sciences, ontology, semantic web
International workshop on question answering on the web (QAWeb2008) BIBAKFull-Text 1275-1276
  Liu Wenyin; Qing Li; Xuedong Huang
A half-day single track workshop is designed to gather academic researchers and industrial practitioners at to share ideas and knowledge of know-how, and to discuss all relevant issues including the business models, enabling technologies, and killer applications, of Web-based question answering (QA), especially, the user-interactive QA services and applications. The workshop program consists of two sessions, one for academic papers, and the other for industrial practice papers. Each session consists of a leading talk, followed by three short presentations. Sufficient time is allocated for brainstorming discussions in each session.
Keywords: question answering
WWW 2008 workshop: NLPIX2008 summary BIBAKFull-Text 1277-1278
  Hiroshi Nakagawa; Kentaro Torisawa; Marasu Kitsuregawa
The amount of information available on the Web has increased rapidly, reaching levels that few would ever have imagined possible. We live in what could be called the "information-explosion era," and this situation poses new problems for computer scientists. Users demand useful and reliable information from the Web in the shortest time possible, but the obstacles to fulfilling this demand are many including language barriers and the so-called "long tail." Even worse, users may provide only vague specifications of the information that they actually want, so that a more concrete specification must somehow be inferred by Web access tools. Natural language processing (NLP) is one of the key technologies for solving the above Web usability problems. Almost all the Web page provide with the essential information in the form of natural language texts, and the amount of these text information is huge. In order to offer solutions to these problems we must perform searching and extracting information from the Web texts using NLP technologies. The aim of this workshop: NLP Challenges in the Information Explosion Era (NLPIX 2008) is to bring researchers and practitioners together in order to discuss our most pressing needs with respect to accessing information on the Web, and to discuss new ideas in NLP technologies that might offer viable solutions for those issues.
Keywords: natural language processing, web
Workshop on social web and knowledge management (SWKM2008) BIBAKFull-Text 1279-1280
  Peter Dolog; Markus Kroetzsch; Sebastian Schaffert; Denny Vrandecic
This paper provides an overview on the synergies between social web and knowledge management, topics, program committee members as well as summary of accepted papers for the SWKM2008 workshop.
Keywords: knowledge management, semantic web, social web, web 2.0
WWW 2008 workshop on social web search and mining: SWSM2008 BIBKFull-Text 1281-1282
  Juanzi Li; Gui-rong Xue; Jie Tang; Ying Ding
Keywords: social web, web mining, web search