HCI Bibliography Home | HCI Conferences | CIKM Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
CIKM Tables of Contents: 0809101112131415

Proceedings of the 2009 ACM Conference on Information and Knowledge Management

Fullname:Proceedings of the 18th ACM conference on Information and knowledge management
Editors:David Cheung; Il-Yeol Song; Wesley Chu; Xiaohua Hu; Jimmy Lin
Location:Hong Kong, China
Dates:2009-Nov-02 to 2009-Nov-06
Standard No:ISBN: 978-1-60558-512-3; ACM DL: Table of Contents; hcibib: CIKM09
Links:Conference Website
  1. KM information extraction I
  2. IR web search
  3. DB XML data processing, filtering, routing, & algorithms
  4. IR domain specific retrieval II
  5. KM information extraction II
  6. IR personalization & social search II
  7. DB string databases, blogs, & social search
  8. KM advance mining techniques
  9. KM text mining
  10. IR crawling & indexing
  11. DB novel data management & data mining tools
  12. KM semantic techniques & applications
  13. KM graph mining
  14. IR evaluation
  15. DB information integration, data provenance, probabilistic databases
  16. Industry information retrieval
  17. KM information filtering and recommender systems
  18. IR ranking and retrieval models I
  19. DB streams, network databases
  20. Panel information extraction meets relational databases: where are we heading?
  21. KM classification and clustering II
  22. IR domain-specific retrieval I
  23. DB data warehousing and OLAP
  24. Industry data mining framework and applications
  25. KM link analysis and social computing
  26. KM data summarization
  27. Industry data and query similarity
  28. IR personalization and social search I
  29. IR ranking and retrieval models II
  30. KM classification and clustering II
  31. Industry call and web center, e-commerce related technologies
  32. Poster session 1: DB track
  33. Poster session 2: DB track
  34. Poster session 3: IR track
  35. Poster session 4: KM track
  36. Poster session 5: KM track
  37. Poster session 6: IR track
  38. Poster session 7: IR track
  39. Poster session 8: IR track
  40. Demo session 1: stream data management and OLAP
  41. Demo session 2: semantic web, information extraction & knowledge management
  42. Demo session 3: advanced techniques & applications
  43. Demo session 4: XML data processing & system architecture
  44. CIKM 2009 co-located workshop overviews
DB-IR integration and its application to a massively-parallel search engine BIBAFull-Text 1-2
  Kyu-Young Whang
Nowadays, as there is an increasing need to integrate the DBMS (for structured data) with Information Retrieval (IR) features (for unstructured data), DB-IR integration is becoming one of major challenges in the database area[1,2]. Extensible architectures provided by commercial object-relational DBMS (ORDBMS) vendors can be used for DB-IR integration. Here, extensions are implemented using a high-level (typically, SQL-level) interface. We call this architecture loose-coupling. The advantage of loose-coupling is ease of implementation. But, loose-coupling is not preferable for implementing new data types and operations in large databases when high performance is required. In this talk, we present a new DBMS architecture applicable to DB-IR integration, which we call tight-coupling. In tight-coupling, new data types and operations are integrated into the core of the DBMS engine in the extensible type layer. Thus, they are incorporated as the "first-class citizens"[1] within the DBMS architecture and are supported in a consistent manner with high performance. This tight-coupling architecture is being used to incorporate IR features and spatial database features into the Odysseus ORDBMS that has been under development at KAIST/AITrc for over 19 years. In this talk, we introduce Odysseus and explain its tightly-coupled IR features (U.S. patented in 2002[2]). Then, we demonstrate excellence in performance of tight-coupling by showing benchmark results. We have built a web search engine that is capable of managing 100 million web pages per node in a non-parallel configuration using Odysseus. This engine has been successfully tested in many commercial environments. This work won the Best Demonstration Award from the IEEE ICDE conference held in Tokyo, Japan, in April 2005[3]. Last, we present a design of a massively-parallel search engine using Odysseus. Recently, parallel search engines have been implemented based on scalable distributed file systems (e.g., GFS). Nevertheless, building a massively-parallel search engine using a DBMS can be an attractive alternative since it supports a higher-level (i.e., SQL-level) interface than that of a distributed file system while providing scalability. The parallel search engine designed is capable of indexing 30 billion web pages with a performance comparable to or better than those of state-of-the-art search engines.
Confucius and "its" intelligent disciples BIBAFull-Text 3
  Edward Y. Chang
Confucius is a great teacher in ancient China. His theories and principles were effectively spread throughout China by his disciples. Confucius is the product code name of Google's Knowledge Search product, which is developed at Google Beijing office. In this talk, I present Knowledge Search's key disciples, which are data management subroutines that generate labels for questions, that match existing answers to a question, that evaluate quality of answers, that rank users based on their contributions, that distill high-quality answers for search engines to index, and that route questions to domain experts, and etc. This talk presents scalable algorithms that we have devised to make these disciples effective in dealing with huge datasets. Efforts in making these algorithms run even faster on thousands of machines, and some open research problems will also be presented.
Advanced metasearch engines BIBAFull-Text 5
  Clement T. Yu
A metasearch engine is a system, which is connected to different search engines. In response to a user query, it invokes suitable search engines for the query, merges the information returned by these search engines and output the merged result. There are two types of metasearch engines: one type for unstructured data (mostly text) and the other for structured data. In comparison to a text search engine, a metasearch engine can have a higher coverage of the Web and can have more timely information. A metasearch engine for structured data facilitates comparison shopping and services and is convenient to use. In this talk, we discuss the problems and their potential solutions. In addition, challenges and unsolved problems are sketched.

KM information extraction I

StereoTrust: a group based personalized trust model BIBAFull-Text 7-16
  Xin Liu; Anwitaman Datta; Krzysztof Rzadca; Ee-Peng Lim
Trust plays important roles in diverse decentralized environments, including our society at large. Computational trust models help to, for instance, guide users' judgements in online auction sites about other users; or determine quality of contributions in web 2.0 sites. Most of the existing trust models, however, require historical information about past behavior of a specific agent being evaluated -- information that is not always available. In contrast, in real life interactions among users, in order to make the first guess about the trustworthiness of a stranger, we commonly use our "instinct" -- essentially stereotypes developed from our past interactions with "similar" people. We propose StereoTrust, a computational trust model inspired by real life stereotypes. A user forms stereotypes using her previous transactions with other agents. A stereotype contains certain features of agents and an expected outcome of the transaction. These features can be taken from agents' profile information, or agents' observed behavior in the system. When facing a stranger, the stereotypes matching stranger's profile are aggregated to derive his expected trust. Additionally, when some information about stranger's previous transactions is available, StereoTrust uses it to refine the stereotype matching. According to our experiments, StereoTrust compares favorably with existing trust models that use different kind of information and more complete historical information. Moreover, because evaluation is done according to user's personal stereotypes, the system is completely distributed and the result obtained is personalized. StereoTrust can be used as a complimentary mechanism to provide the initial trust value for a stranger, especially when there is no trusted, common third parties.
An empirical study on using hidden Markov model for search interface segmentation BIBAFull-Text 17-26
  Ritu Khare; Yuan An
This paper describes a hidden Markov model (HMM) based approach to perform search interface segmentation. Automatic processing of an interface is a must to access the invisible contents of deep Web. This entails automatic segmentation, i.e., the task of grouping related components of an interface together. While it is easy for a human to discern the logical relationships among interface components, machine processing of an interface is difficult. In this paper, we propose an approach to segmentation that leverages the probabilistic nature of the interface design process. The design process involves choosing components based on the underlying database query requirements, and organizing them into suitable patterns. We simulate this process by creating an "artificial designer" in the form of a 2-layered HMM. The learned HMM acquires the implicit design knowledge required for segmentation. We empirically study the effectiveness of the approach across several representative domains of deep Web. In terms of segmentation accuracy, the HMM-based approach outperforms an existing state-of-the-art approach by at least 10% in most cases. Furthermore, our cross-domain investigation shows that a single HMM trained on data having varied and frequent design patterns can accurately segment interfaces from multiple domains.
Query by analogical example: relational search using web search engine indices BIBAFull-Text 27-36
  Makoto P. Kato; Hiroaki Ohshima; Satoshi Oyama; Katsumi Tanaka
We describe methods to search with a query by example in a known domain for information in an unknown domain by exploiting Web search engines. Relational search is an effective way to obtain information in an unknown field for users. For example, if an Apple user searches for Microsoft products, similar Apple products are important clues for the search. Even if the user does not know keywords to search for specific Microsoft products, the relational search returns a product name by querying simply an example of Apple products. More specifically, given a tuple containing three terms, such as (Apple, iPod, Microsoft), the term Zune can be extracted from the Web search results, where Apple is to iPod what Microsoft is to Zune. As a previously proposed relational search requires a huge text corpus to be downloaded from the Web, the results are not up-to-date and the corpus has a high construction cost. We introduce methods for relational search by using Web search indices. We consider methods based on term co-occurrence, on lexico-syntactic patterns, and on combinations of the two approaches. Our experimental results showed that the combination methods got the highest precision, and clarified the characteristics of the methods.
Semi-supervised learning of semantic classes for query understanding: from the web and for the web BIBAFull-Text 37-46
  Ye-Yi Wang; Raphael Hoffmann; Xiao Li; Jakub Szymanski
Understanding intents from search queries can improve a user's search experience and boost a site's advertising profits. Query tagging via statistical sequential labeling models has been shown to perform well, but annotating the training set for supervised learning requires substantial human effort. Domain-specific knowledge, such as semantic class lexicons, reduces the amount of needed manual annotations, but much human effort is still required to maintain these as search topics evolve over time.
   This paper investigates semi-supervised learning algorithms that leverage structured data (HTML lists) from the Web to automatically generate semantic-class lexicons, which are used to improve query tagging performance -- even with far less training data. We focus our study on understanding the correct objectives for the semi-supervised lexicon learning algorithms that are crucial for the success of query tagging. Prior work on lexicon acquisition has largely focused on the precision of the lexicons, but we show that precision is not important if the lexicons are used for query tagging. A more adequate criterion should emphasize a trade-off between maximizing the recall of semantic class instances in the data, and minimizing the confusability. This ensures that the similar levels of precision and recall are observed on both training and test set, hence prevents over-fitting the lexicon features. Experimental results on retail product queries show that enhancing a query tagger with lexicons learned with this objective reduces word level tagging errors by up to 25% compared to the baseline tagger that does not use any lexicon features. In contrast, lexicons obtained through a precision-centric learning algorithm even degrade the performance of a tagger compared to the baseline. Furthermore, the proposed method outperforms one in which semantic class lexicons have been extracted from a database.
Efficient record-level wrapper induction BIBAFull-Text 47-56
  Shuyi Zheng; Ruihua Song; Ji-Rong Wen; C. Lee Giles
Web information is often presented in the form of record, e.g., a product record on a shopping website or a personal profile on a social utility website. Given a host webpage and related information needs, how to identify relevant records as well as their internal semantic structures is critical to many online information systems. Wrapper induction is one of the most effective methods for such tasks. However, most traditional wrapper techniques have issues dealing with web records since they are designed to extract information from a page, not a record. We propose a record-level wrapper system. In our system, we use a novel "broom" structure to represent both records and generated wrappers. With such representation, our system is able to effectively extract records and identify their internal semantics at the same time. We test our system on 16 real-life websites from four different domains. Experimental results demonstrate 99% extraction accuracy in terms of F1-Value.

IR web search

What happens after an ad click?: quantifying the impact of landing pages in web advertising BIBAFull-Text 57-66
  Hila Becker; Andrei Broder; Evgeniy Gabrilovich; Vanja Josifovski; Bo Pang
Unbeknownst to most users, when a query is submitted to a search engine two distinct searches are performed: the organic or algorithmic search that returns relevant Web pages and related data (maps, images, etc.), and the sponsored search that returns paid advertisements. While an enormous amount of work has been invested in understanding the user interaction with organic search, surprisingly little research has been dedicated to what happens after an ad is clicked, a situation we aim to correct.
   To this end, we define and study the process of context transfer, that is, the user's transition from Web search to the context of the landing page that follows an ad-click. We conclude that in the vast majority of cases the user is shown one of three types of pages, namely, Homepage (the homepage of the advertiser), Category browse (a browse-able sub-catalog related to the original query), and Search transfer (the search results of the same query re-executed on the target site). We show that these three types of landing pages can be accurately distinguished using automatic text classification. Finally, using such an automatic classifier, we correlate the landing page type with conversion data provided by advertisers, and show that the conversion rate (i.e., users' response rate to ads) varies considerably according to the type. We believe our findings will further the understanding of users' response to search advertising in general, and landing pages in particular, and thus help advertisers improve their Web sites and help search engines select the most suitable ads.
Characterizing commercial intent BIBAFull-Text 67-76
  Azin Ashkan; Charles L. A. Clarke
Understanding the intent underlying user's queries may help personalize search results and therefore improve user satisfaction. We develop a methodology for using the content of search engine result pages (SERPs) along with the information obtained from query strings to study characteristics of query intent, with a particular focus on sponsored search. This work represents an initial step towards the development and evaluation of an ontology for commercial search, considering queries that reference specific products, brands and retailers. The characteristics of query categories are studied with respect to aggregated user's clickthrough behavior on advertising links. We present a model for clickthrough behavior that considers the influence of such factors as the location of ads and the rank of ads, along with query category. We evaluate our work using a large corpus of clickthrough data obtained from a major commercial search engine. Our findings suggest that query based features, along with the content of SERPs, are effective in detecting query intent. The clickthrough behavior is found to be consistent with the classification for the general categories of query intent, while for product, brand and retailer categories, all is true to a lesser extent.
Analyzing and evaluating query reformulation strategies in web search logs BIBAFull-Text 77-86
  Jeff Huang; Efthimis N. Efthimiadis
Users frequently modify a previous search query in hope of retrieving better results. These modifications are called query reformulations or query refinements. Existing research has studied how web search engines can propose reformulations, but has given less attention to how people perform query reformulations. In this paper, we aim to better understand how web searchers refine queries and form a theoretical foundation for query reformulation. We study users' reformulation strategies in the context of the AOL query logs. We create a taxonomy of query refinement strategies and build a high precision rule-based classifier to detect each type of reformulation. Effectiveness of reformulations is measured using user click behavior. Most reformulation strategies result in some benefit to the user. Certain strategies like add/remove words, word substitution, acronym expansion, and spelling correction are more likely to cause clicks, especially on higher ranked results. In contrast, users often click the same result as their previous query or select no results when forming acronyms and reordering words. Perhaps the most surprising finding is that some reformulations are better suited to helping users when the current results are already fruitful, while other reformulations are more effective when the results are lacking. Our findings inform the design of applications that can assist searchers; examples are described in this paper.
Characterizing and predicting search engine switching behavior BIBAFull-Text 87-96
  Ryen W. White; Susan T. Dumais
Search engine switching describes the voluntarily transition from one Web search engine to another. In this paper we present a study of search engine switching behavior that combines large-scale log-based analysis and survey data. We characterize aspects of switching behavior, and develop and evaluate predictive models of switching behavior using features of the active query, the current session, and user search history. Our findings provide insight into the decision-making processes of search engine users and demonstrate the relationship between switching and factors such as dissatisfaction with the quality of the results, the desire for broader topic coverage or verification of encountered information, and user preferences. The findings also reveal sufficient consistency in users' search behavior prior to engine switching to afford accurate prediction of switching events. Predictive models may be useful for search engines who may want to modify the search experience if they can accurately anticipate a switch.
Clustering and exploring search results using timeline constructions BIBAFull-Text 97-106
  Omar Alonso; Michael Gertz; Ricardo Baeza-Yates
Time is an important dimension of any information space and can be very useful in information retrieval and in particular clustering and exploration of search results. Search result clustering is a feature integrated in some of today's search engines, allowing users to further explore search results. However, only little work has been done on exploiting temporal information embedded in documents for the presentation, clustering, and exploration of search results along well-defined timelines.
   In this paper, we present an add-on to traditional information retrieval applications in which we exploit various temporal information associated with documents to present and cluster documents along timelines. Temporal information expressed in the form of, e.g., date and time tokens or temporal references, appear in documents as part of the textual context or metadata. Using temporal entity extraction techniques, we show how temporal expressions are made explicit and used in the construction of multiple-granularity timelines. We discuss how hit-list based search results can be clustered according to temporal aspects, anchored in the constructed timelines, and how time-based document clusters can be used to explore search results that include temporal snippets. We also outline a prototypical implementation and evaluation that demonstrates the feasibility and functionality of our framework.

DB XML data processing, filtering, routing, & algorithms

Effective, design-independent XML keyword search BIBAFull-Text 107-116
  Arash Termehchy; Marainne Winslett
Keyword search techniques that take advantage of XML structure make it very easy for ordinary users to query XML databases, but current approaches to processing these queries rely on intuitively appealing heuristics that are ultimately ad hoc. These approaches often retrieve irrelevant answers, overlook relevant answers, and cannot rank answers appropriately. To address these problems for data-centric XML, we propose coherency ranking (CR), a domain- and database design-independent ranking method for XML keyword queries that is based on an extension of the concept of mutual information. With CR, the results of a keyword query are invariant under schema reorganization. We analyze how previous approaches to XML keyword search approximate CR, and present efficient algorithms to perform CR. Our empirical evaluation with 65 user-supplied queries over two real-world XML data sets shows that CR has better precision and recall and provides better ranking than all previous approaches.
Efficient processing of twig pattern matching in fuzzy XML BIBAFull-Text 117-126
  Jian Liu; Z. M. Ma; Li Yan
In order to find all occurrences of a twig pattern in XML documents, a considerable amount of twig pattern matching algorithms have been proposed. At the same time, previous work mainly focuses on twig pattern query under the complete semantics. However, there is often a need to produce partial answers because XML data may have missing sub-elements. Furthermore, the existed works fall short in their ability to support twig pattern query under different semantics in fuzzy XML. In this paper, we study the problem of twig matches in fuzzy XML. We begin by introducing the extended region scheme to accurately and effectively represent nodes information in fuzzy XML. We then discuss the fuzzy query semantics and compute the membership information by using Einstein operator instead of Zadeh's min-max technique. On the basis, we propose two efficient algorithms for querying twig under complete and incomplete semantics in fuzzy XML. The experimental results show that our proposed algorithms can perform on the fuzzy twig pattern matching efficiently.
Dissemination of heterogeneous XML data in publish/subscribe systems BIBAFull-Text 127-136
  Yuan Ni; Chee Yong Chan
The publish-subscribe paradigm is an effective approach for data publishers to asynchronously disseminate relevant data to a large number of data subscribers. A lot of recent research has focused on extending this paradigm to support content-based delivery of XML data using more expressive XML-based subscription specifications that allow constraints on both data contents as well as structure. However, due to the heterogeneous data schemas used by different data publishers even for data in the same domain, an important challenge is how to efficiently and effectively disseminate relevant data to subscribers whose subscriptions might be specified based on schemas that are different from those used by the data publishers. In this paper, we examine the options to resolve this schema heterogeneity problem in XML data dissemination, and propose a novel paradigm that is based on data rewriting. Our experimental results demonstrate the effectiveness of the data rewriting paradigm and identifies the tradeoffs of the various approaches.
Linear inclusion for XML regular expression types BIBAFull-Text 137-146
  Dario Colazzo; Giorgio Ghelli; Luca Pardini; Carlo Sartiani
Type inclusion is a fundamental operation in every type-checking compiler, but it is quite expensive for XML manipulation languages. We recently presented an inclusion checking algorithm for an expressive family of XML type languages which is polynomial, but runs in quadratic time both in the best and in the worst cases. We present here an algorithm that has a linear-time backbone, and resorts to the quadratic approach for some specific parts of the compared types. Our experiments show that the new algorithm typically runs in linear time, hence can be used as a building block for a practical type-checking compiler.
Effective XML content and structure retrieval with relevance ranking BIBAFull-Text 147-156
  Xiping Liu; Changxuan Wan; Lei Chen
XML documents can be retrieved by means of not only content-only (CO) queries, but also content-and-structure (CAS) queries. Though promising better retrieval precision, CAS queries introduce several new challenges. To address these challenges, we propose a novel approach for XML CAS retrieval. The distinctive feature of the approach is that it adopts a content-oriented point of view. Specifically, the approach first decomposes a CAS query into several fragments, then retrieves results for each query fragment in a content-centric way, and finally scores each answer node. The approach is adaptive to versatile homogeneous and heterogeneous data environments. To assess the relevance of retrieval results to a query fragment, we present a scoring strategy that measures relevance from both content and structure perspectives. In addition, an effective approach is proposed to infer answer nodes based on the CAS query and document structure. An efficient algorithm is also presented for CAS retrieval. Finally, we demonstrate the effectiveness of the proposed methods through comprehensive experimental studies.

IR domain specific retrieval II

Intention-focused active reranking for image object retrieval BIBAFull-Text 157-166
  Jen-Hao Hsiao; Ming-Syan Chen
We consider the problem of ranking refinement for image object retrieval, whose goal is to improve an existing ranking function by a small number of labeled instances. To retrieve the relevant image object, one state-of-the-art approach is to use the relevance feedback: it first ranks the images in database based on a given ranking function (i.e., base ranker), and then rerank the initial result by further introducing user's feedback information. The key challenge of combining the information from the base ranker and user's feedback comes from the fact that the base ranker tends to give an imperfect result and the information obtained from user's feedback tends to be very noisy. This paper describes an Intention-Focused Active Reranking, an approach for automatically finding the right information to re-estimate the query model. Three novel strategies are proposed to boost the performance of the base ranker: (1) an active selection criterion, which obtains a small number of feedback images that are the most informative to the base ranker for user labeling; (2) the user intention verification, which captures the user's intention in object level to alleviate the query drift problem; (3) a discriminative query model re-estimation, which augments the generative approach with a model of the discriminative information conveyed by positive and negative feedback information. Experiments on a real world data set demonstrate the effectiveness of the proposed approach and furthermore it significantly outperforms the baseline visual bag-of-words retrieval.
A translation model for matching reviews to objects BIBAFull-Text 167-176
  Nilesh Dalvi; Ravi Kumar; Bo Pang; Andrew Tomkins
We develop a generic method for the review matching problem, which is to match unstructured text reviews to a list of objects, where each object has a set of attributes. To this end, we propose a translation model for generating reviews from a structured description of objects. We develop an EM-based method to estimate the model parameters and use this model to find, given a review, the object most likely to be the topic of the review. We conduct extensive experiments on two large-scale datasets: a collection of restaurant reviews from Yelp and a collection of movie reviews from IMDb. The experiments show that our translation model-based method is superior to traditional tf-idf based methods as well as a recent mixture model-based method for the review matching problem.
Learning better transliterations BIBAFull-Text 177-186
  Jeff Pasternack; Dan Roth
We introduce a new probabilistic model for transliteration that performs significantly better than previous approaches, is language-agnostic, requiring no knowledge of the source or target languages, and is capable of both generation (creating the most likely transliteration of a source word) and discovery (selecting the most likely transliteration from a list of candidate words). Our experimental results demonstrate improved accuracy over the existing state-of-the-art by more than 10% in Chinese, Hebrew and Russian. While past work has commonly made use of fixed-size n-gram features along with more traditional models such as HMM or Perceptron, we utilize an intuitive notion of "productions", where each source word can be segmented into a series of contiguous, non-overlapping substrings of any size, each of which independently transliterates to a substring in the target language with a given probability. To learn these parameters, we employ Expectation-Maximization (EM), with the alignment between substrings in the source and target word training pairs as our latent data. Despite the size of the parameter space and the 2(|w|-1) possible segmentations to consider for each word, by using dynamic programming each iteration of EM takes O(m^6 * n) time, where m is the length of the longest word in the data and n is the number of word pairs, and is very fast in practice. Furthermore, discovering transliterations takes only O(m^4 * w) time, where w is the number of candidate words to choose from, and generating a transliteration takes O(m2 * k2) time, where k is a pruning constant (we used a value of 100). Additionally, we are able to obtain training examples in an unsupervised fashion from Wikipedia by using a relatively simple algorithm to filter potential word pairs.
Supervised semantic indexing BIBAFull-Text 187-196
  Bing Bai; Jason Weston; David Grangier; Ronan Collobert; Kunihiko Sadamasa; Yanjun Qi; Olivier Chapelle; Kilian Weinberger
In this article we propose Supervised Semantic Indexing (SSI), an algorithm that is trained on (query, document) pairs of text documents to predict the quality of their match. Like Latent Semantic Indexing (LSI), our models take account of correlations between words (synonymy, polysemy). However, unlike LSI our models are trained with a supervised signal directly on the ranking task of interest, which we argue is the reason for our superior results. As the query and target texts are modeled separately, our approach is easily generalized to different retrieval tasks, such as online advertising placement. Dealing with models on all pairs of words features is computationally challenging. We propose several improvements to our basic model for addressing this issue, including low rank (but diagonal preserving) representations, and correlated feature hashing (CFH). We provide an empirical study of all these methods on retrieval tasks based on Wikipedia documents as well as an Internet advertisement task. We obtain state-of-the-art performance while providing realistically scalable methods.
Ranking model adaptation for domain-specific search BIBAFull-Text 197-206
  Bo Geng; Linjun Yang; Chao Xu; Xian-Sheng Hua
Recently, various domain-specific search engines emerge, which are restricted to specific topicalities or document formats, and vertical to the broad-based search. Simply applying the ranking model trained for the broad-based search to the verticals cannot achieve a sound performance due to the domain differences, while building different ranking models for each domain is both laborious for labeling sufficient training samples and time-consuming or the training process. In this paper, to address the above difficulties, we investigate two problems: (1) whether we can adapt the ranking model learned for existing Web page search or verticals, to the new domain, so that the amount of labeled data and the training cost is reduced, while the performance requirement is still satisfied; and (2) how to adapt the ranking model from auxiliary domains to a new target domain. We address the second problem from the regularization framework and an algorithm called ranking adaptation SVM is proposed. Our algorithm is flexible enough, which needs only the prediction from the existing ranking model, rather than the internal representation of the model or the data from auxiliary domains. The first problem is addressed by the proposed ranking adaptability measurement, which quantitatively estimates if an existing ranking model can be adapted to the new domain. Extensive experiments are performed over Letor benchmark dataset and two large scale datasets crawled from different domains through a commercial internet search engine, where the ranking model learned for one domain will be adapted to the other. The results demonstrate the applicabilities of the proposed ranking model adaptation algorithm and the ranking adaptability measurement.

KM information extraction II

Data-driven compound splitting method for English compounds in domain names BIBAFull-Text 207-214
  Sanjeet Khaitan; Arumay Das; Sandeep Gain; Adithi Sampath
Significant amount of literature is available on compound splitting of long words albeit for non-English languages -- especially European. Not surprisingly, there has been not much work for English as it is not a compounding language like some of its European counterparts. However, Internet domain names in general are compound English words, e.g. bankofamerica.com". Compound splitting can be effectively employed to extract information from domain names. In this paper, an data-driven learning technique for splitting English compound words is described which among others uses features like normalized frequency, length of parts and n-gram. The splitting F-measure is higher than the published approaches. We applied this technique on a real life web search application where the queries are mistyped domain names routed through sources like ISPs and browsers. Relevant and meaningful keywords were extracted out and shown to the user as a value added search option. Results show a very high click-through rate and increased commercial value.
Named entity disambiguation by leveraging wikipedia semantic knowledge BIBAFull-Text 215-224
  Xianpei Han; Jun Zhao
Name ambiguity problem has raised an urgent demand for efficient, high-quality named entity disambiguation methods. The key problem of named entity disambiguation is to measure the similarity between occurrences of names. The traditional methods measure the similarity using the bag of words (BOW) model. The BOW, however, ignores all the semantic relations such as social relatedness between named entities, associative relatedness between concepts, polysemy and synonymy between key terms. So the BOW cannot reflect the actual similarity. Some research has investigated social networks as background knowledge for disambiguation. Social networks, however, can only capture the social relatedness between named entities, and often suffer the limited coverage problem.
   To overcome the previous methods' deficiencies, this paper proposes to use Wikipedia as the background knowledge for disambiguation, which surpasses other knowledge bases by the coverage of concepts, rich semantic information and up-to-date content. By leveraging Wikipedia's semantic knowledge like social relatedness between named entities and associative relatedness between concepts, we can measure the similarity between occurrences of names more accurately. In particular, we construct a large-scale semantic network from Wikipedia, in order that the semantic knowledge can be used efficiently and effectively. Based on the constructed semantic network, a novel similarity measure is proposed to leverage Wikipedia semantic knowledge for disambiguation. The proposed method has been tested on the standard WePS data sets. Empirical results show that the disambiguation performance of our method gets 10.7% improvement over the traditional BOW based methods and 16.7% improvement over the traditional social network based methods.
Helping editors choose better seed sets for entity set expansion BIBAFull-Text 225-234
  Vishnu Vyas; Patrick Pantel; Eric Crestan
Sets of named entities are used heavily at commercial search engines such as Google, Yahoo and Bing. Acquiring sets of entities typically consists of combining semi-supervised expansion algorithms with manual cleaning of the resulting expanded sets. In this paper, we study the effects of different seed sets in a state-of-the-art semi-supervised expansion system and show a tremendous variation in expansion performance depending on the choice of seeds. We further show that human editors, in general, provide very bad seed sets, which perform well-below the average random seed set. We identify three factors of seed set composition, namely prototypicality, ambiguity and coverage, and we investigate their effects on expansion performance. Finally, we propose various automatic systems for improving editor-generated seed sets, which seek to remove ambiguous and other error-prone seed instances. An extensive experimental analysis shows that expansion quality, measured in R-precision, can be improved on average by a maximum of 46% by removing the right seeds from a seed set. Our automatic methods outperform the human editors seed sets and on average improve expansion performance by up to 34% over the original seed sets.
Using multiple ontologies in information extraction BIBAFull-Text 235-244
  Daya C. Wimalasuriya; Dejing Dou
Ontology-Based Information Extraction (OBIE) has recently emerged as a subfield of Information Extraction (IE). Here, ontologies -- which provide formal and explicit specifications of conceptualizations -- play a crucial role in the information extraction process. Several OBIE systems have been implemented previously but all of them use a single ontology although multiple ontologies have been designed for many domains. We have studied the theoretical basis for using multiple ontologies in information extraction and have developed information extraction systems that use them. These systems investigate the two major scenarios for having multiple ontologies for the same domain: specializing in sub-domains and providing different perspectives. The domain of universities has been used for the former scenario through a corpus collected from university websites. For the latter, the domain of terrorist attacks and a corpus used by a previous Message Understanding Conference (MUC) have been used. The results from these two case studies indicate that using multiple ontologies in information extraction has led to a clear improvement in performance measures.

IR personalization & social search II

Computational community interest for ranking BIBAFull-Text 245-254
  Xiaozhong Liu; Vadim von Brzeski
Ranking documents with respect to users' information needs is a challenging task, due, in part, to the dynamic nature of users' interest with respect to a query, which can change over time. In this paper, we propose an innovative method for characterizing the interests of a community of users at a specific point in time and for using this characterization to alter the ranking of documents retrieved for a query. By generating a community interest vector (CIV) for a given query, we measure the community interest by computing a score in a specific document or web page retrieved by the query. This score is based on a continuously updated set of recent (daily or past few hours) user-oriented text data. When applying our method in ranking Yahoo! Buzz results, the CIV score improves relevant results by 16% as determined by real-world user evaluation.
Adaptive relevance feedback in information retrieval BIBAFull-Text 255-264
  Yuanhua Lv; ChengXiang Zhai
Relevance Feedback has proven very effective for improving retrieval accuracy. A difficult yet important problem in all relevance feedback methods is how to optimally balance the original query and feedback information. In the current feedback methods, the balance parameter is usually set to a fixed value across all the queries and collections. However, due to the difference in queries and feedback documents, this balance parameter should be optimized for each query and each set of feedback documents.
   In this paper, we present a learning approach to adaptively predict the optimal balance coefficient for each query and each collection. We propose three heuristics to characterize the balance between query and feedback information. Taking these three heuristics as a road map, we explore a number of features and combine them using a regression approach to predict the balance coefficient. Our experiments show that the proposed adaptive relevance feedback is more robust and effective than the regular fixed-coefficient feedback.
The use of categorization information in language models for question retrieval BIBAFull-Text 265-274
  Xin Cao; Gao Cong; Bin Cui; Christian Søndergaard Jensen; Ce Zhang
Community Question Answering (CQA) has emerged as a popular type of service meeting a wide range of information needs. Such services enable users to ask and answer questions and to access existing question-answer pairs. CQA archives contain very large volumes of valuable user-generated content and have become important information resources on the Web. To make the body of knowledge accumulated in CQA archives accessible, effective and efficient question search is required. Question search in a CQA archive aims to retrieve historical questions that are relevant to new questions posed by users. This paper proposes a category-based framework for search in CQA archives. The framework embodies several new techniques that use language models to exploit categories of questions for improving question-answer search. Experiments conducted on real data from Yahoo! Answers demonstrate that the proposed techniques are effective and efficient and are capable of outperforming baseline methods significantly.
Improving search engines using human computation games BIBAFull-Text 275-284
  Hao Ma; Raman Chandrasekar; Chris Quirk; Abhishek Gupta
Work on evaluating and improving the relevance of web search engines typically use human relevance judgments or clickthrough data. Both these methods look at the problem of learning the mapping from queries to web pages. In this paper, we identify some issues with this approach, and suggest an alternative approach, namely, learning a mapping from web pages to queries. In particular, we use human computation games to elicit data about web pages from players that can be used to improve search. We describe three human computation games that we developed, with a focus on Page Hunt, a single-player game. We describe experiments we conducted with several hundred game players, highlight some interesting aspects of the data obtained and define the 'findability' metric. We also show how we automatically extract query alterations for use in query refinement using techniques from bitext matching. The data that we elicit from players has several other applications including providing metadata for pages and identifying ranking issues.

DB string databases, blogs, & social search

Space-economical partial gram indices for exact substring matching BIBAFull-Text 285-294
  Nan Tang; Lefteris Sidirourgos; Peter Boncz
Exact substring matching queries on large data collections can be answered using q-gram indices, that store for each occurring q-byte pattern an (ordered) posting list with the positions of all occurrences. Such gram indices are known to provide fast query response time and to allow the index to be created quickly even on huge disk-based datasets. Their main drawback is relatively large storage space, that is a constant multiple (typically >2) of the original data size, even when compression is used. In this work, we study methods to conserve the scalable creation time and efficient exact substring query properties of gram indices, while reducing storage space. To this end, we first propose a partial gram index based on a reduction from the problem of omitting indexed q-grams to the set cover problem. While this method is successful in reducing the size of the index, it generates false positives at query time, reducing efficiency. We then increase the accuracy of partial grams by splitting posting lists of frequent grams in a frequency-tuned set of signatures that take the bytes surrounding the grams into account. The resulting qs-gram scheme is tested on huge collections (up to 426GB) and is shown to achieve an almost 1:1 data:index size, and query performance even faster than normal gram methods, thanks to the reduced size and access cost.
AS-index: a structure for string search using n-grams and algebraic signatures BIBAFull-Text 295-304
  Cédric du Mouza; Witold Litwin; Philippe Rigaux; Thomas Schwarz
AS-Index is a new index structure for exact string search in disk resident databases. It uses hashing, unlike known alternatives, whether based on trees or tries. It typically indexes every n-gram in the database, though non-dense indexing is also possible. The hash function uses the algebraic signatures of all n-grams. Use of hashing provides for constant index access time for arbitrarily long patterns, unlike other structures whose search cost is at best logarithmic. The storage overhead of AS-Index is basically 500 -- 600%, similar to that of alternatives or smaller. We show the index structure, our use of algebraic signatures and the search algorithm. We discuss the design trade-offs and present the theoretical and experimental performance analysis. We compare the AS-Index to main alternatives. We conclude that AS-Index is an attractive structure and we indicate directions for future work.
Robust record linkage blocking using suffix arrays BIBAFull-Text 305-314
  Timothy de Vries; Hui Ke; Sanjay Chawla; Peter Christen
Record linkage is an important data integration task that has many practical uses for matching, merging and duplicate removal in large and diverse databases. However, a quadratic scalability for the brute force approach necessitates the design of appropriate indexing or blocking techniques. We design and evaluate an efficient and highly scalable blocking approach based on suffix arrays. Our suffix grouping technique exploits the ordering used by the index to merge similar blocks at marginal extra cost, resulting in a much higher accuracy while retaining the high scalability of the base suffix array method. Efficiently grouping similar suffixes is carried out with the use of a sliding window technique. We carry out an in-depth analysis of our method and show results from experiments using real and synthetic data, which highlights the importance of using efficient indexing and blocking in real world applications where data sets contain millions of records.
Efficient algorithms for approximate member extraction using signature-based inverted lists BIBAFull-Text 315-324
  Jiaheng Lu; Jialong Han; Xiaofeng Meng
We study the problem of approximate membership extraction (AME), i.e., how to efficiently extract substrings in a text document that approximately match some strings in a given dictionary. This problem is important in a variety of applications such as named entity recognition and data cleaning. We solve this problem in two steps. In the first step, for each substring in the text, we filter away the strings in the dictionary that are very different from the substring. In the second step, each candidate string is verified to decide whether the substring should be extracted. We develop an incremental algorithm using signature-based inverted lists to minimize the duplicate list-scan operations of overlapping windows in the text. Our experimental study of the proposed algorithms on real and synthetic datasets showed that our solutions significantly outperform existing methods in the literature.

KM advance mining techniques

An integrated discriminative probabilistic approach to information extraction BIBAFull-Text 325-334
  Xiaofeng Yu; Wai Lam; Bo Chen
Probabilistic graphical models for sequence data enable us to effectively deal with inherent uncertainty in many real-world domains. However, they operate on a mostly propositional level. Logic approaches, on the other hand, can compactly represent a wide variety of knowledge, especially first-order ones, but treat uncertainty only in limited ways. Therefore, combining probability and first-order logic is highly desirable for information extraction which requires uncertainty modeling as well as dependency and deeper knowledge representation. In this paper, we model both segmentations in observation sequence and relations of segments simultaneously in our proposed integrated discriminative probabilistic framework. We propose the Metropolis-Hastings, a Markov chain Monte Carlo (MCMC) algorithm for approximate Bayesian inference to find the maximum a posteriori assignment of all the variables of this model. This integrated model has several advantages over previous probabilistic graphical models, and it offers a great capability of extracting implicit relations and new relation discovery for relation extraction from encyclopedic documents, and capturing sub-structures in named entities for named entity recognition. We performed extensive experiments on the above two well-established information extraction tasks, illustrating the feasibility and promise of our approach.
Mining linguistic cues for query expansion: applications to drug interaction search BIBAFull-Text 335-344
  Sheng Guo; Naren Ramakrishnan
Given a drug under development, what are other drugs or biochemical compounds that it might interact with? Early answers to this question, by mining the literature, are valuable for pharmaceutical companies, both monetarily and in avoiding public relations nightmares. Inferring drug-drug interactions is also important in designing combination therapies for complex diseases including cancers. We study this problem as one of mining linguistic cues for query expansion. By using (only) positive instances of drug interactions, we show how we can extract linguistic cues which can then be used to expand and reformulate queries to improve the effectiveness of drug interaction search. Our approach integrates many learning paradigms: partially supervised classification, association measures for collocation mining, and feature selection in supervised learning. We demonstrate compelling results on using positive examples from the DrugBank database to seed MEDLINE searches for drug interactions. In particular, we show that purely data-driven linguistic cues can be effectively mined and applied to realize a successful domain-specific query expansion framework.
Message family propagation for ising mean field based on iteration tree BIBAFull-Text 345-354
  Yarui Chen; Shizhong Liao
Ising mean field is a basic variational inference method for Ising model, which can provide an effective approximate solution for large-scale inference problem. The main idea is to transform a probabilistic inference problem into a functional extremum problem by variational calculus, and solve the functional extremum problem to obtain approximate marginal distributions. The process of solving the functional extremum is an important step and a computational core for variational inference. But the traditional full variational iteration methods make the variable information intercross with each other deeply. From the view of incomplete variational iterations, we propose a message family propagation method for Ising mean field to compute a marginal distribution family of object variable.
   First we define the concepts of iteration tree and pruning iteration tree to describe the iteration computation process of Ising mean field inference. Then we design the message family propagation method based on the iteration trees. The method propagates mean field message families and belief message families from bottom to top of the iteration tree, and presents a marginal distribution family of variable in root node. Finally we prove the marginal distribution bound theorem, which shows that the marginal distribution family computed by the method in the pruning iteration tree contains the exact marginal distribution. Theoretical and experimental results illustrate that the message family propagation method is valid and the marginal distribution bounds are tight.
Efficient itemset generator discovery over a stream sliding window BIBAFull-Text 355-364
  Chuancong Gao; Jianyong Wang
Mining generator patterns has raised great research interest in recent years. The main purpose of mining itemset generators is that they can form equivalence classes together with closed itemsets, and can be used to generate simple classification rules according to the MDL principle. In this paper, we devise an efficient algorithm called StreamGen to mine frequent itemset generators over a stream sliding window. We adopt a novel enumeration tree structure to help keep the information of mined generators and the border between generators and non-generators, and propose some optimization techniques to speed up the mining process. We further extend the algorithm to directly mine a set of high quality classification rules over stream sliding windows while keeping high performance. The extensive performance study shows that our algorithm outperforms other state-of-the-art algorithms which perform similar tasks in terms of both runtime and memory usage efficiency, and has high utility in terms of classification.

KM text mining

Learning document aboutness from implicit user feedback and document structure BIBAFull-Text 365-374
  Deepa Paranjpe
Capturing the "aboutness" of documents has been a key research focus throughout the history of automated textual information processing. In this work, we represent aboutness using words and phrases that best reflect the central topics of a document. We present a machine learning approach that learns to score and rank words and phrases in a document according to their relevance to the document. We use implicit user feedback available in search engine click logs to characterize the user-perceived notion of term relevance. Using a small set of manually generated training data, we show that the surrogate training data from click logs correlates well with this data, thus eliminating the need to create data for training manually which is both expensive and fundamentally difficult to obtain for such a task. Further, we use a diverse set of features in our learning model that capitalize heavily on the structural and visual properties of web documents. In our extensive experimentation, we pay particular attention to tail web pages and show that our approach trained on mainly head web pages generalizes and performs well on all kinds of documents. In several evaluation methods using manually generated summaries and term relevance judgments, our system shows 25% improvement over other aboutness solutions.
Joint sentiment/topic model for sentiment analysis BIBAFull-Text 375-384
  Chenghua Lin; Yulan He
Sentiment analysis or opinion mining aims to use automated tools to detect subjective information such as opinions, attitudes, and feelings expressed in text. This paper proposes a novel probabilistic modeling framework based on Latent Dirichlet Allocation (LDA), called joint sentiment/topic model (JST), which detects sentiment and topic simultaneously from text. Unlike other machine learning approaches to sentiment classification which often require labeled corpora for classifier training, the proposed JST model is fully unsupervised. The model has been evaluated on the movie review dataset to classify the review sentiment polarity and minimum prior information have also been explored to further improve the sentiment classification accuracy. Preliminary experiments have shown promising results achieved by JST.
Generating comparative summaries of contradictory opinions in text BIBAFull-Text 385-394
  Hyun Duk Kim; ChengXiang Zhai
This paper presents a study of a novel summarization problem called contrastive opinion summarization (COS). Given two sets of positively and negatively opinionated sentences which are often the output of an existing opinion summarizer, COS aims to extract comparable sentences from each set of opinions and generate a comparative summary containing a set of contrastive sentence pairs. We formally formulate the problem as an optimization problem and propose two general methods for generating a comparative summary using the framework, both of which rely on measuring the content similarity and contrastive similarity of two sentences. We study several strategies to compute these two similarities. We also create a test data set for evaluating such a novel summarization problem. Experiment results on this test set show that the proposed methods are effective for generating comparative summaries of contradictory opinions.
sDoc: exploring social wisdom for document enhancement in web mining BIBAFull-Text 395-404
  Xiaoxun Zhang; Lichun Yang; Xian Wu; Honglei Guo; Zhili Guo; Shenghua Bao; Yong Yu; Zhong Su
Web document could be seen to be composed of textual content as well as social metadata of various forms (e.g., anchor text, search query and social annotation), both of which are valuable to indicate the semantic content of the document. However, due to the free nature of the web, the two streams of web data suffer from the serious problems of noise and sparseness, which have actually become the major challenges to the success of many web mining applications. Previous work has shown that it could enhance the content of web document by integrating anchor text and search query. In this paper, we study the problem of exploring emergent social annotation for document enhancement and propose a novel reinforcement framework to generate "social representation" of document. Distinguishing from prior work, textual content and social annotation are enhanced simultaneously in our framework, which is achieved by exploiting a kind of mutual reinforcement relationship behind them. Two convergent models, social content model and social annotation model, are symmetrically derived from the framework to represent enhanced textual content and enhanced social annotation respectively. The enhanced document is referred to as Social Document or sDoc in that it could embed complementary viewpoints from many web authors and many web visitors. In this sense, the document semantics is enhanced exactly by exploring social wisdom. We build the framework on a large Del.icio.us data and evaluate it through three typical web mining applications: annotation, classification and retrieval. Experimental results demonstrate that social representation of web document could boost the performance of these applications significantly.
Terminology mining in social media BIBAFull-Text 405-414
  Magnus Sahlgren; Jussi Karlgren
The highly variable and dynamic word usage in social media presents serious challenges for both research and those commercial applications that are geared towards blogs or other user-generated non-editorial texts. This paper discusses and exemplifies a terminology mining approach for dealing with the productive character of the textual environment in social media. We explore the challenges of practically acquiring new terminology, and of modeling similarity and relatedness of terms from observing realistic amounts of data. We also discuss semantic evolution and density, and investigate novel measures for characterizing the preconditions for terminology mining.

IR crawling & indexing

Compact full-text indexing of versioned document collections BIBAFull-Text 415-424
  Jinru He; Hao Yan; Torsten Suel
We study the problem of creating highly compressed full-text index structures for versioned document collections, that is, collections that contain multiple versions of each document. Important examples of such collections are Wikipedia or the web page archive maintained by the Internet Archive. A straightforward indexing approach would simply treat each document version as a separate document, such that index size scales linearly with the number of versions. However, several authors have recently studied approaches that exploit the significant similarities between different versions of the same document to obtain much smaller index sizes. In this paper, we propose new techniques for organizing and compressing inverted index structures for such collections. We also perform a detailed experimental comparison of new techniques and the existing techniques in the literature. Our results on an archive of the English version of Wikipedia, and on a subset of the Internet Archive collection, show significant benefits over previous approaches.
On the feasibility of multi-site web search engines BIBAFull-Text 425-434
  Ricardo Baeza-Yates; Aristides Gionis; Flavio Junqueira; Vassilis Plachouras; Luca Telloli
Web search engines are often implemented as centralized systems. Designing and implementing a Web search engine in a distributed environment is a challenging engineering task that encompasses many interesting research questions. However, distributing a search engine across multiple sites has several advantages, such as utilizing less compute resources and exploiting data locality. In this paper we investigate the cost-effectiveness of building a distributed Web search engine. We propose a model for assessing the total cost of a distributed Web search engine that includes the computational costs and the communication cost among all distributed sites. We then present a query-processing algorithm that maximizes the amount of queries answered locally, without sacrificing the quality of the results compared to a centralized search engine. We simulate the algorithm on real document collections and query workloads to measure the actual parameters needed for our cost model, and we show that a distributed search engine can be competitive compared to a centralized architecture with respect to real cost.
On-line index maintenance using horizontal partitioning BIBAFull-Text 435-444
  Sairam Gurajada; Sreenivasa Kumar P
In this paper, we propose a new merge-based index maintenance strategy for Information Retrieval systems. The new model is based on partitioning of the inverted index across the terms in it. We exploit the query log to partition the on-disk inverted index into two types of sub-indexes. Inverted lists of the terms contained in the queries that are frequently posed to the Information Retrieval systems are kept in one partition, called frequent-term index and the other inverted lists form another partition, called infrequent-term index. We use a lazy-merge strategy for maintaining infrequent-term sub-indexes, and an active merge strategy for maintaining frequent-term sub-indexes. The sub-indexes are also similarly split into frequent and in-frequent parts. Experimental results show that the proposed method improves both index maintenance performance and query performance compared to the existing merge-based strategies.
Adaptive geospatially focused crawling BIBAFull-Text 445-454
  Dirk Ahlers; Susanne Boll
Location information on the Web is a precious asset for a multitude of applications and is becoming an increasingly important dimension in Web search. Even though more and more Web pages carry location information, they form only a small share of all pages and are scattered over the Web. To efficiently find and index location-related Web content, we propose an efficient crawling strategy that retrieves precisely those pages that are geospatially relevant while minimizing the amount of the non-spatially-relevant pages within the crawled pages. We propose to address this challenge by expanding the technique of focused crawling to exploit location references on Web pages to specifically retrieve geospatial topics on the Web.
   In this paper, we describe the design and development of a focused crawler with an adaptive geospatial focus that efficiently retrieves and identifies location-relevant documents on the Web. Drawing from geospatial features of both Web pages and the link graph, a crawl strategy based on Bayesian classifiers prioritizes promising links and pages, leading to a faster coverage of the desired geospatial topic as a means for fast creation of precise geospatial Web indexes.
   We present evaluations of the system's performance and share our findings on the geospatial Web graph and the distribution of location references on the Web.
Low-cost management of inverted files for online full-text search BIBAFull-Text 455-464
  Giorgos Margaritis; Stergios V. Anastasiadis
In dynamic environments with frequent content updates, we require online full-text search that scales to large data collections and achieves low search latency. Several recent methods that support fast incremental indexing of documents typically keep on disk multiple partial index structures that they continuously update as new documents are added. However, spreading indexing information across multiple locations on disk tends to considerably decrease the search responsiveness of the system. In the present paper, we take a fresh look at the problem of online full-text search with consideration of the architectural features of modern systems. Selective Range Flush is a greedy method that we introduce to manage the index in the system by using fixed-size blocks to organize the data on disk and dynamically keep low the cost of data transfer between memory and disk. As we experimentally demonstrate with the Proteus prototype implementation that we developed, we retrieve indexing information at latency that matches the lowest achieved by existing methods. Additionally, we reduce the total building cost by 30% in comparison to methods with similar retrieval time.

DB novel data management & data mining tools

Bitmap indexes for relational XML twig query processing BIBAFull-Text 465-474
  Kyong-Ha Lee; Bongki Moon
Due to an increasing volume of XML data, it is considered prudent to store XML data on an industry-strength database system instead of relying on a domain specific application or a file system. For shredded XML data stored in the relational tables, however, it may not be straightforward to apply existing algorithms for twig query processing, because most of the algorithms require XML data to be accessed in a form of streams of elements grouped by their tags and sorted in a particular order. In order to support XML query processing within the common framework of relational database systems, we first propose several bitmap indexes for supporting holistic twig joins on XML data stored in the relational tables. Since bitmap indexes are well supported in most of the commercial and open-source database systems, the proposed bitmap indexes and twig query processing algorithms can be incorporated into the relational query processing framework with more ease. The proposed query processing algorithms are efficient in terms of both time and space, since the compressed bitmap indexes stay compressed during query processing. In addition, we propose a hybrid index which computes twig query solutions with only bit-vectors, without accessing labeled XML elements stored in the relational tables.
Answering XML queries using materialized views revisited BIBAFull-Text 475-484
  Xiaoying Wu; Dimitri Theodoratos; Wendy Hui Wang
Answering queries using views is a well-established technique in databases. In this context, two outstanding problems can be formulated. The first one consists in deciding whether a query can be answered exclusively using one or multiple materialized views. Given the many alternative ways to compute the query from the materialized views, the second problem consists in finding the best way to compute the query from the materialized views. In the realm of XML, there is a restricted number of contributions in the direction of these problems due to the many limitations associated with the use of materialized views in traditional XML query evaluation models.
   In this paper, we adopt a recent evaluation model, called inverted lists model, and holistic algorithms which together have been established as the prominent technique for evaluating queries on large persistent XML data, and we address the previous two problems. This new context revises these problems since it requires new conditions for view usability and new techniques for computing queries from materialized views. We suggest an original approach for materializing views which stores for every view node only the list of XML nodes necessary for computing the answer of the view. We specify necessary and sufficient conditions for answering a tree-pattern query using one or multiple materialized views in terms of homomorphisms from the views to the query. In order to efficiently answer queries using materialized views, we design a stack-based algorithm which compactly encodes in polynomial time and space all the homomorphisms from a view to a query. We further propose space and time optimizations by using bitmaps to encode view materializations and by employing bitwise operations to minimize the evaluation cost of the queries. Finally, we conducted an extensive experimentation which demonstrates that our approach yields impressive query hit rates in the view pool, achieves significant time and space savings and shows smooth scalability.
A query language for analyzing networks BIBAFull-Text 485-494
  Anton Dries; Siegfried Nijssen; Luc De Raedt
With more and more large networks becoming available, mining and querying such networks are increasingly important tasks which are not being supported by database models and querying languages. This paper wants to alleviate this situation by proposing a data model and a query language for facilitating the analysis of networks. Key features include support for executing external tools on the networks, flexible contexts on the network each resulting in a different graph, primitives for querying subgraphs (including paths) and transforming graphs.
   The data model provides for a closure property, in which the output of every query can be stored in the database and used for further querying.
Probabilistic models for topic learning from images and captions in online biomedical literatures BIBAFull-Text 495-504
  Xin Chen; Caimei Lu; Yuan An; Palakorn Achananuparp
Biomedical images and captions are one of the major sources of information in online biomedical publications. They often contain the most important results to be reported, and provide rich information about the main themes in published papers. In the data mining and information retrieval community, there has been much effort on using text mining and language modeling algorithms to extract knowledge from the text content of online biomedical publications; however, the problem of knowledge extraction from biomedical images and captions has not been fully studied yet. In this paper, a hierarchical probabilistic topic model with background distribution (HPB) is introduced to uncover the latent semantic topics from the co-occurrence patterns of caption words, visual words and biomedical concepts. With downloaded biomedical figures, restricted captions are extracted with regard to each individual image panel. During the indexing stage, the 'bag-of-words' representation of captions is supplemented by an ontology-based concept indexing to alleviate the synonym and polysemy problems. As the visual counterpart of text words, the visual words are extracted and indexed from corresponding image panels. The model is estimated via collapsed Gibbs sampling algorithm. We compare the performance of our model with the extension of the Correspondence LDA (Corr-LDA) model under the same biomedical image annotation scenario using cross-validation. Experimental results demonstrate that our model is able to accurately extract latent patterns from complicated biomedical image-caption pairs and facilitate knowledge organization and understanding in online biomedical literatures.
Learning to rank with a novel kernel perceptron method BIBAFull-Text 505-512
  Xue-wen Chen; Haixun Wang; Xiaotong Lin
While conventional ranking algorithms, such as the PageRank, rely on the web structure to decide the relevancy of a web page, learning to rank seeks a function capable of ordering a set of instances using a supervised learning approach. Learning to rank has gained increasing popularity in information retrieval and machine learning communities. In this paper, we propose a novel nonlinear perceptron method for rank learning. The proposed method is an online algorithm and simple to implement. It introduces a kernel function to map the original feature space into a nonlinear space and employs a perceptron method to minimize the ranking error by avoiding converging to a solution near the decision boundary and alleviating the effect of outliers in the training dataset. Furthermore, unlike existing approaches such as RankSVM and RankBoost, the proposed method is scalable to large datasets for online learning. Experimental results on benchmark corpora show that our approach is more efficient and achieves higher or comparable accuracies in instance ranking than state of the art methods such as FRank, RankSVM and RankBoost.

KM semantic techniques & applications

Towards a universal wordnet by learning from combined evidence BIBAFull-Text 513-522
  Gerard de Melo; Gerhard Weikum
Lexical databases are invaluable sources of knowledge about words and their meanings, with numerous applications in areas like NLP, IR, and AI. We propose a methodology for the automatic construction of a large-scale multilingual lexical database where words of many languages are hierarchically organized in terms of their meanings and their semantic relations to other words. This resource is bootstrapped from WordNet, a well-known English-language resource. Our approach extends WordNet with around 1.5 million meaning links for 800,000 words in over 200 languages, drawing on evidence extracted from a variety of resources including existing (monolingual) wordnets, (mostly bilingual) translation dictionaries, and parallel corpora. Graph-based scoring functions and statistical learning techniques are used to iteratively integrate this information and build an output graph. Experiments show that this wordnet has a high level of precision and coverage, and that it can be useful in applied tasks such as cross-lingual text classification.
Event detection from flickr data through wavelet-based spatial analysis BIBAFull-Text 523-532
  Ling Chen; Abhishek Roy
Detecting events from web resources has attracted increasing
   research interests in recent years. Our focus in this paper is to detect events from photos on Flickr, an Internet image community website. The results can be used to facilitate user searching and browsing photos by events. The problem is challenging considering: (1) Flickr data is noisy, because there are photos unrelated to real-world events; (2) It is not easy to capture the content of photos. This paper presents our effort in detecting events from Flickr photos by exploiting the tags supplied by users to annotate photos. In particular, the temporal and locational distributions of tag usage are analyzed in the first place, where a wavelet transform is employed to suppress noise. Then, we identify tags related with events, and further distinguish between tags of aperiodic events and those of periodic events. Afterwards, event-related tags are clustered such that each cluster, representing an event, consists of tags with similar temporal and locational distribution patterns as well as with similar associated photos. Finally, for each tag cluster, photos corresponding to the represented event are extracted. We evaluate the performance of our approach using a set of real data collected from Flickr. The experimental results demonstrate that our approach is effective in detecting events from the Flickr photo collection.
Msuggest: a semantic recommender framework for traditional Chinese medicine book search engine BIBAFull-Text 533-542
  Shi Shaomin; Wei Baogang; Yang Yan
Learning traditional Chinese medicine knowledge from the digital library is becoming more and more important these days in China. In medicine learning, many readers want to find out the intrinsic relation between two medicines or among thousands of medicines. A semantic recommender system is useful for readers to understand something quickly by means of analogy which is a cognitive process of transferring information from a particular subject to another if they are similar in some aspects. In view of these above, we present a novel recommender framework called Msuggest to give the diverse semantic recommended medicine terminologies and book pages when a reader searching for medicine information in digital library. Users can choose various aspects including medicine property, efficacy, clinical application, place of origin, book provenance and etc. to see different recommended results. We evaluate Msuggest under the t-test on the samples from random sampling. The result shows that Msuggest is effective and efficient in giving the recommended words and book pages.
Interactive, topic-based visual text summarization and analysis BIBAFull-Text 543-552
  Shixia Liu; Michelle X. Zhou; Shimei Pan; Weihong Qian; Weijia Cai; Xiaoxiao Lian
We are building an interactive, visual text analysis tool that aids users in analyzing a large collection of text. Unlike existing work in text analysis, which focuses either on developing sophisticated text analytic techniques or inventing novel visualization metaphors, ours is tightly integrating state-of-the-art text analytics with interactive visualization to maximize the value of both. In this paper, we focus on describing our work from two aspects. First, we present the design and development of a time-based, visual text summary that effectively conveys complex text summarization results produced by the Latent Dirichlet Allocation (LDA) model. Second, we describe a set of rich interaction tools that allow users to work with a created visual text summary to further interpret the summarization results in context and examine the text collection from multiple perspectives. As a result, our work offers two unique contributions. First, we provide an effective visual metaphor that transforms complex and even imperfect text summarization results into a comprehensible visual summary of texts. Second, we offer users a set of flexible visual interaction tools as the alternatives to compensate for the deficiencies of current text summarization techniques. We have applied our work to a number of text corpora and our evaluation shows the promise of the work, especially in support of complex text analyses.

KM graph mining

P-Rank: a comprehensive structural similarity measure over information networks BIBAFull-Text 553-562
  Peixiang Zhao; Jiawei Han; Yizhou Sun
With the ubiquity of information networks and their broad applications, the issue of similarity computation between entities of an information network arises and draws extensive research interests. However, to effectively and comprehensively measure "how similar two entities are within an information network" is nontrivial, and the problem becomes even more challenging when the information network to be examined is massive and diverse. In this paper, we propose a new similarity measure, P-Rank (Penetrating Rank), toward effectively computing the structural similarities of entities in real information networks. P-Rank enriches the well-known similarity measure, SimRank, by jointly encoding both in- and out-link relationships into structural similarity computation. P-Rank is proven to be a unified structural similarity framework, under which all state-of-the-art similarity measures, including CoCitation, Coupling, Amsler and SimRank, are just its special cases. Based on its recursive nature of P-Rank, we propose a fixed point algorithm to reinforce structural similarity of vertex pairs beyond the localized neighborhood scope toward the entire information network. Our experimental studies demonstrate the power of P-Rank as an effective similarity measure in different information networks. Meanwhile, under the same time/space complexity, P-Rank outperforms SimRank as a comprehensive and more meaningful structural similarity measure, especially in large real information networks.
Independent informative subgraph mining for graph information retrieval BIBAFull-Text 563-572
  Bingjun Sun; Prasenjit Mitra; C. Lee Giles
In order to enable scalable querying of graph databases, intelligent selection of subgraphs to index is essential. An improved index can reduce response times for graph queries significantly. For a given subgraph query, graph candidates that may contain the subgraph are retrieved using the graph index and subgraph isomorphism tests are performed to prune out unsatisfied graphs. However, since the space of all possible subgraphs of the whole set of graphs is prohibitively large, feature selection is required to identify a good subset of subgraph features for indexing. Thus, one of the key issues is: given the set of all possible subgraphs of the graph set, which subset of features is the optimal such that the algorithm retrieves the smallest set of candidate graphs and reduces the number of subgraph isomorphism tests? We introduce a graph search method for subgraph queries based on subgraph frequencies. Then, we propose several novel feature selection criteria, Max-Precision, Max-Irredundant-Information, and Max-Information-Min-Redundancy, based on mutual information. Finally we show theoretically and empirically that our proposed methods retrieve a smaller candidate set than previous methods. For example, using the same number of features, our method improve the precision for the query candidate set by 4%-13% in comparison to previous methods. As a result the response time of subgraph queries also is improved correspondingly.
Graph classification based on pattern co-occurrence BIBAFull-Text 573-582
  Ning Jin; Calvin Young; Wei Wang
Subgraph patterns are widely used in graph classification, but their effectiveness is often hampered by large number of patterns or lack of discrimination power among individual patterns. We introduce a novel classification method based on pattern co-occurrence to derive graph classification rules. Our method employs a pattern exploration order such that the complementary discriminative patterns are examined first. Patterns are grouped into co-occurrence rules during the pattern exploration, leading to an integrated process of pattern mining and classifier learning. By taking advantage of co-occurrence information, our method can generate strong features by assembling weak features. Unlike previous methods that invoke the pattern mining process repeatedly, our method only performs pattern mining once. In addition, our method produces a more interpretable classifier and shows better or competitive classification effectiveness in terms of accuracy and execution time.
Frequent subgraph pattern mining on uncertain graph data BIBAFull-Text 583-592
  Zhaonian Zou; Jianzhong Li; Hong Gao; Shuo Zhang
Graph data are subject to uncertainties in many applications due to incompleteness and imprecision of data. Mining uncertain graph data is semantically different from and computationally more challenging than mining exact graph data. This paper investigates the problem of mining frequent subgraph patterns from uncertain graph data. The frequent subgraph pattern mining problem is formalized by designing a new measure called expected support. An approximate mining algorithm is proposed to find an approximate set of frequent subgraph patterns by allowing an error tolerance on the expected supports of the discovered subgraph patterns. The algorithm uses an efficient approximation algorithm to determine whether a subgraph pattern can be output or not. The analytical and experimental results show that the algorithm is very efficient, accurate and scalable for large uncertain graph databases.
L2 norm regularized feature kernel regression for graph data BIBAFull-Text 593-600
  Hongliang Fei; Jun Huan
Features in many real world applications such as Cheminformatics, Bioinformatics and Information Retrieval have complex internal structure. For example, frequent patterns mined from graph data are graphs. Such graph features have different number of nodes and edges and usually overlap with each other. In conventional data mining and machine learning applications, the internal structure of features are usually ignored. In this paper we consider a supervised learning problem where the features of the data set have intrinsic complexity, and we further assume that the feature intrinsic complexity may be measured by a kernel function. We hypothesize that by regularizing model parameters using the information of feature complexity, we can construct simple yet high quality model that captures the intrinsic structure of the data. Towards the end of testing this hypothesis, we focus on a regression task and have designed an algorithm that incorporate the feature complexity in the learning process, using a kernel matrix weighted L2 norm for regularization, to obtain improved regression performance over conventional learning methods that does not consider the additional information of the feature. We have tested our algorithm using 5 different real-world data sets and have demonstrate the effectiveness of our method.

IR evaluation

Improvements that don't add up: ad-hoc retrieval results since 1998 BIBAFull-Text 601-610
  Timothy G. Armstrong; Alistair Moffat; William Webber; Justin Zobel
The existence and use of standard test collections in information retrieval experimentation allows results to be compared between research groups and over time. Such comparisons, however, are rarely made. Most researchers only report results from their own experiments, a practice that allows lack of overall improvement to go unnoticed. In this paper, we analyze results achieved on the TREC Ad-Hoc, Web, Terabyte, and Robust collections as reported in SIGIR (1998-2008) and CIKM (2004-2008). Dozens of individual published experiments report effectiveness improvements, and often claim statistical significance. However, there is little evidence of improvement in ad-hoc retrieval technology over the past decade. Baselines are generally weak, often being below the median original TREC system. And in only a handful of experiments is the score of the best TREC automatic run exceeded. Given this finding, we question the value of achieving even a statistically significant result over a weak baseline. We propose that the community adopt a practice of regular longitudinal comparison to ensure measurable progress, or at least prevent the lack of it from going unnoticed. We describe an online database of retrieval runs that facilitates such a practice.
Empirical justification of the gain and discount function for nDCG BIBAFull-Text 611-620
  Evangelos Kanoulas; Javed A. Aslam
The nDCG measure has proven to be a popular measure of retrieval effectiveness utilizing graded relevance judgments. However, a number of different instantiations of nDCG exist, depending on the arbitrary definition of the gain and discount functions used (1) to dictate the relative value of documents of different relevance grades and (2) to weight the importance of gain values at different ranks, respectively. In this work we discuss how to empirically derive a gain and discount function that optimizes the efficiency or stability of nDCG. First, we describe a variance decomposition analysis framework and an optimization procedure utilized to find the efficiency- or stability-optimal gain and discount functions. Then we use TREC data sets to compare the optimal gain and discount functions to the ones that have appeared in the IR literature with respect to (a) the efficiency of the evaluation, (b) the induced ranking of systems, and (c) the discriminative power of the resulting nDCG measure.
Expected reciprocal rank for graded relevance BIBAFull-Text 621-630
  Olivier Chapelle; Donald Metlzer; Ya Zhang; Pierre Grinspan
While numerous metrics for information retrieval are available in the case of binary relevance, there is only one commonly used metric for graded relevance, namely the Discounted Cumulative Gain (DCG). A drawback of DCG is its additive nature and the underlying independence assumption: a document in a given position has always the same gain and discount independently of the documents shown above it. Inspired by the "cascade" user model, we present a new editorial metric for graded relevance which overcomes this difficulty and implicitly discounts documents which are shown below very relevant documents. More precisely, this new metric is defined as the expected reciprocal length of time that the user will take to find a relevant document. This can be seen as an extension of the classical reciprocal rank to the graded relevance case and we call this metric Expected Reciprocal Rank (ERR). We conduct an extensive evaluation on the query logs of a commercial search engine and show that ERR correlates better with clicks metrics than other editorial metrics.
Usage based effectiveness measures: monitoring application performance in information retrieval BIBAFull-Text 631-640
  Leif Azzopardi
The aim of an Information Retrieval (IR) application is to support the user accessing relevant information effectively and efficiently. It is well known that system performance, in terms of finding relevant information is heavily dependent upon the IR application (i.e. the IR system exposed through the application's interface), as well as how the application is used by the user (i.e. how the user interacts with the system through the interface). Thus, a very pragmatic evaluation question that arises at the application level is: what is the effectiveness experienced by the user during the usage of the application? To be able to answer this question, we represent the usage of an application by the stream of documents the user encounters while interacting with the application. This representation enables us to monitor and track the performance over time and usage. By taking a stream-based, time-centric view of the IR process, instead of a rank-list, topic/task centric view, the evaluation can be performed on any IR based application. To illustrate the difference and the utility of this approach, we demonstrate how a new suite of usage based effectiveness measures can be applied. This work provides the conceptual foundations for measuring, monitoring and modeling the performance of any IR application which needs to be evaluated over time and in context.
Post-rank reordering: resolving preference misalignments between search engines and end users BIBAFull-Text 641-650
  Chao Liu; Mei Li; Yi-Min Wang
No search engine is perfect. A typical type of imperfection is the preference misalignment between search engines and end users, e.g., from time to time, web users skip higher-ranked documents and click on lower-ranked ones. Although search engines have been aggressively incorporating clickthrough data in their ranking, it is hard to eliminate such misalignments across millions of queries. Therefore, we, in this paper, propose to accompany a search engine with an "always-on" component that reorders documents on a per-query basis, based on user click patterns. Because of positional bias and dependencies between clicks, we show that a simple sort based on click counts (and its variants), albeit intuitive and useful, is not precise enough.
   In this paper, we put forward a principled approach to reordering documents by leveraging existing click models. Specifically, we compute the preference probability that a lower-ranked document is preferred to a higher-ranked one from the Click Chain Model (CCM), and propose to swap the two documents if the probability is sufficiently high. Because CCM models positional bias and dependencies between clicks, this method readily accounts for many twisted heuristics that have to be manually encoded in sort-based approaches. For this approach to be practical, we further devise two approximation schemes that make online computation of the preference probability feasible. We carried out a set of experiments based on real-world data from a major search engine, and the result clearly demonstrates the effectiveness of the proposed approach.

DB information integration, data provenance, probabilistic databases

Probabilistic skyline queries BIBAFull-Text 651-660
  Christian Böhm; Frank Fiedler; Annahita Oswald; Claudia Plant; Bianca Wackersreuther
The ability to deal with uncertain information is becoming increasingly important for modern database applications. Whereas a conventional (certain) object is usually represented by a vector from a multidimensional feature space, an uncertain object is represented by a multivariate probability density function (PDF). This PDF can be defined either discretely (e.g. by a histogram) or continuously in parametric form (e.g. by a Gaussian Mixture Model). For a database of uncertain objects, the users expect similar data analysis techniques as for a conventional database of certain objects. An important analysis technique for certain objects is the skyline operator which finds maximal or minimal vectors with respect to any possible attribute weighting. In this paper, we propose the concept of probabilistic skylines, an extension of the skyline operator for uncertain objects. In addition, we propose efficient and effective methods for determining the probabilistic skyline of uncertain objects which are defined by a PDF in parametric form (e.g. a Gaussian function or a Gaussian Mixture Model). To further accelerate the search, we elaborate how the computation of the probabilistic skyline can be supported by an index structure for uncertain objects. An extensive experimental evaluation demonstrates both the effectiveness and the efficiency of our technique.
Density-based clustering using graphics processors BIBAFull-Text 661-670
  Christian Böhm; Robert Noll; Claudia Plant; Bianca Wackersreuther
During the last few years, GPUs have evolved from simple devices for the display signal preparation into powerful coprocessors that do not only support typical computer graphics tasks but can also be used for general numeric and symbolic computation tasks. As major advantage GPUs provide extremely high parallelism combined with a high bandwidth in memory transfer at low cost. We want to exploit these advantages in density-based clustering, an important paradigm in clustering since typical algorithms of this category are noise and outlier robust and search for clusters of an arbitrary shape in metric and vector spaces. Moreover, with a time complexity ranging from O(n log n) to O(n2) these algorithms are scalable to large data sets in a database system. In this paper, we propose CUDA-DClust, a massively parallel algorithm for density-based clustering for the use of a Graphics Processing Unit (GPU). While the result of this algorithm is guaranteed to be equivalent to that of DBSCAN, we demonstrate a high speed-up, particularly in combination with a novel index structure for use in GPUs.
Scalable continuous range monitoring of moving objects in symbolic indoor space BIBAFull-Text 671-680
  Bin Yang; Hua Lu; Christian S. Jensen
Indoor spaces accommodate large populations of individuals. The continuous range monitoring of such objects can be used as a foundation for a wide variety of applications, e.g., space planning, way finding, and security. Indoor space differs from outdoor space in that symbolic locations, e.g., rooms, rather than Euclidean positions or spatial network locations are important. In addition, positioning based on presence sensing devices, rather than, e.g., GPS, is assumed. Such devices report the objects in their activation ranges. We propose an incremental, query-aware continuous range query processing technique for objects moving in this setting. A set of critical devices is determined for each query, and only the observations from those devices are used to continuously maintain the query result. Due to the limitations of the positioning devices, queries contain certain and uncertain results. A maximum-speed constraint on object movement is used to refine the latter results. A comprehensive experimental study with both synthetic and real data suggests that our proposal is efficient and scalable.
Provenance query evaluation: what's so special about it? BIBAFull-Text 681-690
  Anastasios Kementsietsidis; Min Wang
While provenance has been extensively studied in the literature, the efficient evaluation of provenance queries remains an open problem. Traditional query optimization techniques, like the use of general-purpose indexes, or the materialization of provenance data, fail on different fronts to address the problem. Therefore, the need to develop provenance-aware access methods becomes apparent. This paper starts by identifying some key requirements that are to a large extent specific to provenance queries and are necessary for their efficient evaluation. The first such property, called duality, requires that a single access method is used to evaluate both backward provenance queries (which input items of some analysis generate an output item) and forward provenance queries (which outputs of some analysis does an input item generate). The second property, called locality, guarantees that provenance query evaluation times should depend mainly on the size of the provenance query results and should be largely independent of the total size of provenance data. Motivated by the above, we identify proper data structures with the aforementioned properties, we implement them, and through a detailed set of experiments, we illustrate their effectiveness on the evaluation of provenance queries.
Navigational path privacy protection: navigational path privacy protection BIBAFull-Text 691-700
  Ken C. K. Lee; Wang-Chien Lee; Hong Va Leong; Baihua Zheng
Navigational path query, one of the most popular location-based services (LBSs), determines a route from a source to a destination on a road network. However, issuing path queries to some non-trustworthy service providers may pose privacy threats to the users. For instance, given a query requesting for a path from a residential address to a psychiatrist, some adversaries may deduce "who is related to what disease". In this paper, we present an obfuscator framework that reduces the likelihood of path queries being revealed, while supporting different user privacy protection needs and retaining query evaluation efficiency. The framework consists of two major components, namely, an obfuscator and an obfuscated path query processor. The former formulates obfuscated path queries by intermixing true and fake sources and destinations and the latter facilitates efficient evaluation of the obfuscated path queries in an LBS server. The framework supports three types of obfuscated path queries, namely, independent obfuscated path query, shared obfuscated path query, and anti-collusion obfuscated path query. Our proposal strikes a balance between privacy protection strength and query processing overheads, while enhancing privacy protection against collusion attacks. Finally, we validate the proposed ideas and evaluate the performance of our framework based on an extensive set of empirical experiments.

Industry information retrieval

Automatic retrieval of similar content using search engine query interface BIBAFull-Text 701-710
  Ali Dasdan; Paolo D'Alberto; Santanu Kolay; Chris Drome
We consider the coverage testing problem where we are given a document and a corpus with a limited query interface and asked to find if the corpus contains a near-duplicate of the document. This problem has applications in search engines for competitive coverage testing. To solve this problem, we propose approaches that work in three main steps: generate a query signature from the document, query the corpus using the query signature and scrape the returned results, and validate the similarity between the input document and the returned results. We discuss techniques to control and bound the performance of these methods. We perform large-scale experimental validation and show that these methods perform well across different search engine corpora and documents in multiple languages. They also are robust against performance parameter variations.
Mashup-based information retrieval for domain experts BIBAFull-Text 711-720
  Anand Ranganathan; Anton Riabov; Octavian Udrea
In this paper, we tackle the problem of helping domain experts to construct, parameterize and deploy mashups of data and code. We view a mashup as a data processing flow, that describes how data is obtained from one or more sources, processed by one or more components, and finally sent to one or more sinks. Our approach allows specifying patterns of flows, in a language called Cascade. The patterns cover different possible variations of the flows, including variations in the structure of the flow, the components in the flow and the possible parameterizations of these components. We present a tool that makes use of this knowledge of flow patterns and associated metadata to allow domain experts to explore the space of possible flows described in the pattern. The tool uses an AI planning approach to automatically build a flow, belonging to the flow pattern, from a high-level goal, specified as a set of tags. We describe examples from the financial services domain to show the use of flow patterns in allowing domain experts to construct a large variety of mashups rapidly.
A study of information retrieval on accumulative social descriptions using the generation features BIBAFull-Text 721-730
  Lichun Yang; Shengliang Xu; Shenghua Bao; Dingyi Han; Zhong Su; Yong Yu
This paper is concerned with the study of information retrieval (IR) on Accumulative Social Descriptions (ASDs). ASDs refer to Web texts that accumulated by many Web users describing certain Web resources, such as anchor texts, search logs and social annotations. There have been some studies working on leveraging ASDs for improving search performance in both internet and intranet. However, to the best of our knowledge, no prior study has concerned the specific generation features of ASDs, which are the focus point of this paper. Specifically, we consider the generation features from two perspectives, the generation processes and the generated distributions. Further, three probabilistic IR models are derived based on them. The three models are first demonstrated with one toy dataset and then empirically evaluated with two real datasets: an internet dataset consisting of 90,295 Web pages, with 25,845,818 social annotations crawled from Del.icio.us and 31,320,005 pieces of anchor texts crawled through Yahoo! API, and an intranet dataset consisting of 179,835 Web pages with 1,245,522 annotations dumped from the intranet tagging system in IBM, named as Dogear. Extensive experimental results show that the proposed methods, which fully leverage the generation features of ASDs, improve the performance of both internet and intranet search significantly.
iMecho: an associative memory based desktop search system BIBAFull-Text 731-740
  Jidong Chen; Hang Guo; Wentao Wu; Wei Wang
Traditional desktop search engines only support keyword based search that needs exact keyword matching to find resources. However, users generally have a vague picture of what is stored but forget the exact location and keywords of the resource. According to observations of human associative memory, people tend to remember things from some memory fragments in their brains and these memory fragments are connected by memory cues of user activity context. We developed iMecho (My Memory Echo), an associative memory based desktop search system, which exploits such associations and contexts to enhance traditional desktop search. Desktop resources are connected with semantic links mined from explicit and implicit user activities according to specific access patterns. Using these semantic links, associations among memory fragments can be built or rebuilt in a user's brain during a search. Moreover, our personalized ranking scheme uses these links together with a user's personal preferences to rank results by both relevance and importance to the user. In addition, the system provides a faceted search feature and association graph navigation to help users refine and associate search results generated by full-text keyword search. Our experiments investigating precision and recall quality of iMecho prototype show that the association-based search system is superior to the traditional keyword search in personal search engines since it is closer to the way that human associative memory works.
Product query classification BIBAFull-Text 741-750
  Dou Shen; Ying Li; Xiao Li; Dengyong Zhou
Web query classification is an effective way to understand Web user intents, which can further improve Web search and online advertising relevance. However, Web queries are usually very short which cannot fully reflect their meanings. What is more, it is quite hard to obtain enough training data for training accurate classifiers. Therefore, previous work on query classification has focused on two issues. One is how to represent Web queries through query expansion. The other is how to increase the amount of training data. In this paper, we took product query classification as an example, which is to classify Web queries into a predefined product taxonomy, and systematically studied the impact of query expansion and the size of training data. We proposed two methods of enriching Web queries and three approaches of collecting training data. Thereafter, we conducted a series of experiments to compare the classification performance of using different combinations of training data and query representations over a real data set. The data set consists of hundreds of thousands queries collected from a popular commercial search engine. From the experiments, we found some interesting observations, which were not discussed before. Finally, we proposed an effective and efficient product query classification method based on our observations.

KM information filtering and recommender systems

Learning to recommend questions based on user ratings BIBAFull-Text 751-758
  Ke Sun; Yunbo Cao; Xinying Song; Young-In Song; Xiaolong Wang; Chin-Yew Lin
At community question answering services, users are usually encouraged to rate questions by votes. The questions with the most votes are then recommended and ranked on the top when users browse questions by category. As users are not obligated to rate questions, usually only a small proportion of questions eventually gets rating. Thus, in this paper, we are concerned with learning to recommend questions from user ratings of a limited size. To overcome the data sparsity, we propose to utilize questions without users rating as well. Further, as there exist certain noises within user ratings (the preference of some users expressed in their ratings diverges from that of the majority of users), we design a new algorithm called 'majority-based perceptron algorithm' which can avoid the influence of noisy instances by emphasizing its learning over data instances from the majority users. Experimental results from a large collection of real questions confirm the effectiveness of our proposals.
Probabilistic latent preference analysis for collaborative filtering BIBAFull-Text 759-766
  Nathan N. Liu; Min Zhao; Qiang Yang
A central goal of collaborative filtering (CF) is to rank items by their utilities with respect to individual users in order to make personalized recommendations. Traditionally, this is often formulated as a rating prediction problem. However, it is more desirable for CF algorithms to address the ranking problem directly without going through an extra rating prediction step. In this paper, we propose the probabilistic latent preference analysis (pLPA) model for ranking predictions by directly modeling user preferences with respect to a set of items rather than the rating scores on individual items. From a user's observed ratings, we extract his preferences in the form of pairwise comparisons of items which are modeled by a mixture distribution based on Bradley-Terry model. An EM algorithm for fitting the corresponding latent class model as well as a method for predicting the optimal ranking are described. Experimental results on real world data sets demonstrated the superiority of the proposed method over several existing CF algorithms based on rating predictions in terms of ranking performance measure NDCG.
Semi-nonnegative matrix factorization with global statistical consistency for collaborative filtering BIBAFull-Text 767-776
  Hao Ma; Haixuan Yang; Irwin King; Michael R. Lyu
Collaborative Filtering, considered by many researchers as the most important technique for information filtering, has been extensively studied by both academic and industrial communities. One of the most popular approaches to collaborative filtering recommendation algorithms is based on low-dimensional factor models. The assumption behind such models is that a user's preferences can be modeled by linearly combining item factor vectors using user-specific coefficients. In this paper, aiming at several aspects ignored by previous work, we propose a semi-nonnegative matrix factorization method with global statistical consistency. The major contribution of our work is twofold: (1) We endow a new understanding on the generation or latent compositions of the user-item rating matrix. Under the new interpretation, our work can be formulated as the semi-nonnegative matrix factorization problem. (2) Moreover, we propose a novel method of imposing the consistency between the statistics given by the predicted values and the statistics given by the data. We further develop an optimization algorithm to determine the model complexity automatically. The complexity of our method is linear with the number of the observed ratings, hence it is scalable to very large datasets. Finally, comparing with other state-of-the-art methods, the experimental analysis on the EachMovie dataset illustrates the effectiveness of our approach.
Voting in social networks BIBAFull-Text 777-786
  Paolo Boldi; Francesco Bonchi; Carlos Castillo; Sebastiano Vigna
A voting system is a set of rules that a community adopts to take collective decisions. In this paper we study voting systems for a particular kind of community: electronically mediated social networks. In particular, we focus on delegative democracy (a.k.a. proxy voting) that has recently received increased interest for its ability to combine the benefits of direct and representative systems, and that seems also perfectly suited for electronically mediated social networks. In such a context, we consider a voting system in which users can only express their preference for one among the people they are explicitly connected with, and this preference can be propagated transitively, using an attenuation factor.
   We present this system and we study its properties. We also take into consideration the problem of missing votes, which is particularly relevant in online networks, as some recent case shows. Our experiments on real-world networks provide interesting insight into the significance and stability of the results obtained with the suggested voting system.
User-induced links in collaborative tagging systems BIBAFull-Text 787-796
  Ching-man Au Yeung; Nicholas Gibbins; Nigel Shadbolt
Collaborative tagging systems allow users to use tags to describe their favourite online documents. Two documents that are maintained in the collection of the same user and/or assigned similar sets of tags can be considered as related from the perspective of the user, even though they may not be connected by hyperlinks. We call this kind of implicit relations user-induced links between documents. We consider two methods of identifying user-induced links in collaborative tagging, and compare these links with existing hyperlinks on the Web. Our analyses show that user-induced links have great potentials to enrich the existing link structure of the Web. We also propose to use these links as a basis for predicting how documents would be tagged. Our experiments show that they achieve much higher accuracy than existing hyperlinks. This study suggests that by studying the collective behaviour of users we are able to enhance navigation and organisation of Web documents.

IR ranking and retrieval models I

A signal-to-noise approach to score normalization BIBAFull-Text 797-806
  Avi Arampatzis; Jaap Kamps
Score normalization is indispensable in distributed retrieval and fusion or meta-search where merging of result-lists is required. Distributional approaches to score normalization with reference to relevance, such as binary mixture models like the normal-exponential, suffer from lack of universality and troublesome parameter estimation especially under sparse relevance. We develop a new approach which tackles both problems by using aggregate score distributions without reference to relevance, and is suitable for uncooperative engines. The method is based on the assumption that scores produced by engines consist of a signal and a noise component which can both be approximated by submitting well-defined sets of artificial queries to each engine. We evaluate in a standard distributed retrieval testbed and show that the signal-to-noise approach yields better results than other distributional methods. As a significant by-product, we investigate query-length distributions.
Nonlinear static-rank computation BIBAFull-Text 807-816
  Shuming Shi; Bin Lu; Yunxiao Ma; Ji-Rong Wen
Mainstream link-based static-rank algorithms (e.g. PageRank and its variants) express the importance of a page as the linear combination of its in-links and compute page importance scores by solving a linear system in an iterative way. Such linear algorithms, however, may give apparently unreasonable static-rank results for some link structures. In this paper, we examine the static-rank computation problem from the viewpoint of evidence combination and build a probabilistic model for it. Based on the model, we argue that a nonlinear formula should be adopted, due to the correlation or dependence between links. We focus on examining some simple formulas which only consider the correlation between links in the same domain. Experiments conducted on 100 million web pages (with multiple static-rank quality evaluation metrics) show that higher quality static-rank could be yielded by the new nonlinear algorithms. The convergence of the new algorithms is also proved in this paper by nonlinear functional analysis.
A general magnitude-preserving boosting algorithm for search ranking BIBAFull-Text 817-826
  Chenguang Zhu; Weizhu Chen; Zeyuan Allen Zhu; Gang Wang; Dong Wang; Zheng Chen
Traditional boosting algorithms for the ranking problems usually employ the pairwise approach and convert the document rating preference into a binary-value label, like RankBoost. However, such a pairwise approach ignores the information about the magnitude of preference in the learning process. In this paper, we present the directed distance function (DDF) as a substitute for binary labels in pairwise approach to preserve the magnitude of preference and propose a new boosting algorithm called MPBoost, which applies GentleBoost optimization and directly incorporates DDF into the exponential loss function. We give the boundedness property of MPBoost through theoretic analysis. Experimental results demonstrate that MPBoost not only leads to better NDCG accuracy as compared to state-of-the-art ranking solutions in both public and commercial datasets, but also has good properties of avoiding the overfitting problem in the task of learning ranking functions.
Learning to rank from Bayesian decision inference BIBAFull-Text 827-836
  Jen-Wei Kuo; Pu-Jen Cheng; Hsin-Min Wang
Ranking is a key problem in many information retrieval (IR) applications, such as document retrieval and collaborative filtering. In this paper, we address the issue of learning to rank in document retrieval. Learning-based methods, such as RankNet, RankSVM, and RankBoost, try to create ranking functions automatically by using some training data. Recently, several learning to rank methods have been proposed to directly optimize the performance of IR applications in terms of various evaluation measures. They undoubtedly provide statistically significant improvements over conventional methods; however, from the viewpoint of decision-making, most of them do not minimize the Bayes risk of the IR system. In an attempt to fill this research gap, we propose a novel framework that directly optimizes the Bayes risk related to the ranking accuracy in terms of the IR evaluation measures. The results of experiments on the LETOR collections demonstrate that the framework outperforms several existing methods in most cases.
Reducing the risk of query expansion via robust constrained optimization BIBAFull-Text 837-846
  Kevyn Collins-Thompson
We introduce a new theoretical derivation, evaluation methods, and extensive empirical analysis for an automatic query expansion framework in which model estimation is cast as a robust constrained optimization problem. This framework provides a powerful method for modeling and solving complex expansion problems, by allowing multiple sources of domain knowledge or evidence to be encoded as simultaneous optimization constraints. Our robust optimization approach provides a clean theoretical way to model not only expansion benefit, but also expansion risk, by optimizing over uncertainty sets for the data. In addition, we introduce risk-reward curves to visualize expansion algorithm performance and analyze parameter sensitivity. We show that a robust approach significantly reduces the number and magnitude of expansion failures for a strong baseline algorithm, with no loss in average gain. Our approach is implemented as a highly efficient post-processing step that assumes little about the baseline expansion method used as input, making it easy to apply to existing expansion methods. We provide analysis showing that this approach is a natural and effective way to do selective expansion, automatically reducing or avoiding expansion in risky scenarios, and successfully attenuating noise in poor baseline methods.

DB streams, network databases

A code generation approach to optimizing high-performance distributed data stream processing BIBAFull-Text 847-856
  Bugra Gedik; Henrique Andrade; Kun-Lung Wu
We present a code-generation-based optimization approach to bringing performance and scalability to distributed stream processing applications. We express stream processing applications using an operator-based, stream-centric language called SPADE, which supports composing distributed data flow graphs out of toolkits of type-generic operators. A major challenge in building such applications is to find an effective and flexible way of mapping the logical graph of operators into a physical one that can be deployed on a set of distributed nodes. This involves finding how best operators map to processes and how best processes map to computing nodes. In this paper, we take a two-stage optimization approach, where an instrumented version of the application is first generated by the SPADE compiler to profile and collect statistics about the processing and communication characteristics of the operators within the application. In the second stage, the profiling information is fed to an optimizer to come up with a physical data flow graph that is deployable across nodes in a computing cluster. This approach not only creates highly optimized applications that are tailored to the underlying computing and networking infrastructure, but also makes it possible to re-target the application to a different hardware setup by simply repeating the optimization step and re-compiling the application to match the physical flow graph produced by the optimizer. Using real-world applications, from diverse domains such as finance and radio-astronomy, we demonstrate the effectiveness of our approach on System S -- a large-scale, distributed stream processing platform.
Efficient join processing on uncertain data streams BIBAFull-Text 857-866
  Xiang Lian; Lei Chen
Join processing in the streaming environment has many practical applications such as data cleaning and outlier detection. Due to the inherent uncertainty in the real-world data, it has become an increasingly important problem to consider the join processing on uncertain data streams, where the incoming data at each timestamp are uncertain and imprecise. Different from the static databases, processing uncertain data streams has its own requirements such as the limited memory, small response time, and so on. To tackle the challenges with respect to efficiency and effectiveness, in this paper, we formalize the problem of join on uncertain data streams (USJ), which can guarantee the accuracy of USJ answers over uncertain data, and propose effective pruning methods to filter out false alarms. We integrate the pruning methods into an efficient query procedure for incrementally maintaining USJ answers. Extensive experiments have been conducted to demonstrate the efficiency and effectiveness of our approaches.
Fast shortest path distance estimation in large networks BIBAFull-Text 867-876
  Michalis Potamias; Francesco Bonchi; Carlos Castillo; Aristides Gionis
In this paper we study approximate landmark-based methods for point-to-point distance estimation in very large networks. These methods involve selecting a subset of nodes as landmarks and computing offline the distances from each node in the graph to those landmarks. At runtime, when the distance between a pair of nodes is needed, it can be estimated quickly by combining the precomputed distances. We prove that selecting the optimal set of landmarks is an NP-hard problem, and thus heuristic solutions need to be employed. We therefore explore theoretical insights to devise a variety of simple methods that scale well in very large networks. The efficiency of the suggested techniques is tested experimentally using five real-world graphs having millions of edges. While theoretical bounds support the claim that random landmarks work well in practice, our extensive experimentation shows that smart landmark selection can yield dramatically more accurate results: for a given target accuracy, our methods require as much as 250 times less space than selecting landmarks at random. In addition, we demonstrate that at a very small accuracy loss our techniques are several orders of magnitude faster than the state-of-the-art exact methods. Finally, we study an application of our methods to the task of social search in large graphs.
Evaluating top-k queries over incomplete data streams BIBAFull-Text 877-886
  Parisa Haghani; Sebastian Michel; Karl Aberer
We study the problem of continuous monitoring of top-k queries over multiple non-synchronized streams. Assuming a sliding window model, this general problem has been a well addressed research topic in recent years. Most approaches, however, assume synchronized streams where all attributes of an object are known simultaneously to the query processing engine. In many streaming scenarios though, different attributes of an item are reported in separate non-synchronized streams which do not allow for exact score calculations. We present how the traditional notion of object dominance changes in this case such that the k dominance set still includes all and only those objects which have a chance of being among the top-k results in their life time. Based on this, we propose an exact algorithm which builds on generating multiple instances of the same object in a way that enables efficient object pruning. We show that even with object pruning the necessary storage for exact evaluation of top-k queries is linear in the size of the sliding window. As data should reside in main memory to provide fast answers in an online fashion and cope with high stream rates, storing all this data may not be possible with limited resources. We present an approximate algorithm which leverages correlation statistics of pairs of streams to evict more objects while maintaining accuracy. We evaluate the efficiency of our proposed algorithms with extensive experiments.
Mining data streams with periodically changing distributions BIBAFull-Text 887-896
  Yingying Tao; M. Tamer Ozsu
Dynamic data streams are those whose underlying distribution changes over time. They occur in a number of application domains, and mining them is important for these applications. Coupled with the unboundedness and high arrival rates of data streams, the dynamism of the underlying distribution makes data mining challenging. In this paper, we focus on a large class of dynamic streams that exhibit periodicity in distribution changes. We propose a framework, called DMM, for mining this class of streams that includes a new change detection technique and a novel match-and-reuse approach. Once a distribution change is detected, we compare the new distribution with a set of historically observed distribution patterns and use the mining results from the past if a match is detected. Since, for two highly similar distributions, their mining results should also present high similarity, by matching and reusing existing mining results, the overall stream mining efficiency is improved while the accuracy is maintained. Our experimental results confirm this conjecture.

Panel information extraction meets relational databases: where are we heading?

Information extraction meets relation databases BIBAFull-Text 897
  Davood Rafiei; Andrei Broder; Edward Chang; Patrick Pantel
Information extraction from unstructured text has much in common with querying in databases systems. Despite some differences on how data is modeled or represented, the general goal remains the same, i.e. to retrieve data or tag elements that satisfy some user-specified constraints. In recent years, the two paradigms have become much closer thanks to the large volume of data on the World Wide Web and the need for more automated search tools for information extraction and often the need for relating the extracted pieces. Several developments have contributed to the growth of the area including the work on named entity recognition (marked by MUC-6 and subsequent conferences) and natural language processing, Web information retrieval and mining, and Web query languages inspired by the query languages in the relational world. This panel explores the areas where the two paradigms overlap, the impacts and contributions they have had on each other and the areas that may be open for further research. The panel will bring together researchers who have worked in some established areas that closely relate to extracting structured information from unstructured text. In the first (role-playing) round, each panelist will strongly take a side on where the intersection is heading, arguing that one area will subsume the other area in near future. In the second round, the panelists will counter one or two others, pointing out the challenges that one area would be facing in subsuming the other and implications for future research directions.

KM classification and clustering II

Clustering web queries BIBAFull-Text 899-908
  John S. Whissell; Charles L. A. Clarke; Azin Ashkan
Despite the wide applicability of clustering methods, their evaluation remains a problem. In this paper, we present a metric for the evaluation of clustering methods. The data set to be clustered is viewed as a sample from a larger population, with clustering quality measured in terms of our predicted ability to discriminate between members of this population. We measure this property by training a classifier to recognize each cluster and measuring the accuracy of this classifier, normalized by a notion of expected accuracy. To demonstrate the applicability of this metric we apply it to Web queries. We investigated a commercially oriented data set of 1700 queries and a general data set of 4000 queries. Both sets are taken from the logs of a commercial Web search engine. Clustering is based on the contents of search engine result pages generated by executing the queries on the search engine from which they were taken. Multiple clustering algorithms are crossed with various weighting schemes to produce multiple clusterings of each query set. Our metric is used evaluate these clusterings. The results on the commercially oriented data set are compared to two pre-existing manual labelings, and are also used in an ad clickthrough experiment.
Evidence of quality of textual features on the web 2.0 BIBAFull-Text 909-918
  Flavio Figueiredo; Fabiano Belém; Henrique Pinto; Jussara Almeida; Marcos Gonçalves; David Fernandes; Edleno Moura; Marco Cristo
The growth of popularity of Web 2.0 applications greatly increased the amount of social media content available on the Internet. However, the unsupervised, user-oriented nature of this source of information, and thus, its potential lack of quality, have posed a challenge to information retrieval (IR) services. Previous work focuses mostly only on tags, although a consensus about its effectiveness as supporting information for IR services has not yet been reached. Moreover, other textual features of the Web 2.0 are generally overseen by previous research.
   In this context, this work aims at assessing the relative quality of distinct textual features available on the Web 2.0. Towards this goal, we analyzed four features (title, tags, description and comments) in four popular applications (CiteULike, Last.FM, Yahoo! Video, and YouTube). Firstly, we characterized data from these applications in order to extract evidence of quality of each feature with respect to usage, amount of content, descriptive and discriminative power as well as of content diversity across features. Afterwards, a series of classification experiments were conducted as a case study for quality evaluation. Characterization and classification results indicate that: 1) when considered separately, tags is the most promising feature, achieving the best classification results, although its absence in a non-negligible fraction of objects may affect its potential use; and 2) each feature may bring different pieces of information, and combining their contents can improve classification.
Exploiting internal and external semantics for the clustering of short texts using world knowledge BIBAFull-Text 919-928
  Xia Hu; Nan Sun; Chao Zhang; Tat-Seng Chua
Clustering of short texts, such as snippets, presents great challenges in existing aggregated search techniques due to the problem of data sparseness and the complex semantics of natural language. As short texts do not provide sufficient term occurring information, traditional text representation methods, such as "bag of words" model, have several limitations when directly applied to short texts tasks. In this paper, we propose a novel framework to improve the performance of short texts clustering by exploiting the internal semantics from original text and external concepts from world knowledge. The proposed method employs a hierarchical three-level structure to tackle the data sparsity problem of original short texts and reconstruct the corresponding feature space with the integration of multiple semantic knowledge bases -- Wikipedia and WordNet. Empirical evaluation with Reuters and real web dataset demonstrates that our approach is able to achieve significant improvement as compared to the state-of-the-art methods.
SELC: a self-supervised model for sentiment classification BIBAFull-Text 929-936
  Likun Qiu; Weishi Zhang; Changjian Hu; Kai Zhao
This paper presents the SELC Model (SElf-Supervised, (Lexicon-based and (Corpus-based Model) for sentiment classification. The SELC Model includes two phases. The first phase is a lexicon-based iterative process. In this phase, some reviews are initially classified based on a sentiment dictionary. Then more reviews are classified through an iterative process with a negative/positive ratio control. In the second phase, a supervised classifier is learned by taking some reviews classified in the first phase as training data. Then the supervised classifier applies on other reviews to revise the results produced in the first phase. Experiments show the effectiveness of the proposed model. SELC totally achieves 6.63% F1-score improvement over the best result in previous studies on the same data (from 82.72% to 89.35%). The first phase of the SELC Model independently achieves 5.90% improvement (from 82.72% to 88.62%). Moreover, the standard deviation of F1-scores is reduced, which shows that the SELC Model could be more suitable for domain-independent sentiment classification.
Graph-based transfer learning BIBAFull-Text 937-946
  Jingrui He; Yan Liu; Richard Lawrence
Transfer learning is the task of leveraging the information from labeled examples in some domains to predict the labels for examples in another domain. It finds abundant practical applications, such as sentiment prediction, image classification and network intrusion detection. In this paper, we propose a graph-based transfer learning framework. It propagates the label information from the source domain to the target domain via the example-feature-example tripartite graph, and puts more emphasis on the labeled examples from the target domain via the example-example bi-partite graph. Our framework is semi-supervised and non-parametric in nature and thus more flexible. We also develop an iterative algorithm so that our framework is scalable to large-scale applications. It enjoys the theoretical property of convergence. Compared with existing transfer learning methods, the proposed framework propagates the label information to both the features irrelevant to the source domain and the unlabeled examples in the target domain via the common features in a principled way. Experimental results on 3 real data sets demonstrate the effectiveness of our algorithm.

IR domain-specific retrieval I

A unified relevance model for opinion retrieval BIBAFull-Text 947-956
  Xuanjing Huang; W. Bruce Croft
Representing the information need is the greatest challenge for opinion retrieval. Typical queries for opinion retrieval are composed of either just content words, or content words with a small number of cue "opinion" words. Both are inadequate for retrieving opinionated documents. In this paper, we develop a general formal framework -- the opinion relevance model -- to represent an information need for opinion retrieval. We explore a series of methods to automatically identify the most appropriate opinion words for query expansion, including using query independent sentiment resources. We also propose a relevance feedback-based approach to extract opinion words. Both query-independent and query-dependent methods can also be integrated into a more effective mixture relevance model. Finally, opinion retrieval experiments are presented for the Blog06 and COAE08 text collections. The results show that, significant improvements can always be obtained by this opinion relevance model whether sentiment resources are available or not.
Detecting topic evolution in scientific literature: how can citations help? BIBAFull-Text 957-966
  Qi He; Bi Chen; Jian Pei; Baojun Qiu; Prasenjit Mitra; Lee Giles
Understanding how topics in scientific literature evolve is an interesting and important problem. Previous work simply models each paper as a bag of words and also considers the impact of authors. However, the impact of one document on another as captured by citations, one important inherent element in scientific literature, has not been considered. In this paper, we address the problem of understanding topic evolution by leveraging citations, and develop citation-aware approaches. We propose an iterative topic evolution learning framework by adapting the Latent Dirichlet Allocation model to the citation network and develop a novel inheritance topic model. We evaluate the effectiveness and efficiency of our approaches and compare with the state of the art approaches on a large collection of more than 650,000 research papers in the last 16 years and the citation network enabled by CiteSeerX. The results clearly show that citations can help to understand topic evolution better.
Efficient information retrieval in mobile peer-to-peer networks BIBAFull-Text 967-976
  Lijiang Chen; Bin Cui; Heng Tao Shen; Wei Lu; Xiaofang Zhou
Mobile devices have become indispensable in daily life, and hence how to take advantage of these portable and powerful facilities to share resources and information begins to emerge as an interesting problem. In this paper, we investigate the problem of information retrieval in a mobile peer-to-peer network. The prevailing approach to information retrieval is to apply flooding methods because of its quick response and easy maintenance. Obviously, this kind of approach wastes a huge amount of communication bandwidth which greatly affects the availability of the network, and the battery power which significantly shortens the serving time of mobile devices in the network. To tackle this problem, we propose a novel approach by mimicking different human behaviors of social networks, which takes advantages of Intelligence Accuracy (IA) mechanism that evaluates the distance from a node to certain resources in the network. Extensive experimental results show the efficiency and effectiveness of our approach as well as its scalability in a volatile environment.
Language-model-based ranking for queries on RDF-graphs BIBAFull-Text 977-986
  Shady Elbassuoni; Maya Ramanath; Ralf Schenkel; Marcin Sydow; Gerhard Weikum
The success of knowledge-sharing communities like Wikipedia and the advances in automatic information extraction from textual and Web sources have made it possible to build large "knowledge repositories" such as DBpedia, Freebase, and YAGO. These collections can be viewed as graphs of entities and relationships (ER graphs) and can be represented as a set of subject-property-object (SPO) triples in the Semantic-Web data model RDF. Queries can be expressed in the W3C-endorsed SPARQL language or by similarly designed graph-pattern search. However, exact-match query semantics often fall short of satisfying the users' needs by returning too many or too few results. Therefore, IR-style ranking models are crucially needed.
   In this paper, we propose a language-model-based approach to ranking the results of exact, relaxed and keyword-augmented graph pattern queries over RDF graphs such as ER graphs. Our method estimates a query model and a set of result-graph models and ranks results based on their Kullback-Leibler divergence with respect to the query model. We demonstrate the effectiveness of our ranking model by a comprehensive user study.
Heterogeneous cross domain ranking in latent space BIBAFull-Text 987-996
  Bo Wang; Jie Tang; Wei Fan; Songcan Chen; Zi Yang; Yanzhu Liu
Traditional ranking mainly focuses on one type of data source, and effective modeling still relies on a sufficiently large number of labeled or supervised examples. However, in many real-world applications, in particular with the rapid growth of the Web 2.0, ranking over multiple interrelated (heterogeneous) domains becomes a common situation, where in some domains we may have a large amount of training data while in some other domains we can only collect very little. One important question is: "if there is not sufficient supervision in the domain of interest, how could one borrow labeled information from a related but heterogenous domain to build an accurate model?". This paper explores such an approach by bridging two heterogeneous domains via the latent space. We propose a regularized framework to simultaneously minimize two loss functions corresponding to two related but different information sources, by mapping each domain onto a "shared latent space", capturing similar and transferable concepts. We solve this problem by optimizing the convex upper bound of the non-continuous loss function and derive its generalization bound. Experimental results on three different genres of data sets demonstrate the effectiveness of the proposed approach.

DB data warehousing and OLAP

Supporting ranking pattern-based aggregate queries in sequence data cubes BIBAFull-Text 997-1006
  Chun Kit Chui; Eric Lo; Ben Kao; Wai-Shing Ho
Sequence data processing has been studied extensively in the literature.
   In recent years, the warehousing and online-analytical processing (OLAP) of archived sequence data have received growing attentions. In particular, the concept of sequence OLAP is recently proposed with the objective of evaluating various kinds of so-called Pattern-Based Aggregate (PBA) queries so that various kinds of data analytical tasks on sequence data can be carried out efficiently. This paper studies the evaluation of ranking PBA queries, which rank the results of PBA queries and return only the top-ranked ones to users. We discuss how ranking PBA queries drastically improve the usability of S-OLAP systems and present techniques that can evaluate various kinds of ranking PBA queries efficiently.
Fuzzy semantic web ontology learning from fuzzy UML model BIBAFull-Text 1007-1016
  Fu Zhang; Z. M. Ma; Jingwei Cheng; Xiangfu Meng
How to quickly and cheaply construct Web ontologies has become a key technology to enable the Semantic Web. Classical ontologies are not sufficient for handling imprecise and uncertain information that is commonly found in many application domains. In this paper, we propose an approach for constructing fuzzy ontologies from fuzzy UML models, in which the fuzzy ontology consists of fuzzy ontology structure and instances. Firstly, the fuzzy UML model is investigated in detail, and a kind of formal definition of fuzzy UML models is proposed. Then, a kind of fuzzy ontology called fuzzy OWL DL ontology is introduced. Furthermore, we consider the fuzzy UML model and the corresponding fuzzy UML instantiations (i.e., object diagrams) simultaneously, and translate them into the fuzzy ontology structure and the fuzzy ontology instances, respectively. In addition, since a fuzzy OWL DL ontology is equivalent to a fuzzy Description Logic f-SHOIN(D) knowledge base, how the reasoning problems of fuzzy UML models (e.g., consistency, subsumption, equivalence, and redundancy) may be reasoned through reasoning mechanism of f-SHOIN(D) is investigated, which can help to construct fuzzy ontologies more exactly.
Efficient joins with compressed bitmap indexes BIBAFull-Text 1017-1026
  Kamesh Madduri; Kesheng Wu
We present a new class of adaptive algorithms that use compressed bitmap indexes to speed up evaluation of the range join query in relational databases. We determine the best strategy to process a join query based on a fast sub-linear time computation of the join selectivity (the ratio of the number of tuples in the result to the total number of possible tuples). In addition, we use compressed bitmaps to represent the join output compactly: the space requirement for storing the tuples representing the join of two relations is asymptotically bounded by min(h; n.cb), where h is the number of tuple pairs in the result relation, n is the number of tuples in the smaller of the two relations, and cb is the cardinality of the larger column being joined. We present a theoretical analysis of our algorithms, as well as experimental results on large-scale synthetic and real data sets. Our implementations are efficient, and consistently outperform well-known approaches for a range of join selectivity factors. For instance, our count-only algorithm is up to three orders of magnitude faster than the sort-merge approach, and our best bitmap index-based algorithm is 1.2x-80x faster than the sort-merge algorithm, for various query instances. We achieve these speedups by exploiting several inherent performance advantages of compressed bitmap indexes for join processing: an implicit partitioning of the attributes, space-efficiency, and tolerance of high-cardinality relations.
A framework for semantic link discovery over relational data BIBAFull-Text 1027-1036
  Oktie Hassanzadeh; Anastasios Kementsietsidis; Lipyeow Lim; Renée J. Miller; Min Wang
Discovering links between different data items in a single data source or across different data sources is a challenging problem faced by many information systems today. In particular, the recent Linking Open Data (LOD) community project has highlighted the paramount importance of establishing semantic links among web data sources. Currently, LOD sources provide billions of RDF triples, but only millions of links between data sources. Many of these data sources are published using tools that operate over relational data stored in a standard RDBMS. In this paper, we present a framework for discovery of semantic links from relational data. Our framework is based on declarative specification of linkage requirements by a user. We illustrate the use of our framework using several link discovery algorithms on a real world scenario. Our framework allows data publishers to easily find and publish high-quality links to other data sources, and therefore could significantly enhance the value of the data in the next generation of web.
POkA: identifying Pareto-optimal k-anonymous nodes in a domain hierarchy lattice BIBAFull-Text 1037-1046
  Rinku Dewri; Indrajit Ray; Indrakshi Ray; Darrell Whitley
Data generalization is widely used to protect identities and prevent inference of sensitive information during the public release of microdata. The k-anonymity model has been extensively applied in this context. The model seeks a generalization scheme such that every individual becomes indistinguishable from at least k-1 other individuals and the loss in information while doing so is kept at a minimum. The search is performed on a domain hierarchy lattice where every node is a vector signifying the level of generalization for each attribute. An effort to understand privacy and data utility trade-offs will require knowing the minimum possible information losses of every possible value of k. However, this can easily lead to an exhaustive evaluation of all nodes in the hierarchy lattice. In this paper, we propose using the concept of Pareto-optimality to obtain the desired trade-off information. A Pareto-optimal generalization is one in which no other generalization can provide a higher value of k without increasing the information loss. We introduce the Pareto-Optimal k-Anonymization (POkA) algorithm to traverse the hierarchy lattice and show that the number of node evaluations required to find the Pareto-optimal generalizations can be significantly reduced. Results on a benchmark data set show that the algorithm is capable of identifying all Pareto-optimal nodes by evaluating only 20% of nodes in the lattice.

Industry data mining framework and applications

Practical lessons of data mining at Yahoo! BIBAFull-Text 1047-1056
  Ye Chen; Dmitry Pavlov; Pavel Berkhin; Aparna Seetharaman; Albert Meltzer
The usage of data in many commercial applications has been growing at an unprecedented pace in the last decade. While successful data mining efforts lead to major business advances, there were also numerous, less publicized efforts that for one or another reason failed. In this paper, we discuss practical lessons based on years of our data mining experiences at Yahoo! and offer insights into how to drive the data mining effort to success in a business environment.
   We use two significant Yahoo's applications as illustrative examples: shopping categorization and behavioral targeting; and reflect on four success factors: methodology, data, infrastructure, and people.
Domain driven data mining to improve promotional campaign ROI and select marketing channels BIBAFull-Text 1057-1066
  Thomas Piton; Julien Blanchard; Henri Briand; Fabrice Guillet
The trading activities of materials retail is concerned with an extremely competitive market. However, business people are not well informed about how to proceed and what to do during marketing activities. Data mining methods could be interesting to generate substantial profits for decision makers and to optimize the choice of different marketing activities.
   In this paper, we propose an actionable knowledge discovery methodology, for one-to-one marketing, which allows to contact the right customer through the right communication channel. This methodology first requires a measurement of the tendency for the customers to purchase a given item, and second requires an optimization of the Return On Investment by selecting the most effective communication channels for attracting these customers. Our methodology has been applied to the VM Matériaux company. Thanks to the collaboration between data miners and decision makers, we present a domain-driven view of knowledge discovery satisfying real business needs to improve the efficiency and outcome of several promotional marketing campaigns.
Framework for timely and accurate ads on mobile devices BIBAFull-Text 1067-1076
  Alex Penev; Raymond K. Wong
We propose a framework for mobile advertising covering Value-Added Services where an ad-database is maintained on the device and both selection and display are dictated by the device. Advantages over existing mobile marketing are that ads are more timely, viable on a variety of use cases, can be both location-sensitive and personalized with minimal privacy concerns, and provide an obvious means for subsidizing users' service costs. We construct a suitable selection algorithm and evaluate its execution, accuracy and scalability. We show that ad-serving can be done under the processing constraints imposed by mobiles, which may lead to improvements in mobile marketing effectiveness.
Improving web page classification by label-propagation over click graphs BIBAFull-Text 1077-1086
  Soo-Min Kim; Patrick Pantel; Lei Duan; Scott Gaffney
In this paper, we present a semi-supervised learning method for web page classification, leveraging click logs to augment training data by propagating class labels to unlabeled similar documents. Current state-of-the-art classifiers are supervised and require large amounts of manually labeled data. We hypothesize that unlabeled documents similar to our positive and negative labeled documents tend to be clicked through by the same user queries. Our proposed method leverages this hypothesis and augments our training set by modeling the similarity between documents in a click graph. We experiment with three different web page classifiers and show empirical evidence that our proposed approach outperforms state-of-the-art methods and reduces the amount of human effort to label training data.
Product feature categorization with multilevel latent semantic association BIBAFull-Text 1087-1096
  Honglei Guo; Huijia Zhu; Zhili Guo; XiaoXun Zhang; Zhong Su
In recent years, the number of freely available online reviews is increasing at a high speed. Aspect-based opinion mining technique has been employed to find out reviewers' opinions toward different product aspects. Such finer-grained opinion mining is valuable for the potential customers to make their purchase decisions. Product-feature extraction and categorization is very important for better mining aspect-oriented opinions. Since people usually use different words to describe the same aspect in the reviews, product-feature extraction and categorization becomes more challenging. Manually product-feature extraction and categorization is tedious and time consuming, and practically infeasible for the massive amount of products. In this paper, we propose an unsupervised product-feature categorization method with multilevel latent semantic association. After extracting product-features from the semi-structured reviews, we construct the first latent semantic association (LaSA) model to group words into a set of concepts according to their virtual context documents. It generates the latent semantic structure for each product-feature. The second LaSA model is constructed to categorize the product-features according to their latent semantic structures and context snippets in the reviews. Experimental results demonstrate that our method achieves better performance compared with the existing approaches. Moreover, the proposed method is language- and domain-independent.

KM link analysis and social computing

Completing wikipedia's hyperlink structure through dimensionality reduction BIBAFull-Text 1097-1106
  Robert West; Doina Precup; Joelle Pineau
Wikipedia is the largest monolithic repository of human knowledge. In addition to its sheer size, it represents a new encyclopedic paradigm by interconnecting articles through hyperlinks. However, since these links are created by human authors, links one would expect to see are often missing. The goal of this work is to detect such gaps automatically. In this paper, we propose a novel method for augmenting the structure of hyperlinked document collections such as Wikipedia. It does not require the extraction of any manually defined features from the article to be augmented. Instead, it is based on principal component analysis, a well-founded mathematical generalization technique, and predicts new links purely based on the statistical structure of the graph formed by the existing links. Our method does not rely on the textual content of articles; we are exploiting only hyperlinks. A user evaluation of our technique shows that it improves the quality of top link suggestions over the state of the art and that the best predicted links are significantly more valuable than the 'average' link already present in Wikipedia. Beyond link prediction, our algorithm can potentially be used to point out topics an article misses to cover and to cluster articles semantically.
Scalable learning of collective behavior based on sparse social dimensions BIBAFull-Text 1107-1116
  Lei Tang; Huan Liu
The study of collective behavior is to understand how individuals behave in a social network environment. Oceans of data generated by social media like Facebook, Twitter, Flickr and YouTube present opportunities and challenges to studying collective behavior in a large scale. In this work, we aim to learn to predict collective behavior in social media. In particular, given information about some individuals, how can we infer the behavior of unobserved individuals in the same network? A social-dimension based approach is adopted to address the heterogeneity of connections presented in social media. However, the networks in social media are normally of colossal size, involving hundreds of thousands or even millions of actors. The scale of networks entails scalable learning of models for collective behavior prediction. To address the scalability issue, we propose an edge-centric clustering scheme to extract sparse social dimensions. With sparse social dimensions, the social-dimension based approach can efficiently handle networks of millions of actors while demonstrating comparable prediction performance as other non-scalable methods.
Blog cascade affinity: analysis and prediction BIBAFull-Text 1117-1126
  Hui Li; Sourav S. Bhowmick; Aixin Sun
Information propagation within the blogosphere is of much importance in implementing policies, marketing research, launching new products, and other applications. In this paper, we take a microscopic view of the information propagation pattern in blogosphere by investigating blog cascade affinity. A blog cascade is a group of posts linked together discussing about the same topic, and cascade affinity refers to the phenomenon of a blog's inclination to join a specific cascade. We identify and analyze an array of features that may affect a blogger's cascade joining behavior and utilize these features to predict cascade affinity of blogs. Evaluated on a real dataset consisting of 873,496 posts, our svm-based prediction achieved accuracy of 0.723 measured by F1. Our experiments also showed that among all features identified, the number of friends was the most important factor affecting bloggers' inclination to join cascades.
Socializing or knowledge sharing?: characterizing social intent in community question answering BIBAFull-Text 1127-1136
  Eduarda Mendes Rodrigues; Natasa Milic-Frayling
Knowledge sharing communities, such as Wikipedia or Yahoo! Answers, add greatly to the wealth of information available on the Web. They represent complex social ecosystems that rely on user participation and the quality of users' contributions to prosper. However, quality is harder to achieve when knowledge sharing is facilitated through a high degree of personal interactions. The individuals' objectives may change from knowledge sharing to socializing, with a profound impact on the community and the value it delivers to the broader population of Web users. In this paper we provide new insights into the types of content that is shared through Community Question Answering (CQA) services. We demonstrate an approach that combines in-depth content analysis with social network analysis techniques. We adapted the Undirected Inductive Coding method to analyze samples of user questions and arrive at a comprehensive typology of the user intent. In our analysis we focused on two types of intent, social vs. non-social, and define measures of social engagement to characterize the users' participation and content contributions. Our approach is applicable to a broad class of online communities and can be used to monitor the dynamics of community ecosystems.

KM data summarization

Time sequence summarization to scale up chronology-dependent applications BIBAFull-Text 1137-1146
  Quang-Khai Pham; Guillaume Raschia; Noureddine Mouaddib; Regis Saint-Paul; Boualem Benatallah
In this paper, we present the concept of Time Sequence Summarization to support chronology-dependent applications on massive data sources. Time sequence summarization takes as input a time sequence of events that are chronologically ordered. Each event is described by a set of descriptors. Time sequence summarization produces a concise time sequence that can be substituted for the original time sequence in chronology-dependent applications. We propose an algorithm that achieves time sequence summarization based on a generalization, grouping and concept formation process. Generalization expresses event descriptors at higher levels of abstraction using taxonomies while grouping gathers similar events. Concept formation is responsible for reducing the size of the input time sequence of events by representing each group created by one concept. The process is performed in a way such that the overall chronology of events is preserved. The algorithm computes the summary incrementally and has reduced algorithmic complexity. The resulting output is a concise representation, yet, informative enough to directly support chronology-dependent applications. We validate our approach by summarizing one year of financial news provided by Reuters.
Compressing tags to find interesting media groups BIBAFull-Text 1147-1156
  Matthijs van Leeuwen; Francesco Bonchi; Börkur Sigurbjörnsson; Arno Siebes
On photo sharing websites like Flickr and Zooomr, users are offered the possibility to assign tags to their uploaded pictures. Using these tags to find interesting groups of semantically related pictures in the result set of a given query is a problem with obvious applications. We analyse this problem from a Minimum Description Length (MDL) perspective and develop an algorithm that finds the most interesting groups. The method is based on Krimp, which finds small sets of patterns that characterise the data using compression. These patterns are sets of tags, often assigned together to photos. The better a database compresses, the more structure it contains and thus the more homogeneous it is. Following this observation we devise a compression-based measure. Our experiments on Flickr data show that the most interesting and homogeneous groups are found. We show extensive examples and compare to clusterings on the Flickr website.
Efficient feature weighting methods for ranking BIBAFull-Text 1157-1166
  Hwanjo Yu; Jinoh Oh; Wook-Shin Han
Feature weighting or selection is a crucial process to identify an important subset of features from a data set. Removing irrelevant or redundant features can improve the generalization performance of ranking functions in information retrieval. Due to fundamental differences between classification and ranking, feature weighting methods developed for classification cannot be readily applied to feature weighting for ranking. A state of the art feature selection method for ranking, called GAS, has been recently proposed, which exploits importance of each feature and similarity between every pair of features. However, GAS must compute the similarity scores of all pairs of features, thus it is not scalable for high-dimensional data and its performance degrades on nonlinear ranking functions. This paper proposes novel algorithms, RankWrapper and RankFilter, which is scalable for high-dimensional data and also performs reasonably well on nonlinear ranking functions. RankWrapper and RankFilter are designed based on the key idea of Relief algorithm. Relief is a feature selection algorithm for classification, which exploits the notions of hits (data points within the same class) and misses (data points from different classes) for classification. However, there is no such notion of hits or misses in ranking. The proposed algorithms instead utilize the ranking distances of nearest data points in order to identify the key features for ranking. Our extensive experiments show that RankWrapper and RankFilter generate higher accuracy overall than the GAS and traditional Relief algorithms adapted for ranking, and run substantially faster than the GAS on high dimensional data.
Fast and effective histogram construction BIBAFull-Text 1167-1176
  Felix Halim; Panagiotis Karras; Roland H. C. Yap
Histogram construction or sequence segmentation is a basic task with applications in database systems, information retrieval, and knowledge management. Its aim is to approximate a sequence by line segments. Unfortunately, the quadratic algorithm that derives an optimal histogram for Euclidean error lacks the desired scalability. Therefore, sophisticated approximation algorithms have been recently proposed, while several simple heuristics are used in practice. Still, these solutions fail to resolve the efficiency-quality tradeoff in a satisfactory manner. In this paper we take a fresh view on the problem. We propose conceptually clear and scalable algorithms that efficiently derive high-quality histograms. We experimentally demonstrate that existing approximation schemes fail to deliver the desired efficiency and conventional heuristics do not fare well on the side of quality. On the other hand, our schemes match or exceed the quality of the former and the efficiency of the latter.

Industry data and query similarity

Characterizing, constructing and managing resource usage profiles of system S applications: challenges and experience BIBAFull-Text 1177-1186
  Sujay Parekh; Kirsten W. Hildrum; Deepak Rajan; Joel L. Wolf; Kun-Lung Wu
We describe the challenges of characterizing, constructing and managing the usage profiles of System S applications. A running System S application is a directed graph with software processing elements (PEs) as vertices and data streams as edges connecting the PEs. The resource usage of each PE is a critical input to the runtime scheduler for proper resource allocation. We represent the resource usage of PEs in terms of resource functions (RFs) that are used by the System S scheduler, with one RF per resource per PE. The first challenge is that it is difficult to build good RFs that can accurately predict the resource usage of a PE because the PEs perform arbitrary computations. A second set of challenges arises in managing the RFs and performance data so that we can apply them for PEs that are re-run or reused by the same or different applications or users. We report our experience in overcoming these challenges. Specifically, we present an empirical characterization of PE RFs from several real streaming applications running in a System S testbed. This indicates that our simple models of resource usage that build on the data-flow nature of the underlying application can be effective, even for complex PEs. To illustrate our methodology, we evaluate and analyze the performance of these applications as a function of the quality of our resource profile models. The system automatically learns the models from the raw metrics data collected from running PEs. We describe our approach to managing the metrics and RF models, which allows us to construct generalizable RFs and eliminates the learning time for new PEs by intelligently storing and reusing the metrics data.
Generating SQL/XML query and update statements BIBAFull-Text 1187-1196
  Matthias Nicola; Tim Kiefer
The XML support in relational databases and the SQL/XML language are still relatively new as compared to purely relational databases and traditional SQL. Today, most database users have a strong relational and SQL background. SQL/XML enables users to perform queries and updates across XML and relational data, but many struggle with writing SQL/XML statements or XQuery update expressions. One reason is the novelty of SQL/XML and of the XQuery expressions that must be included. Another problem is that the tree structure of the XML data may be unknown or difficult to understand for the user. Evolving XML Schemas as well as hybrid XML/relational schemas make it even harder to write SQL/XML statements. Also, legacy applications use SQL but may require access to XML data without costly code changes.
   Motivated by these challenges, we developed a method to generate SQL/XML query and update statements automatically. The input is either a GUI or a regular SQL statement that uses logical data item names irrespective of their actual location in relational or XML columns in the database. The output is a SQL/XML statement that queries or updates relational and XML data as needed to carry out the original user statement. This relieves the user and simplifies schema evolution and integration. We have prototyped and tested the proposed method on top of DB2 9.5.
A system for detecting xml similarity in content and structure using relational database BIBAFull-Text 1197-1206
  Waraporn Viyanon; Sanjay Kumar Madria
In this paper, we describe a system incorporating an improved technique that detects the similarity of two XML documents based on content and structure similarity using keys. The technique consists of three major components: a subtree generator and validator, a key generator, and similarity components that compare content and structure of the XML documents. First, an XML document is stored in a relational database and extracted into small subtrees using leaf-node parents. The leaf-node parents are considered as a root of a subtree which is then recursively traversed bottom-up for matching. Second, a possible key(s) is identified in order to match XML subtrees from two documents efficiently. Key matchings help in reducing the number of comparisons dramatically. In addition, the number of subtrees to be processed is reduced in the subtree validation phase using instance statistics and taxonomic analyzer. The subtrees are matched by the key(s) first and the remaining subtrees are matched by finding degrees of similarity in content and structure. To obtain improved similarity comparison results, XML element names are transformed according to their semantic similarity. The results show that the clustering points are selected appropriately and the overall execution time is reduced dramatically.
Characteristics of document similarity measures for compliance analysis BIBAFull-Text 1207-1216
  Asad Sayeed; Soumitra Sarkar; Yu Deng; Rafah Hosn; Ruchi Mahindru; Nithya Rajamani
Due to increased competition in the IT Services business, improving quality, reducing costs and shortening schedules has become extremely important. A key strategy being adopted for achieving these goals is the use of an asset-based approach to service delivery, where standard reusable components developed by domain experts are minimally modified for each customer instead of creating custom solutions. One example of this approach is the use of contract templates, one for each type of service offered. A compliance checking system that measures how well actual contracts adhere to standard templates is critical for ensuring the success of such an approach. This paper describes the use of document similarity measures -- Cosine similarity and Latent Semantic Indexing -- to identify the top candidate templates on which a more detailed (and expensive) compliance analysis can be performed. Comparison of results of using the different methods are presented.

IR personalization and social search I

PQC: personalized query classification BIBAFull-Text 1217-1226
  Bin Cao; Jian-Tao Sun; Evan Wei Xiang; Derek Hao Hu; Qiang Yang; Zheng Chen
Query classification (QC) is a task that aims to classify Web queries into topical categories. Since queries are usually short in length and ambiguous, the same query may need to be classified to different categories according to different people's perspectives. In this paper, we propose the Personalized Query Classification (PQC) task and develop an algorithm based on user preference learning as a solution. Users' preferences that are hidden in clickthrough logs are quite helpful for search engines to improve their understandings of users' queries. We propose to connect query classification with users' preference learning from clickthrough logs for PQC. To tackle the sparseness problem in clickthrough logs, we propose a collaborative ranking model to leverage similar users' information. Experiments on a real world clickthrough log data show that our proposed PQC algorithm can gain significant improvement compared with general QC as well as natural baselines. Our method can be applied to a wide range of applications including personalized search and online advertising.
Personalized social search based on the user's social network BIBAFull-Text 1227-1236
  David Carmel; Naama Zwerdling; Ido Guy; Shila Ofek-Koifman; Nadav Har'el; Inbal Ronen; Erel Uziel; Sivan Yogev; Sergey Chernov
This work investigates personalized social search based on the user's social relations -- search results are re-ranked according to their relations with individuals in the user's social network. We study the effectiveness of several social network types for personalization: (1) Familiarity-based network of people related to the user through explicit familiarity connection; (2) Similarity-based network of people "similar" to the user as reflected by their social activity; (3) Overall network that provides both relationship types. For comparison we also experiment with Topic-based personalization that is based on the user's related terms, aggregated from several social applications. We evaluate the contribution of the different personalization strategies by an off-line study and by a user survey within our organization. In the off-line study we apply bookmark-based evaluation, suggested recently, that exploits data gathered from a social bookmarking system to evaluate personalized retrieval. In the on-line study we analyze the feedback of 240 employees exposed to the alternative personalization approaches. Our main results show that both in the off-line study and in the user survey social network based personalization significantly outperforms non-personalized social search. Additionally, as reflected by the user survey, all three SN-based strategies significantly outperform the Topic-based strategy.
Beyond hyperlinks: organizing information footprints in search logs to support effective browsing BIBAFull-Text 1237-1246
  Xuanhui Wang; Bin Tan; Azadeh Shakery; ChengXiang Zhai
While current search engines serve known-item search such as homepage finding very well, they generally cannot support exploratory search effectively. In exploratory search, users do not know their information needs precisely and also often lack the needed knowledge to formulate effective queries, thus querying alone, as supported by the current search engines, is insufficient, and browsing into related information would be very useful. Currently, browsing is mostly done by following hyperlinks embedded on Web pages. In this paper, we propose to leverage search logs to allow a user to browse beyond hyperlinks with a multi-resolution topic map constructed based on search logs. Specifically, we treat search logs as "footprints" left by previous users in the information space and build a multi-resolution topic map to semantically capture and organize them in multiple granularities. Such a topic map can support a user to zoom in, zoom out, and navigate horizontally over the information space, and thus provide flexible and effective browsing capabilities for end users. To test the effectiveness of the proposed methods of supporting browsing, we rely on real search logs and a commercial search engine to implement our proposed methods. Our experimental results show that the proposed topic map is effective to support browsing beyond hyperlinks.
A social recommendation framework based on multi-scale continuous conditional random fields BIBAFull-Text 1247-1256
  Xin Xin; Irwin King; Hongbo Deng; Michael R. Lyu
This paper addresses the issue of social recommendation based on collaborative filtering (CF) algorithms. Social recommendation emphasizes utilizing various attributes information and relations in social networks to assist recommender systems. Although recommendation techniques have obtained distinct developments over the decades, traditional CF algorithms still have these following two limitations: (1) relational dependency within predictions, an important factor especially when the data is sparse, is not being utilized effectively; and (2) straightforward methods for combining features like linear integration suffer from high computing complexity in learning the weights by enumerating the whole value space, making it difficult to combine various information into an unified approach. In this paper, we propose a novel model, Multi-scale Continuous Conditional Random Fields (MCCRF), as a framework to solve above problems for social recommendations. In MCCRF, relational dependency within predictions is modeled by the Markov property, thus predictions are generated simultaneously and can help each other. This strategy has never been employed previously. Besides, diverse information and relations in social network can be modeled by state and edge feature functions in MCCRF, whose weights can be optimized globally. Thus both problems can be solved under this framework. In addition, We propose to utilize Markov chain Monte Carlo (MCMC) estimation methods to solve the difficulties in training and inference processes of MCCRF. Experimental results conducted on two real world data have demonstrated that our approach outperforms traditional CF algorithms. Additional experiments also show the improvements from the two factors of relational dependency and feature combination, respectively.
Enhancing recommender systems under volatile user interest drifts BIBAFull-Text 1257-1266
  Huanhuan Cao; Enhong Chen; Jie Yang; Hui Xiong
This paper presents a systematic study of how to enhance recommender systems under volatile user interest drifts. A key development challenge along this line is how to track user interests dynamically. To this end, we first define four types of interest patterns to understand users' rating behaviors and analyze the properties of these patterns. We also propose a rating graph and rating chain based approach for detecting these interest patterns. For each users' rating series, a rating graph and a rating chain are constructed based on the similarities between rated items. The type of a given user's interest pattern is identified through the density of the corresponding rating graph and the continuity of the corresponding rating chain. In addition, we propose a general algorithm framework for improving recommender systems by exploiting these identified patterns. Finally, experimental results on a real-world data set show that the proposed rating graph based approach is effective for detecting user interest patterns, which in turn help to improve the performance of recommender systems.

IR ranking and retrieval models II

A term dependency-based approach for query terms ranking BIBAFull-Text 1267-1276
  Chia-Jung Lee; Ruey-Cheng Chen; Shao-Hang Kao; Pu-Jen Cheng
Formulating appropriate and effective queries has been regarded as a challenging issue, since a large number of candidate words or phrases could be chosen as query terms to convey users' information needs. In this paper, we propose an approach to rank a set of given query terms according their effectiveness, wherein top ranked terms will be selected as an effective query. Our ranking approach exploits and benefits from the underlying relationship between the query terms, and thereby the effective terms can be properly combined into the query. Two regression models which capture a rich set of linguistic and statistical properties are used in our approach. Experiments on NTCIR-4 ad-hoc retrieval tasks demonstrate that the proposed approach can significantly improve retrieval performance, and can be well applied to other problems such as query expansion and querying by text segments.
Classification-based resource selection BIBAFull-Text 1277-1286
  Jaime Arguello; Jamie Callan; Fernando Diaz
In some retrieval situations, a system must search across multiple collections. This task, referred to as federated search, occurs for example when searching a distributed index or aggregating content for web search. Resource selection refers to the subtask of deciding, given a query, which collections to search. Most existing resource selection methods rely on evidence found in collection content. We present an approach to resource selection that combines multiple sources of evidence to inform the selection decision. We derive evidence from three different sources: collection documents, the topic of the query, and query click-through data. We combine this evidence by treating resource selection as a multiclass machine learning problem. Although machine learned approaches often require large amounts of manually generated training data, we present a method for using automatically generated training data. We make use of and compare against prior resource selection work and evaluate across three experimental testbeds.
Probabilistic models of ranking novel documents for faceted topic retrieval BIBAFull-Text 1287-1296
  Ben Carterette; Praveen Chandar
Traditional models of information retrieval assume documents are independently relevant. But when the goal is retrieving diverse or novel information about a topic, retrieval models need to capture dependencies between documents. Such tasks require alternative evaluation and optimization methods that operate on different types of relevance judgments. We define faceted topic retrieval as a particular novelty-driven task with the goal of finding a set of documents that cover the different facets of an information need. A faceted topic retrieval system must be able to cover as many facets as possible with the smallest number of documents. We introduce two novel models for faceted topic retrieval, one based on pruning a set of retrieved documents and one based on retrieving sets of documents through direct optimization of evaluation measures. We compare the performance of our models to MMR and the probabilistic model due to Zhai et al. on a set of 60 topics annotated with facets, showing that our models are competitive.
Retrieval experiments using pseudo-desktop collections BIBAFull-Text 1297-1306
  Jinyoung Kim; W. Bruce Croft
Desktop search is an important part of personal information management (PIM). However, research in this area has been limited by the lack of shareable test collections, making cumulative progress difficult. In this paper, we define desktop search as a semi-structured document retrieval problem and introduce a methodology to automatically build a reusable collection (the pseudo-desktop) that has many of the same properties as a real desktop collection. We then present a comprehensive evaluation of retrieval methods for semi-structured document retrieval on several pseudo-desktop collections and the TREC Enterprise collection. Our results show that a probabilistic retrieval model using the mapping relation between a query term and a document field (PRM-S) has the best performance in collections with more structure, such as email, and that the query-likelihood language model is better for other document types. We further analyze the observed differences using generated queries and suggest ways to improve PRM-S, which makes the performance gains more significant and consistent.
Incident threading for news passages BIBAFull-Text 1307-1316
  Ao Feng; James Allan
With an overwhelming volume of news reports currently available, there is an increasing need for automatic techniques to analyze and present news to a general reader in a meaningful and efficient manner. We explore incident threading as a possible solution to this problem. All text that describes the occurrence of a real-world happening is merged into a news incident, and incidents are organized in a network with dependencies of predefined types.
   Earlier attempts at this problem have assumed that a news story covers a single topic. We move beyond that limitation to introduce passage threading, which processes news at the passage level. First we develop a new testbed for this research and extend the evaluation methods to address new granularity issues. Then a three-stage algorithm is described that identifies on-subject passages, groups them into incidents, and establishes links between related incidents. Finally, we observe significant improvement over earlier work when we optimize the harmonic mean of the appropriate evaluation measures. The resulting performance exceeds the level that a calibration study shows is necessary to support a reading comprehension task.

KM classification and clustering II

Detection of orthogonal concepts in subspaces of high dimensional data BIBAFull-Text 1317-1326
  Stephan Günnemann; Emmanuel Müller; Ines Färber; Thomas Seidl
In the knowledge discovery process, clustering is an established technique for grouping objects based on mutual similarity. However, in today's applications for each object very many attributes are provided. As multiple concepts described by different attributes are mixed in the same data set, clusters do not appear in all dimensions. In these high dimensional data spaces, each object can be clustered in several projections of the data. However, recent clustering techniques do not succeed in detection of these orthogonal concepts hidden in the data. They either miss multiple concepts for each object by partitioning approaches or provide redundant clusters in very similar subspaces.
   In this work we propose a novel clustering method aiming only at orthogonal concept detection in subspaces of the data. Unlike existing clustering approaches, OSCLU (Orthogonal Subspace CLUstering) detects for each object the orthogonal concepts described by differing attributes while pruning similar concepts. Thus, each detected cluster in an orthogonal subspace provides novel information about the hidden structure of the data. Thorough experiments on real and synthetic data show that OSCLU yields substantial quality improvements over existing clustering approaches.
Large margin transductive transfer learning BIBAFull-Text 1327-1336
  Brian Quanz; Jun Huan
Recently there has been increasing interest in the problem of transfer learning, in which the typical assumption that training and testing data are drawn from identical distributions is relaxed. We specifically address the problem of transductive transfer learning in which we have access to labeled training data and unlabeled testing data potentially drawn from different, yet related distributions, and the goal is to leverage the labeled training data to learn a classifier to correctly predict data from the testing distribution.
   We have derived efficient algorithms for transductive transfer learning based on a novel viewpoint and the Support Vector Machine (SVM) paradigm, of a large margin hyperplane classifier in a feature space. We show that our method can out-perform some recent state-of-the-art approaches for transfer learning on several data sets, with the added benefits of model and data separation and the potential to leverage existing work on support vector machines.
Subspace maximum margin clustering BIBAFull-Text 1337-1346
  Quanquan Gu; Jie Zhou
In text mining, we are often confronted with very high dimensional data. Clustering with high dimensional data is a challenging problem due to the curse of dimensionality. In this paper, to address this problem, we propose an subspace maximum margin clustering (SMMC) method, which performs dimensionality reduction and maximum margin clustering simultaneously within a unified framework. We aim to learn a subspace, in which we try to find a cluster assignment of the data points, together with a hyperplane classifier, such that the resultant margin is maximized among all possible cluster assignments and all possible subspaces. The original problem is transformed from learning the subspace to learning a positive semi-definite matrix, in order to avoid tuning the dimensionality of the subspace. The transformed problem can be solved efficiently via cutting plane technique and constrained concave-convex procedure (CCCP). Since the sub-problem in each iteration of CCCP is joint convex, alternating minimization is adopted to obtain the global optimum. Experiments on benchmark data sets illustrate that the proposed method outperforms the state of the art clustering methods as well as many dimensionality reduction based clustering approaches.
A risk minimization framework for domain adaptation BIBAFull-Text 1347-1356
  Bo Long; Sudarshan Lamkhede; Srinivas Vadrevu; Ya Zhang; Belle Tseng
Supervised learning algorithms usually require high quality labeled training set of large volume. It is often expensive to obtain such labeled examples in every domain of an application. Domain adaptation aims to help in such cases by utilizing data available in related domains. However transferring knowledge from one domain to another is often non trivial due to different data distributions among the domains. Moreover, it is usually very hard to measure and formulate these distribution differences. Hence we introduce a new concept of label-relation function to transfer knowledge among different domains without explicitly formulating the data distribution differences. A novel learning framework, Domain Transfer Risk Minimization (DTRM), is proposed based on this concept. DTRM simultaneously minimizes the empirical risk for the target and the regularized empirical risk for source domain. Under this framework, we further derive a generic algorithm called Domain Adaptation by Label Relation (DALR) that is applicable to various applications in both classification and regression settings. DALR iteratively updates the target hypothesis function and outputs for the source domain until it converges. We provide an in-depth theoretical analysis of DTRM and establish fundamental error bounds. We also experimentally evaluate DALR on the task of ranking search results using real-world data. Our experimental results show that the proposed algorithm effectively and robustly utilizes data from source domains under various conditions: different sizes for source domain data; different noise levels for source domain data, and different difficulty levels for target domain data.

Industry call and web center, e-commerce related technologies

ExSearch: a novel vertical search engine for online barter business BIBAFull-Text 1357-1366
  Lei Ji; Jun Yan; Ning Liu; Wen Zhang; Weiguo Fan; Zheng Chen
E-Commerce has shown its exponentially-growing business value in the past decade. However, in contrast to the successful examples in online sales, such as Amazon1 and eBay2, the online barter business is still underexplored due to the lack of corresponding information aggregation service. In this paper, we design and implement a novel vertical search engine, called ExSearch, to aggregate online barter information for developing the barter market. Different from classical general purpose Web search engines, ExSearch adopts a focused crawler to gather related information from various websites. We propose to automatically extract the barter information from free-text Web pages such that the unstructured information is represented in structured databases. In addition, we utilize the data mining techniques such as regression to fulfill the missing information, which cannot be extracted from the Web pages. Finally, we validate and rank the search results according to user queries. Experimental results show that each component module in our proposed ExSearch system is efficient and effective. The volunteer users are satisfied by and interested in this novel vertical search engine.
iLoc: a framework for incremental location-state acquisition and prediction based on mobile sensors BIBAFull-Text 1367-1376
  Yiming Ma; Rich Hankins; David Racz
Much research focuses on predicting a person's geo-spatial traversal patterns using a history of recorded geo-coordinates. In this paper, we focus on the problem of predicting location-state transitions. Location-states for a user refer to a set of anchoring points/regions in space, and the prediction task produces a sequence of predicted location states for a given query time window. If this problem can be solved accurately and efficiently, it may lead to new location based services (LBS) that can smartly recommend information to a user based on his current and future location states. The proposed iLoc (Incremental (Location-State Acquisition and Prediction) framework solves the prediction problem by utilizing the sensor information provided by a user's mobile device. It incrementally learns the location states by constantly monitoring the signal environment of the mobile device. Further, the framework tightly integrates the learning and prediction modules, allowing iLoc to update location-states continuously and predict future location-states at the same time. Our extensive experiments show that the quality of the location-states learned by iLoc are better than the state-of-the-art. We also show that when other learners failed to produce reasonable predictions, iLoc provides good forecasts. As for the efficiency, iLoc processes the data in a single pass, which fits well to many data stream processing models.
Predicting the conversion probability for items on C2C ecommerce sites BIBAFull-Text 1377-1386
  Xiaoyuan Wu; Alvaro Bolivar
Online ecommerce has been booming for a decade. For instance, as the largest online C2C marketplace (eBay), millions of new items are listed daily. Due to the overwhelming number of items, the process of finding the right items to buy is sometimes daunting. In order to address this problem, this paper describes the idea of predicting the probability that a newly listed item will be sold successfully. And adjust the item exposure chances proportional according to their conversion possibility. Hence, by ranking higher items that users are likely to buy, the chance that users make the purchases could be increased as well as their user satisfaction. For catalog products that have been listed repeatedly, this probability can be measured empirically. However, on C2C sites like eBay, lots of items are not product-based. They are unique, and from different sellers. Therefore, in order to predict whether a new listing will be sold, we collect a large scale item set as the training data, and a set of features were used to model the average buyer shopping decision on C2C sites. Experimental results verified our system's feasibility and effectiveness.
Towards real-time measurement of customer satisfaction using automatically generated call transcripts BIBAFull-Text 1387-1396
  Youngja Park; Stephen C. Gates
Customer satisfaction is a very important indicator of how successful a contact center is at providing services to the customers. Contact centers typically conduct a manual survey with a randomly selected group of customers to measure customer satisfaction. Manual customer satisfaction surveys, however, provide limited values due to high cost and the time lapse between the service and the survey.
   In this paper, we demonstrate that it is possible to automatically measure customer satisfaction by analyzing call transcripts enabling companies to measure customer satisfaction for every call in near real-time. We have identified various features from multiple knowledge sources indicating prosodic, linguistic and behavioral aspects of the speakers, and built machine learning models that predict the degree of customer satisfaction with high accuracy. The machine learning algorithms used in this work include Decision Tree, Naive Bayes, Logistic Regression and Support Vector Machines (SVMs).
   Experiments were conducted for a 5-point satisfaction measurement and a 2-point satisfaction measurement using customer calls to an automotive company. The experimental results show that customer satisfaction can be measured quite accurately both at the end of calls and in the middle of calls. The best performing 5-point satisfaction classification yields an accuracy of 66.09% outperforming the DominantClass baseline by 15.16%. The best performing 2-point classification shows an accuracy of 89.42% and outperforms both the DominantClass baseline and the CSRJudgment baseline by 17.7% and 3.3% respectively. Furthermore, Decision Tree and SVMs achieve higher F-measure than the CSRJudgment baseline in identifying both satisfied customers and dissatisfied customers.
ROSE: retail outlet site evaluation by learning with both sample and feature preference BIBAFull-Text 1397-1404
  Bin Zhang; Ming Xie; Jinyan Shao; Wenjun Yin; Jin Dong
It is critical for retail enterprises to select good sites or locations to open their stores, especially in current competitive retail market. However, evaluating the goodness of sites in real business applications is a complex problem. That is, how to judge whether the market around a store site is good? We don't know the exact mechanism of how a site can be good and it is hard to have correct site goodness values as supervised labels. The Retail Outlet Site Evaluation (ROSE) tool is designed to learn the site evaluation model by integrating city geographic & demographic data and two kinds of expert knowledge: sample preference and feature preference. The feature preference information can help greatly reduce the required number of sample preferences. It enables our application practicable because it is almost impossible to give such amount of sample preference pairs manually by experts when ranking hundreds of data points. In the experiment and case study part, we show that the ROSE tool can achieve good results and useful for users to do site evaluation work in real cases.

Poster session 1: DB track

3se: a semi-structured search engine for heterogeneous data in graph model BIBAFull-Text 1405-1408
  Ming Zhong; Mengchi Liu
As the ubiquitous interplay of structured, semi-structured and unstructured data from different sources, neither DB-style structured query requiring knowledge of full schema and complex language, nor IR-style keyword search ignoring latent structures, can satisfy users. In this paper, we present a novel Semi-Structured Search Engine (3SE) that provides easy, flexible, precise and rapid access to heterogeneous data represented by a semi-structured graph model.
   By using an intuitive 3SE Query Language (3SQL), users are able to pose queries on heterogeneous data in a varying degree of structural constraint according to their knowledge of schema. 3SE evaluates 3SQL queries as the top-k answers composed of "logical units" and relationship paths between them, and thus can extract meaningful information even if the query conditions are vague, ambiguous, and inaccurate.
Minimal common container of tree patterns BIBAFull-Text 1409-1412
  Junhu Wang; Jeffrey Xu Yu; Chaoyi Pang; Chengfei Liu
Tree patterns represent important fragments of XPath. In this paper, we show that some classes of tree patterns exhibit such a property that, given a finite number of tree patterns P1, ..., Pn, there exists another pattern P (tree pattern or DAG-pattern) such that P1, ..., Pn, are all contained in P, and for any tree pattern Q belonging to a given class C, P1, ..., Pn, are contained in Q implies P is contained in Q.
Probabilistic moving range query over RFID spatio-temporal data streams BIBAFull-Text 1413-1416
  Yu Gu; Ge Yu; Na Guo; Yueguo Chen
Moving range query over RFID data streams is one of the most important spatio-temporal queries to support valuable information analysis. However, the location uncertainty challenges the query strategy. In this paper, we propose a probability evaluation model in the RFID-enabled monitoring environments and discuss the query optimization techniques under the scenarios of continuous moving range query, which can also be applied into more situations. The extensive experimental evaluation verifies the efficiency and effectiveness of our proposed model and methods.
Suffix trees for very large genomic sequences BIBAFull-Text 1417-1420
  Marina Barsky; Ulrike Stege; Alex Thomo; Chris Upton
A suffix tree is a fundamental data structure for string searching algorithms. Unfortunately, when it comes to the use of suffix trees in real-life applications, the current methods for constructing suffix trees do not scale for large inputs. All the existing practical algorithms perform random access to the input string, thus requiring that the input be small enough to be kept in main memory.
   We are the first to present an algorithm which is able to construct suffix trees for input sequences significantly larger than the size of the available main memory. As a proof of concept, we show that our method allows to build the suffix tree for 12GB of real DNA sequences in 26 hours on a single machine with 2GB of RAM. This input is four times the size of the Human Genome, and the construction of suffix trees for inputs of such magnitude was never reported before.
Discovering matching dependencies BIBAFull-Text 1421-1424
  Shaoxu Song; Lei Chen
Matching dependencies (MDs) are recently proposed for various data quality applications such as detecting the violation of integrity constraints and duplicate object identification. In this paper, we study the problem of discovering matching dependencies for a given database instance. First, we formally define the measures, support and confidence, for evaluating the utility of MDs in the given database instance. Then, we study the discovery of MDs with certain utility requirements of support and confidence. Exact algorithms are developed, together with pruning strategies to improve the time performance. Finally, our experimental evaluation demonstrates the efficiency of the proposed methods.
Workload-aware trie indices for XML BIBAFull-Text 1425-1428
  Yuqing Wu; Sofia Brenes; Hyungdae Yi
Well-designed indices can dramatically improve query performance. Including query workload information can produce indices that yield better overall throughput while balancing the space and performance trade-off at the core of index design. In the context of XML, structural indices have proven to be particularly effective in supporting XPath queries by capturing the structural correlation between data components in an XML document. In this paper, we propose a family of novel workload-aware indices by taking advantage of the disk-based Ρ[k]-Trie index framework, which indexes node pairs of an XML document to facilitate index-only evaluation plans. Our indices are designed to be optimal for answering frequent path queries in one index lookup and efficient for answering non-frequent path queries using an index-only plan. Experimental results prove that our indices outperform the APEX index in overall throughput and excel in answering non-frequent queries, queries with predicates, and queries that yield empty results.
Rank-aware clustering of structured datasets BIBAFull-Text 1429-1432
  Julia Stoyanovich; Sihem Amer-Yahia
In online applications such as Yahoo! Personals and Yahoo! Real Estate users define structured profiles in order to find potentially interesting matches. Typically, profiles are evaluated against large datasets and produce thousands of matches. In addition to filtering, users also specify ranking in their profile, and matches are returned in a ranked list. Top results in a list are typically homogeneous, which hinders data exploration. For example, a user looking for 1- or 2-bedroom apartments sorted by price will see a large number of cheap 1-bedrooms in undesirable neighborhoods before seeing a different apartment. An alternative to ranking is to group matches on common attribute values, e.g., cheap 1-bedrooms in good neighborhoods, 2-bedrooms with 2 baths, and choose groups in relationship with ranking. In this paper, we present a novel paradigm of rank-aware clustering, and demonstrate its effectiveness on a large dataset from Yahoo! Personals, a leading online dating site.
Group-by skyline query processing in relational engines BIBAFull-Text 1433-1436
  Ming-Hay Luk; Man Lung Yiu; Eric Lo
The skyline operator was first proposed in 2001 for retrieving interesting tuples from a dataset. Since then, 100+ skyline-related papers have been published; however, we discovered that one of the most intuitive and practical type of skyline queries, namely, group-by skyline queries remains unaddressed. Group-by skyline queries find the skyline for each group of tuples. In this paper, we present a comprehensive study on processing group-by skyline queries in the context of relational engines. Specifically, we examine the composition of a query plan for a group-by skyline query and develop the missing cost model for the BBS algorithm. Experimental results show that our techniques are able to devise the best query plans for a variety of group-by skyline queries. Our focus is on algorithms that can be directly implemented in today's commercial database systems without the addition of new access methods (which would require addressing the associated challenges of maintenance with updates, concurrency control, etc.).
Supporting context-based query in personal DataSpace BIBAFull-Text 1437-1440
  Yukun Li; Xiaofeng Meng
Many users need to refer to content in existing files (pictures, tables, emails, web pages and etc.) when they write documents (programs, presentations, proposals and etc.), and often need to revisit these referenced files for review, revision or reconfirmation. Therefore it is meaningful to discover an approach to help users revisit these references effectively. Traditional approaches (file explorer, desktop search, and etc.) fail to work in this case. In this paper, we propose an efficient solution for this problem. We firstly define a new personal data relationship: Context-based Reference (CR), which is generated by user behaviors. We also propose efficient methods to identify CR relationship and present a new type of query based on it: Context-based Query (C-Query), which helps users efficiently revisit personal documents based on CR relationship. Our experiments validate the effectiveness and efficiency of our methods.
Walking in the crowd: anonymizing trajectory data for pattern analysis BIBAFull-Text 1441-1444
  Noman Mohammed; Benjamin C. M. Fung; Mourad Debbabi
Recently, trajectory data mining has received a lot of attention in both the industry and the academic research. In this paper, we study the privacy threats in trajectory data publishing and show that traditional anonymization methods are not applicable for trajectory data due to its challenging properties: high-dimensional, sparse, and sequential. Our primary contributions are (1) to propose a new privacy model called LKC-privacy that overcomes these challenges, and (2) to develop an efficient anonymization algorithm to achieve LKC-privacy while preserving the information utility for trajectory pattern mining.
Progressive skyline query evaluation and maintenance in wireless sensor networks BIBAFull-Text 1445-1448
  Baichen Chen; Weifa Liang; Jeffrey Xu Yu
Skyline query has been received much attention due to its wide application backgrounds for multi-preference and decision making. In this paper we consider skyline query evaluation and maintenance in wireless sensor networks. We devise an evaluation algorithm for finding skyline points progressively and a maintenance algorithm for skyline maintenance incrementally. We also conduct extensive experiments by simulations to evaluate the performance of the proposed algorithms on various datasets. The experimental results show that the proposed algorithms significantly outperform existing algorithms in terms of network lifetime prolongation.
Incremental similarity joins with edit distance constraints BIBAFull-Text 1449-1452
  Dongbo Dai; Gang Zhao
With the dynamic increase of string data and the need to integrate data from multiple data sources, a challenging issue is to perform similarity joins on dynamically-augmented string sets. Existing methods only exploit domain-oriented filters to speed up join processing on static datasets, which are inefficient for incremental data-generation scenarios.
   In this paper, an efficient approach called ISJ-ED (abbr for Incremental Similarity Joins with Edit Distance constraints) is proposed to tackle similarity join problem on ever-growing string sets. We first design a distance-based filtering technique which exploits an incrementally-built index to improve the filtering capability. Then, for the existent filters, we study the impact of their executing orders on total filtering cost and suggest dynamically-optimized filtering orders. All these optimization strategies work jointly with the existing domain-oriented filters in ISJ-ED, that is, they are complementary to those filter-based methods with edit-distance thresholds. Experimental results demonstrate that on dynamically augmented string sets, our method is more efficient than those only leverage domain-oriented filters with a fixed filtering order.
Structure-aware indexing for keyword search in databases BIBAFull-Text 1453-1456
  Guoliang Li; Jianhua Feng; Jianyong Wang
Most of existing methods of keyword search over relational databases find the Steiner trees composed of relevant tuples as the answers. They identify the Steiner trees by discovering the rich structural relationships between tuples, and neglect the fact that such structural relationships can be pre-computed and indexed. Tuple units that are composed of most relevant tuples are proposed to address this problem. Tuple units can be precomputed and indexed. Existing methods identify a single tuple unit to answer keyword queries. They, however, may involve false negatives as in many cases a single tuple unit cannot answer a keyword query. Instead, multiple tuple units should be integrated to answer keyword queries. To address this problem, in this paper, we study how to integrate multiple related tuple units to effectively answer keyword queries. We devise novel indices and incorporate the structural relationships between different tuple units into the indices. We use the indices to efficiently and progressively identify the top-k relevant answers. We have implemented our method in real database systems, and the experimental results show that our approach achieves high search efficiency and accuracy, and outperforms state-of-the-art methods significantly.
RS-Wrapper: random write optimization for solid state drive BIBAFull-Text 1457-1460
  Da Zhou; Xiaofeng Meng
Solid State Drive (SSD), emerging as new data storage media with high random read speed, has been widely used in laptops, desktops, and data servers to replace hard disk during the past few years. However, poor random write performance becomes the bottle neck in practice. In this paper, we propose to insert unmodified data into random write sequence in order to convert random writes into sequential writes, and thus data sequence can be flushed at the speed of sequential write. Further, we propose a clustering strategy to improve the performance by reducing quantity of unmodified data to read. After exploring the intrinsic parallelism of SSD, we also propose to flush write sequences with the help of the simultaneous program between planes and parallel program between devices for the first time. Comprehensive experiments show that our method outperform the existing random-write solution up to one order of magnitude improvement.
Label correspondence learning for part-of-speech annotation transformation BIBAFull-Text 1461-1464
  Muhua Zhu; Huizhen Wang; Jingbo Zhu
The performance of machine learning methods heavily depends on the volume of used training data. For the purpose of dataset enlargement, it is of interest to study the problem of unifying multiple labeled datasets with different annotation standards. In this paper, we focus on the case of unifying datasets for sequence labeling problems with natural language part-of-speech (POS) tagging as an examplar application. To this end, we propose a probabilistic approach to transforming the annotations of one dataset to the standard specified by another dataset. The key component of the approach, named as label correspondence learning, serves as a bridge of annotations from the datasets. Two methods designed from distinct perspectives are proposed to attack this sub-problem. Experiments on two large-scale part-of-speech datasets demonstrate the efficacy of the transformation and label correspondence learning methods.
Effective anonymization of query logs BIBAFull-Text 1465-1468
  Yuan Hong; Xiaoyun He; Jaideep Vaidya; Nabil Adam; Vijayalakshmi Atluri
User search query logs have proven to be very useful, but have vast potential for misuse. Several incidents have shown that simple removal of identifiers is insufficient to protect the identity of users. Publishing such inadequately anonymized data can cause severe breach of privacy. While significant effort has been expended on coming up with anonymity models and techniques for microdata, there is little corresponding work for query log data. Query logs are different in several important aspects, such as the diversity of queries and the causes of privacy breach. This necessitates the need to design privacy models and techniques specific to this environment. This paper takes a first cut at tackling this challenge. Our main contribution is to define effective anonymization models for query log data along with proposing techniques to achieve such anonymization. We analyze the inherent utility and privacy tradeoff, and experimentally validate the performance of our techniques.
A framework for safely publishing communication traces BIBAFull-Text 1469-1472
  Abhinav Parate; Gerome Miklau
A communication trace is a detailed record of the communication between two entities. Communication traces are vital for research in computer networks and study of network protocols in various domains, but their release is severely constrained by privacy and security concerns. In this paper, we propose a framework in which a trace owner can match an anonymizing transformation with the requirements of analysts. The trace owner can release multiple transformed traces, each customized to an analyst's needs, or a single transformation satisfying all requirements. The framework enables formal reasoning about anonymization policies, for example to verify that a given trace has utility for the analyst, or to obtain the most secure anonymization for the desired level of utility. Because communication traces are typically very large, we also provide techniques that allow efficient application of transformations using relational database systems.
Diverging patterns: discovering significant frequency change dissimilarities in large databases BIBAFull-Text 1473-1476
  Aijun An; Qian Wan; Jiashu Zhao; Xiangji Huang
In this paper, we present a framework for mining diverging patterns, a new type of contrast patterns whose frequency changes significantly differently in two data sets, e.g., it changes from a relatively low to a relatively high value in one dataset, but from high to low in the other. In this framework, a measure called diverging ratio is defined and used to discover diverging patterns. We use a four-dimensional vector to represent a pattern, and define the pattern's diverging ratio based on the angular difference between its vectors in two datasets. An algorithm is proposed to mine diverging patterns from a pair of datasets, which makes use of a standard frequent pattern mining algorithm to compute vector components efficiently. We demonstrate the effectiveness of our approach on real-world datasets, showing that the method can reveal novel knowledge from large databases.

Poster session 2: DB track

Matching stream patterns of various lengths and tolerances BIBAFull-Text 1477-1480
  Huanliang Sun; Ke Deng; Fanyu Meng; Junling Liu
Continuously identifying pre-defined patterns in a streaming time series has strong demand in various applications. While most existing works assume the patterns are in equal length and tolerance, this work focuses on the problem where the patterns have various lengths and tolerances, a common situation in the real world. The challenge of this problem roots on the strict space and time requirements of processing the arriving and expiring data in high-speed stream, combined with difficulty of coping with a large number of patterns with various lengths and tolerances. We introduce a novel concept of converging envelope which bounds the tolerance of a group of patterns in various tolerances and equal length and thus dramatically reduces the number of patterns for similarity computation. The basic idea of converging envelope has potential to more general index problems. To index patterns in various lengths and tolerances, we partition patterns into sub-patterns in equal length and an multi-tree index is developed in this paper.
Efficient processing of group-oriented connection queries in a large graph BIBAFull-Text 1481-1484
  James Cheng; Yiping Ke; Wilfred Ng
We study query processing in large graphs that are fundamental data model underpinning various social networks and Web structures. Given a set of query nodes, we aim to find the groups which the query nodes belong to, as well as the best connection among the groups. Such a query is useful to many applications but the query processing is extremely costly. We define a new notion of Correlation Group (CG), which is a set of nodes that are strongly correlated in a large graph G. We then extract the subgraph from G that gives the best connection for the nodes in a CG. To facilitate query processing, we develop an efficient index built upon the CGs. Our experiments show that the CGs are meaningful as groups and importantly, the meaningfulness of the query results are justifiable. We also demonstrate the high efficiency of CG computation, index construction and query processing.
Dynamic in-page logging for flash-aware B-tree index BIBAFull-Text 1485-1488
  Gap-Joo Na; Sang-Won Lee; Bongki Moon
This paper presents Dynamic IPL B+-tree (d-IPL in short) as a B+-tree index variant for flash-based storage systems. The d-IPL B+-tree adopts a dynamic In-Page Logging (IPL) scheme in order to address a few new problems that are caused by the unique characteristics of B+-tree indexes The d-IPL B+-tree avoids the frequent log overflow problem by allocating a log area in a flash block dynamically. It also addresses elegantly the problem of page evaporation, imposed by the contemporary NAND flash chips, by introducing ghost nodes within the context of the dynamic IPL scheme. This simple but elegant design of the d-IPL B+-tree improves the performance significantly. For a random insertion workload, the d-IPL B+-tree index outperformed a B+-tree with a plain IPL scheme by more than a factor of two in terms of page write and block erase operations.
Multidimensional routing indices for efficient distributed query processing BIBAFull-Text 1489-1492
  Christos Doulkeridis; Akrivi Vlachou; Kjetil Nørvåg; Yannis Kotidis; Michalis Vazirgiannis
Traditional routing indices in peer-to-peer (P2P) networks are mainly designed for document retrieval applications and maintain aggregated one-dimensional values representing the number of documents that can be obtained in a certain direction in the network. In this paper, we introduce the concept of multidimensional routing indices (MRIs), which are suitable for handling multidimensional data represented by minimum bounding regions (MBRs). Depending on data distribution on peers, the aggregation of the MBRs may lead to MRIs that exhibit extremely poor performance, which renders them ineffective. Thus, focusing on a hybrid unstructured P2P network, we analyze the parameters for building MRIs of high selectivity. We present techniques that boost the query routing performance by detecting similar peers and grouping and reassigning these peers to other parts of the hybrid network in a distributed and scalable way. We demonstrate the advantages of our approach using large-scale simulations.
Cluster based rank query over multidimensional data streams BIBAFull-Text 1493-1496
  Dengcheng He; Yongluan Zhou; Lidan Shou; Gang Chen
Many data stream monitoring applications involve rank queries and hence a number of efficient evaluation algorithms are proposed recently. Most of these techniques assume that rank queries are executed directly over the whole data space. However, we observe that many applications often require to perform clustering over the data streams before rank queries are run on each cluster.
   To address the problem, we propose a novel algorithm for integral clustering and ranking processing and we refer to such integrated queries as cluster-based rank queries. The algorithm includes two phases, namely the online phase which maintains the required data structures and statistics, and the query phase which uses these data structures to process queries. Extensive experiments indicate that the proposed algorithm is efficient in both space consumption and query processing.
Online anonymity for personalized web services BIBAFull-Text 1497-1500
  Yabo Xu; Ke Wang; Guoliang Yang; Ada W. C. Fu
To receive personalized web services, the user has to provide personal information and preferences, in addition to the query itself, to the web service. However, detailed personal information could identify the sender of sensitive queries, thus compromise user privacy. We propose the notion of online anonymity to enable users to issue personalized queries to an untrusted web service while with their anonymity preserved. The challenge for providing online anonymity is dealing with unknown and dynamic web users who can get online and offline at any time. We define this problem, discuss its implications and differences from the problems in the literature, and propose a solution.
Towards non-directional Xpath evaluation in a RDBMS BIBAFull-Text 1501-1504
  Sourav S. Bhowmick; Curtis Dyreson; Erwin Leonardi; Zhifeng Ng
XML query languages use directional path expressions to locate data in an XML data collection. They are tightly coupled to the structure of a data collection, and can fail when evaluated on the same data in a different structure. This paper extends path expressions with a new non-directional axis called the rank-distance axis. Given a context node and two positive integers α and β, the rank-distance axis returns those nodes that are ranked between α and β in terms of closeness from the context node in any direction. This paper shows how to evaluate the rank-distance axis in a tree-unaware XML database. A tree-unaware implementation does not invade the database kernel to support XML queries, instead it uses an existing RDBMS such as Microsoft's SQL server as a back-end and provides a front-end layer to translate XML queries to SQL. This paper presents an overview of an algorithm that translates queries with a rank-distance axis to SQL.
Semantic queries in databases: problems and challenges BIBAFull-Text 1505-1508
  Lipyeow Lim; Haixun Wang; Min Wang
Supporting semantic queries in relational databases is essential to many advanced applications. Recently, with the increasing use of ontology in various applications, the need for querying relational data together with its related ontology has become more urgent. In this paper, we identify and discuss the problem of querying relational data with its ontologies. Two fundamental challenges make the problem interesting. First, it is extremely difficult to express queries against graph structured ontology in the relational query language SQL, and second, in many cases where data and its related ontology are complicated, queries are usually not precise, that is, users often have only a vague notion, rather than a clear understanding and definition, of what they query for. We outline a query-by-example approach that enables us to support semantic queries in relational databases with ease. Instead of endeavoring to incorporate ontology into relational form and create new language constructs to express such queries, we ask the user to provide a small number of examples that satisfy the query she has in mind. Using these examples as seeds, the system infers the exact query automatically, and the user is therefore shielded from the complexity of interfacing with the ontology.
Inverted indexes vs. bitmap indexes in decision support systems BIBAFull-Text 1509-1512
  Truls A. Bjørklund; Nils Grimsmo; Johannes Gehrke; Øystein Torbjørnsen
Bitmap indexes are widely used in Decision Support Systems (DSSs) to improve query performance. In this paper, we evaluate the use of compressed inverted indexes with adapted query processing strategies from Information Retrieval as an alternative. In a thorough experimental evaluation on both synthetic data and data from the Star Schema Benchmark, we show that inverted indexes are more compact than bitmap indexes in almost all cases. This compactness combined with efficient query processing strategies results in inverted indexes outperforming bitmap indexes for most queries, often significantly.
Scalable indexing of RDF graphs for efficient join processing BIBAFull-Text 1513-1516
  George H. L. Fletcher; Peter W. Beck
Current approaches to RDF graph indexing suffer from weak data locality, i.e., information regarding a piece of data appears in multiple locations, spanning multiple data structures. Weak data locality negatively impacts storage and query processing costs. Towards stronger data locality, we propose a Three-way Triple Tree (TripleT) secondary memory indexing technique to facilitate flexible and efficient join evaluation on RDF data. The novelty of TripleT is that the index is built over the atoms occurring in the data set, rather than at a coarser granularity, such as whole triples occurring in the data set; and, the atoms are indexed regardless of the roles (i.e., subjects, predicates, or objects) they play in the triples of the data set. We show through extensive empirical evaluation that TripleT exhibits multiple orders of magnitude improvement over the state-of-the-art, in terms of both storage and query processing costs.
Privacy without noise BIBAFull-Text 1517-1520
  Yitao Duan
This paper presents several results on statistical database privacy. We first point out a serious vulnerability in a widely-accepted approach which perturbs query results with additive noise. We then show that for sum queries which aggregate across all records, when the dataset is sufficiently large, the inherent uncertainty associated with unknown quantities is enough to provide similar perturbation and the same privacy can be obtained without external noise. Sum query is a surprisingly general primitive supporting a large number of data mining algorithms such as SVD, PCA, k-means, ID3, SVM, EM, and all the algorithms in the statistical query model. We derive privacy conditions for sum queries and provide the first mathematical proof for the intuition that aggregates across a large number of individuals is private using a widely accepted notion of privacy. We also show how the results can be used to construct simulatable query auditing algorithms with stronger privacy.
Mining frequent itemsets in time-varying data streams BIBAFull-Text 1521-1524
  Yingying Tao; M. Tamer Özsu
Mining frequent itemsets in data streams is beneficial to many real-world applications but is also a challenging task since data streams are unbounded and have high arrival rates. Moreover, the distribution of data streams can change over time, which makes the task of maintaining frequent itemsets even harder. In this paper, we propose a false-negative oriented algorithm, called TWIM, that can find most of the frequent itemsets, detect distribution changes, and update the mining results accordingly. Experimental results show that our algorithm performs as good as other false-negative algorithms on data streams without distribution change, and has the ability to detect changes over time-varying data streams in -time with a high accuracy rate.
The gardener's problem for web information monitoring BIBAFull-Text 1525-1528
  Byron J. Gao; Mingji Xia; Walter Cai; David C. Anastasiu
We introduce and theoretically study the Gardener's problem that well models many web information monitoring scenarios, where numerous dynamically changing web sources are monitored and local information needs to be periodically updated under communication and computation capacity constraints. Typical such examples include maintenance of inverted indexes for search engines and maintenance of extracted structures for unstructured data management systems. We formulate a corresponding multicriteria optimization problem and propose heuristic solutions.
Extraction of a latent blog community based on subject BIBAFull-Text 1529-1532
  Seok-Ho Yoon; Jung-Hwan Shin; Sang-Wook Kim; Sunju Park
In the blogosphere, there exist posts relevant to a particular subject and blogs that show interests in the subject. In this paper, we define a set of such posts and blogs as "blog community" and propose a method for extracting the blog community associated with a particular subject. The proposed method is based on the idea that the blogs who have performed actions to the posts of a particular subject are the ones that have interests in the subject, and that the posts which have received actions from such blogs are the ones that contain the subject. The proposed method selects a small number of seed posts that contain the subject. Then, it selects the blogs that perform actions to the seed posts over some threshold and the posts that have received actions over some threshold. By repeating these two steps, it gradually expands the blog community. The experimental results show that the proposed method exhibits a higher level of accuracy than the methods proposed in prior research.
Context-sensitive document ranking BIBAFull-Text 1533-1536
  Lijun Chang; Jeffrey Xu Yu; Lu Qin
Ranking is a main research issue in IR-styled keyword search over a set of documents. In this paper, we study a new keyword search problem, called context-sensitive document ranking, which is to rank documents with an additional context that provides additional information about the application domain where the documents are to be searched and ranked. The work is motivated by the fact that additional information associated with the documents can possibly assist users to find more relevant documents when they are unable to find the needed documents from the documents alone. In this paper, a context is a multi-attribute graph, which can represent any information maintained in a relational database. The context-sensitive ranking is related to several research issues, how to score documents, how to evaluate the additional information obtained in the context that may contribute the document ranking, how to rank the documents by combining the scores/costs from the documents and the context. More importantly, the relationships between documents and the information stored in a relational database may be uncertain, because they are from different data sources and the relationships are determined systematically using similarity match which causes uncertainty. In this paper, we concentrate ourselves on these research issues, and provide our solution on how to rank the documents in a context where there exist uncertainty between the documents and the context. We confirm the effectiveness of our approaches by conducting extensive experimental studies using real datasets.
(Not) yet another matcher BIBAFull-Text 1537-1540
  Fabien Duchateau; Remi Coletta; Zohra Bellahsene; Renée J. Miller
Discovering correspondences between schema elements is a crucial task for data integration. Most schema matching tools are semi-automatic, e.g. an expert must tune some parameters (thresholds, weights, etc.). They mainly use several methods to combine and aggregate similarity measures. However, their quality results often decrease when one requires to integrate a new similarity measure or when matching particular domain schemas. This paper describes YAM (Yet Another Matcher), which is a schema matcher factory. Indeed, it enables the generation of a dedicated matcher for a given schema matching scenario, according to user inputs. Our approach is based on machine learning since schema matchers can be seen as classifiers. Several bunches of experiments run against matchers generated by YAM and traditional matching tools show how our approach is able to generate the best matcher for a given scenario.
Injecting purpose and trust into data anonymisation BIBAFull-Text 1541-1544
  Xiaoxun Sun; Hua Wang; Jiuyong Li
Most existing works of data anonymisation target at the optimization of the anonymisation metrics to balance the data utility and privacy, whereas they ignore the effects of a requester's trust level and application purposes during the data anonymisation. Our aim of this paper is to propose a much finer level anonymisation scheme with regard to the data requester's trust value and specific application purpose. We prioritize the attributes for anonymisation based on how important and critical they are related to the specified application purposes and propose a trust evaluation strategy to quantify the data requester's reliability, and further build the projection between the trust value and the degree of data anonymization, which intends to determine to what extent the data should be anonymized. The decomposition algorithm is developed to find the desired anonymous solution, which guarantees the uniqueness and correctness.
Exploit the tripartite network of social tagging for web clustering BIBAFull-Text 1545-1548
  Caimei Lu; Xin Chen; E. K. Park
In this poster, we investigate how to enhance web clustering by leveraging the tripartite network of social tagging systems. We propose a clustering method, called "Tripartite Clustering", which cluster the three types of nodes (resources, users and tags) simultaneously based on the links in the social tagging network. The proposed method is experimented on a real-world social tagging dataset sampled from del.icio.us. We also compare the proposed clustering approach with K-means. All the clustering results are evaluated against a human-maintained web directory. The experimental results show that Tripartite Clustering significantly outperforms the content-based K-means approach and achieves performance close to that of social annotation-based K-means whereas generating much more useful information.

Poster session 3: IR track

Multi-task learning for learning to rank in web search BIBAFull-Text 1549-1552
  Jing Bai; Ke Zhou; Guirong Xue; Hongyuan Zha; Gordon Sun; Belle Tseng; Zhaohui Zheng; Yi Chang
Both the quality and quantity of training data have significant impact on the performance of ranking functions in the context of learning to rank for web search. Due to resource constraints, training data for smaller search engine markets are scarce and we need to leverage existing training data from large markets to enhance the learning of ranking function for smaller markets. In this paper, we present a boosting framework for learning to rank in the multi-task learning context for this purpose. In particular, we propose to learn non-parametric common structures adaptively from multiple tasks in a stage-wise way. An algorithm is developed to iteratively discover super-features that are effective for all the tasks. The estimation of the functions for each task is then learned as a linear combination of those super-features. We evaluate the performance of this multi-task learning method for web search ranking using data from a search engine. Our results demonstrate that multi-task learning methods bring significant relevance improvements over existing baseline methods.
Text segmentation via topic modeling: an analytical study BIBAFull-Text 1553-1556
  Hemant Misra; François Yvon; Joemon M. Jose; Olivier Cappe
In this paper, the task of text segmentation is approached from a topic modeling perspective. We investigate the use of latent Dirichlet allocation (LDA) topic model to segment a text into semantically coherent segments. A major benefit of the proposed approach is that along with the segment boundaries, it outputs the topic distribution associated with each segment. This information is of potential use in applications like segment retrieval and discourse analysis. The new approach outperforms a standard baseline method and yields significantly better performance than most of the available unsupervised methods on a benchmark dataset.
iRANK: an interactive ranking framework and its application in query-focused summarization BIBAFull-Text 1557-1560
  Furu Wei; Wenjie Li; Wei Wang; Yanxiang He
We address the problem of unsupervised ensemble ranking in this paper. Traditional approaches either combine multiple ranking criteria into a unified representation to obtain an overall ranking score or to utilize certain rank fusion or aggregation techniques to combine the ranking results. Beyond the aforementioned combine-then-rank and rank-then-combine approaches, we propose a novel rank-learn-combine ranking framework, called Interactive Ranking (iRANK), which allows two base rankers to "teach" each other before combination during the ranking process by providing their own ranking results as feedback to the others so as to boost the ranking performance. This mutual ranking refinement process continues until the two base rankers cannot learn from each other any more. The overall performance is improved by the enhancement of the base rankers through the mutual learning mechanism. We apply this framework to the sentence ranking problem in query-focused summarization and evaluate its effectiveness on the DUC 2005 data set. The results are encouraging with consistent and promising improvements.
Adaptive web mining of bilingual lexicons for cross language information retrieval BIBAFull-Text 1561-1564
  Lei Shi
Bilingual web pages contain abundant term translation knowledge which is crucial for query translation in Cross Language Information Retrieval systems. But it is a challenging task to extract term translations from bilingual web pages due to the variation in web page layouts and writing styles. In this paper, based on the observation that translation pairs on the same web page tend to appear following similar patterns, a new extraction model is proposed to adaptively learn extraction patterns and exploit them to facilitate term translation mining from bilingual web pages. Experiments reflect that this model can significantly improve extraction coverage while maintaining high accuracy. It improves query translation in cross-language information retrieval, leading to significantly higher retrieval effectiveness on TREC collections.
Similarity-aware indexing for real-time entity resolution BIBAFull-Text 1565-1568
  Peter Christen; Ross Gayler; David Hawking
Entity resolution, also known as data matching or record linkage, is the task of identifying and matching records from several databases that refer to the same entities. Traditionally, entity resolution has been applied in batch-mode and on static databases. However, many organisations are increasingly faced with the challenge of having large databases containing entities that need to be matched in real-time with a stream of query records also containing entities, such that the best matching records are retrieved. Example applications include online law enforcement and national security databases, public health surveillance and emergency response systems, financial verification systems, online retail stores, eGovernment services, and digital libraries.
   A novel inverted index based approach for real-time entity resolution is presented in this paper. At build time, similarities between attribute values are computed and stored to support the fast matching of records at query time. The presented approach differs from other approaches to approximate query matching in that it allows any similarity comparison function, and any 'blocking' (encoding) function, both possibly domain specific, to be incorporated.
   Experimental results on a real-world database indicate that the total size of all data structures of this novel index approach grows sub-linearly with the size of the database, and that it allows matching of query records in sub-second time, more than two orders of magnitude faster than a traditional entity resolution index approach. The interested reader is referred to the longer version of this paper [5].
Clustering queries for better document ranking BIBAFull-Text 1569-1572
  Yi Liu; Liangjie Zhang; Ruihua Song; Jian-Yun Nie; Ji-Rong Wen
Different queries require different ranking methods. It is however challenging to determine what queries are similar, and how to rank documents for them. In this paper, we propose a new method to cluster queries according to the similarity determined based on URLs in their answers. We then train specific ranking models for each query cluster. In addition, a cluster-specific measure of authority is defined to favor documents from authoritative websites on the corresponding topics. The proposed approach is tested using data from a search engine. It turns out that our proposed topic-dependent models can significantly improve the search results of eight most popular categories of queries.
Effective and efficient structured retrieval BIBAFull-Text 1573-1576
  Le Zhao; Jamie Callan
Search engines that support structured documents typically support structure created by the author (e.g., title, section), and may also support structure added by an annotation process (e.g., part of speech, named entity, semantic role). Exploiting such structure can be difficult. Query structure may fail to match structure in a relevant document for a variety of reasons, thus structured queries, although containing more information than keyword queries, are often less effective than unstructured queries. This paper studies retrieval of sentences with annotations for a question answering task. Three problems of structured retrieval are identified and solutions proposed. Structural mismatch is addressed by query structure expansion of predicted relevant structures. Lack of presence of all key aspects of a question is solved by Boolean filtering of result sentences. The score variations of the annotator generated fields with all the different lengths are accounted for by using field specific smoothing. Experiments show that each solution incrementally improves structured retrieval, and a combination of Boolean filtering, structural expansion, and keyword queries outperforms keyword and simple structured retrieval baselines.
Who tags the tags?: a framework for bookmark weighting BIBAFull-Text 1577-1580
  David Carmel; Haggai Roitman; Elad Yom-Tov
In this work we propose a novel framework for bookmark weighting which allows us to estimate the effectiveness of each of the bookmarks individually. We show that by weighting bookmarks according to their estimated quality we can significantly improve search effectiveness. Using empirical evaluation on real data gathered from two large bookmarking systems, we demonstrate the effectiveness of the new framework for search enhancement.
Web search result summarization: title selection algorithms and user satisfaction BIBAFull-Text 1581-1584
  Tapas Kanungo; Nadia Ghamrawi; Ki Yuen Kim; Lawrence Wai
Eye tracking experiments have shown that titles of Web search results play a crucial role in guiding a user's search process. We present a machine-learned algorithm that trains a boosted tree to pick the most relevant title for a Web search result. We compare two modeling approaches: i) using absolute editorial judgments and ii) using pairwise preference judgments. We find that the pairwise modeling approach gives better results in terms of three offline metrics. We present results of our models in four regions. We also describe a hybrid user satisfaction evaluation process -- search success -- that combines page relevance and user click behavior, and show that our machine-learned algorithm improves in search success.
Context sensitive synonym discovery for web search queries BIBAFull-Text 1585-1588
  Xing Wei; Fuchun Peng; Huihsin Tseng; Yumao Lu; Benoit Dumoulin
We propose a simple yet effective approach to context sensitive synonym discovery for Web search queries based on co-click analysis; i.e., analyzing queries leading to clicking same documents. In addition to deriving word based synonyms, we also derive concept based synonyms with the help of query segmentation. Evaluation results show that this approach dramatically outperforms the thesaurus based synonym replacement method in keeping search intent, from accuracy of 40% to above 80%.
Text summarization model based on the budgeted median problem BIBAFull-Text 1589-1592
  Hiroya Takamura; Manabu Okumura
We propose a multi-document generic summarization model based on the budgeted median problem. Our model selects sentences to generate a summary so that every sentence in the document cluster can be assigned to and be represented by a sentence in the summary as much as possible. The advantage of this model is that it covers the entire relevant part of the document cluster through sentence assignment and can incorporate asymmetric relations between sentences such as textual entailment.
Instance- and bag-level manifold regularization for aggregate outputs classification BIBAFull-Text 1593-1596
  Shuo Chen; Bin Liu; Mingjie Qian; Changshui Zhang
Aggregate outputs learning differs from the classical supervised learning setting in that, training samples are packed into bags with only the aggregate outputs (labels for classification or real values for regression) known. This setting of the problem is associated with several kinds of application background. We focus on the aggregate outputs classification problem in this paper, and set up a manifold regularization framework to deal with it. The framework can be of both instance level and bag level for different testing goals. We propose four concrete algorithms based on our framework, each of which can cope with both binary and multi-class scenarios. The experimental results on several datasets suggest that our algorithms outperform the state-of-art technique.
Utilizing inter-passage and inter-document similarities for re-ranking search results BIBAFull-Text 1597-1600
  Eyal Krikon; Oren Kurland; Michael Bendersky
We present a novel language-model-based approach to re-ranking an initially retrieved list so as to improve precision at top ranks. Our model integrates whole-document information with that induced from passages. Specifically, inter-passage, inter-document, and query-based similarities are integrated in our model. Empirical evaluation demonstrates the effectiveness of our approach.
On domain similarity and effectiveness of adapting-to-rank BIBAFull-Text 1601-1604
  Keke Chen; Jing Bai; Srihari Reddy; Belle Tseng
Adapting to rank address the problem of insufficient domain-specific labeled training data in learning to rank. However, the initial study shows that adaptation is not always effective. In this paper, we investigate the relationship between the domain similarity and the effectiveness of domain adaptation with the help of two domain similarity measure: relevance correlation and sample distribution correlation.
An effective model of using negative relevance feedback for information filtering BIBAFull-Text 1605-1608
  Abdulmohsen Algarni; Yuefeng Li; Yue Xu; Raymond Y. K. Lau
Over the years, people have often held the hypothesis that negative feedback should be very useful for largely improving the performance of information filtering systems; however, we have not obtained very effective models to support this hypothesis. This paper, proposes an effective model that use negative relevance feedback based on a pattern mining approach to improve extracted features. This study focuses on two main issues of using negative relevance feedback: the selection of constructive negative examples to reduce the space of negative examples; and the revision of existing features based on the selected negative examples. The former selects some offender documents, where offender documents are negative documents that are most likely to be classified in the positive group. The later groups the extracted features into three groups: the positive specific category, general category and negative specific category to easily update the weight. An iterative algorithm is also proposed to implement this approach on RCV1 data collections, and substantial experiments show that the proposed approach achieves encouraging performance.
Topic analysis for topic-focused multi-document summarization BIBAFull-Text 1609-1612
  Xiaojun Wan
Topic-focused multi-document summarization has been a challenging task because the created summary is required to be biased to the given topic or query. Existing methods consider the given topic as a single coarse unit and then directly incorporate the relevance between each sentence and the single topic into the sentence evaluation process. However, the given topic is usually not well-defined and it consists of a few explicit or implicit subtopics. In this study, the related subtopics are discovered from the topic's narrative text or document set through topic analysis techniques. Then, the sentence relationships against each subtopic are considered as an individual modality and the multi-modality manifold-ranking method is proposed to evaluate and rank sentences by fusing the multiple modalities. Experimental results on the DUC benchmark datasets show the promising results of our proposed methods.
MatchSim: a novel neighbor-based similarity measure with maximum neighborhood matching BIBAFull-Text 1613-1616
  Zhenjiang Lin; Michael R. Lyu; Irwin King
The problem of measuring similarity between web pages arises in many important Web applications, such as search engines and Web directories. In this paper, we propose a novel neighbor-based similarity measure called MatchSim, which uses only the neighborhood structure of web pages. Technically, MatchSim recursively defines similarity between web pages by the average similarity of the maximum matching between their neighbors. Our method extends the traditional methods which simply count the numbers of common and/or different neighbors. It also successfully overcomes a severe counterintuitive loophole in SimRank, due to its strict consistency with the intuitions of similarity. We give the computational complexity of MatchSim iteration. The accuracy of MatchSim is compared against others on two real datasets. The results show that our method performs best in most cases.
Using opinion-based features to boost sentence retrieval BIBAFull-Text 1617-1620
  Ronald T. Fernández; David E. Losada
Opinion mining has become recently a major research topic. A wide range of techniques have been proposed to enable opinion-oriented information seeking systems. However, little is known about the ability of opinion-related information to improve regular retrieval tasks. Our hypothesis is that standard retrieval methods might benefit from the inclusion of opinion-based features. A sentence retrieval scenario is a natural choice to evaluate this claim. We propose here a formal method to incorporate some opinion-based features of the sentences as query-independent evidence. We show that this incorporation leads to retrieval methods whose performance is significantly better than the performance of state of the art sentence retrieval models.
Exploring multimedia databases via optimization-based relevance feedback and the earth mover's distance BIBAFull-Text 1621-1624
  Marc Wichterich; Christian Beecks; Martin Sundermeyer; Thomas Seidl
Determining similar objects is a fundamental operation both in data mining tasks such as clustering and in query-driven object retrieval. By definition of similarity search, query objects can only be imprecise descriptions of what users are looking for in a database, and even high-quality similarity measures can only be approximations of the users' notion of similarity. To overcome these shortcomings, iterative query refinement systems have been proposed. They utilize user feedback regarding the relevance of intermediate results to adapt the query object and/or the similarity measure.
   We propose an optimization-based relevance feedback approach for adaptable distance measures -- focusing on the Earth Mover's Distance. Our technique enables quicker iterative database exploration as shown by our experiments.
A hybrid index structure for geo-textual searches BIBAFull-Text 1625-1628
  Richard Göbel; Andreas Henrich; Raik Niemann; Daniel Blank
The efficient execution of multi-criteria queries has gained increasing interest over the last years. In the present paper we propose an R-tree based approach for queries addressing textual as well as geographic filter conditions. Whereas most previous approaches use an index structure optimised for a single criterion adding special treatment for the other criterion at the leaf nodes or end points of this index structure, our approach uses a deeper integration. In short, R-trees are maintained for certain subsets of the whole term set. Furthermore, in each of these R-trees bit sets are used within the nodes to indicate whether entries for the terms associated with the single bits can be found in the corresponding sub-tree. Our index structure aims to be both, time and space efficient. The paper investigates the efficiency and applicability of the proposed index structure via practical experiments based on real-world and synthetic data.
Feature engineering on event-centric surrogate documents to improve search results BIBAFull-Text 1629-1632
  Wenhui Liao; Isabelle Moulinier
We investigate the task of re-ranking search results based on query log information. Prior work has considered this problem as either the task of learning document rankings of using features based on user behavior, or as the task of enhancing documents and queries using log data. Our contribution combines both. We distill log information into event-centric surrogate documents (ESDs), and extract features from these ESDs to be used in a learned ranking function. Our experiments on a legal corpus demonstrate that features engineered on surrogate documents lead to improved rankings, in particular when the original ranking is of poor quality.

Poster session 4: KM track

Clustering object moving patterns for prediction-based object tracking sensor networks BIBAFull-Text 1633-1636
  Chih-Chieh Hung; Wen-Chih Peng
Prior works have shown that probabilistic suffix trees (PST) could predict accurately the moving behaviors of objects for prediction-based object tracking sensor networks. However, maintaining PSTs for objects incurs a considerable amount of storage spaces for resource-constrained sensor nodes. In this paper, we derive a distance function between two PSTs and propose an algorithm to determine the similarity between them. By the distance between PSTs, we propose a clustering algorithm to partition objects with similar moving behaviors into groups. Furthermore, for each group, one PST is selected to predict movements of objects within one group. Experimental results show that our proposed approaches not only effectively reduce the storage cost but also provide good prediction accuracy.
Exploiting term relationship to boost text classification BIBAFull-Text 1637-1640
  Dou Shen; Jianmin Wu; Bin Cao; Jian-Tao Sun; Qiang Yang; Zheng Chen; Ying Li
Document classification provides an effective way to handle the explosive online textual data. However, in practical classification settings, we face the so-called feature sparsity problem caused by a lack of training documents or the shortness of text to be classified. In this paper, we solve the sparsity problem by exploiting term relationships along with Naive Bayes classifiers. The first method is to estimate term relationships based on the co-occurrence information of two terms in a certain context. The second method estimates the term relationships based on the distribution of terms over different hierarchical categories in a publicly available document taxonomy. Thereafter, term relationship is used to augment Naive Bayes classifiers. We test our methods on two open-domain data sets to demonstrate its advantages. The experimental results show that our method can significantly improve the classification performance, especially when we do not have enough training data or the texts are Web search queries.
Consistent on-line classification of dbs workload events BIBAFull-Text 1641-1644
  Marc Holze; Claas Gaidies; Norbert Ritter
An important goal of self-managing databases is the autonomic adaptation of the database configuration to evolving workloads. However, the diversity of SQL statements in real-world workloads typically causes the required analysis overhead to be prohibitive for a continuous workload analysis. The workload classification presented in this paper reduces the workload analysis overhead by grouping similar workload events into classes. Our approach employs clustering techniques based upon a general distance function for DBS workload events. To be applicable for a continuous workload analysis, our workload classification specifically addresses a stream-based, lightweight operation, a controllable loss of quality, and self-management.
Automatic web data extraction using tree alignment BIBAFull-Text 1645-1648
  Yingju Xia; Hao Yu; Shu Zhang
This paper investigates the automatic extraction of data from forums, blogs and news web sites. Web pages are increasingly dynamically generated using a common template populated with data from databases. This paper proposes a novel method that uses tree alignment to automatically extract data from these types of web pages. A new tree alignment algorithm is presented for determining the optimal matching structure of the input web pages. Based on the alignment, the trees are merged into one union tree whose nodes record statistical information obtained from multiple web pages. A heuristic method is employed for determining the most probable content block and the alignment algorithm detects repeating patterns on the union tree. A wrapper built on the most probable content block and the repeating patterns extracts data from web pages. Experimental results show that the method achieves high extraction accuracy and has steady performance.
LoOP: local outlier probabilities BIBAFull-Text 1649-1652
  Hans-Peter Kriegel; Peer Kröger; Erich Schubert; Arthur Zimek
Many outlier detection methods do not merely provide the decision for a single data object being or not being an outlier but give also an outlier score or "outlier factor" signaling "how much" the respective data object is an outlier. A major problem for any user not very acquainted with the outlier detection method in question is how to interpret this "factor" in order to decide for the numeric score again whether or not the data object indeed is an outlier. Here, we formulate a local density based outlier detection method providing an outlier "score" in the range of [0, 1] that is directly interpretable as a probability of a data object for being an outlier.
MING: mining informative entity relationship subgraphs BIBAFull-Text 1653-1656
  Gjergji Kasneci; Shady Elbassuoni; Gerhard Weikum
Many modern applications are faced with the task of knowledge discovery in entity-relationship graphs, such as domain-specific knowledge bases or social networks. Mining an "informative" subgraph that can explain the relations between k(>= 2) given entities of interest is a frequent knowledge discovery scenario on such graphs. We present MING, a principled method for extracting an informative subgraph for given query nodes. MING builds on a new notion of informativeness of nodes. This is used in a random-walk-with-restarts process to compute the informativeness of entire subgraphs.
Experiments on pattern-based relation learning BIBAFull-Text 1657-1660
  Willy Yap; Timothy Baldwin
Relation extraction is the task of extracting semantic relations -- such as synonymy or hypernymy -- between word pairs from corpus data. Past work in relation extraction has concentrated on manually creating templates to use in directly extracting word pairs for a given semantic relation from corpus text. Recently, there has been a move towards using machine learning to automatically learn these patterns. We build on this research by running experiments investigating the impact of corpus type, corpus size and different parameter settings on learning a range of lexical relations.
Identifying comparable entities on the web BIBAFull-Text 1661-1664
  Alpa Jain; Patrick Pantel
Web search engines are often presented with user queries that involve comparisons of real-world entities. Thus far, this interaction has typically been captured by users submitting appropriately designed keyword queries for which they are presented a list of relevant documents. Richer interactions that explicitly allow for a comparative analysis of entities represent a new potential direction to improve the search experience. With this in mind, we present an initial step of mining comparable entities from sources of information available to a large-scale Web search engine, namely, search query logs and documents from a Web crawl. Our mining methods generate a diverse set of comparables consisting of entities from a broad class of categories, such as medicines, appliances, electronics, and vacation destinations.
Efficient multi-class unlabeled constrained semi-supervised SVM BIBAFull-Text 1665-1668
  Mingjie Qian; Feiping Nie; Changshui Zhang
Semi-supervised learning has been successfully applied to many fields such as knowledge management, information retrieval and data mining as it can utilize both labeled and unlabeled data. In this paper, we propose a general semi-supervised framework for multi-class categorization. Many classical supervised and semi-supervised method dealing with binary classification or multi-class classification including the standard regularization and the manifold regularization can be viewed as special cases of this framework. Based on this framework, we propose a novel method called multi-class unlabeled constrained SVM (MCUCSVM) and its special case: multi-class Laplacian SVM (MCLapSVM). We then put forward a general kernel version semi-supervised dual coordinate descent algorithm to efficiently solve MCUCSVM and makes it more applicable to problems with large number of classes and large scale labeled data. Both rigorous theory and promising experimental results on four real datasets show the great performance and remarkable efficiency of MCUCSVM and MCLapSVM.
Modeling context-dependent information BIBAFull-Text 1669-1672
  Jie Hu; Mengchi Liu
Object properties are often based on their contexts, and contexts can be nested to form complex context-dependent information. Existing data models cannot naturally and directly represent such context-dependent information. In this paper, we propose a novel mechanism called context constructor in an object-oriented framework to solve this problem.
iPoG: fast interactive proximity querying on graphs BIBAFull-Text 1673-1676
  Hanghang Tong; Huiming Qu; Hani Jamjoom; Christos Faloutsos
Given an author-conference graph, how do we answer proximity queries (e.g., what are the most related conferences for John Smith?); how can we tailor the search result if the user provides additional yes/no type of feedback (e.g., what are the most related conferences for John Smith given that he does not like ICML?)? Given the potential computational complexity, we mainly devote ourselves to addressing the computational issues in this paper by proposing an efficient solution (referred to as iPoG-B) for bipartite graphs. Our experimental results show that the proposed fast solution (iPoGB) achieves significant speedup, while leading to the same ranking result.
Using domain ontology for semantic web usage mining and next page prediction BIBAFull-Text 1677-1680
  Nizar R. Mabroukeh; Christie I. Ezeife
This paper proposes the integration of semantic information drawn from a web application's domain knowledge into all phases of the web usage mining process (preprocessing, pattern discovery, and recommendation/prediction). The goal is to have an intelligent semantics-aware web usage mining framework. This is accomplished by using semantic information in the sequential pattern mining algorithm to prune the search space and partially relieve the algorithm from support counting. In addition, semantic information is used in the prediction phase with low order Markov models, for less space complexity and accurate prediction, that will help ambiguous predictions problem.
   Experimental results show that semantics-aware sequential pattern mining algorithms can perform 4 times faster than regular non-semantics-aware algorithms with only 26% of the memory requirement.
Using negative voting to diversify answers in non-factoid question answering BIBAFull-Text 1681-1684
  Palakorn Achananuparp; Christopher C. Yang; Xin Chen
We propose a ranking model to diversify answers of non-factoid questions based on an inverse notion of graph connectivity. By representing a collection of candidate answers as a graph, we posit that novelty, a measure of diversity, is inversely proportional to answer vertices' connectivity. Hence, unlike the typical graph ranking models, which score vertices based on the degree of connectedness, our method assigns a penalty score for a candidate answer if it is strongly connected to other answers. That is, any redundant answers, indicated by a higher inter-sentence similarity, will be ranked lower than those with lower inter-sentence similarity. At the end of the ranking iterations, many redundant answers will be moved toward the bottom on the ranked list. The experimental results show that our method helps diversify answer coverage of non-factoid questions according to F-scores from nugget pyramid evaluation.
A fast and simple method for extracting relevant content from news webpages BIBAFull-Text 1685-1688
  Eduardo Sany Laber; Críston Pereira de Souza; Iam Vita Jabour; Evelin Carvalho Freire de Amorim; Eduardo Teixeira Cardoso; Raúl Pierre Rentería; Lúcio Cunha Tinoco; Caio Dias Valentim
We propose NCE, an efficient algorithm to identify and extract relevant content from news webpages. We define relevant as the textual sections that more objectively describe the main event in the article. This includes the title and the main body section, and excludes comments about the story and presentation elements.
   Our experiments suggest that NCE is competitive, in terms of extraction quality, with the best methods available in the literature. It achieves F1 = 90.7% in our test corpus containing 324 news webpages from 22 sites. The main advantages of our method are its simplicity and its computational performance. It is at least an order of magnitude faster than methods that use visual features. This characteristic is very suitable for applications that process a large number of pages.
Real-word spelling correction using Google web 1Tn-gram data set BIBAFull-Text 1689-1692
  Aminul Islam; Diana Inkpen
We present a method for correcting real-word spelling errors using the Google Web 1T n-gram data set and a normalized and modified version of the Longest Common Subsequence (LCS) string matching algorithm. Our method is focused mainly on how to improve the correction recall (the fraction of errors corrected) while keeping the correction precision (the fraction of suggestions that are correct) as high as possible. Evaluation results on a standard data set show that our method performs very well.
Acronym extraction and disambiguation in large-scale organizational web pages BIBAFull-Text 1693-1696
  Shicong Feng; Yuhong Xiong; Conglei Yao; Liwei Zheng; Wei Liu
In this paper, we focus on the automatic extraction and disambiguation of acronyms in large-scale organizational web pages, which is important but difficult due to the diversity of acronyms and the scale of organizational web pages. We propose two novel algorithms to address the key problems in acronym extraction and disambiguation: (1) An unsupervised ranking algorithm to automatically filter out the incorrect acronym-expansion pairs. Different from the existing approaches, our method does not require any hand-crafted rules; (2) A graph-based algorithm to disambiguate ambiguous acronyms, which leverages the hyperlinks of pages to facilitate the acronym disambiguation. We evaluate the proposed approaches using two large-scale, real-world datasets in two different domains. Our experimental results show that our approach is domain independent, and achieves higher precision and recall than the existing methods.
Constrained multi-aspect expertise matching for committee review assignment BIBAFull-Text 1697-1700
  Maryam Karimzadehgan; ChengXiang Zhai
Automatic review assignment can significantly improve the productivity of many people such as conference organizers, journal editors and grant administrators. Most previous works have set the problem up as using a paper as a query to independently "retrieve" a set of reviewers that should review the paper. A more appropriate formulation of the problem would be to simultaneously optimize the assignments of all the papers to an entire committee of reviewers under constraints such as the review quota. In this paper, we solve the problem of committee review assignment with multi-aspect expertise matching by casting it as an integer linear programming problem. The proposed algorithm can naturally accommodate any probabilistic or deterministic method for modeling multiple aspects to automate committee review assignments. Evaluation using an existing data set shows that the proposed algorithm is effective for committee review assignments based on multi-aspect expertise matching.
Automatic link detection: a sequence labeling approach BIBAFull-Text 1701-1704
  James J. Gardner; Li Xiong
The popularity of Wikipedia and other online knowledge bases has recently produced an interest in the machine learning community for the problem of automatic linking. Automatic hyperlinking can be viewed as two sub problems -- link detection which determines the source of a link, and link disambiguation which determines the destination of a link. Wikipedia is a rich corpus with hyperlink data provided by authors. It is possible to use this data to train classifiers to be able to mimic the authors in some capacity. In this paper, we introduce automatic link detection as a sequence labeling problem. Conditional random fields (CRFs) are a probabilistic framework for labeling sequential data. We show that training a CRF with different types of features from the Wikipedia dataset can be used to automatically detect links with almost perfect precision and high recall.
MagicCube: choosing the best snippet for each aspect of an entity BIBAFull-Text 1705-1708
  Yexin Wang; Li Zhao; Yan Zhang
Wikis are currently used in business to provide knowledge management systems, especially for individual organizations. However, building wikis manually is a laborious and time-consuming work. To assist founding wikis, we propose a methodology in this paper to automatically select the best snippets for entities as their initial explanations. Our method consists of two steps. First, we focus on extracting snippets from a given set of web pages for each entity. Starting from a seed sentence, a snippet grows up by adding the most relevant neighboring sentences into itself. The sentences are chosen by the Snippet Growth Model, which employs a distance function and an influence function to make decisions. Secondly, we pick out the best snippet for each aspect of an entity. The combination of all the selected snippets serves as the primary description of the entity. We present three ever-increasing methods to handle selection process. Experimental results based on a real data set show that our proposed method works effectively in producing primary descriptions for entities such as employee names.
Mining and ranking streams of news stories using cross-stream sequential patterns BIBAFull-Text 1709-1712
  Robert Gwadera; Fabio Crestani
We present a new method for mining and ranking streams of news stories using cross-stream sequential patterns and content similarity. In particular, we focus on stories reporting the same event across the streams within a given time window, where an event is defined as a specific thing that happens at a specific time and place. For every discovered cluster of stories reporting the same event we create an itemset-sequence consisting of stream identifiers of the stories in the cluster, where the sequence is ordered according to the timestamps of the stories. Furthermore, we record exact timestamps and content similarities between the respective stories. Given such a collection of itemset-sequences we use it for two tasks: (I) to discover recurrent temporal publishing patterns between the news streams in terms of frequent sequential patterns and content similarity and (II) to rank the streams of news stories with respect to timeliness of reporting important events and content authority. We demonstrate the applicability of the presented method on a multi-stream of news stories was gathered from RSS feeds of major world news agencies.
Mining tourist information from user-supplied collections BIBAFull-Text 1713-1716
  Adrian Popescu; Gregory Grefenstette; Pierre-Alain Moëllic
Tourist photographs constitute a large part of the images uploaded to photo sharing platforms. But filtering methods are needed before one can extract useful knowledge from noisy user-supplied metadata. Here we show how to extract clean trip related information (what people visit, for how long, panoramic spots) from Flickr metadata. We illustrate our technique on a sample of metadata and images covering 183 cities of different size and from different parts of the world.
Cross-domain sentiment classification using a two-stage method BIBAFull-Text 1717-1720
  Kang Liu; Jun Zhao
In this paper, we give out a two-stage approach for domain adaptation problem in sentiment classification. In the first stage, based on our observation that customers often use different words to comment on the similar topics in the different domains, we regard these common topics as the bridge to link the different domain-specific features. We propose a novel topic model named Transfer-PLSA to extract the topic knowledge between different domains. Through these common topics, the features in the source domain are corresponded to the target features, so that those domain-specific knowledge can be transferred across different domains. In the second step, we use the classifier trained on the labeled examples in the source domain to pick up some informative examples in the target domain. Then we retrain the classifier on these selected examples, so that the classifier is adapted for the target domain. Experimental results on sentiment classification in four different domains indicate that our method outperforms other traditional methods.

Poster session 5: KM track

Kernel latent semantic analysis using an information retrieval based kernel BIBAFull-Text 1721-1724
  Laurence A. F. Park; Kotagiri Ramamohanarao
Hidden term relationships can be found within a document collection using Latent semantic analysis (LSA) and can be used to assist in information retrieval. LSA uses the inner product as its similarity function, which unfortunately introduces bias due to document length and term rarity into the term relationships. In this article, we present the novel kernel based LSA method, which uses separate document and query kernel functions to compute document and query similarities, rather than the inner product. We show that by providing an appropriate kernel function, we are able to provide a better fit of our data and hence produce more effective term relationships.
The impact of document structure on keyphrase extraction BIBAFull-Text 1725-1728
  Katja Hofmann; Manos Tsagkias; Edgar Meij; Maarten de Rijke
Keyphrases are short phrases that reflect the main topic of a document. Because manually annotating documents with keyphrases is a time-consuming process, several automatic approaches have been developed. Typically, candidate phrases are extracted using features such as position or frequency in the document text. Document structure may contain useful information about which parts or phrases of a document are important, but has rarely been considered as a source of information for keyphrase extraction.
   We address this issue in the context of keyphrase extraction from scientific literature. We introduce a new, large corpus that consists of full-text journal articles, where the rich collection and document structure available at the publishing stage is explicitly annotated.
   We explore features based on the XML tags contained in the documents, and based on generic section types derived using position and cue words in section titles. For XML tags we find sections, abstract, and title to perform best, but many smaller elements may be beneficial in combination with other features. Of the generic section types, the discussion section is found to be most useful for keyphrase extraction.
XCFS: an XML documents clustering approach using both the structure and the content BIBAFull-Text 1729-1732
  Sangeetha Kutty; Richi Nayak; Yuefeng Li
This paper introduces a clustering approach, XML Clustering using Frequent Substructures (XCFS) that considers both the structural and the content information of XML documents in clustering. XCFS uses frequent substructures in the form of a novel representation, Closed Frequent Embedded (CFE) subtrees to constrain the content in the clustering process. The empirical analysis ascertains that XCFS can effectively cluster even very large XML datasets and outperforms other existing methods.
Enhancing expertise retrieval using community-aware strategies BIBAFull-Text 1733-1736
  Hongbo Deng; Irwin King; Michael R. Lyu
Expertise retrieval has received increased interests in recent years, whose task is to suggest people with relevant expertise. Motivated by the observation that communities could provide valuable insight and distinctive information, we investigate two community-aware strategies to enhance expertise retrieval. We first propose a new smoothing method using the community context instead of the whole collection for statistical language model in the document-based model. Furthermore, a query-sensitive AuthorRank is proposed to model the authors' authorities according to the community co-authorship networks, and then an adaptive ranking refinement method is developed to further enhance expertise retrieval. Experimental results demonstrate the effectiveness and robustness of both community-aware strategies.
Combining labeled and unlabeled data with word-class distribution learning BIBAFull-Text 1737-1740
  Yanjun Qi; Ronan Collobert; Pavel Kuksa; Koray Kavukcuoglu; Jason Weston
We describe a novel simple and highly scalable semi-supervised method called Word-Class Distribution Learning (WCDL), and apply it task of information extraction (IE) by utilizing unlabeled sentences to improve supervised classification methods. WCDL iteratively builds class label distributions for each word in the dictionary by averaging predicted labels over all cases in the unlabeled corpus, and re-training a base classifier adding these distributions as word features. In contrast, traditional self-training or co-training methods self-labeled examples (rather than features) which can degrade performance due to incestuous learning bias. WCDL exhibits robust behavior, and has no difficult parameters to tune. We applied our method on German and English name entity recognition (NER) tasks. WCDL shows improvements over self-training, multi-task semi-supervision or supervision alone, in particular yielding a state-of-the art 75.72 F1 score on the German NER task.
Finding the topical anchors of a context using lexical cooccurrence data BIBAFull-Text 1741-1744
  Aditya Ramana Rachakonda; Srinath Srinivasa
Lexical cooccurrence in textual data is not uniformly random. The statistics inferred from the term-cooccurrence data enable us to model dependencies between terms as graphs, somewhat resembling the way semantic memory is organised in human beings. In this paper we look at cooccurrence patterns to identify topical anchors of a given context. Topical anchors are those terms whose semantics represent the topic of the whole context. This work is based on computing a stationary distribution in the cooccurrence graph. Topical anchors were computed on a set of 100 contexts and were also evaluated by 86 volunteers and the results show that the algorithm correctly identifies the topical anchors around 62% of the time.
Vetting the links of the web BIBAFull-Text 1745-1748
  Na Dai; Brian D. Davison
Many web links mislead human surfers and automated crawlers because they point to changed content, out-of-date information, or invalid URLs. It is a particular problem for large, well-known directories such as the dmoz Open Directory Project, which maintains links to representative and authoritative external web pages within their various topics. Therefore, such sites involve many editors to manually revisit and revise links that have become out-of-date. To remedy this situation, we propose the novel web mining task of identifying outdated links on the web. We build a general classification model, primarily using local and global temporal features extracted from historical content, topic, link and time-focused changes over time. We evaluate our system via five-fold cross-validation on more than fifteen thousand ODP external links selected from thirteen top-level categories. Our system can predict the actions of ODP editors more than 75% of the time. Our models and predictions could be useful for various applications that depend on analysis of web links, including ranking and crawling.
Building domain-oriented sentiment lexicon by improved information bottleneck BIBAFull-Text 1749-1752
  Weifu Du; Songbo Tan
This paper describes an adapted information bottleneck approach for construction of domain-oriented sentiment lexicon. The basic idea is to use three kinds of relationships (WWinter, WDinter and WDintra), to infer the semantic orientation of the out-of-domain words. The experimental results demonstrate that proposed method could dramatically improve the accuracy of the baseline approach on the construction of out-of-domain sentiment lexicon.
Agglomerating local patterns hierarchically with ALPHA BIBAFull-Text 1753-1756
  Loïc Cerf; Pierre-Nicolas Mougel; Jean-François Boulicaut
To increase the relevancy of local patterns discovered from noisy relations, it makes sense to formalize error-tolerance. Our starting point is to address the limitations of state-of-the-art methods for this purpose. Some extractors perform an exhaustive search w.r.t. a declarative specification of error-tolerance. Nevertheless, their computational complexity prevents the discovery of large relevant patterns. Alpha is a 3-step method that (1) computes complete collections of closed patterns, possibly error-tolerant ones, from arbitrary n-ary relations, (2) enlarges them by hierarchical agglomeration, and (3) selects the relevant agglomerated patterns.
Topic and keyword re-ranking for LDA-based topic modeling BIBAFull-Text 1757-1760
  Yangqiu Song; Shimei Pan; Shixia Liu; Michelle X. Zhou; Weihong Qian
Topic-based text summaries promise to help average users quickly understand a text collection and derive insights. Recent research has shown that the Latent Dirichlet Allocation (LDA) model is one of the most effective approaches to topic analysis. However, the LDA-based results may not be ideal for human understanding and consumption. In this paper, we present several topic and keyword re-ranking approaches that can help users better understand and consume the LDA-derived topics in their text analysis. Our methods process the LDA output based on a set of criteria that model a user's information needs. Our evaluation demonstrates the usefulness of the methods in summarizing several large-scale, real world data sets.
Spatio-temporal association rule mining framework for real-time sensor network applications BIBAFull-Text 1761-1764
  Hamed Chok; Le Gruenwald
In this paper, we present a data mining framework to estimate missing or corrupted data in sensor network applications -- a frequently occurring phenomenon in this domain. The framework is naturally germane to the spatio-temporal analysis of relational data stream evolution. Our method utilizes association rules to capture spatio-temporal correlations in multivariate, dynamically evolving, and unbounded sensor data streams. Existing approaches that tackled this problem do not account for the multi-dimensionality of the node data and their relationship; furthermore they entail simplistic and/or premature assumptions on the temporal and spatial factors to overcome the complexity of the streaming environment. Our technique, called Mining Autonomously Spatio-Temporal Environmental Rules (MASTER), comprehensively formulates the problem of mining patterns in sensor data streams, and yet remains provably adaptive to bounded time and space costs while probabilistically assuring a bounded estimation error. Simulation experiments show MASTER's efficiency in terms of overhead as well as the quality of estimation.
Predicting the volume of comments on online news stories BIBAFull-Text 1765-1768
  Manos Tsagkias; Wouter Weerkamp; Maarten de Rijke
On-line news agents provide commenting facilities for readers to express their views with regard to news stories. The number of user supplied comments on a news article may be indicative of its importance or impact. We report on exploratory work that predicts the comment volume of news articles prior to publication using five feature sets. We address the prediction task as a two stage classification task: a binary classification identifies articles with the potential to receive comments, and a second binary classification receives the output from the first step to label articles "low" or "high" comment volume. The results show solid performance for the former task, while performance degrades for the latter.
ComprehEnRank: estimating comprehension in classroom by absorbing random walks on a cognitive graph BIBAFull-Text 1769-1772
  Nimit Pattanasri; Masayuki Mukunoki; Michihiko Minoh
This paper develops a graph-theoretic framework for estimating comprehension in classroom. To deal with imprecise data gathered in classroom, we propose multi-step comprehension propagation over a semantic graph. Random walks on the graph measure students' comprehension with probabilities absorbed at student nodes.
Interpretable and reconfigurable clustering of document datasets by deriving word-based rules BIBAFull-Text 1773-1776
  Vipin Balachandran; Deepak P; Deepak Khemani
Clusters of text documents output by clustering algorithms are often hard to interpret. We describe motivating real-world scenarios that necessitate reconfigurability and high interpretability of clusters and outline the problem of generating clusterings with interpretable and reconfigurable cluster models. We develop a clustering algorithm toward the outlined goal of building interpretable and reconfigurable cluster models; it works by generating rules with disjunctions and conditions on the frequencies of words, to decide on the membership of a document to a cluster. Each cluster is comprised of precisely the set of documents that satisfy the corresponding rule. We show that our approach outperforms the unsupervised decision tree approach by huge margins. We show that the purity and f-measure losses to achieve interpretability are as little as 5% and 3% respectively using our approach.
CAOFES: an ontological framework for web service retrieval BIBAFull-Text 1777-1782
  Sourish Dasgupta; Satish Bhat; Yugyung Lee
Semantic web search involves retrieval of user-specific web artifacts by utilizing their semantic descriptions. A very specific web artifact that has evolved in recent times is web services. Web services can be semantically described in languages like OWL-S. However, such languages are limited with respect to their expressivity of context. They also lack a formal ontological framework where efficient web service retrieval can be conducted. In this paper we model a service request scenario within the web as an event-driven system. Services as well as user requests are modeled as events. We propose an ontological framework called Context-Aware Ontology Framework for Events and Services (CAOFES) where such event-driven web service retrieval can be efficiently executed by a novel reasoning technique.
Active learning in partially supervised classification BIBAFull-Text 1783-1786
  Priyanka Garg; S. Sundararajan
Positive Example based learners reduce human annotation effort significantly by removing the burden of labeling the negative examples. Various methods have been proposed in literature for building classifiers using positive and unlabeled examples. However, we empirically observe that classification accuracy of the state of the art methods degrades significantly as the number of labeled positive examples decreases. In this paper, we propose an active learning based method to address this issue. The proposed method learns starting from a handful of positively labeled examples and a large number of unlabeled examples. Experimental results on benchmark datasets show that the proposed method performs better than the state of the art methods when the percentage of labeled positive examples is small.
Identifying interesting assertions from the web BIBAFull-Text 1787-1790
  Thomas Lin; Oren Etzioni; James Fogarty
How can we cull the facts we need from the overwhelming mass of information and misinformation that is the Web? The TextRunner extraction engine represents one approach, in which people pose keyword queries or simple questions and TextRunner returns concise answers based on tuples extracted from Web text. Unfortunately, the results returned by engines such as TextRunner include both informative facts (e.g., the FDA banned ephedra) and less useful statements (e.g., the FDA banned products).
   This paper therefore investigates filtering TextRunner results to enable people to better focus on interesting assertions. We first develop three distinct models of what assertions are likely to be interesting in response to a query. We then fully operationalize each of these models as a filter over TextRunner results. Finally, we develop a more sophisticated filter that combines the different models using relevance feedback.
   In a study of human ratings of the interestingness of TextRunner assertions, we show that our approach substantially enhances the quality of TextRunner results. Our best filter raises the fraction of interesting results in the top thirty from 41.6% to 64.1%.
Opinion classification with tree kernel SVM using linguistic modality analysis BIBAFull-Text 1791-1794
  Takeshi S. Kobayakawa; Tadashi Kumano; Hideki Tanaka; Naoaki Okazaki; Jin-Dong Kim; Jun'ichi Tsujii
We propose a method for classifying opinions which captures the role of linguistic modalities in the sentence. We use features than simple bag-of-words or opinion-holding predicates. The method is based on a machine learning and utilizes opinion-holding predicates and linguistic modalities as features. Two different detectors help to classify the opinions: the opinion-holding predicate detector and the modality detector. An opinion in the target is first parsed into a dependency structure, and then the opinion-holding predicates and modalities stick onto the leaf nodes of the dependency tree. The whole tree is regarded as input features of the opinion, and it becomes the input of tree kernel support vector machines. We have applied method to opinions in Japanese about television programs, and have confirmed the effectiveness of the method against conventional bag-of-words features, or against simple opinion-holding predicates features.
Fragment-based clustering ensembles BIBAFull-Text 1795-1798
  Ou Wu; Mingliang Zhu; Weiming Hu
Clustering ensembles combine different clustering solutions into a single robust and stable one. Most of existing methods become highly time-consuming when the data size turns to large. In this paper, we study the properties of the defined 'clustering fragment' and put forward a useful proposition. Solid proofs are presented with two widely used goodness measures for clustering ensembles. Finally, a new ensemble framework termed as fragment-based clustering ensembles is proposed. Theoretically, most of existing methods can be improved by adopting this framework. To evaluate the proposed framework, three new methods are introduced by bring three popular clustering ensemble methods into our framework. The experimental results on several public data sets show that the three introduced methods are greatly improved in computational complexity and also achieved better or similar accurate results than the original methods.
Multi-aspect opinion polling from textual reviews BIBAFull-Text 1799-1802
  Jingbo Zhu; Huizhen Wang; Benjamin K. Tsou; Muhua Zhu
This paper presents an unsupervised approach to aspect-based opinion polling from raw textual reviews without explicit ratings. The key contribution of this paper is three-fold. First, a multi-aspect bootstrapping algorithm is proposed to learn from unlabeled data aspect-related terms of each aspect to be used for aspect identification. Second, an unsupervised segmentation model is proposed to address the challenge of identifying multiple single-aspect units in a multi-aspect sentence. Finally, an aspect-based opinion polling algorithm is presented. Experiments on real Chinese restaurant reviews show that our opinion polling method can achieve 75.5% precision performance.
Blogger-centric contextual advertising BIBAFull-Text 1803-1806
  Teng-Kai Fan; Chia-Hui Chang
This paper addresses the concept of Blogger-Centric Contextual Advertising, which refers to the assignment of personal ads to any blog page, chosen in according to bloggers' interests. As blogs become a platform for expressing personal opinions, they naturally contain various kinds of statements, including facts, comments and statements about personal interests, of both a positive and negative nature. To extend the concept behind the Long Tail theory in contextual advertising, we argue that web bloggers, as the constant visitors of their own blog sites, could be potential consumers who will respond to ads on their own blogs. Hence, in this paper, we propose using text mining techniques to discover bloggers' immediate personal interests in order to improve online contextual advertising. The proposed BCCA (Blogger-Centric Contextual Advertising) framework aims to combine contextual advertising matching with text mining in order to select ads that are related to personal interests as revealed in a blog and rank them according to their relevance. We validate our approach experimentally using a set of data that includes both real ads and actual blog pages. The results indicate that our proposed method could effectively identify those ads that are positively-correlated with a blogger's personal interests.
A co-classification framework for detecting web spam and spammers in social media web sites BIBAFull-Text 1807-1810
  Feilong Chen; Pang-Ning Tan; Anil K. Jain
Social media are becoming increasingly popular and have attracted considerable attention from spammers. Using a sample of more than ninety thousand known spam Web sites, we found between 7% to 18% of their URLs are posted on two popular social media Web sites, digg.com and delicious.com. In this paper, we present a co-classification framework to detect Web spam and the spammers who are responsible for posting them on the social media Web sites. The rationale for our approach is that since both detection tasks are related, it would be advantageous to train them simultaneously to make use of the labeled examples in the Web spam and spammer training data. We have evaluated the effectiveness of our algorithm on the delicious.com data set. Our experimental results showed that the proposed co-classification algorithm significantly outperforms classifiers that learn each detection task independently.

Poster session 6: IR track

A machine learning approach for improved BM25 retrieval BIBAFull-Text 1811-1814
  Krysta M. Svore; Christopher J. C. Burges
Despite the widespread use of BM25, there have been few studies examining its effectiveness on a document description over single and multiple field combinations. We determine the effectiveness of BM25 on various document fields. We find that BM25 models relevance on popularity fields such as anchor text and query click information no better than a linear function of the field attributes. We also find query click information to be the single most important field for retrieval. In response, we develop a machine learning approach to BM25-style retrieval that learns, using LambdaRank, from the input attributes of BM25. Our model significantly improves retrieval effectiveness over BM25 and BM25F. Our data-driven approach is fast, effective, avoids the problem of parameter tuning, and can directly optimize for several common information retrieval measures. We demonstrate the advantages of our model on a very large real-world Web data collection.
Incremental query evaluation for support vector machines BIBAFull-Text 1815-1818
  Danzhou Liu; Kien A. Hua
Support vector machines (SVMs) have been widely used in multimedia retrieval to learn a concept in order to find the best matches. In such a SVM active learning environment, the system first processes k sampling queries and top-k uncertain queries to select the candidate data items for training. The user's top-k relevant queries are then evaluated to compute the answer. This approach has shown to be effective. However, it suffers from the scalability problem associated with larger database sizes. To address this limitation, we propose an incremental query evaluation technique for these three types of queries. Based on the observation that most queries are not revised dramatically during the iterative evaluation, the proposed technique reuses the results of previous queries to reduce the computation cost. Furthermore, this technique takes advantage of a tuned index structure to efficiently prune irrelevant data. As a result, only a small portion of the data set needs to be accessed for query processing. This index structure also provides an inexpensive means to process the set of candidates to evaluate the final query result. This technique can work with different kernel functions and kernel parameters. Our experimental results indicate that the proposed technique significantly reduces the overall computation cost, and offers a promising solution to the scalability issue.
To obtain orthogonal feature extraction using training data selection BIBAFull-Text 1819-1822
  Ye Xu; Shen Furao; Jinxi Zhao; Osamu Hasegawa
Feature extraction is an effective tool in data mining and machine learning. Many feature extraction methods have been investigated recently. However, few methods can achieve orthogonal components. Non-orthogonal components distort the metric structure of original data space and contain reductant information. In this paper, we propose a feature extraction method, named as incremental orthogonal basis analysis (IOBA), to cope with the challenging endeavors. First, IOBA learns orthogonal components for original data, not only theoretically but also numerically. Second, an innovative way of training data selection is proposed. This selection scheme helps IOBA pick up numerically orthogonal components from training patterns. Third, by designing a self-adaptive threshold technique, no prior knowledge about the number of components is necessary to use IOBA. Moreover, without solving eigenvalue and eigenvector problems, IOBA not only saves large computing loads, but also avoids ill-conditioned problems. Results of experiments show the efficiency of the proposed IOBA.
User interests in social media sites: an exploration with micro-blogs BIBAFull-Text 1823-1826
  Nilanjan Banerjee; Dipanjan Chakraborty; Koustuv Dasgupta; Sumit Mittal; Anupam Joshi; Seema Nagar; Angshu Rai; Sameer Madan
Recent technological advances in mobile-based access to social networking platforms and facilities to update information in real{time (e.g. in Facebook) have allowed an individual's online presence to be as ephemeral and dynamic in nature, as her very thoughts and interests. In this context, micro-blogging has been widely adopted by users as an effective means to capture and disseminate their thoughts and actions to a larger audience on a daily basis. Interestingly, daily chatters of a user obtained from her micro-blogs offer a unique information source to analyze and interpret her context in real-time -- i.e. interests, intentions, and activities. In this paper, we gather data from the public timeline of Twitter spanning across ten worldwide cities over a period of four weeks. We use this dataset to (a) explore how users express interests in real-time through micro-blogs, and (b) understand how text mining techniques can be applied to interpret real-time context of a user based on her tweets. Initial findings reported herein suggest that social media sites like Twitter constitute a promising source for extracting user context that can be exploited by novel social networking applications.
The effect of negation on sentiment analysis and retrieval effectiveness BIBAFull-Text 1827-1830
  Lifeng Jia; Clement Yu; Weiyi Meng
We investigate the problem of determining the polarity of sentiments when one or more occurrences of a negation term such as "not" appear in a sentence. The concept of the scope of a negation term is introduced. By using a parse tree and typed dependencies generated by a parser and special rules proposed by us, we provide a procedure to identify the scope of each negation term. Experimental results show that the identification of the scope of negation improves both the accuracy of sentiment analysis and the retrieval effectiveness of opinion retrieval.
Dynamic hyperparameter optimization for Bayesian topical trend analysis BIBAFull-Text 1831-1834
  Tomonari Masada; Daiji Fukagawa; Atsuhiro Takasu; Tsuyoshi Hamada; Yuichiro Shibata; Kiyoshi Oguri
This paper presents a new Bayesian topical trend analysis. We regard the parameters of topic Dirichlet priors in latent Dirichlet allocation as a function of document timestamps and optimize the parameters by a gradient-based algorithm. Since our method gives similar hyperparameters to the documents having similar timestamps, topic assignment in collapsed Gibbs sampling is affected by timestamp similarities. We compute TFIDF-based document similarities by using a result of collapsed Gibbs sampling and evaluate our proposal by link detection task of Topic Detection and Tracking.
A general Markov framework for page importance computation BIBAFull-Text 1835-1838
  Bin Gao; Tie-Yan Liu; Zhiming Ma; Taifeng Wang; Hang Li
We propose a General Markov Framework for computing page importance. Under the framework, a Markov Skeleton Process is used to model the random walk conducted by the web surfer on a given graph. Page importance is then defined as the product of page reachability and page utility, which can be computed from the transition probability and the mean staying time of the pages in the Markov Skeleton Process respectively. We show that this general framework can cover many existing algorithms as its special cases, and that the framework can help us define new algorithms to handle more complex problems. In particular, we demonstrate the use of the framework with the exploitation of a new process named Mirror Semi-Markov Process. The experimental results validate that the Mirror Semi-Markov Process model is more effective than previous models in several tasks.
Exploiting bidirectional links: making spamming detection easier BIBAFull-Text 1839-1842
  Yan Zhang; Qiancheng Jiang; Lei Zhang; Yizhen Zhu
Previous anti-spamming algorithms based on link structure suffer from either the weakness of the page value metric or the vagueness of the seed selection. In this paper, we propose two page value metrics, AVRank and HVRank. These two "values" of all the web pages can be well assessed by using the bidirectional links' information. Moreover, with the help of bidirectional links, it becomes easier to enlarge the propagation coverage of seed sets. We further discuss the effectiveness of the combination of these two metrics, such as the quadratic mean of them. Our experimental results show that with such two metrics, our method can filter out spam sites and identify reputable ones more effectively than previous algorithms such as TrustRank.
What's behind topic formation and development: a perspective of community core groups BIBAFull-Text 1843-1846
  Tieyun Qian; Qing Li; Bing Liu; Hui Xiong; Jaideep Srivastava; Phillip Sheu
Over the past several years, there has been a great interest in topic detection and tracking (TDT). Recently, analyzing general research trend from the huge amount of history documents also arouses considerable attention. However, existing work on TDT mainly focuses on overall trend analysis, and is unable to address questions such as "what determines the evolution of a topic?" and "when and how does a new topic get formed?".
   In this paper, we propose a core group model to explain the dynamics and further segment topic development. According to the division phase and interphase in the life cycle of a core group, a topic is separated into four states, i.e. birth state, extending state, saturation state and shrinkage state. Experimental results on a real dataset show that the division of a core group brings on the generation of a new topic, and the progress of an entire topic is closely correlated to the growth of a core group during its interphase.
Exploring relevance for clicks BIBAFull-Text 1847-1850
  Rongwei Cen; Yiqun Liu; Min Zhang; Bo Zhou; Liyun Ru; Shaoping Ma
Mining feedback information from user click-through data is an important issue for modern Web retrieval systems in terms of architecture analysis, performance evaluation and algorithm optimization. For commercial search engines, user click-through data contains useful information as well as large amount of inevitable noises. This paper proposes an approach to recognize reliable and meaningful user clicks (referred to as Relevant Clicks, RCs) in click-through data. By modeling user click-through behavior on search result lists, we propose several features to separate RCs from click noises. A learning algorithm is presented to estimate the quality of user clicks. Experimental results on large scale dataset show that: 1) our model effectively identifies RCs in noisy click-through data; 2) Different from previous click-through analysis efforts, our approach works well for both hot queries and long-tail queries.
An efficient clustering algorithm for large-scale topical web pages BIBAFull-Text 1851-1854
  Lei Wang; Peng Chen; Lian'en Huang
The clustering of topic-related web pages has been recognized as a foundational work in exploiting large sets of web pages such as the cases in search engines and web archive systems, which collect and preserve billions of web pages. However, this task faces great challenges both in efficiency and accuracy. In this paper we present a novel clustering algorithm for large scale topical web pages which achieves high efficiency together with considerately high accuracy. In our algorithm, a two-phase divide and conquer framework is developed to solve the efficiency problem, in which both link analysis and content analysis are utilized in mining the topical similarity between pages to achieve a high accuracy. A comprehensive experiment was conducted to evaluate our method in terms of its effectiveness, efficiency, and quality of result.
HyperSum: hypergraph based semi-supervised sentence ranking for query-oriented summarization BIBAFull-Text 1855-1858
  Wei Wang; Furu Wei; Wenjie Li; Sujian Li
Graph based sentence ranking algorithms such as PageRank and HITS have been successfully used in query-oriented summarization. With these algorithms, the documents to be summarized are often modeled as a text graph where nodes represent sentences and edges represent pairwise similarity relationships between two sentences. A deficiency of conventional graph modeling is its incapability of naturally and effectively representing complex group relationships shared among multiple objects. Simply squeezing complex relationships into pairwise ones will inevitably lead to loss of information which can be useful for ranking and learning. In this paper, we propose to take advantage of hypergraph, i.e. a generalization of graph, to remedy this defect. In a text hypergraph, nodes still represent sentences, yet hyperedges are allowed to connect more than two sentences. With a text hypergraph, we are thus able to integrate both group relationships formulated among multiple sentences and pairwise relationships formulated between two sentences in a unified framework. As essential work, it is first addressed in the paper that how a text hypergraph can be built for summarization by applying clustering techniques. Then, a hypergraph based semi-supervised sentence ranking algorithm is developed for query-oriented extractive summarization, where the influence of query is propagated to sentences through the structure of the constructed text hypergraph. When evaluated on DUC data sets, performance of the proposed approach is remarkable.
Relying on topic subsets for system ranking estimation BIBAFull-Text 1859-1862
  Claudia Hauff; Djoerd Hiemstra; Franciska de Jong; Leif Azzopardi
Ranking a number of retrieval systems according to their retrieval effectiveness without relying on costly relevance judgments was first explored by Soboroff et al [6]. Over the years, a number of alternative approaches have been proposed. We perform a comprehensive analysis of system ranking estimation approaches on a wide variety of TREC test collections and topics sets. Our analysis reveals that the performance of such approaches is highly dependent upon the topic or topic subset, used for estimation. We hypothesize that the performance of system ranking estimation approaches can be improved by selecting the "right" subset of topics and show that using topic subsets improves the performance by 32% on average, with a maximum improvement of up to 70% in some cases.
Improving retrievability of patents with cluster-based pseudo-relevance feedback documents selection BIBAFull-Text 1863-1866
  Shariq Bashir; Andreas Rauber
High findability of documents within a certain cut-off rank is considered an important factor in recall-oriented application domains such as patent or legal document retrieval. Findability is hindered by two aspects, namely the inherent bias favoring some types of documents over others introduced by the retrieval model, and the failure to correctly capture and interpret the context of conventionally rather short queries. In this paper, we analyze the bias impact of different retrieval models and query expansion strategies. We furthermore propose a novel query expansion strategy based on document clustering to identify dominant relevant documents. This helps to overcome limitations of conventional query expansion strategies that suffer strongly from the noise introduced by imperfect initial query results for pseudo-relevance feedback documents selection. Experiments with different collections of patent documents suggest that clustering based document selection for pseudo-relevance feedback is an effective approach for increasing the findability of individual documents and decreasing the bias of a retrieval system.
Learning from past queries for resource selection BIBAFull-Text 1867-1870
  Suleyman Cetintas; Luo Si; Hao Yuan
Federated text search provides a unified search interface for multiple search engines of distributed text information sources. Resource selection is an important component for federated text search, which selects a small number of information sources that contain the largest number of relevant documents for a user query. Most prior research of resource selection focused on selecting information sources by analyzing static information of available information sources that is sampled in the offline manner. On the other hand, most prior research ignored a large amount of valuable information like the results from past queries.
   This paper proposes a new resource selection technique (which is called qSim) that utilizes the search results of past queries for estimating the utilities of available information sources for a specific user query. Experiment results demonstrate the effectiveness of the new resource selection algorithm.
Learning to rank graphs for online similar graph search BIBAFull-Text 1871-1874
  Bingjun Sun; Prasenjit Mitra; C. Lee Giles
Many applications in structure matching require the ability to search for graphs that are similar to a query graph, i.e., similarity graph queries. Prior works, especially in chemoinformatics, have used the maximum common edge subgraph (MCEG) to compute the graph similarity. This approach is prohibitively slow for real-time queries. In this work, we propose an algorithm that extracts and indexes subgraph features from a graph dataset. It computes the similarity of graphs using a linear graph kernel based on feature weights learned offline from a training set generated using MCEG. We show empirically that our proposed algorithm of learning to rank graphs can achieve higher normalized discounted cumulative gain compared with existing optimal methods based on MCEG. The running time of our algorithm is orders of magnitude faster than these existing methods.
Matching person names through name transformation BIBAFull-Text 1875-1878
  Jun Gong; Lidan Wang; Douglas W. Oard
Matching person names plays an important role in many applications, including bibliographic databases and indexing systems. Name variations and spelling errors make exact string matching problematic; therefore, it is useful to develop methodologies that can handle variant forms for the same named entity. In this paper, a novel person name matching model is presented. Common name variations in the English speaking world are formalized, and the concept of name transformation paths is introduced; name similarity is measured after the best transformation path has been selected. Supervised techniques are used to learn a similarity function and a decision rule. Experiments with three datasets show the method to be effective.
Learning to rank using evolutionary computation: immune programming or genetic programming? BIBAFull-Text 1879-1882
  Shuaiqiang Wang; Jun Ma; Jiming Liu
Nowadays ranking function discovery approaches using Evolutionary Computation (EC), especially Genetic Programming (GP), have become an important branch in the Learning to Rank for Information Retrieval (LR4IR) field. Inspired by the GP based learning to rank approaches, we provide a series of generalized definitions and a common framework for the application of EC in learning to rank research. Besides, according to the introduced framework, we propose RankIP, a ranking function discovery approach using Immune Programming (IP). Experimental results demonstrate that RankIP evidently outperforms the baselines.
   In addition, we study the differences between IP and GP in theory and experiments. Results show that IP is more suitable for LR4IR due to its high diversity.
To divide and conquer search ranking by learning query difficulty BIBAFull-Text 1883-1886
  Zeyuan Allen Zhu; Weizhu Chen; Tao Wan; Chenguang Zhu; Gang Wang; Zheng Chen
Learning to rank plays an important role in information retrieval. In most of the existing solutions for learning to rank, all the queries with their returned search results are learnt and ranked with a single model. In this paper, we demonstrate that it is highly beneficial to divide queries into multiple groups and conquer search ranking based on query difficulty. To this end, we propose a method which first characterizes a query using a variety of features extracted from user search behavior, such as the click entropy, the query reformulation probability. Next, a classification model is built on these extracted features to assign a score to represent how difficult a query is. Based on this score, our method automatically divides queries into groups, and trains a specific ranking model for each group to conquer search ranking. Experimental results on RankSVM and RankNet with a large-scale evaluation dataset show that the proposed method can achieve significant improvement in the task of web search ranking.
Maximal metric margin partitioning for similarity search indexes BIBAFull-Text 1887-1890
  Hisashi Kurasawa; Daiji Fukagawa; Atsuhiro Takasu; Jun Adachi
We propose a partitioning scheme for similarity search indexes that is called Maximal Metric Margin Partitioning (MMMP). MMMP divides the data on the basis of its distribution pattern, especially for the boundaries of clusters. A partitioning surface created by MMMP is likely to be at maximum distances from the two cluster boundaries. MMMP is the first similarity search index approach to focus on partitioning surfaces and data distribution patterns. We also present an indexing scheme, named the MMMP-Index, which uses MMMP and small ball partitioning. The MMMP-Index prunes many objects that are not relevant to a query, and it reduces the query execution cost. Our experimental results show that MMMP effectively indexes clustered data and reduces the search cost. For clustered vector data, the MMMP-Index reduces the computational cost to less than two thirds that of comparable schemes.

Poster session 7: IR track

What makes categories difficult to classify?: a study on predicting classification performance for categories BIBAFull-Text 1891-1894
  Aixin Sun; Ee-Peng Lim; Ying Liu
In this paper, we try to predict which category will be less accurately classified compared with other categories in a classification task that involves multiple categories. The categories with poor predicted performance will be identified before any classifiers are trained and additional steps can be taken to address the predicted poor accuracies of these categories. Inspired by the work on query performance prediction in ad-hoc retrieval, we propose to predict classification performance using two measures, namely, category size and category coherence. Our experiments on 20-Newsgroup and Reuters-21578 datasets show that the Spearman rank correlation coefficient between the predicted rank of classification performance and the expected classification accuracy is as high as 0.9.
A comparative study of methods for estimating query language models with pseudo feedback BIBAFull-Text 1895-1898
  Yuanhua Lv; ChengXiang Zhai
We systematically compare five representative state-of-the-art methods for estimating query language models with pseudo feedback in ad hoc information retrieval, including two variants of the relevance language model, two variants of the mixture feedback model, and the divergence minimization estimation method. Our experiment results show that a variant of relevance model and a variant of the mixture model tend to outperform other methods. We further propose several heuristics that are intuitively related to the good retrieval performance of an estimation method, and show that the variations in how these heuristics are implemented in different methods provide a good explanation of many empirical observations.
Translating relevance scores to probabilities for contextual advertising BIBAFull-Text 1899-1902
  Deepak Agarwal; Evgeniy Gabrilovich; Robert Hall; Vanja Josifovski; Rajiv Khanna
Information retrieval systems conventionally assess document relevance using the bag of words model. Consequently, relevance scores of documents retrieved for different queries are often difficult to compare, as they are computed on different (or even disjoint) sets of textual features. Many tasks, such as federation of search results or global thresholding of relevance scores, require that scores be globally comparable. To achieve this, in this paper we propose methods for non-monotonic transformation of relevance scores into probabilities for a contextual advertising selection engine that uses a vector space model. The calibration of the raw scores is based on historical click data.
A query model based on normalized log-likelihood BIBAFull-Text 1903-1906
  Edgar Meij; Wouter Weerkamp; Maarten de Rijke
Leveraging information from relevance assessments has been proposed as an effective means for improving retrieval. We introduce a novel language modeling method which uses information from each assessed document and their aggregate. While most previous approaches focus either on features of the entire set or on features of the individual relevant documents, our model exploits features of both the documents and the set as a whole. When evaluated, we show that our model is able to significantly improve over state-of-art feedback methods.
Online community search using thread structure BIBAFull-Text 1907-1910
  Jangwon Seo; W. Bruce Croft; David A. Smith
Online communities are valuable information sources where knowledge is accumulated by interactions between people. Search services provided by online community sites such as forums are often, however, quite poor. To address this, we investigate retrieval techniques that exploit the hierarchical thread structures in community sites. Since these structures are sometimes not explicit or accurately annotated, we use structure discovery techniques. We then make use of thread structures in retrieval experiments. Our results show that using thread structures that have been accurately annotated can lead to significant improvements in retrieval performance compared to strong baselines.
A word clustering approach for language model-based sentence retrieval in question answering systems BIBAFull-Text 1911-1914
  Saeedeh Momtazi; Dietrich Klakow
In this paper we propose a term clustering approach to improve the performance of sentence retrieval in Question Answering (QA) systems. As the search in question answering is conducted over smaller segments of data than in a document retrieval task, the problems of data sparsity and exact matching become more critical. In this paper we propose Language Modeling (LM) techniques to overcome such problems and improve the sentence retrieval performance.
   Our proposed methods include building class-based models by term clustering, and then employing higher order n-grams with the new class-based model. We report our experiments on the TREC 2007 questions from QA track. The results show that the methods investigated here enhanced the mean average precision of sentence retrieval from 23.62% to 29.91%.
Pure spreading activation is pointless BIBAFull-Text 1915-1918
  Michael R. Berthold; Ulrik Brandes; Tobias Kötter; Martin Mader; Uwe Nagel; Kilian Thiel
Almost every application of spreading activation is accompanied by its own set of often heuristic restrictions on the dynamics. We show that in constraint-free scenarios spreading activation would actually yield query-independent results, so that the specific choice of restrictions is not only a pragmatic computational issue, but crucially determines the outcome.
Collaborative resource discovery in social tagging systems BIBAFull-Text 1919-1922
  Bin Bi; Lifeng Shang; Ben Kao
Social tagging systems which allow users to create, edit and share collections of internet resources associated with tags in a collaborative fashion are growing in popularity in recent years. The rapidly growing amount of shared data in these folksonomies, i.e., taxonomies created by the folk, presents new technical challenges involved with discovering resources which are likely of interest to the user. Social tags which reflect the meaning of resources from the user's points of view provide an opportunity to enhance the quality of retrieval. In this paper, we introduce a novel framework to search relevant resources to the user query by incorporating information obtained from folksonomies' underlying data structures consisting of a set of user/tag/resource triplets. In contrast to traditional retrieval and recommendation techniques which represent a collection by a matrix, we represent our data as a third-order tensor on which a novel Cube Latent Semantic Indexing (CubeLSI) technique is proposed to capture latent semantic associations between tags. With the latent semantic representation we show how to rank relevant resources according to their relevance to user queries. The excellent performance of the method is demonstrated by an experimental evaluation on the deli.cio.us dataset.
Smoothing DCG for learning to rank: a novel approach using smoothed hinge functions BIBAFull-Text 1923-1926
  Mingrui Wu; Yi Chang; Zhaohui Zheng; Hongyuan Zha
Discounted cumulative gain (DCG) is widely used for evaluating ranking functions. It is therefore natural to learn a ranking function that directly optimizes DCG. However, DCG is non-smooth, rendering gradient-based optimization algorithms inapplicable. To remedy this, smoothed versions of DCG have been proposed but with only partial success. In this paper, we first present analysis that shows it is ineffective using the gradient of the smoothed DCG to drive the optimization algorithm. We then propose a novel approach, SHF-SDCG, for smoothing DCG by using smoothed hinge functions (SHF). It has the advantage of seamlessly transition from driving the optimization mimicking pairwise learning when the ranking function does not fit the data well, to driving the optimization using DCG when the ranking function becomes more accurate. SHF-SDCG is then extended to REG-SHF-SDCG, an algorithm which gradually transits from pointwise and pairwise to listwise learning. Finally experimental results are provided to validate the effectiveness of SHF-SDCG and REG-SHF-SDCG.
A collaborative filtering approach to ad recommendation using the query-ad click graph BIBAFull-Text 1927-1930
  Tasos Anastasakos; Dustin Hillard; Sanjay Kshetramade; Hema Raghavan
Search engine logs contain a large amount of click-through data that can be leveraged as soft indicators of relevance. In this paper we address the sponsored search retrieval problem which is to find and rank relevant ads to a search query. We propose a new technique to determine the relevance of an ad document for a search query using click-through data. The method builds on a collaborative filtering approach to discover new ads related to a query using a click graph. It is implemented on a graph with several million edges and scales to larger sizes easily. The proposed method is compared to three different baselines that are state-of-the-art for a commercial search engine. Evaluations on editorial data indicate that the model discovers many new ads not retrieved by the baseline methods. The ads from the new approach are on average of better quality than the baselines.
Pseudo relevance feedback using semantic clustering in relevance language model BIBAFull-Text 1931-1934
  Qiang Pu; Daqing He
Pseudo relevance feedback has demonstrated to be in general an effective technique for improving retrieval effectiveness, but the noise in the top retrieved documents still can cause topic drift problem that affects the performance of certain topics. By viewing a document as an interaction of a set of independent hidden topics, we propose a novel semantic clustering technique using independent component analysis. Then within the language modeling framework, we apply the obtained semantic topic clusters into the query sampling process so that the sampling depends on the activated topics rather than on the individual document language model. Therefore, we obtain a semantic cluster based relevance language model, which uses pseudo relevance feedback technique without requiring any relevance training information. We applied the model on five TREC data sets. The experiments show that our model can significantly improve retrieval performance over traditional language models including relevance-based and clustering-based retrieval language models. The main contribution of the improvements comes from the estimation of the relevance model on the semantic clusters that are closely related to the query.
A proactive personalised retrieval system BIBAFull-Text 1935-1938
  Desmond Elliott; Joemon M. Jose
We present a personalised retrieval system that captures explicit relevance feedback to build an evolving user profile with multiple aspects. The user profile is used to proactively retrieve results between search sessions to support multi-session search tasks. This approach to supporting users with their multi-session search tasks is evaluated in a between-subjects multiple time-series study with ten subjects performing two simulated work situation tasks over five sessions. System interaction data shows that subjects using the personalised retrieval system issue fewer queries and interact with fewer results than subjects using a baseline system. The interaction data also shows a trend of subjects interacting with the proactively retrieved results in the personalised retrieval system.
Data extraction from the web using wild card queries BIBAFull-Text 1939-1942
  Davood Rafiei; Haobin Li
This paper presents an overview of our work for searching and retrieving facts and relationships within natural language text sources. In this work, an extraction task over a text collection is expressed as a query that combines text fragments with wild cards, and the query result is a set of facts in the form of unary, binary and general n-ary tuples. Despite being both simple and declarative, the framework can be applied to a wide range of extraction tasks. This paper presents an overview of the work and its various components. We also report some of our experiments and an evaluation of the proposed querying framework in extracting relevant information to a task.
Smoothing document language model with local word graph BIBAFull-Text 1943-1946
  Yunping Huang; Le Sun; Jian-Yun Nie
Smoothing document model with word graph is a new and effective method in information retrieval. Word graph can naturally incorporate the dependency between the words; random walk algorithm based on the graph can be used to estimate the weight of each vertex. In this paper, we present a new way to construct a local word graph for smoothing document model, which exploits the document's k nearest neighbors: the vertices represent the words in the document and its k nearest neighbors, and the weights of the edges are estimated through word co-occurrence in the local document set. We argue that word graph is a key factor to the performance in graph-based smoothing method. By using the local document set, we can obtain a document specific word graph, and achieve better retrieval performance. Experimental results on three TREC collections show that our proposed approach is effective.
Aging effects on query flow graphs for query suggestion BIBAFull-Text 1947-1950
  Ranieri Baraglia; Carlos Castillo; Debora Donato; Franco Maria Nardini; Raffaele Perego; Fabrizio Silvestri
World Wide Web content continuously grows in size and importance. Furthermore, users ask Web search engines to satisfy increasingly disparate information needs. New techniques and tools are constantly developed aimed at assisting users in the interaction with the Web search engine. Query recommender systems suggesting interesting queries to users are an example of such tools. Most query recommendation techniques are based on the knowledge of the behaviors of past users of the search engine recorded in query logs.
   A recent query-log mining approach for query recommendation is based on Query Flow Graphs (QFG). In this paper we propose an evaluation of the effects of time on this query recommendation model. As users interests change over time, the knowledge extracted from query logs may suffer an aging effect as new interesting topics appear. In order to validate experimentally this hypothesis, we build different query flow graphs from the queries belonging to a large query log of a real-world search engine. Each query flow graph is built on distinct query log segments. Then, we generate recommendations on different sets of queries. Results are assessed both by means of human judgments and by using an automatic evaluator showing that the models inexorably age.
Exploiting query views for static index pruning in web search engines BIBAFull-Text 1951-1954
  Ismail Sengor Altingovde; Rifat Ozcan; Özgür Ulusoy
We propose incorporating query views in a number of static pruning strategies, namely term-centric, document-centric and access-based approaches. These query-view based strategies considerably outperform their counterparts for both disjunctive and conjunctive query processing in Web search engines.
Answer typing for information retrieval BIBAFull-Text 1955-1958
  Christopher Pinchak; Davood Rafiei; Dekang Lin
Answer typing is commonly thought of as finding appropriate responses to given questions. We extend the notion of answer typing to information retrieval to ensure results contain plausible answers to queries. Identification of a large class of applicable queries is performed using a discriminative classifier, and discriminative preference ranking methods are employed for the selection of type-appropriate terms. Experimental results show that type-appropriate terms identified by the model are superior to terms most commonly associated with the query, providing strong evidence that answer typing techniques can find meaningful and appropriate terms. Further experiments show that snippets containing correct answers are ranked higher by our model than by the baseline Google search engine in those instances in which a query does indeed seek a short answer.
Exploring path query results through relevance feedback BIBAFull-Text 1959-1962
  Huiping Cao; Yan Qi; K. Selçuk Candan; Maria Luisa Sapino
Feedback driven data exploration schemes have been implemented for non-structured data (such as text) and document-centric XML collections where formulating precise queries is often impossible. In this paper, we study the problem of enabling exploratory access, through ranking, to data-centric XML. Given a path query and a set of results identified by the system to this query over the data, we consider feedback which captures the user's preference for some features over the others. The feedback can be "positive" or "negative". To deal with feedback, we develop a probabilistic feature significance measure and describe how to use this for ranking results in the presence of dependencies between the path features. We bring together these techniques in AXP, a system for adaptive and exploratory path retrieval. The experimental results show the effectiveness of the proposed techniques.
Comparative document summarization via discriminative sentence selection BIBAFull-Text 1963-1966
  Dingding Wang; Shenghuo Zhu; Tao Li; Yihong Gong
Given a collection of document groups, a quick question is what are the differences in these groups. In this paper, we study a novel problem of summarizing the differences between document groups. A discriminative sentence selection method is proposed to extract the most discriminative sentences which represent the specific characteristics of each document group. Experiments on real world data sets demonstrate the effectiveness of our proposed method.
Graph-based seed selection for web-scale crawlers BIBAFull-Text 1967-1970
  Shuyi Zheng; Pavel Dmitriev; C. Lee Giles
One of the most important steps in web crawling is determining the starting points, or seed selection. This paper identifies and explores the problem of seed selection in web-scale incremental crawlers. We argue that seed selection is not a trivial but very important problem. Selecting proper seeds can increase the number of pages a crawler will discover, and can result in a repository with more "good" and less "bad" pages. We propose a graph-based framework for crawler seed selection, and present several algorithms within this framework. Evaluation on real web data showed significant improvements over heuristic seed selection approaches.
An improved feedback approach using relevant local posts for blog feed retrieval BIBAFull-Text 1971-1974
  Yeha Lee; Seung-Hoon Na; Jong-Hyeok Lee
Blog feed search aims to identify a blog feed with a recurring interest in a given topic. In this paper, we investigate the "pseudo-relevance feedback" for blog feed search task, where its unit of relevance judgment is not based on a blog post but a blog feed (the collection of all its constituent posts). This paper focuses on two characteristics of feed search task, blog feed's topical diversity and multifaceted property of query. We propose a novel feed-level selection of local posts which uses only highly relevant local posts in each top-ranked feed, in order to capture the correct and diverse relevant information to a given topic. Experimental results show that the proposed approach outperforms traditional feedback approaches. Especially, the proposed approach gives 2% further increase of nDCG over the best performing result of TREC '08 Blog Distillation Task.

Poster session 8: IR track

Retrieval constraints and word frequency distributions: a log-logistic model for IR BIBAFull-Text 1975-1978
  Stéphane Clinchant; Eric Gaussier
We first present in this paper an analytical view of heuristic retrieval constraints which yields simple tests to determine whether a retrieval function satisfies the constraints or not. We then review empirical findings on word frequency distributions and the central role played by burstiness in this context. This leads us to propose a formal definition of burstiness which can be used to characterize probability distributions wrt this phenomenon. We then introduce the family of information-based IR models which naturally captures heuristic retrieval constraints when the underlying probability distribution is bursty and propose a new IR model within this family, based on the log-logistic distribution. The experiments we conduct on three different collections illustrate the good behavior of the log-logistic IR model: it significantly outperforms the Jelinek-Mercer and Dirichlet prior language models on all three collections, with both short and long queries and for both the MAP and the precision at 10 documents. It also outperforms the InL2 DFR model for the MAP, and yields results on a par with it for the precision at 10.
A scalable and effective full-text search in P2P networks BIBAFull-Text 1979-1982
  Yosi Mass; Yehoshua Sagiv; Michal Shmueli-Scheuer
We consider the problem of full-text search involving multi-term queries in a network of self-organizing, autonomous peers. Existing approaches do not scale well with respect to the number of peers, because they either require access to a large number of peers or incur a high communication cost in order to achieve good query results. In this paper, we present a novel algorithmic framework for processing multi-term queries in P2P networks that achieves high recall while using (per-query) a small number of peers and a low communication cost, thereby enabling high query throughput. Our approach is based on per-query peer-selection strategy using two-dimensional histograms of score distributions. A full utilization of the histograms incurs a high communication cost. We show how to drastically reduce this cost by employing a two-phase peer-selection algorithm. We also describe an adaptive approach to peer selection that further increases the recall. Experiments on a large real-world collection show that the recall is indeed high while the number of involved peers and the communication cost are low.
The influence of the document ranking in expert search BIBAFull-Text 1983-1986
  Craig Macdonald; Iadh Ounis
The retrieval effectiveness of the underlying document search component of an expert search engine can have an important impact on the effectiveness of the generated expert search results. In this large-scale study, we perform novel experiments in the context of the document search and expert search tasks of the TREC Enterprise track, to measure the influence that the performance of the document ranking has on the ranking of candidate experts. In particular, we show, using real and simulated document rankings, that while the expert search system performance is related to the relevance of the retrieved documents, surprisingly, it is not always the case that increasing document search effectiveness causes an increase in expert search performance.
URL normalization for de-duplication of web pages BIBAFull-Text 1987-1990
  Amit Agarwal; Hema Swetha Koppula; Krishna P. Leela; Krishna Prasad Chitrapura; Sachin Garg; Pavan Kumar; Chittaranjan Haty; Anirban Roy; Amit Sasturkar
Presence of duplicate documents in the World Wide Web adversely affects crawling, indexing and relevance, which are the core building blocks of web search. In this paper, we present a set of techniques to mine rules from URLs and utilize these learnt rules for de-duplication using just URL strings without fetching the content explicitly. Our technique is composed of mining the crawl logs and utilizing clusters of similar pages to extract specific rules from URLs belonging to each cluster. Preserving each mined rules for de-duplication is not efficient due to the large number of specific rules. We present a machine learning technique to generalize the set of rules, which reduces the resource footprint to be usable at web-scale. The rule extraction techniques are robust against web-site specific URL conventions. We demonstrate the effectiveness of our techniques through experimental evaluation.
An analysis framework for search sequences BIBAFull-Text 1991-1994
  Qiaozhu Mei; Kristina Klinkner; Ravi Kumar; Andrew Tomkins
In this paper we present a general framework to study sequences of search activities performed by a user. Our framework provides (i) a vocabulary to discuss types of features, models, and tasks, (ii) straightforward feature re-use across problems, (iii) realistic baselines for many sequence analysis tasks we study, and (iv) a simple mechanism to develop baselines for sequence analysis tasks beyond those studied in this paper. Using this framework we study a set of fourteen sequence analysis tasks with a range of features and models. While we show that most tasks benefit from features based on recent history, we also identify two categories of "sequence-resistant" tasks for which simple classes of local features perform as well as richer features and models.
Location cache for web queries BIBAFull-Text 1995-1998
  Mauricio Marin; Flavio Ferrarotti; Marcelo Mendoza; Carlos Gomez-Pantoja; Veronica Gil-Costa
This paper proposes a strategy to reduce the amount of hardware involved in the solution of search engine queries. It proposes using a secondary compact cache that keeps minimal information stored in the query receptionist machine to register the processors that must get involved in the solution of queries which are evicted from the standard result cache or are not admitted in it. This cache strategy produces exact answers by using very few processors.
A study of selective collection enrichment for enterprise search BIBAFull-Text 1999-2002
  Jie Peng; Craig Macdonald; Ben He; Iadh Ounis
Enterprise intranets are often sparse in nature, with limited use of alternative lexical representations between authors, making query expansion (QE) ineffective. Hence, for some enterprise search queries, it can be advantageous to instead use the well-known collection enrichment (CE) method to gather higher quality pseudo-feedback documents from a more diverse external resource. However, it is not always clear for which queries the collection enrichment technique should be applied. In this paper, we study two different approaches, namely a predictor-based approach and a divergence-based approach, to decide on when to apply CE. We thoroughly evaluate both approaches on the TREC Enterprise track CERC test collection and its corresponding topic sets, in combination with three different external resources and nine different query performance predictors. Our results show that both approaches are effective to selectively apply CE for enterprise search. In particular, the divergence-based approach leads to consistent and marked retrieval improvements over the systematic application of QE or CE on all external resources.
Generating synopses for document-element search BIBAFull-Text 2003-2006
  Sumit Bhatia; Shibamouli Lahiri; Prasenjit Mitra
Scientists often search for document-elements like tables, figures, or algorithm pseudo-codes. Domain scientists and researchers report important data, results and algorithms using these document-elements; readers want to compare the reported results with their findings. Some document-element search engines have been proposed (especially to search for tables and figures) to make this task easier. While searching for document-elements today, the end-user is presented with the caption of the document-element and a sentence in the document text that refers to the document-element. Oftentimes, the caption and the reference text do not contain enough information to interpret the document-element. In this paper, we present the first set of methods to extract this useful information (synopsis) related to document-elements automatically. We also investigate the problem of choosing the optimum synopsis-size that strikes a balance between information content and size of the generated synopses.
Incorporating robustness into web ranking evaluation BIBAFull-Text 2007-2010
  Xin Li; Fan Li; Shihao Ji; Zhaohui Zheng; Yi Chang; Anlei Dong
In many Web search engines, a ranking function is selected for deployment mainly by comparing the relevance measurements over candidates. Due to the dynamical nature of the Web, the ranking features and the query and URL distribution on which the ranking functions are built, may change dramatically over time. The actual relevance of the function may degrade, and thus the previous function selection conclusions become invalid. In this work we suggest to select Web ranking functions according to both their relevance and robustness to the changes that may lead to relevance degradation over time. We argue that the ranking robustness can be effectively measured by taking into account the ranking score distribution across search results. We then propose two alternatives to the NDCG metric that both incorporate ranking robustness into ranking function evaluation and selection. A machine learning approach is developed to learn the parameters that control the metric sensitivity to score turbulence, from human-judged preference data.
Finding good feedback documents BIBAFull-Text 2011-2014
  Ben He; Iadh Ounis
Pseudo-relevance feedback finds useful expansion terms from a set of top-ranked documents. It is often crucial to identify those good feedback documents from which useful expansion terms can be added to the query. In this paper, we propose to detect good feedback documents by classifying all feedback documents using a variety of features such as the distribution of query terms in the feedback document, the similarity between a single feedback document and all top-ranked documents, or the proximity between the expansion terms and the original query terms in the feedback document. By doing this, query expansion is only performed using a selected set of feedback documents, which are predicted to be good among all top-ranked documents. Experimental results on standard TREC test data show that query expansion on the selected feedback documents achieves statistically significant improvements over a strong pseudo-relevance feedback mechanism, which expands the query using all top-ranked documents.
Ensembles in adversarial classification for spam BIBAFull-Text 2015-2018
  Deepak Chinavle; Pranam Kolari; Tim Oates; Tim Finin
The standard method for combating spam, either in email or on the web, is to train a classifier on manually labeled instances. As the spammers change their tactics, the performance of such classifiers tends to decrease over time. Gathering and labeling more data to periodically retrain the classifier is expensive. We present a method based on an ensemble of classifiers that can detect when its performance might be degrading and retrain itself, all without manual intervention. Experiments with a real-world dataset from the blog domain show that our methods can significantly reduce the number of times classifiers are retrained when compared to a fixed retraining schedule, and they maintain classification accuracy even in the absence of manually labeled examples.
Improving binary classification on text problems using differential word features BIBAFull-Text 2019-2024
  Justin Martineau; Tim Finin; Anupam Joshi; Shamit Patel
We describe an efficient technique to weigh word-based features in binary classification tasks and show that it significantly improves classification accuracy on a range of problems. The most common text classification approach uses a document's ngrams (words and short phrases) as its features and assigns feature values equal to their frequency or TFIDF score relative to the training corpus. Our approach uses values computed as the product of an ngram's document frequency and the difference of its inverse document frequencies in the positive and negative training sets. While this technique is remarkably easy to implement, it gives a statistically significant improvement over the standard bag-of-words approaches using support vector machines on a range of classification tasks. Our results show that our technique is robust and broadly applicable. We provide an analysis of why the approach works and how it can generalize to other domains and problems.
Feature selection for ranking using boosted trees BIBAFull-Text 2025-2028
  Feng Pan; Tim Converse; David Ahn; Franco Salvetti; Gianluca Donato
Modern search engines have to be fast to satisfy users, so there are hard back-end latency requirements. The set of features useful for search ranking functions, though, continues to grow, making feature computation a latency bottleneck. As a result, not all available features can be used for ranking, and in fact, much of the time, only a small percentage of these features can be used. Thus, it is crucial to have a feature selection mechanism that can find a subset of features that both meets latency requirements and achieves high relevance. To this end, we explore different feature selection methods using boosted regression trees, including both greedy approaches (selecting the features with highest relative importance as computed by boosted trees; discounting importance by feature similarity and a randomized approach. We evaluate and compare these approaches using data from a commercial search engine. The experimental results show that the proposed randomized feature selection with feature-importance-based backward elimination outperforms greedy approaches and achieves a comparable relevance with 30 features to a full-feature model trained with 419 features and the same modeling parameters.
Evaluation of methods for relative comparison of retrieval systems based on clickthroughs BIBAFull-Text 2029-2032
  Jing He; Chengxiang Zhai; Xiaoming Li
The Cranfield evaluation method has some disadvantages, including its high cost in labor and inadequacy for evaluating interactive retrieval techniques. As a very promising alternative, automatic comparison of retrieval systems based on observed clicking behavior of users has recently been studied. Several methods have been proposed, but there has so far been no systematic way to assess which strategy is better, making it difficult to choose a good method for real applications. In this paper, we propose a general way to evaluate these relative comparison methods with two measures: utility to users (UtU) and effectiveness of differentiation (EoD). We evaluate two state of the art methods by systematically simulating different retrieval scenarios. Inspired by the weakness of these methods revealed through our evaluation, we further propose a novel method by considering the positions of clicked documents. Experiment results show that our new method performs better than the existing methods.
Measuring system performance and topic discernment using generalized adaptive-weight mean BIBAFull-Text 2033-2036
  Chung Tong Lee; Vishwa Vinay; Eduarda Mendes Rodrigues; Gabriella Kazai; Nataša Milic-Frayling; Aleksandar Ignjatovic
Standard approaches to evaluating and comparing information retrieval systems compute simple averages of performance statistics across individual topics to measure the overall system performance. However, topics vary in their ability to differentiate among systems based on their retrieval performance. At the same time, systems that perform well on discriminative queries demonstrate notable qualities that should be reflected in the systems' evaluation and ranking. This motivated research on alternative performance measures that are sensitive to the discriminative value of topics and the performance consistency of systems. In this paper we provide a mathematical formulation of a performance measure that postulates the dependence between the system and topic characteristics. We propose the Generalized Adaptive-Weight Mean (GAWM) measure and show how it can be computed as a fixed point of a function for which the Brouwer Fixed Point Theorem applies. This guarantees the existence of a scoring scheme that satisfies the starting axioms and can be used for ranking of both systems and topics. We apply our method to TREC experiments and compare the GAWM with the standard averages used in TREC.
Automatic query generation for patent search BIBAFull-Text 2037-2040
  Xiaobing Xue; W. Bruce Croft
Patent search is the task of finding relevant existing patents, which is an important part of the patent's examiner's process of validating a patent application. In this paper, we studied how to transform a query patent (the application) into search queries. Three types of search features are explored for automatic query generation for patent search. Furthermore, different types of features are combined with a learning to rank method. Experiments based on a USPTO patent collection demonstrate that the single best search feature is the combination of words and noun-phrases from the summary field and the retrieval performance can be significantly improved by combining three types of search features.
Boosting KNN text classification accuracy by using supervised term weighting schemes BIBAFull-Text 2041-2044
  Iyad Batal; Milos Hauskrecht
The increasing availability of digital documents in the last decade has prompted the development of machine learning techniques to automatically classify and organize text documents. The majority of text classification systems rely on the vector space model, which represents the documents as vectors in the term space. Each vector component is assigned a weight that reflects the importance of the term in the document. Typically, these weights are assigned using an information retrieval (IR) approach, such as the famous tf-idf function. In this work, we study two weighting schemes based on information gain and chi-square statistics. These schemes take advantage of the category label information to weight the terms according to their distributions across the different categories. We show that using these supervised weights instead of conventional unsupervised weights can greatly improve the performance of the k-nearest neighbor (KNN) classifier. Experimental evaluations, carried out on multiple text classification tasks, demonstrate the benefits of this approach in creating accurate text classifiers.
Multidimensional political spectrum identification and analysis BIBAFull-Text 2045-2048
  Leilei Zhu; Prasenjit Mitra
In this work, we show the importance of multidimensional opinion representation in the political context combining domain knowledge and results from principal component analysis. We discuss the differences of feature selection between political spectrum analysis and normal opinion mining tasks. We build regression models on each opinion dimension for scoring and placing new opinion entities, e.g. personal blogs or politicians, onto the political opinion spectrum. We apply our methods on the floor statement records of the United States Senate and evaluate it against the uni-dimensional representation of political opinion space. The experimental results show the effectiveness of the proposed model in explaining the voting records of the Senate.
Automatic generation of topic pages using query-based aspect models BIBAFull-Text 2049-2052
  Niranjan Balasubramanian; Silviu Cucerzan
We investigate the automatic generation of topic pages as an alternative to the current Web search paradigm. We describe a general framework, which combines query log analysis to build aspect models, sentence selection methods for identifying relevant and non-redundant Web sentences, and a technique for sentence ordering. We evaluate our approach on biographical topics both automatically and manually, by using Wikipedia as reference.
Interactive relevance feedback with graded relevance and sentence extraction: simulated user experiments BIBAFull-Text 2053-2056
  Kalervo Järvelin
Research on relevance feedback (RFB) in information retrieval (IR) has given mixed results. Success in RFB seems to depend on the searcher's willingness to provide feedback and ability to identify relevant documents or query keys. The paper is based on simulating many user scenarios regarding the amount and quality of RFB. In addition, we experiment with query-biased sentence extraction for query reformulation. The baselines are initial no-feedback queries and queries based on pseudo-relevance feedback. The core question is: under which conditions would RFB based on sentence extraction be successful? The answer depends on user's behavior, implementation of feedback query formulation, and the evaluation methods. A small amount of feedback from a short browsing window seems to improve the final ranking the most. Longer browsing allows more feedback and better queries but also consumes the available relevant documents.
Easiest-first search: towards comprehension-based web search BIBAFull-Text 2057-2060
  Makoto Nakatani; Adam Jatowt; Katsumi Tanaka
Although Web search engines have become information gateways to the Internet, for queries containing technical terms, search results often contain pages that are difficult to be understood by non-expert users. Therefore, re-ranking search results in a descending order of their comprehensibility should be effective for non-expert users. In our approach, the comprehensibility of Web pages is estimated considering both the document readability and the difficulty of technical terms in the domain of search queries. To extract technical terms, we exploit the domain knowledge extracted from Wikipedia. Our proposed method can be applied to general Web search engines as Wikipedia includes nearly every field of human knowledge. We demonstrate the usefulness of our approach by user experiments.
Stochastic gradient boosted distributed decision trees BIBAFull-Text 2061-2064
  Jerry Ye; Jyh-Herng Chow; Jiang Chen; Zhaohui Zheng
Stochastic Gradient Boosted Decision Trees (GBDT) is one of the most widely used learning algorithms in machine learning today. It is adaptable, easy to interpret, and produces highly accurate models. However, most implementations today are computationally expensive and require all training data to be in main memory. As training data becomes ever larger, there is motivation for us to parallelize the GBDT algorithm. Parallelizing decision tree training is intuitive and various approaches have been explored in existing literature. Stochastic boosting on the other hand is inherently a sequential process and have not been applied to distributed decision trees. In this work, we present two different distributed methods that generates exact stochastic GBDT models, the first is a MapReduce implementation and the second utilizes MPI on the Hadoop grid environment.

Demo session 1: stream data management and OLAP

M-COPE: a multiple continuous query processing engine BIBAFull-Text 2065-2066
  Hong Kyu Park; Se Jung Shin; Sang Hyuck Na; Won Suk Lee
A data stream management system (DSMS) should support an efficient evaluation scheme for long-running continuous queries over infinite data streams. This demonstration presents a scalable query processing engine, M-COPE (Multiple Continuous Query Processing Engine) developed to evaluate multiple continuous queries efficiently. A multiple query optimization scheme implemented in the system generates a single network of operations as an execution plan for registered queries in order to maximize the reuse of the intermediate results of common sub-expressions in the queries adaptively. In this paper, we describe the overall architecture of M-COPE along with its special features. Network traffic flow streams are used to demonstrate the main features of M-COPE.
DS-Cuber: an integrated OLAP environment for data streams BIBAFull-Text 2067-2068
  Ho Jin Woo; Se Jung Shin; Woo Sock Yang; Won Suk Lee
Most of emerging applications deal with an infinite data stream in an incessant, immense and volatile manner. Consequently, it is very important to analyze not only the varying characteristics of a source data stream in a short-term period but also those in a long-term period. For this purpose, this paper demonstrates an OLAP system, DS-Cuber (Data Stream Cuber) for the analysis of data streams. The proposed system consists of two analytic components: short-term and long-term, so that it can provide an integrated analysis environment for infinite data streams. Furthermore, each of these two components supports diversified exception detection methods which can be used for the automatic identification of abnormality in the data elements of a data stream in order to guide the data cube navigation of a user effectively. Network traffic flow streams are used to demonstrate the features of the DS-Cube system.
A novel distributed P2P simulator architecture: D-P2P-sim BIBAFull-Text 2069-2070
  Spyros Sioutas; George Papaloukopoulos; Evangelos Sakkopoulos; Kostas Tsichlas; Yannis Manolopoulos
In this paper we introduce a novel distributed simulation environment with GUI for P2P simulations (D-P2P-Sim). The key aim is to provide the appropriate integrated set of tools in a single software solution to evaluate the performance of various protocols. The basic architecture of the distributed P2P simulator is based on a multi-threading, asynchronous, message passing and distributed environment with graphical user interface to facilitate ease of use by both researchers and programmers.
Demonstration of an RFID middleware: LIT ALE manager BIBAFull-Text 2071-2072
  Qiang Wang; Wooseok Ryu; Soohan Kim; Bonghee Hong
The sharp increase of RFID tags and readers require dedicated middleware solutions that manage readers and process event data. In this paper, we demonstrate a RFID middleware called LIT ALE Manager system with key features and also illustrate overall functional components of the middleware system with implementation techniques.
OLAP with UDFs in digital libraries BIBAFull-Text 2073-2074
  Carlos Garcia-Alvarado; Zhibo Chen; Carlos Ordonez
Queries on digital libraries generally involve the retrieval of specific documents, but most techniques lack the ability to efficiently explore these collections. The integration of OLAP techniques with digital libraries allows users to navigate throughout these collections on multiple levels. In order to accomplish this, we propose the creation of OLAP networks, a complex data structure that contains summarized representations of the original collection of metadata to enrich traditional retrievals and allow the users to quickly explore the collection. We developed a system that enables OLAP-based exploration on the metadata of digital libraries through the use of a combination of efficient UDFs and optimized SQL queries. In addition, we also incorporated visualization methods into our system to allow fast navigation and exploration.
HDDBrs middleware for implementing highly available distributed databases BIBAFull-Text 2075-2076
  Rim Moussa
Our demo presents HDDBRS, a middle tier offering to clients a highly available distributed database interface using Reed Solomon codes to compute parity data. Parity data is stored in dedicated parity DB backends, is synchronously updated and allows recovering from multiple DB backend unavailability. HDDBRS middle tier is implemented in JAVA using standard technology, and is designed to be interoperable with any database engine that provides a JDBC driver and implements X/open XA protocol.

Demo session 2: semantic web, information extraction & knowledge management

OfCourse: web content discovery, classification and information extraction for online course materials BIBAFull-Text 2077-2078
  Yuhong Xiong; Ping Luo; Yong Zhao; Fen Lin; Shicong Feng; Baoyao Zhou; Liwei Zheng
In this paper we present OfCourse, a vertical search engine for online course materials. These materials have the following characteristics: they are scattered very sparsely in the university Web sites; and are generated by the teachers with totally different HMTL templates and layouts. These characteristics impose some challenges for Web Classification (to identify the course materials) and Web Information Extraction (to extract course metadata, such as course title, time and ID) from the identified course homepages. Here, we describe our proposed method to tackle these challenges, and the features of this system. OfCourse, containing over 60,000 courses from the top 50 universities in the US, is currently available for public access at http://fusion.hpl.hp.com/OfCourse/.
YAM: a schema matcher factory BIBAFull-Text 2079-2080
  Fabien Duchateau; Remi Coletta; Zohra Bellahsene; Renée J. Miller
In this paper, we present YAM, a schema matcher factory. YAM (Yet Another Matcher) is not (yet) another schema matching system as it enables the generation of a la carte schema matchers according to user requirements. These requirements include a preference for recall or precision, a training data set (schemas already matched) and provided expert correspondences. YAM uses a knowledge base that includes a (possibly large) set of similarity measures and classifiers. Based on the user requirements, YAM learns how to best apply these tools (similarity measures and classifiers) in concert to achieve the best matching quality. In our demonstration, we will let users apply YAM to build the best schema matcher for different user requirements.
VRIFA: a nonlinear SVM visualization tool using nomogram and localized radial basis function (LRBF) kernels BIBAFull-Text 2081-2082
  Ngo Anh Vien; Nguyen Hoang Viet; TaeChoong Chung; Hwanjo Yu; Sungchul Kim; Baek Hwan Cho
Prediction problems are prevalent in medical domains. For example, computer-aided diagnosis or prognosis is a key component in a CDSS (Clinical Decision Support System). SVMs, especially SVMs with nonlinear kernels such RBF kernels, have shown superior accuracy in prediction problems. However, they are not favorably used by physicians for medical prediction problems because nonlinear SVMs are difficult to visualize, thus it is hard to provide intuitive interpretation of prediction results to physicians. Nomogram was proposed to visualize SVM classification models. However, it cannot visualize nonlinear SVM models. Localized RBF (LRBF) kernel was proposed which shows comparable accuracy as the RBF kernel while the LRBF kernel is easier to interpret since it can be linearly decomposed. This paper presents a new tool named VRIFA, which integrates the nomogram and LRBF kernel to provide users with an interactive visualization of nonlinear SVM models. VRIFA graphically exposes the internal structure of nonlinear SVM models showing the effect of each feature, the magnitude of the effect, and the change at the prediction output. VRIFA also performs nomogram-based feature selection while training a model in order to remove noise or redundant features and improve the prediction accuracy. The tool has been used by biomedical researchers for computer-aided diagnosis and risk factor analysis for diseases. VRIFA is accessible at http://dm.postech.ac.kr/vrifa.
LuposDate: a semantic web database system BIBAFull-Text 2083-2084
  Jinghua Groppe; Sven Groppe; Andreas Schleifer; Volker Linnemann
Managing and querying Semantic Web are important issues for Semantic Web applications. Therefore, we have developed a Semantic Web database system with logically and physically optimized SPARQL engines to manage and query RDF data, named LuposDate. In order to present the functionalities of the LUPOSDATE system and engines, we have developed an online demonstration, which is available at http://www.ifis.uni-luebeck.de/index.php?id=luposdate-demo.
Constructing evolutionary taxonomy of collaborative tagging systems BIBAFull-Text 2085-2086
  Junjie Yao; Yuxin Huang; Bin Cui
Collaborative tagging systems allow users to label online resources. The tags are generally correlated and evolving according to the change of web contents, and the popularity of tags represent evolution of social interests. Tag taxonomy is a promising solution to organize the data in tagging systems. In this demonstration, we propose to construct the evolutionary taxonomy which incorporates the correlation and evolution of tags, as user generated tags grow and change temporally. We demonstrate that our approach is intuitive and efficient in tag organization which exploits the evolving characteristic of collaborative tagging systems.
SPIDER: a system for scalable, parallel / distributed evaluation of large-scale RDF data BIBAFull-Text 2087-2088
  Hyunsik Choi; Jihoon Son; YongHyun Cho; Min Kyoung Sung; Yon Dohn Chung
RDF is a data model for representing labeled directed graphs, and it is used as an important building block of semantic web. Due to its flexibility and applicability, RDF has been used in applications, such as semantic web, bioinformatics, and social networks. In these applications, large-scale graph datasets are very common. However, existing techniques are not effectively managing them. In this paper, we present a scalable, efficient query processing system for RDF data, named SPIDER, based on the well-known parallel/distributed computing framework, Hadoop. SPIDER consists of two major modules (1) the graph data loader, (2) the graph query processor. The loader analyzes and dissects the RDF data and places parts of data over multiple servers. The query processor parses the user query and distributes sub queries to cluster nodes. Also, the results of sub queries from multiple servers are gathered (and refined if necessary) and delivered to the user. Both modules utilize the MapReduce framework of Hadoop. In addition, our system supports some features of SPARQL query language. This prototype will be foundation to develop real applications with large-scale RDF graph data.

Demo session 3: advanced techniques & applications

AnchorWoman: top-k structured mobile web search engine BIBAFull-Text 2089-2090
  Wookey Lee; James Jung-Hoon Lee; Young-Kuk Kim; Carson Kai-Sang Leung
With advances in technology, mobile handheld devices-such as PDAs-have become very popular. In many real-life situations, users want to find structuring information using these mobile devices, which are convenient to use but have relatively limited resources. In this paper, we present a top-k structured mobile Web search engine. It uses a top-k adaptable search-tree method that utilizes hierarchical structure of hypermedia objects to effectively look for structuring information from the mobile Web model. The engine, which is implemented in the mobile environment, provides users with top-k adaptive Web search recommendations for mobile handheld devices.
OSSOBOOK: database and knowledge-management techniques for archaeozoology BIBAFull-Text 2091-2092
  Hans-Peter Kriegel; Peer Kröger; Henriette Obermaier; Joris Peters; Matthias Renz; Christiaan Hendrikus van der Meijden
This demo describes the OSSOBOOK database system developed for archaeozoology applications providing data storage, data retrieval, and data mining facilities. It shows a case study of integrating state-of-the-art database concepts like intermittently synchronized database system as well as concepts of information retrieval and knowledge representation like similarity search and data mining in order to provide a comprehensive system for an interesting application domain.
A flexible simulation environment for flash-aware algorithms BIBAFull-Text 2093-2094
  Peiquan Jin; Xuan Su; Zhi Li; Lihua Yue
In this paper, we present a flexible simulation environment for the performance evaluation of flash-aware algorithms, which is called Flash-DBSim. The main purpose of Flash-DBSim is to provide a configurable virtual flash disk for upper systems, such as file system and DBMS, so that the algorithms in those systems can be easily evaluated on different types of flash disks. Moreover, it also offers a prototyping environment for those algorithms inside flash disk, e.g. the algorithms for garbage collection or wear-leveling. After an overview of the general features of Flash-DBSim, we discuss the architecture of Flash-DBSim. And finally, a case study of Flash-DBSim's demonstration is presented.
Helping people to choose for whom to vote. a web information system for the 2009 European elections BIBAFull-Text 2095-2096
  Arjan Nusselder; Hendrike Peetz; Anne Schuth; Maarten Marx
We demonstrate a web information system created for the European elections in June 2009. Based on their speeches in the EU parliament and their written questions, we created language models for each of the 736 members of the EU parliament. These language models were used to search for politicians responsible for a given topic, similar to expert search applications. Users prefer to see some kind of evidence for returning a hit after a search. We created a profile of each EU parliamentarian by comparing her personal language model to the language model created from all EU parliamentarians. The top 50 words best separating the individual from the avarege were shown as a wordcloud. These top 50 words and their scores were derived from a parsimonious language model.
RSS watchdog: an instant event monitor on real online news streams BIBAFull-Text 2097-2098
  Chih-Lin Hu; Chung-Kuang Chou
This paper introduces the RSS Watchdog system, which is capable of news clustering and instant event monitoring over multiple real and online RSS news streams. We briefly mention software architecture design, technical implementation, and prototype demonstration. In addition, the results of real case studies are presented to notice the RSS Watchdog's functionality.
RefMed: relevance feedback retrieval system for PubMed BIBAFull-Text 2099-2100
  Hwanjo Yu; Taehoon Kim; Jinoh Oh; Ilhwan Ko; Sungchul Kim
Finding related articles from the PubMed (a large biomedical literature repository) is challenging because it is hard to express the user's specific relevance in the given query interface and a keyword query typically retrieves many results. Biomedical researchers spend a critical amount of time (e.g., often more than several days) in the literature search process. This paper proposes RefMed, a novel search system for PubMed, which supports relevance ranking by enabling relevance feedback on PubMed. RefMed first returns initial result documents for a user's keyword query as in PubMed. The user then makes relevance judgments on some of the resultant documents while browsing them. Once the user "pushes the feedback", the system induces a relevance function using RankSVM and ranks the results according to the function. To realize the ad-hoc relevance retrieval on PubMed, RefMed "tightly" integrates RankSVM within RDBMS and runs the rank learning and process on the fly with a response time of a few minutes. Our qualitative experiments with biomedical researchers show that RefMed substantially reduces the amount of effort required to search related PubMed articles. RefMed is accessible at "http://dm.postech.ac.kr/refmed".

Demo session 4: XML data processing & system architecture

SOIRE: a service-oriented IR evaluation architecture BIBAFull-Text 2101-2102
  Michael Dittenbach; Bernhard Pflugfelder; Andreas Pesenhofer; Giovanna Roda; Helmut Berger
We have developed a system that offers comprehensive analysis functionality for information retrieval experiments combined with a storage facility for persisting experiment data in a uniform fashion to facilitate repeatability and comparability of experiments. Our Service-Oriented IR Evaluation Framework -- SOIRE offers a number of technological interfaces based on open networking standards and modeling languages to connect to other systems while regarding ease-of-use for researchers as the feature of utmost importance.
MRM: an adaptive framework for XML searching BIBAFull-Text 2103-2104
  Ho Lam Lau; Wilfred Ng
In order to deal with the diversified nature of XML documents as well as individual user preferences, we propose a novel Multi-Ranker Model (MRM), which is able to abstract a spectrum of important XML properties and adapt the features to different XML search needs. The model consists of a novel three-level ranking structure and a training module called Ranking Support Vector Machine in a voting Spy Naïve Bayes Framework (RSSF). RSSF is effective in learning search preference and then ranks the returned results adaptively. In this demonstration, we present our prototype developed from the model, which we call it the MRM XML search engine. The MRM engine employs only a list of simple XML tagged keywords as a user query for searching XML fragments from a collection of real XML documents. The demonstration presents an indepth analyses of the effectiveness of adaptive rankers, tailored XML rankers and a spectrum of low level ranking features.
Efficient and reliable merging of XML documents BIBAFull-Text 2105-2106
  Sebastian Rönnau; Geraint Philipp; Uwe M. Borghoff
Many knowledge-based processes rely on XML-based office documents. Up to now, versioning and merging XML documents was a difficult and error-prone task, mostly done manually. The support by tools is still in its infancy. We have presented a novel approach to compare and merge XML documents in a reliable way using context fingerprints. In this demonstration, we show the application of our toolset to ODF documents using a new, simple, and user-friendly graphical user-interface.
A graphical browser for XML schema documents BIBAFull-Text 2107-2108
  Jihyeon Yeom; Hyeokman Kim
Recently, tools for browsing XML Schema documents have become popular, but they cannot support advanced browsing functionalities. We have implemented a new graphical schema browser which provides traversals not only along a composition hierarchy but also along a type hierarchy defined in the documents. In this demonstration, we show how users can quickly and easily understand the semantic structures by freely navigating to any datatypes or elements located at any position in the hierarchies.
XQGen: an algebra-based XPath query generator for micro-benchmarking BIBAFull-Text 2109-2110
  Yuqing Wu; Namrata Lele; Rashmi Aroskar; Sharanya Chinnusamy; Sofia Brenes
We propose XQGen, a stand-alone, algebra-based XPath generator to aid engineers in testing and improving the design of XML query engines. XQGen takes an XML schema sketch and user configurations, such as number of queries, query types, duplication factors, and branching factors as input, and generates a set of queries that conform to the schema and configurations. In addition, given a set of label-paths as workload input, XQGen is capable of generating query sets that honor the workload.
ASIC: algebra-based structural index comparison BIBAFull-Text 2111-2112
  Yuqing Wu; Sofia Brenes; Tejas Totade; Shijin Joshua; Dhaval Damani; Michel Salim
Structural indices play a significant role in improving the efficiency of XML query evaluation. Being able to compare various structural indexing techniques is critical for a DBMS to select which indices to support, for the query optimizer to choose an index to use in query evaluation, and for DBAs to configure a database application. We present ASIC, an Algebra-based Structural Index Comparison framework that aids users in understanding the ability of different types of structural indices in answering XPath queries which have been characterized using the XPath algebra. ASIC allows users to select, configure and construct structural indices for comparison, guides users to compare the selected indices by evaluating queries of a particular XPath sub-algebra, and visually displays the index structures, query evaluation plans, and performance results for analysis and comparison.

CIKM 2009 co-located workshop overviews

Bridging the gap: complex networks meet information and knowledge management BIBAFull-Text 2113-2114
  Jun Wang; Shi Zhou; Dell Zhang
In this article, we briefly summarize the motivation, content and structure of the CNIKM'09 workshop.
TSA'09 workshop summary: topic-sentiment analysis BIBAFull-Text 2115-2116
  Bei Yu; Maojin Jiang
This workshop seeks to bring together researchers in both computer science and social sciences who are interested in developing and using topic-sentiment analysis methods to measure mass opinion, and to foster communications between the research community and industry practitioners as well.
Privacy and anonymization for very large datasets BIBAFull-Text 2117-2118
  Victor Muntés-Mulero; Jordi Nin
With the increase of available public data sources and the interest for analyzing them, privacy issues are becoming the eye of the storm in many applications. The vast amount of data collected on human beings and organizations as a result of cyberinfrastructure advances, or that collected by statistical agencies, for instance, has made traditional ways of protecting social science data obsolete. This has given rise to different techniques aimed at tackling this problem and at the analysis of limitations in such environments, such as the seminal study by Aggarwal of anonymization techniques and their dependency on data dimensionality. The growing accessibility to high-capacity storage devices allows keeping more detailed information from many areas. While this enriches the information and conclusions extracted from this data, it poses a serious problem for most of the previous work presented up to now regarding privacy, focused on quality and paying little attention to performance aspects. In this workshop, we want to gather researchers in the areas of data privacy and anonymization together with researchers in the area of high performance and very large data volumes management. We seek to collect the most recent advances in data privacy and anonymization (i.e. anonymization techniques, statistic disclosure techniques, privacy in machine learning algorithms, privacy in graphs or social networks, etc) and those in High Performance and Data Management (i.e. algorithms and structures for efficient data management, parallel or distributed systems, etc).
CloudDB workshop summary BIBAFull-Text 2119-2120
  Xiaofeng Meng; Haixun Wang; Ying Chen
This is the first workshop in CIKM conference that addresses the challenge of large data management based on cloud computing infrastructure. This workshop will bring together researchers and practitioners in cloud computing and data-intensive system design, programming, parallel algorithms, data management, scientific applications, and information-based applications to maximize performance, minimize cost and improve the scale of their endeavors.
   Totally this workshop attracted 11 submissions from Asia, Canada, Europe, and the United States. The program committee accepted 8 papers, among which there are 5 full papers and 3 short papers. Topics of accepted papers includes query processing & index, service in the Cloud, platform availability, Cloud system implementation, data replication and so on. These proceedings will serve as a valuable reference for cloud data management researchers and developers.