HCI Bibliography Home | HCI Conferences | CIKM Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
CIKM Tables of Contents: 0809101112131415

Proceedings of the 2011 ACM Conference on Information and Knowledge Management

Fullname:Proceedings of the 20th ACM international conference on Information and knowledge management
Editors:Bettina Berendt; Arjen de Vries; Wenfei Fan; Craig Macdonald; Iadh Ounis; Ian Ruthven
Location:Glasgow, Scotland
Dates:2011-Oct-24 to 2011-Oct-28
Publisher:ACM
Standard No:ISBN: 978-1-4503-0717-8; ACM DL: Table of Contents; hcibib: CIKM11
Papers:428
Pages:2648
Links:Conference Website
Summary:On behalf of the organizing committee, it is our great pleasure to welcome you to the 20th ACM Conference on Information and Knowledge Management in Glasgow!
    Since its inception, the CIKM conference has provided a unique international forum for the presentation, discussion and dissemination of research findings in data management, information retrieval and knowledge management. The purpose of the conference is to identify challenging problems facing the development of future knowledge and information systems and to shape future research directions through the publication of high quality, applied and theoretical research findings. The conference has been a leading forum in which experts from academic, industry and the public sector gather to exchange ideas, research achievements and technical developments in multidisciplinary research areas.
    CIKM is one of the world's most recognized conferences in the field. This year CIKM received 918 full paper submissions, 220 poster submissions, and 56 demonstration submissions. This is a great demonstration of the lively research areas that contribute to the CIKM area. In addition, CIKM 2011 will host 10 tutorials from leading researchers, 15 workshops on cutting-edge areas of research, a panel session on Social and Collaborative Search and a dedicated Industry Day featuring leading industrial practitioners. We are grateful to all authors who chose to submit their research to CIKM 2011 and are very excited by the final program.
    CIKM values interdisciplinary research and we are proud to present three keynote speakers, Professor Justin Zobel, Professor Maurizio Lenzerini and Professor David Karger, all of whom will give presentations that cross discipline boundaries.
  1. Keynote address
  2. Retrieval models
  3. Techniques for the Web
  4. Exploiting query logs
  5. Sparse data and difficult queries
  6. Type and structure
  7. Machine learning for information retrieval
  8. Information retrieval implementation techniques
  9. Language technology and information retrieval
  10. Results in context
  11. Algorithms
  12. Image retrieval
  13. Social media
  14. Personalization and advertising
  15. Evaluation and analysis
  16. Classification and evaluation
  17. Information filtering
  18. Topics and events
  19. Temporal, stream and spatial information
  20. Text mining
  21. Privacy
  22. Unsupervised and semi-supervised learning
  23. Social networks and communities
  24. Sentiments and other perspectives
  25. Classification and clustering: large-scale statistical techniques
  26. Link prediction
  27. Link, graph and relation mining
  28. Science, the past, and the future
  29. Information extraction and entities
  30. Queries, questions and tags mining
  31. Preparing, mining and evaluating with and for different views
  32. Information extraction and semantic techniques
  33. Data on the web
  34. Query processing and optimization
  35. Semantic web and information retrieval
  36. Query answering and social search
  37. Distributed data management and data integration
  38. Keyword search and ranked queries
  39. Data cleaning and analysis
  40. Graph management and queries
  41. Social, search, and other behaviour
  42. Applications in different areas
  43. Poster session: information retrieval
  44. Poster session: knowledge management
  45. Poster session: databases
  46. Demonstration session 1
  47. Demonstration session 2
  48. Demonstration session 3
  49. Co-located tutorial summaries
  50. Co-located workshop summaries
  51. Panel

Keynote address

Creating user interfaces that entice people to manage better information BIBAFull-Text 1-2
  David R. Karger
Much research in information management begins by asking how to manage a given information corpus. But information management systems can only be as good as the information they manage. They struggle and often fail to correctly infer meaning from large blobs of text and the mysterious actions and demands of users. And they are useless for managing information that is never captured.
   Instead of accepting the existing information as an immutable condition, I will argue that there are significant opportunities to help and motivate people to improve the quality and quantity of information their tools manage, and to exploit that better information to benefit its users.
   The greatest challenge in doing so is developing systems, and particularly user interfaces, that overcome humans' perverse reluctance to invest small present-moment effort for large future payoffs. Effective systems must minimize the effort needed to record high-quality information and maximize the perceived future benefits of that information investment.
   I will support these ideas with examples covering structured data management and presentation, notetaking, collaborative filtering, and social media.
Data, health, and algorithmics: computational challenges for biomedicine BIBAFull-Text 3-4
  Justin Zobel
In the decade following the completion of the Human Genome Project in 2000, the cost of sequencing DNA fell by a factor of around a million, and continues to fall. Applications of sequencing in health include precise diagnosis of infection and disease, lifestyle management, and development of highly targeted treatments. However, the volume and complexity of the data produced by these technologies presents a severe computational challenge. Breakthroughs in methods for search, storage, and analysis are required to keep pace with the flow of data, and to make use of the changes in biomedical knowledge that these technologies are creating. This keynote is an overview of some of these technologies and the new computational obstacles they have engendered, and reviews examples of algorithmic innovations and approaches currently being explored. These illustrate both the kinds of solutions that are required and the challenges that must be addressed to allow this data to be fully exploited.
Ontology-based data management BIBAFull-Text 5-6
  Maurizio Lenzerini
Ontology-based data management aims at accessing and using data by means of an ontology, i.e., a conceptual representation of the domain of interest in the underlying information system. This new paradigm provides several interesting features, many of which have been already proved effective in managing complex information systems. On the other hand, several important issues remain open, and constitute stimulating challenges for the research community. In this talk we first provide an introduction to ontology-based data management, illustrating the main ideas and techniques for using an ontology to access the data layer of an information system, and then we discuss several important issues that are still the subject of extensive investigations, including the need of inconsistency tolerant query answering methods, and the need of supporting update operations expressed over the ontology.

Retrieval models

Lower-bounding term frequency normalization BIBAFull-Text 7-16
  Yuanhua Lv; ChengXiang Zhai
In this paper, we reveal a common deficiency of the current retrieval models: the component of term frequency (TF) normalization by document length is not lower-bounded properly; as a result, very long documents tend to be overly penalized. In order to analytically diagnose this problem, we propose two desirable formal constraints to capture the heuristic of lower-bounding TF, and use constraint analysis to examine several representative retrieval functions. Analysis results show that all these retrieval functions can only satisfy the constraints for a certain range of parameter values and/or for a particular set of query terms. Empirical results further show that the retrieval performance tends to be poor when the parameter is out of the range or the query term is not in the particular set. To solve this common problem, we propose a general and efficient method to introduce a sufficiently large lower bound for TF normalization which can be shown analytically to fix or alleviate the problem. Our experimental results demonstrate that the proposed method, incurring almost no additional computational cost, can be applied to state-of-the-art retrieval functions, such as Okapi BM25, language models, and the divergence from randomness approach, to significantly improve the average precision, especially for verbose queries.
A quasi-synchronous dependence model for information retrieval BIBAFull-Text 17-26
  Jae Hyun Park; W. Bruce Croft; David A. Smith
Incorporating syntactic features in a retrieval model has had very limited success in the past, with the exception of binary term dependencies. This paper presents a new term dependency modeling approach based on syntactic dependency parsing for both queries and documents. Our model is inspired by a quasi-synchronous stochastic process for machine translation[21]. We model four different types of relationships between syntactically dependent term pairs to perform inexact matching between documents and queries. We also propose a machine learning technique for predicting optimal parameter settings for a retrieval model incorporating syntactic relationships. The results on TREC collections show that the quasi-synchronous dependence model can improve retrieval performance and outperform a strong state-of-art sequential dependence baseline when we use predicted optimal parameters.
Improving retrieval accuracy of difficult queries through generalizing negative document language models BIBAFull-Text 27-36
  Maryam Karimzadehgan; ChengXiang Zhai
When a query topic is difficult and the search results are very poor, negative feedback is a very useful method to improve the retrieval accuracy and user experience. One challenge in negative feedback is that negative documents tend to be distracting in different ways, thus as training examples, negative examples are sparse. In this paper, we solve the problem of data sparseness in the language modeling framework. We propose an optimization framework, in which we learn from a few top-ranked non-relevant examples, and search in a large space of all language models to build a more general negative language model. This general negative language model has more power in pruning the non-relevant documents, thus potentially improving the performance for difficult queries. Experiment results on representative TREC collections show that the proposed optimization framework can improve negative feedback performance over the state-of-the-art negative feedback method through generalizing negative language models.
S3K: seeking statement-supporting top-K witnesses BIBAFull-Text 37-46
  Steffen Metzger; Shady Elbassuoni; Katja Hose; Ralf Schenkel
Traditional information retrieval techniques based on keyword search help to identify a ranked set of relevant documents, which often contains many documents in the top ranks that do not meet the user's intention. By considering the semantics of the keywords and their relationships, both precision and recall can be improved. Using an ontology and mapping keywords to entities/concepts and identifying the relationship between them that the user is interested in, allows for retrieving documents that actually meet the user's intention. In this paper, we present a framework that enables semantic-aware document retrieval. User queries are mapped to semantic statements based on entities and their relationships. The framework searches for documents expressing these statements in different variations, e.g., synonymous names for entities or different textual expressions for relations between them. The size of potential result sets makes ranking documents according to their relevance to the user an essential component of such a system. The ranking model proposed in this paper is based on statistical language-models and considers aspects such as the authority of a document and the confidence in the textual pattern representing the queried information.
Finding relevant information of certain types from enterprise data BIBAFull-Text 47-56
  Xitong Liu; Hui Fang; Cong-Lei Yao; Min Wang
Search over enterprise data is essential to every aspect of an enterprise because it helps users fulfill their information needs. Similar to Web search, most queries in enterprise search are keyword queries. However, enterprise search is a unique research problem because, compared with the data in traditional IR applications (e.g., text data), enterprise data includes information stored in different formats. In particular, enterprise data include both unstructured and structured information, and all the data center around a particular enterprise. As a result, the relevant information from these two data sources could be complementary to each other. Intuitively, such integrated data could be exploited to improve the enterprise search quality. Despite its importance, this problem has received little attention so far. In this paper, we demonstrate the feasibility of leveraging the integrated information in enterprise data to improve search quality through a case study, i.e., finding relevant information of certain types from enterprise data. Enterprise search users often look for different types of relevant information other than documents, e.g., the contact information of persons working on a product. When formulating a keyword query, search users may specify both content requirements, i.e., what kind of information is relevant, and type requirements, i.e., what type of information is relevant. Thus, the goal is to find information relevant to both requirements specified in the query. Specifically, we formulate the problem as keyword search over structured or semistructured data, and then propose to leverage the complementary unstructured information in the enterprise data to solve the problem. Experiment results over real world enterprise data and simulated data show that the proposed methods can effectively exploit the unstructured information to find relevant information of certain types from structured and semistructured information in enterprise data.

Techniques for the Web

Unsupervised transactional query classification based on webpage form understanding BIBAFull-Text 57-66
  Yuchen Liu; Xiaochuan Ni; Jian-Tao Sun; Zheng Chen
Query type classification aims to classify search queries into categories like navigational, informational and transactional, etc., according to the type of information need behind the queries. Although this problem has drawn many research attentions, previous methods usually require editors to label queries as training data or need domain knowledge to edit rules for predicting query type. Also, the existing work has been mainly focusing on the classification of informational and navigational query types. Transactional query classification has not been well addressed. In this work, we propose an unsupervised approach for transactional query classification. This method is based on the observation that, after the transactional queries are issued to a search engine, many users will click the search result pages and then have interactions with Web forms on these pages. The interactions, e.g., typing in text box, making selections from dropdown list, clicking on a button to execute actions, are used to specify detailed information of the transaction. By mining toolbar search log data, which records the associations between queries and Web forms clicked by users, we can get a set of good quality transactional queries without using manual labeling efforts. By matching these automatically acquired transactional queries and their associated Web form contents, we can generalize these queries into patterns. These patterns can be used to classify queries which are not covered by search log. Our experiments indicate that transactional queries produced by this method have good quality. The pattern based classifier achieves 83% F1 classification result. This is very effective considering the fact that we do not adopt any labeling efforts to train the classifier.
Assigning documents to master sites in distributed search BIBAFull-Text 67-76
  Roi Blanco; B. Barla Cambazoglu; Flavio P. Junqueira; Ivan Kelly; Vincent Leroy
An appealing solution to scale Web search with the growth of the Internet is the use of distributed architectures. Distributed search engines rely on multiple sites deployed in distant regions across the world, where each site is specialized to serve queries issued by the users of its region. This paper investigates the problem of assigning each document to a master site. We show that by leveraging similarities between a document and the activity of the users, we can accurately detect which site is the most relevant to place a document. We conduct various experiments using two document assignment approaches, showing performance improvements of up to 20.8% over a baseline technique which assigns the documents to search sites based on their language.
Discovering URLs through user feedback BIBAFull-Text 77-86
  Xiao Bai; B. Barla Cambazoglu; Flavio P. Junqueira
Search engines rely upon crawling to build their Web page collections. A Web crawler typically discovers new URLs by following the link structure induced by links on Web pages. As the number of documents on the Web is large, discovering newly created URLs may take arbitrarily long, and depending on how a given page is connected to others, such a crawler may miss the pages altogether. In this paper, we evaluate the benefits of integrating a passive URL discovery mechanism into a Web crawler. This mechanism is passive in the sense that it does not require the crawler to actively fetch documents from the Web to discover URLs. We focus here on a mechanism that uses toolbar data as a representative source for new URL discovery. We use the toolbar logs of Yahoo! to characterize the URLs that are accessed by users via their browsers, but not discovered by Yahoo! Web crawler. We show that a high fraction of URLs that appear in toolbar logs are not discovered by the crawler. We also reveal that a certain fraction of URLs are discovered by the crawler later than the time they are first accessed by users. One important conclusion of our work is that web search engines can highly benefit from user feedback in the form of toolbar logs for passive URL discovery.
User browsing behavior-driven web crawling BIBAFull-Text 87-92
  Minghai Liu; Rui Cai; Ming Zhang; Lei Zhang
To optimize the performance of web crawlers, various page importance measures have been studied to select and order URLs in crawling. Most sophisticated measures (e.g. breadth-first and PageRank) are based on link structure. In this paper, we treat the problem from another perspective and propose to measure page importance through mining user interest and behaviors from web browse logs. Unlike most existing approaches which work on single URL, in this paper, both the log mining and the crawl ordering are performed at the granularity of URL pattern. The proposed URL pattern-based crawl orderings are capable to properly predict the importance of newly created (unseen) URLs. Promising experimental results proved the feasibility of our approach.
Diversifying search results of controversial queries BIBAFull-Text 93-98
  Mouna Kacimi; Johann Gamper
Diversifying search results of queries seeking for different view points about controversial topics is key to improving satisfaction of users. The challenge for finding different opinions is how to maximize the number of discussed arguments without being biased against specific sentiments. This paper addresses the issue by first introducing a new model that represents the patterns occurring in documents about controversial topics. Second, proposing an opinion diversification model that uses (1) relevance of documents, (2) semantic diversification to capture different arguments and (3) sentiment diversification to identify positive, negative and neutral sentiments about the query topic. We have conducted our experiments using queries on various controversial topics and applied our diversification model on the set of documents returned by Google search engine. The results show that our model outperforms the native ranking of Web pages about controversial topics by a significant margin.
Relevance weighting using within-document term statistics BIBAFull-Text 99-104
  Kai Hui; Ben He; Tiejian Luo; Bin Wang
With the rapid development of the information technology, there exists the difficulty in deploying state-of-the-art retrieval models in environments such as peer-to-peer networks and pervasive computing, where it is expensive or even infeasible to maintain the global statistics. To this end, this paper presents an investigation in the validity of different statistical assumptions of term distributions. Based on the findings in this investigation, a variety of weighting models, called NG (standing for "no global statistics") models, are derived from the Divergence from Randomness framework, in which only the within-document statistics are used in the relevance weighting. Compared to the state-of-the-art weighting models in extensive experiments on various standard TREC test collections, our proposed NG models can provide acceptable retrieval performance in ad-hoc search, without the use of global statistics.

Exploiting query logs

Suggestion set utility maximization using session logs BIBAFull-Text 105-114
  Umut Ozertem; Emre Velipasaoglu; Larry Lai
Assistance technology is undoubtedly one of the important elements in the commercial search engines, and routing the user towards the right direction throughout the search sessions is of great importance for providing a good search experience. Most search assistance methods in the literature that involve query generation, query expansion and other techniques consider each suggestion candidate individually, which implies an independence assumption. We challenge this independence assumption and give a method to maximize the utility of a given set of suggestions. For this, we will define a measure of conditional utility for query pairs using query-URL bipartite graphs based on the session logs (clicked and viewed URLs). Afterwards, we remove the redundant queries from the suggestion set using a greedy algorithm to be able to replace them with more useful ones. Both offline (based on user studies and session log analysis) and online (based on millions of user interactions) evaluations show that modeling the conditional utility and maximizing the utility of the set of queries (by eliminating redundant ones) significantly increases the effectiveness of the search assistance both for the presubmit and postsubmit modes.
Improving context-aware query classification via adaptive self-training BIBAFull-Text 115-124
  Minmin Chen; Jian-Tao Sun; Xiaochuan Ni; Yixin Chen
Topical classification of user queries is critical for general-purpose web search systems. It is also a challenging task, due to the sparsity of query terms and the lack of labeled queries. On the other hand, search contexts embedded in query sessions and unlabeled queries free on the web have not been fully utilized in most query classification systems. In this work, we leverage these information to improve query classification accuracy.
   We first incorporate search contexts into our framework using a Conditional Random Field (CRF) model. Discriminative training of CRFs is favored over the traditional maximum likelihood training because of its robustness to noise. We then adapt self-training with our model to exploit the information in unlabeled queries. By investigating different confidence measurements and model selection strategies, we effectively avoid the error-reinforcing nature of self-training. In extensive experiments on real search logs, we have averaged around 20% improvement in classification accuracy over other state-of-the-art baselines.
A task level metric for measuring web search satisfaction and its application on improving relevance estimation BIBAFull-Text 125-134
  Ahmed Hassan; Yang Song; Li-wei He
Understanding the behavior of satisfied and unsatisfied Web search users is very important for improving users search experience. Collecting labeled data that characterizes search behavior is a very challenging problem. Most of the previous work used a limited amount of data collected in lab studies or annotated by judges lacking information about the actual intent. In this work, we performed a large scale user study where we collected explicit judgments of user satisfaction with the entire search task. Results were analyzed using sequence models that incorporate user behavior to predict whether the user ended up being satisfied with a search or not. We test our metric on millions of queries collected from real Web search traffic and show empirically that user behavior models trained using explicit judgments of user satisfaction outperform several other search quality metrics. The proposed model can also be used to optimize different search engine components. We propose a method that uses task level success prediction to provide a better interpretation of clickthrough data. Clickthough data has been widely used to improve relevance estimation. We use our user satisfaction model to distinguish between clicks that lead to satisfaction and clicks that do not. We show that adding new features derived from this metric allowed us to improve the estimation of document relevance.
Multi-view random walk framework for search task discovery from click-through log BIBAFull-Text 135-140
  Jianwei Cui; Hongyan Liu; Jun Yan; Lei Ji; Ruoming Jin; Jun He; Yingqin Gu; Zheng Chen; Xiaoyong Du
Search engine users often have clear search tasks hidden behind their queries. Inspired by this, the modern search engines are providing an increasing number of services to help users simplify their key tasks. However, the problem of what are the major user search tasks with high traffic for which search engines should design special services is still underexplored. In this paper, we propose a novel Multi-view Random Walk (MRW) algorithm to measure the search task oriented similarity between queries, and then group search queries with similar tasks so that the major search tasks of users can be identified from search engine click-through log. The proposed MRW, which is a general framework to combine knowledge from different views in a random walk process, allows the random surfer to walk across different views to integrate information for search task discovery. Experimental results on click-through log of a commonly used commercial search engine show that our proposed MRW algorithm can effectively discover user search tasks.
Query sampling for learning data fusion BIBAFull-Text 141-146
  Ting-Chu Lin; Pu-Jen Cheng
Data fusion is to merge the results of multiple independent retrieval models into a single ranked list. Several earlier studies have shown that the combination of different models can improve the retrieval performance better than using any of the individual models. Although many promising results have been given by supervised fusion methods, training data sampling has attracted little attention in previous work of data fusion. By observing some evaluations on TREC and NTCIR datasets, we found that the performance of one model varied largely from one training example to another, so that not all training examples were equivalently effective. In this paper, we propose two novel approaches: greedy and boosting approaches, which select effective training data by query sampling to improve the performance of supervised data fusion algorithms such as BayesFuse, probFuse and MAPFuse. Extensive experiments were conducted on five data sets including TREC-3,4,5 and NTCIR-3,4. The results show that our sampling approaches can significantly improve the retrieval performance of those data fusion methods.
Query session detection as a cascade BIBAFull-Text 147-152
  Matthias Hagen; Benno Stein; Tino Rüb
We propose a cascading method for query session detection, the problem of identifying series of consecutive queries a user submits with the same information need. While the existing session detection research mostly deals with effectiveness, our focus also is on efficiency, and we investigate questions related to the analysis trade-off: How expensive (in terms of runtime) is a certain improvement in F-Measure? In this regard, we distinguish two major scenarios where query session knowledge is important: (1) In an online setting, the search engine tries to incorporate knowledge of the preceding queries for an improved retrieval performance. Obviously, the efficiency of the session detection method is a crucial issue as the overall retrieval time should not be influenced too much. (2) In an offline post-retrieval setting, search engine logs are divided into sessions in order to examine what causes users to fail or to identify typical reformulation patterns etc. Here, efficiency might not be as important as in the online scenario but the accuracy of the detected sessions is essential.
   Our cascading method provides a sensible treatment for both scenarios. It involves different steps that form a cascade in the sense that computationally costly and hence time-consuming features are applied only after cheap features "failed." This is different to previous session detection methods, most of which involve many features simultaneously. Experiments on a standard test corpus show the cascading method to save runtime compared to the state of the art while the detected sessions' accuracy is even superior.

Sparse data and difficult queries

Discovering missing click-through query language information for web search BIBAFull-Text 153-162
  Xing Yi; James Allan
The click-through information in web query logs has been widely used for web search tasks. However, it usually suffers from the data sparseness problem, known as the missing/incomplete click problems, where large volume of pages receive few or no clicks. In this paper, we adapt two language modeling based approaches to address this issue in the context of using web query logs for web search. The first approach discovers missing click-through query language features for web pages with no or few clicks from their similar pages' click-associated queries in the query logs, to help search. We further propose combining this content based approach with the random walk approach on the click graph to further reduce click-through sparseness for search. The second approach follows the query expansion method and utilizes the queries and their clicked web pages in the query logs to reconstruct a structured variant of the relevance based language models for each user-input query for search. We design experiments with a publicly available query log excerpt and two TREC web search tasks on the GOV2 and ClueWeb09 corpora to evaluate the search performance of different approaches. Our results show that using discovered semantic click-through query language features can statistically significantly improve search performance, compared with the baselines that do not use the discovered information. The combination approach that uses discovered click-through features from both random walk and the content based approach can further improve search performance.
Interactive sense feedback for difficult queries BIBAFull-Text 163-172
  Alexander Kotov; ChengXiang Zhai
Ambiguity of query terms is a common cause of inaccurate retrieval results. Existing work has mostly focused on studying how to improve retrieval accuracy by automatically resolving word sense ambiguity. However, fully automatic sense identification and disambiguation is a very challenging task. In this work, we propose to involve a user in the process of disambiguation through interactive sense feedback and study the potential effectiveness of this novel feedback strategy. We propose several general methods to automatically identify the major senses of query terms based on global analysis of document collection and generate concise representations of the discovered senses to the users. This feedback strategy does not rely on initial retrieval results, and thus can be especially useful for improving the results of difficult queries. We evaluated the effectiveness of the proposed methods for sense identification and presentation through simulation experiments and user studies, which both indicate that sense feedback strategy is a promising alternative to the existing interactive feedback techniques such as relevance feedback and term feedback.
Reranking search results for sparse queries BIBAFull-Text 173-182
  Elif Aktolga; James Allan
It is well known that clickthrough data can be used to improve the effectiveness of search results: broadly speaking, a query's past clicks are a predictor of future clicks on documents. However, when a new or unusual query appears, or when a system is not as widely used as a mainstream web search system, there may be little to no click data available to improve the results. Existing methods to boost query performance for sparse queries extend the query-document click relationship to more documents or queries, but require substantial clickthrough data from other queries. In this work we describe a way to boost rarely-clicked queries in a system where limited clickthrough data is available for all queries. We describe a probabilistic approach for carrying out that estimation and use it to rerank retrieved documents. We utilize information from co-click queries, subset queries, and synonym queries to estimate the clickthrough for a sparse query. Our experiments on a query log from a medical informatics company demonstrate that when overall clickthrough data is sparse, reranking search results using clickthrough information from related queries significantly outperforms reranking that employs clickthrough information from the query alone.
Searching microblogs: coping with sparsity and document quality BIBAFull-Text 183-188
  Nasir Naveed; Thomas Gottron; Jérôme Kunegis; Arifah Che Alhadi
Two of the main challenges in retrieval on microblogs are the inherent sparsity of the documents and difficulties in assessing their quality. The feature sparsity is immanent to the restriction of the medium to short texts. Quality assessment is necessary as the microblog documents range from spam over trivia and personal chatter to news broadcasts, information dissemination and reports of current hot topics. In this paper we analyze how these challenges can influence standard retrieval models and propose methods to overcome the problems they pose. We consider the sparsity's effect on document length normalization and introduce "interestingness" as static quality measure. Our results show that deliberately ignoring length normalization yields better retrieval results in general and that interestingness improves retrieval for underspecified queries.
Finding images of difficult entities in the long tail BIBAFull-Text 189-194
  Bilyana Taneva; Mouna Kacimi; Gerhard Weikum
While images of famous people and places are abundant on the Internet, they are much harder to retrieve for less popular entities such as notable computer scientists or regionally interesting churches. Querying the entity names in image search engines yields large candidate lists, but they often have low precision and unsatisfactory recall. In this paper, we propose a principled model for finding images of rare or ambiguous named entities. We propose a set of efficient, light-weight algorithms for identifying entity-specific keyphrases from a given textual description of the entity, which we then use to score candidate images based on the matches of keyphrases in the underlying Web pages. Our experiments show the high precision-recall quality of our approach.
Learning to rank user intent BIBAFull-Text 195-200
  Giorgos Giannopoulos; Ulf Brefeld; Theodore Dalamagas; Timos Sellis
Personalized retrieval models aim at capturing user interests to provide personalized results that are tailored to the respective information needs. User interests are however widely spread, subject to change, and cannot always be captured well, thus rendering the deployment of personalized models challenging. We take a different approach and study ranking models for user intent. We exploit user feedback in terms of click data to cluster ranking models for historic queries according to user behavior and intent. Each cluster is finally represented by a single ranking model that captures the contained search interests expressed by users. Once new queries are issued, these are mapped to the clustering and the retrieval process diversifies possible intents by combining relevant ranking functions. Empirical evidence shows that our approach significantly outperforms baseline approaches on a large corporate query log.

Type and structure

Learning to aggregate vertical results into web search results BIBAFull-Text 201-210
  Jaime Arguello; Fernando Diaz; Jamie Callan
Aggregated search is the task of integrating results from potentially multiple specialized search services, or verticals, into the Web search results. The task requires predicting not only which verticals to present (the focus of most prior research), but also predicting where in the Web results to present them (i.e., above or below the Web results, or somewhere in between). Learning models to aggregate results from multiple verticals is associated with two major challenges. First, because verticals retrieve different types of results and address different search tasks, results from different verticals are associated with different types of predictive evidence (or features). Second, even when a feature is common across verticals, its predictiveness may be vertical-specific. Therefore, approaches to aggregating vertical results require handling an inconsistent feature representation across verticals, and, potentially, a vertical-specific relationship between features and relevance. We present 3 general approaches that address these challenges in different ways and compare their results across a set of 13 verticals and 1070 queries. We show that the best approaches are those that allow the learning algorithm to learn a vertical-specific relationship between features and relevance.
Coreference aware web object retrieval BIBAFull-Text 211-220
  Jeffrey Dalton; Roi Blanco; Peter Mika
As user demands become increasingly sophisticated, search engines today are competing in more than just returning document results from the Web. One area of competition is providing web object results from structured data extracted from a multitude of information sources. We address the problem of performing keyword retrieval over a collection of objects containing a large degree of duplication as different Web-based information sources provide descriptions of the same object. We develop a method for coreference aware retrieval that performs topic-specific coreference resolution on retrieved objects in order to improve object search results. Our results demonstrate that coreference has a significant impact on the effectiveness of retrieval in the domain of local search. Our results show that a coreference aware system outperforms naive object retrieval by more than 20% in P5 and P10.
Tag clouds revisited BIBAFull-Text 221-230
  Dimitrios Skoutas; Mohammad Alrifai
Tagging has become a very common feature in Web 2.0 applications, providing a simple and effective way for users to freely annotate resources to facilitate their discovery and management. Subsequently, tag clouds have become popular as a summarized representation of a collection of tagged resources. A tag cloud is typically a visualization of the top-k most frequent tags in the underlying collection. In this paper, we revisit tag clouds, to examine whether frequency is the most suitable criterion for tag ranking. We propose alternative tag ranking strategies, based on methods for random walk on graphs, diversification, and rank aggregation. To enable the comparison of different tag selection and ranking methods, we propose a set of evaluation metrics that consider the use of tag clouds for search, navigation and recommendations. We apply these tag ranking methods and evaluation metrics to empirically compare alternative tag clouds in a dataset obtained from Flickr, comprising 488,112 tagged photos organized in 451 groups, and 112,514 distinct tags.
Ranking-based processing of SQL queries BIBAFull-Text 231-236
  Hany Azzam; Thomas Roelleke; Sirvan Yahyaei
A growing number of applications are built on top of search engines and issue complex structured queries. This paper contributes a customisable ranking-based processing of such queries, specifically SQL. Similar to how term-based statistics are exploited by term-based retrieval models, ranking-aware processing of SQL queries exploits tuple-based statistics that are derived from sources or, more precisely, derived from the relations specified in the SQL query. To implement this ranking-based processing, we leverage PSQL, a probabilistic variant of SQL, to facilitate probability estimation and the generalisation of document retrieval models to be used for tuple retrieval. The result is a general-purpose framework that can interpret any SQL query and then assign a probabilistic retrieval model to rank the results of that query. The evaluation on the IMDB and Monster benchmarks proves that the PSQL-based approach is applicable to (semi-)structured and unstructured data and structured queries.
Keyword search over RDF graphs BIBAFull-Text 237-242
  Shady Elbassuoni; Roi Blanco
Large knowledge bases consisting of entities and relationships between them have become vital sources of information for many applications. Most of these knowledge bases adopt the Semantic-Web data model RDF as a representation model. Querying these knowledge bases is typically done using structured queries utilizing graph-pattern languages such as SPARQL. However, such structured queries require some expertise from users which limits the accessibility to such data sources. To overcome this, keyword search must be supported. In this paper, we propose a retrieval model for keyword queries over RDF graphs. Our model retrieves a set of subgraphs that match the query keywords, and ranks them based on statistical language models. We show that our retrieval model outperforms the-state-of-the-art IR and DB models for keyword search over structured data using experiments over two real-world datasets.
Frequency-aware similarity measures: why Arnold Schwarzenegger is always a duplicate BIBAFull-Text 243-248
  Dustin Lange; Felix Naumann
Measuring the similarity of two records is a challenging problem, but necessary for fundamental tasks, such as duplicate detection and similarity search. By exploiting frequencies of attribute values, many similarity measures can be improved: In a person table with U.S. citizens, Arnold Schwarzenegger is a very rare name. If we find several Arnold Schwarzeneggers in it, it is very likely that these are duplicates. We are then less strict when comparing other attribute values, such as birth date or address. We put this intuition to use by partitioning compared record pairs according to frequencies of attribute values. For example, we could create three partitions from our data: Partition 1 contains all pairs with rare names, Partition 2 all pairs with medium frequent names, and Partition 3 all pairs with frequent names. For each partition, we learn a different similarity measure: we apply machine learning techniques to combine a set of base similarity measures into an overall measure. To determine a good partitioning, we compare different partitioning strategies. We achieved best results with a novel algorithm inspired by genetic programming.
   We evaluate our approach on real-world data sets from a large credit rating agency and from a bibliography database. We show that our learning approach works well for logistic regression, SVM, and decision trees with significant improvements over (i) learning models that ignore frequencies and (ii) frequency-enriched models without partitioning.

Machine learning for information retrieval

A probabilistic method for inferring preferences from clicks BIBAFull-Text 249-258
  Katja Hofmann; Shimon Whiteson; Maarten de Rijke
Evaluating rankers using implicit feedback, such as clicks on documents in a result list, is an increasingly popular alternative to traditional evaluation methods based on explicit relevance judgments. Previous work has shown that so-called interleaved comparison methods can utilize click data to detect small differences between rankers and can be applied to learn ranking functions online. In this paper, we analyze three existing interleaved comparison methods and find that they are all either biased or insensitive to some differences between rankers. To address these problems, we present a new method based on a probabilistic interleaving process. We derive an unbiased estimator of comparison outcomes and show how marginalizing over possible comparison outcomes given the observed click data can make this estimator even more effective.
   We validate our approach using a recently developed simulation framework based on a learning to rank dataset and a model of click behavior. Our experiments confirm the results of our analysis and show that our method is both more accurate and more robust to noise than existing methods.
Intent-aware query similarity BIBAFull-Text 259-268
  Jiafeng Guo; Xueqi Cheng; Gu Xu; Xiaofei Zhu
Query similarity calculation is an important problem and has a wide range of applications in IR, including query recommendation, query expansion, and even advertisement matching. Existing work on query similarity aims to provide a single similarity measure without considering the fact that queries are ambiguous and usually have multiple search intents. In this paper, we argue that query similarity should be defined upon search intents, so-called intent-aware query similarity. By introducing search intents into the calculation of query similarity, we can obtain more accurate and also informative similarity measures on queries and thus help a variety of applications, especially those related to diversification. Specifically, we first identify the potential search intents of queries, and then measure query similarity under different intents using intent-aware representations. A regularized topic model is employed to automatically learn the potential intents of queries by using both the words from search result snippets and the regularization from query co-clicks. Experimental results confirm the effectiveness of intent-aware query similarity on ambiguous queries which can provide significantly better similarity scores over the traditional approaches. We also experimentally verified the utility of intent-aware similarity in the application of query recommendation, which can suggest diverse queries in a structured way to search users.
Semi-supervised learning to rank with preference regularization BIBAFull-Text 269-278
  Martin Szummer; Emine Yilmaz
We propose a semi-supervised learning to rank algorithm. It learns from both labeled data (pairwise preferences or absolute labels) and unlabeled data. The data can consist of multiple groups of items (such as queries), some of which may contain only unlabeled items. We introduce a preference regularizer favoring that similar items are similar in preference to each other. The regularizer captures manifold structure in the data, and we also propose a rank-sensitive version designed for top-heavy retrieval metrics including NDCG and mean average precision.
   The regularizer is employed in SSLambdaRank, a semi-supervised version of LambdaRank. This algorithm directly optimizes popular retrieval metrics and improves retrieval accuracy over LambdaRank, a state-of-the-art ranker that was used as part of the winner of the Yahoo! Learning to Rank challenge 2010. The algorithm runs in linear time in the number of queries, and can work with huge datasets.
Simultaneous clustering of multi-type relational data via symmetric nonnegative matrix tri-factorization BIBAFull-Text 279-284
  Hua Wang; Heng Huang; Chris Ding
The rapid growth of Internet and modern technologies has brought data involving objects of multiple types that are related to each other, called as multi-type relational data. Traditional clustering methods for single-type data rarely work well on them, which calls for more advanced clustering techniques to deal with multiple types of data simultaneously to utilize their interrelatedness. A major challenge in developing simultaneous clustering methods is how to effectively use all available information contained in a multi-type relational data set including inter-type and intra-type relationships. In this paper, we propose a Symmetric Nonnegative Matrix Tri-Factorization (S-NMTF) framework to cluster multi-type relational data at the same time. The proposed S-NMTF approach employs NMTF to simultaneously cluster different types of data using their inter-type relationships, and incorporate the intra-type information through manifold regularization. In order to deal with the symmetric usage of the factor matrix in S-NMTF, we present a new generic matrix inequality to derive the solution algorithm, which involves a fourth-order matrix polynomial, in a principled way. Promising experimental results have validated the proposed approach.
Collaborative online learning of user generated content BIBAFull-Text 285-290
  Guangxia Li; Kuiyu Chang; Steven C. H. Hoi; Wenting Liu; Ramesh Jain
We study the problem of online classification of user generated content, with the goal of efficiently learning to categorize content generated by individual user. This problem is challenging due to several reasons. First, the huge amount of user generated content demands a highly efficient and scalable classification solution. Second, the categories are typically highly imbalanced, i.e., the number of samples from a particular useful class could be far and few between compared to some others (majority class). In some applications like spam detection, identification of the minority class often has significantly greater value than that of the majority class. Last but not least, when learning a classification model from a group of users, there is a dilemma: A single classification model trained on the entire corpus may fail to capture personalized characteristics such as language and writing styles unique to each user. On the other hand, a personalized model dedicated to each user may be inaccurate due to the scarcity of training data, especially at the very beginning; when users have written just a few articles. To overcome these challenges, we propose learning a global model over all users' data, which is then leveraged to continuously refine the individual models through a collaborative online learning approach. The class imbalance problem is addressed via a cost-sensitive learning approach. Experimental results show that our method is effective and scalable for timely classification of user generated content.
Structured learning of two-level dynamic rankings BIBAFull-Text 291-296
  Karthik Raman; Thorsten Joachims; Pannaga Shivaswamy
For ambiguous queries, conventional retrieval systems are bound by two conflicting goals. On the one hand, they should diversify and strive to present results for as many query intents as possible. On the other hand, they should provide depth for each intent by displaying more than a single result. Since both diversity and depth cannot be achieved simultaneously in the conventional static retrieval model, we propose a new dynamic ranking approach. In particular, our proposed two-level dynamic ranking model allows users to adapt the ranking through interaction, thus overcoming the constraints of presenting a one-size-fits-all static ranking. In this model, a user's interactions with the first-level ranking are used to infer this user's intent, so that second-level rankings can be inserted to provide more results relevant to this intent. Unlike previous dynamic ranking models, we provide an algorithm to efficiently compute dynamic rankings with provable approximation guarantees. We also propose the first principled algorithm for learning dynamic ranking functions from training data. In addition to the theoretical results, we provide empirical evidence demonstrating the gains in retrieval quality over conventional approaches.

Information retrieval implementation techniques

Efficiency optimizations for interpolating subqueries BIBAFull-Text 297-306
  Marc-Allen Cartright; James Allan
A large class of queries can be viewed as linear combinations of smaller subqueries. Additionally, many situations arise when part or all of one subquery has been preprocessed or has cached information, while another subquery requires full processing. This type of query is common, for example, in relevance feedback settings where the original query has been run to produce a set of expansion terms, but the expansion terms still need to be processed. We investigate mechanisms to reduce the time needed to process queries of this nature. We use RM3, a variant of the Relevance Model scoring algorithm, as our instantiation of this arrangement. We examine the different scenarios that can arise when we have access to the internal structure of each subquery. Given this additional information, we investigate methods to utilize this information, reducing processing costs substantially. Depending on the amount of accessibility we have into the subqueries, we can reduce processing costs over 80% without affecting the score of the final results.
Efficiently encoding term co-occurrences in inverted indexes BIBAFull-Text 307-316
  Marcus Fontoura; Maxim Gurevich; Vanja Josifovski; Sergei Vassilvitskii
Precomputation of common term co-occurrences has been successfully applied to improve query performance in large scale search engines based on inverted indexes. The results of such precomputations are traditionally stored as additional posting lists in the index. During query evaluation, these precomputed lists are used to reduce the number of query terms, as the results for multiple terms can be accessed through a single precomputed list. In this paper, we expand this paradigm by considering an alternative method for storing term co-occurrences in inverted indexes. For a selected set of terms in the index, we store bitmaps that encode term co-occurrences. A bitmap of size k for term t augments each posting to store the co-occurrences of t with k other terms, across every document in the index. At query evaluation, size k bitmaps can be used to answer queries that involve any of the 2^k combinations of the additional terms. In contrast, a precomputed list, although typically shorter, can only be used to evaluate queries containing all of its terms. We evaluate the bitmaps technique we propose, and the baseline of adding precomputed posting lists and show that they are complementary, as they capture different aspects of the query evaluation cost. We perform an experimental evaluation on the TREC WT10g corpus and show that a hybrid strategy combining both methods significantly lowers the cost of query evaluation compared to each method separately.
SIMD-based decoding of posting lists BIBAFull-Text 317-326
  Alexander A. Stepanov; Anil R. Gangolli; Daniel E. Rose; Ryan J. Ernst; Paramjit S. Oberoi
Powerful SIMD instructions in modern processors offer an opportunity for greater search performance. In this paper, we apply these instructions to decoding search engine posting lists. We start by exploring variable-length integer encoding formats used to represent postings. We define two properties, byte-oriented and byte-preserving, that characterize many formats of interest. Based on their common structure, we define a taxonomy that classifies encodings along three dimensions, representing the way in which data bits are stored and additional bits are used to describe the data. Using this taxonomy, we discover new encoding formats, some of which are particularly amenable to SIMD-based decoding. We present generic SIMD algorithms for decoding these formats. We also extend these algorithms to the most common traditional encoding format. Our experiments demonstrate that SIMD-based decoding algorithms are up to 3 times faster than non-SIMD algorithms.
Factorization-based lossless compression of inverted indices BIBAFull-Text 327-332
  George Beskales; Marcus Fontoura; Maxim Gurevich; Sergei Vassilvitskii; Vanja Josifovski
Many large-scale Web applications that require ranked top-k retrieval are implemented using inverted indices. An inverted index represents a sparse term-document matrix, where non-zero elements indicate the strength of term-document associations. In this work, we present an approach for lossless compression of inverted indices. Our approach maps terms in a document corpus to a new term space in order to reduce the number of non-zero elements in the term-document matrix, resulting in a more compact inverted index. We formulate the problem of selecting a new term space as a matrix factorization problem, and prove that finding the optimal solution is an NP-hard problem. We develop a greedy algorithm for finding an approximate solution. A side effect of our approach is increasing the number of terms in the index, which may negatively affect query evaluation performance. To eliminate such effect, we develop a methodology for modifying query evaluation algorithms by exploiting specific properties of our compression approach.
TOPSIG: topology preserving document signatures BIBAFull-Text 333-338
  Shlomo Geva; Christopher M. De Vries
Comparisons between file signatures and inverted files for text retrieval have shown the shortcomings of traditional file signatures. It has been widely accepted that traditional file signatures are inferior alternatives to inverted files. This paper describes TopSig, a new approach to the construction of file signatures that extends recent advances in semantic hashing and dimensionality reduction. These were not so far linked to general purpose, signature file based, search engines. We demonstrate significant improvements in the performance of signature file based indexing and retrieval. Performance is comparable to the state of the art inverted file based systems, including language models and BM25. These findings suggest that file signatures offer a viable alternative to inverted files in suitable settings and positions the file signatures model in the class of Vector Space retrieval models.
Implementation techniques for large-scale latent semantic indexing applications BIBAFull-Text 339-344
  Roger B. Bradford
The technique of latent semantic indexing (LSI) has wide applicability in information retrieval and data mining tasks. To date, however, most applications of LSI have addressed relatively small collections of data. This has been due partly to hardware and software limitations and partly to overly pessimistic estimates of the processing requirements of the singular value decomposition (SVD) process. In recent years, advances in hardware capabilities and software implementations have enabled much larger LSI applications. Moreover, experience with large LSI indexes has shown that the SVD is not the limitation on scalability that it was long thought to be. This paper describes techniques applicable to creating large-scale (multi-million document) LSI indexes. Detailed data regarding the LSI index creation process is presented for collections of up to 100 million documents. Four key factors are shown to contribute to the scalability of LSI. First, in most situations, the time required for calculation of the singular value decomposition (SVD) of the term-document matrix is not the dominant factor determining the overall time required to build an LSI index. Second, the time required to calculate the SVD in LSI is linear in the number of objects indexed. Third, incremental index creation greatly facilitates use of LSI in dynamic environments. Fourth, distributed query processing can be employed to support large numbers of users. It is shown that LSI is well-suited for implementation in modern distributed computing environments. This paper provides the first measurements of the execution time for large-scale LSI build processes in a cloud environment.

Language technology and information retrieval

Statistical source expansion for question answering BIBAFull-Text 345-354
  Nico Schlaefer; Jennifer Chu-Carroll; Eric Nyberg; James Fan; Wlodek Zadrozny; David Ferrucci
A source expansion algorithm automatically extends a given text corpus with related content from large external sources such as the Web. The expanded corpus is not intended for human consumption but can be used in question answering (QA) and other information retrieval or extraction tasks to find more relevant information and supporting evidence. We propose an algorithm that extends a corpus of seed documents with web content, using a statistical model to select text passages that are both relevant to the topics of the seeds and complement existing information.
   In an evaluation on 1,500 hand-labeled web pages, our algorithm ranked text passages by relevance with 81% MAP, compared to 43% when relying on web search engine ranks alone and 75% when using a multi-document summarization algorithm. Applied to QA, the proposed method yields consistent and significant performance gains. We evaluated the impact of source expansion on over 6,000 questions from the Jeopardy! quiz show and TREC evaluations using Watson, a state-of-the-art QA system. Accuracy increased from 66% to 71% on Jeopardy! questions and from 59% to 64% on TREC questions.
Passage retrieval for incorporating global evidence in sequence labeling BIBAFull-Text 355-364
  Jeffrey Dalton; James Allan; David A. Smith
Many forms of linguistic analysis, such as part of speech tagging, named entity recognition, and other sequence labeling tasks are performed on short spans of text and assume statistical dependence within a window of only a few tokens. We propose using passage retrieval to induce non-local dependencies in structured classification that generalizes earlier work in context aggregation for named-entity recognition. We introduce a new method for feature expansion inspired by psuedo-relevance feedback (PRF). Our results on the CoNLL 2003 task show that features from cross-document feature expansion improves NER effectiveness over previous aggregation models. Utilizing all the tokens in a sentence for query context consistently perform best on both intrinsic and extrinsic evaluations. Tagging models incorporating feature expansion outperform the leading NER system when evaluated on out of domain data, a collection of publicly available scanned books on the topic of historic Deerfield, MA. Finally, the results show that retrieval based feature expansion using an external collection of unlabeled text can result in further effectiveness improvements.
Effective and efficient polarity estimation in blogs based on sentence-level evidence BIBAFull-Text 365-374
  Jose M. Chenlo; David E. Losada
One of the core tasks in Opinion Mining consists of estimating the polarity of the opinionated documents found. In some scenarios (e.g. blogs), this estimation is severely affected by sentences that are off-topic or that simply do not express any opinion. In fact, the key sentiments in a blog post often appear in specific locations of the text. In this paper we propose several effective and robust polarity detection methods based on different sentence features. We show that we can successfully determine the polarity of documents guided by a sentence-level analysis that takes into account topicality and the location in the blog post of the subjective sentences. Our experimental results show that some of our proposed variants are both highly effective and computationally-lightweight.
Sentiment classification based on supervised latent n-gram analysis BIBAFull-Text 375-382
  Dmitriy Bespalov; Bing Bai; Yanjun Qi; Ali Shokoufandeh
In this paper, we propose an efficient embedding for modeling higher-order (n-gram) phrases that projects the n-grams to low-dimensional latent semantic space, where a classification function can be defined. We utilize a deep neural network to build a unified discriminative framework that allows for estimating the parameters of the latent space as well as the classification function with a bias for the target classification task at hand. We apply the framework to large-scale sentimental classification task. We present comparative evaluation of the proposed method on two (large) benchmark data sets for online product reviews. The proposed method achieves superior performance in comparison to the state of the art.
Legal document clustering with built-in topic segmentation BIBAFull-Text 383-392
  Qiang Lu; Jack G. Conrad; Khalid Al-Kofahi; William Keenan
Clustering is a useful tool for helping users navigate, summarize, and organize large quantities of textual documents available on the Internet, in news sources, and in digital libraries. A variety of clustering methods have also been applied to the legal domain, with various degrees of success. Some unique characteristics of legal content as well as the nature of the legal domain present a number of challenges. For example, legal documents are often multi-topical, contain carefully crafted, professional, domain-specific language, and possess a broad and unevenly distributed coverage of legal issues. Moreover, unlike widely accessible documents on the Internet, where search and categorization services are generally free, the legal profession is still largely a fee-for-service field that makes the quality (e.g., in terms of both recall and precision) a key differentiator of provided services. This paper introduces a classification-based recursive soft clustering algorithm with built-in topic segmentation. The algorithm leverages existing legal document metadata such as topical classifications, document citations, and click stream data from user behavior databases, into a comprehensive clustering framework. Techniques associated with the algorithm have been applied successfully to very large databases of legal documents, which include judicial opinions, statutes, regulations, administrative materials and analytical documents. Extensive evaluations were conducted to determine the efficiency and effectiveness of the proposed algorithm. Subsequent evaluations conducted by legal domain experts have demonstrated that the quality of the resulting clusters based upon this algorithm is similar to those created by domain experts.

Results in context

What and how children search on the web BIBAFull-Text 393-402
  Sergio Duarte Torres; Ingmar Weber
The Internet has become an important part of the daily life of children as a source of information and leisure activities. Nonetheless, given that most of the content available on the web is aimed at the general public, children are constantly exposed to inappropriate content, either because the language goes beyond their reading skills, their attention span differs from grown-ups or simple because the content is not targeted at children as is the case of ads and adult content. In this work we employed a large query log sample from a commercial web search engine to identify the struggles and search behavior of children of the age of 6 to young adults of the age of 18. Concretely we hypothesized that the large and complex volume of information to which children are exposed leads to ill-defined searches and to disorientation during the search process. For this purpose, we quantified their search difficulties based on query metrics (e.g. fraction of queries posed in natural language), session metrics (e.g. fraction of abandoned sessions) and click activity (e.g. fraction of ad clicks). We also used the search logs to retrace stages of child development. Concretely we looked for changes in the user interests (e.g. distribution of topics searched), language development (e.g. readability of the content accessed) and cognitive development (e.g. sentiment expressed in the queries) among children and adults. We observed that these metrics clearly demonstrate an increased level of confusion and unsuccessful search sessions among children. We also found a clear relation between the reading level of the clicked pages and the demographics characteristics of the users such as age and average educational attainment of the zone in which the user is located.
Personalizing web search results by reading level BIBAFull-Text 403-412
  Kevyn Collins-Thompson; Paul N. Bennett; Ryen W. White; Sebastian de la Chica; David Sontag
Traditionally, search engines have ignored the reading difficulty of documents and the reading proficiency of users in computing a document ranking. This is one reason why Web search engines do a poor job of serving an important segment of the population: children. While there are many important problems in interface design, content filtering, and results presentation related to addressing children's search needs, perhaps the most fundamental challenge is simply that of providing relevant results at the right level of reading difficulty. At the opposite end of the proficiency spectrum, it may also be valuable for technical users to find more advanced material or to filter out material at lower levels of difficulty, such as tutorials and introductory texts. We show how reading level can provide a valuable new relevance signal for both general and personalized Web search. We describe models and algorithms to address the three key problems in improving relevance for search using reading difficulty: estimating user proficiency, estimating result difficulty, and re-ranking based on the difference between user and result reading level profiles. We evaluate our methods on a large volume of Web query traffic and provide a large-scale log analysis that highlights the importance of finding results at an appropriate reading level for the user.
Location-aware click prediction in mobile local search BIBAFull-Text 413-422
  Dimitrios Lymberopoulos; Peixiang Zhao; Christian Konig; Klaus Berberich; Jie Liu
Users increasingly rely on their mobile devices to search, locate and discover places and activities around them while on the go. Their decision process is driven by the information displayed on their devices and their current context (e.g. traffic, driving or walking etc.). Even though recent research efforts have already examined and demonstrated how different context parameters such as weather, time and personal preferences affect the way mobile users click on local businesses, little has been done to study how the location of the user affects the click behavior. In this paper we follow a data-driven methodology where we analyze approximately 2 million local search queries submitted by users across the US, to visualize and quantify how differently mobile users click across locations. Based on the data analysis, we propose new location-aware features for improving local search click prediction and quantify their performance on real user query traces. Motivated by the results, we implement and evaluate a data-driven technique where local search models at different levels of location granularity (e.g. city, state, and country levels) are combined together at run-time to further improve click prediction accuracy. By applying the location-aware features and the multiple models at different levels of location granularity on real user query streams from a major, commercially available search engine, we achieve anywhere from 5% to 47% higher Precision than a single click prediction model across the US can achieve.
Text vs. space: efficient geo-search query processing BIBAFull-Text 423-432
  Maria Christoforaki; Jinru He; Constantinos Dimopoulos; Alexander Markowetz; Torsten Suel
Many web search services allow users to constrain text queries to a geographic location (e.g., yoga classes near Santa Monica). Important examples include local search engines such as Google Local and location-based search services for smart phones. Several research groups have studied the efficient execution of queries mixing text and geography; their approaches usually combine inverted lists with a spatial access method such as an R-tree or space-filling curve. In this paper, we take a fresh look at this problem. We feel that previous work has often focused on the spatial aspect at the expense of performance considerations in text processing, such as inverted index access, compression, and caching. We describe new and existing approaches and discuss their different perspectives. We then compare their performance in extensive experiments on large document collections. Our results indicate that a query processor that combines state-of-the-art text processing techniques with a simple coarse-grained spatial structure can outperform existing approaches by up to two orders of magnitude. In fact, even a naive approach that first uses a simple inverted index and then filters out any documents outside the query range outperforms many previous methods.

Algorithms

One is enough: distributed filtering for duplicate elimination BIBAFull-Text 433-442
  Georgia Koloniari; Nikos Ntarmos; Evaggelia Pitoura; Dimitris Souravlias
The growth of online services has created the need for duplicate elimination in high-volume streams of events. The sheer volume of data in applications such as pay-per-click clickstream processing, RSS feed syndication and notification services in social sites such Twitter and Facebook makes traditional centralized solutions hard to scale. In this paper, we propose an approach based on distributed filtering. To this end, we introduce a suite of distributed Bloom filters that exploit different ways of partitioning the event space. To address the continuous nature of event delivery, the filters are extended to support sliding window semantics. Moreover, we examine locality-related tradeoffs and propose a tree-based architecture to allow for duplicate elimination across geographic locations. We cast the design space and present experimental results that demonstrate the pros and cons of our various solutions in different settings.
Duplicate detection through structure optimization BIBAFull-Text 443-452
  Luís Leitão; Pável Calado
Detecting and eliminating duplicates in databases is a task of critical importance in many applications. Although solutions for traditional models, such as relational data, have been widely studied, recently there has been some focus on solutions for more complex hierarchical structures as, for instance, XML data. Such data presents many different challenges, among which is the issue of how to exploit the schema structure to determine if two objects are duplicates. In this paper, we argue that structure can indeed have a significant impact on the process of duplicate detection. We propose a novel method that automatically restructures database objects in order to take full advantage of the relations between its attributes. This new structure reflects the relative importance of the attributes in the database and avoids the need to perform a manual selection. To test our approach we applied it to an existing duplicate detection system. Experiments performed on several datasets show that, using the new learned structure, we consistently outperform both the results obtained with the original database structure and those obtained by letting a knowledgeable user manually choose the attributes to compare.
SISP: a new framework for searching the informative subgraph based on PSO BIBAFull-Text 453-462
  Chen Chen; Guoren Wang; Huilin Liu; Junchang Xin; Ye Yuan
A significant number of applications on graph require the key relations among a group of query nodes. Given a relational graph such as social network or biochemical interaction, an informative subgraph is urgent, which can best explain the relationships among a group of given query nodes. Based on Particle Swarm Optimization (PSO), a new framework of SISP (Searching the Informative Subgraph based on PSO) is proposed. SISP contains three key stages. In the initialization stage, a random spreading method is proposed, which can effectively guarantee the connectivity of the nodes in each particle; In the calculating stage of fitness, a fitness function is designed by incorporating a sign function with the goodness score; In the update stage, the intersection-based particle extension method and rule-based particle compression method are proposed. To evaluate the qualities of returned subgraphs, the appropriate calculating of goodness score is studied. Considering the importance and relevance of a node together, we present the PNR method, which makes the definition of informativeness more reliable and the returned subgraph more satisfying. At last, we present experiments on a real dataset and a synthetic dataset separately. The experimental results confirm that the proposed methods achieve increased accuracy and are efficient for any query set.
Indexes for highly repetitive document collections BIBAFull-Text 463-468
  Francisco Claude; Antonio Fariña; Miguel A. Martínez-Prieto; Gonzalo Navarro
We introduce new compressed inverted indexes for highly repetitive document collections. They are based on run-length, Lempel-Ziv, or grammar-based compression of the differential inverted lists, instead of gap-encoding them as is the usual practice. We show that our compression methods significantly reduce the space achieved by classical compression, at the price of moderate slowdowns. Moreover, many of our methods are universal, that is, they do not need to know the versioning structure of the collection.
   We also introduce compressed self-indexes in the comparison. We show that techniques can compress much further, using a small fraction of the space required by our new inverted indexes, yet they are orders of magnitude slower.
Partial duplicate detection for large book collections BIBAFull-Text 469-474
  Ismet Zeki Yalniz; Ethem F. Can; R. Manmatha
A framework is presented for discovering partial duplicates in large collections of scanned books with optical character recognition (OCR) errors. Each book in the collection is represented by the sequence of words (in the order they appear in the text) which appear only once in the book. These words are referred to as "unique words" and they constitute a small percentage of all the words in a typical book. Along with the order information the set of unique words provides a compact representation which is highly descriptive of the content and the flow of ideas in the book. By aligning the sequence of unique words from two books using the longest common subsequence (LCS) one can discover whether two books are duplicates. Experiments on several datasets show that DUPNIQ is more accurate than traditional methods for duplicate detection such as shingling and is fast. On a collection of 100K scanned English books DUPNIQ detects partial duplicates in 30 min using 350 cores and has precision 0.996 and recall 0.833 compared to shingling with precision 0.992 and recall 0.720. The technique works on other languages as well and is demonstrated for a French dataset.

Image retrieval

This image smells good: effects of image information scent in search engine results pages BIBAFull-Text 475-484
  Faidon Loumakis; Simone Stumpf; David Grayson
Users are confronted with an overwhelming amount of web pages when they look for information on the Internet. Current search engines already aid the user in their information seeking tasks by providing textual results but adding images to results pages could further help the user in judging the relevance of a result. We investigated this problem from an Information Foraging perspective and we report on two empirical studies that focused on the information scent of images. Our results show that images have their own distinct "smell" which is not as strong as that of text. We also found that combining images and text cues leads to a stronger overall scent. Surprisingly, when images were added to search engine results pages, this did not lead our participants to behave significantly differently in terms of effectiveness or efficiency. Even when we added images that could confuse the participants' scent, this had no significantly detrimental impact on their behaviour. However, participants expressed a preference for results pages which included images. We discuss potential challenges and point to future research to ensure the success of adding images to textual results in search engine results pages.
Retrieving and ranking unannotated images through collaboratively mining online search results BIBAFull-Text 485-494
  Songhua Xu; Hao Jiang; Francis Chi-Moon Lau
We present a new image search and ranking algorithm for retrieving unannotated images by collaboratively mining online search results which consist of online image and text search results. The online image search results are leveraged as reference examples to perform content-based image search over unannotated images. The online text search results are utilized to estimate the reference images' relevance to the search query. The key feature of our method is its capability to deal with unreliable online image search results through jointly mining visual and textual aspects of online search results. Through such collaborative mining, our algorithm infers the relevance of an online search result image to a text query. Once we obtain the estimate of query relevance score for each online image search result, we can selectively use query specific online search result images as reference examples for retrieving and ranking unannotated images. We tested our algorithm both on the standard public image datasets and several modestly sized personal photo collections. We also compared our method with two well-known peer methods. The results indicate that our algorithm is superior to existing content-based image search algorithms for retrieving and ranking unannotated images.
Adaptive parallel approximate similarity search for responsive multimedia retrieval BIBAFull-Text 495-504
  George Teodoro; Eduardo Valle; Nathan Mariano; Ricardo Torres; Wagner, Jr. Meira
This paper introduces Hypercurves, a flexible framework for providing similarity search indexing to high throughput multimedia services. Hypercurves efficiently and effectively answers k-nearest neighbor searches on multigigabyte high-dimensional databases. It supports massively parallel processing and adapts at runtime its parallelization regimens to keep answer times optimal for either low and high demands. In order to achieve its goals, Hypercurves introduces new techniques for selecting parallelism configurations and allocating threads to computation cores, including hyperthreaded cores. Its efficiency gains are throughly validated on a large database of multimedia descriptors, where it presented near linear speedups and superlinear scaleups. The adaptation reduces query response times in 43% and 74% for both platforms tested, when compared to the best static parallelism regimens.
A linear-time approximation of the earth mover's distance BIBAFull-Text 505-514
  Min-Hee Jang; Sang-Wook Kim; Christos Faloutsos; Sunju Park
Color descriptors are one of the important features used in content-based image retrieval. The dominant color descriptor (DCD) represents a few perceptually dominant colors in an image through color quantization. For image retrieval based on DCD, the earth mover's distance and the optimal color composition distance are proposed to measure the dissimilarity between two images. Although providing good retrieval results, both methods are too time-consuming to be used in a large image database. To solve the problem, we propose a new distance function that calculates an approximate earth mover's distance in linear time. To calculate the dissimilarity in linear time, the proposed approach employs the space-filling curve for multidimensional color space. To improve the accuracy, the proposed approach uses multiple curves and adjusts the color positions. As a result, our approach achieves order-of-magnitude time improvement but incurs small errors. We have performed extensive experiments to show the effectiveness and efficiency of the proposed approach. The results reveal that our approach achieves almost the same results with the EMD in linear time.

Social media

Towards a framework for attribute retrieval BIBAFull-Text 515-524
  Arlind Kopliku; Mohand Boughanem; Karen Pinel-Sauvagnat
In this paper, we propose an attribute retrieval approach which extracts and ranks attributes from HTML tables. We distinguish between class attribute retrieval and instance attribute retrieval. On one hand, given an instance (e.g. University of Strathclyde) we retrieve from the Web its attributes (e.g. principal, location, number of students). On the other hand, given a class (e.g. universities) represented by a set of instances, we retrieve common attributes of its instances. Furthermore, we show we can reinforce instance attribute retrieval if similar instances are available. Our approach uses HTML tables which are probably the largest source for attribute retrieval. Three recall oriented filters are applied over tables to check the following three properties: (i) is the table relational, (ii) has the table a header, and (iii) the conformity of its attributes and values. Candidate attributes are extracted from tables and ranked with a combination of relevance features. Our approach is shown to have a high recall and a reasonable precision. Moreover, it outperforms state of the art techniques.
Building directories for social tagging systems BIBAFull-Text 525-534
  Denis Helic; Markus Strohmaier
Today, a number of algorithms exist for constructing tag hierarchies from social tagging data. While these algorithms were designed with ontological goals in mind, we know very little about their properties from an information retrieval perspective, such as whether these tag hierarchies support efficient navigation in social tagging systems. The aim of this paper is to investigate the usefulness of such tag hierarchies (sometimes also called folksonomies -- from folk-generated taxonomy) as directories that aid navigation in social tagging systems. To this end, we simulate navigation of directories as decentralized search on a network of tags using Kleinberg's model. In this model, a tag hierarchy can be applied as background knowledge for decentralized search. By constraining the visibility of nodes in the directories we aim to mimic typical constraints imposed by a practical user interface (UI), such as limiting the number of displayed subcategories or related categories. Our experiments on five different social tagging datasets show that existing tag hierarchy algorithms can support navigation in theory, but our results also demonstrate that they face tremendous challenges when user interface (UI) restrictions are taken into account. Based on this observation, we introduce a new algorithm that constructs efficiently navigable directories on our datasets. The results are relevant for engineers and scientists aiming to improve navigability of social tagging systems.
Workload-aware indexing for keyword search in social networks BIBAFull-Text 535-544
  Truls A. Bjørklund; Michaela Götz; Johannes Gehrke; Nils Grimsmo
More and more data is accumulated inside social networks. Keyword search provides a simple interface for exploring this content. However, a lot of the content is private, and a search system must enforce the privacy settings of the social network. In this paper, we present a workload-aware keyword search system with access control based on a social network. We make two technical contributions: (1) HeapUnion, a novel union operator that improves processing of search queries with access control by up to a factor of two compared to the best previous solution; and (2) highly accurate cost models that vary in sophistication and accuracy; these cost models provide input to an optimization algorithm that selects the most efficient organization of access control meta-data for a given workload. Our experimental results with real and synthetic data show that our approach outperforms previous work by up to a factor of three.
Effective retrieval of resources in folksonomies using a new tag similarity measure BIBAFull-Text 545-550
  Giovanni Quattrone; Licia Capra; Pasquale De Meo; Emilio Ferrara; Domenico Ursino
Social (or folksonomic) tagging has become a very popular way to describe content within Web 2.0 websites. However, as tags are informally defined, continually changing, and ungoverned, it has often been criticised for lowering, rather than increasing, the efficiency of searching. To address this issue, a variety of approaches have been proposed that recommend users what tags to use, both when labeling and when looking for resources. These techniques work well in dense folksonomies, but they fail to do so when tag usage exhibits a power law distribution, as it often happens in real-life folksonomies. To tackle this issue, we propose an approach that induces the creation of a dense folksonomy, in a fully automatic and transparent way: when users label resources, an innovative tag similarity metric is deployed, so to enrich the chosen tag set with related tags already present in the folksonomy. The proposed metric, which represents the core of our approach, is based on the mutual reinforcement principle. Our experimental evaluation proves that the accuracy and coverage of searches guaranteed by our metric are higher than those achieved by applying classical metrics.
Content-driven detection of campaigns in social media BIBAFull-Text 551-556
  Kyumin Lee; James Caverlee; Zhiyuan Cheng; Daniel Z. Sui
We study the problem of detecting coordinated free text campaigns in large-scale social media. These campaigns -- ranging from coordinated spam messages to promotional and advertising campaigns to political astro-turfing -- are growing in significance and reach with the commensurate rise of massive-scale social systems. Often linked by common "talking points", there has been little research in detecting these campaigns. Hence, we propose and evaluate a content-driven framework for effectively linking free text posts with common "talking points" and extracting campaigns from large-scale social media. One of the salient aspects of the framework is an investigation of graph mining techniques for isolating coherent campaigns from large message-based graphs. Through an experimental study over millions of Twitter messages we identify five major types of campaigns -- Spam, Promotion, Template, News, and Celebrity campaigns -- and we show how these campaigns may be extracted with high precision and recall.
Exploring categorization property of social annotations for information retrieval BIBAFull-Text 557-562
  Peng Li; Bin Wang; Wei Jin; Jian-Yun Nie; Zhiwei Shi; Ben He
User generated social annotations provide extra information for describing document contents. In this paper, we propose an effective method to model the categorization property of social annotations and explore the potential of combining it with classical language models for improving retrieval performance. Specifically, a novel TR-LDA model is presented to take annotations as an additional source for generating document contents apart from the document itself. We provide strategies for representing and weighting the categorization property and develop an efficient inference algorithm, where space saving is taken into account. Experiments are carried out on synthetic datasets, where documents and queries come from the standard evaluation conference TREC and annotations come from the website Delicious.com. Our results demonstrate the effectiveness of the proposed method on the ad-hoc retrieval task, which significantly outperforms state-of-art baselines.

Personalization and advertising

Context-aware search personalization with concept preference BIBAFull-Text 563-572
  Di Jiang; Kenneth Wai-Ting Leung; Wilfred Ng
As the size of the web is growing rapidly, a well-recognized challenge for developing web search engines is to optimize the search result towards each user's preference. In this paper, we propose and develop a new personalization framework that captures the user's preference in the form of concepts obtained by mining web search contexts. The search context consists of both the user's clickthroughs and query reformulations that satisfy some specific information need, which is able to provide more information than each individual query in a search session. We also propose a method that discovers search contexts by one-pass of raw search query log. Using the information of the search context, we develop eight strategies that derive conceptual preference judgment. A learning-to-rank approach is employed to combine the derived preference judgments and then a Context-Aware User Profile (CAUP) is created. We further employ CAUP to adapt a personalized ranking function. Experimental results demonstrate that our approach captures accurate and comprehensive user's preference and, in terms of Top-N results quality, outperforms those existing concept-based personalization approaches without using search contexts.
A framework for personalized and collaborative clustering of search results BIBAFull-Text 573-582
  David C. Anastasiu; Byron J. Gao; David Buttler
How to organize and present search results plays a critical role in the utility of search engines. Due to the unprecedented scale of the Web and diversity of search results, the common strategy of ranked lists has become increasingly inadequate, and clustering has been considered as a promising alternative. Clustering divides a long list of disparate search results into a few topic-coherent clusters, allowing the user to quickly locate relevant results by topic navigation. While many clustering algorithms have been proposed that innovate on the automatic clustering procedure, we introduce ClusteringWiki, the first prototype and framework for personalized clustering that allows direct user editing of the clustering results. Through a Wiki interface, the user can edit and annotate the membership, structure and labels of clusters for a personalized presentation. In addition, the edits and annotations can be shared among users as a mass-collaborative way of improving search result organization and search engine utility.
Using query log and social tagging to refine queries based on latent topics BIBAFull-Text 583-592
  Lidong Bing; Wai Lam; Tak-Lam Wong
An important way to improve users' satisfaction in Web search is to assist them to issue more effective queries. One such approach is query refinement (reformulation), which generates new queries according to the current query issued by users. A common procedure for conducting refinement is to generate some candidate queries first, and then a scoring method is designed to assess the quality of these candidates. Currently, most of the existing methods are context based. They rely heavily on the context relation of terms in the historical queries, and cannot detect and maintain the semantic consistency of queries. In this paper, we propose a graphical model to score queries. The proposed model exploits a latent topic space, which is automatically derived from the query log, to assess the semantic dependency of terms in a query. In the graphical model, both term context dependency and topic context dependency are considered. This also makes it feasible to score some queries which do not have much available historical term context information. We also utilize social tagging data in the candidate query generation process. Based on the observation that different users may tag the same resource with different tags of similar meaning, we propose a method to mine these term pairs for new candidate query construction.
Retrieval models for audience selection in display advertising BIBAFull-Text 593-598
  Sarah K. Tyler; Sandeep Pandey; Evgeniy Gabrilovich; Vanja Josifovski
Web applications often rely on user profiles of observed user actions, such as queries issued, page views, etc. In audience selection for display advertising, the audience that is likely to be responsive to a given ad campaign is identified via such profiles. We formalize the audience selection problem as a ranked retrieval task over an index of known users. We focus on the common case of audience selection where a small seed set of users who have previously responded positively to the campaign is used to identify a broader target audience. The actions of the users in the seed set are aggregated to construct a query, the query is then executed against an index of other user profiles to retrieve the highest scoring profiles. We validate our approach on a real-world dataset, demonstrating the trade-offs of different user and query models and that our approach is particularly robust for small campaigns. The proposed user modeling framework is applicable to many other applications requiring user profiles such as content suggestion and personalization.
A language model approach to capture commercial intent and information relevance for sponsored search BIBAFull-Text 599-604
  Lei Wang; Mingjiang Ye; Yu Zou
A fundamental task of sponsored search is how to find the best match between web search queries and textual advertisements. To address this problem, we explicitly characterize the criteria for an advertisement to be a 'good match' to a query from two aspects (it should be relevant with the query from information perspective, and it should be able to capture and satisfy the commercial intent in the query). Correspondingly, we introduce in this paper a mixture language model of two parts: a commercial model which characterizes language bias of commercial intent leveraging on users' clicks on advertisements, and an informational model which is a traditional language model with consideration of the entropy of each word to capture informational relevance. We then introduce a regularized expectation-maximization (EM) algorithm model for parameters estimation, and integrate query commercial intent into the scoring function to boost overall click efficiency. Empirical evaluation shows that our model achieves better performance as compared to a well tuned classical language model and deliberated TFIDF-pLSI model (6% and 5% precision improvement at our operating point in production environment of 30% recall, and 5.3% and 6.3% AUC improvement), and performs superior to the KL Divergence language model for tail queries (0.5% nDCG improvement). Live traffic test shows over 2% CTR lift and 2.5% RPS lift as well.
Learning to rank audience for behavioral targeting in display ads BIBAFull-Text 605-610
  Jian Tang; Ning Liu; Jun Yan; Yelong Shen; Shaodan Guo; Bin Gao; Shuicheng Yan; Ming Zhang
Behavioral targeting (BT), which aims to sell advertisers those behaviorally related user segments to deliver their advertisements, is facing a bottleneck in serving the rapid growth of long tail advertisers. Due to the small business nature of the tail advertisers, they generally expect to accurately reach a small group of audience, which is hard to be satisfied by classical BT solutions with large size user segments. In this paper, we propose a novel probabilistic generative model named Rank Latent Dirichlet Allocation (RANKLDA) to rank audience according to their ads click probabilities for the long tail advertisers to deliver their ads. Based on the basic assumption that users who clicked the same group of ads will have a higher probability of sharing similar latent search topical interests, RANKLDA combines topic discovery from users' search behaviors and learning to rank users from their ads click behaviors together. In computation, the topic learning could be enhanced by the supervised information of the rank learning and simultaneously, the rank learning could be better optimized by considering the discovered topics as features. This co-optimization scheme enhances each other iteratively. Experiments over the real click-through log of display ads in a public ad network show that the proposed RANKLDA model can effectively rank the audience for the tail advertisers.

Evaluation and analysis

Simulating simple user behavior for system effectiveness evaluation BIBAFull-Text 611-620
  Ben Carterette; Evangelos Kanoulas; Emine Yilmaz
Information retrieval effectiveness evaluation typically takes one of two forms: batch experiments based on static test collections, or lab studies measuring actual users interacting with a system. Test collection experiments are sometimes viewed as introducing too many simplifying assumptions to accurately predict the usefulness of a system to its users. As a result, there is great interest in creating test collections and measures that better model user behavior. One line of research involves developing measures that include a parameterized user model; choosing a parameter value simulates a particular type of user. We propose that these measures offer an opportunity to more accurately simulate the variance due to user behavior, and thus to analyze system effectiveness to a simulated user population. We introduce a Bayesian procedure for producing sampling distributions from click data, and show how to use statistical tools to quantify the effects of variance due to parameter selection.
Click the search button and be happy: evaluating direct and immediate information access BIBAFull-Text 621-630
  Tetsuya Sakai; Makoto P. Kato; Young-In Song
We define Direct Information Access as a type of information access where there is no user operation such as clicking or scrolling between the user's click on the search button and the user's information acquisition; we define Immediate Information Access as a type of information access where the user can locate the relevant information within the system output very quickly. Hence, a Direct and Immediate Information Access (DIIA) system is expected to satisfy the user's information need very quickly with its very first response. We propose a nugget-based evaluation framework for DIIA, which takes nugget positions into account in order to evaluate the ability of a system to present important nuggets first and to minimise the amount of text the user has to read. To demonstrate the integrity, usefulness and limitations of our framework, we built a Japanese DIIA test collection with 60 queries and over 2,800 nuggets as well as an offset-based nugget match evaluation interface, and conducted experiments with manual and automatic runs. The results suggest our proposal is a useful complement to traditional ranked retrieval evaluation based on document relevance.
Local computation of PageRank: the ranking side BIBAFull-Text 631-640
  Marco Bressan; Luca Pretto
Imagine you are a social network user who wants to search, in a list of potential candidates, for the best candidate for a job on the basis of their PageRank-induced importance ranking. Is it possible to compute this ranking for a low cost, by visiting only small subnetworks around the nodes that represent each candidate? The fundamental problem underpinning this question, i.e. computing locally the PageRank ranking of k nodes in an $n$-node graph, was first raised by Chen et al. (CIKM 2004) and then restated by Bar-Yossef and Mashiach (CIKM 2008). In this paper we formalize and provide the first analysis of the problem, proving that any local algorithm that computes a correct ranking must take into consideration Ω(√(kn)) nodes -- even when ranking the top $k$ nodes of the graph, even if their PageRank scores are "well separated", and even if the algorithm is randomized (and we prove a stronger Ω(n) bound for deterministic algorithms). Experiments carried out on large, publicly available crawls of the web and of a social network show that also in practice the fraction of the graph to be visited to compute the ranking may be considerable, both for algorithms that are always correct and for algorithms that employ (efficient) local score approximations.
Prioritizing relevance judgments to improve the construction of IR test collections BIBAFull-Text 641-646
  Mehdi Hosseini; Ingemar J. Cox; Natasa Milic-Frayling; Trevor Sweeting; Vishwa Vinay
We consider the problem of optimally allocating a fixed budget to construct a test collection with associated relevance judgements, such that it can (i) accurately evaluate the relative performance of the participating systems, and (ii) generalize to new, previously unseen systems. We propose a two stage approach. For a given set of queries, we adopt the traditional pooling method and use a portion of the budget to evaluate a set of documents retrieved by the participating systems. Next, we analyze the relevance judgments to prioritize the queries and remaining pooled documents for further relevance assessments. The query prioritization is formulated as a convex optimization problem, thereby permitting efficient solution and providing a flexible framework to incorporate various constraints. Query-document pairs with the highest priority scores are evaluated using the remaining budget. We evaluate our resource optimization approach on the TREC 2004 Robust track collection. We demonstrate that our optimization techniques are cost efficient and yield a significant improvement in the reusability of the test collections.
Evaluating an associative browsing model for personal information BIBAFull-Text 647-652
  Jinyoung Kim; W. Bruce Croft; David Smith; Anton Bakalov
Recent studies suggest that associative browsing can be beneficial for personal information access. Associative browsing is intuitive for the user and complements other methods of accessing personal information, such as keyword search. In our previous work, we proposed an associative browsing model of personal information in which users can navigate through the space of documents and concepts (e.g., person names, events, etc.). Our approach differs from other systems in that it presented a ranked list of associations by combining multiple measures of similarity, whose weights are improved based on click feedback from the user.
   In this paper, we evaluate the associative browsing model we proposed in the context of known-item finding task. We performed game-based user studies as well as a small scale instrumentation study using a prototype system that helped us to collect a large amount of usage data from the participants. Our evaluation results show that the associative browsing model can play an important role in known-item finding. We also found that the system can learn to improve suggestions for browsing with a small amount of click data.

Classification and evaluation

Semi-supervised SVMs for classification with unknown class proportions and a small labeled dataset BIBAFull-Text 653-662
  Sathiya Keerthi Selvaraj; Bigyan Bhar; Sundararajan Sellamanickam; Shirish Shevade
In the design of practical web page classification systems one often encounters a situation in which the labeled training set is created by choosing some examples from each class; but, the class proportions in this set are not the same as those in the test distribution to which the classifier will be actually applied. The problem is made worse when the amount of training data is also small. In this paper we explore and adapt binary SVM methods that make use of unlabeled data from the test distribution, viz., Transductive SVMs (TSVMs) and expectation regularization/constraint (ER/EC) methods to deal with this situation. We empirically show that when the labeled training data is small, TSVM designed using the class ratio tuned by minimizing the loss on the labeled set yields the best performance; its performance is good even when the deviation between the class ratios of the labeled training set and the test set is quite large. When the labeled training data is sufficiently large, an unsupervised Gaussian mixture model can be used to get a very good estimate of the class ratio in the test set; also, when this estimate is used, both TSVM and EC/ER give their best possible performance, with TSVM coming out superior. The ideas in the paper can be easily extended to multi-class SVMs and MaxEnt models.
A pairwise ranking based approach to learning with positive and unlabeled examples BIBAFull-Text 663-672
  Sundararajan Sellamanickam; Priyanka Garg; Sathiya Keerthi Selvaraj
A large fraction of binary classification problems arising in web applications are of the type where the positive class is well defined and compact while the negative class comprises everything else in the distribution for which the classifier is developed; it is hard to represent and sample from such a broad negative class. Classifiers based only on positive and unlabeled examples reduce human annotation effort significantly by removing the burden of choosing a representative set of negative examples. Various methods have been proposed in the literature for building such classifiers. Of these, the state of the art methods are Biased SVM and Elkan & Noto's methods. While these methods often work well in practice, they are computationally expensive since hyperparameter tuning is very important, particularly when the size of labeled positive examples set is small and class imbalance is high. In this paper we propose a pairwise ranking based approach to learn from positive and unlabeled examples (LPU) and we give a theoretical justification for it. We present a pairwise RankSVM (RSVM) based method for our approach. The method is simple, efficient, and its hyperparameters are easy to tune. A detailed experimental study using several benchmark datasets shows that the proposed method gives competitive classification performance compared to the mentioned state of the art methods, while training 3-10 times faster. We also propose an efficient AUC based feature selection technique in the LPU setting and demonstrate its usefulness on the datasets. To get an idea of the goodness of the LPU methods we compare them against supervised learning (SL) methods that also make use of negative examples in training. SL methods give a slightly better performance than LPU methods when there is a rich set of negative examples; however, they are inferior when the number of negative training examples is not large enough.
Robust nonnegative matrix factorization using L21-norm BIBAFull-Text 673-682
  Deguang Kong; Chris Ding; Heng Huang
Nonnegative matrix factorization (NMF) is widely used in data mining and machine learning fields. However, many data contain noises and outliers. Thus a robust version of NMF is needed. In this paper, we propose a robust formulation of NMF using L21 norm loss function. We also derive a computational algorithm with rigorous convergence analysis. Our robust NMF approach, (1) can handle noises and outliers; (2) provides very efficient and elegant updating rules; (3) incurs almost the same computational cost as standard NMF, thus potentially to be used in more real world application tasks. Experiments on 10 datasets show that the robust NMF provides more faithful basis factors and consistently better clustering results as compared to standard NMF.
TAKES: a fast method to select features in the kernel space BIBAFull-Text 683-692
  Ye Xu; Furao Shen; Wei Ping; Jinxi Zhao
Feature selection is an effective tool to deal with the "curse of dimensionality". To cope with the non-separable problem, feature selection in the kernel space has been investigated. However, previous study cannot adequately estimate the intrinsic dimensionality of the kernel space. Thus, it is difficult to accurately preserve the sketch of the kernel space using the learned basis, and the feature selection performance is affected. Moreover, the computing load of the algorithm reaches at least cubic with the number of training data. In this paper, we propose a fast framework to conduct feature selection in the kernel space. By designing a fast kernel subspace learning method, we automatically learn the intrinsic dimensionality and construct an orthogonal basis set of kernel space. The learned basis can accurately preserve the sketch of kernel space. Then backed by the constructed basis, we directly select features in kernel space. The whole proposed framework has a quadratic complexity with the number of training data, which is faster than existing kernel methods for feature selection. We evaluate our work under several typical datasets and find it not only preserves the sketch of the kernel space more accurately but also achieves better classification performance compared with many state-of-the-art methods.
Designing an ensemble classifier over subspace classifiers using iterative convergence routine BIBAFull-Text 693-698
  Bhanukiran Vinzamuri; Kamalakar Karlapalem
There can be multiple classifiers for a given data set. One way to generate multiple classifiers is to use subspaces of the attribute sets. In this paper, we generate subspace classifiers by an iterative convergence routine to build an ensemble classifier. Experimental evaluation covers the cases of both labelled and unlabelled (blind) data separately. We evaluate our approach on many benchmark UC Irvine datasets to assess the robustness of our approach with varying induced noise levels. We explicitly compare and present the utility of the clusterings generated for classification using several diverse clustering dissimilarity metrics. Results show that our ensemble classifier is a more robust classifier in comparison to different multi-class classification approaches.

Information filtering

Bayesian latent variable models for collaborative item rating prediction BIBAFull-Text 699-708
  Morgan Harvey; Mark J. Carman; Ian Ruthven; Fabio Crestani
Collaborative filtering systems based on ratings make it easier for users to find content of interest on the Web and as such they constitute an area of much research. In this paper we first present a Bayesian latent variable model for rating prediction that models ratings over each user's latent interests and also each item's latent topics. We describe a Gibbs sampling procedure that can be used to estimate its parameters and show by experiment that it is competitive with the gradient descent SVD methods commonly used in state-of-the-art systems. We then proceed to make an important and novel extension to this model, enhancing it with user-dependent and item-dependant biases to significantly improve rating estimation. We show by experiment on a large set of real ratings data that these models are able to outperform 3 common baselines, including a very competitive and modern SVD-based model. Furthermore we illustrate other advantages of our approach beyond simply its ability to provide more accurate ratings and show that it is able to perform better on the common and important case where the user profile is short.
Timing when to buy BIBAFull-Text 709-718
  Rakesh Agrawal; Samuel Ieong; Raja Velu
Most e-commerce sites to-date have focused on helping consumers decide what to buy and where to buy. We study the complementary question of helping consumers decide when to buy, focusing on consumer durables. We introduce a utility-based model for evaluating different approaches to this question. We focus on how best to make use of forecasts in making recommendations, and propose three natural strategies. We establish a relationship between these strategies, and show that one of them is optimal. We conduct a large-scale experimental study to test the performance and robustness of these strategies. Across a wide range of conditions, the best strategy obtains 90% of the maximum possible gains.
Assisting web search users by destination reachability BIBAFull-Text 719-728
  Chi-Hoon Lee; Alpa Jain; Larry Lai
Search engine users are increasingly performing complex tasks based on the simple keyword-in document-out paradigm. To assist users in accomplishing their tasks effectively, search engines provide query recommendations based on the user's current query. These are suggestions for follow-up queries given the user-provided query. A large number of techniques have been proposed in the past on mining such query recommendations which include past user sessions (e.g., sequence of queries within a specified window of time) to identify most frequently occurring pairs, using click-through graphs (e.g., a bipartite graph of queries and the urls on which users clicked) and rank these suggestions using some form of frequency counts from the past query logs. Given the limited number of queries that are offered (typically 5) it is important to effectively rank them. In this paper, we present a novel approach to ranking query recommendations which not only consider relevance to the original query but also take into account efficiency of a query at accomplishing a user search task at hand. We formalize the notion of query efficiency and show how our objective function effectively captures this as determined by a human study and eliminates biases introduced by click-through based metrics. To compute this objective function, we present a pseudosupervised learning technique where no explicit human experts are required to label samples. In addition, our techniques effectively characterize preferred url destinations and project each query into a higher dimension space where each sub-spaces represents user intent using these characteristics. Finally, we present an extensive evaluation of our proposed methods against production systems and show our method to increase task completion efficiency by 15%.
Modeling personalized email prioritization: classification-based and regression-based approaches BIBAFull-Text 729-738
  Shinjae Yoo; Yiming Yang; Jaime Carbonell
Email overload, even after spam filtering, presents a serious productivity challenge for busy professionals and executives. One solution is automated prioritization of incoming emails to ensure the most important are read and processed quickly, while others are processed later as/if time permits in declining priority levels. This paper presents a study of machine learning approaches to email prioritization into discrete levels, comparing ordinal regression versus classifier cascades. Given the ordinal nature of discrete email priority levels, SVM ordinal regression would be expected to perform well, but surprisingly a cascade of SVM classifiers significantly outperforms ordinal regression for email prioritization. In contrast, SVM regression performs well -- better than classifiers -- on selected UCI data sets. This unexpected performance inversion is analyzed and results are presented, providing core functionality for email prioritization systems.
Diversification and refinement in collaborative filtering recommender BIBAFull-Text 739-744
  Rubi Boim; Tova Milo; Slava Novgorodov
This paper considers a popular class of recommender systems that are based on Collaborative Filtering (CF) and proposes a novel technique for diversifying the recommendations that they give to users. Items are clustered based on a unique notion of priority-medoids that provides a natural balance between the need to present highly ranked items vs. highly diverse ones. Our solution estimates items diversity by comparing the rankings that different users gave to the items, thereby enabling diversification even in common scenarios where no semantic information on the items is available. It also provides a natural zoom-in mechanism to focus on items (clusters) of interest and recommending diversified similar items. We present DiRec a plug-in that implements the above concepts and allows CF Recommender systems to diversify their recommendations. We illustrate the operation of DiRec in the context of a movie recommendation system and present a thorough experimental study that demonstrates the effectiveness of our recommendation diversification technique and its superiority over previous solutions.

Topics and events

Emerging topic detection using dictionary learning BIBAFull-Text 745-754
  Shiva Prasad Kasiviswanathan; Prem Melville; Arindam Banerjee; Vikas Sindhwani
Streaming user-generated content in the form of blogs, microblogs, forums, and multimedia sharing sites, provides a rich source of data from which invaluable information and insights maybe gleaned. Given the vast volume of such social media data being continually generated, one of the challenges is to automatically tease apart the emerging topics of discussion from the constant background chatter. Such emerging topics can be identified by the appearance of multiple posts on a unique subject matter, which is distinct from previous online discourse. We address the problem of identifying emerging topics through the use of dictionary learning. We propose a two stage approach respectively based on detection and clustering of novel user-generated content. We derive a scalable approach by using the alternating directions method to solve the resulting optimization problems. Empirical results show that our proposed approach is more effective than several baselines in detecting emerging topics in traditional news story and newsgroup data. We also demonstrate the practical application to social media analysis, based on a study on streaming data from Twitter.
Focusing on novelty: a crawling strategy to build diverse language models BIBAFull-Text 755-764
  Luciano Barbosa; Srinivas Bangalore
Word prediction performed by language models has an important role in many tasks as e.g. word sense disambiguation, speech recognition, hand-writing recognition, query spelling and query segmentation. Recent research has exploited the textual content of the Web to create language models. In this paper, we propose a new focused crawling strategy to collect Web pages that focuses on novelty in order to create diverse language models. In each crawling cycle, the crawler tries to ll the gaps present in the current language model built from previous cycles, by avoiding visiting pages whose vocabulary is already well represented in the model. It relies on an information theoretic measure to identify these gaps and then learns link patterns to pages in these regions in order to guide its visitation policy. To handle constantly evolving domains, a key feature of our crawler approach is its ability to adjust its focus as the crawl progresses. We evaluate our approach in two different scenarios in which our solution can be useful. First, we demonstrate that our approach produces more effective language models than the ones created by a baseline crawler in the context of a speech recognition task of broadcast news. In fact, in some cases, our crawler was able to obtain similar results to the baseline by crawling only 12.5% of the pages collected by the latter. Secondly, since in the news domain avoiding well-represented content might lead to novelty, i.e. up-to-date pages, we show that our diversity-based crawler can also be helpful to guide the crawler for the most recent content in the news. The results show that our approach was able to obtain on average 50% more up-to-date pages than the baseline crawler.
Natural event summarization BIBAFull-Text 765-774
  Yexi Jiang; Chang-Shing Perng; Tao Li
Event mining is a useful way to understand computer system behaviors. The focus of recent works on event mining has been shifted to event summarization from discovering frequent patterns. Event summarization seeks to provide a comprehensible explanation of the event sequence on certain aspects. Previous methods have several limitations such as ignoring temporal information, generating the same set of boundaries for all event patterns, and providing a summary which is difficult for human to understand. In this paper, we propose a novel framework called natural event summarization that summarizes an event sequence using inter-arrival histograms to capture the temporal relationship among events. Our framework uses the minimum description length principle to guide the process in order to balance between accuracy and brevity. Also, we use multi-resolution analysis for pruning the problem space. We demonstrate how the principles can be applied to generate summaries with periodic patterns and correlation patterns in the framework. Experimental results on synthetic and real data show our method is capable of producing usable event summary, robust to noises, and scalable.
Transferring topical knowledge from auxiliary long texts for short text clustering BIBAFull-Text 775-784
  Ou Jin; Nathan N. Liu; Kai Zhao; Yong Yu; Qiang Yang
With the rapid growth of social Web applications such as Twitter and online advertisements, the task of understanding short texts is becoming more and more important. Most traditional text mining techniques are designed to handle long text documents. For short text messages, many of the existing techniques are not effective due to the sparseness of text representations. To understand short messages, we observe that it is often possible to find topically related long texts, which can be utilized as the auxiliary data when mining the target short texts data. In this article, we present a novel approach to cluster short text messages via transfer learning from auxiliary long text data. We show that while some previous work exists that enhance short text clustering with related long texts, most of them ignore the semantic and topical inconsistencies between the target and auxiliary data and hurt the clustering performance. To accommodate the possible inconsistency between source and target data, we propose a novel topic model -- Dual Latent Dirichlet Allocation (DLDA) model, which jointly learns two sets of topics on short and long texts and couples the topic parameters to cope with the potential inconsistency between data sets. We demonstrate through large-scale clustering experiments on both advertisements and Twitter data that we can obtain superior performance over several state-of-art techniques for clustering short text documents.
LogSig: generating system events from raw textual logs BIBAFull-Text 785-794
  Liang Tang; Tao Li; Chang-Shing Perng
Modern computing systems generate large amounts of log data. System administrators or domain experts utilize the log data to understand and optimize system behaviors. Most system logs are raw textual and unstructured. One main fundamental challenge in automated log analysis is the generation of system events from raw textual logs. Log messages are relatively short text messages but may have a large vocabulary, which often result in poor performance when applying traditional text clustering techniques to the log data. Other related methods have various limitations and only work well for some particular system logs. In this paper, we propose a message signature based algorithm logSig to generate system events from textual log messages. By searching the most representative message signatures, logSig categorizes log messages into a set of event types. logSig can handle various types of log data, and is able to incorporate human's domain knowledge to achieve a high performance. We conduct experiments on five real system log data. Experiments show that logSig outperforms other alternative algorithms in terms of the overall performance.

Temporal, stream and spatial information

Coupling or decoupling for KNN search on road networks?: a hybrid framework on user query patterns BIBAFull-Text 795-804
  Ying-Ju Chen; Kun-Ta Chuang; Ming-Syan Chen
We explore in this paper a new KNN algorithm, called the SQUARE algorithm, for searching spatial objects on road networks. Recent works in the literature discussed the necessity to support object updates for promising location-based services. Among them, the decoupling spatial search algorithms, which separate the handle of the network traversal and the object lookup, has been recognized as the most effective approach to cut the maintenance overhead from updates. However, the queue-based network traversal needs to be performed from scratch for each KNN query until the KNN objects are exactly identified, indicating that the query complexity is in proportion to the number of visited network nodes. The query efficiency is concerned for online LBS applications since they only allow lightweight operations for minimizing the query latency. To improve the query scalability while supporting data updates, SQUARE constructs the network index similar to the way used in decoupling models, and meanwhile exploit the coupling idea to maintain the KNN information relative to hot regions in the network index. The hot region denotes the area with frequent queries discovered in the query history. Inspired from the prevalently observed 80-20 rule, SQUARE can maximize the query throughput by returning KNN results in the quasi-constant time for 80% queries that are roughly issued within 20% area (hot regions). As validated in our experimental results, SQUARE outperforms previous works and achieves the significant performance improvement without sacrifice on the maintenance overhead for object updates.
Toward traffic-driven location-based web search BIBAFull-Text 805-814
  Zhiyuan Cheng; James Caverlee; Krishna Yeswanth Kamath; Kyumin Lee
The emergence of location sharing services is rapidly accelerating the convergence of our online and offline activities. In one direction, Foursquare, Google Latitude, Facebook Places, and related services are enriching real-world venues with the social and semantic connections among online users. In analogy to how clickstreams have been successfully incorporated into traditional web ranking based on content and link analysis, we propose to mine traffic patterns revealed through location sharing services to augment traditional location-based search. Concretely, we study location-based traffic patterns revealed through location sharing services and find that these traffic patterns can identify semantically related locations. Based on this observation, we propose and evaluate a traffic-driven location clustering algorithm that can group semantically related locations with high confidence. Through experimental study of 12 million locations from Foursquare, we extend this result through supervised location categorization, wherein traffic patterns can be used to accurately predict the semantic category of uncategorized locations. Based on these results, we show how traffic-driven semantic organization of locations may be naturally incorporated into location-based web search.
CLUES: a unified framework supporting interactive exploration of density-based clusters in streams BIBAFull-Text 815-824
  Di Yang; Zhenyu Guo; Elke A. Rundensteiner; Matthew O. Ward
Although various mining algorithms have been proposed in the literature to efficiently compute clusters, few strides have been made to date in helping analysts to interactively explore such patterns in the stream context. We present a framework called CLUES to both computationally and visually support the process of real-time mining of density-based clusters. CLUES is composed of three major components. First, as foundation of CLUES, we develop an evolution model of density-based clusters in data streams that captures the complete spectrum of cluster evolution types across streaming windows. Second, to equip CLUES with the capability of efficiently tracking cluster evolution, we design a novel algorithm to piggy-back the evolution tracking process into the underlying cluster detection process. Third, CLUES organizes the detected clusters and their evolution interrelationships into a multidimensional pattern space -- presenting clusters at different time horizons and across different abstraction levels. It provides a rich set of visualization and interaction techniques to allow the analyst to explore this multi-dimensional pattern space in real-time. Our experimental evaluation, including performance studies and a user study, using real streams from ground group movement monitoring and from stock transaction domains confirm both the efficiency and effectiveness of our proposed CLUES framework.
e-NSP: efficient negative sequential pattern mining based on identified positive patterns without database rescanning BIBAFull-Text 825-830
  Xiangjun Dong; Zhigang Zheng; Longbing Cao; Yanchang Zhao; Chengqi Zhang; Jinjiu Li; Wei Wei; Yuming Ou
Mining Negative Sequential Patterns (NSP) is much more challenging than mining Positive Sequential Patterns (PSP) due to the high computational complexity and huge search space required in calculating Negative Sequential Candidates (NSC). Very few approaches are available for mining NSP, which mainly rely on re-scanning databases after identifying PSP. As a result, they are very inefficient. In this paper, we propose an efficient algorithm for mining NSP, called e-NSP, which mines for NSP by only involving the identified PSP, without re-scanning databases. First, negative containment is defined to determine whether or not a data sequence contains a negative sequence. Second, an efficient approach is proposed to convert the negative containment problem to a positive containment problem. The supports of NSC are then calculated based only on the corresponding PSP. Finally, a simple but efficient approach is proposed to generate NSC. With e-NSP, mining NSP does not require additional database scans, and the existing PSP mining algorithms can be integrated into e-NSP to mine for NSP efficiently. e-NSP is compared with two currently available NSP mining algorithms on 14 synthetic and real-life datasets. Intensive experiments show that e-NSP takes as little as 3% of the runtime of the baseline approaches and is applicable for efficient mining of NSP in large datasets.
Optimising ontology stream reasoning with truth maintenance system BIBAFull-Text 831-836
  Yuan Ren; Jeff Z. Pan
So far researchers in the Description Logics / Ontology communities mainly consider ontology reasoning services for static ontologies. The rapid development of the Semantic Web and its emerging data ask for reasoning technologies for dynamic knowledge streams. Existing work on stream reasoning is focused on lightweight languages such as RDF and RDFS. In this paper, we introduce the notion of Ontology Stream Management System (OSMS) and present a stream-reasoning approach based on Truth Maintenance System (TMS). We present optimised EL++ algorithm to reduce memory consumption. Our evaluations show that the optimisation improves TMS-enabled EL++ reasoning to deal with relatively large volumes of data and update efficiently.

Text mining

Harvesting facts from textual web sources by constrained label propagation BIBAFull-Text 837-846
  Yafang Wang; Bin Yang; Lizhen Qu; Marc Spaniol; Gerhard Weikum
There have been major advances on automatically constructing large knowledge bases by extracting relational facts from Web and text sources. However, the world is dynamic: periodic events like sports competitions need to be interpreted with their respective timepoints, and facts such as coaching a sports team, holding political or business positions, and even marriages do not hold forever and should be augmented by their respective timespans. This paper addresses the problem of automatically harvesting temporal facts with such extended time-awareness. We employ pattern-based gathering techniques for fact candidates and construct a weighted pattern-candidate graph. Our key contribution is a system called PRAVDA based on a new kind of label propagation algorithm with a judiciously designed loss function, which iteratively processes the graph to label good temporal facts for a given set of target relations. Our experiments with online news and Wikipedia articles demonstrate the accuracy of this method.
Towards a top-down and bottom-up bidirectional approach to joint information extraction BIBAFull-Text 847-856
  Xiaofeng Yu; Irwin King; Michael R. Lyu
Most high-level information extraction (IE) consists of compound and aggregated subtasks. Such IE problems are generally challenging and they have generated increasing interest recently. We investigate two representative IE tasks: (1) entity identification and relation extraction from Wikipedia, and (2) citation matching, and we formally define joint optimization of information extraction. We propose a joint paradigm integrating three factors -- segmentation, relation, and segmentation-relation joint factors, to solve all relevant subtasks simultaneously. This modeling offers a natural formalism for exploiting bidirectional rich dependencies and interactions between relevant subtasks to capture mutual benefits. Since exact parameter estimation is prohibitively intractable, we present a general, highly-coupled learning algorithm based on variational expectation maximization (VEM) to perform parameter estimation approximately in a top-down and bottom-up manner, such that information can flow bidirectionally and mutual benefits from different subtasks can be well exploited. In this algorithm, both segmentation and relation are optimized iteratively and collaboratively using hypotheses from each other. We conducted extensive experiments using two real-world datasets to demonstrate the promise of our approach.
From names to entities using thematic context distance BIBAFull-Text 857-866
  Anja Pilz; Gerhard Paaß
Name ambiguity arises from the polysemy of names and causes uncertainty about the true identity of entities referenced in unstructured text. This is a major problem in areas like information retrieval or knowledge management, for example when searching for a specific entity or updating an existing knowledge base.
   We approach this problem of named entity disambiguation (NED) using thematic information derived from Latent Dirichlet Allocation (LDA) to compare the entity mention's context with candidate entities in Wikipedia represented by their respective articles. We evaluate various distances over topic distributions in a supervised classification setting to find the best suited candidate entity, which is either covered in Wikipedia or unknown. We compare our approach to a state of the art method and show that it achieves significantly better results in predictive performance, regarding both entities covered in Wikipedia as well as uncovered entities.
   We show that our approach is in general language independent as we obtain equally good results for named entity disambiguation using the English, the German and the French Wikipedia.
Learning conditional random fields with latent sparse features for acronym expansion finding BIBAFull-Text 867-872
  Jie Liu; Jimeng Chen; Yi Zhang; Yalou Huang
The ever increasing usage of acronyms in many kinds of documents, including web pages, is becoming an obstacle for average readers. This paper studies the task of finding expansions in documents for a given set of acronyms. We cast the expansion finding problem as a sequence labeling task and adapt Conditional Random Fields (CRF) to solve it. While adapting CRFs, we enhance the performance from two aspects. First, we introduce nonlinear hidden layers to learn better representations of the input data. Second, we design simple and effective features. We create a hand labeled evaluation data based on Wikipedia.org and web crawling. We evaluate the effectiveness of several algorithms in solving the expansion finding problem. The experimental results demonstrate that the new method achieves performs better than Support Vector Machine and standard Conditional Random Fields.
Accounting for data dependencies within a hierarchical dirichlet process mixture model BIBAFull-Text 873-878
  Dongwoo Kim; Alice Oh
We propose a hierarchical nonparametric topic model, based on the hierarchical Dirichlet process (HDP), that accounts for dependencies among the data. The HDP mixture models are useful for discovering an unknown semantic structure (i.e., topics) from a set of unstructured data such as a corpus of documents. For simplicity, HDP makes an exchangeability assumption that any permutation of the data points would result in the same joint probability of the data being generated. This exchangeability assumption poses a problem for some domains where there are clear and strong dependencies among the data. A model that allows for non-exchangeability of data can capture these dependencies and assign higher probabilities to clusters that account for data dependencies, for example, inferring topics that reflect the temporal patterns of the data. Our model incorporates the distance dependent Chinese restaurant process (ddCRP), which clusters data with an inherent bias toward clusters of data points that are near to one another, into a hierarchical construction analogous to the HDP, and we call this new prior the distance dependent Chinese restaurant franchise (ddCRF). When tested with temporal datasets, the ddCRF mixture model shows clear improvements in data fit compared to the HDP in terms of heldout likelihood and complexity. The resulting set of topics shows the sequential emergence and disappearance patterns of topics.
Summarizing web forum threads based on a latent topic propagation process BIBAFull-Text 879-884
  Zhaochun Ren; Jun Ma; Shuaiqiang Wang; Yang Liu
With an increasingly amount of information in web forums, quick comprehension of threads in web forums has become a challenging research problem. To handle this issue, this paper investigates the task of Web Forum Thread Summarization (WFTS), aiming to give a brief statement of each thread that involving multiple dynamic topics. When applied to the task of WFTS, traditional summarization methods are cramped by topic dependencies, topic drifting and text sparseness. Consequently, we explore an unsupervised topic propagation model in this paper, the Post Propagation Model (PPM), to burst through these problems by simultaneously modeling the semantics and the reply relationship existing in each thread. Each post in PPM is considered as a mixture of topics, and a product of Dirichlet distributions in previous posts is employed to model each topic dependencies during the asynchronous discussion. Based on this model, the task of WFTS is accomplished by extracting most significant sentences in a thread. The experimental results on two different forum data sets show that WFTS based on the PPM outperforms several state-of-the-art summarization methods in terms of ROUGE metrics.

Privacy

Cloning for privacy protection in multiple independent data publications BIBAFull-Text 885-894
  Muzammil M. Baig; Jiuyong Li; Jixue Liu; Hua Wang
Data anonymization has become a major technique in privacy preserving data publishing. Many methods have been proposed to anonymize one dataset and a series of datasets of a data owner. However, no method has been proposed for the anonymization of data of multiple independent data publications. A data owner publishes a dataset, which contains overlapping population with other datasets published by other independent data owners. In this paper we analyze the privacy risk in the such scenario and vulnerability of partitioned based anonymization methods. We show that no partitioned based anonymization methods can protect privacy in arbitrary data distributions, and identify a case that the privacy can be protected in the scenario. We propose a new generalization principle ε-cloning to protect privacy for multiple independent data publications. We also develop an effective algorithm to achieve the ε-cloning. We experimentally show that the proposed algorithm anonymizes data to satisfy the privacy requirement and preserves good data utility.
Privacy-aware querying over sensitive trajectory data BIBAFull-Text 895-904
  Nikos Pelekis; Aris Gkoulalas-Divanis; Marios Vodas; Despina Kopanaki; Yannis Theodoridis
Existing approaches for privacy-aware mobility data sharing aim at publishing an anonymized version of the mobility dataset, operating under the assumption that most of the information in the original dataset can be disclosed without causing any privacy violations. In this paper, we assume that the majority of the information that exists in the mobility dataset must remain private and the data has to stay in-house to the hosting organization. To facilitate privacy-aware sharing of the mobility data we develop a trajectory query engine that allows subscribed users to gain restricted access to the database to accomplish various analysis tasks. The proposed engine (i) audits queries for trajectory data to block potential attacks to user privacy, (ii) supports range, distance, and k-nearest neighbors spatial and spatiotemporal queries, and (iii) preserves user anonymity in answers to queries by (a) augmenting the real trajectories with a set of carefully crafted, realistic fake trajectories, and (b) ensuring that no user-specific sensitive locations are reported as part of the returned trajectories.
Privacy preserving indexing for eHealth information networks BIBAFull-Text 905-914
  Yuzhe Tang; Ting Wang; Ling Liu; Shicong Meng; Balaji Palanisamy
The past few years have witnessed an increasing demand for the next generation health information networks (e.g., NHIN[1]), which hold the promise of supporting large-scale information sharing across a network formed by autonomous healthcare providers. One fundamental capability of such information network is to support efficient, privacy-preserving (for both users and providers) search over the distributed, access controlled healthcare documents. In this paper we focus on addressing the privacy concerns of content providers; that is, the search should not reveal the specific association between contents and providers (a.k.a. content privacy). We propose SS-PPI, a novel privacy-preserving index abstraction, which, in conjunction of distributed access control-enforced search protocols, provides theoretically guaranteed protection of content privacy. Compared with existing proposals (e.g., flipping privacy-preserving index[2]), our solution highlights with a series of distinct features: (a) it incorporates access control policies in the privacy-preserving index, which improves both search efficiency and attack resilience; (b) it employs a fast index construction protocol via a novel use of the secrete-sharing scheme in a fully distributed manner (without trusted third party), requiring only constant (typically two) round of communication; (c) it provides information-theoretic security against colluding adversaries during index construction as well as query answering. We conduct both formal analysis and experimental evaluation of SS-PPI and show that it outperforms the state-of-the-art solutions in terms of both privacy protection and execution efficiency.
Recommendation in the end-to-end encrypted domain BIBAFull-Text 915-924
  Jyh-Ren Shieh; Ching-Yung Lin; Ja-Ling Wu
In recommendation systems, a central host typically requires access to user profiles in order to generate useful recommendations. This access, however, undermines user privacy; the more information is revealed to the host, the more the user's privacy is compromised. In this paper, we propose a novel end-to-end encrypted recommendation mechanism which encrypts sensitive private data at the user end, without ever exposing plaintext private data to the host server. Unlike previously proposed privacy-preserving recommendation mechanisms, the data in this proposed system are lossless -- a pivotal feature to many applications, e.g., in health informatics, business analytics, cyber security, etc. We achieve this goal by developing encrypted-domain polynomial ring homomorphism cryptographic algorithms to compute similarity of encrypted scores on the server, so that collaborative recommendations can be computed in the encryption domain and only an authorized person can decrypt the exact results. We also propose a novel key management system to make sure private information retrieval and recommendation computations can be executed in the encrypted domain in practice. Our experiments show that the proposed scheme offers robust security and lossless accurate recommendation, as well as high efficiency. Our preliminary results show the recommendation accuracy is 21% better than the existing statistical lossy privacy-preserving mechanisms based on random perturbation and user profile distribution. This new approach can potentially be applied to various data mining and cloud computing environments and significantly alleviates the privacy concerns of users.
Privacy preservation by independent component analysis and variance control BIBAFull-Text 925-930
  Chih-Ming Hsu; Ming-Syan Chen
The primary objective of privacy preservation is to protect an individual's confidential information in released data sets. In recent years, several simulation-based approaches for privacy preservation have been proposed. The idea is to generate a synthetic data set with the constraint that the probability distribution is as close as possible to that of the original set. In this paper, we propose two frameworks for simulation-based privacy preservation of multivariate numerical data. The first framework, called PRIMP (PRivacy preserving by Independent coMPonents), is based on independent component analysis (ICA). It is shown empirically that PRIMP outperforms other simulation-based approaches in terms of Spearman's rank correlation and Kendall's tau correlation. The second approach proposed is a hybrid method that combines PRIMP and Cholesky's decomposition technique. It is shown empirically that the hybrid method preserves the covariance matrix of the original data exactly. The method also resolves the problem of generating good seeds for the Cholesky-based approach. Although the empirical results show that the hybrid approach is not always better than the PRIMP in terms of Spearman's rank correlation and Kendall's tau correlation, in theory, the risk of information leakage under the hybrid approach is much less than that under PRIMP.

Unsupervised and semi-supervised learning

Can irrelevant data help semi-supervised learning, why and how? BIBAFull-Text 937-946
  Haiqin Yang; Shenghuo Zhu; Irwin King; Michael R. Lyu
Previous semi-supervised learning (SSL) techniques usually assume unlabeled data are relevant to the target task. That is, they follow the same distribution as the targeted labeled data. In this paper, we address a different and very difficult scenario in SSL, where the unlabeled data may be a mixture of data relevant or irrelevant to the target binary classification task. In our framework, we do not require explicitly prior knowledge on the relatedness of the unlabeled data to the target data. In order to alleviate the effect of the irrelevant unlabeled data and utilize the implicit knowledge among all available data, we develop a novel maximum margin classifier, named the tri-class support vector machine (3C-SVM), to seek an inductive rule to separate the target binary classification task well while finding out the irrelevant data by-product. To attain this goal, we introduce a new min loss function, which can relieve the impact of the irrelevant data while relying more on the labeled data and the relevant unlabeled data. This loss function can therefore achieve the maximum entropy principle. The 3C-SVM can then generalize standard SVMs, Semi-supervised SVMs, and SVMs learned from the universum as its special cases. We further analyze the property of 3C-SVM on why the irrelevant data can help to improve the model performance. For implementation, we make relaxation and approximate the objective by the convex-concave procedure, which turns the original optimization from integral programming problem to a problem by just solving a finite number of quadratic programming problems. Empirical results are reported to demonstrate the advantages of our 3C-SVM model.
Toward interactive training and evaluation BIBAFull-Text 947-956
  Gregory Druck; Andrew McCallum
Machine learning often relies on costly labeled data, and this impedes its application to new classification and information extraction problems. This has motivated the development of methods for leveraging abundant prior knowledge about these problems, including methods for lightly supervised learning using model expectation constraints. Building on this work, we envision an interactive training paradigm in which practitioners perform evaluation, analyze errors, and provide and refine expectation constraints in a closed loop. In this paper, we focus on several key subproblems in this paradigm that can be cast as selecting a representative sample of the unlabeled data for the practitioner to inspect. To address these problems, we propose stratified sampling methods that use model expectations as a proxy for latent output variables. In classification and sequence labeling experiments, these sampling strategies reduce accuracy evaluation effort by as much as 53%, provide more reliable estimates of F1 for rare labels, and aid in the specification and refinement of constraints.
Semi-supervised multi-task learning of structured prediction models for web information extraction BIBAFull-Text 957-966
  Paramveer S. Dhillon; Sundararajan Sellamanickam; Sathiya Keerthi Selvaraj
Extracting information from web pages is an important problem; it has several applications such as providing improved search results and construction of databases to serve user queries. In this paper we propose a novel structured prediction method to address two important aspects of the extraction problem: (1) labeled data is available only for a small number of sites and (2) a machine learned global model does not generalize adequately well across many websites. For this purpose, we propose a weight space based graph regularization method. This method has several advantages. First, it can use unlabeled data to address the limited labeled data problem and falls in the class of graph regularization based semi-supervised learning approaches. Second, to address the generalization inadequacy of a global model, this method builds a local model for each website. Viewing the problem of building a local model for each website as a task, we learn the models for a collection of sites jointly; thus our method can also be seen as a graph regularization based multi-task learning approach. Learning the models jointly with the proposed method is very useful in two ways: (1) learning a local model for a website can be effectively influenced by labeled and unlabeled data from other websites; and (2) even for a website with only unlabeled examples it is possible to learn a decent local model. We demonstrate the efficacy of our method on several real-life data; experimental results show that significant performance improvement can be obtained by combining semi-supervised and multi-task learning in a single framework.
Memory-less unsupervised clustering for data streaming by versatile ellipsoidal function BIBAFull-Text 967-972
  Niwan Wattanakitrungroj; Chidchanok Lursinsap
The challenge of clustering on data stream is the ability to deal with the continuous incoming data which are unlimited and unable to store all of them. To manage the storage crisis, the data must be processed in a single pass or only once after the arrival and are thrown away outer. All previously clustered data must be mathematically captured in terms of group features since those data are already non-existent. The proposed data stream clustering algorithm is divided into two main phases, namely on-line and off-line. In the on-line phase, new micro-cluster features are proposed. Our micro-cluster features better represent the arriving data than the traditional micro-cluster features. In the off-line phase, the prepared micro-clusters are categorized by their densities. The proposed method can generate the final clusters with different shapes and densities. Based on entropy, purity, Jaccard coefficient, and Rand statistic measures, our algorithm being applied on synthetic and real data outperforms the other previous data stream clustering algorithms.
Coupled nominal similarity in unsupervised learning BIBAFull-Text 973-978
  Can Wang; Longbing Cao; Mingchun Wang; Jinjiu Li; Wei Wei; Yuming Ou
The similarity between nominal objects is not straightforward, especially in unsupervised learning. This paper proposes coupled similarity metrics for nominal objects, which consider not only intra-coupled similarity within an attribute (i.e., value frequency distribution) but also inter-coupled similarity between attributes (i.e. feature dependency aggregation). Four metrics are designed to calculate the inter-coupled similarity between two categorical values by considering their relationships with other attributes. The theoretical analysis reveals their equivalent accuracy and superior efficiency based on intersection against others, in particular for large-scale data. Substantial experiments on extensive UCI data sets verify the theoretical conclusions. In addition, experiments of clustering based on the derived dissimilarity metrics show a significant performance improvement.
Feature selection using hierarchical feature clustering BIBAFull-Text 979-984
  Huawen Liu; Xindong Wu; Shichao Zhang
One of the challenges in data mining is the dimensionality of data, which is often very high and prevalent in many domains, such as text categorization and bio-informatics. The high-dimensionality of data may bring many adverse situations to traditional learning algorithms. To cope with this issue, feature selection has been put forward. Currently, many efforts have been attempted in this field and lots of feature selection algorithms have been developed. In this paper we propose a new selection method to pick discriminative features by using information measurement. The main characteristic of our selection method is that the selection procedure works like feature clustering in a hierarchically agglomerative way, where each feature is considered as a cluster and the between-cluster and within-cluster distances are measured by mutual information and the coefficient of relevancy respectively. Consequently, the final aggregated cluster is the selection result, which has the minimal redundancy among its members and the maximal relevancy with the class labels. The simulation experiments on seven datasets show that the proposed method outperforms other popular feature selection algorithms in classification performance.

Social networks and communities

Discovering top-k teams of experts with/without a leader in social networks BIBAFull-Text 985-994
  Mehdi Kargar; Aijun An
We study the problem of discovering a team of experts from a social network. Given a project whose completion requires a set of skills, our goal is to find a set of experts that together have all of the required skills and also have the minimal communication cost among them. We propose two communication cost functions designed for two types of communication structures. We show that the problem of finding the team of experts that minimizes one of the proposed cost functions is NP-hard. Thus, an approximation algorithm with an approximation ratio of two is designed. We introduce the problem of finding a team of experts with a leader. The leader is responsible for monitoring and coordinating the project, and thus a different communication cost function is used in this problem. To solve this problem, an exact polynomial algorithm is proposed. We show that the total number of teams may be exponential with respect to the number of required skills. Thus, two procedures that produce top-k teams of experts with or without a leader in polynomial delay are proposed. Extensive experiments on real datasets demonstrate the effectiveness and scalability of the proposed methods.
Content based social behavior prediction: a multi-task learning approach BIBAFull-Text 995-1000
  Hongliang Fei; Ruoyi Jiang; Yuhao Yang; Bo Luo; Jun Huan
Information Flow Studies analyze the principles and mechanisms of social information distribution and is an essential research topic in social networks. Traditional approaches are primarily based on the social network graph topology. However, topology itself can not accurately reflect the user interests or activities. In this paper, we adopt a "microeconomics" approach to study social information diffusion and aim to answer the question that how social information flow and socialization behaviors are related to content similarity and user interests. In particular, we study content-based social activity prediction, i.e., to predict a user's response (e.g. comment or like) to their friends' postings (e.g. blogs) w.r.t. message content. In our solution, we cast the social behavior prediction problem as a multi-task learning problem, in which each task corresponds to a user. We have designed a novel multi-task learning algorithm that is specifically designed for learning information flow in social networks. In our model, we apply l1 and Tikhonov regularization to obtain a sparse and smooth model in a linear multi-task learning framework. Using comprehensive experimental study, we have demonstrated the effectiveness of the proposed learning method.
Improving user interest inference from social neighbors BIBAFull-Text 1001-1006
  Zhen Wen; Ching-Yung Lin
Prior research has provided some evidence of social correlation (i.e., "you are who you know"), which makes it possible to infer one's interests from his or her social neighbors. However, it is also shown to be challenging to consistently obtain high quality inference. This challenge can be partially attributed to the fact that people usually maintain diverse social relationships, in order to tap into diverse information and knowledge. It is unlikely that a person would possess all interests of his/her social neighbors. Instead, s/he may selectively acquire just a subset of them. This paper intends to improve inferring interests from neighbors given this observation. We conduct this study by implementing a privacy-preserving large distributed social sensor system in a large global IT company to capture the multifaceted activities (e.g., emails, instant messaging, social bookmarking, etc.) of 25K+ people. These activities occupy the majority of employees' time, and thus, provide a higher quality view of the diverse aspects of their professional interests compared to the friending activity on online social networking sites. In this paper, we propose a technique that exploits the correlation among the attributes that a person possesses to improve social-correlation-based inference quality. Our technique offers two unique contributions. First, we demonstrate that the proposed technique can significantly improve inference quality by as much as 76.1%. Second, we study the interaction between the two factors: social correlation and attribute correlation under different situations. The results can inform practical applications how the inference quality would change in various scenarios.
CASINO: towards conformity-aware social influence analysis in online social networks BIBAFull-Text 1007-1012
  Hui Li; Sourav S. Bhowmick; Aixin Sun
Social influence analysis in online social networks is the study of people's influence by analyzing the social interactions between individuals. There have been increasing research efforts to understand the influence propagation phenomenon due to its importance to information dissemination among others. Despite the progress achieved by state-of-the-art social influence analysis techniques, a key limitation of these techniques is that they only utilize positive interactions (e.g., agreement, trust) between individuals, ignoring two equally important factors, namely, negative relationships (e.g., distrust, disagreement) between individuals and conformity of people, which refers to a person's inclination to be influenced. In this paper, we propose a novel algorithm CASINO (Conformity-Aware Social INfluence cOmputation) to study the interplay between influence and conformity of each individual. Given a social network, CASINO first extracts a set of topic-based subgraphs where each subgraph depicts the social interactions associated with a specific topic. Then it optionally labels the edges (relationships) between individuals with positive or negative signs. Finally, it computes the influence and conformity indices of each individual in each signed topic-based subgraph. Our empirical study with several real-world social networks demonstrates superior effectiveness and accuracy of CASINO compared to state-of-the-art methods. Furthermore, we revealed several interesting characteristics of "influentials" and "conformers" in these networks.
Mining direct antagonistic communities in explicit trust networks BIBAFull-Text 1013-1018
  David Lo; Didi Surian; Kuan Zhang; Ee-Peng Lim
There has been a recent increase of interest in analyzing trust and friendship networks to gain insights about relationship dynamics among users. Many sites such as Epinions, Facebook, and other social networking sites allow users to declare trusts or friendships between different members of the community. In this work, we are interested in extracting direct antagonistic communities (DACs) within a rich trust network involving trusts and distrusts. Each DAC is formed by two subcommunities with trust relationships among members of each sub-community but distrust relationships across the sub-communities. We develop an efficient algorithm that could analyze large trust networks leveraging the unique property of direct antagonistic community. We have experimented with synthetic and real data-sets (myGamma and Epinions) to demonstrate the scalability of our proposed solution.
Connecting users with similar interests via tag network inference BIBAFull-Text 1019-1024
  Xufei Wang; Huan Liu; Wei Fan
The popularity of social networking greatly increases interaction among people. However, one major challenge remains -- how to connect people who share similar interests. In a social network, the majority of people who share similar interests with given a user are in the long tail that accounts for 80% of total population. Searching for similar users by following links in social network has two limitations: it is inefficient and incomplete. Thus, it is desirable to design new methods to find like-minded people. In this paper, we propose to use collective wisdom from the crowd or tag networks to solve the problem. In a tag network, each node represents a tag as described by some words, and the weight of an undirected edge represents the co-occurrence of two tags. As such, the tag network describes the semantic relationships among tags. In order to connect to other users of similar interests via a tag network, we use diffusion kernels on the tag network to measure the similarity between pairs of tags. The similarity of people's interests are measured on the basis of similar tags they share. To recommend people who are alike, we retrieve top k people sharing the most similar tags. Compared to two baseline methods triadic closure and LSI, the proposed tag network approach achieves 108% and 27% relative improvements on the BlogCatalog dataset, respectively.
Do all birds tweet the same?: characterizing Twitter around the world BIBAFull-Text 1025-1030
  Barbara Poblete; Ruth Garcia; Marcelo Mendoza; Alejandro Jaimes
Social media services have spread throughout the world in just a few years. They have become not only a new source of information, but also new mechanisms for societies world-wide to organize themselves and communicate. Therefore, social media has a very strong impact in many aspects -- at personal level, in business, and in politics, among many others. In spite of its fast adoption, little is known about social media usage in different countries, and whether patterns of behavior remain the same or not. To provide deep understanding of differences between countries can be useful in many ways, e.g.: to improve the design of social media systems (which features work best for which country?), and influence marketing and political campaigns. Moreover, this type of analysis can provide relevant insight into how societies might differ. In this paper we present a summary of a large-scale analysis of Twitter for an extended period of time. We analyze in detail various aspects of social media for the ten countries we identified as most active. We collected one year's worth of data and report differences and similarities in terms of activity, sentiment, use of languages, and network structure. To the best of our knowledge, this is the first on-line social network study of such characteristics.

Sentiments and other perspectives

Topic sentiment analysis in Twitter: a graph-based hashtag sentiment classification approach BIBAFull-Text 1031-1040
  Xiaolong Wang; Furu Wei; Xiaohua Liu; Ming Zhou; Ming Zhang
Twitter is one of the biggest platforms where massive instant messages (i.e. tweets) are published every day. Users tend to express their real feelings freely in Twitter, which makes it an ideal source for capturing the opinions towards various interesting topics, such as brands, products or celebrities, etc. Naturally, people may anticipate an approach to receiving the common sentiment tendency towards these topics directly rather than through reading the huge amount of tweets about them. On the other side, Hashtags, starting with a symbol "#" ahead of keywords or phrases, are widely used in tweets as coarse-grained topics. In this paper, instead of presenting the sentiment polarity of each tweet relevant to the topic, we focus our study on hashtag-level sentiment classification. This task aims to automatically generate the overall sentiment polarity for a given hashtag in a certain time period, which markedly differs from the conventional sentence-level and document-level sentiment analysis. Our investigation illustrates that three types of information is useful to address the task, including (1) sentiment polarity of tweets containing the hashtag; (2) hashtags co-occurrence relationship and (3) the literal meaning of hashtags. Consequently, in order to incorporate the first two types of information into a classification framework where hashtags can be classified collectively, we propose a novel graph model and investigate three approximate collective classification algorithms for inference. Going one step further, we show that the performance can be remarkably improved using an enhanced boosting classification setting in which we employ the literal meaning of hashtags as semi-supervised information. Experimental results on a real-life data set consisting of 29,195 tweets and 2,181 hashtags show the effectiveness of the proposed model and algorithms.
Language-independent sentiment classification using three common words BIBAFull-Text 1041-1046
  Zheng Lin; Songbo Tan; Xueqi Cheng
Many methods for cross-lingual processing tasks are resource-dependent, which will not work without machine translation system or bilingual lexicon. In this paper, we propose a novel approach for multilingual sentiment classification just by few seed words. For a given language, the proposed approach learns a sentiment classifier from the initial seed words instead of any labeled data. We employ our method both in supervised learning and unsupervised learning. Experimental results demonstrate that our method relies less on external resource but performs as well as or better than the baseline.
A cross-domain adaptation method for sentiment classification using probabilistic latent analysis BIBAFull-Text 1047-1052
  Sheng Gao; Haizhou Li
Sentiment classification is becoming attractive in recent years because of its potential commercial applications. It exploits supervised learning methods to learn the classifiers from the annotated training documents. The challenge in sentiment classification lies in that the sentiment domains are diverse, heterogeneous and fast-growing. The classifiers trained on one domain (source domain) could not classify a document from another domain (target domain). The domain adaptation technique is to address the problem by making use of labeled samples in the source domain, and unlabeled samples in the target domain. This paper presents a new solution, a cross-domain topic indexing (CDTI) method, with which a common semantic space is found from the prior between-domain term correspondences and the term co-occurrences in the cross-domain documents. These observations are characterized with the mixture model in CDTI, with each component being a possible topic shared by the source and target domains. Such common topics are found to index the cross-domain content. We evaluate the algorithms on a multi-domain sentiment classification task, which shows that CDTI outperforms the state-of-the-art domain adaptation method, i.e. spectral feature alignment (SFA), and the traditional latent semantic indexing method.
Using games with a purpose and bootstrapping to create domain-specific sentiment lexicons BIBAFull-Text 1053-1060
  Albert Weichselbraun; Stefan Gindl; Arno Scharl
Sentiment detection analyzes the positive or negative polarity of text. The field has received considerable attention in recent years, since it plays an important role in providing means to assess user opinions regarding an organization's products, services, or actions. Approaches towards sentiment detection include machine learning techniques as well as computationally less expensive methods. Both approaches rely on the use of language-specific sentiment lexicons, which are lists of sentiment terms with their corresponding sentiment value. The effort involved in creating, customizing, and extending sentiment lexicons is considerable, particularly if less common languages and domains are targeted without access to appropriate language resources. This paper proposes a semi-automatic approach for the creation of sentiment lexicons which assigns sentiment values to sentiment terms via crowd-sourcing. Furthermore, it introduces a bootstrapping process operating on unlabeled domain documents to extend the created lexicons, and to customize them according to the particular use case. This process considers sentiment terms as well as sentiment indicators occurring in the discourse surrounding a articular topic. Such indicators are associated with a positive or negative context in a particular domain, but might have a neutral connotation in other domains. A formal evaluation shows that bootstrapping considerably improves the method's recall. Automatically created lexicons yield a performance comparable to professionally created language resources such as the General Inquirer.
Polarity analysis of texts using discourse structure BIBAFull-Text 1061-1070
  Bas Heerschop; Frank Goossen; Alexander Hogenboom; Flavius Frasincar; Uzay Kaymak; Franciska de Jong
Sentiment analysis has applications in many areas and the exploration of its potential has only just begun. We propose Pathos, a framework which performs document sentiment analysis (partly) based on a document's discourse structure. We hypothesize that by splitting a text into important and less important text spans, and by subsequently making use of this information by weighting the sentiment conveyed by distinct text spans in accordance with their importance, we can improve the performance of a sentiment classifier. A document's discourse structure is obtained by applying Rhetorical Structure Theory on sentence level. When controlling for each considered method's structural bias towards positive classifications, weights optimized by a genetic algorithm yield an improvement in sentiment classification accuracy and macro-level F1 score on documents of 4.5% and 4.7%, respectively, in comparison to a baseline not taking into account discourse structure.
A query-based multi-document sentiment summarizer BIBAFull-Text 1071-1076
  Maria Soledad Pera; Rani Qumsiyeh; Yiu-Kai Ng
Review websites, such as Epinions.com, which offer users a platform to share their opinions on diverse products and services, provide a valuable source of opinion-rich information. Browsing through archived reviews to locate different opinions on a product or service, however, is a time-consuming and tedious task, and in most cases, the large amount of available information is difficult for users to absorb. To facilitate the process of synthesizing opinions expressed in reviews on a product or service P specified in a user query/question Q, we introduce QMSS, a query-based multi-document sentiment summarizer. QMSS creates a summary for Q, which either reflects the general opinions on P or is tailored to specific facets (i.e., features) and/or sentiment of P as specified in Q. QMSS (i) identifies the facets addressed in reviews retrieved for Q, (ii) employs a sentence-based, sentiment classifier to determine the polarity of each sentence in each review, and (iii) clusters sentences in reviews according to the facets captured in the sentences, which are identified using a keyword-label extraction algorithm. This process dictates which sentences in the reviews should be included in the summary for Q. Empirical studies have verified that QMSS is highly effective in generating summaries that satisfy users' information needs and ranks on top among the state-of-the-art query-based multi-document sentiment summarizers.

Classification and clustering: large-scale statistical techniques

Scalable density-based subspace clustering BIBAFull-Text 1077-1086
  Emmanuel Müller; Ira Assent; Stephan Günnemann; Thomas Seidl
For knowledge discovery in high dimensional databases, subspace clustering detects clusters in arbitrary subspace projections. Scalability is a crucial issue, as the number of possible projections is exponential in the number of dimensions. We propose a scalable density-based subspace clustering method that steers mining to few selected subspace clusters. Our novel steering technique reduces subspace processing by identifying and clustering promising subspaces and their combinations directly. Thereby, it narrows down the search space while maintaining accuracy. Thorough experiments on real and synthetic databases show that steering is efficient and scalable, with high quality results. For future work, our steering paradigm for density-based subspace clustering opens research potential for speeding up other subspace clustering approaches as well.
Correlated multi-label feature selection BIBAFull-Text 1087-1096
  Quanquan Gu; Zhenhui Li; Jiawei Han
Multi-label learning studies the problem where each instance is associated with a set of labels. There are two challenges in multi-label learning: (1) the labels are interdependent and correlated, and (2) the data are of high dimensionality. In this paper, we aim to tackle these challenges in one shot. In particular, we propose to learn the label correlation and do feature selection simultaneously. We introduce a matrix-variate Normal prior distribution on the weight vectors of the classifier to model the label correlation. Our goal is to find a subset of features, based on which the label correlation regularized loss of label ranking is minimized. The resulting multi-label feature selection problem is a mixed integer programming, which is reformulated as quadratically constrained linear programming (QCLP). It can be solved by cutting plane algorithm, in each iteration of which a minimax optimization problem is solved by dual coordinate descent and projected sub-gradient descent alternatively. Experiments on benchmark data sets illustrate that the proposed methods outperform single-label feature selection method and many other state-of-the-art multi-label learning methods.
Pattern change discovery between high dimensional data sets BIBAFull-Text 1097-1106
  Yi Xu; Zhongfei Zhang; Philips Yu; Bo Long
This paper investigates the general problem of pattern change discovery between high-dimensional data sets. Current methods either mainly focus on magnitude change detection of low-dimensional data sets or are under supervised frameworks. In this paper, the notion of the principal angles between the subspaces is introduced to measure the subspace difference between two high-dimensional data sets. Principal angles bear a property to isolate subspace change from the magnitude change. To address the challenge of directly computing the principal angles, we elect to use matrix factorization to serve as a statistical framework and develop the principle of the dominant subspace mapping to transfer the principal angle based detection to a matrix factorization problem. We show how matrix factorization can be naturally embedded into the likelihood ratio test based on the linear models. The proposed method is of an unsupervised nature and addresses the statistical significance of the pattern changes between high-dimensional data sets. We have showcased the different applications of this solution in several specific real-world applications to demonstrate the power and effectiveness of this method.
MTopS: scalable processing of continuous top-k multi-query workloads BIBAFull-Text 1107-1116
  Avani Shastri; Yang Di; Elke A. Rundensteiner; Matthew O. Ward
A continuous top-k query retrieves the k most preferred objects in a data stream according to a given preference function. These queries are important for a broad spectrum of applications ranging from web-based advertising to financial analysis. In various streaming applications, a large number of such continuous top-k queries need to be executed simultaneously against a common popular input stream. To efficiently handle such top-k query workload, we present a comprehensive framework, called MTopS.Within this MTopS framework, several computational components work collaboratively to first analyze the commonalities across the workload; organize the workload for maximized sharing opportunities; execute the workload queries simultaneously in a shared manner; and output query results whenever any input query requires. In particular, MTopS supports two proposed algorithms, MTopBand and MTopList, which both incrementally maintain the top-k objects over time for multiple queries. As the foundation, we first identify the minimal object set from the data stream that is both necessary and sufficient for accurately answering all top-k queries in the workload. Then, the MTopBand algorithm is presented to incrementally maintain such minimum object set and eliminate the need for any recomputation from scratch. To further optimize MTop-Band, we design the second algorithm, MTopList which organizes the progressive top-k results of workload queries in a compact structure. MTopList is shown to be memory optimal and also more efficient in terms of CPU time usage than MTopBand. Our experimental study, using real data streams from domains of stock trades and moving object monitoring, demonstrates that both the efficiency and scalability of our proposed techniques are clearly superior to the state-of-the-art solutions.
Probabilistic near-duplicate detection using simhash BIBAFull-Text 1117-1126
  Sadhan Sood; Dmitri Loguinov
This paper offers a novel look at using a dimensionality-reduction technique called simhash to detect similar document pairs in large-scale collections. We show that this algorithm produces interesting intermediate data, which is normally discarded, that can be used to predict which of the bits in the final hash are more susceptible to being flipped in similar documents. This paves the way for a probabilistic search technique in the Hamming space of simhashes that can be significantly faster and more space-efficient than the existing simhash approaches. We show that with 95% recall compared to deterministic search of prior work, our method exhibits 4-14 times faster lookup and requires 2-10 times less RAM on our collection of 70M web pages.

Link prediction

Collective prediction with latent graphs BIBAFull-Text 1127-1136
  Xiaoxiao Shi; Yao Li; Philip Yu
Collective classification in relational data has become an important and active research topic in the last decade. It exploits the dependencies of instances in a network to improve predictions. Related applications include hyperlinked document classification, social network analysis and collaboration network analysis. Most of the traditional collective classification models mainly study the scenario that there exists a large amount of labeled examples (labeled nodes). However, in many real-world applications, labeled data are extremely difficult to obtain. For example, in network intrusion detection, there may be only a limited number of identified intrusions whereas there are a huge set of unlabeled nodes. In this situation, most of the data have no connection to labeled nodes; hence, no supervision knowledge can be obtained from the local connections. In this paper, we propose to explore various latent linkages among the nodes and judiciously integrate the linkages to generate a latent graph. This is achieved by finding a graph that maximizes the linkages among the training data with the same label, and maximizes the separation among the data with different labels. The objective is further cast into an optimization problem and is solved with quadratic programming. Finally, we apply label propagation on the latent graph to make prediction. Experiments show that the proposed model LNP (Latent Network Propagation) can improve the learning accuracy significantly. For instance, when there are only 10% of labeled examples, the accuracies of all the comparison models are less than 63%, while that of the proposed model is 74%.
Who will follow you back?: reciprocal relationship prediction BIBAFull-Text 1137-1146
  John Hopcroft; Tiancheng Lou; Jie Tang
We study the extent to which the formation of a two-way relationship can be predicted in a dynamic social network. A two-way (called reciprocal) relationship, usually developed from a one-way (parasocial) relationship, represents a more trustful relationship between people. Understanding the formation of two-way relationships can provide us insights into the micro-level dynamics of the social network, such as what is the underlying community structure and how users influence each other. Employing Twitter as a source for our experimental data, we propose a learning framework to formulate the problem of reciprocal relationship prediction into a graphical model. The framework incorporates social theories into a machine learning model. We demonstrate that it is possible to accurately infer 90% of reciprocal relationships in a dynamic network. Our study provides strong evidence of the existence of the structural balance among reciprocal relationships. In addition, we have some interesting findings, e.g., the likelihood of two "elite" users creating a reciprocal relationships is nearly 8 times higher than the likelihood of two ordinary users. More importantly, our findings have potential implications such as how social structures can be inferred from individuals' behaviors.
Link prediction: the power of maximal entropy random walk BIBAFull-Text 1147-1156
  Rong-Hua Li; Jeffrey Xu Yu; Jianquan Liu
Link prediction is a fundamental problem in social network analysis. The key technique in unsupervised link prediction is to find an appropriate similarity measure between nodes of a network. A class of wildly used similarity measures are based on random walk on graph. The traditional random walk (TRW) considers the link structures by treating all nodes in a network equivalently, and ignores the centrality of nodes of a network. However, in many real networks, nodes of a network not only prefer to link to the similar node, but also prefer to link to the central nodes of the network. To address this issue, we use maximal entropy random walk (MERW) for link prediction, which incorporates the centrality of nodes of the network. First, we study certain important properties of MERW on graph $G$ by constructing an eigen-weighted graph G. We show that the transition matrix and stationary distribution of MERW on G are identical to the ones of TRW on G. Based on G, we further give the maximal entropy graph Laplacians, and show how to fast compute the hitting time and commute time of MERW. Second, we propose four new graph kernels and two similarity measures based on MERW for link prediction. Finally, to exhibit the power of MERW in link prediction, we compare 27 various link prediction methods over 3 synthetic and 8 real networks. The results show that our newly proposed MERW based methods outperform the state-of-the-art method on most datasets.
Exploiting longer cycles for link prediction in signed networks BIBAFull-Text 1157-1162
  Kai-Yang Chiang; Nagarajan Natarajan; Ambuj Tewari; Inderjit S. Dhillon
We consider the problem of link prediction in signed networks. Such networks arise on the web in a variety of ways when users can implicitly or explicitly tag their relationship with other users as positive or negative. The signed links thus created reflect social attitudes of the users towards each other in terms of friendship or trust. Our first contribution is to show how any quantitative measure of social imbalance in a network can be used to derive a link prediction algorithm. Our framework allows us to reinterpret some existing algorithms as well as derive new ones. Second, we extend the approach of Leskovec et al. (2010) by presenting a supervised machine learning based link prediction method that uses features derived from longer cycles in the network. The supervised method outperforms all previous approaches on 3 networks drawn from sources such as Epinions, Slashdot and Wikipedia. The supervised approach easily scales to these networks, the largest of which has 132k nodes and 841k edges. Most real-world networks have an overwhelmingly large proportion of positive edges and it is therefore easy to get a high overall accuracy at the cost of a high false positive rate. We see that our supervised method not only achieves good accuracy for sign prediction but is also especially effective in lowering the false positive rate.
Structural link analysis and prediction in microblogs BIBAFull-Text 1163-1168
  Dawei Yin; Liangjie Hong; Brian D. Davison
With hundreds of millions of participants, social media services have become commonplace. Unlike a traditional social network service, a microblogging network like Twitter is a hybrid network, combining aspects of both social networks and information networks. Understanding the structure of such hybrid networks and predicting new links are important for many tasks such as friend recommendation, community detection, and modeling network growth. We note that the link prediction problem in a hybrid network is different from previously studied networks. Unlike the information networks and traditional online social networks, the structures in a hybrid network are more complicated and informative. We compare most popular and recent methods and principles for link prediction and recommendation. Finally we propose a novel structure-based personalized link prediction model and compare its predictive performance against many fundamental and popular link prediction methods on real-world data from the Twitter microblogging network. Our experiments on both static and dynamic data sets show that our methods noticeably outperform the state-of-the-art.
Temporal link prediction by integrating content and structure information BIBAFull-Text 1169-1174
  Sheng Gao; Ludovic Denoyer; Patrick Gallinari
In this paper we address the problem of temporal link prediction, i.e., predicting the apparition of new links, in time-evolving networks. This problem appears in applications such as recommender systems, social network analysis or citation analysis. Link prediction in time-evolving networks is usually based on the topological structure of the network only. We propose here a model which exploits multiple information sources in the network in order to predict link occurrence probabilities as a function of time. The model integrates three types of information: the global network structure, the content of nodes in the network if any, and the local or proximity information of a given vertex. The proposed model is based on a matrix factorization formulation of the problem with graph regularization. We derive an efficient optimization method to learn the latent factors of this model. Extensive experiments on several real world datasets suggest that our unified framework outperforms state-of-the-art methods for temporal link prediction tasks.

Link, graph and relation mining

Towards feature selection in network BIBAFull-Text 1175-1184
  Quanquan Gu; Jiawei Han
Traditional feature selection methods assume that the data are independent and identically distributed (i.i.d.). However, in real world, there are tremendous amount of data which are distributing in a network. Existing features selection methods are not suited for networked data because the i.i.d. assumption no longer holds. This motivates us to study feature selection in a network. In this paper, we present a supervised feature selection method based on Laplacian Regularized Least Squares (LapRLS) for networked data. In detail, we use linear regression to utilize the content information, and adopt graph regularization to consider the link information. The proposed feature selection method aims at selecting a subset of features such that the empirical error of LapRLS is minimized. The resultant optimization problem is a mixed integer programming, which is difficult to solve. It is relaxed into a L2,1-norm constrained LapRLS problem and solved by accelerated proximal gradient descent algorithm. Experiments on benchmark networked data sets show that the proposed feature selection method outperforms traditional feature selection method and the state of the art learning in network approaches.
Practical representations for web and social graphs BIBAFull-Text 1185-1190
  Francisco Claude; Susana Ladra
In this paper we focus on representing Web and social graphs. Our work is motivated by the need of mining information out of these graphs, thus our representations do not only aim at compressing the graphs, but also at supporting efficient navigation. This allows us to process bigger graphs in main memory, avoiding the slowdown brought by resorting on external memory. We first show how by just partitioning the graph and combining two existing techniques for Web graph compression, k2-trees [Brisaboa, Ladra and Navarro, SPIRE 2009] and RePair-Graph [Claude and Navarro, TWEB 2010], exploiting the fact that most links are intra-domain, we obtain the best time/space trade-off for direct and reverse navigation when compared to the state of the art. In social networks, splitting the graph to achieve a good decomposition is not easy. For this case, we explore a new proposal for indexing MPK linearizations [Maserrat and Pei, KDD 2010], which have proven to be an effective way of representing social networks in little space by exploiting common dense subgraphs. Our proposal offers better worst case bounds in space and time, and is also a competitive alternative in practice.
Determining the diameter of small world networks BIBAFull-Text 1191-1196
  Frank W. Takes; Walter A. Kosters
In this paper we present a novel approach to determine the exact diameter (longest shortest path length) of large graphs, in particular of the nowadays frequently studied small world networks. Typical examples include social networks, gene networks, web graphs and internet topology networks. Due to complexity issues, the diameter is often calculated based on a sample of only a fraction of the nodes in the graph, or some approximation algorithm is applied. We instead propose an exact algorithm that uses various lower and upper bounds as well as effective node selection and pruning strategies in order to evaluate only the critical nodes which ultimately determine the diameter. We will show that our algorithm is able to quickly determine the exact diameter of various large datasets of small world networks with millions of nodes and hundreds of millions of links, whereas before only approximations could be given.
Detecting anomalies in graphs with numeric labels BIBAFull-Text 1197-1202
  Michael Davis; Weiru Liu; Paul Miller; George Redpath
This paper presents Yagada, an algorithm to search labelled graphs for anomalies using both structural data and numeric attributes. Yagada is explained using several security-related examples and validated with experiments on a physical Access Control database. Quantitative analysis shows that in the upper range of anomaly thresholds, Yagada detects twice as many anomalies as the best-performing numeric discretization algorithm. Qualitative evaluation shows that the detected anomalies are meaningful, representing a combination of structural irregularities and numerical outliers.
Extracting multi-dimensional relations: a generative model of groups of entities in a corpus BIBAFull-Text 1203-1208
  Ching-man Au Yeung; Tomoharu Iwata
Extracting relations among different entities from various data sources has been an important topic in data mining. While many methods focus only on a single type of relations, real world entities maintain relations that contain much richer information. We propose a hierarchical Bayesian model for extracting multi-dimensional relations among entities from a text corpus. Using data from Wikipedia, we show that our model can accurately predict the relevance of an entity given the topic of the document as well as the set of entities that are already mentioned in that document.
Distributed social graph embedding BIBAFull-Text 1209-1214
  Anne-Marie Kermarrec; Vincent Leroy; Gilles Trédan
Distributed recommender systems are becoming increasingly important for they address both scalability and the Big Brother syndrome. Link prediction is one of the core mechanism in recommender systems and relies on extracting some notion of proximity between entities in a graph. Applied to social networks, defining a proximity metric between users enable to predict potential relevant future relationships. In this paper, we propose SoCS (Social Coordinate Systems}, a fully distributed algorithm that embeds any social graph in an Euclidean space, which can easily be used to implement link prediction. To the best of our knowledge, SoCS is the first system explicitly relying on graph embedding. Inspired by recent works on non-isomorphic embeddings, the SoCS embedding preserves the community structure of the original graph, while being easy to decentralize. Nodes thus get assigned coordinates that reflect their social position. We show through experiments on real and synthetic data sets that these coordinates can be exploited for efficient link prediction.
Classification and annotation in social corpora using multiple relations BIBAFull-Text 1215-1220
  Yann Jacob; Ludovic Denoyer; Patrick Gallinari
We consider the problem of learning to annotate documents with concepts or keywords in content information networks, where the documents may share multiple relations. The concepts associated to a document will depend both on its content and on its neighbors in the network through the different relations. We formalize this problem as single and multi-label classification in a multi-graph, the nodes being the documents and the edges representing the different relations. The proposed algorithm learns to weight the different relations according to their importance for the annotation task. We perform experiments on different corpora corresponding to different annotation tasks on scientific articles, emails and Flickr images and show how the model may take advantage of the rich relational information.

Science, the past, and the future

Plagiarism detection based on structural information BIBAFull-Text 1221-1230
  Efstathios Stamatatos
In this paper a novel method for detecting plagiarized passages in document collections is presented. In contrast to previous work in this field that uses mainly content terms to represent documents, the proposed method is based on structural information provided by occurrences of a small list of stopwords (i.e., very frequent words). We show that stopword n-grams are able to capture local syntactic similarities between suspicious and original documents. Moreover, an algorithm for detecting the exact boundaries of plagiarized and source passages is proposed. Experimental results on a publicly-available corpus demonstrate that the performance of the proposed approach is competitive when compared with the best reported results. More importantly, it achieves significantly better results when dealing with difficult plagiarism cases where the plagiarized passages are highly modified by replacing most of the words or phrases with synonyms to hide the similarity with the source documents.
Studying how the past is remembered: towards computational history through large scale text mining BIBAFull-Text 1231-1240
  Ching-man Au Yeung; Adam Jatowt
History helps us understand the present and even to predict the future to certain extent. Given the huge amount of data about the past, we believe computer science will play an increasingly important role in historical studies, with computational history becoming an emerging interdisciplinary field of research. We attempt to study how the past is remembered through large scale text mining. We achieve this by first collecting a large dataset of news articles about different countries and analyzing the data using computational and statistical tools. We show that analysis of references to the past in news articles allows us to gain a lot of insight into the collective memories and societal views of different countries. Our work demonstrates how various computational tools can assist us in studying history by revealing interesting topics and hidden correlations. Our ultimate objective is to enhance history writing and evaluation with the help of algorithmic support.
Combining machine learning and human judgment in author disambiguation BIBAFull-Text 1241-1246
  Yanan Qian; Yunhua Hu; Jianling Cui; Qinghua Zheng; Zaiqing Nie
Author disambiguation in digital libraries becomes increasingly difficult as the number of publications and consequently the number of ambiguous author names keep growing. The fully automatic author disambiguation approach could not give satisfactory results due to the lack of signals in many cases. Furthermore, human judgment on the basis of automatic algorithms is also not suitable because the automatically disambiguated results are often mixed and not understandable for humans. In this paper, we propose a Labeling Oriented Author Disambiguation approach, called LOAD, to combine machine learning and human judgment together in author disambiguation. LOAD exploits a framework which consists of high precision clustering, high recall clustering, and top dissimilar clusters selection and ranking. In the framework, supervised learning algorithms are used to train the similarity functions between publications and a clustering algorithm is further applied to generate clusters. To validate the effectiveness and efficiency of the proposed LOAD approach, comprehensive experiments are conducted. Comparing to conventional author disambiguation algorithms, the LOAD yields much more accurate results to assist human labeling. Further experiments show that the LOAD approach can save labeling time dramatically.
Citation count prediction: learning to estimate future citations for literature BIBAFull-Text 1247-1252
  Rui Yan; Jie Tang; Xiaobing Liu; Dongdong Shan; Xiaoming Li
In most of the cases, scientists depend on previous literature which is relevant to their research fields for developing new ideas. However, it is not wise, nor possible, to track all existed publications because the volume of literature collection grows extremely fast. Therefore, researchers generally follow, or cite merely a small proportion of publications which they are interested in. For such a large collection, it is rather interesting to forecast which kind of literature is more likely to attract scientists' response. In this paper, we use the citations as a measurement for the popularity among researchers and study the interesting problem of Citation Count Prediction (CCP) to examine the characteristics for popularity. Estimation of possible popularity is of great significance and is quite challenging. We have utilized several features of fundamental characteristics for those papers that are highly cited and have predicted the popularity degree of each literature in the future. We have implemented a system which takes a series of features of a particular publication as input and produces as output the estimated citation counts of that article after a given time period. We consider several regression models to formulate the learning process and evaluate their performance based on the coefficient of determination (R-square). Experimental results on a real-large data set show that the best predictive model achieves a mean average predictive performance of 0.740 measured in R-square, which significantly outperforms several alternative algorithms.
Extracting cross references from life science databases for search result ranking BIBAFull-Text 1253-1258
  Anja Bachmann; Rene Schult; Matthias Lange; Myra Spiliopoulou
Scholars in life sciences have to process huge amounts of data in a disciplined and efficient way. These data are spread among thousands of databases which overlap in content but differ substantially with respect to interface, formats and data structure. Search engines have the potential of assisting in data retrieval from these structured sources but fall short of providing a relevance ranking of the results that reflects the needs of life science scholars. One such need is to acquire insights to cross-references among entities in the databases, whereby search hits with many cross-references are expected to be more informative than those with few cross-references. In this work, we investigate to what extend this expectation holds. We propose BioXREF, a method that extracts cross-references from multiple life science databases by combining targeted crawling, pointer chasing, sampling and information extraction. We study the retrieval quality of our method and the relationship between manually crafted relevance ranking and relevance ranking based on cross-references, and report on first, promising results.
Extracting collective expectations about the future from large text collections BIBAFull-Text 1259-1264
  Adam Jatowt; Ching-man Au Yeung
News articles often contain information about the future. Given the huge volume of information available nowadays, an automatic way for extracting and summarizing future-related information is desirable. Such information will allow people to obtain a collective image of the future, to recognize possible future scenarios and be prepared for the future events. We propose a model-based clustering algorithm for detecting future events based on information extracted from a text corpus. The algorithm takes into account both textual and temporal similarity of sentences. We demonstrate that our algorithm can be used to discover future events and estimate their probabilities over time.

Information extraction and entities

Towards a unified solution: data record region detection and segmentation BIBAFull-Text 1265-1274
  Lidong Bing; Wai Lam; Yuan Gu
Although the task of data record extraction from Web pages has been studied extensively, yet it fails to handle many pages due to their complexity in format or layout. In this paper, we propose a unified method to tackle this task by addressing several key issues in a uniform manner. A new search structure, named as Record Segmentation Tree (RST), is designed, and several efficient search pruning strategies on the RST structure are proposed to identify the records in a given Web page. Another characteristic of our method which is significantly different from previous works is that it can effectively handle complicated and challenging data record regions. It is achieved by generating subtree groups dynamically from the RST structure during the search process. Furthermore, instead of using string edit distance or tree edit distance, we propose a token-based edit distance which takes each DOM node as a basic unit in the cost calculation. Extensive experiments are conducted on four data sets, including flat, nested, and intertwine records. The experimental results demonstrate that our method achieves higher accuracy compared with three state-of-the-art methods.
Fast metadata-driven multiresolution tensor decomposition BIBAFull-Text 1275-1284
  Claudio Schifanella; K. Selçuk Candan; Maria Luisa Sapino
Tensors (multi-dimensional arrays) are widely used for representing high-order dimensional data, in applications ranging from social networks, sensor data, and Internet traffic. Multi-way data analysis techniques, in particular tensor decompositions, allow extraction of hidden correlations among multi-way data and thus are key components of many data analysis frameworks. Intuitively, these algorithms can be thought of as multi-way clustering schemes, which consider multiple facets of the data in identifying clusters, their weights, and contributions of each data element. Unfortunately, algorithms for fitting multi-way models are, in general, iterative and very time consuming. In this paper, we observe that, in many applications, there is a priori background knowledge (or metadata) about one or more domain dimensions. This metadata is often in the form of a hierarchy that clusters the elements of a given data facet (or mode). In this paper, we investigate whether such single-mode data hierarchies can be used to boost the efficiency of tensor decomposition process, without significant impact on the final decomposition quality. We consider each domain hierarchy as a guide to help provide higher- or lower-resolution views of the data in the tensor on demand and we rely on these metadata-induced multi-resolution tensor representations to develop a multiresolution approach to tensor decomposition. In this paper, we focus on an alternating least squares (ALS) based implementation of the PARAllel FACtors (PARAFAC) decomposition (which decomposes a tensor into a diagonal tensor and a set of factor matrices). Experiment results show that, when the available metadata is used as a rough guide, the proposed multiresolution method helps fit PARAFAC models with consistent (for both dense and sparse tensor representations, under different parameters settings) savings in execution time and memory consumption, while preserving the quality of the decomposition.
Enabling information extraction by inference of regular expressions from sample entities BIBAFull-Text 1285-1294
  Falk Brauer; Robert Rieger; Adrian Mocan; Wojciech M. Barczynski
Regular expressions are the dominant technique to extract business relevant entities (e.g., invoice numbers or product names) from text data (e.g., invoices), since these entity types often follow a strict underlying syntactical pattern. However, the manual construction of regular expressions that guarantee a high recall and precision is a tedious manual task and requires expert knowledge. In this paper, we propose an approach that automatically infers regular expressions from a set of (positive) sample entities, which in turn can be derived either from enterprise databases (e.g., a product catalog) or annotated documents (e.g., historical invoices). The main innovation of our approach is that it learns effective regular expressions that can be easily interpreted and modified by a user. The effectiveness is obtained by a novel method that weights dependent entity features of different granularity (i.e. on character and token level) against each other and selects the most suitable ones to form a regular expression.
Mining entity translations from comparable corpora: a holistic graph mapping approach BIBAFull-Text 1295-1304
  Jinhan Kim; Long Jiang; Seung-won Hwang; Young-In Song; Ming Zhou
This paper addresses the problem of mining named entity translations from comparable corpora, specifically, mining English and Chinese named entity translation. We first observe that existing approaches use one or more of the following named entity similarity metrics: entity, entity context, and relationship. Inspired by this observation, in this paper, we propose a new holistic approach, by (1) combining all similarity types used and (2) additionally considering relationship context similarity between pairs of named entities, a missing quadrant in the taxonomy of similarity metrics. We abstract the named entity translation problem as the matching of two named entity graphs extracted from the comparable corpora. Specifically, named entity graphs are first constructed from comparable corpora to extract relationship between named entities. Entity similarity and entity context similarity are then calculated from every pair of bilingual named entities. A reinforcing method is utilized to reflect relationship similarity and relationship context similarity between named entities. According to our experimental results, our holistic graph-based approach significantly outperforms previous approaches.
Max margin learning on domain-independent web information extraction BIBAFull-Text 1305-1310
  Bin Zhao; Xiaoxin Yin; Eric P. Xing
Domain-independent web information extraction can be addressed as a structured prediction problem where we learn a mapping function from an input web page to the structured and interdependent output variables, labeling each block on the page. In this paper, built upon an HTML parser of Internet Explorer that parses and renders a web page based on HTML tags and visual appearance, we propose a max margin learning approach for web information extraction. Specifically, the output of the parser is a vision tree, which is similar to a DOM tree but with visual information, i.e., how each node is displayed. Based on this hierarchical structure, we develop a max margin learning method for labeling each of its nodes. Due to the rich connections between blocks on the web page, we further introduce edges that connect spatially adjacent nodes on the vision tree, complicating the problem into a cyclic graph labeling task. A max margin learning method on cyclic graphs is developed for this problem, where loopy belief propagation is used for approximate inference. Experimental results on web data extraction show the feasibility and promise of our approach.

Queries, questions and tags mining

Finding dimensions for queries BIBAFull-Text 1311-1320
  Zhicheng Dou; Sha Hu; Yulong Luo; Ruihua Song; Ji-Rong Wen
We address the problem of finding multiple groups of words or phrases that explain the underlying query facets, which we refer to as query dimensions. We assume that the important aspects of a query are usually presented and repeated in the query's top retrieved documents in the style of lists, and query dimensions can be mined out by aggregating these significant lists. Experimental results show that a large number of lists do exist in the top results, and query dimensions generated by grouping these lists are useful for users to learn interesting knowledge about the queries.
Large-scale question classification in cQA by leveraging Wikipedia semantic knowledge BIBAFull-Text 1321-1330
  Li Cai; Guangyou Zhou; Kang Liu; Jun Zhao
With the flourishing of community-based question answering (cQA) services like Yahoo! Answers, more and more web users seek their information need from these sites. Understanding user's information need expressed through their search questions is crucial to information providers. Question classification in cQA is studied for this purpose. However, there are two main difficulties in applying traditional methods (question classification in TREC QA and text classification) to cQA: (1) Traditional methods confine themselves to classify a text or question into two or a few predefined categories. While in cQA, the number of categories is much larger, such as Yahoo! Answers, there contains 1,263 categories. Our empirical results show that with the increasing of the number of categories to moderate size, the performance of the classification accuracy dramatically decreases. (2) Unlike the normal texts, questions in cQA are very short, which cannot provide sufficient word co-occurrence or shared information for a good similarity measure due to the data sparseness. In this paper, we propose a two-stage approach for question classification in cQA that can tackle the difficulties of the traditional methods. In the first stage, we preform a search process to prune the large-scale categories to focus our classification effort on a small subset. In the second stage, we enrich questions by leveraging Wikipedia semantic knowledge to tackle the data sparseness. As a result, the classification model is trained on the enriched small subset. We demonstrate the performance of our proposed method on Yahoo! Answers with 1,263 categories. The experimental results show that our proposed method significantly outperforms the baseline method (with error reductions of 23.21%).
Hierarchical tag visualization and application for tag recommendations BIBAFull-Text 1331-1340
  Yang Song; Baojun Qiu; Umer Farooq
Social bookmarking sites typically visualize user-generated tags as tag clouds. While tag clouds effectively show the relative frequency and thus popularity of tags, they fail to convey two aspects to the users: (1) the similarity between tags, and (2) the abstractness of tags. We suggest an alternative to tag clouds known as tag hierarchies. Tag hierarchies are based on a minimum evolution-based greedy algorithm for tag hierarchy construction, which iteratively includes optimal tags into the tree that introduce minimum changes to the existing taxonomy. Our algorithm also uses a global tag ranking method to order tags according to their levels of abstractness as well as popularity such that more abstract tags will appear at higher levels in the taxonomy. Based on the tag hierarchy, we derive a new tag recommendation algorithm, which is a structure-based approach that does not require heavily trained models and thus is highly efficient. User studies and quantitative analysis suggest that (1) the tag hierarchy can potentially reduce the user's tagging time in comparison to tag clouds and other tag tree structures, and (2) the tag recommendation algorithm significantly outperforms existing content-based methods in quality.
Perspective hierarchical dirichlet process for user-tagged image modeling BIBAFull-Text 1341-1346
  Xin Chen; Xiaohua Hu; Yuan An; Zunyan Xiong; Tingting He; E. K. Park
In this paper, we proposed a perspective Hierarchical Dirichlet Process (pHDP) model to deal with user-tagged image modeling. The contribution is two-fold. Firstly, we associate image features with image tags. Secondly, we incorporate the user's perspectives into the image tag generation process and introduce new latent variables to determine if an image tag is generated from user's perspectives or from the image content. Therefore, the model is able to extract both embedded semantic components and user's perspectives from user-tagged images. Based on the proposed pHDP model, we achieve automatic image tagging with users' perspective. Experimental results show that the pHDP model achieves better image tagging performance compared to state-of-the-art topic models.
Asking what no one has asked before: using phrase similarities to generate synthetic web search queries BIBAFull-Text 1347-1352
  Marius Pasca
This paper introduces a method for automatically inferring meaningful, not-yet-submitted queries. The inferred queries fill some of the knowledge gaps between documents, on one hand, and known (i.e., already-submitted) queries, on the other hand. Thus, the inferred queries expand query logs and increase their coverage. New candidate queries are over-generated from known queries via phrase similarity data, then filtered against the set of known queries. The accuracy of the generated queries is computed using open-domain questions from standard question answering evaluation sets. Over the ranked lists of questions inferred for each of the evaluation questions, the precision reaches 0.9 at rank 50. The set of inferred queries is more than twice as large as the set of input queries.

Preparing, mining and evaluating with and for different views

Simultaneous joint and conditional modeling of documents tagged from two perspectives BIBAFull-Text 1353-1362
  Pradipto Das; Rohini Srihari; Yun Fu
This paper explores correspondence and mixture topic modeling of documents tagged from two different perspectives. There has been ongoing work in topic modeling of documents with tags (tag-topic models) where words and tags typically reflect a single perspective, namely document content. However, words in documents can also be tagged from different perspectives, for example, syntactic perspective as in part-of-speech tagging or an opinion perspective as in sentiment tagging. The models proposed in this paper are novel in: (i) the consideration of two different tag perspectives -- a document level tag perspective that is relevant to the document as a whole and a word level tag perspective pertaining to each word in the document; (ii) the attribution of latent topics with word level tags and labeling latent topics with images in case of multimedia documents; and (iii) discovering the possible correspondence of the words to document level tags. The proposed correspondence tag-topic model shows better predictive power i.e. higher likelihood on heldout test data than all existing tag topic models and even a supervised topic model. To evaluate the models in practical scenarios, quantitative measures between the outputs of the proposed models and the ground truth domain knowledge have been explored. Manually assigned (gold standard) document category labels in Wikipedia pages are used to validate model-generated tag suggestions using a measure of pairwise concept similarity within an ontological hierarchy like WordNet. Using a news corpus, automatic relationship discovery between person names was performed and compared to a robust baseline.
External evaluation measures for subspace clustering BIBAFull-Text 1363-1372
  Stephan Günnemann; Ines Färber; Emmanuel Müller; Ira Assent; Thomas Seidl
Knowledge discovery in databases requires not only development of novel mining techniques but also fair and comparable quality assessment based on objective evaluation measures. Especially in young research areas where no common measures are available, researchers are unable to provide a fair evaluation. Typically, publications glorify the high quality of one approach only justified by an arbitrary evaluation measure. However, such conclusions can only be drawn if the evaluation measures themselves are fully understood. In this paper, we provide the basis for systematic evaluation in the emerging research area of subspace clustering. We formalize general quality criteria for subspace clustering measures not yet addressed in the literature. We compare the existing external evaluation methods based on these criteria and pinpoint limitations. We propose a novel external evaluation measure which meets the requirements in form of quality properties. In thorough experiments we empirically show characteristic properties of evaluation measures. Overall, we provide a set of evaluation measures that fulfill the general quality criteria as recommendation for future evaluations. All measures and datasets are provided on our website and are integrated in our evaluation framework.
Behavior-driven clustering of queries into topics BIBAFull-Text 1373-1382
  Luca Maria Aiello; Debora Donato; Umut Ozertem; Filippo Menczer
Categorization of web-search queries in semantically coherent topics is a crucial task to understand the interest trends of search engine users and, therefore, to provide more intelligent personalization services. Query clustering usually relies on lexical and clickthrough data, while the information originating from the user actions in submitting their queries is currently neglected. In particular, the intent that drives users to submit their requests is an important element for meaningful aggregation of queries. We propose a new intent-centric notion of topical query clusters and we define a query clustering technique that differs from existing algorithms in both methodology and nature of the resulting clusters. Our method extracts topics from the query log by merging missions, i.e., activity fragments that express a coherent user intent, on the basis of their topical affinity. Our approach works in a bottom-up way, without any a-priori knowledge of topical categorization, and produces good quality topics compared to state-of-the-art clustering techniques. It can also summarize topically-coherent missions that occur far away from each other, thus enabling a more compact user profiling on a topical basis. Furthermore, such a topical user profiling discriminates the stream of activity of a particular user from the activity of others, with a potential to predict future user search activity.
Discovering customer intent in real-time for streamlining service desk conversations BIBAFull-Text 1383-1388
  Ullas Nambiar; Tanveer Faruquie; L. Venkata Subramaniam; Sumit Negi; Ganesh Ramakrishnan
Businesses require the contact center agents to meet pre-specified customer satisfaction levels while keeping the cost of operations low or meeting sales targets, objectives that end up being complementary and difficult to achieve in real-time. In this paper, we describe a speech enabled real-time conversation management system that tracks customer-agent conversations to detect user intent (e.g. gathering information, likely to buy, etc.) that can help agents to then decide the best sequence of actions for that call. We present an entropy based decision support system that parses a text stream generated in real-time during a audio conversation and identifies the first instance at which the intent becomes distinct enough for the agent to then take subsequent actions. We provide evaluation results displaying the efficiency and effectiveness of our system.
Sparse structured probabilistic projections for factorized latent spaces BIBAFull-Text 1389-1394
  Xinquan Qu; Xinlei Chen
Building a common representation for several related data sets is an important problem in multi-view learning. CCA and its extensions have shown that they are effective in finding the shared variation among all data sets. However, these models generally fail to exploit the common structure of the data when the views are with private information. Recently, methods explicitly modeling the information into shared part and private parts have been proposed, but they presume to know the prior knowledge about the latent space, which is usually impossible to obtain. In this paper, we propose a probabilistic model, which could simultaneously learn the structure of the latent space whilst factorize the information correctly, therefore the prior knowledge of the latent space is unnecessary. Furthermore, as a probabilistic model, our method is able to deal with missing data problem in a natural way. We show that our approach attains the performance of state-of-art methods on the task of human pose estimation when the motion capture view is completely missing, and significantly improves the inference accuracy with only a few observed data.

Information extraction and semantic techniques

Automated feature generation from structured knowledge BIBAFull-Text 1395-1404
  Weiwei Cheng; Gjergji Kasneci; Thore Graepel; David Stern; Ralf Herbrich
The prediction accuracy of any learning algorithm highly depends on the quality of the selected features; but often, the task of feature construction and selection is tedious and nonscalable. In recent years, however, there have been numerous projects with the goal of constructing general-purpose or domain-specific knowledge bases with entity-relationship-entity triples extracted from various Web sources or collected from user communities, e.g. YAGO, DBpedia, Freebase, UMLS, etc. This paper advocates the simple and yet far-reaching idea that the structured knowledge contained in such knowledge bases can be exploited to automatically extract features for general learning tasks. We introduce an expressive graph-based language for extracting features from such knowledge bases and a theoretical framework for constructing feature vectors from the extracted features. Our experimental evaluation on different learning scenarios provides evidence that the features derived through our framework can considerably improve the prediction accuracy, especially when the labeled data at hand is sparse.
Filtering and clustering relations for unsupervised information extraction in open domain BIBAFull-Text 1405-1414
  Wei Wang; Romaric Besançon; Olivier Ferret; Brigitte Grau
Information Extraction has recently been extended to new areas by loosening the constraints on the strict definition of the extracted information and allowing to design more open information extraction systems. In this new domain of unsupervised information extraction, we focus on the task of extracting and characterizing a priori unknown relations between a given set of entity types. One of the challenges of this task is to deal with the large amount of candidate relations when extracting them from a large corpus. We propose in this paper an approach for the filtering of such candidate relations based on heuristics and machine learning models. More precisely, we show that the best model for achieving this task is a Conditional Random Field model according to evaluations performed on a manually annotated corpus of about one thousand relations. We also tackle the problem of identifying semantically similar relations by clustering large sets of them. Such clustering is achieved by combining a classical clustering algorithm and a method for the efficient identification of highly similar relation pairs. Finally, we evaluate the impact of our filtering of relations on this semantic clustering with both internal measures and external measures. Results show that the filtering procedure doubles the recall of the clustering while keeping the same precision.
Facilitating pattern discovery for relation extraction with semantic-signature-based clustering BIBAFull-Text 1415-1424
  Yunyao Li; Vivian Chu; Sebastian Blohm; Huaiyu Zhu; Howard Ho
Hand-crafted textual patterns have been the mainstay device of practical relation extraction for decades. However, there has been little work on reducing the manual effort involved in the discovery of effective textual patterns for relation extraction. In this paper, we propose a clustering-based approach to facilitate the pattern discovery for relation extraction. Specifically, we define the notion of semantic signature to represent the most salient features of a textual fragment. We then propose a novel clustering algorithm based on semantic signature, S2C, and its enhancement S2C+. Experiments on two real-world data sets show that, when compared with k-means clustering, S2C and S2C+ are at least an order of magnitude faster, while generating high quality clusters that are at least comparable to the best clusters generated by k-means without requiring any manual tuning. Finally, a user study confirms that our clustering-based approach can indeed help users discover effective textual patterns for relation extraction with only a fraction of the manual effort required by the conventional approach.
Finding all justifications of OWL entailments using TMS and MapReduce BIBAFull-Text 1425-1434
  Gang Wu; Guilin Qi; Jianfeng Du
Finding all justifications of an OWL entailment is an important reasoning service for explaining logical inconsistencies. In this paper, we consider finding all justifications of an entailment in OWL pD* fragment, which is a fragment of OWL that makes possible decidable rule extensions of OWL. We first propose a novel approach to find all justifications of OWL pD* entailments using TMS and show the complexity of this approach. This approach is limited by the hardware capabilities of standalone systems. In order to improve its scalability to handle large scale semantic data, we optimize the proposed approach by exploiting the MapReduce technology. We implement our approach and the optimization, and do experiments on synthetic and real world data sets. Evaluation results show that our approach has the ability to scale to more than one billion triples.

Data on the web

Estimating selectivity for joined RDF triple patterns BIBAFull-Text 1435-1444
  Hai Huang; Chengfei Liu
A fundamental problem related to RDF query processing is selectivity estimation, which is crucial to query optimization for determining a join order of RDF triple patterns. In this paper we focus research on selectivity estimation for SPARQL graph patterns. The previous work takes the join uniformity assumption when estimating the joined triple patterns. This assumption would lead to highly inaccurate estimations in the cases where properties in SPARQL graph patterns are correlated. We take into account the dependencies among properties in SPARQL graph patterns and propose a more accurate estimation model. Since star and chain query patterns are common in SPARQL graph patterns, we first focus on these two basic patterns and propose to use Bayesian network and chain histogram respectively for estimating the selectivity of them. Then, for estimating the selectivity of an arbitrary SPARQL graph pattern, we design algorithms for maximally using the precomputed statistics of the star paths and chain paths. The experiments show that our method outperforms existing approaches in accuracy.
Efficient resource attribute retrieval in RDF triple stores BIBAFull-Text 1445-1454
  Andreas Brodt; Oliver Schiller; Bernhard Mitschang
The W3C Resource Description Framework (RDF) is gaining popularity for its ability to manage semi-structured data without a predefined database schema. So far, most RDF query processors have concentrated on finding complex graph patterns in RDF, which typically involves a high number of joins. This works very well to query resources by the relations between them. Yet, obtaining a record-like view on the attributes of resources, as natively supported by RDBMS, imposes unnecessary performance burdens, as the individual attributes must be joined to assemble the final result records. We present an approach to retrieve the attributes of resources efficiently. We first determine the resources in question and then retrieve all their attributes efficiently at once, exploiting contiguous storage in RDF indexes. In addition, we present an index structure which is specifically designed for RDF attribute retrieval. Our measurements show that our approach is clearly superior for larger numbers of attributes.
Effective stratification for low selectivity queries on deep web data sources BIBAFull-Text 1455-1464
  Fan Wang; Gagan Agrawal
We study the problem of estimating the result of an aggregation query with low selectivity when a data source only supports limited data accesses. Existing stratified sampling techniques cannot be applied to such a problem since either it is very hard, if not impossible, to gather certain critical statistics from such a data source, or more importantly, the selective attribute of the query may not be queriable on the data source. In such cases, we need an effective mechanism to stratify the data and form homogeneous strata with respect to the selective attribute of the query, despite not being able to query the data source with the selective attribute.
   This paper presents and evaluates a stratification method for this problem utilizing a queriable auxiliary attribute. The breaking points for the stratification are computed based on a novel Bayesian Adaptive Harmony Search algorithm. This method derives from the existing Harmony search method, but includes novel objective function, and introduces a technique for dynamically adapting key parameters of this method. Our experiments show that the estimation accuracy achieved using our method is consistently higher than 95% even for 0.01% selectivity query, even when there is only a low correlation between the auxiliary attribute and the selective attribute. Furthermore, our method achieves at least a five fold reduction in estimation error over three other methods, for the same sampling cost.
Finding information nebula over large networks BIBAFull-Text 1465-1474
  Lijun Chang; Jeffrey Xu Yu; Lu Qin; Yuanyuan Zhu; Haixun Wang
Social and information networks have been extensively studied over years. In this paper, we concentrate ourselves on a large information network that is composed of entities and relationships, where entities are associated with sets of keyword terms (kterms) to specify what they are, and relationships describe the link structure among entities which can be very complex. Our work is motivated but is different from the existing works that find a best subgraph to describe how user-specified entities are connected. We compute information nebula (cloud) which is a set of top-K kterms P that are most correlated to a set of user-specified kterms Q, over a large information network. Our goal is to find how kterms are correlated given the complex information network among entities. The information nebula computing requests us to take all possible kterms into consideration for the top-K kterms selection, and needs to measure the similarity between kterms by considering all possible subgraphs that connect them instead of the best single one. In this work, we compute information nebula using a global structural-context similarity, and our similarity measure is independent of connection subgraphs. To the best of our knowledge, among the link-based similarity methods, none of the existing work considers similarity between two sets of nodes or two kterms. We propose new algorithms to find top-K kterms P for a given set of kterms Q based on the global structural-context similarity, without computing all the similarity scores of kterms in the large information network. We performed extensive performance studies using large real datasets, and confirmed the effectiveness and efficiency of our approach.
Efficient methods for finding influential locations with adaptive grids BIBAFull-Text 1475-1484
  Da Yan; Raymond Chi-Wing Wong; Wilfred Ng
Given a set S of servers and a set C of clients, an optimal-location query returns a location where a new server can attract the greatest number of clients. Optimal-location queries are important in a lot of real-life applications, such as mobile service planning or resource distribution in an area. Previous studies assume that a client always visits its nearest server, which is too strict to be true in reality. In this paper, we relax this assumption and propose a new model to tackle this problem. We further generalize the problem to finding top-k optimal locations. The main challenge is that, even the fastest approach in existing studies needs to take hours to answer an optimal-location query on a typical real world dataset, which significantly limits the applications of the query. Using our relaxed model, we design an efficient grid-based approximation algorithm called FILM (Fast Influential Location Miner) to the queries, which is orders of magnitude faster than the best-known previous work and the number of clients attracted by a new server in the result location often exceeds 98% of the optimal. The algorithm is extended to finding k influential locations. Extensive experiments are conducted to show the efficiency and effectiveness of FILM on both real and synthetic datasets.

Query processing and optimization

Semi-indexing semi-structured data in tiny space BIBAFull-Text 1485-1494
  Giuseppe Ottaviano; Roberto Grossi
Semi-structured textual formats are gaining increasing popularity for the storage of document collections and rich logs. Their flexibility comes at the cost of having to load and parse a document entirely even if just a small part of it needs to be accessed. For instance, in data analytics massive collections are usually scanned sequentially, selecting a small number of attributes from each document. We propose a technique to attach to a raw, unparsed document (even in compressed form) a "semi-index": a succinct data structure that supports operations on the document tree at speed comparable with an in-memory deserialized object, thus bridging textual formats with binary formats. After describing the general technique, we focus on the JSON format: our experiments show that avoiding the full loading and parsing step can give speedups of up to 12 times for on-disk documents using a small space overhead.
Evaluation of set-based queries with aggregation constraints BIBAFull-Text 1495-1504
  Quoc Trung Tran; Chee-Yong Chan; Guoping Wang
Many applications often require finding a set of items of interest with respect to some aggregation constraints. For example, a tourist might want to find a set of places of interest to visit in a city such that the total expected duration is no more than six hours and the total cost is minimized. We refer to such queries as SAC queries for "set-based with aggregation constraints" queries. The usefulness of SAC queries is evidenced by the many variations of SAC queries that have been studied which differ in the number and types of constraints supported. In this paper, we make two contributions to SAC query evaluation. We first establish the hardness of evaluating SAC queries with multiple count constraints and presented a novel, pseudo-polynomial time algorithm for evaluating a non-trivial fragment of SAC queries with multiple sum constraints and at most one of either count, group-by, or content constraint. We also propose a heuristic approach for evaluating general SAC queries. The effectiveness of our proposed solutions is demonstrated by an experimental performance study.
Index structures and top-k join algorithms for native keyword search databases BIBAFull-Text 1505-1514
  Günter Ladwig; Thanh Tran
For supporting keyword search on structured data, current solutions require large indexes to be built that redundantly store subgraphs called neighborhoods. Further, for exploring keyword search results, large graphs have to be loaded into memory. We propose a solution, which employs much more compact index structures for neighborhood lookups. Using these indexes, we reduce keyword search result exploration to the traditional database problem of top-k join processing, enabling results to be computed efficiently. In particular, this computation can be performed on data streams successively loaded from disk (i.e., does not require the entire input to be loaded at once into memory). For supporting this, we propose a top-k procedure based on the rank join operator, which not only computes the k-best results, but also selects query plans in a top-k fashion during the process. In experiments using large real-world datasets, our solution reduced storage requirements and also outperformed the state-of-the-art in terms of performance and scalability.
Optimized processing of multiple aggregate continuous queries BIBAFull-Text 1515-1524
  Shenoda Guirguis; Mohamed A. Sharaf; Panos K. Chrysanthis; Alexandros Labrinidis
Data Streams Management Systems are designed to support monitoring applications, which require the processing of hundreds of Aggregate Continuous Queries (ACQs). These ACQs typically have different time granularities, with possibly different selection predicates and group-by attributes. In order to achieve scalability in the presence of heavy workloads, in this paper, we introduce the concept of 'Weaveability' as an indicator of the potential gains of sharing the processing of ACQs. We then propose Weave Share, a cost-based optimizer that exploits weaveability to optimize the shared processing of ACQs. Our experimental analysis shows that Weave Share outperforms the alternative sharing schemes generating up to four orders of magnitude better quality plans. Finally, we describe a practical implementation of the Weave Share optimizer.
XQuery optimization based on program slicing BIBAFull-Text 1525-1534
  Jesus M. Almendros-Jimenez; Josep Silva; Salvador Tamarit
XQuery has become the standard query language for XML. The efforts put on this language have produced mature and efficient implementations of XQuery processors. However, in practice the efficiency of XQuery programs is strongly dependent on the ability of the programmer to combine different queries which often affect several XML sources that in turn can be distributed in different branches of the organization. Therefore, techniques to reduce the amount of data loaded and also to reduce the intermediate structures computed by queries is a necessity. In this work we propose a novel technique that allows the programmer to automatically optimize a query in such a way that unnecessary intermediate computations are avoided, and, in addition, it identifies the paths in the source XML documents that are really required to resolve the query.

Semantic web and information retrieval

Learning-based relevance feedback for web-based relation completion BIBAFull-Text 1535-1540
  Zhixu Li; Laurianne Sitbon; Xiaofang Zhou
In a pilot application based on web search engine called Web-based Relation Completion (WebRC), we propose to join two columns of entities linked by a predefined relation by mining knowledge from the web through a web search engine. To achieve this, a novel retrieval task Relation Query Expansion (RelQE) is modelled: given an entity (query), the task is to retrieve documents containing entities in predefined relation to the given one. Solving this problem entails expanding the query before submitting it to a web search engine to ensure that mostly documents containing the linked entity are returned in the top K search results. In this paper, we propose a novel Learning-based Relevance Feedback (LRF) approach to solve this retrieval task. Expansion terms are learned from training pairs of entities linked by the predefined relation and applied to new entity-queries to find entities linked by the same relation. After describing the approach, we present experimental results on real-world web data collections, which show that the LRF approach always improves the precision of top-ranked search results to up to 8.6 times the baseline. Using LRF, WebRC also shows performances way above the baseline.
Categorising logical differences between OWL ontologies BIBAFull-Text 1541-1546
  Rafael S. Gonçalves; Bijan Parsia; Ulrike Sattler
The analysis of changes between OWL ontologies (in the form of a diff ) is an important service for ontology engineering. A purely syntactic analysis of changes is insufficient to distinguish between changes that have logical impact and those that do not. The current state of the art in semantic diffing ignores logically ineffectual changes and lacks any further characterisation of even significant changes. We present a diff method based on an exhaustive categorisation of effectual and ineffectual changes between ontologies. In order to verify the applicability of our approach we apply it to 88 versions of the National Cancer Institute (NCI) Thesaurus (NCIt), and demonstrate that all categories are realized throughout the corpus. Based on the outcome of the NCIt study we argue that the devised categorisation of changes is helpful for ontology engineers and their understanding of changes carried out between ontologies.
ReDRIVE: result-driven database exploration through recommendations BIBAFull-Text 1547-1552
  Marina Drosou; Evaggelia Pitoura
Typically, users interact with database systems by formulating queries. However, many times users do not have a clear understanding of their information needs or the exact content of the database, thus, their queries are of an exploratory nature. In this paper, we propose assisting users in database exploration by recommending to them additional items that are highly related with the items in the result of their original query. Such items are computed based on the most interesting sets of attribute values (or faSets) that appear in the result of the original user query. The interestingness of a faSet is defined based on its frequency both in the query result and in the database instance. Database frequency estimations rely on a novel approach that employs an e-tolerance closed rare faSets representation. We report evaluation results of the efficiency and effectiveness of our approach on both real and synthetic datasets.
Information re-finding by context: a brain memory inspired approach BIBAFull-Text 1553-1558
  Tangjian Deng; Liang Zhao; Ling Feng; Wenwei Xue
Re-finding what we have accessed before is a common behavior in real life. Psychological studies show that context under which information was accessed can serve as a powerful cue for information recall. "Finding the sweet recipe that I read at the hotel on the trip to Africa last year" is a context-based re-finding request example. Inspired by users' recall characteristics and human memory, we present a context memory model, where each context unit links to the data created/accessed before. Context units are organized in a clustering and associative manner, and evolve dynamically in life cycles. Based on the context memory, we build a recall-by-context query model. Two methods are devised to evaluate context-based recall queries. Our experiments with synthetic and real data show that evaluation exploring the use of context associations can get the best response time.
Semantic data markets: a flexible environment for knowledge management BIBAFull-Text 1559-1564
  Roberto De Virgilio; Giorgio Orsi; Letizia Tanca; Riccardo Torlone
We present Nyaya, a system for the management of Semantic-Web data which couples a general-purpose and extensible storage mechanism with efficient ontology reasoning and querying capabilities. Nyaya processes large Semantic-Web datasets, expressed in multiple formalisms, by transforming them into a collection of Semantic Data Kiosks. Nyaya uniformly exposes the native meta-data of each kiosk using the datalog+- language, a powerful rule-based modelling language for ontological databases. The kiosks form a Semantic Data Market where the data in each kiosk can be uniformly accessed using conjunctive queries and where users can specify user-defined constraints over the data. Nyaya is easily extensible and robust to updates of both data and meta-data in the kiosk and can readily adapt to different logical organization of the persistent storage. The approach has been experimented using well-known benchmarks, and compared to state-of-the-art research prototypes and commercial systems.
Advancing the discovery of unique column combinations BIBAFull-Text 1565-1570
  Ziawasch Abedjan; Felix Naumann
Unique column combinations of a relational database table are sets of columns that contain only unique values. Discovering such combinations is a fundamental research problem and has many different data management and knowledge discovery applications. Existing discovery algorithms are either brute force or have a high memory load and can thus be applied only to small datasets or samples. In this paper, the well-known Gordian algorithm [9] and "Apriori-based" algorithms [4] are compared and analyzed for further optimization. We greatly improve the Apriori algorithms through efficient candidate generation and statistics-based pruning methods. A hybrid solution HCA-Gordian combines the advantages of Gordian and our new algorithm HCA, and it outperforms all previous work in many situations.
Continuously monitoring the correlations of massive discrete streams BIBAFull-Text 1571-1576
  Yueguo Chen; Wei Wang; Xiaoyong Du; Xiaofang Zhou
The problem of monitoring the correlations of discrete streams is to continuously monitor the temporal correlations among massive discrete streams. A temporal correlation of two streams is defined as a tracking behavior, i.e., the most recent pattern of one stream is very similar to a historical pattern of another stream. The challenge is that both the tracking stream and the tracked stream are evolving, which causes the frequent updates of the correlation-ships. The straightforward way of monitoring correlations by brute-force subsequence matching will be very expensive for massive streams. We propose techniques that are able to significantly reduce the number of expensive subsequence matching calls, by continuously pruning and refining the correlated streams. Extensive experiments on the streaming trajectories show the significant performance improvement achieved by the proposed algorithms.

Query answering and social search

Multiple keyword-based queries over XML streams BIBAFull-Text 1577-1582
  Felipe C. Hummel; Altigran S. da Silva; Mirella M. Moro; Alberto H. F. Laender
In this paper, we propose that various keyword-based queries be processed over XML streams in a multi-query processing way. Our algorithms rely on parsing stacks designed for simultaneously matching terms from several distinct queries and use new query indexes to speed up search operations when processing a large number of queries. Besides defining a new problem and novel solutions, we perform experiments in which aspects related to performance and scalability are examined.
Authentication of location-based skyline queries BIBAFull-Text 1583-1588
  Xin Lin; Jianliang Xu; Haibo Hu
In outsourced spatial databases, the location-based service (LBS) provides query services to the clients on behalf of the data owner. However, if the LBS is not trustworthy, it may return incorrect or incomplete query results. Thus, authentication is needed to verify the soundness and completeness of query results. In this paper, we study the authentication problem for location-based skyline queries, which have recently been receiving increasing attention in LBS applications. We propose two authentication methods: one based on the traditional MR-tree index and the other based on a newly developed MR-Sky-tree. Experimental results demonstrate the efficiency of our proposed methods in terms of the authentication cost.
Matching query processing in high-dimensional space BIBAFull-Text 1589-1594
  Chunyang Ma; Yongluan Zhou; Lidan Shou; Dan Dai; Gang Chen
In many applications, such as online dating or job hunting websites, users often need to search for potential matches based on the requirements or preferences imposed by both sides.We refer to this type of queries as matching queries. In spite of their wide applicabilities, there has been little attention devoted to improve their performance. As matching queries often appear in various forms even within a single application, we, in this paper, propose a general processing framework, which can efficiently process various forms of matching queries. Moreover, we elaborate the detailed processing algorithms for two particular forms of matching queries to illustrate the applicability of this framework. We conduct an extensive experimental study with both synthetic and real datasets. The results indicate that, for various matching queries, our techniques can dramatically improve the query performance, especially when the dimensionality is high.
Answering label-constraint reachability in large graphs BIBAFull-Text 1595-1600
  Kun Xu; Lei Zou; Jeffery Xu Yu; Lei Chen; Yanghua Xiao; Dongyan Zhao
In this paper, we study a variant of reachability queries, called label-constraint reachability (LCR) queries, specifically, given a label set S and two vertices u1 and u2 in a large directed graph G, we verify whether there exists a path from u1 to u2 under label constraint S. Like traditional reachability queries, LCR queries are very useful, such as pathway finding in biological networks, inferring over RDF (resource description framework) graphs, relationship finding in social networks. However, LCR queries are much more complicated than their traditional counterpart.Several techniques are proposed in this paper to minimize the search space in computing path-label transitive closure. Furthermore, we demonstrate the superiority of our method by extensive experiments.
The list Viterbi training algorithm and its application to keyword search over databases BIBAFull-Text 1601-1606
  Silvia Rota; Sonia Bergamaschi; Francesco Guerra
Hidden Markov Models (HMMs) are today employed in a variety of applications, ranging from speech recognition to bioinformatics. In this paper, we present the List Viterbi training algorithm, a version of the Expectation-Maximization (EM) algorithm based on the List Viterbi algorithm instead of the commonly used forward-backward algorithm. We developed the batch and online versions of the algorithm, and we also describe an interesting application in the context of keyword search over databases, where we exploit a HMM for matching keywords into database terms. In our experiments we tested the online version of the training algorithm in a semi-supervised setting that allows us to take into account the feedbacks provided by the users.
Context-based people search in labeled social networks BIBAFull-Text 1607-1612
  Cheng-Te Li; Man-Kwan Shan; Shou-De Lin
In online social networking services, there are a range of scenarios in which users want to search a particular person given the targeted person one's name. The challenge of such people search is namesake, which means that there are many people possess the same names in the social network. In this paper, we propose to leverage the query contexts to tackle such problems. For example, given the information of one's graduation year and city, the last names of some individuals, one may wish to find classmates from his/her high school. We formulate such problem as the context-based people search. Given a social network in which each node is associated with a set of labels and given a query set of labels consisting of a targeted name label and other context labels, our goal is to return a ranking list of persons who possess the targeted name label and connects to other context labels with minimum communication costs through an effective subgraph in the social network. We consider the interactions among query labels to propose a grouping-based method to solve the context-based people search. Our method consists of three major parts. First, we model those nodes with query labels into a group graph which is able to reduce the search space to enhance the time efficiency. Second, we identify three different kinds of connectors which connecting different groups, and exploit connectors to find the corresponding detailed graph topology from the group graph. Third, we propose a Connector-Steiner Tree algorithm to retrieve a resulting ranked list of individuals who possess the targeted label. Experimental results on the DBLP bibliography data show that our grouping-based method can reach the good quality of returned persons as a greedy search algorithm at a considerable outperformance on the time efficiency.
On benchmarking data translation systems for semantic-web ontologies BIBAFull-Text 1613-1618
  Carlos R. Rivero; Inma Hernández; David Ruiz; Rafael Corchuelo
Data translation, also known as data exchange, is an integration task that aims at populating a target model using data from a source model. This task is gaining importance in the context of semantic-web ontologies due to the increasing interest in graph databases and semantic-web agents. Currently, there are a variety of semantic-web technologies that can be used to implement data translation systems. This makes it difficult to assess them from an empirical point of view. In this paper, we present a benchmark that provides a catalogue of seven data translation patterns that can be instantiated by means of seven parameters. This allows us to create a variety of synthetic, domain-independent scenarios one can use to test existing data translation systems. We also illustrate how to analyse three such systems using our benchmark. The main benefit of our benchmark is that it allows to compare data translation systems side by side within a homogeneous framework.

Distributed data management and data integration

I/O-efficient algorithms for answering pattern-based aggregate queries in a sequence OLAP system BIBAFull-Text 1619-1628
  Chun Kit Chui; Ben Kao; Eric Lo; Reynold Cheng
Many kinds of real-life data exhibit logical ordering among their data items and are thus sequential in nature. In recent years, the concept of Sequence OLAP (S-OLAP) has been proposed. The biggest distinguishing feature of SOLAP from traditional OLAP is that data sequences managed by an S-OLAP system are characterized by the subsequence/substring patterns they possess. An S-OLAP system thus supports pattern-based grouping and aggregation. Conceptually, an S-OLAP system maintains a sequence data cube which is composed of sequence cuboids. Each sequence cuboid presents the answer of a pattern-based aggregate (PBA) query. This paper focuses on the I/O aspects of evaluating PBA queries. We study the problems of joining plan selection and execution planning, which are the core issues in the design of I/O-efficient cuboid materialization algorithms. Through an empirical study, we show that our algorithms lead to a very I/O-efficient strategy for sequence cuboid materialization.
Tractable XML data exchange via relations BIBAFull-Text 1629-1638
  Rada Chirkova; Leonid Libkin; Juan L. Reutter
We consider data exchange for XML documents: given source and target schemas, a mapping between them, and a document conforming to the source schema, construct a target document and answer target queries in a way that is consistent with source information. The problem has primarily been studied in the relational context, in which data-exchange systems have also been built. Since many XML documents are stored in relations, it is natural to consider using a relational system for XML data exchange. However, there is a complexity mismatch between query answering in relational and XML data exchange, which indicates that restrictions have to be imposed on XML schemas and mappings, and on XML shredding schemes, to make the use of relational systems possible. We isolate a set of five requirements that must be fulfilled in order to have a faithful representation of the XML data-exchange problem by a relational translation. We then demonstrate that these requirements naturally suggest the inlining technique for data-exchange tasks. Our key contribution is to provide shredding algorithms for schemas, documents, mappings and queries, and demonstrate that they enable us to correctly perform XML data-exchange tasks using a relational system.
A parallel algorithm for computing borders BIBAFull-Text 1639-1648
  Nicolas Hanusse; Sofian Maabout
The border concept has been introduced by Mannila and Toivonen in their seminal paper [20]. This concept finds many applications, e.g maximal frequent itemsets, minimal functional dependencies, emerging patterns between consecutive database instances and materialized view selection. For large transactions and relational databases defined on n items or attributes, the running time of any border computations are mainly dominated by the time T (for standard sequential algorithms) required to test the interestingness, in general the frequencies, of sets of candidates.
   In this paper we propose a general parallel algorithm for computing borders whatever the application is. We prove the efficiency of our algorithm by showing that: (i) it generates exactly the same number of candidates as the standard sequential algorithm and, (ii) if the interestingness test time of a candidate is bounded by Δ then for a multi-processor shared memory machine with p cores, we prove that the total interestingness time Tp < T/p + 2 Δ n. We implemented our algorithm in the maximal frequent itemset (MFI) mining setting and our experiments confirm our theoretical performance guarantee.
Supporting queries spanning across phases of evolving artifacts using Steiner forests BIBAFull-Text 1649-1658
  Siarhei Bykau; John Mylopoulos; Flavio Rizzolo; Yannis Velegrakis
The problem of managing evolving data has attracted considerable research attention. Researchers have focused on the modeling and querying of schema/instance-level structural changes, such as, addition, deletion and modification of attributes. Databases with such a functionality are known as temporal databases. A limitation of the temporal databases is that they treat changes as independent events, while often the appearance (or elimination) of some structure in the database is the result of an evolution of some existing structure. We claim that maintaining the causal relationship between the two structures is of major importance since it allows additional reasoning to be performed and answers to be generated for queries that previously had no answers. We present here a novel framework for exploiting the evolution relationships between the structures in the database. In particular, our system combines different structures that are associated through evolution relationships into virtual structures to be used during query answering. The virtual structures define "possible" database instances, in a fashion similar to the possible worlds in the probabilistic databases. The framework includes a query answering mechanism that allows queries to be answered over these possible databases without materializing them. Evaluation of such queries raises many interesting technical challenges, since it requires the discovery of Steiner forests on the evolution graphs. On this problem we have designed and implemented a new dynamic programming algorithm with exponential complexity in the size of the input query and polynomial complexity in terms of both the attribute and the evolution data sizes.
Provenance-based refresh in data-oriented workflows BIBAFull-Text 1659-1668
  Robert Ikeda; Semih Salihoglu; Jennifer Widom
We consider a general workflow setting in which input data sets are processed by a graph of transformations to produce output results. Our goal is to perform efficient selective refresh of elements in the output data, i.e., compute the latest values of specific output elements when the input data may have changed. We explore how data provenance can be used to enable efficient refresh. Our approach is based on capturing one-level data provenance at each transformation when the workflow is run initially. Then at refresh time provenance is used to determine (transitively) which input elements are responsible for given output elements, and the workflow is rerun only on that portion of the data needed for refresh. Our contributions are to formalize the problem setting and the problem itself, to specify properties of transformations and provenance that are required for efficient refresh, and to provide algorithms that apply to a wide class of transformations and workflows. We have built a prototype system supporting the features and algorithms presented in the paper. We report preliminary experimental results on the overhead of provenance capture, and on the crossover point between selective refresh and full workflow recomputation.

Keyword search and ranked queries

Ranking support for keyword search on structured data using relevance models BIBAFull-Text 1669-1678
  Veli Bicer; Thanh Tran; Radoslav Nedkov
Keyword query processing over structured data has gained a lot of interest as keywords have proven to be an intuitive mean for accessing complex results in databases. While there is a large body of work that provides different mechanisms for computing keyword search results efficiently, a recent study has shown that the problem of ranking is much neglected. Existing strategies employ heuristics that perform only in ad-hoc experiments but fail to consistently and repeatedly deliver results across different information needs. We provide a principled approach for ranking that focuses on a well-established notion of what constitutes relevant keyword search results. In particular, we adopt relevance-based language models to consider the structure and semantics of keyword search results, and introduce novel strategies for smoothing probabilities in this structured data setting. Using a standardized evaluation framework, we show that our work largely and consistently outperforms all existing systems across datasets and various information needs.
Efficient similarity search: arbitrary similarity measures, arbitrary composition BIBAFull-Text 1679-1688
  Dustin Lange; Felix Naumann
Given a (large) set of objects and a query, similarity search aims to find all objects similar to the query. A frequent approach is to define a set of base similarity measures for the different aspects of the objects, and to build light-weight similarity indexes on these measures. To determine the overall similarity of two objects, the results of these base measures are composed, e.g., using simple aggregates or more involved machine learning techniques. We propose the first solution to this search problem that does not place any restrictions on the similarity measures, the composition technique, or the data set size. We define the query plan optimization problem to determine the best query plan using the similarity indexes. A query plan must choose which individual indexes to access and which thresholds to apply. The plan result should be as complete as possible within some cost threshold. We propose the approximative top neighborhood algorithm, which determines a near-optimal plan while significantly reducing the amount of candidate plans to be considered. An exact version of the algorithm determines the optimal solution. Evaluation on real-world data indicates that both versions clearly outperform a complete search of the query plan space.
Learning to rank results in relational keyword search BIBAFull-Text 1689-1698
  Joel Coffman; Alfred C. Weaver
Keyword search within databases has become a hot topic within the research community as databases store increasing amounts of information. Users require an effective method to retrieve information from these databases without learning complex query languages (viz. SQL). Despite the recent research interest, performance and search effectiveness have not received equal attention, and scoring functions in particular have become increasingly complex while providing only modest benefits with regards to the quality of search results. An analysis of the factors appearing in existing scoring functions suggests that some factors previously deemed critical to search effectiveness are at best loosely correlated with relevance. We consider a number of these different scoring factors and use machine learning to create a new scoring function that provides significantly better results than existing approaches. We simplify our scoring function by systematically removing the factors with the lowest weight and show that this version still outperforms the previous state-of-the-art in this area.
Adding structure to top-k: from items to expansions BIBAFull-Text 1699-1708
  Xueyao Liang; Min Xie; Laks V. S. Lakshmanan
Keyword based search interfaces are extremely popular as a means for efficiently discovering items of interest from a huge collection, as evidenced by the success of search engines like Google and Bing. However, most of the current search services still return results as a flat ranked list of items. Considering the huge number of items which can match a query, this list based interface can be very difficult for the user to explore and find important items relevant to their search needs. In this work, we consider a search scenario in which each item is annotated with a set of keywords. E.g., in Web 2.0 enabled systems such as flickr and del.icio.us, it is common for users to tag items with keywords. Based on this annotation information, we can automatically group query result items into different expansions of the query corresponding to subsets of keywords. We formulate and motivate this problem within a top-k query processing framework, but as that of finding the top-k most important expansions. Then we study additional desirable properties for the set of expansions returned, and formulate the problem as an optimization problem of finding the best k expansions satisfying all the desirable properties. We propose several efficient algorithms for this problem. Our problem is similar in spirit to recent works on automatic facets generation, but has the important difference and advantage that we don't need to assume the existence of pre-defined categorical hierarchy which is critical for these works. Through extensive experiments on both real and synthetic datasets, we show our proposed algorithms are both effective and efficient.
TEXplorer: keyword-based object search and exploration in multidimensional text databases BIBAFull-Text 1709-1718
  Bo Zhao; Xide Lin; Bolin Ding; Jiawei Han
We propose a novel system TEXplorer that integrates keyword-based object ranking with the aggregation and exploration power of OLAP in a text database with rich structured attributes available, e.g., a product review database. TEXplorer can be implemented within a multi-dimensional text database, where each row is associated with structural dimensions (attributes) and text data (e.g., a document). The system utilizes the text cube data model, where a cell aggregates a set of documents with matching values in a subset of dimensions. Cells in a text cube capture different levels of summarization of the documents, and can represent objects at different conceptual levels.
   Users query the system by submitting a set of keywords. Instead of returning a ranked list of all the cells, we propose a keyword-based interactive exploration framework that could offer flexible OLAP navigational guides and help users identify the levels and objects they are interested in. A novel significance measure of dimensions is proposed based on the distribution of IR relevance of cells. During each interaction stage, dimensions are ranked according to their significance scores to guide drilling down; and cells in the same cuboids are ranked according to their relevance to guide exploration. We propose efficient algorithms and materialization strategies for ranking top-k dimensions and cells. Finally, extensive experiments on real datasets demonstrate the efficiency and effectiveness of our approach.

Data cleaning and analysis

The quality of the XML web BIBAFull-Text 1719-1724
  Steven Grijzenhout; Maarten Marx
We collect evidence to answer the following question: Is the quality of the XML documents found on the web sufficient to apply XML technology like XQuery, XPath and XSLT? XML collections from the web have been previously studied statistically, but no detailed information about the quality of the XML documents on the web is available to date. We address this shortcoming in this study. We gathered 180K XML documents from the web. Their quality is surprisingly good; 85.4% is well-formed and 99.5% of all specified encodings is correct. Validity needs serious attention. Only 25% of all files contain a reference to a DTD or XSD, of which just one third is actually valid. Errors are studied in detail. Automatic error repair seems promising. Our study is well documented and easily repeatable. This paves the way for a periodic quality assessment of the XML web.
Context-based entity description rule for entity resolution BIBAFull-Text 1725-1730
  Lingli Li; Jianzhong Li; Hongzhi Wang; Hong Gao
In this paper, we consider the entity resolution (ER) problem, which is to identify objects referring to the same real-world entity. Prior work of ER involves expensive similarity comparison and clustering approaches. Additionally, the quality of entity resolution may be low due to insufficient information. To address these problems, by adopting context information of data objects, we present a novel framework of entity resolution, context-based entity description (CED), to make context information help entity resolution. In our framework, each entity is described by a set of CEDs. During entity resolution, objects are only compared with CEDs to determine its corresponding entity. Additionally, we propose efficient algorithms for CED discovery and CED-based entity resolution. We experimentally evaluated our CED-based ER algorithm on the real DBLP datasets, and the experimental results show that our algorithm can achieve both high precision and recall as well as outperform existing methods.
Cost-efficient repair in inconsistent probabilistic databases BIBAFull-Text 1731-1736
  Xiang Lian; Yincheng Lin; Lei Chen
Due to the ubiquitous data uncertainty in many emerging real applications, efficient management of probabilistic databases has become an increasingly important yet challenging problem. In particular, one fundamental task of data management is to identify those unreliable data in the probabilistic database that violate integrity constraints (e.g., functional dependencies), and then quickly resolve data inconsistencies. In this paper, we formulate and tackle an important problem of repairing inconsistent probabilistic databases efficiently by value modification. Specifically, we propose a repair semantic, namely possible-world-oriented repair (PW-repair), which partitions possible worlds into several disjoint groups, and repairs these groups individually with minimum repair costs. Due to the intractable result that finding such a PW-repair strategy is NP-complete, we carefully design a heuristic-based greedy approach for PW-repair, which can efficiently obtain an effective repair of the inconsistent probabilistic database. Through extensive experiments, we show that our approach can achieve the efficiency and effectiveness of the repair on inconsistent probabilistic data.
Approximate tensor decomposition within a tensor-relational algebraic framework BIBAFull-Text 1737-1742
  Mijung Kim; Kasim Selçuk Candan
In this paper, we first introduce a tensor-based relational data model and define algebraic operations on this model. We note that, while in traditional relational algebraic systems the join operation tends to be the costliest operation of all, in the tensor-relational framework presented here, tensor decomposition becomes the computationally costliest operation. Therefore, we consider optimization of tensor decomposition operations within a relational algebraic framework. This leads to a highly efficient, effective, and easy-to-parallelize join-by-decomposition approach and a corresponding KL-divergence based optimization strategy. Experimental results provide evidence that minimizing KL-divergence within the proposed join-by-decomposition helps approximate the conventional join-then-decompose scheme well, without the associated time and space costs.
RFID data analysis using tensor calculus for supply chain management BIBAFull-Text 1743-1748
  Roberto De Virgilio; Franco Milicchio
In current trends of consumer products market, there is a growing significance of the role of retailers in the governance of supply chains. RFID is a promising infrastructure-less technology, allowing to connect an object with its virtual counterpart, i.e., its representation within information systems. However, the amount of RFID data in supply chain management is vast, posing significant challenges for attaining acceptable performance on their analysis. Current approaches provide hard-coded solutions, with high consumption of resources; moreover, these exhibit very limited flexibility dealing with multidimensional queries, at various levels of granularity and complexity. In this paper we propose a general model for supply chain management based on the first principles of linear algebra, in particular on tensorial calculus. Leveraging our abstract algebraic framework, our technique allows both quick decentralized on-line processing, and centralized off-line massive business logic analysis, according to needs and requirements of supply chain actors. Experimental results show that our approach, utilizing recent linear algebra techniques can process analysis efficiently, when compared to recent approaches. In particular, we are able to carry out the required computations even in high memory constrained environments, such as on mobile devices. Moreover, when dealing with massive amounts of data, we are capable of exploiting recent parallel and distributed technologies, subdividing our tensor objects into sub-blocks, and processing them independently.
Spreadsheet-based complex data transformation BIBAFull-Text 1749-1754
  Vu Hung; Boualem Benatallah; Regis Saint-Paul
Spreadsheets are used by millions of users as a routine all-purpose data management tool. It is now increasingly necessary for external applications and services to consume spreadsheet data. In this paper, we investigate the problem of transforming spreadsheet data to structured formats required by these applications and services. Unlike prior methods, we propose a novel approach in which transformation logic is embedded into a familiar and expressive spreadsheet-like formula mapping language. Popular transformation patterns provided by transformation languages and mapping tools, that are relevant to spreadsheet-based data transformation, are supported in the language via formulas. Consequently, the language avoids cluttering the source spreadsheets with transformations and turns out to be helpful when multiple schemas are targeted. We implemented a prototype and evaluated the benefits of our approach via experiments in a real application. The experimental results confirmed the benefits of our approach.

Graph management and queries

High efficiency and quality: large graphs matching BIBAFull-Text 1755-1764
  Yuanyuan Zhu; Lu Qin; Jeffrey Xu Yu; Yiping Ke; Xuemin Lin
Graph matching plays an essential role in many real applications. In this paper, we study how to match two large graphs by maximizing the number of matched edges, which is known as maximum common subgraph matching and is NP-hard. To find exact matching, it cannot handle a graph with more than 30 nodes. To find an approximate matching, the quality can be very poor. We propose a novel two-step approach which can efficiently match two large graphs over thousands of nodes with high matching quality. In the first step, we propose an anchor-selection/expansion approach to compute a good initial matching. In the second step, we propose a new approach to refine the initial matching. We give the optimality of our refinement and discuss how to randomly refine the matching with different combinations. We conducted extensive testing using real and synthetic datasets, and will report our findings.
DELTA: indexing and querying multi-labeled graphs BIBAFull-Text 1765-1774
  Jiong Yang; Shijie Zhang; Wei Jin
With the emergence of social networks and computational biology, more data are in the forms of multi-labeled graphs, where a vertex has multiple labels. Since most algorithms focus only on single labeled graphs, these algorithms perform inefficiently when applied to multi-labeled graphs. In this paper, we investigate the problem of subgraph indexing and matching in the multi-labeled graphs. The label set on a vertex is transformed into a high dimensional box. The R-tree is employed to store and index these boxes. The vertex matching problem can be transformed into spatial range queries on the high dimensional space. In addition, we study two types of queries: location and existence queries. In this paper, detailed algorithms are provided to process these two types of queries. Real and synthetic data sets are employed to demonstrate the efficiency and effectiveness of our subgraph indexing and query processing methods.
Skynets: searching for minimum trees in graphs with incomparable edge weights BIBAFull-Text 1775-1784
  Huiping Cao; K. Selçuk Candan; Maria Luisa Sapino
Query processing over weighted data graphs often involves searching for a minimum weighted subgraph -- a tree -- which covers the nodes satisfying the given query criteria (such as a given set of keywords). Existing works often focus on graphs where the edges have scalar valued weights. In many applications, however, edge weights need to be represented as ranges (or intervals) of possible values. In this paper, we introduce the problem of skynets, for searching minimum weighted subgraphs, covering the nodes satisfying given query criteria, over interval-weighted graphs. The key challenge is that, unlike scalars which are often totally ordered, depending on the application specific semantics of the ≤ operator, intervals may be partially ordered. Naturally, the need to maintain alternative, incomparable solutions can push the computational complexity of the problem (which is already high for the case with totally ordered scalar edge weights) even higher. In this paper, we first provide alternative definitions of the ≤ operator for intervals and show that some of these lend themselves to efficient solutions. To tackle the complexity challenge in the remaining cases, we propose two optimization criteria that can be used to constrain the solution space. We also discuss how to extend existing approximation algorithms for Steiner trees to discover solutions to the skynet problem. For efficient calculation of the results, we introduce a novel skyline union operator. Experiments show that the proposed approach achieves significant gains in efficiency, while providing close to optimal results.
Fast fully dynamic landmark-based estimation of shortest path distances in very large graphs BIBAFull-Text 1785-1794
  Konstantin Tretyakov; Abel Armas-Cervantes; Luciano García-Bañuelos; Jaak Vilo; Marlon Dumas
Computing the shortest path between a pair of vertices in a graph is a fundamental primitive in graph algorithmics. Classical exact methods for this problem do not scale up to contemporary, rapidly evolving social networks with hundreds of millions of users and billions of connections. A number of approximate methods have been proposed, including several landmark-based methods that have been shown to scale up to very large graphs with acceptable accuracy. This paper presents two improvements to existing landmark-based shortest path estimation methods. The first improvement relates to the use of shortest-path trees (SPTs). Together with appropriate short-cutting heuristics, the use of SPTs allows to achieve higher accuracy with acceptable time and memory overhead. Furthermore, SPTs can be maintained incrementally under edge insertions and deletions, which allows for a fully-dynamic algorithm. The second improvement is a new landmark selection strategy that seeks to maximize the coverage of all shortest paths by the selected landmarks. The improved method is evaluated on the DBLP, Orkut, Twitter and Skype social networks.
CP-index: on the efficient indexing of large graphs BIBAFull-Text 1795-1804
  Yan Xie; Philip S. Yu
Graph search, i.e., finding all graphs in a database D that contain the query graph q, is a classical primitive prevalent in various graph database applications. In the past, there has been an abundance of studies devoting to this topic; however, with the recent emergence of large information networks, it places new challenges to the research community. Most of the traditional graph search schemes utilize the strategy of graph feature based indexing, whereas the index construction step that often involves frequent subgraph mining becomes a bottleneck for large graphs due to the high computational complexity. Although there have been several methods proposed to solve this mining bottleneck such as summarization of database graphs, the frequent subgraphs thus generated as indexing features are still unsatisfactory because the feature set is in general not only inadequate or deficient for the large graph scenario, but also with many redundant features. Furthermore, the large size of the graphs makes it too easy for a small feature to be contained in many of them, severely impacting its selectivity and pruning power. Motivated by all the above issues we identify, in this paper we propose a novel CP-Index (Contact Preservation) for efficient indexing of large graphs. To overcome the low selectivity issue, we reap further pruning opportunities by leveraging each feature's location information in the database graphs. Specifically, we look at how features are touching upon each other in the query, and check whether this contact pattern is preserved in the target graphs. Then, to tackle the deficiency and redundancy problems associated with features, new feature generation and selection methods such as dual feature generation and size-increasing bootstrapping feature selection are introduced to complete our design. Experiment results show that CP-Index is much more effective in indexing large graphs.

Social, search, and other behaviour

Learning to target: what works for behavioral targeting BIBAFull-Text 1805-1814
  Sandeep Pandey; Mohamed Aly; Abraham Bagherjeiran; Andrew Hatch; Peter Ciccolo; Adwait Ratnaparkhi; Martin Zinkevich
Understanding what interests and delights users is critical to effective behavioral targeting, especially in information-poor contexts. As users interact with content and advertising, their passive behavior can reveal their interests towards advertising. Two issues are critical for building effective targeting methods: what metric to optimize for and how to optimize. More specifically, we first attempt to understand what the learning objective should be for behavioral targeting so as to maximize advertiser's performance. While most popular advertising methods optimize for user clicks, as we will show, maximizing clicks does not necessarily imply maximizing purchase activities or transactions, called conversions, which directly translate to advertiser's revenue. In this work we focus on conversions which makes a more relevant metric but also the more challenging one. Second is the issue of how to represent and combine the plethora of user activities such as search queries, page views, ad clicks to perform the targeting. We investigate several sources of user activities as well as methods for inferring conversion likelihood given the activities. We also explore the role played by the temporal aspect of user activities for targeting, e.g., how recent activities compare to the old ones. Based on a rigorous offline empirical evaluation over 200 individual advertising campaigns, we arrive at what we believe are best practices for behavioral targeting. We deploy our approach over live user traffic to demonstrate its superiority over existing state-of-the-art targeting methods.
Large-scale behavioral targeting with a social twist BIBAFull-Text 1815-1824
  Kun Liu; Lei Tang
Behavioral targeting (BT) is a widely used technique for online advertising. It leverages information collected on an individual's web-browsing behavior, such as page views, search queries and ad clicks, to select the ads most relevant to user to display. With the proliferation of social networks, it is possible to relate the behavior of individuals and their social connections. Although the similarity among connected individuals are well established (i.e., homophily), it is still not clear whether and how we can leverage the activities of one's friends for behavioral targeting; whether forecasts derived from such social information are more accurate than standard behavioral targeting models. In this paper, we strive to answer these questions by evaluating the predictive power of social data across 60 consumer domains on a large online network of over 180 million users in a period of two and a half months. To our best knowledge, this is the most comprehensive study of social data in the context of behavioral targeting on such an unprecedented scale. Our analysis offers interesting insights into the value of social data for developing the next generation of targeting services.
Evolving social search based on bookmarks and status messages from social networks BIBAFull-Text 1825-1834
  Bastian Karweg; Christian Huetter; Klemens Böhm
Social search is a variant of information retrieval where a document or website is considered relevant if individuals from the searcher's social network have interacted with it. Our ranking metric Social Relevance Score (SRS) is based on two factors. First, the engagement intensity quantifies the effort a user has made during an interaction. Second, users can assign a trust score to each person from their social network, which is then refined using social network analysis. We have tested our hypotheses with our search engine www.social-search.com, which extends the existing social bookmarking platform folkd.com. Our search engine integrates information the folkd.com users share through the popular social networks Twitter and Facebook. With permission of 2,385 testers, we have connected to their social graphs to generate a large-scale real-world dataset. Over the course of a two-month field study, 468,889 individuals have generated 24,854,281 website recommendations. We have used those links to enhance their search results while measuring the impact on the search behavior. We have found that social results are available for most queries and usually lead to more satisfying results.
Social ranking for spoken web search BIBAFull-Text 1835-1840
  Shrey Sahay; Nitendra Rajput; Niketan Pansare
Spoken Web is an alternative Web for low-literacy users in the developing world. People can create audio content over phone and share on the Spoken Web. This enables easy creation of locally relevant content. Even on the World Wide Web in developed regions, the recent increase in traffic is due to the locally relevant content created on social networking sites. This paper argues that content search and ranking in the new scenario needs a re-look. The generic model of using in-links for ranking such content is not an appropriate measure of the content relevance in such a collaborative Web 2.0 world. This paper aims to bring the social context in Spoken Web ranking. We formulate a relationship function between the query-creator and the content-creator and use this as one measure of the content relevance to the user. The relationship function uses the geographical location of the two people and their prior browsing preferences as parameters to determine the relationship between the two users. Further we also determine the trustability of the content based on the content creator's acceptance measure by the social network. We use these two features in addition to the term-frequency -- inverse-term-frequency match to rank the search results in context of the social network of the query-creator and provide a more specific and socially relevant result to the user.
Effects of search success on search engine re-use BIBAFull-Text 1841-1846
  Victor Hu; Maria Stone; Jan Pedersen; Ryen W. White
People's experiences when interacting with online services affects their decisions on reuse. Users of Web search engines are primarily focused on obtaining relevant information pertaining to their query. Search engines that fail to satisfy users' information needs may find their market share to be negatively affected. However, despite its importance to search providers, the relationship be-tween search success and search engine reuse is poorly understood. In this paper, we present a longitudinal log-based study with a large cohort of search engine users that quantifies the relationship between success and re-use of search engines. We use time series analysis to define two groups of users: stationary and non-stationary. We find that recent changes in satisfaction rate do correlate moderately with changes in rate of return for stationary users. For non-stationary users, we find that satisfaction and rate of return change together and in the same direction. We also find that some effects are stronger for a smaller player on the market than for a clear market leader, but both are affected. This is the first study to explore these issues in the context of Web search, and our findings have implications for search providers seeking to better understand their users and improving their experience.

Applications in different areas

Enriching textbooks with images BIBAFull-Text 1847-1856
  Rakesh Agrawal; Sreenivas Gollapudi; Anitha Kannan; Krishnaram Kenthapadi
Textbooks have a direct bearing on the quality of education imparted to the students. Therefore, it is of paramount importance that the educational content of textbooks should provide rich learning experience to the students. Recent studies on understanding learning behavior suggest that the incorporation of digital visual material can greatly enhance learning. However, textbooks used in many developing regions are largely text-oriented and lack good visual material. We propose techniques for finding images from the web that are most relevant for augmenting a section of the textbook, while respecting the constraint that the same image is not repeated in different sections of the same chapter. We devise a rigorous formulation of the image assignment problem and present a polynomial time algorithm for solving the problem optimally. We also present two image mining algorithms that utilize orthogonal signals and hence obtain different sets of relevant images. Finally, we provide an ensembling algorithm for combining the assignments. To empirically evaluate our techniques, we use a corpus of high school textbooks in use in India. Our user study utilizing the Amazon Mechanical Turk platform indicates that the proposed techniques are able to obtain images that can help increase the understanding of the textbook material.
Exploring the corporate ecosystem with a semi-supervised entity graph BIBAFull-Text 1857-1866
  Hassan H. Malik; Ian MacGillivray; Måns Olof-Ors; Siming Sun; Shailesh Saroha
Investment decisions in the financial markets require careful analysis of information available from multiple data sources. In this paper, we present Atlas, a novel entity-based information analysis and content aggregation platform that uses heterogeneous data sources to construct and maintain the "ecosystem" around tangible and logical entities such as organizations, products, industries, geographies, commodities and macroeconomic indicators. Entities are represented as vertices in a directed graph, and edges are generated using entity co-occurrences in unstructured documents and supervised information from structured data sources. Significance scores for the edges are computed using a method that combines supervised, unsupervised and temporal factors into a single score. Important entity attributes from the structured content and the entity neighborhood in the graph are automatically summarized as the entity "fingerprint". A highly interactive user interface provides exploratory access to the graph and supports common business use cases. We present results of experiments performed on five years of news and broker research data, and show that Atlas is able to accurately identify important and interesting connections in real-world entities. We also demonstrate that Atlas entity fingerprints are particularly useful in entity similarity queries, with a quality that rivals existing human maintained databases.
Generating links to background knowledge: a case study using narrative radiology reports BIBAFull-Text 1867-1876
  Jiyin He; Maarten de Rijke; Merlijn Sevenster; Rob van Ommering; Yuechen Qian
Automatically annotating texts with background information has recently received much attention. We conduct a case study in automatically generating links from narrative radiology reports to Wikipedia. Such links help users understand the medical terminology and thereby increase the value of the reports. Direct applications of existing automatic link generation systems trained on Wikipedia to our radiology data do not yield satisfactory results. Our analysis reveals that medical phrases are often syntactically regular but semantically complicated, e.g., containing multiple concepts or concepts with multiple modifiers. The latter property is the main reason for the failure of existing systems. Based on this observation, we propose an automatic link generation approach that takes into account these properties. We use a sequential labeling approach with syntactic features for anchor text identification in order to exploit syntactic regularities in medical terminology. We combine this with a sub-anchor based approach to target finding, which is aimed at coping with the complex semantic structure of medical phrases. Empirical results show that the proposed system effectively improves the performance over existing systems.
Information extraction from pathology reports in a hospital setting BIBAFull-Text 1877-1882
  David Martinez; Yue Li
As more health data becomes available, information extraction aims to make an impact on the workflows of hospitals and care centers. One of the targeted areas is the management of pathology reports, which are employed for cancer diagnosis and staging. In this work we integrate text mining tools in the workflow of the Royal Melbourne Hospital, to extract information from pathology reports with minimal expert intervention. Our framework relies on coarse-grained annotation (at document level), making it highly portable. Our evaluation shows that the kind of language used in these reports makes it feasible to extract information with high precision and recall, by means of state-of-the-art classification methods, and feature engineering.
Extract knowledge from semi-structured websites for search task simplification BIBAFull-Text 1883-1888
  Yingqin Gu; Jun Yan; Hongyan Liu; Jun He; Lei Ji; Ning Liu; Zheng Chen
Simplifying the key tasks of search engine users by directly retrieving to them structured knowledge according to their queries is attracting much attention from both industry and academia. A bottleneck of this challenging problem is how to extract the structured knowledge from the noisy and complex Web scale websites automatically. In this paper, we propose an unsupervised automatic wrapper induction algorithm, named as Scalable Knowledge Extractor from webSites (SKES). SKES induces the wrapper in a divide and conquer mode, i.e., it divides the general wrapper into several sub-wrappers to learn from the data independently. Moreover, through employing techniques such as tag path representation of Web pages, SKES is verified to be efficient and noise-tolerant by the experimental results. Furthermore, based on our automatically extracted knowledge, we also built a prototype to serve structured knowledge to end users for simplifying their key search tasks. Very positive feedbacks were received on the prototype.
Privacy protected knowledge management in services with emphasis on quality data BIBAFull-Text 1889-1894
  Debapriyo Majumdar; Rose Catherine; Shajith Ikbal; Karthik Visweswariah
Improving productivity of practitioners through effective knowledge management and delivering high quality service in Application Management Services (AMS) domain, are key focus areas for all IT services organizations. One source of historical knowledge in AMS is the large amount of resolved problem ticket data which are often confidential, immensely valuable, but majority of it is of very bad quality. In this paper, we present a knowledge management tool that detects the quality of information present in problem tickets and enables effective knowledge search in tickets by prioritizing quality data in the search ranking. The tool facilitates leveraging of knowledge across different AMS accounts, while preserving data privacy, by masking client confidential information. It also extracts several relevant entities contained in the noisy unstructured text entered in the tickets and presents them to the users. We present several experimental evaluations and a pilot study conducted with an AMS account which show that our tool is effective and leads to substantial improvement in productivity of the practitioners.

Poster session: information retrieval

Search result diversification for enterprise data BIBAFull-Text 1901-1904
  Wei Zheng; Hui Fang; Conglei Yao; Min Wang
Search result diversification aims to return a list of diversified relevant documents in order to satisfy different user information needs. Most of the efforts focused on Web Search, and few studies have considered another important search domain, i.e., enterprise search. Unlike Web search, enterprise search deals with both unstructured and structured data. In this paper, we propose to integrate the structured and unstructured data to discover meaningful query subtopics in search result diversification. Experimental results show that integrating structured and unstructured information allows us to discover high quality query, which are effective in diversifying the retrieval results.
Diversification for multi-domain result sets BIBAFull-Text 1905-1908
  Alessandro Bozzon; Marco Brambilla; Piero Fraternali; Marco Tagliasacchi
Multi-domain search answers to queries spanning multiple entities, like "Find an affordable house in a city with low criminality index, good schools and medical services", by producing ranked sets of entity combinations that maximize relevance, measured by a function expressing the user's preferences. Due to the combinatorial nature of results, good entity instances (e.g., inexpensive houses) tend to appear repeatedly in top-ranked combinations. To improve the quality of the result set, it is important to balance relevance (i.e., high values of the ranking function) with diversity, which promotes different, yet almost equally relevant, entities in the top-k combinations. This paper explores two different notions of diversity for multi-domain result sets, compares experimentally alternative algorithms for the trade-off between relevance and diversity, and performs a user study for evaluating the utility of diversification in multi-domain queries.
A peer's-eye view: network term clouds in a peer-to-peer system BIBAFull-Text 1909-1912
  Raynor Vliegendhart; Martha Larson; Christoph Kofler; Johan Pouwelse
We investigate term clouds that represent the content available in a peer-to-peer (P2P) network. Such network term clouds are non-trivial to generate in distributed settings. Our term cloud generator was implemented and released in Tribler -- a widely-used, server-free P2P system -- to support users in understanding the sorts of content available. Our evaluation and analysis focuses on three aspects of the clouds: coverage, usefulness and accumulation speed. A live experiment demonstrates that individual peers accumulate substantial network-level information, indicating good coverage of the overall content of the system. The results of a user study carried out on a crowdsourcing platform confirm the usefulness of clouds, showing that they succeed in conveying to users information on the type of content available in the network. An analysis of five example peers reveals that accumulation speeds of terms at new peers can support the development of a semantically diverse term set quickly after a cold start. This work represents the first investigation of term clouds in a live, 100% server-free P2P setting.
RerankEverything: a reranking interface for exploring search results BIBAFull-Text 1913-1916
  Takehiro Yamamoto; Satoshi Nakamura; Katsumi Tanaka
This paper proposes a system called "RerankEverything", which enables users to rerank search results in any search service, such as a Web search engine, an e-commerce site, a hotel reservation site, and so on. This system helps users explore diverse search results. In conventional search services, interactions between users and systems are quite limited and complicated. By using RerankEverything, users can interactively explore search results in accordance with their interests by reranking search results from various viewpoints. Experimental results show that our system potentially help users search more proactively. When using our system, users were more likely to click search results that were initially low ranked. Users also browsed through more diverse search results by reranking search results after giving various types of feedback with our system.
HealthTrust: trust-based retrieval of you tube's diabetes channels BIBAFull-Text 1917-1920
  Luis Fernandez-Luque; Randi Karlsen; Genevieve B. Melton
The Internet has become one of the main sources of consumer health information. Health consumers have access to ever-growing health information resources, especially since the rise of the Social Media. For example, over 20.000 videos have been uploaded by American hospitals on to YouTube. To find health videos is challenging because of factors like tags spamming and misleading information. Previous studies have found difficulties when searching for good health videos in YouTube, including false information (e.g., herbal cures for diabetes or cancer).
   Our objective was to extract information about the trustworthiness of the diabetes YouTube's channels using link analysis of the diabetes online community by developing an algorithm, called HealthTrust, based on Hyperlink-Induced Topic Search (HITS) for ranking the most authoritative diabetes channels. The ranked list of channels from HealthTrust was compared with the list of the most relevant diabetes channels from YouTube. Two healthcare professionals made a blinded classification of channels based on whether they would recommend the channel to a patient. HealthTrust performed better for retrieving channels recommended by the professional reviewers. HealthTrust performed several times better than YouTube for filtering out the worst channels (i.e., those not recommended by any expert reviewer).
Item categorization in the e-commerce domain BIBAFull-Text 1921-1924
  Dan Shen; Jean David Ruvini; Manas Somaiya; Neel Sundaresan
Hierarchical classification is a challenging problem yet bears a broad application in real-world tasks. Item categorization in the ecommerce domain is such an example. In a large-scale industrial setting such as eBay, a vast amount of items need to be categorized into a large number of leaf categories, on top of which a complex topic hierarchy is defined. Other than the scale challenges, item data is extremely sparse and skewed distributed over categories, and exhibits heterogeneous characteristics across categories. A common strategy for hierarchical classification is the "gates-and-experts" methods, where a high-level classification is made first (the gates), followed by a low-level distinction (the experts). In this paper, we propose to leverage domain-specific feature generation and modeling techniques to greatly enhance the classification accuracy of the experts. In particular, we innovatively derive features to encode various rich domain knowledge and linguistic hints, and then adapt a SVM-based model to distinguish several very confusing category groups appeared as the performance bottleneck of a currently deployed live system at eBay. We use illustrative examples and empirical results to demonstrate the effectiveness of our approach, particularly the merit of smartly designed domain-specific features.
An efficient method for using machine translation technologies in cross-language patent search BIBAFull-Text 1925-1928
  Walid Magdy; Gareth J. F. Jones
Topics in prior-art patent search are typically full patent applications and relevant items are patents often taken from sources in different languages. Cross language patent retrieval (CLPR) technologies support searching for relevant patents across multiple languages. As such, CLPR requires a translation process between topic and document languages. The most popular method for crossing the language barrier in cross language information retrieval (CLIR) in general is machine translation (MT). High quality MT systems are becoming widely available for many language pairs and generally have higher effectiveness for CLIR than dictionary based methods. However for patent search, using MT for translation of the very long search queries requires significant time and computational resources. We present a novel MT approach specifically designed for CLIR in general and CLPR in particular. In this method information retrieval (IR) text pre-processing in the form of stop word removal and stemming are applied to the MT training corpus prior to the training phase of the MT system. Applying this step leads to a significant decrease in the MT computational and resource requirements in both the training and translation phases. Experiments on the CLEF-IP 2010 CLPR task show the new technique to be 5 to 23 times faster than standard MT for query translation, while maintaining statistically indistinguishable IR effectiveness. Furthermore the new method is significantly better than standard MT when only limited translation training resources are available.
Understanding the types of information humans associate with geographic objects BIBAFull-Text 1929-1932
  Ahmet Aker; Robert Gaizauskas
In this paper we investigate what sorts of information humans request about geographical objects of the same type. For example, Edinburgh Castle and the Bodiam Castle are two objects of the same type -- castle. The question is whether specific information is requested for the object type castle and how this information differs for objects of other types, e.g. church, museum or lake. We aim to answer this question using an online survey. In the survey we showed 184 participants 200 images pertaining to urban and rural objects and asked them to write questions for which they would like to know the answers when seeing those objects. Our analysis of 7644 questions collected in the survey shows that humans have shared ideas of what to ask about geographical objects. When the object types resemble each other (e.g. church, temple) the requested information is similar for the objects of these types. Otherwise, the information is specific to an object type. Our results can guide tasks involving automatic generation of templates for image descriptions, and their assessment as well as image indexing and organization.
Google, bing and a new perspective on ranking similarity BIBAFull-Text 1933-1936
  Bruno Cardoso; João Magalhães
In this paper, we propose a framework to characterize and compare two search engine results. Typical user-queries are ambiguous and, consequentially, each search engine will compute ranks in different manners, attempting to answer them in the best possible way. Thus, each search engine will have its own bias. Given the importance of the first page results in Web Search Engines, in this paper we propose a framework to assess the information presented in the first page by measuring the information entropy and the correlations between two ranks. Employing the recently proposed Rank-Biased Overlap measure [2] we compare to which extent do Bing and Google rankings in fact differ. We also extend this measure and propose a measure for comparing the information entropy present in two ranks. The proposed measure is based on the correlation of two ranks and the application of Jensen-Shannon's divergence among two document sets. Our methodology starts with 40,000 user queries and crawls the search results for these queries on both search engines. The results allow us to determine the search engines correlations, crawling coverage, information overlap, and information entropy.
Effectiveness beyond the first crawl tier BIBAFull-Text 1937-1940
  Rodrygo L. T. Santos; Craig Macdonald; Iadh Ounis
Modern Web crawlers seek to visit quality documents first, and re-visit them more frequently than other documents. As a result, the first-tier crawl of a Web corpus is typically of higher quality compared to subsequent crawls. In this paper, we investigate the impact of first-tier documents on adhoc retrieval performance. In particular, we analyse the retrieval performance of runs submitted to the adhoc task of the TREC 2009 Web track in terms of how they rank first-tier documents and how these documents contribute to the performance of each run. Our results show that the performance of these runs is heavily dependent on their ability to rank first-tier documents. Moreover, we show that, different from leading Web search engines, their attempt to go beyond the first tier almost always results in decreased performance. Finally, we show that selectively removing spam from different tiers can be a direction for fully exploiting documents beyond the first tier.
Worker types and personality traits in crowdsourcing relevance labels BIBAFull-Text 1941-1944
  Gabriella Kazai; Jaap Kamps; Natasa Milic-Frayling
Crowdsourcing platforms offer unprecedented opportunities for creating evaluation benchmarks, but suffer from varied output quality from crowd workers who possess different levels of competence and aspiration. This raises new challenges for quality control and requires an in-depth understanding of how workers' characteristics relate to the quality of their work.
   In this paper, we use behavioral observations (HIT completion time, fraction of useful labels, label accuracy) to define five worker types: Spammer, Sloppy, Incompetent, Competent, Diligent. Using data collected from workers engaged in the crowdsourced evaluation of the INEX 2010 Book Track Prove It task, we relate the worker types to label accuracy and personality trait information along the 'Big Five' personality dimensions.
   We expect that these new insights about the types of crowd workers and the quality of their work will inform how to design HITs to attract the best workers to a task and explain why certain HIT designs are more effective than others.
A nugget-based test collection construction paradigm BIBAFull-Text 1945-1948
  Shahzad Rajput; Virgil Pavlu; Peter B. Golbus; Javed A. Aslam
The problem of building test collections is central to the development of information retrieval systems such as search engines. Starting with a few relevant "nuggets" of information manually extracted from existing TREC corpora, we implement and test a methodology that finds and correctly assesses the vast majority of relevant documents found by TREC assessors -- as well as up to four times more additional relevant documents. Our methodology produces highly accurate test collections that hold the promise of addressing the issues of scalability, reusability, and applicability.
Recency ranking by diversification of result set BIBAFull-Text 1949-1952
  Andrey Styskin; Fedor Romanenko; Fedor Vorobyev; Pavel Serdyukov
In this paper, we propose a web search retrieval approach which automatically detects recency sensitive queries and increases the freshness of the ordinary document ranking by a degree proportional to the probability of the need in recent content. We propose to solve the recency ranking problem by using result diversification principles and deal with the query's non-topical ambiguity appearing when the need in recent content can be detected only with uncertainty. Our offine and online experiments with millions of queries from real search engine users demonstrate the significant increase in satisfaction of users presented with a search result generated by our approach.
Patent query reduction using pseudo relevance feedback BIBAFull-Text 1953-1956
  Debasis Ganguly; Johannes Leveling; Walid Magdy; Gareth J. F. Jones
Queries in patent prior art search are full patent applications and much longer than standard ad hoc search and web search topics. Standard information retrieval (IR) techniques are not entirely effective for patent prior art search because of ambiguous terms in these massive queries. Reducing patent queries by extracting key terms has been shown to be ineffective mainly because it is not clear what the focus of the query is. An optimal query reduction algorithm must thus seek to retain the useful terms for retrieval favouring recall of relevant patents, but remove terms which impair IR effectiveness. We propose a new query reduction technique decomposing a patent application into constituent text segments and computing the Language Modeling (LM) similarities by calculating the probability of generating each segment from the top ranked documents. We reduce a patent query by removing the least similar segments from the query, hypothesising that removal of these segments can increase the precision of retrieval, while still retaining the useful context to achieve high recall. Experiments on the patent prior art search collection CLEF-IP 2010 show that the proposed method outperforms standard pseudo-relevance feedback (PRF) and a naive method of query reduction based on removal of unit frequency terms (UFTs).
Relevance feedback exploiting query-specific document manifolds BIBAFull-Text 1957-1960
  Chang Wang; Emine Yilmaz; Martin Szummer
We incorporate relevance feedback into a learning to rank framework by exploiting query-specific document similarities. Given a few judged feedback documents and many retrieved but unjudged documents for a query, we learn a function that adjusts the initial ranking score of each document. Scores are fit so that documents with similar term content get similar scores, and scores of judged documents are close to their labels. By such smoothing along the manifold of retrieved documents, we avoid overfitting, and can therefore learn a detailed query-specific scoring function with several dozen term weights.
Insights into explicit semantic analysis BIBAFull-Text 1961-1964
  Thomas Gottron; Maik Anderka; Benno Stein
Since its debut the Explicit Semantic Analysis (ESA) has received much attention in the IR community. ESA has been proven to perform surprisingly well in several tasks and in different contexts. However, given the conceptual motivation for ESA, recent work has observed unexpected behavior. In this paper we look at the foundations of ESA from a theoretical point of view and employ a general probabilistic model for term weights which reveals how ESA actually works. Based on this model we explain some of the phenomena that have been observed in previous work and support our findings with new experiments. Moreover, we provide a theoretical grounding on how the size and the composition of the index collection affect the ESA-based computation of similarity values for texts.
On bias problem in relevance feedback BIBAFull-Text 1965-1968
  Qianli Xing; Yi Zhang; Lanbo Zhang
Relevance feedback is an effective approach to improve retrieval quality over the initial query. Typical relevance feedback methods usually select top-ranked documents for relevance judgments, then query expansion or model updating are carried out based on the feedback documents. However, the number of feedback documents is usually limited due to expensive human labeling. Thus relevant documents in the feedback set are hardly representative of all relevant documents and the feedback set is actually biased. As a result, the performance of relevance feedback will get hurt. In this paper, we first show how and where the bias problem exists through experiments. Then we study how the bias can be reduced by utilizing the unlabeled documents. After analyzing the usefulness of a document to relevance feedback, we propose an approach that extends the feedback set with carefully selected unlabeled documents by heuristics. Our experiment results show that the extended feedback set has less bias than the original feedback set and better performance can be achieved when the extended feedback set is used for relevance feedback.
Selecting related terms in query-logs using two-stage SimRank BIBAFull-Text 1969-1972
  Yunlong Ma; Hongfei Lin; Yuan Lin
It is commonly believed that query logs from Web search are a gold mine for search business, because they reflect users' preference over Web pages presented by search engines, so a lot of studies based on query logs have been carried out in the last few years. In this study, we assume that two queries are relevant to each other when they have same clicked page in their result lists, and we also consider the queries' topics of user's need. Thus, we propose a Two-Stage SimRank (called TSS in this paper) algorithm based on SimRank and some clustering algorithms to compute the similarity among queries, and then use it to discover relevant terms for query expansion, considering the information of topics and the global relationships of queries concurrently, with a query log collected by a practical search engine. Experimental results on two TREC test collections show that our approach can discover qualified terms effectively and improve retrieval performance.
On relevance, time and query expansion BIBAFull-Text 1973-1976
  Giuseppe Amodeo; Giambattista Amati; Giorgio Gambosi
We present the results of our exploratory analysis on the relationship that exists between relevance and time. We observe how the amount of documents published in a given interval of time is related to the probability of relevance, and, using the time series analysis, we show the existence of a correlation between time and relevance. As an initial application of this analysis, we study query expansion exploiting the detection of publication time peaks over the Blog06 collection. We finally propose an effective approach for the query expansion in the blog search domain. Our approach is based on the documents publication trend being so completely independent of any external resource.
Diverse retrieval via greedy optimization of expected 1-call@k in a latent subtopic relevance model BIBAFull-Text 1977-1980
  Scott Sanner; Shengbo Guo; Thore Graepel; Sadegh Kharazmi; Sarvnaz Karimi
It has been previously observed that optimization of the 1-call@k relevance objective (i.e., a set-based objective that is 1 if at least one document is relevant, otherwise 0) empirically correlates with diverse retrieval. In this paper, we proceed one step further and show theoretically that greedily optimizing expected 1-call@k w.r.t. a latent subtopic model of binary relevance leads to a diverse retrieval algorithm sharing many features of existing diversification approaches. This new result is complementary to a variety of diverse retrieval algorithms derived from alternate rank-based relevance criteria such as average precision and reciprocal rank. As such, the derivation presented here for expected 1-call@k provides a novel theoretical perspective on the emergence of diversity via a latent subtopic model of relevance -- an idea underlying both ambiguous and faceted subtopic retrieval that have been used to motivate diverse retrieval.
Hybrid models for future event prediction BIBAFull-Text 1981-1984
  Giuseppe Amodeo; Roi Blanco; Ulf Brefeld
We present a hybrid method to turn off-the-shelf information retrieval (IR) systems into future event predictors. Given a query, a time series model is trained on the publication dates of the retrieved documents to capture trends and periodicity of the associated events. The periodicity of historic data is used to estimate a probabilistic model to predict future bursts. Finally, a hybrid model is obtained by intertwining the probabilistic and the time-series model. Our empirical results on the New York Times corpus show that autocorrelation functions of time-series suffice to classify queries accurately and that our hybrid models lead to more accurate future event predictions than baseline competitors.
Adaptive term frequency normalization for BM25 BIBAFull-Text 1985-1988
  Yuanhua Lv; ChengXiang Zhai
A key component of BM25 contributing to its success is its sub linear term frequency (TF) normalization formula. The scale and shape of this TF normalization component is controlled by a parameter k1, which is generally set to a term-independent constant. We hypothesize and show empirically that in order to optimize retrieval performance, this parameter should be set in a term-specific way. Following this intuition, we propose an information gain measure to directly estimate the contributions of repeated term occurrences, which is then exploited to fit the BM25 function to predict a term-specific k1. Our experiment results show that the proposed approach, without needing any training data, can efficiently and automatically estimate a term-specific k1, and is more effective and robust than the standard BM25.
An unsupervised ranking method based on a technical difficulty terrain BIBAFull-Text 1989-1992
  Shoaib Jameel; Wai Lam; Ching-man Au Yeung; Sheaujiun Chyan
Users look for information that can suit their level of expertise, but it often takes a mammoth effort to trace such information. One has to sift through multiple pages to look for one that fits the appropriate technical background. In this paper, a query-independent ranking system is proposed for technical web pages. The pages returned by the system are sorted by their relative technical difficulty in either ascending or descending order specified by the user. The technical difficulty of a document i.e. terms in sequence, is first computed by the combination of each individual term's geometry in the low-dimensional latent semantic indexing (LSI) space, which can be visualized as a conceptual terrain. Then the pages are ranked based on the expected cost to get over the terrain. Results indicate that our terrain based method outperforms traditional readability measures.
When close enough is good enough: approximate positional indexes for efficient ranked retrieval BIBAFull-Text 1993-1996
  Tamer Elsayed; Jimmy Lin; Donald Metzler
Previous research has shown that features based on term proximity are important for effective retrieval. However, they incur substantial costs in terms of larger inverted indexes and slower query execution times as compared to term-based features. This paper explores whether term proximity features based on approximate term positions are as effective as those based on exact term positions. We introduce the novel notion of approximate positional indexes based on dividing documents into coarse-grained buckets and recording term positions with respect to those buckets. We propose different approaches to defining the buckets and compactly encoding bucket ids. In the context of linear ranking functions, experimental results show that features based on approximate term positions are able to achieve effectiveness comparable to exact term positions, but with smaller indexes and faster query evaluation.
Index tuning for query-log based on-line index maintenance BIBAFull-Text 1997-2000
  Sairam Gurajada; A Sreenivasa Kumar P.
The existing query-log based on-line index maintenance approaches rely on frequency distribution of terms in the static query-log. Though these approaches are proved to be efficient, but in real world, the frequency distribution of the terms changes over a period of time. This negatively affects the efficiency of the static query-log based approaches. To overcome this problem, we propose an index tuning strategy for reorganizing the indexes according to the latest frequency distribution of the terms captured from query-logs.Experimental results show that the proposed tuning strategy improves the performance of static query-log based approaches.
Efficient phrase querying with flat position index BIBAFull-Text 2001-2004
  Dongdong Shan; Wayne Xin Zhao; Jing He; Rui Yan; Hongfei Yan; Xiaoming Li
A large proportion of search engine queries contain phrases, namely a sequence of adjacent words. In this paper, we propose to use flat position index (a.k.a schema-independent index) for phrase query evaluation. In the flat position index, the entire document collection is viewed as a huge sequence of tokens. Each token is represented by one flat position, which is a unique position offset from the beginning of the collection. Each indexed term is associated with a list of the flat positions about that term in the sequence. To recover DocID from flat positions efficiently, we propose a novel cache sensitive look-up table (CSLT), which is much faster than existing search algorithms. Experiments on TREC GOV2 data collection show that flat position index can reduce the index size and speed up phrase querying substantially, compared with traditional word-level index.
Trained trigger language model for sentence retrieval in QA: bridging the vocabulary gap BIBAFull-Text 2005-2008
  Saeedeh Momtazi; Dietrich Klakow
We propose a novel language model for sentence retrieval in Question Answering (QA) systems called trained trigger language model. This model addresses the word mismatch problem in information retrieval. The proposed model captures pairs of trigger and target words while training on a large corpus. The word pairs are extracted based on both unsupervised and supervised approaches while different notions of triggering are used. In addition, we study the impact of corpus size and domain for a supervised model. All notions of the trained trigger model are finally used in a language model-based sentence retrieval framework. Our experiments on TREC QA collection verify that the proposed model significantly improves the sentence retrieval performance compared to the state-of-the-art translation model and class model which address the same problem.
Topic modeling for named entity queries BIBAFull-Text 2009-2012
  Xiaobing Xue; Xiaoxin Yin
Named entities are observed in a large portion of web search queries (named entity queries), where each entity can be associated with many different query terms that refer to various aspects of this entity. Organizing these query terms into topics helps understand major search intents about entities and the discovered topics are useful for applications such as query suggestion. Furthermore, we notice that named entities can often be organized into categories and those from the same category share many generic topics. Therefore, working on a category of named entities instead of individual ones helps avoid the problems caused by the sparsity and noise in the data. In this paper, Named Entity Topic Model (NETM) is proposed to discover generic topics for a category of named entities, where the quality of the generic topics is improved through the model design and the parameter initialization. Experiments based on query log data show that NETM discovers high-quality topics and outperforms the state-of-the-art techniques by 12.8% based on F1 measure.
Semantic convolution kernels over dependency trees: smoothed partial tree kernel BIBAFull-Text 2013-2016
  Danilo Croce; Alessandro Moschitti; Roberto Basili
In recent years, natural language processing techniques have been used more and more in IR. Among other syntactic and semantic parsing are effective methods for the design of complex applications like for example question answering and sentiment analysis. Unfortunately, extracting feature representations suitable for machine learning algorithms from linguistic structures is typically difficult. In this paper, we describe one of the most advanced piece of technology for automatic engineering of syntactic and semantic patterns. This method merges together convolution dependency tree kernels with lexical similarities. It can efficiently and effectively measure the similarity between dependency structures, whose lexical nodes are in part or completely different. Its use in powerful algorithm such as Support Vector Machines (SVMs) allows for fast design of accurate automatic systems.
   We report some experiments on question classification, which show an unprecedented result, e.g. 41% of error reduction of the former state-of-the-art, along with the analysis of the nice properties of the approach.
Recommending citations with translation model BIBAFull-Text 2017-2020
  Yang Lu; Jing He; Dongdong Shan; Hongfei Yan
Citation Recommendation is useful for an author to find out the papers or books that can support the materials she is writing about. It is a challengeable problem since the vocabulary used in the content of papers and in the citation contexts are usually quite different. To address this problem, we propose to use translation model, which can bridge the gap between two heterogeneous languages. We conduct an experiment and find the translation model can provide much better candidates of citations than the state-of-the-art methods.
Extracting adjective facets from community Q&A corpus BIBAFull-Text 2021-2024
  Takehiro Yamamoto; Satoshi Nakamura; Katsumi Tanaka
In this paper, we propose a method for helping users explore information via Web searches by using a question and answer (Q&A) corpus archived in a community Q&A site. When users do not have clear information needs and have little knowledge about the task domain, it is difficult for them to create queries that adequately reflect their information needs. We focused on terms like "famous temples," "historical townscapes," and "delicious sweets," which we call "adjective facets", and developed a method of extracting these facets from question and answer archives at a community Q&A site. We evaluated the effectiveness of our adjective facets by comparing them with several baselines.
A novel framework of training hidden Markov support vector machines from lightly-annotated data BIBAFull-Text 2025-2028
  Deyu Zhou; Yulan He
Natural language understanding (NLU) aims to map sentences to their semantic mean representations. Statistical approaches to NLU normally require fully-annotated training data where each sentence is paired with its word-level semantic annotations. In this paper, we propose a novel learning framework which trains the Hidden Markov Support Vector Machines (HM-SVMs) without the use of expensive fully-annotated data. In particular, our learning approach takes as input a training set of sentences labeled with abstract semantic annotations encoding underlying embedded structural relations and automatically induces derivation rules that map sentences to their semantic meaning representations. The proposed approach has been tested on the DARPA Communicator Data and achieved 93.18% in F-measure, which outperforms the previously proposed approaches of training the hidden vector state model or conditional random fields from unaligned data, with a relative error reduction rate of 43.3% and 10.6% being achieved.
Learning to recommend questions based on public interest BIBAFull-Text 2029-2032
  Jun Wang; Xia Hu; Zhoujun Li; Wenhan Chao; Biyun Hu
This paper is concerned with the problem of question recommendation in the setting of Community Question Answering (CQA). Given a question as query, our goal is to rank all of the retrieved questions according to their likelihood of being good recommendations for the query. In this paper, we propose a notion of public interest, and show how public interest can boost the performance of question recommendation. In particular, to model public interest in question recommendation, we build a language model to combine relevance score to the query and popularity score regarding question popularity. Experimental results on Yahoo!Answers dataset demonstrate the performance of question recommendation can be greatly improved with considering the public interest.
CQC: classifying questions in CQA websites BIBAFull-Text 2033-2036
  Amit Singh; Karthik Visweswariah
Community Question Answering portals like Yahoo! Answers have recently become a popular method for seeking information online. Users express their information need as questions for which other users generate potential answers. These questions are organized into pre-defined hierarchical categories to facilitate effective answering, hence Question Classification is an important aspect of these systems. In this paper we propose a novel system, CQC, for automatically classifying new questions into one of the hierarchical categories. Experiments conducted on large scale real data from Yahoo Answers! show that the proposed techniques are effective and outperform existing methods significantly.
Automatic query reformulation with syntactic operators to alleviate search difficulty BIBAFull-Text 2037-2040
  Huizhong Duan; Rui Li; ChengXiang Zhai
Modern search engines usually provide a query language with a set of advanced syntactic operators (e.g., plus sign to require a term's appearance, or quotation marks to require a phrase's appearance) which if used appropriately, can significantly improve the effectiveness of a plain keyword query. However, they are rarely used by ordinary users due to the intrinsic difficulties and users' lack of corpora statistics. In this paper, we propose to automatically reformulate queries that do not work well by selectively adding syntactic operators. Particularly, we propose to perform syntactic operator-based query reformulation when a retrieval system detects users encounter difficulty in search as indicated by users' behaviors such as scanning over top k documents without click-through. We frame the problem of automatic reformulation with syntactic operators as a supervised learning problem, and propose a set of effective features to represent queries with syntactic operators. Experiment results verify the effectiveness of the proposed method and its applicability as a query suggestion mechanism for search engines. As a negative feedback strategy, syntactic operator-based query reformulation also shows promising results in improving search results for difficult queries as compared with existing methods.
Question routing in community question answering: putting category in its place BIBAFull-Text 2041-2044
  Baichuan Li; Irwin King; Michael R. Lyu
This paper investigates a ground-breaking incorporation of question category to Question Routing (QR) in Community Question Answering (CQA) services. The incorporation of question category was designed to estimate answerer expertise for routing questions to potential answerers. Two category-sensitive Language Models (LMs) were developed with large-scale real world data sets being experimented. Results demonstrated that higher accuracies of routing questions with lower computational costs were achieved, relative to traditional Query Likelihood LM (QLLM), state-of-the-art Cluster-Based LM (CBLM) and the mixture of Latent Dirichlet Allocation and QLLM (LDALM).
Fact-based question decomposition for candidate answer re-ranking BIBAFull-Text 2045-2048
  Aditya Kalyanpur; Siddharth Patwardhan; Branimir Boguraev; Adam Lally; Jennifer Chu-Carroll
Factoid questions often contain one or more assertions (facts) about their answers. However, existing question-answering (QA) systems have not investigated how the multiple facts may be leveraged to enhance system performance. We argue that decomposing complex factoid questions can benefit QA, as an answer candidate is more likely to be correct if multiple independent facts support it. We categorize decomposable questions as parallel or nested, depending on processing strategy required. We present a novel decomposition framework -- for parallel and nested questions -- which can be overlaid on top of traditional QA systems. It contains decomposition rules for identifying fact sub-questions, a question-rewriting component and a candidate re-ranker. In a particularly challenging domain for our baseline QA system, our framework shows a statistically significant improvement in end-to-end QA performance.
CoDet: sentence-based containment detection in news corpora BIBAFull-Text 2049-2052
  Emre Varol; Fazli Can; Cevdet Aykanat; Oguz Kaya
We study a generalized version of the near-duplicate detection problem which concerns whether a document is a subset of another document. In text-based applications, document containment can be observed in exact-duplicates, near-duplicates, or containments, where the first two are special cases of the third. We introduce a novel method, called CoDet, which focuses particularly on this problem, and compare its performance with four well-known near-duplicate detection methods (DSC, full fingerprinting, I-Match, and SimHash) that are adapted to containment detection. Our method is expandable to different domains, and especially suitable for streaming news. Experimental results show that CoDet effectively and efficiently produces remarkable results in detecting containments.
Smoothing NDCG metrics using tied scores BIBAFull-Text 2053-2056
  Andrey Kustarev; Yury Ustinovsky; Yury Logachev; Evgeny Grechnikov; Ilya Segalovich; Pavel Serdyukov
One of promising directions in research on learning to rank concerns the problem of appropriate choice of the objective function to maximize by means of machine learning algorithms. We describe a novel technique of smoothing an arbitrary ranking metric and demonstrate how to utilize it to maximize the retrieval quality in terms of the $NDCG$ metric. The idea behind our listwise ranking model called TieRank is artificial probabilistic tying of predicted relevance scores at each iteration of learning process, which defines a distribution on the set of all permutations of retrieved documents. Such distribution provides a desired smoothed version of the target retrieval quality metric. This smooth function is possible to maximize using a gradient descent method. Experiments on LETOR collections show that TieRank outperforms most of the existing learning to rank algorithms.
Learning to rank with cross entropy BIBAFull-Text 2057-2060
  Yuan Lin; Hongfei Lin; Jiajin Wu; Kan Xu
Learning to rank algorithms are usually grouped into three types: the point wise approach, the pairwise approach, and the listwise approach, according to the input spaces. Much of the prior work is based on the three approaches to learn the ranking model to predict the relevance of a document to a query. In this paper, we focus on the problem of constructing new input space based on groups of documents with the same relevance judgment. A novel approach is proposed based on cross entropy to improve the existing ranking method. The experimental results show that our approach leads to significant improvements in retrieval effectiveness.
Predicting document effectiveness in pseudo relevance feedback BIBAFull-Text 2061-2064
  Mostafa Keikha; Jangwon Seo; W. Bruce Croft; Fabio Crestani
Pseudo relevance feedback (PRF) is one of effective practices in Information Retrieval. In particular, PRF via the relevance model (RM) has been widely used due to the theoretical soundness and effectiveness. In a PRF scenario, an underlying relevance model is inferred by combining language models of the top retrieved documents where the contribution of each document is assumed to be proportional to its score for the initial query. However, it is not clear that selecting the top retrieved documents only by the initial retrieval scores is actually the optimal way for query expansion.
   We show that the initial score of a document is not a good indicator of its effectiveness in query expansion. Our experiments show that if we can estimate the true effectiveness of the top retrieved documents, we can obtain almost 50% improvement over RM. Based on this observation, we introduce various document features that can be used to estimate the effectiveness of documents. Our experiments on the TREC Robust collection show that the proposed features make good predictors, and PRF using the effectiveness predictors can achieve statistically significant improvements over RM.
Learning to rank categories for web queries BIBAFull-Text 2065-2068
  Prashant V. Ullegaddi; Vasudeva Varma
In web search, understanding the user intent plays an important role in improving search experience of the end users. Such an intent can be represented by the categories which the user query belongs to. In this work, we propose an information retrieval based approach to query categorization with an emphasis on learning category rankings. To carry out categorization we first represent a category by web documents (from Open Directory Project) that describe the semantics of the category. Then, we learn the category rankings for the queries using 'learning to rank' techniques. To show that the results obtained are consistent and do not vary across datasets, we evaluate our approach on two datasets including the publicly available KDD Cup dataset. We report an overall improvement of 20% on all evaluation metrics (precision, recall and F-measure) over two baselines: a text categorization baseline and an unsupervised IR baseline.
Supervised language modeling for temporal resolution of texts BIBAFull-Text 2069-2072
  Abhimanu Kumar; Matthew Lease; Jason Baldridge
We investigate temporal resolution of documents, such as determining the date of publication of a story based on its text. We describe and evaluate a model that build histograms encoding the probability of different temporal periods for a document. We construct histograms based on the Kullback-Leibler Divergence between the language model for a test document and supervised language models for each interval. Initial results indicate this language modeling approach is effective for predicting the dates of publication of short stories, which contain few explicit mentions of years.
Context-aware query recommendation by learning high-order relation in query logs BIBAFull-Text 2073-2076
  Xiaohui Yan; Jiafeng Guo; Xueqi Cheng
Query recommendation has been widely used in modern search engines. Recently, several context-aware methods have been proposed to improve the accuracy of recommendation by mining query sequence patterns from query sessions. However, the existing methods usually do not address the ambiguity of queries explicitly and often suffer from the sparsity of the training data. In this paper, we propose a novel context-aware query recommendation approach by modeling the high-order relation between queries and clicks in query log, which captures users' latent search intents. Empirical experiment results demonstrate that our approach outperforms the baseline methods in providing high quality recommendations for ambiguous queries.
Efficient lp-norm multiple feature metric learning for image categorization BIBAFull-Text 2077-2080
  Shuhui Wang; Qingming Huang; Shuqiang Jiang; Qi Tian
Previous metric learning approaches are only able to learn the metric based on single concatenated multivariate feature representation. However, for many real world problems with multiple feature representation such as image categorization, the model trained by previous approaches will degrade because of sparsity brought by significant dimension growth and uncontrolled influence from each feature channel. In this paper, we propose an efficient distance metric learning model which adapts Distance Metric Learning on multiple feature representations. The aim is to learn the Mahalanobis matrices for each independent feature and their non-sparse lp-norm weight coefficients simultaneously by maximizing the margin of the overall learned distance metric among the pairs from the same class and the distance of pairs from different classes. We further extend this method to nonlinear kernel learning and category specific metric learning, which demonstrate the applicability of using many existing kernels for image data and exploring the hierarchical semantic structures for large scale image datasets. Experiments on various datasets demonstrate the promising power of our method.
Re-ranking by local re-scoring for video indexing and retrieval BIBAFull-Text 2081-2084
  Bahjat Safadi; Georges Quénot
Video retrieval can be done by ranking the samples according to their probability scores that were predicted by classifiers. It is often possible to improve the retrieval performance by re-ranking the samples. In this paper, we proposed a re-ranking method that improves the performance of semantic video indexing and retrieval, by re-evaluating the scores of the shots by the homogeneity and the nature of the video they belong to. Compared to previous works, the proposed method provides a framework for the re-ranking via the homogeneous distribution of video shots content in a temporal sequence. The experimental results showed that the proposed re-ranking method was able to improve the system performance by about 18% in average on the TRECVID 2010 semantic indexing task, videos collection with homogeneous contents. For TRECVID 2008, in the case of collections of videos with non-homogeneous contents, the system performance was improved by about 11-13%.
Tightly coupling visual and linguistic features for enriching audio-based web browsing experience BIBAFull-Text 2085-2088
  Muhammad Asiful Islam; Faisal Ahmed; Yevgen Borodin; I. V. Ramakrishnan
People who are blind use screen readers for browsing web pages. Since screen readers read out content serially, a naive readout tends to mix irrelevant and relevant content thereby disrupting the coherency of the material being read out and confusing the listener. To address this problem we can partition web pages into coherent segments and narrate each such piece separately. Extant methods to do segmentation use visual and structural cues without taking the semantics into account and consequently create segments containing irrelevant material. In this paper, we describe a new technique for creating coherent segments by tightly coupling visual, structural, and linguistic features present in the content. A notable aspect of the technique is that it produces segments with little irrelevant content. Preliminary experiments indicate that the technique is effective in creating highly coherent segments and the experiences of an early adopter who is blind suggest that it enriches the overall browsing experience.
Robust video fingerprinting based on hierarchical symmetric difference feature BIBAFull-Text 2089-2092
  Jungho Lee; Seungjae Lee; Yongseok Seo; Wonyoung Yoo
The piracy of copyrighted digital content over the Internet infringes copyrights and damages the digital content industry. Accordingly, identifying and monitoring technology on the online content service like fingerprinting is getting valuable through the explosion of digital content sharing. This paper proposes a robust video fingerprinting feature to identify a modified video clip from a large scale database. Hierarchical symmetric difference feature is proposed in order to offer efficient video fingerprinting. The feature is robust and pairwise independent against various video modifications such as compression, resizing, or cropping. Moreover, videos undergoing a transformation such as flipping or mirroring can be identified by simply disordering the bit pattern of fingerprints. The performance of the proposed feature is extensively experimented on 6,482 hours of database and the experimental results show that the proposed fingerprinting is efficient and robust against various modifications.
Image clustering fusion technique based on BFS BIBAFull-Text 2093-2096
  Luca Costantini; Raffaele Nicolussi
With the increasing in number and size of databases dedicated to the storage of visual content, the need for effective retrieval systems has become crucial. The proposed method makes a significant contribution to meet this need through a technique in which sets of clusters are fused together to create an unique and more significant set of clusters. The images are represented by some features and then are grouped by these features, that are considered one by one. A probability matrix is then built and explored by the breadth first search algorithm with the aim of select an unique set of clusters. Experimental results, obtained using two different datasets, show the effectiveness of the proposed technique. Furthermore, the proposed approach overcomes the drawback of tuning a set of parameters that fuse the similarity measurement obtained by each feature to get an overall similarity between two images.
Efficient retrieval of 3D building models using embeddings of attributed subgraphs BIBAFull-Text 2097-2100
  Raoul Wessel; Sebastian Ochmann; Richard Vock; Ina Blümel; Reinhard Klein
We present a novel method for retrieval and classification of 3D building models that is tailored to the specific requirements of architects. In contrast to common approaches our algorithm relies on the interior spatial arrangement of rooms instead of exterior geometric shape. We first represent the internal topological building structure by a Room Connectivity Graph (RCG). To enable fast and efficient retrieval and classification with RCGs, we transform the structured graph representation into a vector-based one by introducing a new concept of subgraph embeddings. We provide comprehensive experiments showing that the introduced subgraph embeddings yield superior performance compared to state-of-the-art graph retrieval approaches.
Constructing seminal paper genealogy BIBAFull-Text 2101-2104
  Duck-Ho Bae; Se-Mi Hwang; Sang-Wook Kim; Christos Faloutsos
When a researcher starts with a new topic, it would be very useful if seminal papers in the topic and their relationships are provided in advance. We propose an approach to construct seminal paper genealogy and show the effectiveness and efficiency of our approach.
Leveraging Wikipedia concept and category information to enhance contextual advertising BIBAFull-Text 2105-2108
  Zongda Wu; Guandong Xu; Rong Pan; Yanchun Zhang; Zhiwen Hu; Jianfeng Lu
As a prevalent type of Web advertising, contextual advertising refers to the placement of the most relevant ads into a Web page, so as to increase the number of ad-clicks. However, some problems of homonymy and polysemy, low intersection of keywords etc., can lead to the selection of irrelevant ads for a page. In this paper, we present a new contextual advertising approach to overcome the problems, which uses Wikipedia concept and category information to enrich the content representation of an ad (or a page). First, we map each ad and page into a keyword vector, a concept vector and a category vector. Next, we select the relevant ads for a given page based on a similarity metric that combines the above three feature vectors together. Last, we evaluate our approach by using real ads, pages, as well as a great number of concepts and categories of Wikipedia. Experimental results show that our approach can improve the precision of ads-selection effectively.
Beyond relevance in marketplace search BIBAFull-Text 2109-2112
  Nish Parikh; Neel Sundaresan
In this paper we study diversity and its relations to search relevance in the context of an online marketplace. We conduct a large-scale log-based study using click-stream data from a leading eCommerce site. We introduce 3 main metrics -- selection (diversity), trust, and value. In our analysis we also show how these interact with relevance in different ways. We study the benefits of diversity and also show why guaranteeing diversity is important.
Relative effect of spam and irrelevant documents on user interaction with search engines BIBAFull-Text 2113-2116
  Timothy Jones; David Hawking; Paul Thomas; Ramesh Sankaranarayana
Meaningful evaluation of web search must take account of spam. Here we conduct a user experiment to investigate whether satisfaction with search engine result pages as a whole is harmed more by spam or by irrelevant documents. On some measures, search result pages are differentially harmed by the insertion of spam and irrelevant documents. Additionally we find that when users are given two documents of equal utility, the one with the lower spam score will be preferred; a result page without any spam documents will be preferred to one with spam; and an irrelevant document high in a result list is surprisingly more damaging to user satisfaction than a spam document. We conclude that web ranking and evaluation should consider both utility (relevance) and "spamminess" of documents.
Inferring query aspects from reformulations using clustering BIBAFull-Text 2117-2120
  Van Dang; Xiaobing Xue; W. Bruce Croft
When the information need is not clear from the user query, a good strategy would be to return documents that cover as many aspects of the query as possible. To do this, the possible aspects of the query need to be automatically identified. In this paper, we propose to do this by clustering reformulated queries generated from publicly available resources and using each cluster to represent an aspect of the query. Our results show that the automatically generated reformulations for the TREC Web Track queries match up quite well with actual sub-topics of these queries identified by TREC experts. Moreover, agglomerative clustering using query-to-query similarity based on co-occurrence in text passages can provide clusters of high quality that potentially can be used to identify aspects.
Advertiser-centric approach to understand user click behavior in sponsored search BIBAFull-Text 2121-2124
  Sungchul Kim; Tao Qin; Hwanjo Yu; Tie-Yan Liu
Sponsored search is the major business model of commercial search engines. The number of clicks on ads is a key indicator of success for both advertisers and search engines, and increasing ad clicks is a goal of both of them. Many existing works stand on the view of search engines concerning how to help search engines to earn more revenue by accurately predicting ad clicks. Unlike the existing works, this paper aims at understanding user clicks on ads from "the view of advertisers", in order to help advertisers to improve their ad quality and therefore advertising effectiveness. To do this, a factor graph model is proposed, which considers two advertiser-controllable factors to understand user click behaviors: the relevance between a query and an ad, which has been well studied in previous literatures, and the "attractiveness" of the ad, which is a newly-proposed concept. The proposed model can be used to predict user clicks and also to mine a set of attractive words that could be leveraged to improve the quality of the ads. We have verified the effectiveness of the proposed approach using real-world datasets, through quantitative evaluations and informative case studies.
Supervised matching of comments with news article segments BIBAFull-Text 2125-2128
  Dyut Kumar Sil; Srinivasan H. Sengamedu; Chiranjib Bhattacharyya
Comments constitute an important part of Web 2.0. In this paper, we consider comments on news articles. To simplify the task of relating the comment content to the article content the comments are about, we propose the idea of showing comments alongside article segments and explore automatic mapping of comments to article segments. This task is challenging because of the vocabulary mismatch between the articles and the comments. We present supervised and unsupervised techniques for aligning comments to segments the of article the comments are about. More specifically, we provide a novel formulation of supervised alignment problem using the framework of structured classification. Our experimental results show that structured classification model performs better than unsupervised matching and binary classification model.
User action interpretation for personalized content optimization in recommender systems BIBAFull-Text 2129-2132
  Anlei Dong; Jiang Bian; Xiaofeng He; Srihari Reddy; Yi Chang
User interaction plays a vital role in recommender systems. Previous studies on algorithmic recommender systems have mainly focused on modeling techniques and feature development. Traditionally, implicit user feedback or explicit user ratings on the recommended items form the basis for designing and training of recommendation algorithms. But user interactions in real-world Web applications (e.g., a portal website with different recommendation modules in the interface) are unlikely to be as ideal as those assumed by previously proposed models. To address this problem, we build an online learning framework for personalized recommendation. We argue that appropriate user action interpretation is critical for a recommender system. The main contribution in this paper is an approach of interpreting users' actions for the online learning to achieve better item relevance estimation. Our experiments on the large-scale data from a commercial Web recommender system demonstrate significant improvement in terms of a precision metric over the baseline model that does not incorporate user action interpretation. The efficacy of this new algorithm is also proved by the online test results on real user traffic.
A personalized recommendation system on scholarly publications BIBAFull-Text 2133-2136
  Maria Soledad Pera; Yiu-Kai Ng
Researchers, as well as ordinary users who seek information in diverse academic fields, turn to the web to search for publications of interest. Even though scholarly publication recommenders have been developed to facilitate the task of discovering literature pertinent to their users, they (i) are not personalized enough to meet users' expectations, since they provide the same suggestions to users sharing similar profiles/preferences, (ii) generate recommendations pertaining to each user's general interests as opposed to the specific need of the user, and (iii) fail to take full advantages of valuable user-generated data at social websites that can enhance their performance. To address these problems, we propose PubRec, a recommender that suggests closely-related references to a particular publication P tailored to a specific user U, which minimizes the time and efforts imposed on U in browsing through general recommended publications. Empirical studies conducted using data extracted from CiteULike (i) verify the efficiency of the recommendation and ranking strategies adopted by PubRec and (ii) show that PubRec significantly outperforms other baseline recommenders.
Collaborative exploratory search in real-world context BIBAFull-Text 2137-2140
  Naoki Tani; Danushka Bollegala; Naiwala Chandrasiri; Keisuke Okamoto; Kazunari Nawa; Shuhei Iitsuka; Yutaka Matsuo
We propose Collaborative Exploratory Search (CES), which is an integration of dialog analysis and web search that involves multiparty collaboration to accomplish an exploratory information retrieval goal. Given a real-time dialog between users on a single topic; we define CES as the task of automatically detecting the topic of the dialog and retrieving task-relevant web pages to support the dialog. To recognize the task of the dialog, we apply the Author -- Topic model as a topic model. Then, attribute extraction is applied to the dialog to obtain the attributes of the tasks. Finally, a specific search query is generated to identify the task-relevant information. We implement and evaluate the CES system for a commercial in-vehicle conversation. We also develop an iPad application that listens to conversations among users and continuously retrieves relevant web pages. Our experimental results reveal that the proposed method outperforms existing methods, which demonstrates the potential usefulness of collaborative exploratory search with practically usable accuracy levels.
Beyond precision@10: clustering the long tail of web search results BIBAFull-Text 2141-2144
  Benno Stein; Tim Gollub; Dennis Hoppe
The paper addresses the missing user acceptance of web search result clustering. We report on selected analyses and propose new concepts to improve existing result clustering approaches. Our findings in a nutshell are: 1. Don't compete with a search engine's top hits. In response to a query we presume search engines to return an optimal result list in the sense of the probabilistic ranking principle: documents that are expected by the majority of users are placed on top and form the result list head. We argue that, with respect to the top results, it is not beneficial to replace this established form of result presentation. 2. Improve document access in the result list tail. Documents that address the information need of "minorities" appear at some position in the result list tail. Especially for ambiguous and multi-faceted queries we expect this tail to be long, with many users appreciating different documents. In this situation web search result clustering can improve user satisfaction by reorganizing the long tail into topic-specific clusters. 3. Avoid shadowing when constructing cluster labels. We show that most of the cluster labels that are generated by current clustering technology occur within the snippets of the result list head -- an effect which we call shadowing. The value of such labels for topic organization and navigating within a clustering of the entire result list is limited. We propose and analyze a filtering approach to significantly alleviate the label shadowing effect.

Poster session: knowledge management

Spectral analysis of a blogosphere BIBAFull-Text 2145-2148
  Sang-Wook Kim; Ki-Nam Kim; Christos Faloutsos; Joon-Ho Lee
A blogosphere is a representative example of online social networks. In this paper, we address spectral analysis of a blogosphere. We model a real-world blogosphere as a matrix and a tensor, and then analyze it by using the SVD and PARAFAC decomposition. According to the results, the SVD successfully identified communities, each of which focuses on a specific topic, and also found hub blogs and authoritative posts within each community. The PARAFAC decomposition also succeeded in extracting more communities of finer granules than the SVD. Also, the PARAFAC decomposition could identify the dominant keywords in addition to the hub blogs and authoritative posts honored in each community.
Citation chain aggregation: an interaction model to support citation cycling BIBAFull-Text 2149-2152
  Timothy F. Cribbin
Citation chaining is a powerful means of exploring the academic literature. Starting from just one or two known relevant items, a naïve researcher can cycle backwards and forwards through the citation graph to generate a rich overview of key works, authors and journals relating to their topic. Whilst online citation indexes greatly facilitate this process, the size and complexity of the search space can rapidly escalate. In this paper, we propose a novel interaction model called citation chain aggregation (CCA). CCA employs a simple three-list view which highlights the overlaps that occur between the first-generation relations of known relevant items. As more relevant articles are identified, differences in the frequencies of citations made by or to unseen articles provide strong relevance feedback cues. The benefits of this technique are illustrated using a simple case study.
Collaborative blacklist generation via searches-and-clicks BIBAFull-Text 2153-2156
  Lung-Hao Lee; Hsin-Hsi Chen
This paper presents an intent conformity model to collaboratively generate blacklists for cyberporn filtering. A novel porn detection framework via searches-and-clicks is proposed to explore collective intelligence embedded in query logs. Firstly, the clicked pages are represented in terms of the weighted queries to reflect the degrees related to pornography. Consequently, these weighted queries are regarded as discriminative features to calculate the pornography indicator by an inverse chi-square method for candidate determination. Finally, a candidate whose URL contains at least one pornographic keyword is included in our collaborative blacklists. The experiments on a MSN porn data set indicate that the generated blacklist achieves a high precision, while maintaining a favorably low false positive rate. In addition, real-life filtering simulations reveal that our blacklist is more effective than some publicly released blacklists.
Attention prediction on social media brand pages BIBAFull-Text 2157-2160
  Himabindu Lakkaraju; Jitendra Ajmera
In this paper, we deal with the problem of predicting how much attention a newly submitted post would receive from fellow community members of closed communities in social networking sites. Though the concept of attention is subjective, the number of comments received by a post serves as a very good indicator of the same. Unlike previous work which primarily made use of either content features or the network features (friendship links on the network), we exploit both the content features and community level features (for instance, what time of the day is the community more active) for tackling this problem. Further, we focus on dedicated pages of corporate brands on social media websites and accordingly extract important features from the content and community activity of such brand pages. The attention prediction task finds direct application in the listening, monitoring and engaging activities of the businesses that have such brand-pages. In this paper, we formulate the problem of attention prediction on social media brand pages.
   We further propose Attention Prediction (AP) framework which integrates the various features that influence the attention received by a post using classification and regression based approaches. Experimental results on real world data extracted from some highly active brand pages on Facebook demonstrate the efficacy of the proposed framework.
Do they belong to the same class: active learning by querying pairwise label homogeneity BIBAFull-Text 2161-2164
  Yifan Fu; Bin Li; Xingquan Zhu; Chengqi Zhang
Traditional active learning methods request experts to provide ground truths to the queried instances, which can be expensive in practice. An alternative solution is to ask nonexpert labelers to do such labeling work, which can not tell the definite class labels. In this paper, we propose a new active learning paradigm, in which a nonexpert labeler is only asked "whether a pair of instances belong to the same class". To instantiate the proposed paradigm, we adopt the MinCut algorithm as the base classifier. We first construct a graph based on the pairwise distance of all the labeled and unlabeled instances and then repeatedly update the unlabeled edge weights on the max-flow paths in the graph. Finally, we select an unlabeled subset of nodes with the highest prediction confidence as the labeled data, which are included into the labeled data set to learn a new classifier for the next round of active learning. The experimental results and comparisons, with state-of-the-art methods, demonstrate that our active learning paradigm can result in good performance with nonexpert labelers.
Structured data classification by means of matrix factorization BIBAFull-Text 2165-2168
  Paolo Garza
Singular Value Decomposition (SVD) has been extensively used in the classification context as a preprocessing step aiming to reduce the number of features of the input space. Traditional classification algorithms are then applied on the new space to generate accurate models. In this paper, we propose a different use of SVD. In our approach SVD is the building block of a new classification algorithm, called CMF, and not that of a feature reduction algorithm. In particular, we propose a new classification algorithm where the classification model corresponds to the k largest right singular vectors of the factorization of the training dataset obtained by applying SVD. The selected singular vectors allows representing the main "characteristics" of the training data and can be used to provide accurate predictions. The experiments performed on 15 structured UCI datasets show that CMF is efficient and, despite its simplicity, it is more accurate than many state of the art classification algorithms.
Transfer active learning BIBAFull-Text 2169-2172
  Zhenfeng Zhu; Xingquan Zhu; Yangdong Ye; Yue-Fei Guo; Xiangyang Xue
Active learning traditionally assumes that labeled and unlabeled samples are subject to the same distributions and the goal of an active learner is to label the most informative unlabeled samples. In reality, situations may exist that we may not have unlabeled samples from the same domain as the labeled samples (i.e. target domain), whereas samples from auxiliary domains might be available. Under such situations, an interesting question is whether an active learner can actively label samples from auxiliary domains to benefit the target domain. In this paper, we propose a transfer active learning method, namely Transfer Active SVM (TrAcSVM), which uses a limited number of target instances to iteratively discover and label informative auxiliary instances. TrAcSVM employs an extended sigmoid function as instance weight updating approach to adjust the models for prediction of (newly arrived) target data. Experimental results on real-world data sets demonstrate that TrAcSVM obtains better efficiency and prediction accuracy than its peers.
A probabilistic approach to nearest-neighbor classification: naive hubness Bayesian kNN BIBAFull-Text 2173-2176
  Nenad Tomasev; Miloa Radovanovic; Dunja Mladenic; Mirjana Ivanovic
Most machine-learning tasks, including classification, involve dealing with high-dimensional data. It was recently shown that the phenomenon of hubness, inherent to high-dimensional data, can be exploited to improve methods based on nearest neighbors (NNs). Hubness refers to the emergence of points (hubs) that appear among the k NNs of many other points in the data, and constitute influential points for kNN classification. In this paper, we present a new probabilistic approach to kNN classification, naive hubness Bayesian k-nearest neighbor (NHBNN), which employs hubness for computing class likelihood estimates. Experiments show that NHBNN compares favorably to different variants of the kNN classifier, including probabilistic kNN (PNN) which is often used as an underlying probabilistic framework for NN classification, signifying that NHBNN is a promising alternative framework for developing probabilistic NN algorithms.
Representing document as dependency graph for document clustering BIBAFull-Text 2177-2180
  Yujing Wang; Xiaochuan Ni; Jian-Tao Sun; Yunhai Tong; Zheng Chen
In traditional clustering methods, a document is often represented as "bag of words" (in BOW model) or n-grams (in suffix tree document model) without considering the natural language relationships between the words. In this paper, we propose a novel approach DGDC (Dependency Graph-based Document Clustering algorithm) to address this issue. In our algorithm, each document is represented as a dependency graph where the nodes correspond to words which can be seen as meta-descriptions of the document; whereas the edges stand for the relations between pairs of words. A new similarity measure is proposed to compute the pairwise similarity of documents based on their corresponding dependency graphs. By applying the new similarity measure in the Group-average Agglomerative Hierarchial Clustering (GAHC) algorithm, the final clusters of documents can be obtained. The experiments were carried out on five public document datasets. The empirical results have indicated that the DGDC algorithm can achieve better performance in document clustering tasks compared with other approaches based on the BOW model and suffix tree document model.
Finding redundant and complementary communities in multidimensional networks BIBAFull-Text 2181-2184
  Michele Berlingerio; Michele Coscia; Fosca Giannotti
Community Discovery in networks is the problem of detecting, for each node, its membership to one of more groups of nodes, the communities, that are densely connected, or highly interactive. We define the community discovery problem in multidimensional networks, where more than one connection may reside between any two nodes. We also introduce two measures able to characterize the communities found. Our experiments on real world multidimensional networks support the methodology proposed in this paper, and open the way for a new class of algorithms, aimed at capturing the multifaceted complexity of connections among nodes in a network.
Promotional subspace mining with EProbe framework BIBAFull-Text 2185-2188
  Yan Zhang; Yiyu Jia; Wei Jin
In multidimensional data, Promotional Subspace Mining (PSM) aims to find out outstanding subspaces for a given object, and to discover meaningful rules from them. In PSM, one major research issue is to produce top subspaces efficiently given a predefined subspace ranking measure. A common approach is to achieve an exact solution, which searches through the entire subspace search space and evaluate the target object's rank in every subspace, assisted with possible pruning strategies. In this paper, we propose EProbe, an Efficient Subspace Probing framework. This novel framework strives to initialize the idea of "early stop" of the top subspace search process. The essential goal is to provide a scalable, cost-effective, and flexible solution where its accuracy can be traded with the efficiency using adjustable parameters. This framework is especially useful when the computation resources are insufficient and only a limited number of candidate subspaces can be evaluated. As a first attempt to seek solutions under EProbe framework, we propose two novel algorithms SRatio and SlidingCluster. In our experiments, we illustrate that these two algorithms could produce a more effective subspace traversal order. Being effective, the top-k subspaces included in the final results are shown to be evaluated in the early stage of the subspace traversal process.
A partitioning method for symbolic interval data based on kernelized metric BIBAFull-Text 2189-2192
  Bruno Pimentel; Anderson Costa; Renata Souza
To solve the problem of situations with nonlinearly separable clusters, kernel clustering methods have been proposed. Symbolic Data Analysis (SDA) has emerged to deal with variables that can have intervals, histograms, and even functions as values, in order to consider the variability and/or uncertainty innate to the data. In this paper, we present a K-means clustering method based in kernelized squared L2 distance for symbolic interval-type data. Experiments with real and syntectic symbolic interval-type data sets are considered.
Hierarchy evolution for improved classification BIBAFull-Text 2193-2196
  Xiaoguang Qi; Brian D. Davison
Hierarchical classification has been shown to have superior performance than flat classification. It is typically performed on hierarchies created by and for humans rather than for classification performance. As a result, classification based on such hierarchies often yields suboptimal results. In this paper, we propose a novel genetic algorithm-based method on hierarchy adaptation for improved classification. Our approach customizes the typical GA to optimize classification hierarchies. In several text classification tasks, our approach produced hierarchies that significantly improved upon the accuracy of the original hierarchy as well as hierarchies generated by state-of-the-art methods.
Using random walks for multi-label classification BIBAFull-Text 2197-2200
  Chaokun Wang; Wei Zheng; Zhang Liu; Yiyuan Bai; Jianmin Wang
The Multi-Label Classification (MLC) problem has aroused wide concern in these years since the multi-labeled data appears in many applications, such as page categorization, tag recommendation, mining of semantic web data, social network analysis, and so forth. In this paper, we propose a novel MLC solution based on the random walk model, called MLRW. MLRW maps the multi-labeled instances to graphs, on which the random walk is applied. When an unlabeled data is fed, MLRW transforms the original multi-label problem to some single-label subproblems. Experimental results on several real-world data sets demonstrate that MLRW is a better solution to the MLC problems than many other existing multi-label classification methods.
Latent feature encoding using dyadic and relational data BIBAFull-Text 2201-2204
  Shin Ando
Learning from dyadic and relational data is a fundamental problem for IR and KDD applications in web and social media domain. Basic behaviors and characteristics of users and documents are typically described by a collection of dyads, i.e., pairs of entities. Discriminative features extracted from such data are essential in exploratory and discriminatory analyses. Relational properties of the entities reflect pair-wise similarities and their collective community structure which are also valuable for discriminative learning. A challenging aspect of learning from the relational data in many domains, is that the generative process of relational links appears noisy and is not well described by a stochastic model.
   In this paper, we present a principled approach for learning discriminative features from heterogeneous sources of dyadic and relational data. We propose an information-theoretic framework called Latent Feature Encoding (LFE) which projects the entities and the links to a latent feature space in the analogy of lossy-encoding. Projection is formalized as a maximization of the mutual information preserved in the latent features, regularized by the compression rate of encoding. The regularization is emphasized over more probable links to account for the noisiness of the observation. An empirical evaluation of the proposed method using text and social media datasets is presented. Performances in supervised and unsupervised learning tasks are compared with those of conventional latent feature extraction methods.
Learning kernels with upper bounds of leave-one-out error BIBAFull-Text 2205-2208
  Yong Liu; Shizhong Liao; Yuexian Hou
We propose a new leaning method for Multiple Kernel Learning (MKL) based on the upper bounds of the leave-one-out error that is an almost unbiased estimate of the expected generalization error. Specifically, we first present two new formulations for MKL by minimizing the upper bounds of the leave-one-out error. Then, we compute the derivatives of these bounds and design an efficient iterative algorithm for solving these formulations. Experimental results show that the proposed method gives better accuracy results than that of both SVM with the uniform combination of basis kernels and other state-of-art kernel learning approaches.
KLEAP: an efficient cleaning method to remove cross-reads in RFID streams BIBAFull-Text 2209-2212
  Guoqiong Liao; Jing Li; Lei Chen; Changxuan Wan
Recently, the RFID technology has been widely used in many kinds of applications. However, because of the interference from environmental factors and limitations of the radio frequency technology, the data streams collected by the RFID readers are usually contain a lot of cross-reads. To address this issue, we propose a KerneL dEnsity-bAsed Probability cleaning method (KLEAP) to remove cross-reads within a sliding window. The method estimates the density of each tag using a kernel-based function. The reader corresponding to the micro-cluster with the largest density will be regarded as the position that the tagged object should locate in current window, and the readings derived from other readers will be treated as the cross-reads. Experiments verify the effectiveness and efficiency of the proposed method.
A diversity measure leveraging domain specific auxiliary information BIBAFull-Text 2213-2216
  Narayan Bhamidipati; Nagaraj Kota
This article deals with the notion of reduction in uncertainty when the probability mass is distributed over similar values than dissimilar values. Shannon's entropy is a frequently used information theoretic measure of the uncertainty associated with random variables, but it depends solely on the set of values the probability mass function assumes, and does not take into consideration whether the mass is distributed among extreme values or not. A similarity structure, possibly obtained through domain knowledge, on the values assumed by the random variable may reduce the associated uncertainty. More the similarity, less the uncertainty. A novel measure named Similarity Adjusted Entropy (or Sim-adjusted Entropy for short), that generalizes Shannon's entropy, is then proposed to capture the effects of this similarity structure. Sim-adjusted entropy provides a mechanism for incorporating the domain expertise into an entropy based framework for solving various data mining tasks. Applications highlighted in this manuscript include clustering of categorical data and measuring audience diversity. Experiments performed on Yahoo! Answers data set demonstrate the ability of the proposed method to obtain more cohesive clusters. Another set of experiments confirm the utility of the proposed measure for measuring audience diversity.
Mining query structure from click data: a case study of product queries BIBAFull-Text 2217-2220
  Julia Kiseleva; Eugene Agichtein; Daniel Billsus
Most of the information on the Web is inherently structured, product pages of large online shopping sites such as Amazon.com being a typical example. Yet, unstructured keyword queries are still the most common way to search for such structured information, producing an ambiguities and poor ranking, and by that degrading user experience. This problem can be resolved by query segmentation, that is, transformation of unstructured keyword queries into structured queries. The resulting queries can be used to search product databases more accurately, and improve result presentation and query suggestion. The main contribution of our work is a novel approach to query segmentation based on unsupervised machine learning. Its highlight is that query and click-through logs are used for training. Extensive experiments over a large query and click log from a leading shopping engine demonstrate that our approach significantly outperforms baseline.
Towards expert finding by leveraging relevant categories in authority ranking BIBAFull-Text 2221-2224
  Hengshu Zhu; Huanhuan Cao; Hui Xiong; Enhong Chen; Jilei Tian
How to improve authority ranking is a crucial research problem for expert finding. In this paper, we propose a novel framework for expert finding based on the authority information in the target category as well as the relevant categories. First, we develop a scalable method for measuring the relevancy between categories through topic models. Then, we provide a link analysis approach for ranking user authority by considering the information in both the target category and the relevant categories. Finally, the extensive experiments on two large-scale real-world Q&A data sets clearly show that the proposed method outperforms the baseline methods with a significant margin.
Joint inference for cross-document information extraction BIBAFull-Text 2225-2228
  Qi Li; Sam Anzaroot; Wen-Pin Lin; Xiang Li; Heng Ji
Previous information extraction (IE) systems are typically organized as a pipeline architecture of separated stages which make independent local decisions. When the data grows beyond some certain size, the extracted facts become inter-dependent and thus we can take advantage of information redundancy to conduct reasoning across documents and improve the performance of IE. We describe a joint inference approach based on information network structure to conduct cross-fact reasoning with an integer linear programming framework. Without using any additional labeled data this new method obtained 13.7%-24.4% user browsing cost reduction over a state-of-the-art IE system which extracts various types of facts independently.
Building a generic debugger for information extraction pipelines BIBAFull-Text 2229-2232
  Anish Das Sarma; Alpa Jain; Philip Bohannon
Complex information extraction (IE) pipelines are becoming an integral component of most text processing frameworks. We introduce a first system to help IE users analyze extraction pipeline semantics and operator transformations interactively while debugging. This allows the effort to be proportional to the need, and to focus on the portions of the pipeline under the greatest suspicion. We present a generic debugger for running post-execution analysis of any IE pipeline consisting of arbitrary types of operators. For this, we propose an effective provenance model for IE pipelines which captures a variety of operator types, ranging from those for which full to no specifications are available. We have evaluated our proposed algorithms and provenance model on large-scale real-world extraction pipelines.
Fast supervised feature extraction by term discrimination information pooling BIBAFull-Text 2233-2236
  Amara Tariq; Asim Karim
Dimensionality reduction (DR) through feature extraction (FE) is desirable for efficient and effective processing of text documents. Many of the techniques for text FE produce features that are not readily interpretable and require super-linear computation time. In this paper, we present a fast supervised DR/FE technique, named FEDIP, that is motivated by the notion of relatedness of terms to topics or contexts. This relatedness is quantified by using the discrimination information provided by a term for a topic in a labeled document collection. Features are constructed by pooling the discrimination information of highly related terms for each topic. FEDIP's time complexity is linear in the size of the vocabulary and document collection. FEDIP is evaluated for document classification with SVM and naive Bayes classifiers on six text data sets. The results show that FEDIP produces low-dimension feature spaces that yield higher classification accuracy when compared with LDA and LSI. FEDIP is also found to be significantly faster than the other techniques on our evaluation data sets.
Constructing efficient information extraction pipelines BIBAFull-Text 2237-2240
  Henning Wachsmuth; Benno Stein; Gregor Engels
Information Extraction (IE) pipelines analyze text through several stages. The pipeline's algorithms determine both its effectiveness and its run-time efficiency. In real-world tasks, however, IE pipelines often fail acceptable run-times because they analyze too much task-irrelevant text. This raises two interesting questions: 1) How much "efficiency potential" depends on the scheduling of a pipeline's algorithms? 2) Is it possible to devise a reliable method to construct efficient IE pipelines? Both questions are addressed in this paper. In particular, we show how to optimize the run-time efficiency of IE pipelines under a given set of algorithms. We evaluate pipelines for three algorithm sets on an industrially relevant task: the extraction of market forecasts from news articles. Using a system-independent measure, we demonstrate that efficiency gains of up to one order of magnitude are possible without compromising a pipeline's original effectiveness.
CoRankBayes: Bayesian learning to rank under the co-training framework and its application in keyphrase extraction BIBAFull-Text 2241-2244
  Chen Wang; Sujian Li
Recently, learning to rank algorithms have become a popular and effective tool for ordering objects (e.g. terms) according to their degrees of importance. The contribution of this paper is that we propose a simple and fast learning to rank model RankBayes and embed it in the co-training framework. The detailed proof is given that Naïve Bayes algorithm can be used to implement a learning to rank model. To solve the problem of two-model inconsistency, an ingenious approach is put forward to rank all the phrases by making use of the labeled results of two RankBayes models. Experimental results show that the proposed approach is promising in solving ranking problems.
Discovering trending phrases on information streams BIBAFull-Text 2245-2248
  Krishna Y. Kamath; James Caverlee
We study the problem of efficient discovery of trending phrases from high-volume text streams -- be they sequences of Twitter messages, email messages, news articles, or other time-stamped text documents. Most existing approaches return top-k trending phrases. But, this approach neither guarantees that the top-k phrases returned are all trending, nor that all trending phrases are returned. In addition, the value of k is difficult to set and is indifferent to stream dynamics. Hence, we propose an approach that identifies all the trending phrases in a stream and is flexible to the changing stream properties.
Review recommendation: personalized prediction of the quality of online reviews BIBAFull-Text 2249-2252
  Samaneh Moghaddam; Mohsen Jamali; Martin Ester
The problem of identifying high quality and helpful reviews automatically has attracted many attention recently. Current methods assume that the helpfulness of a review is independent from the readers of that review. However, we argue that the quality of a review may not be the same for different users. In this paper, we employ latent factor models to address this problem. We evaluate the proposed models using a real life database from Epinions.com. The experiments demonstrate that the latent factor models outperform the state-of-the-art approaches and confirms that the helpfulness of a review is indeed not the same for all users.
Improving k-nearest neighbors algorithms: practical application of dataset analysis BIBAFull-Text 2253-2256
  Fidel Cacheda; Victor Carneiro; Diego Fernández; Vreixo Formoso
In the last years, recommender systems have achieved a great popularity. Many different techniques have been developed and applied to this field. However, in many cases the algorithms do not obtain the expected results. In particular, when the applied model does not fit the real data the results are especially bad. This happens because many times models are directly applied to a domain without a previous analysis of the data. In this work we study the most popular datasets in the movie recommendation domain, in order to understand how the users behave in this particular context. We have found some remarkable facts that question the utility of the similarity measures traditionally used in k-Nearest Neighbors (kNN) algorithms. These findings can be useful in order to develop new algorithms. In particular, we modify traditional kNN algorithms by introducing a new similarity measure specially suited for sparse contexts, where users have rated very few items. Our experiments show slight improvements in prediction accuracy, which proves the importance of a thorough dataset analysis as a previous step to any algorithm development.
Structured collaborative filtering BIBAFull-Text 2257-2260
  Alejandro Bellogin; Jun Wang; Pablo Castells
In a general collaborative filtering (CF) setting, a user profile contains a set of previously rated items and is used to represent the user's interest. Unfortunately, most CF approaches ignore the underlying structure of user profiles. In this paper, we argue that a certain class of interest is best represented jointly by several items, drawing an analogy to "phrases" in text retrieval, which are not equivalent to the separate meaning of their words. At an alternative stance, we also consider the situation where, analogously to word synonyms, two items might be substitutable when representing a class of interest. We propose an approach integrating these two notions as opposing poles on a continuum spectrum. Upon this, we model the underlying structure in user profiles, drawing an analogy with text retrieval. The approach gives rise to a novel structured Vector Space Model for CF. We show that item-based CF approaches are a special case of the proposed method.
User oriented tweet ranking: a filtering approach to microblogs BIBAFull-Text 2261-2264
  Ibrahim Uysal; W. Bruce Croft
The increasing volume of streaming data on microblogs has re-introduced the necessity of effective filtering mechanisms for such media. Microblog users are overwhelmed with mostly uninteresting pieces of text in order to access information of value. In this paper, we propose a personalized tweet ranking method, leveraging the use of retweet behavior, to bring more important tweets forward. In addition, we also investigate how to determine the audience of tweets more effectively, by ranking the users based on their likelihood of retweeting the tweets. Finally, conducting a pilot user study, we analyze how retweet likelihood correlates with the interestingness of the tweets.
A semi-supervised hybrid system to enhance the recommendation of channels in terms of campaign roi BIBAFull-Text 2265-2268
  Julie Séguéla; Gilbert Saporta
In domains such as Marketing, Advertising or even Human Resources (sourcing), decision-makers have to choose the most suitable channels according to their objectives when starting a campaign. In this paper, three recommender systems providing channel ("user") ranking for a given campaign ("item") are introduced. This work refers exclusively to the new item problem, which is still a challenging topic in the literature. The first two systems are standard content-based recommendation approaches, with different rating estimation techniques (model-based vs heuristic-based). To overcome the lacks of previous approaches, we introduce a new hybrid system using a supervised similarity based on PLS components. Algorithms are compared in a case study: purpose is to predict the ranking of job boards (job search web sites) in terms of ROI (return on investment) per job posting. In this application, the semi-supervised hybrid system outperforms standard approaches.
YANA: an efficient privacy-preserving recommender system for online social communities BIBAFull-Text 2269-2272
  Dongsheng Li; Qin Lv; Li Shang; Ning Gu
In online social communities, many recommender systems use collaborative filtering, a method that makes recommendations based on what are liked by other users with similar interests. Serious privacy issues may arise in this process, as sensitive personal information (e.g., content interests) may be collected and disclosed to other parties, especially the recommender server. In this paper, we propose YANA (short for "you are not alone"), an efficient group-based privacy-preserving collaborative filtering system for content recommendation in online social communities. We have developed a prototype system on desktop and mobile devices, and evaluated it using real world data. The results demonstrate that YANA can effectively protect users' privacy, while achieving high recommendation quality and energy efficiency.
More influence means less work: fast latent dirichlet allocation by influence scheduling BIBAFull-Text 2273-2276
  Mirwaes Wahabzada; Kristian Kersting; Anja Pilz; Christian Bauckhage
There have recently been considerable advances in fast inference for (online) latent Dirichlet allocation (LDA). While it is widely recognized that the scheduling of documents in stochastic optimization and in turn in LDA may have significant consequences, this issue remains largely unexplored. Instead, practitioners schedule documents essentially uniformly at random, due perhaps to ease of implementation, and to the lack of clear guidelines on scheduling the documents.
   In this work, we address this issue and propose to schedule documents for an update that exert a disproportionately large influence on the topics of the corpus before less influential ones. More precisely, we justify to sample documents randomly biased towards those ones with higher norms to form mini-batches. On several real-world datasets, including 3M articles from Wikipedia and 8M from PubMed, we demonstrate that the resulting influence scheduled LDA can handily analyze massive document collections and find topic models as good or better than those found with online LDA, often at a fraction of time.
Utility-driven anonymization in data publishing BIBAFull-Text 2277-2280
  Mingqiang Xue; Panagiotis Karras; Chedy Raïssi; Hung Keng Pung
Privacy-preserving data publication has been studied intensely in the past years. Still, all existing approaches transform data values by random perturbation or generalization. In this paper, we introduce a radically different data anonymization methodology. Our proposal aims to maintain a certain amount of patterns, defined in terms of a set of properties of interest that hold for the original data. Such properties are represented as linear relationships among data points. We present an algorithm that generates a set of anonymized data that strictly preserves these properties, thus maintaining specified patterns in the data. Extensive experiments with real and synthetic data show that our algorithm is efficient, and produces anonymized data that affords high utility in several data analysis tasks while safeguarding privacy.
Privacy preserving feature selection for distributed data using virtual dimension BIBAFull-Text 2281-2284
  Madhushri Banerjee; Sumit Chakravarty
Data Mining often suffers from the curse of dimensionality. Huge numbers of dimensions or attributes in the data pose serious problems to the data mining tasks. Traditionally data dimensionality reduction techniques like Principal Component Analysis have been used to address this problem.However, the need might be to remain in the original attribute space and identify the key predictive attributes instead of moving to a transformed space. As a result feature subset selection has become an important area of research over the last few years.
   With the advent of network technologies data is sometimes distributed in multiple locations and often with multiple parties. The biggest concern while sharing data is data privacy. Here, in this paper a secure distributed protocol is proposed that will allow feature selection for multiple parties without revealing their own data. The proposed distributed feature selection method has evolved from a method called virtual dimension reduction used in the field of hyperspectral image processing for selection of subset of hyperspectral bands for further analysis. The experimental results with real life datasets presented in this paper will demonstrate the effectiveness of the proposed method.
Switch detector: an activity spotting system for desktop BIBAFull-Text 2285-2288
  Hamid Turab Mirza; Ling Chen; Gencai Chen; Ibrar Hussain; Xufeng He
An average white-collar worker deals with enormous amount of digital information on daily basis. Recently, there has been a growing interest to support their work. However, in order to be really supportive there is a need to know the current activity of the user at all times. In this paper we present a new technique that takes advantage of temporal aspects of user activity behavior to infer when it is most likely that an activity switch is occurring. We then describe "Activity Switch Detector" an interactive switch notification system embodying these ideas, and an extensive user study by ten participants to test the validity of the approach and present its results.
LSH based outlier detection and its application in distributed setting BIBAFull-Text 2289-2292
  Madhuchand Rushi Pillutla; Nisarg Raval; Piyush Bansal; Kannan Srinathan; C. V. Jawahar
In this paper, we give an approximate algorithm for distance based outlier detection using Locality Sensitive Hashing (LSH) technique. We propose an algorithm for the centralized case wherein the entire dataset is locally available for processing. However, in case of very large datasets collected from various input sources, often the data is distributed across the network. Accordingly, we show that our algorithm can be effectively extended to a constant round protocol with low communication costs, in a distributed setting with horizontal partitioning.
Authormagic: an approach to author disambiguation in large-scale digital libraries BIBAFull-Text 2293-2296
  Henning Weiler; Klaus Meyer-Wegener; Salvatore Mele
A collaboration of leading research centers in the field of High Energy Physics (HEP) has built INSPIRE, a novel information infrastructure, which comprises the entire corpus of about one million documents produced within the discipline, including a rich set of metadata, citation information and half a million full-text documents, and offers a unique opportunity for author disambiguation strategies. The presented approach features extended metadata comparison metrics and a three-step unsupervised graph clustering technique. The algorithm aided in identifying 200'000 individuals from 6'500'000 author signatures. Preliminary tests based on knowledge of external experts and a pilot of a crowd-sourcing system show a success rate of more than 96% within the selected test cases. The obtained author clusters serve as a recommendation for INSPIRE users to further clean the publication list in a crowd-sourced approach.
DIGRank: using global degree to facilitate ranking in an incomplete graph BIBAFull-Text 2297-2300
  Xiang Niu; Lusong Li; Ke Xu
PageRank has been broadly applied to get credible rank sequences of nodes in many networks such as the web, citation networks, or online social networks. However, in the real world, it is usually hard to ascertain a complete structure of a network, particularly a large-scale one. Some researchers have begun to explore how to get a relatively accurate rank more efficiently. They have proposed some local approximation methods, which are especially designed for quickly estimating the PageRank value of a new node, after it is just added to the network. Yet, these local approximation methods rely on the link server too much, and it is difficult to use them to estimate rank sequences of nodes in a group. So we propose a new method called DIGRank, which uses global Degree to facilitate Ranking in an Incomplete Graph and which takes into account the frequent need for applications to rank users in a community, retrieve pages in a particular area, or mine nodes in a fractional or limited network. Based on experiments in small-world and scale-free networks generated by models, the DIGRank method performs better than other local estimation methods on ranking nodes in a given subgraph. In the models, it tends to perform best in graphs that have low average shortest path length, high average degree, or weak community structure. Besides, compared with an local PageRank and an advanced local approximation method, it significantly reduces the computational cost and error rate.
On selection of objective functions in multi-objective community detection BIBAFull-Text 2301-2304
  Chuan Shi; Philip S. Yu; Yanan Cai; Zhenyu Yan; Bin Wu
There is a surge of community detection of complex networks in recent years. Different from conventional single-objective community detection, this paper formulates community detection as a multi-objective optimization problem and proposes a general algorithm NSGA-Net based on evolutionary multi-objective optimization. Interested in the effect of optimization objectives on the performance of the multi-objective community detection, we further study the correlations (i.e., positively correlated, independent, or negatively correlated) of 11 objective functions that have been used or can potentially be used for community detection. Our experiments show that NSGA-Net optimizing over a pair of negatively correlated objectives usually performs better than the single-objective algorithm optimizing over either of the original objectives, and even better than other well-established community detection approaches.
Suggesting ghost edges for a smaller world BIBAFull-Text 2305-2308
  Manos Papagelis; Francesco Bonchi; Aristides Gionis
Small changes in the network topology can have dramatic effects on its capacity to disseminate information. In this paper, we consider the problem of adding a small number of ghost edges in the network in order to minimize the average shortest-path distance between nodes, towards a smaller-world network. We formalize the problem of suggesting ghost edges and we propose a novel method for quickly evaluating the importance of ghost edges in sparse graphs. Through experiments on real and synthetic data sets, we demonstrate that our approach performs very well, for a varying range of conditions, and it outperforms sensible baselines.
Examining the "leftness" property of Wikipedia categories BIBAFull-Text 2309-2312
  Karl Gyllstrom; Marie-Francine Moens
Wikipedia's rich category structure has helped make it one of the largest semantic taxonomies in existence, a property that has been central to much recent research. However, Wikipedia's category representation is simplistic: an article contains a single list of categories, with no data about their relative importance. We investigate the ordering of category lists to determine how a category's position in the list correlates with its relevance to the article and overall significance. We identify a number of interesting connections between a category's position and its persistence within the article, age, popularity, size, and descriptiveness.
Detection of text quality flaws as a one-class classification problem BIBAFull-Text 2313-2316
  Maik Anderka; Benno Stein; Nedim Lipka
For Web applications that are based on user generated content the detection of text quality flaws is a key concern. Our research contributes to automatic quality flaw detection. In particular, we propose to cast the detection of text quality flaws as a one-class classification problem: we are given only positive examples (= texts containing a particular quality flaw) and decide whether or not an unseen text suffers from this flaw. We argue that common binary or multiclass classification approaches are ineffective in here, and we underpin our approach by a real-world application: we employ a dedicated one-class learning approach to determine whether a given Wikipedia article suffers from certain quality flaws. Since in the Wikipedia setting the acquisition of sensible test data is quite intricate, we analyze the effects of a biased sample selection. In addition, we illustrate the classifier effectiveness as a function of the flaw distribution in order to cope with the unknown (real-world) flaw-specific class imbalances. Altogether, provided test data with little noise, four from ten important quality flaws in Wikipedia can be detected with a precision close to 1.
Two birds with one stone: learning semantic models for text categorization and word sense disambiguation BIBAFull-Text 2317-2320
  Roberto Navigli; Stefano Faralli; Aitor Soroa; Oier de Lacalle; Eneko Agirre
In this paper we present a novel approach to learning semantic models for multiple domains, which we use to categorize Wikipedia pages and to perform domain Word Sense Disambiguation (WSD). In order to learn a semantic model for each domain we first extract relevant terms from the texts in the domain and then use these terms to initialize a random walk over the WordNet graph. Given an input text, we check the semantic models, choose the appropriate domain for that text and use the best-matching model to perform WSD. Our results show considerable improvements on text categorization and domain WSD tasks.
More or better: on trade-offs in compacting textual problem solution repositories BIBAFull-Text 2321-2324
  A Deepak P.; Sutanu Chakraborti; Deepak Khemani
In this paper, we look into the problem of filtering problem solution repositories (from sources such as community-driven question answering systems) to render them more suitable for usage in knowledge reuse systems. We explore harnessing the fuzzy nature of usability of a solution to a problem, for such compaction. Fuzzy usabilities lead to several challenges; notably, the trade-off between choosing generic or better solutions. We develop an approach that can heed to a user specification of the trade-off between these criteria and introduce several quality measures based on fuzzy usability estimates to ascertain the quality of a problem-solution repository for usage in a Case Based Reasoning system. We establish, through a detailed empirical analysis, that our approach outperforms state-of-the-art approaches on virtually all quality measures.
Mining frequent patterns across multiple data streams BIBAFull-Text 2325-2328
  Jing Guo; Peng Zhang; Jianlong Tan; Li Guo
Mining frequent patterns from data streams has drawn increasing attention in recent years. However, previous mining algorithms were all focused on a single data stream. In many emerging applications, it is of critical importance to combine multiple data streams for analysis. For example, in real-time news topic analysis, it is necessary to combine multiple news report streams from different media sources to discover collaborative frequent patterns which are reported frequently in all media, and comparative frequent patterns which are reported more frequently in a media than others. To address this problem, we propose a novel frequent pattern mining algorithm Hybrid-Streaming, H-Stream for short. H-Stream builds a new Hybrid-Frequent tree to maintain historical frequent and potential frequent itemsets from all data streams, and incrementally updates these itemsets for efficient collaborative and comparative pattern mining. Theoretical and empirical studies demonstrate the utility of the proposed method.
SILA: a spatial instance learning approach for deep webpages BIBAFull-Text 2329-2332
  Ermelinda Oro; Massimo Ruffolo
Deep Web pages convey very relevant information for different application domains like e-government, e-commerce, social networking. For this reason there is a constant high interest in efficiently, effectively and automatically extracting data from Deep Web data sources. In this paper we present SILA, a novel Spatial Instance Learning Approach, that allows for extracting data records from Deep Web pages by exploiting both the spatial arrangement and the presentation features of data items/fields produced by layout engines of Web browsers in visualizing Deep Web pages on the screen. SILA is independent from the internal HTML encodings of Web pages, and allows for recognizing data records in pages having multiple data regions in which data items are arranged by many different presentation layouts. Experimental results show that SILA has very high precision and recall and that it works much better than MDR and ViNTs approaches.
A geographic study of tie strength in social media BIBAFull-Text 2333-2336
  Jeffrey McGee; James A. Caverlee; Zhiyuan Cheng
In this paper, we investigate the interplay of distance and tie strength through an examination of 20 million geo-encoded tweets collected from Twitter and 6 million user profiles. Concretely, we investigate the relationship between the strength of the tie between a pair of users, and the distance between the pair. We identify several factors -- including following, mentioning, and actively engaging in conversations with another user -- that can strongly reveal the distance between a pair of users. We find a bimodal distribution in Twitter, with one peak around 10 miles from people who live nearby, and another peak around 2500 miles, further validating Twitter's use as both a social network (with geographically nearby friends) and as a news distribution network (with very distant relationships).
Named entity recognition using a modified Pegasos algorithm BIBAFull-Text 2337-2340
  Changki Lee; Pum-Mo Ryu; HyunKi Kim
In this paper, we describe a named entity recognition using a modified Pegasos algorithm for structural SVMs. We show the modified Pegasos algorithm significantly outperformed CRFs and the training time for the modified Pegasos algorithm is reduced 17-26 times compared to CRFs.
WikiLabel: an encyclopedic approach to labeling documents en masse BIBAFull-Text 2341-2344
  Tadashi Nomoto
This paper presents a particular approach to collective labeling of multiple documents, which works by associating the documents with Wikipedia pages and labeling them with headings the pages carry. The approach has an obvious advantage over past approaches in that it is able to produce fluent labels, as they are hand-written by human editors. We carried out some experiments on the TDT5 dataset, which found that the approach works rather robustly for an arbitrary set of documents in the news domain. Comparisons were made with some baselines, including the state of the art, with results strongly in favor of our approach.
Towards noise-resilient document modeling BIBAFull-Text 2345-2348
  Tao Yang; Dongwon Lee
We introduce a generative probabilistic document model based on latent Dirichlet allocation (LDA), to deal with textual errors in the document collection. Our model is inspired by the fact that most large-scale text data are machine-generated and thus inevitably contain many types of noise. The new model, termed as TE-LDA, is developed from the traditional LDA by adding a switch variable into the term generation process in order to tackle the issue of noisy text data. Through extensive experiments, the efficacy of our proposed model is validated using both real and synthetic data sets.
Probabilistic model for discovering topic based communities in social networks BIBAFull-Text 2349-2352
  Mrinmaya Sachan; Danish Contractor; Tanveer Faruquie; Venkata Subramaniam
Social graphs have received renewed interest as a research topic with the advent of social networking websites. These online networks provide a rich source of data to study user relationships and interaction patterns on a large scale. In this paper, we propose a generative Bayesian model for extracting latent communities from a social graph. We assume that community memberships depend on topics of interest between users and the link relationships between them in the social graph topology. In addition, we make use of the nature of interaction to gauge user interests. Our model allows communities to be related to multiple topics and each user in the graph can be a member of multiple communities. This gives an insight into user interests and topical distribution in communities. We show the effectiveness of our model using a real world data set and also compare our model with existing community discovery methods.

Poster session: databases

Scalable entity matching computation with materialization BIBAFull-Text 2353-2356
  Sanghoon Lee; Jongwuk Lee; Seung-won Hwang
Entity matching (EM) is the task of identifying records that refer to the same real-world entity from different data sources. While EM is widely used in data integration and data cleaning applications, the naive method for EM incurs quadratic cost with respect to the size of the datasets. To address this problem, this paper proposes a scalable EM algorithm that employs a pre-materialized structure. Specifically, once the structure is built, our proposed algorithm can identify the EM results with sub-linear cost. In addition, as the rules evolve, our algorithm can efficiently adapt to new rules by selectively accessing records using the materialized structure. Our evaluation results show that our proposed EM algorithm is significantly faster than the state-of-the-art method for extensive real-life datasets.
Predicting the optimal ad-hoc index for reachability queries on graph databases BIBAFull-Text 2357-2360
  Jintian Deng; Fei Liu; Yun Peng; Byron Choi; Jianliang Xu
Due to the recent advances in graph databases, a large number of ad-hoc indexes for a fundamental query, in particular, reachability query, have been proposed. The performances of these indexes on different graphs have known to be very different. Worst still, deriving an accurate cost model for selecting the optimal index of a graph database appears to be a daunting task. In this paper, we propose a hierarchical prediction framework, based on neural networks and a set of graph features and a knowledge base on past predictions, to determine the optimal index for a graph database. For ease of presentation, we propose our framework with three structurally distinguishable indexes. Our experiments show that our framework is accurate.
Collection-based compression using discovered long matching strings BIBAFull-Text 2361-2364
  Andrew Peel; Anthony Wirth; Justin Zobel
Many collections of data contain items that are inherently similar. For example, archives contain files with incremental changes between releases. Long-range inter-file similarities are not exploited by standard approaches to compression. We investigate compression using similarity from all parts of a collection, collection-based compression (CBC). Input files are delta-encoded by reference to long string matches in a source collection. The expected space requirement of our encoding algorithm is sublinear with the collection size, and the compression time complexity is linear with the input file size. We show that our scheme achieves better compression for large input files than existing differential compression systems, and scales better. Also, we achieve significant compression improvement compared to compressing each file individually using standard utilities: our scheme achieves several times the compression of gzip or 7-zip. The overall result is a dramatic improvement on compression available with existing approaches.
A robust index for regular expression queries BIBAFull-Text 2365-2368
  Dominic Tsang; Sanjay Chawla
The like regular expression predicate has been part of the SQL standard since at least 1989. However, despite its popularity and wide usage, database vendors provide only limited indexing support for regular expression queries which almost always require a full table scan.
   In this paper we propose a rigorous and robust approach for providing indexing support for regular expression queries. Our approach consists of formulating the indexing problem as a combinatorial optimization problem. We begin with a database, abstracted as a collection of strings. From this data set we generate a query workload. The input to the optimization problem is the database and the workload. The output is a set of multigrams (substrings) which can be used as keys to records which satisfy the query workload. The multigrams can then be integrated with the data structure (like B+ trees) to provide indexing support for the queries. We provide a deterministic and a randomized approximation algorithm (with provable guarantees) to solve the optimization problem. Extensive experiments on synthetic data sets demonstrate that our approach is accurate and efficient.
   We also present a case study on PROSITE patterns -- which are complex regular expression signatures for classes of proteins. Again, we are able to demonstrate the utility of our indexing approach in terms of accuracy and efficiency. Thus, perhaps for the first time, there is a robust and practical indexing mechanism for an important class of database queries.
Integrating and querying web databases and documents BIBAFull-Text 2369-2372
  Carlos Garcia-Alvarado; Carlos Ordonez
There exist many interrelated information sources on the Internet that can be categorized into structured (database) and semistructured (documents). A key challenge is to integrate, query and analyze such heterogeneous collections of information. In this paper, we defend the idea of building web metadata repositories using relational databases as the main source and central data management technology of structured data, enriched by the semistructured data surrounding it. Our proposal rests on the assumption that heterogeneous relational databases can be integrated (i.e. entity resolution is assumed to work well) and thus can serve as references for external data. That is, we tackle the problem of integrating information in the deep web, departing from databases. We discuss a prototype system that can integrate and query metadata and related documents, based on relational database technology. Metadata includes database ER model elements like database name, table, and column (entity, attribute). Web document data include files, documents and web pages. Links between metadata and external documents are built with SQL queries. Once databases and documents are linked, they are managed and queried with SQL. We discuss an interesting scientific application of our solution with a water pollution database.
Processing the signature quadratic form distance on many-core GPU architectures BIBAFull-Text 2373-2376
  Martin Kruliš; Jakub Lokoc; Christian Beecks; Tomáš Skopal; Thomas Seidl
The Signature Quadratic Form Distance on feature signatures represents a flexible distance-based similarity model for effective content-based multimedia retrieval. Although metric indexing approaches are able to speed up query processing by two orders of magnitude, their applicability to large-scale multimedia databases containing billions of images is still a challenging issue. In this paper, we propose the utilization of GPUs for efficient query processing with the Signature Quadratic Form Distance. We show how to process multiple distance computations in parallel and demonstrate efficient query processing by comparing many-core GPU with multi-core CPU implementations.
Top-k most influential locations selection BIBAFull-Text 2377-2380
  Jin Huang; Zeyi Wen; Jianzhong Qi; Rui Zhang; Jian Chen; Zhen He
We propose and study a new type of facility location selection query, the top-k most influential location selection query. Given a set M of customers and a set F of existing facilities, this query finds k locations from a set C of candidate locations with the largest influence values, where the influence of a candidate location c (c in C) is defined as the number of customers in M who are the reverse nearest neighbors of c. We first present a naive algorithm to process the query. However, the algorithm is computationally expensive and not scalable to large datasets. This motivates us to explore more efficient solutions. We propose two branch and bound algorithms, the Estimation Expanding Pruning (EEP) algorithm and the Bounding Influence Pruning (BIP) algorithm. These algorithms exploit various geometric properties to prune the search space, and thus achieve much better performance than that of the naive algorithm. Specifically, the EEP algorithm estimates the distances to the nearest existing facilities for the customers and the numbers of influenced customers for the candidate locations, and then gradually refines the estimation until the answer set is found, during which distance metric based pruning techniques are used to improve the refinement efficiency. BIP only estimates the numbers of influenced customers for the candidate locations. But it uses the existing facilities to limit the space for searching the influenced customers and achieve a better estimation, which results in an even more efficient algorithm. Extensive experiments conducted on both real and synthetic datasets validate the efficiency of the algorithms.
Defining isochrones in multimodal spatial networks BIBAFull-Text 2381-2384
  Johann Gamper; Michael Böhlen; Willi Cometti; Markus Innerebner
An isochrone in a spatial network is the minimal, possibly disconnected subgraph that covers all locations from where a query point is reachable within a given time span and by a given arrival time. In this paper we formally define isochrones for multimodal spatial networks with different transportation modes that can be discrete or continuous in, respectively, space and time. For the computation of isochrones we propose the multimodal incremental network expansion (MINE) algorithm, which is independent of the actual network size and depends only on the size of the isochrone. An empirical study using real-world data confirms the analytical results.
On the elasticity of NoSQL databases over cloud management platforms BIBAFull-Text 2385-2388
  Ioannis Konstantinou; Evangelos Angelou; Christina Boumpouka; Dimitrios Tsoumakos; Nectarios Koziris
NoSQL databases focus on analytical processing of large scale datasets, offering increased scalability over commodity hardware. One of their strongest features is elasticity, which allows for fairly portioned premiums and high-quality performance and directly applies to the philosophy of a cloud-based platform. Yet, the process of adaptive expansion and contraction of resources usually involves a lot of manual effort during cluster configuration. To date, there exists no comparative study to quantify this cost and measure the efficacy of NoSQL engines that offer this feature over a cloud provider. In this work, we present a cloud-enabled framework for adaptive monitoring of NoSQL systems. We perform a study of the elasticity feature on some of the most popular NoSQL databases over an open-source cloud platform. Based on these measurements, we finally present a prototype implementation of a decision making system that enables automatic elastic operations of any NoSQL engine based on administrator or application-specified constraints.
Continuous data stream query in the cloud BIBAFull-Text 2389-2392
  Jun Li; Peng Zhang; Jianlong Tan; Ping Liu; Li Guo
Cloud computing represents one of the most important research directions for modern computing systems. Existing research efforts on Cloud computing were all focused on designing advanced storage and query techniques for static data. None of them consider the problem that data in a Cloud may appear as continuous and rapid data streams. To address this problem, in this paper we propose a new LCN-Index framework to handle continuous data stream queries in the Cloud. LCN-Index uses the Map-Reduce computing paradigm to process all the queries. In the Mapping stage, it divides all the queries into a batch of predicate sets which are then deployed onto mapping nodes using interval predicate index. In the reducing stage, it merges results from the mapping nodes using multi attribute hash index. In so doing, a data stream can be efficiently evaluated by traversing through the LCN-Index framework. Experiments demonstrate the utility of the proposed method.
A cluster based mobile peer to peer architecture in wireless ad hoc networks BIBAFull-Text 2393-2396
  He Li; KyoungSoo Bok; JaeSoo Yoo
With the rapid development of wireless communication technologies and mobile devices, the mobile peer to peer (MP2P) network has been emerged. Since the existing MP2P architectures have high management cost, in this paper, we propose a hierarchical MP2P architecture using clustering mobile peers. The proposed method clusters the mobile peers by considering three aspects like the maximum connection time, the minimum hop count and the number of the connected peers. The connection times between the connected peers can be determined by the location, velocity vector and communication range of the mobile peers. Since the maximum connection time of the connected peers are considered, the network topology is relatively stable. Therefore, the management cost of the network is decreased and the success rate of contents search is increased. Experiments have shown that our proposed method outperforms the existing schemes.
Block-based load balancing for entity resolution with MapReduce BIBAFull-Text 2397-2400
  Lars Kolb; Andreas Thor; Erhard Rahm
The effectiveness and scalability of MapReduce-based implementations of complex data-intensive tasks depend on an even redistribution of data between map and reduce tasks. In the presence of skewed data, sophisticated redistribution approaches thus become necessary to achieve load balancing among all reduce tasks to be executed in parallel. For the complex problem of entity resolution with blocking, we propose BlockSplit, a load balancing approach that supports blocking techniques to reduce the search space of entity resolution. The evaluation on a real cloud infrastructure shows the value and effectiveness of the proposed approach.
PCMLogging: reducing transaction logging overhead with PCM BIBAFull-Text 2401-2404
  Shen Gao; Jianliang Xu; Bingsheng He; Byron Choi; Haibo Hu
Phase Changing Memory (PCM), as one of the most promising next-generation memory technologies, offers various attractive properties such as non-volatility, bit-alterability, and low idle energy consumption. In this paper, we present PCMLogging, a novel logging scheme that exploits PCM devices for both data buffering and transaction logging in disk-based databases. Different from the traditional approach where buffered updates and transaction logs are completely separated, they are integrated in the new logging scheme. Our preliminary experiments show an up to 40% improvement of PCMLogging in disk I/O performance in comparison with a basic buffering and logging scheme.
A continuous query evaluation scheme for a detection-only query over data streams BIBAFull-Text 2405-2408
  Hong Kyu Park; Won Suk Lee
In a data stream environment, a multi-way join continuous query is employed to monitor a considerable number of source data streams from various remote sites in real-time. One key role of a continuous query is detecting only the invocation of a particular event corresponding to the specifications of the query. The evaluation of such a detection-only query does not require to produce either an intermediate tuple or a final result tuple, which not only shortens the processing time of a query but also reduces the usage of memory space. However, there has been no special effort to deal with a query of this type. This paper proposes a new evaluation framework which efficiently processes a multi-way detection-only query without generating any intermediate result tuple explicitly.
Subject-oriented top-k hot region queries in spatial dataset BIBAFull-Text 2409-2412
  Junling Liu; Ge Yu; Huanliang Sun
This paper proposes and solves a novel type of spatial queries named Subject-oriented Top-k hot Region (STR) queries. Given a subject S defined by a feature set R and features importance denoted by weights, an STR query retrieves k non-overlapping regions that have the highest scores computed by the number of feature objects and their weights. As an example, the culture subject is defined by exhibition halls, libraries and museums. On the subject, an STR query finds cultural centers intensively distributed feature objects. In this paper, we propose two efficient algorithms, single-partition (SP) algorithm and dual-partition (DP) algorithm, to process STR queries. Extensive experiments evaluate the proposed solutions under a wide range of parameter settings.
k-Nearest neighbor query processing method based on distance relation pattern BIBAFull-Text 2413-2416
  Yonghun Park; Dongmin Seo; Kyoungsoo Bok; Jaesoo Yoo
The k-nearest neighbor (k-NN) query is one of the most important query types for location based services (LBS). Various methods have been proposed to efficiently process the k-NN query. However, most of the existing methods suffer from high computation time and larger memory requirement because they unnecessarily access cells to find the nearest cells on a grid index. In this paper, we propose a new efficient method, called Pattern Based k-NN (PB-kNN) to process the k-NN query. The proposed method uses the patterns of the distance relationships among the cells in a grid index. The basic idea is to normalize the distance relationships as certain patterns. Using this approach, PB-kNN significantly improves the overall performance of the query processing. It is shown through various experiments that our proposed method outperforms the existing methods in terms of query processing time and storage overhead.
Efficient query rewrite for structured web queries BIBAFull-Text 2417-2420
  Sreenivas Gollapudi; Samuel Ieong; Alexandros Ntoulas; Stelios Paparizos
Web search engines incorporate results from structured data sources to answer semantically rich user queries, i.e. Samsung 50 inch led tv can be answered from a table of television data. However, users are not domain experts and quite often enter values that do not match precisely the underlying data, so a literal execution will return zero results. A search engine would prefer to return at least a minimum number of results as close to the original query as possible while providing a time-bound execution guarantee. In this paper, we formalize these requirements, show the problem is NP-Hard and present approximation algorithms that produce rewrites that work in practice. We empirically validate our algorithms on large-scale data from a major search engine.
Rule-based construction of matching processes BIBAFull-Text 2421-2424
  Eric Peukert; Julian Eberius; Erhard Rahm
Semi-automatic schema matching systems have been developed to compute mapping suggestions that can be corrected by a user. However, constructing and tuning match strategies still requires a high manual effort. We therefore propose a self-configuring schema matching system that is able to automatically adapt to the given mapping problem at hand. Our approach is based on analyzing the input schemas as well as intermediate match results. A variety of matching rules use the analysis results to automatically construct and adapt an underlying matching process for a given match task. The evaluation shows that our system is able to robustly return good quality mappings across different mapping problems and domains.
A taxonomy of local search: semi-supervised query classification driven by information needs BIBAFull-Text 2425-2428
  Jiang Bian; Yi Chang
Local search service (e.g. Yelp, Yahoo! Local) has emerged as a popular and effective paradigm for a wide range of information needs for local businesses; it now provides a viable and even more effective alternative to general purpose web search for queries on local businesses. However, due to the diversity of information needs behind local search, it is necessary to use different information retrieval strategies for different query types in local search. In this paper, we explore a taxonomy of local search driven by users' information needs, which categorizes local search queries into three types: business category, chain business, and non-chain business. To decide which search strategy to use for each category in this taxonomy without placing the burden on the web users, it is indispensable to build an automatic local query classifier. However, since local search queries yield few online features and it is expensive to obtain editorial labels, it is insufficient to use only a supervised learning approach. In this paper, we address these problems by developing a semi-supervised approach for mining information needs from a vast amount of unlabeled data from local query logs to boost local query classification. Results of a large scale evaluation over queries from a commercial local search site illustrate that the proposed semi-supervised method allow us to accurately classify a substantially larger proportion of local queries than the supervised learning approach.
ONTOCUBE: efficient ontology extraction using OLAP cubes BIBAFull-Text 2429-2432
  Carlos Garcia-Alvarado; Zhibo Chen; Carlos Ordonez
Ontologies are knowledge conceptualizations of a particular domain and are commonly represented with hierarchies. While final ontologies appear deceivingly simple on paper, building ontologies represents a time-consuming task that is normally performed by natural language processing techniques or schema matching. On the other hand, OLAP cubes are most commonly used during decision-making processes via the analysis of data summarizations. In this paper, we present a novel approach based on using OLAP cubes for ontology extraction. The resulting ontology is obtained through an analytical process of the summarized frequencies of keywords within a corpus. The solution was implemented within a relational database system (DBMS). In our experiments, we show how all the proposed discrimination measures (frequency, correlation, lift) affect the resulting classes. We also show a sample ontology result and the accuracy of finding true classes. Finally, we show the performance breakdown of our algorithm.
An algorithm for axiom pinpointing in EL+ and its incremental variant BIBAFull-Text 2433-2436
  Xiaojun Cheng; Guilin Qi
Axiom pinpointing plays an important role in the development and maintenance of ontologies. It helps the user to comprehend an unwanted entailment of an ontology by presenting all minimal subsets of the ontology which are responsible for the entailment (called MinAs). In this paper, we consider the problem of axiom pinpointing in description logic EL+, which underpins OWL 2 EL, a profile of the latest version of Web Ontology Language (OWL). We propose a novel method to compute all MinAs that utilizes the hierarchy information obtained from the classification of an EL+ ontology. The advantage of our method over an existing labeled classification based method is that we do not attach labels to entailed subsumptions, which can be memory exhaustion for large scale ontologies. We further consider axiom pinpointing in EL+ when ontologies change. An incremental algorithm is given to compute all MinAs by reusing MinAs previously computed.
Folksonomy-based term extraction for word cloud generation BIBAFull-Text 2437-2440
  David Carmel; Erel Uziel; Ido Guy; Yosi Mass; Haggai Roitman
In this work we study the task of term extraction for word cloud generation. We present a folksonomy-based term extraction method, called tag-boost, which boosts terms that are frequently used by the public to tag content. Our experiments with tag-boost-based term extraction over different domains demonstrate tremendous improvement in word cloud quality, as reflected by the agreement between extracted terms and manually assigned tags of the testing items. Additionally, we show that tag-boost can be effectively applied even in non-tagged domains, by using an external rich folksonomy borrowed from a well-tagged domain.
Efficient association discovery with keyword-based constraints on large graph data BIBAFull-Text 2441-2444
  Mo Zhou; Yifan Pan; Yuqing Wu
In many domains, such as social networks and chem-informatics, data can be represented naturally in graph model, with nodes being data entries and edges the relationships between them. We study the application requirements in these domains and find that discovering Constrained Acyclic Paths (CAP) is highly in demand. In this paper, we define the CAP search problem and introduce a set of quantitative metrics for describing keyword-based constraints. We propose a series of algorithms to efficiently evaluate CAP queries on large-scale graph data. Extensive experiments illustrate that our algorithms are both efficient and scalable.
AWETO: efficient incremental update and querying in rdf storage system BIBAFull-Text 2445-2448
  Xu Pu; Jianyong Wang; Ping Luo; Min Wang
With the fast growth of the knowledge bases built over the Internet, storing and querying millions or billions of RDF triples in a knowledge base have attracted increasing research interests. Although the latest RDF storage systems achieve good querying performance, few of them pay much attention to the characteristic of dynamic growth of the knowledge base. In this paper, to consider the efficiency of both querying and incremental update in RDF data, we propose a hAsh-based tWo-tiEr rdf sTOrage system (abbr. to AWETO) with new index architecture and query execution engine. The performance of our system is systematically measured over two large-scale datesets. Compared with the other three state-of-the-art RDF storage systems, our system achieves the best incremental update efficiency, meanwhile, the query efficiency is competitive.
Insert-friendly XML containment labeling scheme BIBAFull-Text 2449-2452
  Canwei Zhuang; Ziyu Lin; Shaorong Feng
The labeling scheme is designed to label the XML nodes so that both ordered and un-ordered queries can be processed without accessing the original XML file. When XML data become dynamic, it is important to design a labeling scheme that can facilitate updates and support query processing efficiently. In this paper, we propose a novel containment labeling scheme called DXCL (Dynamic XML Containment Labeling) to effectively process updating in dynamic XML data. Compared with the existing dynamic labeling schemes, a distinguishing feature of DXCL is that DXCL is compact and efficient regardless of whether the documents are updated or not. DXCL uses fixed length integer numbers to label initial XML documents and hence yields compact label size and high query performance. When updates take place, DXCL also has high performance on both label updates and query processing especially in the case of skewed insertions. Experimental results conform the benefits of our approach over the previous dynamic schemes.
A pretopological framework for the automatic construction of lexical-semantic structures from texts BIBAFull-Text 2453-2456
  Guillaume Cleuziou; Davide Buscaldi; Vincent Levorato; Gaël Dias
We present in this paper a new approach for the automatic generation of lexical structures from texts. This tedious task is based on the strong hypothesis that simple statistical observations on textual usages can provide pieces of semantics about the lexicon. Using such "naive" observations only, we propose a (pre)-topological framework to formalize and combine various hypothesis on textual data usages and then to derive a structure similar to usual lexical knowledge basis such as WordNet. In addition we also consider the evaluation problem for obtained lexical structures; a multi-level evaluation strategy is proposed that measures the fitting between a given reference structure and automatically generated structures on different point of views: intrinsic/structural and application-based points of view. The evaluation strategy is then used to quantify the contribution of the new structuring approach with respect to the corresponding solution proposed by (Sanderson et al. 2000) on two case studies that differs on the domain and the size of the lexicon.
Leveraging web 2.0 data for scalable semi-supervised learning of domain-specific sentiment lexicons BIBAFull-Text 2457-2460
  Raymond Yiu Keung Lau; Chun Lam Lai; Peter B. Bruza; Kam F. Wong
Since manually constructing domain-specific sentiment lexicons is extremely time consuming and it may not even be feasible for domains where linguistic expertise is not available, research on automatic construction of domain-specific sentiment lexicons has become a hot topic in recent years. The main contribution of this paper is the illustration of a novel semi-supervised learning method which exploits both term-to-term and document-to-term relations hidden in a corpus for the construction of domain-specific sentiment lexicons. More specifically, the proposed two-pass pseudo labeling method combines shallow linguistic parsing and corpus-base statistical learning to make domain-specific sentiment extraction scalable with respect to the sheer volume of opinionated documents archived on the Internet these days. Our experiments show that the proposed method can generate high quality domain-specific sentiment lexicons according to users' evaluation.
Classifying trending topics: a typology of conversation triggers on Twitter BIBAFull-Text 2461-2464
  Arkaitz Zubiaga; Damiano Spina; Víctor Fresno; Raquel Martínez
Twitter summarizes the great deal of messages posted by users in the form of trending topics that reflect the top conversations being discussed at a given moment. These trending topics tend to be connected to current affairs. Different happenings can give rise to the emergence of these trending topics. For instance, a sports event broadcasted on TV, or a viral meme introduced by a community of users. Detecting the type of origin can facilitate information filtering, enhance real-time data processing, and improve user experience. In this paper, we introduce a typology to categorize the triggers that leverage trending topics: news, current events, memes, and commemoratives. We define a set of straightforward language-independent features that rely on the social spread of the trends to discriminate among those types of trending topics. Our method provides an efficient way to immediately and accurately categorize trending topics without need of external data, outperforming a content-based approach.
Enhancing accessibility of microblogging messages using semantic knowledge BIBAFull-Text 2465-2468
  Xia Hu; Lei Tang; Huan Liu
The volume of microblogging messages is increasing exponentially with the popularity of microblogging services. With a large number of messages appearing in user interfaces, it hinders user accessibility to useful information buried in disorganized, incomplete, and unstructured text messages. In order to enhance user accessibility, we propose to aggregate related microblogging messages into clusters and automatically assign them semantically meaningful labels. However, a distinctive feature of microblogging messages is that they are much shorter than conventional text documents. These messages provide inadequate term co occurrence information for capturing semantic associations. To address this problem, we propose a novel framework for organizing unstructured microblogging messages by transforming them to a semantically structured representation. The proposed framework first captures informative tree fragments by analyzing a parse tree of the message, and then exploits external knowledge bases (Wikipedia and WordNet) to enhance their semantic information. Empirical evaluation on a Twitter dataset shows that our framework significantly outperforms existing state-of-the-art methods.
Imbalanced sentiment classification BIBAFull-Text 2469-2472
  Shoushan Li; Guodong Zhou; Zhongqing Wang; Sophia Yat Mei Lee; Rangyang Wang
Sentiment classification has undergone significant development in recent years. However, most existing studies assume the balance between negative and positive samples, which may not be true in reality. In this paper, we investigate imbalanced sentiment classification instead. In particular, a novel clustering-based stratified under-sampling framework and a centroid-directed smoothing strategy are proposed to address the imbalanced class and feature distribution problems respectively. Evaluation across different datasets shows the effectiveness of both the under-sampling framework and the smoothing strategy in handling the imbalanced problems in real sentiment classification applications.
The where in the tweet BIBAFull-Text 2473-2476
  Wen Li; Pavel Serdyukov; Arjen P. de Vries; Carsten Eickhoff; Martha Larson
Twitter is a widely-used social networking service which enables its users to post text-based messages, so-called tweets. POI tags on tweets can show more human-readable high-level information about a place rather than just a pair of coordinates. In this paper, we attempt to predict the POI tag of a tweet based on its textual content and time of posting. Potential applications include accurate positioning when GPS devices fail and disambiguating places located near each other. We consider this task as a ranking problem, i.e., we try to rank a set of candidate POIs according to a tweet by using language and time models. To tackle the sparsity of tweets tagged with POIs, we use web pages retrieved by search engines as an additional source of evidence. From our experiments, we find that users indeed leak some information about their accurate locations in their tweets.
Question identification on Twitter BIBAFull-Text 2477-2480
  Baichuan Li; Xiance Si; Michael R. Lyu; Irwin King; Edward Y. Chang
In this paper, we investigate the novel problem of automatic question identification in the microblog environment. It contains two steps: detecting tweets that contain questions (we call them "interrogative tweets") and extracting the tweets which really seek information or ask for help (so called "qweets") from interrogative tweets. To detect interrogative tweets, both traditional rule-based approach and state-of-the-art learning-based method are employed. To extract qweets, context features like short urls and Tweet-specific features like Retweets are elaborately selected for classification. We conduct an empirical study with sampled one hour's English tweets and report our experimental results for question identification on Twitter.
OpinioNetIt: understanding the opinions-people network for politically controversial topics BIBAFull-Text 2481-2484
  Rawia Awadallah; Maya Ramanath; Gerhard Weikum
The wikileaks documents or the economic crises in Ireland and Portugal are some of the controversial topics being played on the news everyday. Each of these topics has many different aspects, and there is no absolute, simple truth in answering questions such as: should the EU guarantee the financial stability of each member country, or should the countries themselves be solely responsible? To understand the landscape of opinions, it would be helpful to know which politician or other stakeholder takes which position -- support or opposition -- on these aspects of controversial topics. In this paper, we describe our system, named OpinioNetIt (pronounced similar to "opinionated"), which aims to automatically derive a map of the opinions-people network from news and other Web documents.
   We build this network as follows. First, we make use of a small number of generic seeds to identify controversial phrases from text. These phrases are then clustered and organized into a hierarchy of topics. Second, opinion holders are identified for each topic and their opinions (either supporting or opposing the topic) are extracted. Third, the known topics and people are used to construct a lexicon phrases indicating support or opposition. Finally, the lexicon is uses to identify more opinion holders, opinions and topics. Our system currently consists of approximately 30000 person-opinion-topic triples. Our evaluation shows that OpinioNetIt has high accuracy.
Predicting the uncertainty of sentiment adjectives in indirect answers BIBAFull-Text 2485-2488
  Mitra Mohtarami; Hadi Amiri; Man Lan; Chew Lim Tan
Opinion question answering (QA) requires automatic and correct interpretation of an answer relative to its question. However, the ambiguity that often exists in the question-answer pairs causes complexity in interpreting the answers. This paper aims to infer yes/no answers from indirect yes/no question-answer pairs (IQAPs) that are ambiguous due to the presence of ambiguous sentiment adjectives. We propose a method to measure the uncertainty of the answer in an IQAP relative to its question. In particular, to infer the yes or no response from an IQAP, our method employs antonyms, synonyms, word sense disambiguation as well as the semantic association between the sentiment adjectives that appear in the IQAP. Extensive experiments demonstrate the effectiveness of our method over the baseline.
Sentiment classification via l2-norm deep belief network BIBAFull-Text 2489-2492
  Tao Liu; Minghui Li; Shusen Zhou; Xiaoyong Du
Automatic analysis of sentiments expressed in large scale online reviews is very important for intelligent business applications. Sentiment classification is the most popular task of sentiment analysis, which is more challenging than traditional topic-based text classification. Basic features, such as vocabulary words, are not enough to classify sentiments well. Deep Belief Network (DBN) is introduced to discover more abstract features of sentiments. To capture full information of the features, large-size network can be constructed, but at the same time, large-size network tends to over fit the training data and even noise, which will reduce the generalization ability of the network. In this paper, L2-norm Deep Belief Network (L2DBN) is proposed, which uses L2-norm regularization to optimize the network parameters of DBN. L2DBN is first initialized by an unsupervised layer-wise training algorithm, and then fine-tuned by a supervised procedure. Network parameters are optimized using both classification loss and network complexity. Experimental results show that the proposed L2DBN outperforms the state-of-the-art method and the basic DBN on golden, noisy and heterogeneous datasets.
Domain customization for aspect-oriented opinion analysis with multi-level latent sentiment clues BIBAFull-Text 2493-2496
  Honglei Guo; Huijia Zhu; Zhili Guo; Zhong Su
Aspect-oriented opinion mining detects the reviewers' sentiment orientation (e.g. positive, negative or neutral) towards different product-features. Domain customization is a big challenge for opinion mining due to the accuracy loss across domains. In this paper, we show our experiences and lessons learned in the domain customization for the aspect-oriented opinion analysis system OpinionIt. We present a customization method for sentiment classification with multi-level latent sentiment clues. We first construct Latent Semantic Association model to capture latent association among product-features from the unlabeled corpus. Meanwhile, we present an unsupervised method to effectively extract various domain-specific sentiment clues from the unlabeled corpus. In the customization, we tune the sentiment classifier on the labeled source domain data by incorporating the multi-level latent sentiment clues (e.g. latent association among product-features, domain-specific and generic sentiment clues). Experimental results show that the proposed method significantly reduces the accuracy loss of sentiment classification without any labeled target domain data.
Accurate information extraction for quantitative financial events BIBAFull-Text 2497-2500
  Hassan H. Malik; Vikas S. Bhardwaj; Huascar Fiorletta
In this paper, we present a novel financial event extraction system that achieves very high extraction quality by combining the outcome of statistical classifiers with a set of rules. Using expert-annotated press releases as training data, and novel feature generation schemes, our system learns multiple binary classifiers for each "slot" in a financial event. At runtime, common parsing and search indexing methods are used to normalize incoming press releases and to identify candidate event "slots". Rules are applied on candidates that satisfy a combination of classifiers, and the system confidence on extracted events is estimated using a unique confidence model learned from training data. We present results of experiments performed on European corporate press releases for extracting dividend events, and show that our system achieves a precision of 96% and a recall of 79%.
A machine-learned proactive moderation system for auction fraud detection BIBAFull-Text 2501-2504
  Liang Zhang; Jie Yang; Wei Chu; Belle Tseng
Online auction and shopping are gaining popularity with the growth of web-based eCommerce. Criminals are also taking advantage of these opportunities to conduct fraudulent activities against honest parties with the purpose of deception and illegal profit. In practice, proactive moderation systems are deployed to detect suspicious events for further inspection by human experts. Motivated by real-world applications in commercial auction sites in Asia, we develop various advanced machine learning techniques in the proactive moderation system. Our proposed system is formulated as optimizing bounded generalized linear models in multi-instance learning problems, with intrinsic bias in selective labeling and massive unlabeled samples. In both offline evaluations and online bucket tests, the proposed system significantly outperforms the rule-based system on various metrics, including area under ROC (AUC), loss rate of labeled frauds and customer complaints. We also show that the metrics of loss rates are more effective than AUC in our cases.
Simultaneously improving CSAT and profit in a retail banking organization BIBAFull-Text 2505-2508
  Sameep Mehta; Ullas Nambiar; Vishal Batra; Sumit Negi; Prasad Deshpande; Gyana Praija
Customer satisfaction (CSAT) is the key driver for retention and growth in retail banking and several techniques have been applied by banks to achieve this. For instance, banks in emerging markets with high footfall in branches have gone beyond the traditional approach of segmenting customers and services to optimizing the wait time for customers visiting the bank's branch. While this approach has significantly improved service quality, it has also added a new dimension in the service quality metric: proactively identify and address customer needs for (i) efficient banking experience and (ii) enhancing profit by selling additional services to existing customer. In this paper we present a system that addresses the challenge involved in providing better service to retail banking customer while ensuring that a larger share of customer's wallet comes to the branch. We do this by combining predictive analytics, scheduling and process optimization techniques.
Coarse-to-fine classification via parametric and nonparametric models for computer-aided diagnosis BIBAFull-Text 2509-2512
  Le Lu; Meizhu Liu; Xiaojing Ye; Shipeng Yu; Heng Huang
Classification is one of the core problems in Computer-Aided Diagnosis (CAD), targeting for early cancer detection using 3D medical imaging interpretation. High detection sensitivity with desirably low false positive (FP) rate is critical for a CAD system to be accepted as a valuable or even indispensable tool in radiologists' workflow. Given various spurious imagery noises which cause observation uncertainties, this remains a very challenging task. In this paper, we propose a novel, two-tiered coarse-to-fine (CTF) classification cascade framework to tackle this problem. We first obtain classification-critical data samples (e.g., implicit samples on the decision boundary) extracted from the holistic data distributions using a robust parametric model (e.g., [13]); then we build a graph-embedding based nonparametric classifier on sampled data, which can more accurately preserve or formulate the complex classification boundary. These two steps can also be considered as effective "sample pruning" and "feature pursuing + kNN/template matching", respectively. Our approach is validated comprehensively in colorectal polyp detection and lung nodule detection CAD systems, as the top two deadly cancers, using hospital scale, multi-site clinical datasets. The results show that our method achieves overall better classification/detection performance than existing state-of-the-art algorithms using single-layer classifiers, such as the support vector machine variants [17], boosting [15], logistic regression [11], relevance vector machine [13], k-nearest neighbor [9] or spectral projections on graph [2].

Demonstration session 1

Exploratory search over social-medical data BIBAFull-Text 2513-2516
  Haggai Roitman; Sivan Yogev; Yevgenia Tsimerman; Dae Won Kim; Yossi Mesika
In this demo we shall present the IBM Patient Empowerment System (PES), and more specifically, its social-medical discovery sub-system. Social and medical data are represented using entities and relationships and are explored using a combination of expressive, yet intuitive, query language, faceted search, and ER graph navigation. While this demonstration focuses on the healthcare domain, the underlining search technology is generic and can be utilized in many other domains. Therefore, this demo has two main contributions. First, we present a novel entity-relationship indexing and retrieval solution, and discuss its implementation challenges. Second, the demonstration depicts a practical entity-relationship discovery technology in a real domain setting within a real IBM system.
Black swan: augmenting statistics with event data BIBAFull-Text 2517-2520
  Johannes Lorey; Felix Naumann; Benedikt Forchhammer; Andrina Mascher; Peter Retzlaff; Armin ZamaniFarahani; Soeren Discher; Cindy Faehnrich; Stefan Lemme; Thorsten Papenbrock; Robert Christoph Peschel; Stephan Richter; Thomas Stening; Sven Viehmeier
A large number of statistical indicators (GDP, life expectancy, income, etc.) collected over long periods of time as well as data on historical events (wars, earthquakes, elections, etc.) are published on the World Wide Web. By augmenting statistical outliers with relevant historical occurrences, we provide a means to observe (and predict) the influence and impact of events. The vast amount and size of available data sets enable the detection of recurring connections between classes of events and statistical outliers with the help of association rule mining. The results of this analysis are published at http://www.blackswanevents.org and can be explored interactively.
A data mining system based on SQL queries and UDFs for relational databases BIBAFull-Text 2521-2524
  Carlos Ordonez; Carlos Garcia-Alvarado
Most research on data mining has proposed algorithms and optimizations that work on flat files, outside a DBMS, mainly due to the following reasons. It is easier to develop efficient algorithms in a traditional programming language. The integration of data mining algorithms into a DBMS is difficult given its relational model foundation and system architecture. Moreover, SQL may be slow and cumbersome for numerical analysis computations. Therefore, data mining users commonly export data sets outside the DBMS for data mining processing, which creates a performance bottleneck and eliminates important data management capabilities such as query processing and security, among others (e.g. concurrency control and fault tolerance). With that motivation in mind, we developed a novel system based on SQL queries and User-Defined Functions (UDFs) that can directly analyze relational tables to compute statistical models, storing such models as relational tables as well. Most algorithms have been optimized to reduce the number of passes on the data set. Our system can analyze large and high dimensional data sets faster than external data mining tools.
Data-thirsty business analysts need SODA: search over data warehouse BIBAFull-Text 2525-2528
  Lukas Blunschi; Claudio Jossen; Donald Kossmann; Magdalini Mori; Kurt Stockinger
Querying large data warehouses is very hard for non-tech savvy business users. Deep technical knowledge of both SQL as well as the schema of the database is required in order to build correct queries and to come up with new business insights. In this paper we introduce a novel system called SODA (Search Over DAta Warehouse) that bridges the gap between the business world and the IT world by enabling extended keyword search in a data warehouse. SODA uses metadata information, DBpedia entries as well as base data to generate SQL to allow intuitive exploration of the data. The process of query classification, query graph generation and SQL generation is visualized to provide the analysts with information on how the query results are produced. Experiments with real data of a global financial institution comprising around 300 tables showed promising results.
An integrated environment for semantic knowledge work BIBAFull-Text 2529-2532
  Aba-Sah Dadzie; Victoria Uren; Ziqi Zhang; Philip Webster
In this demonstration, we will present a semantic environment called the K-Box. The K-Box supports the lightweight integration of knowledge tools, with a focus on semantic tools, but with the flexibility to integrate natural language and conventional tools. We discuss the implementation of the framework, and two existing applications, including details of a new application for developers of semantic workflows. The demonstration will be of interest to developers and researchers of ontology-based knowledge management systems, and semantic desktops, and to analysts working with cross-media information.
Editing knowledge resources: the wiki way BIBAFull-Text 2533-2536
  Francesco Ronzano; Andrea Marchetti; Maurizio Tesconi
The creation, customization, and maintenance of knowledge resources are essential for fostering the full deployment of Language Technologies. The definition and refinement of knowledge resources are time- and resource-consuming activities. In this paper we explore how the Wiki paradigm for online collaborative content editing can be exploited to gather massive social contributions from common Web users in editing knowledge resources. We discuss the Wikyoto Knowledge Editor, also called Wikyoto. Wikyoto is a collaborative Web environment that enables users with no knowledge engineering background to edit the multilingual network of knowledge resources exploited by KYOTO, a cross-lingual text mining system developed in the context of the KYOTO European Project.
Marco Polo: a system for brand-based shopping and exploration BIBAFull-Text 2537-2540
  Nish Parikh; Neel Sundaresan
In today's world, brand based shopping is popular especially in product lines like clothing and shoes, appliances, and electronics. Because of the importance of brands while shopping, it has become important for online shopping portals to consider brand loyalty and brand preferences of users. In this paper, we describe a system designed for brand-based shopping and exploration. The system is built by analyzing a large query set consisting of 115M queries from eBay.com -- a vibrant marketplace with more than 95M active users. The system allows brand-pivoted exploration of inventory. It allows exploration and purchase of substitute branded goods (e.g. Sony camcorder for Canon camcorder) and complementary branded merchandise (e.g. Lego castle set for Lego train station set).

Demonstration session 2

Jasmine: a real-time local-event detection system based on geolocation information propagated to microblogs BIBAFull-Text 2541-2544
  Kazufumi Watanabe; Masanao Ochi; Makoto Okabe; Rikio Onai
We propose a system for detecting local events in the real-world using geolocation information from microblog documents. A local event happens when people with a common purpose gather at the same time and place. To detect such an event, we identify a group of Twitter documents describing the same theme that were generated within a short time and a small geographic area. Timestamps and geotags are useful for finding such documents, but only 0.7% of documents are geotagged and not sufficient for this purpose. Therefore, we propose an automatic geotagging method that identifies the location of non-geotagged documents. Our geotagging method successfully increased the number of geographic groups by about 115 times. For each group of documents, we extract co-occurring terms to identify its theme and determine whether it is about an event. We subjectively evaluated the precision of our detected local events and found that it had 25.5% accuracy. These results demonstrate that our system can detect local events that are difficult to identify using existing event detection methods. A user can interactively specify the size of a desired event by manipulating the parameters of date, area size, and the minimum number of Twitter users associated with the location. Our system allows users to enjoy the novel experience of finding a local event happening near their current location in real time.
Scalable similarity search of timeseries with variable dimensionality BIBAFull-Text 2545-2548
  Omar U. Florez; Curtis Dyreson
Timeseries can be similar in shape but differ in length. For example, the sound waves produced by the same word spoken twice have roughly the same shape, but one may be shorter in duration. Stream data mining, approximate querying of image and video databases, data compression, and near duplicate detection are applications that need to be able to classify or cluster such timeseries, and to search for and rank timeseries that are similar to a chosen timeseries. We demonstrate software for clustering and performing similarity search in databases of timeseries data, where the timeseries have high and variable dimensionality. Our demonstration uses Timeseries Sensitive Hashing (TSH)[3] to index the timeseries. TSH adapts Locality Sensitive Hashing (LSH), which is an approximate algorithm to index data points in a d-dimensional space under some (e.g., Euclidean) distance function. TSH, unlike LSH, can index points that do not have the same dimensionality. As examples of the potential of TSH, the demonstration will index and classify timeseries from an image database and timeseries describing human motion extracted from a video stream and a motion capture system.
RoSeS: a continuous query processor for large-scale RSS filtering and aggregation BIBAFull-Text 2549-2552
  Jordi Creus; Bernd Amann; Nicolas Travers; Dan Vodislav
We present RoSeS, a running system for large-scale content-based RSS feed filtering and aggregation. The implementation of RoSeS is based on standard database concepts like declarative query languages, views and multi-query optimization. Users create personalized feeds by defining and composing content-based filtering and aggregation queries on collections of RSS feeds. These queries are translated into continuous multi-query execution plans which are optimized using a new cost-based multi-query optimization strategy.
Conkar: constraint keyword-based association discovery BIBAFull-Text 2553-2556
  Mo Zhou; Yifan Pan; Yuqing Wu
In many domains, such as bioinformatics, cheminformatics, health informatics and social networks, data can be represented naturally as labeled graphs. To address the increasing needs in discovering interesting associations between entities in such data graphs, especially under complicated keyword-based and structural constraints, we introduce Conkar (Constrained Keyword-based Association DiscoveRy) System. Conkar is the first system for discovering constrained acyclic paths (CAP) in graph data under keyword-based constraints, with the highlight being the set of quantitative constraint metrics that we proposed, including coverage and relevance. We will demonstrate the key features of Conkar: powerful and userfriendly query specification, efficient query evaluation, flexible and on-demand result ranking, visual result display, as well as an insight tour on our novel CAP query evaluation algorithms.
Interactive reasoning in uncertain RDF knowledge bases BIBAFull-Text 2557-2560
  Timm Meiser; Maximilian Dylla; Martin Theobald
Recent advances in Web-based information extraction have allowed for the automatic construction of large, semantic knowledge bases, which are typically captured in RDF format. The very nature of the applied extraction techniques however entails that the resulting RDF knowledge bases may face a significant amount of incorrect, incomplete, or even inconsistent (i.e., uncertain) factual knowledge, which makes query answering over this kind of data a challenge. Our reasoner, coined URDF, supports SPARQL queries along with rule-based, first-order predicate logic to infer new facts and to resolve data uncertainty over millions of RDF triplets directly at query time. We demonstrate a fully interactive reasoning engine, combining a Java-based reasoning backend and a Flash-based visualization frontend in a dynamic client-server architecture. Our visualization frontend provides interactive access to the reasoning backend, including tasks like exploring the knowledge base, rule-based and statistical reasoning, faceted browsing of large query graphs, and explaining answers through lineage.
Fu-Finder: a game for studying querying behaviours BIBAFull-Text 2561-2564
  Carly O'Neil; James Purvis; Leif Azzopardi
Usually the focus of evaluation within Information Retrieval has been placed largely upon the system. However, the individual user and their submitted queries are typically the greatest source of variation in the search process. This demonstration paper presents Fu-Finder, a fun and enjoyable game that measures the user's querying abilities (or search-fu). This game provides useful data for the study of user querying behaviour and assesses how well users can find specific web pages using different search engines.
PDFMeat: managing publications on the semantic desktop BIBAFull-Text 2565-2568
  David Aumüller; Erhard Rahm
Researchers maintain bibliographies and extensive sets of PDF files of scholarly publications on their desktop. The lack of proper metadata of downloaded PDFs makes this task a tedious one. With PDFMeat we present a solution to automatically determine publication metadata for scholarly papers within the user's desktop environment and link the metadata to the files. PDFMeat effectively matches local full texts to an online repository. In an evaluation for more than 2.000 diverse PDF files it worked highly reliable and showed excellent accuracy of up to 98 percent. We demonstrate PDFMeat for different sets of papers, highlighting the semantic integration and use of the retrieved metadata within the file browser of the desktop environment.

Demonstration session 3

MEMSCALE: in-cluster-memory databases BIBAFull-Text 2569-2572
  Héctor Montaner; Federico Silla; Holger Fröning; José Duato
We have developed a new memory architecture for clusters that allows automatic access from any processor to any memory module in the cluster completely by hardware. Thus, with a single assembly instruction a processor can retrieve (or update) a memory location in a remote node. The efficiency of this new paradigm makes it possible to speed-up the execution of shared-memory applications with very large memory footprints by running them across the entire cluster, thus providing them a true shared-memory environment (contrary to the emulation typically carried out by software-based distributed shared memory).
   This new memory architecture, referred to as MEMSCALE, opens up a new frontier for memory-hungry applications. In this paper we focus on in-memory databases and show how this target application can be boosted by our memory architecture, which can virtually provide unlimited memory resources to it.
   In the demo presented in this paper we show the advantages of our architecture by means of a prototype cluster. We configure two cluster sizes, 16 and 32 nodes, to analyze throughput scalability and latency worsening, to extrapolate these metrics to bigger clusters, and to show the benefits of our technology compared to other alternatives like SSD-based databases. Moreover, we also show the easiness of use of our architecture by explaining how we ported MySQL Server to our prototype cluster. Finally, the possibility of executing queries in any processor of the cluster during the live demo will show the audience how our system aggregates the advantages of the scale out and scale up approaches for database server growing.
H-DB: a hybrid quantitative-structural sql optimizer BIBAFull-Text 2573-2576
  Lucantonio