HCI Bibliography Home | HCI Conferences | CIKM Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
CIKM Tables of Contents: 0809101112131415

Proceedings of the 2010 ACM Conference on Information and Knowledge Management

Fullname:Proceedings of the 19th ACM international conference on Information and knowledge management
Editors:Jimmy Huang; Nick Koudas; Gareth Jones; Xindong Wu; Kevyn Collins-Thompson; Aijun An
Location:Toronto, Ontario, Canada
Dates:2010-Oct-26 to 2010-Oct-30
Standard No:ISBN: 978-1-4503-0099-5; ACM DL: Table of Contents; hcibib: CIKM10
Links:Conference Website
Summary:On behalf of the organizing committee, I wholeheartedly welcome you to the 19th ACM International Conference on Information and Knowledge Management (CIKM 2010). I hope this conference proves to be interesting and beneficial.
    CIKM is a well-known top tier and premier ACM conference in the areas of information retrieval, knowledge management and database. Since its inception, the CIKM conference has provided a unique international forum for the presentation, discussion, and dissemination of research findings in data management, information retrieval, and knowledge management. The purpose of the conference is to identify challenging problems facing the development of future knowledge and information systems, and to shape future research directions through the publication of high quality, applied and theoretical research findings. The conference has been a leading forum in which experts from academia, industry, and the government gather to exchange ideas, research achievements, and technical developments in multidisciplinary research areas.
    CIKM has rapidly grown to become one of the world's most recognized conferences in the field. This year CIKM has received a record high number of submissions in the history of CIKM, as can be seen from the following statistics:
* 1382 abstracts submitted
* 945 full papers plus 38 demo papers submitted
* 126 papers accepted for presentation as full papers (13.3% acceptance
    rate) and an additional 165 were accepted for short papers (17.5%). In addition to regular research tracks, CIKM 2010 features 4 keynote speakers, 4 pre-conference tutorials, 9 workshops, 12 industrial full papers and 20 demo papers. I am proud of our program and acknowledge the tireless efforts of people who materialized this program.
    First of all, I am honored to have 4 distinguished keynote speakers: Jamie Callan, Susan Dumais, Gregory Grefenstette, and Divesh Srivastava. I deeply appreciate their time and commitment to deliver their speeches and share their cutting-edge research experiences and insightful comments in their research topics.
  1. Keynote addresses
  2. KM track: information extraction
  3. IR track: ranking and retrieval model
  4. DB track: indexes and query optimization
  5. IR track: domain-specific and multimedia IR
  6. KM track: link and graph mining
  7. IR track: machine learning for IR (I)
  8. DB track: mobile and distributed data management
  9. KM track: classification and clustering
  10. KM track: large-scale statistical techniques
  11. IR track: machine learning for IR (II)
  12. DB track: top-K and shortest path processing
  13. IR track: IR evaluation
  14. KM track: information filtering and recommender systems (I)
  15. IR track: social networks and text mining
  16. Industry track: IR applications
  17. DB track: information retrieval in databases
  18. KM track: temporal, spatial and stream data mining
  19. IR track: filtering and recommendation
  20. Industry track: databases and OLAP
  21. KM track: data pre- and post-processing
  22. KM track: information filtering and recommender systems (II)
  23. IR track: user modeling and search personalization
  24. Industry track: web and social networks
  25. IR track: query analysis and feedback
  26. KM track: semantic techniques
  27. IR track: web search
  28. Industry track: text mining and analytics
  29. IR track: scalability and efficiency
  30. Poster session 1: DB track
  31. Poster session 2: IR track
  32. Poster session 3: KM track
  33. Poster session 4: Industry track
  34. Demo session 1: IR
  35. Demo session 2: KM and DB
  36. Co-located workshop summaries

Keynote addresses

Search engine support for software applications BIBAFull-Text 1-2
  Jamie Callan
Question-answering, computer-assisted language learning, text mining, and other software applications that use a full-search engine to find information in a large text corpus are becoming common. A software application may use metadata and text annotations to reduce the mismatch between the concept-based representations convenient for inference and the word-based representations typically used for text retrieval. Software applications may also be able to specify detailed requirements that retrieved passages must satisfy. This use of text search is very different than the ad-hoc, interactive search that information retrieval research typically studies.
   Search engine developers are beginning to respond by extending indexing and retrieval models developed for structured (e.g., XML) documents to support multiple representations of document content, text annotations, metadata, and relationships. These new requirements force developers to reconsider basic assumptions about index data structures and ranked retrieval models.
   How best to use these new capabilities is an open problem. Straightforward transformation of a detailed information need into a complex structured query can produce a query that is effective for exact-match retrieval, but a challenge for the retrieval model to use effectively for best-match retrieval. Bag-of-words retrieval is often disparaged, but its advantage is that it is robust: It works well even when desired documents do not exactly meet expectations.
   This talk discusses some of the problems encountered when extending a search engine to support queries posed by other software applications and structured documents with derived annotations.
Schema extraction BIBAFull-Text 3-4
  Divesh Srivastava
Understanding the schema of a complex database is a crucial step in exploratory data analysis. However, gaining such an understanding is challenging for new users for many reasons. First, complex databases often have thousands of inter-linked tables, with little indication of the important tables or the main concepts in the database schema. Second, schemas can be inaccurate, e.g., some foreign/primary key relationships are not known to designers but are inherent in the data, while others become invalid due to data inconsistencies. In this talk, we present an approach to effectively address these challenges and automatically extract an understandable schema from a complex database.
   The first step in our approach is a robust algorithm to discover foreign/primary key relationships between tables. We present a general rule, termed Randomness, that subsumes a variety of other rules proposed in previous work, and develop efficient approximation algorithms for evaluating randomness, using only two passes over the data. The second step is a principled approach to summarize the schema consisting of tables linked using foreign/primary keys, so that a user can easily identify the main concepts and important tables. We present an information theoretic approach to identify important tables, and an intuitive notion of table similarity that can be used to cluster tables into the main concepts of the schema. We validate our approach using real and synthetic datasets.
   This is based on joint work [1, 2] with Marios Hadjieleftheriou, Beng Chin Ooi, Cecilia M. Procopiuc, Xiaoyan Yang and Meihui Zhang.
Use of semantics in real life applications BIBAFull-Text 5-6
  Gregory Grefenstette
Semantics has many different definitions in science. In natural language processing, there has been much research over the past three decades involving extracting the semantics, the meaning, of natural texts. This has led to entity recognition (people, places, companies, prices, dates, and events), and more recently into sentiment analysis, exploring another level of meaning in a text. These techniques are now well understood and robust. Results of this research are beginning to appear in products and online sites, finding their way into practical applications. The stage is set for an explosion of semantically savvy applications, from 3D design, to enhanced web browsing, to social network aware yellowpages. This talk will explore these paths from research to industry, illustrated by current products on the market.
Temporal dynamics and information retrieval BIBAFull-Text 7-8
  Susan T. Dumais
Many digital resources, like the Web, are dynamic and ever-changing collections of information. However, most of the tools information retrieval and management that have been developed for interacting with Web content, such as browsers and search engines, focus on a single static snapshot of the information. In this talk, I will present analyses of how Web content changes over time, how people re-visit Web pages over time, and how re-visitation patterns are influenced by changes in user intent and content. These results have implications for many aspects of information retrieval and management including crawling, ranking and information extraction algorithms, result presentation, and evaluation. I will describe a prototype system that supports people in understanding how the information they interact with changes over time, and a new retrieval model that incorporates features about the temporal evolution of content to improve core ranking. Finally, I will conclude with an overview of some general challenges that need to be addressed to fully incorporate temporal dynamics in information retrieval and information management systems.

KM track: information extraction

Components for information extraction: ontology-based information extractors and generic platforms BIBAFull-Text 9-18
  Daya C. Wimalasuriya; Dejing Dou
Information Extraction (IE) has existed as a field for several decades and has produced some impressive systems in the recent past. Despite its success, widespread usage and commercialization remain elusive goals for this field. We identify the lack of effective mechanisms for reuse as one major reason behind this situation. Here, we mean not only the reuse of the same IE technique in different situations but also the reuse of information related to the application of IE techniques (e.g., features used for classification).
   We have developed a comprehensive component-based approach for information extraction that promotes reuse to address this situation. We designed this approach starting from our previous work on the use of multiple ontologies in information extraction [24]. The key ideas of our approach are "information extractors," which are components of an IE system that make extractions with respect to particular components of an ontology and "platforms for IE," which are domain and corpus independent implementations of IE techniques. A case study has shown that this component-based approach can be successfully applied in practical situations.
Automatically acquiring a semantic network of related concepts BIBAFull-Text 19-28
  Sean Szumlanski; Fernando Gomez
We describe the automatic construction of a semantic network1, in which over 3000 of the most frequently occurring monosemous nouns2 in Wikipedia (each appearing between 1,500 and 100,000 times) are linked to their semantically related concepts in the WordNet noun ontology. Relatedness between nouns is discovered automatically from co-occurrence in Wikipedia texts using an information theoretic inspired measure. Our algorithm then capitalizes on salient sense clustering among related nouns to automatically disambiguate them to their appropriate senses (i.e., concepts). Through the act of disambiguation, we begin to accumulate relatedness data for concepts denoted by polysemous nouns, as well. The resultant concept-to-concept associations, covering 17,543 nouns, and 27,312 distinct senses among them, constitute a large-scale semantic network of related concepts that can be conceived of as augmenting the WordNet noun ontology with related-to links.
Online annotation of text streams with structured entities BIBAFull-Text 29-38
  Ken Q. Pu; Oktie Hassanzadeh; Richard Drake; Renée J. Miller
We propose a framework and algorithm for annotating unbounded text streams with entities of a structured database. The algorithm allows one to correlate unstructured and dirty text streams from sources such as emails, chats and blogs, to entities stored in structured databases. In contrast to previous work on entity extraction, our emphasis is on performing entity annotation in a completely online fashion. The algorithm continuously extracts important phrases and assigns to them top-k relevant entities. Our algorithm does so with a guarantee of constant time and space complexity for each additional word in the text stream, thus infinite text streams can be annotated. Our framework allows the online annotation algorithm to adapt to changing stream rate by self-adjusting multiple run-time parameters to reduce or improve the quality of annotation for fast or slow streams, respectively. The framework also allows the online annotation algorithm to incorporate query feedback to learn the user preference and personalize the annotation for individual users.
Automatic extraction of web data records containing user-generated content BIBAFull-Text 39-48
  Xinying Song; Jing Liu; Yunbo Cao; Chin-Yew Lin; Hsiao-Wuen Hon
In this paper, we are concerned with the problem of automatically extracting web data records that contain user-generated content (UGC). In previous work, web data records are usually assumed to be well-formed with a limited amount of UGC, and thus can be extracted by testing repetitive structure similarity. However, when a web data record includes a large portion of free-format UGC, the similarity test between records may fail, which in turn results in lower performance. In our work, we find that certain domain constraints (e.g., post-date) can be used to design better similarity measures capable of circumventing the influence of UGC. In addition, we also use anchor points provided by the domain constraints to improve the extraction process, which ends in an algorithm called MiBAT (Mining data records Based on Anchor Trees). We conduct extensive experiments on a dataset consisting of forum thread pages which are collected from 307 sites that cover 219 different forum software packages. Our approach achieves a precision of 98.9% and a recall of 97.3% with respect to post record extraction. On page level, it perfectly handles 91.7% of pages without extracting any wrong posts or missing any golden posts. We also apply our approach to comment extraction and achieve good results as well.
An efficient algorithm for mining time interval-based patterns in large database BIBAFull-Text 49-58
  Yi-Cheng Chen; Ji-Chiang Jiang; Wen-Chih Peng; Suh-Yin Lee
Most studies on sequential pattern mining are mainly focused on time point-based event data. Few research efforts have elaborated on mining patterns from time interval-based event data. However, in many real applications, event usually persists for an interval of time. Since the relationships among event time intervals are intrinsically complex, mining time interval-based patterns in large database is really a challenging problem. In this paper, a novel approach, named as incision strategy and a new representation, called coincidence representation are proposed to simplify the processing of complex relations among event intervals. Then, an efficient algorithm, CTMiner (Coincidence Temporal Miner) is developed to discover frequent time-interval based patterns. The algorithm also employs two pruning techniques to reduce the search space effectively. Furthermore, experimental results show that CTMiner is not only efficient and scalable but also outperforms state-of-the-art algorithms.

IR track: ranking and retrieval model

What can quantum theory bring to information retrieval BIBAFull-Text 59-68
  Benjamin Piwowarski; Ingo Frommholz; Mounia Lalmas; Keith van Rijsbergen
The probabilistic formalism of quantum physics is said to provide a sound basis for building a principled information retrieval framework. Such a framework can be based on the notion of information need vector spaces where events, such as document relevance or observed user interactions, correspond to subspaces. As in quantum theory, a probability distribution over these subspaces is defined through weighted sets of state vectors (density operators), and used to represent the current view of the retrieval system on the user information need. Tensor spaces can be used to capture different aspects of information needs. Our evaluation shows that the framework can lead to acceptable performance in an ad-hoc retrieval task. Going beyond this, we discuss the potential of the framework for three active challenges in information retrieval, namely, interaction, novelty and diversity.
Entity ranking using Wikipedia as a pivot BIBAFull-Text 69-78
  Rianne Kaptein; Pavel Serdyukov; Arjen De Vries; Jaap Kamps
In this paper we investigate the task of Entity Ranking on the Web. Searchers looking for entities are arguably better served by presenting a ranked list of entities directly, rather than a list of web pages with relevant but also potentially redundant information about these entities. Since entities are represented by their web homepages, a naive approach to entity ranking is to use standard text retrieval. Our experimental results clearly demonstrate that text retrieval is effective at finding relevant pages, but performs poorly at finding entities. Our proposal is to use Wikipedia as a pivot for finding entities on the Web, allowing us to reduce the hard web entity ranking problem to easier problem of Wikipedia entity ranking. Wikipedia allows us to properly identify entities and some of their characteristics, and Wikipedia's elaborate category structure allows us to get a handle on the entity's type.
   Our main findings are the following. Our first finding is that, in principle, the problem of web entity ranking can be reduced to Wikipedia entity ranking. We found that the majority of entity ranking topics in our test collections can be answered using Wikipedia, and that with high precision relevant web entities corresponding to the Wikipedia entities can be found using Wikipedia's 'external links'. Our second finding is that we can exploit the structure of Wikipedia to improve entity ranking effectiveness. Entity types are valuable retrieval cues in Wikipedia. Automatically assigned entity types are effective, and almost as good as manually assigned types. Our third finding is that web entity retrieval can be significantly improved by using Wikipedia as a pivot. Both Wikipedia's external links and the enriched Wikipedia entities with additional links to homepages are significantly better at finding primary web homepages than anchor text retrieval, which in turn significantly improved over standard text retrieval.
Ranking under temporal constraints BIBAFull-Text 79-88
  Lidan Wang; Donald Metzler; Jimmy Lin
This paper introduces the notion of temporally constrained ranked retrieval, which, given a query and a time constraint, produces the best possible ranked list within the specified time limit. Naturally, more time should translate into better results, but the ranking algorithm should always produce some results. This property is desirable from a number of perspectives: to cope with diverse users and information needs, as well as to better manage system load and variance in query execution times. We propose two temporally constrained ranking algorithms based on a class of probabilistic prediction models that can naturally incorporate efficiency constraints: one that makes independent feature selection decisions, and the other that makes joint feature selection decisions. Experiments on three different test collections show that both ranking algorithms are able to satisfy imposed time constraints, although the joint model outperforms the independent model in being able to deliver more effective results, especially under tight time constraints, due to its ability to capture feature dependencies.
Examining the information retrieval process from an inductive perspective BIBAFull-Text 89-98
  Ronan Cummins; Mounia Lalmas; Colm O'Riordan
Term-weighting functions derived from various models of retrieval aim to model human notions of relevance more accurately. However, there is a lack of analysis of the sources of evidence from which important features of these term weighting schemes originate. In general, features pertaining to these term-weighting schemes can be collected from (1) the document, (2) the entire collection and (3) the query. In this work, we perform an empirical analysis to determine the increase in effectiveness as information from these three different sources becomes more accurate.
   First, we determine the number of documents to be indexed to accurately estimate collection-wide features to obtain near optimal effectiveness for a range of a term-weighting functions. Similarly, we determine the amount of a document and query that must be sampled to achieve near-peak effectiveness. This analysis also allows us to determine the factors that contribute most to the performance of a term-weighting function (i.e. the document, the collection or the query).
   We use our framework to construct a new model of weighting where we discard the 'bag of words' model and aim to retrieve documents based on the initial physical representation of a document using some basic axioms of retrieval. We show that this is a good first step towards incorporating some more interesting features into a term-weighting function.
On identifying representative relevant documents BIBAFull-Text 99-108
  Fiana Raiber; Oren Kurland
Using relevance feedback can significantly improve the effectiveness of ad hoc (query-based) retrieval. However, retrieval performance can significantly vary with respect to the given set of relevant documents. Our goal is to establish a quantitative analysis of what makes a relevant document a good representative of the relevant-documents set regardless of the retrieval approach employed. That is, we would like to estimate the extent to which a relevant document can effectively help in finding (other) relevant documents using some relevance-feedback method employed over the corpus. We present various representativeness estimates; some of which treat documents independently and some utilize inter-document similarities. Empirical evaluation shows that relevant documents that are centrally located within the similarity space of the relevant-documents set tend to be good representatives. In addition, we show that there exist highly representative clusters of similar relevant documents, and devise methods for ranking clusters based on their presumed representativeness. Finally, we study the connection between representativeness and TREC's gradual relevance judgments.

DB track: indexes and query optimization

On the selectivity of multidimensional routing indices BIBAFull-Text 109-118
  Christos Doulkeridis; Akrivi Vlachou; Kjetil Nørvåg; Yannis Kotidis; Michalis Vazirgiannis
Recently, the problem of efficiently supporting advanced query operators, such as nearest neighbor or range queries, over multidimensional data in widely distributed environments has attracted much attention. In unstructured peer-to-peer (P2P) networks, peers store data in an autonomous manner, thus multidimensional routing indices (MRI) are required, in order to route user queries efficiently to only those peers that may contribute to the query result set. Focusing on a hybrid unstructured P2P network, in this paper, we analyze the parameters for building MRI of high selectivity. In the case where similar data are located at different parts of the network, MRI exhibit extremely poor performance, which renders them ineffective. We present algorithms that boost the query routing performance by detecting similar peers and reassigning these peers to other parts of the hybrid network in a distributed and scalable way. The resulting MRI are able to eagerly discard routing paths during query processing. We demonstrate the advantages of our approach experimentally and show that our framework enhances a state-of-the-art approach for similarity search in terms of reduced network traffic and number of contacted peers.
Path-hop: efficiently indexing large graphs for reachability queries BIBAFull-Text 119-128
  Jing Cai; Chung Keung Poon
Graph reachability is a fundamental research problem that finds its use in many applications such as geographic navigation, bioinformatics, web ontologies and XML databases, etc. Given two vertices, u and v, in a directed graph, a reachability query asks if there is a directed path from u to v. Over the last two decades, many indexing schemes have been proposed to support reachability queries on large graphs. Typically, those schemes based on chain or tree covers work well when the graph is sparse. For dense graphs, they still have fast query time but require large storage for their indices. In contrast, the 2-Hop cover and its variations/extensions produce compact indices even for dense graphs but have slower query time than those chain/tree covers. In this paper, we propose a new indexing scheme, called Path-Hop, which is even more space-efficient than those schemes based on 2-Hop cover and yet has query processing speed comparable to those chain/tree covers. We conduct extensive experiments to illustrate the effectiveness of our approach relative to other state-of-the-art methods.
On wavelet decomposition of uncertain time series data sets BIBAFull-Text 129-138
  Yuchen Zhao; Charu Aggarwal; Philip Yu
In this paper, we will explore the construction of wavelet decompositions of uncertain data. Uncertain representations of data sets require significantly more space, and it is therefore even more important to construct compressed representations for such cases. We will use a hierarchical optimization technique in order to construct the most effective partitioning for our wavelet representation. We explore two different schemes which optimize the uncertainty in the resulting representation. We will show that the incorporation of uncertainty into the design of the wavelet representations significantly improves the compression rate of the representation. We present experimental results illustrating the effectiveness of our approach.
Efficient set-correlation operator inside databases BIBAFull-Text 139-148
  Shaoxu Song; Lei Chen
Large scale of short text records are now prevalent, such as news highlights, scientific paper citations, and posted messages in a discussion forum, which are often stored as set records in (hidden) databases. Many interesting information retrieval tasks are correspondingly raised on the correlation query over these short text records, such as finding hot topics over news highlights and searching related scientific papers on a certain topic. However, current relational database management systems (RDBMS) do not directly provide support on set correlation query. Thus, in this paper, we address both the effectiveness and efficiency issues of set correlation query over set records in databases. First, we present a framework of set correlation query inside databases. To our best knowledge, only the Pearson's correlation can be implemented to construct token correlations by using RDBMS facilities. Thereby, we propose a novel correlation coefficient to extend Pearson's correlation, and provide a pure-SQL implementation inside databases. We further propose optimal strategies to set up correlation filtering threshold, which can greatly reduce the query time. Our theoretical analysis proves that, with a proper setting of filtering threshold, we can improve the query efficiency with a little effectiveness loss. Finally, we conduct extensive experiments to show the effectiveness and efficiency of proposed correlation query and optimization strategies.
Online update of b-trees BIBAFull-Text 149-158
  Marina Barsky; Alex Thomo; Zoltan Toth; Calisto Zuzarte
Many scenarios impose a heavy update load on B-tree indexes in modern databases. A typical case is when B-trees are used for indexing all the keywords of a text field. For example upon the insertion of a new text record (e.g. a new document arrives), a barrage of new keywords has to be inserted into the index causing many random disk I/Os and interrupting the normal operation of the database. The common approach has been to collect the updates in a separate structure and then perform a batch update of the index. This update "freezes" the database. Many applications, however, require the immediate availability of the new updates without any interruption of the normal database operation. In this paper we present a novel online B-tree update method based on a new buffering data structure we introduce -- Dynamic Bucket Tree (DBT). The DBT-buffer serves as a differential index for new updates. The grouping of keys in DBT-buffer is based on the longest common prefixes (LCP) of their binary representations. The LCP is used as a measure of the locality of keys to be transferred to the main B-tree. Our online update system does not slow down concurrent user transactions or lead to degradation of search performance. Experiments confirm that our DBT buffer can be efficiently used for online updates of text fields. As such it represents an effective solution to the notorious problem of handling updates to an Inverted Index.

IR track: domain-specific and multimedia IR

Wisdom of the ages: toward delivering the children's web with the link-based agerank algorithm BIBAFull-Text 159-168
  Karl Gyllstrom; Marie-Francine Moens
Though children frequently use web search engines to learn, interact, and be entertained, modern web search engines are poorly suited to children's needs, requiring relatively complex querying and filtering of results in order to find pages oriented to young audiences. To address this limitation, we designed AgeRank, a link-based algorithm that ranks web pages according their appropriateness for young audiences. We show its effectiveness through a multipart evaluation that demonstrates AgeRank to be accurate in page-labeling, widely-spanning in page coverage, and with high potential to improve children's search. As a fast, scalable, and effective algorithm, AgeRank can be adopted by search engines seeking to more effectively address the needs of young users, or easily fitted to complementary machine-learning based classification approaches.
A cross-lingual framework for monolingual biomedical information retrieval BIBAFull-Text 169-178
  Dolf Trieschnigg; Djoerd Hiemstra; Franciska de Jong; Wessel Kraaij
An important challenge for biomedical information retrieval (IR) is dealing with the complex, inconsistent and ambiguous biomedical terminology. Frequently, a concept-based representation defined in terms of a domain-specific terminological resource is employed to deal with this challenge. In this paper, we approach the incorporation of a concept-based representation in monolingual biomedical IR from a cross-lingual perspective. In the proposed framework, this is realized by translating and matching between text and concept-based representations. The approach allows for deployment of a rich set of techniques proposed and evaluated in traditional cross-lingual IR. We compare six translation models and measure their effectiveness in the biomedical domain. We demonstrate that the approach can result in significant improvements in retrieval effectiveness over word-based retrieval. Moreover, we demonstrate increased effectiveness of a CLIR framework for monolingual biomedical IR if basic translations models are combined.
Multi-modal multi-correlation person-centric news retrieval BIBAFull-Text 179-188
  Zechao Li; Jing Liu; Xiaobin Zhu; Hanqing Lu
In this paper, we propose a framework of multi-modal multi-correlation person-centric news retrieval, which integrates news event correlations, news entity correlations, and event-entity correlations simultaneously by exploring both text and image information. The proposed framework is confined to a person-name query and enables a more vivid and informative person-centric news retrieval by providing two views of result presentation, namely a query-oriented multi-correlation map and a ranking list of news items with necessary descriptions including news image, news title and summary, central entities and relevant news events. First, we pre-process news articles using natural language techniques, and initialize the three correlations by statistical analysis about events and entities in news articles and face images. Second, a Multi-correlation Probabilistic Matrix Factorization (MPMF) algorithm is proposed to complete and refine the three correlations. Different from traditional Probabilistic Matrix Factorization (PMF), the proposed MPFM additionally considers the event correlations and the entity correlations as well as the event-entity correlations during the factor analysis. Third, the result ranking and visualization are conducted to present search results relevant to a target news topic. Experimental results on a news dataset collected from multiple news websites demonstrate the attractive performance of the proposed solution for news retrieval.
Bringing order to your photos: event-driven classification of flickr images based on social knowledge BIBAFull-Text 189-198
  Claudiu S. Firan; Mihai Georgescu; Wolfgang Nejdl; Raluca Paiu
With the rapidly increasing popularity of Social Media sites, a lot of user generated content has been injected in the Web, thus resulting in a large amount of both multimedia items (music -- Last.fm, MySpace.com, pictures -- Flickr, Picasa, videos -- YouTube) and textual data (tags and other text-based documents). As a consequence, especially for multimedia content it has become more and more difficult to find exactly the objects that best match the users' information needs. The methods we propose in this paper try to alleviate this problem and we focus on the domain of pictures, in particular on a subset of Flickr data. Many of the photos posted by users on Flickr have been shot during events and our methods aim to allow browsing and organization of picture collections in a natural way, by events. The algorithms we introduce in this paper exploit the social information produced by users in form of tags, titles and photo descriptions, for classifying pictures into different event categories. The extensive automated experiments demonstrate that our approach is very effective and opens new possibilities for multimedia retrieval, in particular image search. Moreover, the direct comparison with previous event detection algorithms confirm once more the quality of our methods.

KM track: link and graph mining

Mining topic-level influence in heterogeneous networks BIBAFull-Text 199-208
  Lu Liu; Jie Tang; Jiawei Han; Meng Jiang; Shiqiang Yang
Influence is a complex and subtle force that governs the dynamics of social networks as well as the behaviors of involved users. Understanding influence can benefit various applications such as viral marketing, recommendation, and information retrieval. However, most existing works on social influence analysis have focused on verifying the existence of social influence. Few works systematically investigate how to mine the strength of direct and indirect influence between nodes in heterogeneous networks.
   To address the problem, we propose a generative graphical model which utilizes the heterogeneous link information and the textual content associated with each node in the network to mine topic-level direct influence. Based on the learned direct influence, a topic-level influence propagation and aggregation algorithm is proposed to derive the indirect influence between nodes. We further study how the discovered topic-level influence can help the prediction of user behaviors. We validate the approach on three different genres of data sets: Twitter, Digg, and citation networks. Qualitatively, our approach can discover interesting influence patterns in heterogeneous networks. Quantitatively, the learned topic-level influence can greatly improve the accuracy of user behavior prediction.
Mining interesting link formation rules in social networks BIBAFull-Text 209-218
  Cane Wing-ki Leung; Ee-Peng Lim; David Lo; Jianshu Weng
Link structures are important patterns one looks out for when modeling and analyzing social networks. In this paper, we propose the task of mining interesting Link Formation rules (LF-rules) containing link structures known as Link Formation patterns (LF-patterns). LF-patterns capture various dyadic and/or triadic structures among groups of nodes, while LF-rules capture the formation of a new link from a focal node to another node as a postcondition of existing connections between the two nodes. We devise a novel LF-rule mining algorithm, known as LFR-Miner, based on frequent subgraph mining for our task. In addition to using a support-confidence framework for measuring the frequency and significance of LF-rules, we introduce the notion of expected support to account for the extent to which LF-rules exist in a social network by chance. Specifically, only LF-rules with higher-than-expected support are considered interesting. We conduct empirical studies on two real-world social networks, namely Epinions and myGamma. We report interesting LF-rules mined from the two networks, and compare our findings with earlier findings in social network analysis.
SHRINK: a structural clustering algorithm for detecting hierarchical communities in networks BIBAFull-Text 219-228
  Jianbin Huang; Heli Sun; Jiawei Han; Hongbo Deng; Yizhou Sun; Yaguang Liu
Community detection is an important task for mining the structure and function of complex networks. Generally, there are several different kinds of nodes in a network which are cluster nodes densely connected within communities, as well as some special nodes like hubs bridging multiple communities and outliers marginally connected with a community. In addition, it has been shown that there is a hierarchical structure in complex networks with communities embedded within other communities. Therefore, a good algorithm is desirable to be able to not only detect hierarchical communities, but also identify hubs and outliers. In this paper, we propose a parameter-free hierarchical network clustering algorithm SHRINK by combining the advantages of density-based clustering and modularity optimization methods. Based on the structural connectivity information, the proposed algorithm can effectively reveal the embedded hierarchical community structure with multiresolution in large-scale weighted undirected networks, and identify hubs and outliers as well. Moreover, it overcomes the sensitive threshold problem of density-based clustering algorithms and the resolution limit possessed by other modularity-based methods. To illustrate our methodology, we conduct experiments with both real-world and synthetic datasets for community detection, and compare with many other baseline methods. Experimental results demonstrate that SHRINK achieves the best performance with consistent improvements.
Outcome aware ranking in interaction networks BIBAFull-Text 229-238
  Sampath Kameshwaran; Vinayaka Pandit; Sameep Mehta; Nukala Viswanadham; Kashyap Dixit
In this paper, we present a novel ranking technique that we developed in the context of an application that arose in a Service Delivery setting. We consider the problem of ranking agents of a service organization. The service agents typically need to interact with other service agents to accomplish the end goal of resolving customer requests. Their ranking needs to take into account two aspects: firstly, their importance in the network structure that arises as a result of their interactions, and secondly, the value generated by the interactions involving them. We highlight several other applications which have the common theme of ranking the participants of a value creation process based on the network structure of their interactions and the value generated by their interactions. We formally present the problem and describe the modeling technique which enables us to encode the value of interaction in the graph. Our ranking algorithm is based on extension of eigen value methods. We present experimental results on real-life, public domain datasets from the Internet Movie DataBase. This makes our experiments replicable and verifiable.
Expansion and search in networks BIBAFull-Text 239-248
  Arun S. Maiya; Tanya Y. Berger-Wolf
Borrowing from concepts in expander graphs, we study the expansion properties of real-world, complex networks (e.g. social networks, unstructured peer-to-peer or P2P networks) and the extent to which these properties can be exploited to understand and address the problem of decentralized search. We first produce samples that concisely capture the overall expansion properties of an entire network, which we collectively refer to as the expansion signature. Using these signatures, we find a correspondence between the magnitude of maximum expansion and the extent to which a network can be efficiently searched. We further find evidence that standard graph-theoretic measures, such as average path length, fail to fully explain the level of "searchability" or ease of information diffusion and dissemination in a network. Finally, we demonstrate that this high expansion can be leveraged to facilitate decentralized search in networks and show that an expansion-based search strategy outperforms typical search methods.

IR track: machine learning for IR (I)

Improved latent concept expansion using hierarchical Markov random fields BIBAFull-Text 249-258
  Hao Lang; Donald Metzler; Bin Wang; Jin-Tao Li
Most existing query expansion approaches for ad-hoc retrieval adopt overly simplistic textual representations that treat documents as bags of words and ignore inherent document structure. These simple representations often lead to incorrect independence assumptions in the proposed approaches and result in limited retrieval effectiveness. In this paper, we propose a novel query expansion technique that models the various types of dependencies that exist between original query terms and expansion terms within a robust, unified framework. The proposed model is called Hierarchical Markov random fields (HMRFs), based on Latent Concept Expansion (LCE). By exploiting implicit (or explicit) hierarchical structure within documents, HMRFs can incorporate hierarchical interactions which are important for modeling term dependencies in an efficient manner. Our rigorous experimental evaluation carried out using several TREC data sets shows that our proposed query expansion technique consistently and significantly outperforms the current state-of-the-art query expansion approaches, including relevance-based language models and LCE.
Term necessity prediction BIBAFull-Text 259-268
  Le Zhao; Jamie Callan
The probability that a term appears in relevant documents (P(t|R)) is a fundamental quantity in several probabilistic retrieval models, however it is difficult to estimate without relevance judgments or a relevance model. We call this value term necessity because it measures the percentage of relevant documents retrieved by the term -- how necessary a term's occurrence is to document relevance. Prior research typically either set this probability to a constant, or estimated it based on the term's inverse document frequency, neither of which was very effective.
   This paper identifies several factors that affect term necessity, for example, a term's topic centrality, synonymy and abstractness. It develops term- and query-dependent features for each factor that enable supervised learning of a predictive model of term necessity from training data. Experiments with two popular retrieval models and 6 standard datasets demonstrate that using predicted term necessity estimates as user term weights of the original query terms leads to significant improvements in retrieval accuracy.
Decomposing background topics from keywords by principal component pursuit BIBAFull-Text 269-278
  Kerui Min; Zhengdong Zhang; John Wright; Yi Ma
Low-dimensional topic models have been proven very useful for modeling a large corpus of documents that share a relatively small number of topics. Dimensionality reduction tools such as Principal Component Analysis or Latent Semantic Indexing (LSI) have been widely adopted for document modeling, analysis, and retrieval. In this paper, we contend that a more pertinent model for a document corpus as the combination of an (approximately) low-dimensional topic model for the corpus and a sparse model for the keywords of individual documents. For such a joint topic-document model, LSI or PCA is no longer appropriate to analyze the corpus data. We hence introduce a powerful new tool called Principal Component Pursuit that can effectively decompose the low-dimensional and the sparse components of such corpus data. We give empirical results on data synthesized with a Latent Dirichlet Allocation (LDA) mode to validate the new model. We then show that for real document data analysis, the new tool significantly reduces the perplexity and improves retrieval performance compared to classical baselines.
Document update summarization using incremental hierarchical clustering BIBAFull-Text 279-288
  Dingding Wang; Tao Li
Document summarization has become a hot topic in recent years. However, most of existing summarization methods work on a batch of documents and do not consider that documents may arrive in a sequence and the corresponding summaries need to be updated in real time. In this paper, we propose a new summarization method based on an incremental hierarchical clustering framework to update summaries as soon as a new document arrives. Extensive experimental results demonstrate the effectiveness and efficiency of our proposed method.
How about utilizing ordinal information from the distribution of unlabeled data BIBAFull-Text 289-298
  Mingjie Qian; Bo Chen; Hongzhi Xu; Hongwei Qi
Problems of ordinal regression arise in many fields such as information retrieval, data mining and knowledge management. In this paper, we consider ordinal regression in a semi-supervised scenario, i.e., we try to utilize the ordinal information from the distribution of unlabeled data. Semi-supervised ordinal regression is more applicable than traditional supervised ordinal regression, because nowadays labeled data is expensive and time-consuming as it needs human labor, whereas a large amount of unlabeled data are far accessible with the development of internet technology. We construct a general semi-supervised ordinal regression framework to formulate this problem. Based on the framework, we then propose a semi-supervised ordinal regression method called Semi-supervised Ordinal SVM (SOSVM). Additionally, in order to make our proposed method more applicable to problems with large scaled labeled data, we put forward a kernel based dual coordinate descent algorithm to efficiently solve SOSVM. Both rigorous theoretical analysis and promising experimental evaluations on real world datasets show the great performance and remarkable efficiency of SOSVM.

DB track: mobile and distributed data management

Automatic schema merging using mapping constraints among incomplete sources BIBAFull-Text 299-308
  Xiang Li; Christoph Quix; David Kensche; Sandra Geisler
Schema merging is the process of consolidating multiple schemas into a unified view. The task becomes particularly challenging when the schemas are highly heterogeneous and autonomous. Classical data integration systems rely on a mediated schema created by human experts through an intensive design process.
   In this paper, we present a novel approach for merging multiple relational data sources related by a collection of mapping constraints in the form of P2P style tuple-generating dependencies (tgds). In the scenario of data integration, we opt for minimal mediated schemas that are complete regarding certain answers of conjunctive queries. Under Open World Assumption (OWA), we characterize the semantics of schema merging by properties of the output mapping system between the source schemas and the mediated schema. We propose a merging algorithm following a redundancy reduction paradigm and prove that the output satisfies the desired logical properties. Recognizing the fact that multiple plausible mediated schemas may co-exist, a variant of the a priori algorithm is employed to enumerate alternative mediated schemas. Output mappings in the form of data dependencies are generated to support the mediated schemas, which enables query processing. We have evaluated our merging approach over a collection of real world data sets, which demonstrate the applicability and effectiveness of our approach in practice.
Preserving location and absence privacy in geo-social networks BIBAFull-Text 309-318
  Dario Freni; Carmen Ruiz Vicente; Sergio Mascetti; Claudio Bettini; Christian S. Jensen
Online social networks often involve very large numbers of users who share very large volumes of content. This content is increasingly being tagged with geo-spatial and temporal coordinates that may then be used in services. For example, a service may retrieve photos taken in a certain region. The resulting geo-aware social networks (GeoSNs) pose privacy threats beyond those found in location-based services. Content published in a GeoSN is often associated with references to multiple users, without the publisher being aware of the privacy preferences of those users. Moreover, this content is often accessible to multiple users. This renders it difficult for GeoSN users to control which information about them is available and to whom it is available. This paper addresses two privacy threats that occur in GeoSNs: location privacy and absence privacy. The former concerns the availability of information about the presence of users in specific locations at given times, while the latter concerns the availability of information about the absence of an individual from specific locations during given periods of time. The challenge addressed is that of supporting privacy while still enabling useful services. We believe this is the first paper to formalize these two notions of privacy and to propose techniques for enforcing them. The techniques offer privacy guarantees, and the paper reports on empirical performance studies of the techniques.
Preference query evaluation over expensive attributes BIBAFull-Text 319-328
  Justin J. Levandoski; Mohamed F. Mokbel; Mohamed E. Khalefa
Most database systems allow query processing over attributes that are derived at query runtime (e.g., user-defined functions and remote data calls to web services), making them expensive to compute relative to relational data stored in a heap or index. In addition, core support for efficient preference query processing has become an important objective in database systems. This paper addresses an important problem at the intersection of these two query processing objectives: efficient preference query evaluation involving expensive attributes. We explore an efficient framework for processing skyline and multi-objective queries in a database when the data involves a mix of "cheap" and "expensive" attributes. Our solution involves a three-phase approach that evaluates a correct final preference answer while aiming to minimizing the number of expensive attributes computations. Unlike previous works for distributed preference algorithms that assume sorted access over each attribute, our framework assumes expensive attribute requests are stateless, i.e., know nothing previous requests. Thus, the proposed approach is more in line with realistic system architectures. Our framework is implemented inside the query processor of PostgreSQL, and evaluated over both synthetic and real data sets involving computation of expensive attributes over real web-service data (e.g., Microsoft MapPoint).
Energy-efficient top-k query processing in wireless sensor networks BIBAFull-Text 329-338
  Baichen Chen; Weifa Liang; Rui Zhou; Jeffrey Xu Yu
Technological advances have enabled the deployment of large-scale sensor networks for environmental monitoring and surveillance purposes. The large volume of data generated by sensors needs to be processed to respond to the users queries. However, efficient processing of queries in sensor networks poses great challenges due to the unique characteristics imposed on sensor networks including slow processing capability, limited storage, and energy-limited batteries, etc. Among various queries, top-k query is one of the fundamental operators in many applications of wireless sensor networks for phenomenon monitoring. In this paper we focus on evaluating top-k queries in an energy-efficient manner such that the network lifetime is maximized. To achieve that, we devise a scalable, filter-based localized evaluation algorithm for top-k query evaluation, which is able to filter out as many unlikely top-k results as possible within the network from transmission. We also conduct extensive experiments by simulations to evaluate the performance of the proposed algorithm on real datasets. The experimental results show that the proposed algorithm outperforms existing algorithms significantly in network lifetime prolongation.
StableBuffer: optimizing write performance for DBMS applications on flash devices BIBAFull-Text 339-348
  Yu Li; Jianliang Xu; Byron Choi; Haibo Hu
Flash devices have been widely used in embedded systems, laptop computers, and enterprise servers. However, the poor random writes have been an obstacle to running write-intensive DBMS applications on flash devices. In this paper, we exploit the recently discovered, efficient write patterns of flash devices to optimize the performance of DBMS applications. Specifically, motivated by a focused write pattern, we propose to write pages temporarily to a small, pre-allocated storage space on the flash device, called StableBuffer, instead of directly writing to their actual destinations. We then recognize and flush efficient write patterns of the buffer to achieve a better write performance. In contrast to prior log-based techniques, our StableBuffer solution does not require modifying the driver of flash devices and hence works well for commodity flash devices. We discuss the detailed design and implementation of the StableBuffer solution. Performance evaluation based on a TPC-C benchmark trace shows that StableBuffer improves the response time and throughput of write operations by a factor of 1.5-12, in comparison with a direct write-through strategy.

KM track: classification and clustering

Mr.KNN: soft relevance for multi-label classification BIBAFull-Text 349-358
  Xiaotong Lin; Xue-wen Chen
Multi-label classification refers to learning tasks with each instance belonging to one or more classes simultaneously. It arose from real-world applications such as information retrieval, text categorization and functional genomics. Currently, most of the multi-label learning methods use the strategy called binary relevance, which constructs a classifier for each unique label by grouping data into positives (examples with this label) and negatives (examples without this label). With binary relevance, an example with multiple labels is considered as a positive data for each label it belongs to. For some classes, this data point may behave like an outlier confusing classifiers, especially in the cases of well-separated classes. In this paper, we first introduce a new strategy called soft relevance, where each multi-label example is assigned a relevance score to the labels it belongs to. This soft relevance is then employed in a voting function used in a k nearest neighbor classifier. Furthermore, a voting-margin ratio is introduced to the k nearest neighbor classifier for better performance. We compare the proposed method to other multi-label learning methods over three multi-label datasets and demonstrate that the proposed method provides an effective way to multi-label learning.
Collaborative Dual-PLSA: mining distinction and commonality across multiple domains for text classification BIBAFull-Text 359-368
  Fuzhen Zhuang; Ping Luo; Zhiyong Shen; Qing He; Yuhong Xiong; Zhongzhi Shi; Hui Xiong
The distribution difference among multiple data domains has been considered for the cross-domain text classification problem. In this study, we show two new observations along this line. First, the data distribution difference may come from the fact that different domains use different key words to express the same concept. Second, the association between this conceptual feature and the document class may be stable across domains. These two issues are actually the distinction and commonality across data domains.
   Inspired by the above observations, we propose a generative statistical model, named Collaborative Dual-PLSA (CD-PLSA), to simultaneously capture both the domain distinction and commonality among multiple domains. Different from Probabilistic Latent Semantic Analysis (PLSA) with only one latent variable, the proposed model has two latent factors y and z, corresponding to word concept and document class respectively. The shared commonality intertwines with the distinctions over multiple domains, and is also used as the bridge for knowledge transformation. We exploit an Expectation Maximization (EM) algorithm to learn this model, and also propose its distributed version to handle the situation where the data domains are geographically separated from each other. Finally, we conduct extensive experiments over hundreds of classification tasks with multiple source domains and multiple target domains to validate the superiority of the proposed CD-PLSA model over existing state-of-the-art methods of supervised and transfer learning. In particular, we show that CD-PLSA is more tolerant of distribution differences.
Inferring gender of movie reviewers: exploiting writing style, content and metadata BIBAFull-Text 369-378
  Jahna Otterbacher
Despite differences in the way that men and women experience goods and communicate their perspectives, online review communities typically do not provide participants' gender. We propose to infer author gender, given a set of reviews of a particular item, and experiment on reviews posted at the Internet Movie Database (IMDb). Using logistic regression, we explore the contribution of three types of information: 1) style, 2) content, and 3) metadata (e.g. review age, social feedback). Our results concur with previous research, in that there are salient differences in writing style and content between reviews authored by men versus women. However, in comparison to literary or scientific texts, to which classification tasks are often applied, reviews are brief and occur within the context of an ongoing discourse. Therefore, to compensative for the brevity of reviews, content and stylistic features can be augmented with metadata. We find in particular that the perceived utility of a review is an important correlate of gender. The model incorporating all features has a classification accuracy of 73.7% and is not as sensitive to review length as are those based only on stylistic or content features.
A robust semi-supervised classification method for transfer learning BIBAFull-Text 379-388
  Akinori Fujino; Naonori Ueda; Masaaki Nagata
The transfer learning problem of designing good classifiers with a high generalization ability by using labeled samples whose distribution is different from that of test samples is an important and challenging research issue in the fields of machine learning and data mining. This paper focuses on designing a semi-supervised classifier trained by using unlabeled samples drawn by the same distribution as test samples, and presents a semi-supervised classification method to deal with the transfer learning problem, based on a hybrid discriminative and generative model. Although JESS-CM is one of the most successful semi-supervised classifier design frameworks and has achieved the best published results in NLP tasks, it has an overfitting problem in transfer learning settings that we consider in this paper. We expect the overfitting problem to be mitigated with the proposed method, which utilizes both labeled and unlabeled samples for the discriminative training of classifiers. We also present a refined objective that formalizes the training algorithm and classifier form. Our experimental results for text classification using three typical benchmark test collections confirmed that the proposed method outperformed the JESS-CM framework with most transfer learning settings.
Multi-view clustering with constraint propagation for learning with an incomplete mapping between views BIBAFull-Text 389-398
  Eric Eaton; Marie desJardins; Sara Jacob
Multi-view learning algorithms typically assume a complete bipartite mapping between the different views in order to exchange information during the learning process. However, many applications provide only a partial mapping between the views, creating a challenge for current methods. To address this problem, we propose a multi-view algorithm based on constrained clustering that can operate with an incomplete mapping. Given a set of pairwise constraints in each view, our approach propagates these constraints using a local similarity measure to those instances that can be mapped to the other views, allowing the propagated constraints to be transferred across views via the partial mapping. It uses co-EM to iteratively estimate the propagation within each view based on the current clustering model, transfer the constraints across views, and update the clustering model, thereby learning a unified model for all views. We show that this approach significantly improves clustering performance over several other methods for transferring constraints and allows multi-view clustering to be reliably applied when given a limited mapping between the views.

KM track: large-scale statistical techniques

Pricing guaranteed contracts in online display advertising BIBAFull-Text 399-408
  Vijay Bharadwaj; Wenjing Ma; Michael Schwarz; Jayavel Shanmugasundaram; Erik Vee; Jack Xie; Jian Yang
We consider the problem of pricing guaranteed contracts in online display advertising. This problem has two key characteristics that when taken together distinguish it from related offline and online pricing problems: (1) the guaranteed contracts are sold months in advance, and at various points in time, and (2) the inventory that is sold to guaranteed contracts -- user visits -- is very high-dimensional, having hundreds of possible attributes, and advertisers can potentially buy any of the very large number (many trillions) of combinations of these attributes. Consequently, traditional pricing methods such as real-time or combinatorial auctions, or optimization-based pricing based on self- and cross-elasticities are not directly applicable to this problem. We hence propose a new pricing method, whereby the price of a guaranteed contract is computed based on the prices of the individual user visits that the contract is expected to get. The price of each individual user visit is in turn computed using historical sales prices that are negotiated between a sales person and an advertiser, and we propose two different variants in this context. Our evaluation using real guaranteed contracts shows that the proposed pricing method is accurate in the sense that it can effectively predict the prices of other (out-of-sample) historical contracts.
Maximum normalized spacing for efficient visual clustering BIBAFull-Text 409-418
  Zhi-Gang Fan; Yadong Wu; Bo Wu
In this paper, for efficient clustering of visual image data that have arbitrary mixture distributions, we propose a simple distance metric learning method called Maximum Normalized Spacing (MNS) which is a generalized principle based on Maximum Spacing [12] and Minimum Spanning Tree (MST). The proposed Normalized Spacing (NS) can be viewed as a kind of adaptive distance metric for contextual dissimilarity measure which takes into account the local distribution of the data vectors. Image clustering is a difficult task because there are multiple nonlinear manifolds embedded in the data space. Many of the existing clustering methods often fail to learn the whole structure of the multiple manifolds and they are usually not very effective. Combining both the internal and external statistics of clusters to capture the density structure of manifolds, MNS is capable of efficient and effective solving the clustering problem for the complex multi-manifold datasets in arbitrary metric spaces. We apply this MNS method into the practical problem of multi-view image clustering and obtain good results which are helpful for image browsing systems. Using the COIL-20 [19] and COIL-100 [18] multi-view image databases, our experimental results demonstrate the effectiveness of the proposed MNS clustering method and this clustering method is more efficient than the traditional clustering methods.
Multilevel manifold learning with application to spectral clustering BIBAFull-Text 419-428
  Haw-ren Fang; Sophia Sakellaridi; Yousef Saad
In the past decade, a number of nonlinear dimensionality reduction methods using an affinity graph have been developed for manifold learning. This paper explores a multilevel framework with the goal of reducing the cost of unsupervised manifold learning and preserving the embedding quality at the same time. An application to spectral clustering is also presented. Experimental results indicate that our multilevel approach is an appealing alternative to standard techniques.
Accelerating probabilistic frequent itemset mining: a model-based approach BIBAFull-Text 429-438
  Liang Wang; Reynold Cheng; Sau Dan Lee; David Cheung
Data uncertainty is inherent in emerging applications such as location-based services, sensor monitoring systems, and data integration. To handle a large amount of imprecise information, uncertain databases have been recently developed. In this paper, we study how to efficiently discover frequent itemsets from large uncertain databases, interpreted under the Possible World Semantics. This is technically challenging, since an uncertain database induces an exponential number of possible worlds. To tackle this problem, we propose a novel method to capture the itemset mining process as a Poisson binomial distribution. This model-based approach extracts frequent itemsets with a high degree of accuracy, and supports large databases. We apply our techniques to improve the performance of the algorithms for: (1) finding itemsets whose frequentness probabilities are larger than some threshold; and (2) mining itemsets with the k highest frequentness probabilities. Our approaches support both tuple and attribute uncertainty models, which are commonly used to represent uncertain databases. Extensive evaluation on real and synthetic datasets shows that our methods are highly accurate. Moreover, they are orders of magnitudes faster than previous approaches.

IR track: machine learning for IR (II)

Learning click models via probit Bayesian inference BIBAFull-Text 439-448
  Yuchen Zhang; Dong Wang; Gang Wang; Weizhu Chen; Zhihua Zhang; Botao Hu; Li Zhang
Recent advances in click models have positioned them as an effective approach to the improvement of interpreting click data, and some typical works include UBM, DBN, CCM, etc. After formulating the knowledge of user search behavior into a set of model assumptions, each click model developed an inference method to estimate its parameters. The inference method plays a critical role in terms of accuracy in interpreting clicks, and we observe that different inference methods for a click model can lead to significant accuracy differences. In this paper, we propose a novel Bayesian inference approach for click models. This approach regards click model under a unified framework, which has the following characteristics and advantages:
   1. This approach can be widely applied to existing click models, and we demonstrate how to infer DBN, CCM and UBM through it. This novel inference method is based on the Bayesian framework which is more flexible in characterizing the uncertainty in clicks and brings higher generalization abilities. As a result, it not only excels in the inference methods originally developed in click models, but also provides a valid comparison among different models;
   2. In contrast to the previous click models, which are exclusively designed for the position-bias, this approach is capable of capturing more sophisticated information such as BM25 and PageRank score into click models. This makes these models interpret click-through data more accurately. Experimental results illustrate that the click models integrated with more information can achieve significantly better performance on click perplexity and search ranking;
   3. Because of the incremental nature of the Bayesian learning, this approach is scalable to process large scale and constantly growing log data.
Document allocation policies for selective searching of distributed indexes BIBAFull-Text 449-458
  Anagha Kulkarni; Jamie Callan
Indexes for large collections are often divided into shards that are distributed across multiple computers and searched in parallel to provide rapid interactive search. Typically, all index shards are searched for each query. For organizations with modest computational resources the high query processing cost incurred in this exhaustive search setup can be a deterrent to working with large collections. This paper investigates document allocation policies that permit searching only a few shards for each query (selective search) without sacrificing search accuracy. Random, source-based and topic-based document-to-shard allocation policies are studied in the context of selective search.
   A thorough study of the tradeoff between search cost and search accuracy in a sharded index environment is performed using three large TREC collections. The experimental results demonstrate that selective search using topic-based shards cuts the search cost to less than 1/5th of that of the exhaustive search without reducing search accuracy across all the three datasets. Stability analysis shows that 90% of the queries do as well or improve with selective search. An overlap-based evaluation with an additional 1000 queries for each dataset tests and confirms the conclusions drawn using the smaller TREC query sets.
Rank learning for factoid question answering with linguistic and semantic constraints BIBAFull-Text 459-468
  Matthew W. Bilotti; Jonathan Elsas; Jaime Carbonell; Eric Nyberg
This work presents a general rank-learning framework for passage ranking within Question Answering (QA) systems using linguistic and semantic features. The framework enables query-time checking of complex linguistic and semantic constraints over keywords. Constraints are composed of a mixture of keyword and named entity features, as well as features derived from semantic role labeling. The framework supports the checking of constraints of arbitrary length relating any number of keywords. We show that a trained ranking model using this rich feature set achieves greater than a 20% improvement in Mean Average Precision over baseline keyword retrieval models. We also show that constraints based on semantic role labeling features are particularly effective for passage retrieval; when they can be leveraged, an 40% improvement in MAP over the baseline can be realized.
Learning to rank relevant and novel documents through user feedback BIBAFull-Text 469-478
  Abhimanyu Lad; Yiming Yang
We consider the problem of learning to rank relevant and novel documents so as to directly maximize a performance metric called Expected Global Utility (EGU), which has several desirable properties: (i) It measures retrieval performance in terms of relevant as well as novel information, (ii) gives more importance to top ranks to reflect common browsing behavior of users, as opposed to existing objective functions based on set-coverage, (iii) accommodates different levels of tolerance towards redundancy, which is not taken into account by existing evaluation measures, and (iv) extends naturally to the evaluation of session-based retrieval comprising multiple ranked lists. Our ground truth is defined in terms of "information nuggets", which are obviously not known to the retrieval system when processing a new user query. Therefore, our approach uses observable query and document features (words and named entities) as surrogates for nuggets, whose weights are learned based on user feedback in an iterative search session. The ranked list is produced to maximize the weighted coverage of these surrogate nuggets. The optimization of such coverage-based metrics is known to be NP-hard. Therefore, we use a greedy algorithm and show that it guarantees good performance due to the submodularity of the objective function. Our experiments on Topic Detection and Tracking data show that the proposed approach represents an efficient and effective retrieval strategy for maximizing EGU, as compared to a purely-relevance based ranking approach that uses Indri, as well as a MMR-based approach for non-redundant ranking.

DB track: top-K and shortest path processing

Set cover algorithms for very large datasets BIBAFull-Text 479-488
  Graham Cormode; Howard Karloff; Anthony Wirth
The problem of Set Cover -- to find the smallest subcollection of sets that covers some universe -- is at the heart of many data and analysis tasks. It arises in a wide range of settings, including operations research, machine learning, planning, data quality and data mining. Although finding an optimal solution is NP-hard, the greedy algorithm is widely used, and typically finds solutions that are close to optimal.
   However, a direct implementation of the greedy approach, which picks the set with the largest number of uncovered items at each step, does not behave well when the input is very large and disk resident. The greedy algorithm must make many random accesses to disk, which are unpredictable and costly in comparison to linear scans. In order to scale Set Cover to large datasets, we provide a new algorithm which finds a solution that is provably close to that of greedy, but which is much more efficient to implement using modern disk technology. Our experiments show a ten-fold improvement in speed on moderately-sized datasets, and an even greater improvement on larger datasets.
The gist of everything new: personalized top-k processing over web 2.0 streams BIBAFull-Text 489-498
  Parisa Haghani; Sebastian Michel; Karl Aberer
Web 2.0 portals have made content generation easier than ever with millions of users contributing news stories in form of posts in weblogs or short textual snippets as in Twitter. Efficient and effective filtering solutions are key to allow users stay tuned to this ever-growing ocean of information, releasing only relevant trickles of personal interest. In classical information filtering systems, user interests are formulated using standard IR techniques and data from all available information sources is filtered based on a predefined absolute quality-based threshold. In contrast to this restrictive approach which may still overwhelm the user with the returned stream of data, we envision a system which continuously keeps the user updated with only the top-k relevant new information. Freshness of data is guaranteed by considering it valid for a particular time interval, controlled by a sliding window. Considering relevance as relative to the existing pool of new information creates a highly dynamic setting. We present POL-filter which together with our maintenance module constitute an efficient solution to this kind of problem. We show by comprehensive performance evaluations using real world data, obtained from a weblog crawl, that our approach brings performance gains compared to state-of-the-art.
Fast and accurate estimation of shortest paths in large graphs BIBAFull-Text 499-508
  Andrey Gubichev; Srikanta Bedathur; Stephan Seufert; Gerhard Weikum
Computing shortest paths between two given nodes is a fundamental operation over graphs, but known to be nontrivial over large disk-resident instances of graph data. While a number of techniques exist for answering reachability queries and approximating node distances efficiently, determining actual shortest paths (i.e. the sequence of nodes involved) is often neglected. However, in applications arising in massive online social networks, biological networks, and knowledge graphs it is often essential to find out many, if not all, shortest paths between two given nodes.
   In this paper, we address this problem and present a scalable sketch-based index structure that not only supports estimation of node distances, but also computes corresponding shortest paths themselves. Generating the actual path information allows for further improvements to the estimation accuracy of distances (and paths), leading to near-exact shortest-path approximations in real world graphs.
   We evaluate our techniques -- implemented within a fully functional RDF graph database system -- over large real-world social and biological networks of sizes ranging from tens of thousand to millions of nodes and edges. Experiments on several datasets show that we can achieve query response times providing several orders of magnitude speedup over traditional path computations while keeping the estimation errors between 0% and 1% on average.
Fast top-k simple shortest paths discovery in graphs BIBAFull-Text 509-518
  Jun Gao; Huida Qiu; Xiao Jiang; Tengjiao Wang; Dongqing Yang
With the wide applications of large scale graph data such as social networks, the problem of finding the top-k shortest paths attracts increasing attention. This paper focuses on the discovery of the top-k simple shortest paths (paths without loops). The well known algorithm for this problem is due to Yen, and the provided worst-case bound O(kn(m + nlogn)), which comes from O(n) times single-source shortest path discovery for each of k shortest paths, remains unbeaten for 30 years, where n is the number of nodes and m is the number of edges. In this paper, we observe that there are shared sub-paths among O(kn) single-source shortest paths. The basic idea behind our method is to pre-compute the shortest paths to the target node, and utilize them to reduce the discovery cost at running time. Specifically, we transform the original graph by encoding the pre-computed paths, and prove that the shortest path discovered over the transformed graph is equivalent to that in the original graph. Most importantly, the path discovery over the transformed graph can be terminated much earlier than before. In addition, two optimization strategies are presented. One is to reduce the total iteration times for shortest path discovery, and the other is to prune the search space in each iteration with an adaptively-determined threshold. Although the worst-case complexity cannot be lowered, our method is proven to be much more efficient in a general case. The final extensive experimental results (on both real and synthetic graphs) also show that our method offers a significant performance improvement over the existing ones.

IR track: IR evaluation

Factors affecting click-through behavior in aggregated search interfaces BIBAFull-Text 519-528
  Shanu Sushmita; Hideo Joho; Mounia Lalmas; Robert Villa
An aggregated search interface is designed to integrate search results from different sources (web, image, video, blog, etc) into a single result page. This paper presents two user studies investigating factors affecting users click-through behavior on aggregated search interfaces. We tested two aggregated search interfaces: one where results from the different sources are blended into a single list (called blended), and another, where results from each source are presented in a separate panel (called non-blended). A total of 1,296 search sessions performed by 48 participants were analysed in our study. Our results suggest that 1) the position of search results is significant only in the blended and not in the non-blended design; 2) participants' click-through behavior on videos is different from other sources; and finally 3) capturing a task's orientation towards particular sources is an important factor for further investigation and research.
Web search solved?: all result rankings the same? BIBAFull-Text 529-538
  Hugo Zaragoza; B. Barla Cambazoglu; Ricardo Baeza-Yates
The objective of this work is to derive quantitative statements about what fraction of web search queries issued to the state-of-the-art commercial search engines lead to excellent results or, on the contrary, poor results. To be able to make such statements in an automated way, we propose a new measure that is based on lower and upper bound analysis over the standard relevance measures. Moreover, we extend this measure to carry out comparisons between competing search engines by introducing the concept of disruptive sets, which we use to estimate the degree to which a search engine solves queries that are not solved by its competitors. We report empirical results on a large editorial evaluation of the three largest search engines in the US market.
Assessor error in stratified evaluation BIBAFull-Text 539-548
  William Webber; Douglas W. Oard; Falk Scholer; Bruce Hedin
Several important information retrieval tasks, including those in medicine, law, and patent review, have an authoritative standard of relevance, and are concerned about retrieval completeness. During the evaluation of retrieval effectiveness in these domains, assessors make errors in applying the standard of relevance, and the impact of these errors, particularly on estimates of recall, is of crucial concern. Using data from the interactive task of the TREC Legal Track, this paper investigates how reliably the yield of relevant documents can be estimated from sampled assessments in the presence of assessor error, particularly where sampling is stratified based upon the results of participating retrieval systems. We show that assessor error is in general a greater source of inaccuracy than sampling error. A process of appeal and adjudication, such as used in the interactive task, is found to be effective at locating many assessment errors; but the process is expensive if complete, and biased if incomplete. An unbiased double-sampling method for resolving assessment error is proposed, and shown on representative data to be more efficient and accurate than appeal-based adjudication.
CiteData: a new multi-faceted dataset for evaluating personalized search performance BIBAFull-Text 549-558
  Abhay Harpale; Yiming Yang; Siddharth Gopal; Daqing He; Zhen Yue
Personalized search systems have evolved to utilize heterogeneous features including document hyperlinks, category labels in various taxonomies and social tags in addition to free-text of the documents. Consequently, classifiers, PageRank algorithms and Collaborative Filtering methods are often used as intermediate steps in such personalized retrieval systems. Thorough comparative evaluation of such complex systems has been difficult due to the lack of appropriate publicly available datasets that provide such diverse feature sets. To remedy the situation, we have created CiteData, a new dataset for benchmark evaluations of personalized search performance, that will be made publicly accessible. CiteData is a collection of academic articles extracted from CiteULike and CiteSeer repositories, with rich feature sets such as authors, author-affiliations, topic labels, social tags and citation information. We further supplement it with personalized queries and relevance judgments which were obtained from volunteer users. This paper starts with a discussion of the design criteria and characteristics of the CiteData dataset in comparison with current benchmark datasets, followed by a set of task-oriented empirical evaluations of popular algorithms in statistical classification, collaborative filtering and link analysis as intermediate steps for personalized search. Our results show significant performance improvement of personalized approaches, over that of unpersonalized approaches. We also observe that a meta personalized search engine that leverages information from multiple sources of features performs better than algorithms that use only one of the constituent source of features.

KM track: information filtering and recommender systems (I)

Learning a user-thread alignment manifold for thread recommendation in online forum BIBAFull-Text 559-568
  Jun Zhao; Jiajun Bu; Chun Chen; Ziyu Guan; Can Wang; Cheng Zhang
People are more and more willing to participate in online forums to share their knowledge and experience. However, it may not be easy for them to find their desired threads in online forums due to the information overload problem. Traditional recommendation approaches can not be directly applied to online forums due to two reasons. First, unlike traditional movie or music recommendation problem, there is no rating information in online forums. Second, the sparsity problem is more severe since the users may only read threads but take no actions. To address these limitations, in this paper we propose to make use of the reply relationships among users, as well as thread contents. A learning algorithm is introduced to infer a user-thread alignment manifold in which both users and thread contents can be well represented. Thus, the relatedness between users and threads can be measured on this alignment manifold, and the closest threads which can best meet the corresponding user's information needs are recommended. Experiments on a dataset crawled from digg.com have demonstrated the superiority of our algorithm over traditional recommendation algorithms.
FacetCube: a framework of incorporating prior knowledge into non-negative tensor factorization BIBAFull-Text 569-578
  Yun Chi; Shenghuo Zhu
Non-negative tensor factorization (NTF) is a relatively new technique that has been successfully used to extract significant characteristics from polyadic data, such as data in social networks. Because these polyadic data have multiple dimensions (e.g., the author, content, and timestamp of a blog post), NTF fits in naturally and extracts data characteristics jointly from different data dimensions. In the standard NTF, all information comes from the observed data and end users have no control over the outcomes. However, in many applications very often the end users have certain prior knowledge, such as the demographic information about individuals in a social network or a pre-constructed ontology on the contents, and therefore prefer the extracted data characteristics being consistent with such prior knowledge. To allow users' prior knowledge to be naturally incorporated into NTF, in this paper we present a novel framework -- FacetCube -- that extends the standard non-negative tensor factorization. The new framework allows the end users to control the factorization outputs at three different levels for each of the data dimensions. The proposed framework is intuitively appealing in that it has a close connection to the probabilistic generative models. In addition to introducing the framework, we provide an iterative algorithm for computing the optimal solution to the framework. We also develop an efficient implementation of the algorithm that consists of a series of techniques to make our framework scalable to large data sets. Extensive experimental studies on a paper citation data set and a blog data set demonstrate that our new framework is able to effectively incorporate users' prior knowledge, improves performance over the standard NTF on the task of personalized recommendation, and is scalable to large data sets from real-life applications.
Travel route recommendation using geotags in photo sharing sites BIBAFull-Text 579-588
  Takeshi Kurashima; Tomoharu Iwata; Go Irie; Ko Fujimura
The ability to create geotagged photos enables people to share their personal experiences as tourists at specific locations and times. Assuming that the collection of each photographer's geotagged photos is a sequence of visited locations, photo-sharing sites are important sources for gathering the location histories of tourists. By following their location sequences, we can find representative and diverse travel routes that link key landmarks. In this paper, we propose a travel route recommendation method that makes use of the photographers' histories as held by Flickr. Recommendations are performed by our photographer behavior model, which estimates the probability of a photographer visiting a landmark. We incorporate user preference and present location information into the probabilistic behavior model by combining topic models and Markov models. We demonstrate the effectiveness of the proposed method using a real-life dataset holding information from 71,718 photographers taken in the United States in terms of the prediction accuracy of travel behavior.
Boosting social network connectivity with link revival BIBAFull-Text 589-598
  Yuan Tian; Qi He; Qiankun Zhao; Xingjie Liu; Wang-chien Lee
Online social networking platforms have become a popular channel of communications among people. However, most people can only keep in touch with a limited number of friends. This phenomenon results in a low-connectivity social network in terms of communications, which is inefficient for information propagation and social engagement. In this paper, we introduce a new recommendation service, called link revival, that suggests users to re-connect with their old friends, such that the resulted connection will improve the social network connectivity. To achieve high connectivity improvement under the dynamic social network evolvement, we propose a graph prediction-based recommendation strategy, which selects proper candidates based on the prediction of their future behaviors. We then develop an effective model that exploits non-homogeneous Poisson process and second-order self-similarity in prediction. Through comprehensive experimental studies on two real datasets (Phone Call Network and Facebook Wall-posts), we demonstrate that our proposed approach can significantly increase the social network connectivity, and that the approach outperforms other baseline solutions. The results also show that our solution is more suitable for online social networks like Facebook, partially due to the stronger long range dependency and lower communication costs in the interactions.
Power in unity: forming teams in large-scale community systems BIBAFull-Text 599-608
  Aris Anagnostopoulos; Luca Becchetti; Carlos Castillo; Aristides Gionis; Stefano Leonardi
The internet has enabled the collaboration of groups at a scale that was unseen before. A key problem for large collaboration groups is to be able to allocate tasks effectively. An effective task assignment method should consider both how fit teams are for each job as well as how fair the assignment is to team members, in terms that no one should be overloaded or unfairly singled out. The assignment has to be done automatically or semi-automatically given that it is difficult and time-consuming to keep track of the skills and the workload of each person. Obviously the method to do this assignment must also be computationally efficient.
   In this paper we present a general framework for task assignment problems. We provide a formal treatment on how to represent teams and tasks. We propose alternative functions for measuring the fitness of a team performing a task and we discuss desirable properties of those functions. Then we focus on one class of task-assignment problems, we characterize the complexity of the problem, and we provide algorithms with provable approximation guarantees, as well as lower bounds. We also present experimental results that show that our methods are useful in practice in several application scenarios.

IR track: social networks and text mining

Who should I cite: learning literature search models from citation behavior BIBAFull-Text 609-618
  Steven Bethard; Dan Jurafsky
Scientists depend on literature search to find prior work that is relevant to their research ideas. We introduce a retrieval model for literature search that incorporates a wide variety of factors important to researchers, and learns the weights of each of these factors by observing citation patterns. We introduce features like topical similarity and author behavioral patterns, and combine these with features from related work like citation count and recency of publication. We present an iterative process for learning weights for these features that alternates between retrieving articles with the current retrieval model, and updating model weights by training a supervised classifier on these articles. We propose a new task for evaluating the resulting retrieval models, where the retrieval system takes only an abstract as its input and must produce as output the list of references at the end of the abstract's article. We evaluate our model on a collection of journal, conference and workshop articles from the ACL Anthology Reference Corpus. Our model achieves a mean average precision of 28.7, a 12.8 point improvement over a term similarity baseline, and a significant improvement both over models using only features from related work and over models without our iterative learning.
A structured approach to query recommendation with social annotation data BIBAFull-Text 619-628
  Jiafeng Guo; Xueqi Cheng; Gu Xu; Huawei Shen
Query recommendation has been recognized as an important mean to help users search and also improve the usability of search engines. Existing approaches mainly focus on helping users refine their search queries and the recommendations typically stick to users' search intent, named search interests in this paper. However, users may also have some vague or delitescent interests which they are unaware of until they are faced with one, named exploratory interests. These interests may be provoked within a search session when users read a web page from search results or even follow links on the page. By considering exploratory interests in query recommendation, we attract more user clicks on recommendations. This type of query recommendation has not been explicitly addressed in previous work. In this paper, we propose to recommend queries in a structured way for better satisfying both search and exploratory interests of users. Specifically, we construct a query relation graph from query logs and social annotation data which capture two types of interests respectively. Based on the query relation graph, we employ hitting time to rank possible recommendations, leverage a modularity based approach to group top recommendations into clusters, and label each cluster with social tags. Empirical experimental results indicate that our structured approach to query recommendation with social annotation data can better satisfy users' interests and significantly enhance users' click behavior on recommendations.
Using the past to score the present: extending term weighting models through revision history analysis BIBAFull-Text 629-638
  Ablimit Aji; Yu Wang; Eugene Agichtein; Evgeniy Gabrilovich
The generative process underlies many information retrieval models, notably statistical language models. Yet these models only examine one (current) version of the document, effectively ignoring the actual document generation process. We posit that a considerable amount of information is encoded in the document authoring process, and this information is complementary to the word occurrence statistics upon which most modern retrieval models are based. We propose a new term weighting model, Revision History Analysis (RHA), which uses the revision history of a document (e.g., the edit history of a page in Wikipedia) to redefine term frequency -- a key indicator of document topic/relevance for many retrieval models and text processing tasks. We then apply RHA to document ranking by extending two state-of-the-art text retrieval models, namely, BM25 and the generative statistical language model (LM). To the best of our knowledge, our paper is the first attempt to directly incorporate document authoring history into retrieval models. Empirical results show that RHA provides consistent improvements for state-of-the-art retrieval models, using standard retrieval tasks and benchmarks.
Language pyramid and multi-scale text analysis BIBAFull-Text 639-648
  Shuang-Hong Yang; Hongyuan Zha
The classical Bag-of-Word (BOW) model represents a document as a histogram of word occurrence, losing the spatial information that is invaluable for many text analysis tasks. In this paper, we present the Language Pyramid (LaP) model, which casts a document as a probabilistic distribution over the joint semantic-spatial space and motivates a multi-scale 2D local smoothing framework for nonparametric text coding. LaP efficiently encodes both semantic and spatial contents of a document into a pyramid of matrices that are smoothed both semantically and spatially at a sequence of resolutions, providing a convenient multi-scale imagic view for natural language understanding. The LaP representation can be used in text analysis in a variety of ways, among which we investigate two instantiations in the current paper: (1) multi-scale text kernels for document categorization, and (2) multi-scale language models for ad hoc text retrieval. Experimental results illustrate that: for classification, LaP outperforms BOW by (up to) 4% on moderate-length texts (RCV1 text benchmark) and 15% on short texts (Yahoo! queries); and for retrieval, LaP gains 12% MAP improvement over uni-gram language models on the OHSUMED data set.
Latent interest-topic model: finding the causal relationships behind dyadic data BIBAFull-Text 649-658
  Noriaki Kawamae
This paper presents a hierarchical generative model that captures the latent relation of cause and effect underlying user behavioral-originated data such as papers, twitter and purchase history. Our proposal, the Latent Interest Topic model (LIT), introduces a latent variable into each document and each author layor in a coherent generative model. We call the former variable the document class, and the latter variable the author class, where these classes are indicator variables that allow the inclusion of different types of probability, and can be shared over documents with similar content and authors with similar interests, respectively. Significantly, unlike other works, LIT differentiates, respectively, document topics and user interests by using these classes. Consequently, LIT is superior to previous models in explaining the causal relationships behind the data by merging similar distributions; it also makes the computation process easier. Experiments on a research paper corpus show that the proposed model can well capture document and author classes, and reduce the dimensionality of documents to a low-dimensional author-document space, making it useful as a generative model.

Industry track: IR applications

PROSPECT: a system for screening candidates for recruitment BIBAFull-Text 659-668
  Amit Singh; Catherine Rose; Karthik Visweswariah; Vijil Chenthamarakshan; Nandakishore Kambhatla
Companies often receive thousands of resumes for each job posting and employ dedicated screeners to short list qualified applicants. In this paper, we present PROSPECT, a decision support tool to help these screeners shortlist resumes efficiently. Prospect mines resumes to extract salient aspects of candidate profiles like skills, experience in each skill, education details and past experience. Extracted information is presented in the form of facets to aid recruiters in the task of screening. We also employ Information Retrieval techniques to rank all applicants for a given job opening. In our experiments we show that extracted information improves our ranking by 30% there by making screening task simpler and more efficient.
Active caching for similarity queries based on shared-neighbor information BIBAFull-Text 669-678
  Michael E. Houle; Vincent Oria; Umar Qasim
Novel applications such as recommender systems, uncertain databases, and multimedia databases are designed to process similarity queries that produce ranked lists of objects as their results. Similarity queries typically result in disk access latency and incur a substantial computational cost. In this paper, we propose an 'active caching' technique for similarity queries that is capable of synthesizing query results from cached information even when the required result list is not explicitly stored in the cache. Our solution, the Cache Estimated Significance (CES) model, is based on shared-neighbor similarity measures, which assess the strength of the relationship between two objects as a function of the number of other objects in the common intersection of their neighborhoods. The proposed method is general in that it does not require that the features be drawn from a metric space, nor does it require that the partial orders induced by the similarity measure be monotonic. Experimental results on real data sets show a substantial cache hit rate when compared with traditional caching approaches.
Searching consumer image collections using web-based concept expansion BIBAFull-Text 679-688
  Mark D. Wood; Alexander Loui; Stacie Hibino
As consumers accumulate more and more personal imagery, searching for specific images has become increasingly difficult. Consumers typically provide little or no annotations, and automated classifiers and concept tagging tools are limited in their scope and vocabulary. This work addresses this sparsity of semantic information by leveraging domain-specific information provided by online photo-sharing communities. Such information enables improved search by allowing user-provided search terms to be expanded into a set of semantically related concepts, using relevant semantic relationships provided by millions of users. Our system first extracts metadata using a modest number of image and event-based semantic classifiers, as well as any meaningful file or folder names. When users pose text-based queries, our system retrieves images from their personal image collections by leveraging Flickr's tag dataset for concept expansion. This approach enables users to search their collections without having to manually annotate their pictures. We compare the retrieval performance of using a Flickr-based concept expander with the performance obtained without concept expansion and with using a WordNet-based concept expander. The results demonstrate that common sense knowledge gleaned from online photo sharing communities can enable meaningful image search on consumer image collections, searches that would be impossible using only the available image metadata.

DB track: information retrieval in databases

Index structures for efficiently searching natural language text BIBAFull-Text 689-698
  Pirooz Chubak; Davood Rafiei
Many existing indexes on text work at the document granularity and are not effective in answering the class of queries where the desired answer is only a term or a phrase. In this paper, we study some of the index structures that are capable of answering the class of queries referred to here as wild card queries and perform an analysis of their performance. Our experimental results on a large class of queries from different sources (including query logs and parse trees) and with various datasets reveal some of the performance barriers of these indexes. We then present Word Permuterm Index (WPI) which is an adaptation of the permuterm index for natural language text applications and show that this index supports a wide range of wild card queries, is quick to construct and is highly scalable. Our experimental resultS comparing WPI to alternative methods on a wide range oF wild card queries show a few orders of magnitude performancE improvements for WPI while the memory usage is kept the same for all compared systems.
Efficient temporal keyword search over versioned text BIBAFull-Text 699-708
  Avishek Anand; Srikanta Bedathur; Klaus Berberich; Ralf Schenkel
Modern text analytics applications operate on large volumes of temporal text data such as Web archives, newspaper archives, blogs, wikis, and micro-blogs. In these settings, searching and mining needs to use constraints on the time dimension in addition to keyword constraints. A natural approach to address such queries is using an inverted index whose entries are enriched with valid-time intervals. It has been shown that these indexes have to be partitioned along time in order to achieve efficiency. However, when the temporal predicate corresponds to a long time range, requiring the processing of multiple partitions, naive query processing incurs high cost of reading of redundant entries across partitions.
   We present a framework for efficient approximate processing of keyword queries over a temporally partitioned inverted index which minimizes this overhead, thus speeding up query processing. By using a small synopsis for each partition we identify partitions that maximize the number of final non-redundant results, and schedule them for processing early on. Our approach aims to balance the estimated gains in the final result recall against the cost of index reading required. We present practical algorithms for the resulting optimization problem of index partition selection. Our experiments with three diverse, large-scale text archives reveal that our proposed approach can provide close to 80% result recall even when only about half the index is allowed to be read.
Result-size estimation for information-retrieval subqueries BIBAFull-Text 709-718
  Guido Sautter; Klemens Böhm; Andranik Khachatryan
Estimating the approximate result size of a query before its execution based on small summary statistics is important for query optimization in database systems and for other facets of query processing. This also holds for queries over text databases. Research on selectivity estimation for such queries has focused on Boolean retrieval, i.e., a document may be relevant for the query or not. But with the coalescence of database and information retrieval (IR) technology, selectivity estimation for other, more sophisticated relevance functions is gaining importance as well. These models generate a query-specific distribution of the documents over the [0,1]-interval. With document distributions, selectivity estimation means estimating how many documents are how similar to a given query. The problem is much more complex than selectivity estimation in the Boolean context: Beside document frequency, query results also depend on other characteristics such as term frequencies and document lengths. Selectivity estimation must take them into account as well. This paper proposes and evaluates a technique for estimating the result of retrieval queries with non-Boolean relevance functions. It estimates discretized document distributions over the range of the relevance function. Despite the complexity, compared to Boolean selectivity estimation, it requires little additional data, and the additional data can be stored in existing data structures with little extensions. Our evaluation demonstrates the effectiveness of our technique.
FACeTOR: cost-driven exploration of faceted query results BIBAFull-Text 719-728
  Abhijith Kashyap; Vagelis Hristidis; Michalis Petropoulos
Faceted navigation is being increasingly employed as an effective technique for exploring large query results on structured databases. This technique of mitigating information-overload leverages metadata of the query results to provide users with facet conditions that can be used to progressively refine the user's query and filter the query results. However, the number of facet conditions can be quite large, thereby increasing the burden on the user. We present the FACeTOR system that proposes a cost-based approach to faceted navigation. At each step of the navigation, the user is presented with a subset of all possible facet conditions that are selected such that the overall expected navigation cost is minimized and every result is guaranteed to be reachable by a facet condition. We prove that the problem of selecting the optimal facet conditions at each navigation step is NP-Hard, and subsequently present two intuitive heuristics employed by FACeTOR. Our user study at Amazon Mechanical Turk shows that FACeTOR reduces the user navigation time compared to the cutting edge commercial and academic faceted search algorithms. The user study also confirms the validity of our cost model. We also present the results of an extensive experimental evaluation on the performance of the proposed approach using two real datasets. FACeTOR is available at http://db.cse.buffalo.edu/facetor/.
A framework for evaluating database keyword search strategies BIBAFull-Text 729-738
  Joel Coffman; Alfred C. Weaver
With regard to keyword search systems for structured data, research during the past decade has largely focused on performance. Researchers have validated their work using ad hoc experiments that may not reflect real-world workloads. We illustrate the wide deviation in existing evaluations and present an evaluation framework designed to validate the next decade of research in this field. Our comparison of 9 state-of-the-art keyword search systems contradicts the retrieval effectiveness purported by existing evaluations and reinforces the need for standardized evaluation. Our results also suggest that there remains considerable room for improvement in this field. We found that many techniques cannot scale to even moderately-sized datasets that contain roughly a million tuples. Given that existing databases are considerably larger than this threshold, our results motivate the creation of new algorithms and indexing techniques that scale to meet both current and future workloads.

KM track: temporal, spatial and stream data mining

Network growth and the spectral evolution model BIBAFull-Text 739-748
  Jérôme Kunegis; Damien Fay; Christian Bauckhage
We introduce and study the spectral evolution model, which characterizes the growth of large networks in terms of the eigenvalue decomposition of their adjacency matrices: In large networks, changes over time result in a change of a graph's spectrum, leaving the eigenvectors unchanged. We validate this hypothesis for several large social, collaboration, authorship, rating, citation, communication and tagging networks, covering unipartite, bipartite, signed and unsigned graphs. Following these observations, we introduce a link prediction algorithm based on the extrapolation of a network's spectral evolution. This new link prediction method generalizes several common graph kernels that can be expressed as spectral transformations. In contrast to these graph kernels, the spectral extrapolation algorithm does not make assumptions about specific growth patterns beyond the spectral evolution model. We thus show that it performs particularly well for networks with irregular, but spectral, growth patterns.
Automatic detection of craters in planetary images: an embedded framework using feature selection and boosting BIBAFull-Text 749-758
  Wei Ding; Tomasz F. Stepinski; Lourenco Bandeira; Ricardo Vilalta; Youxi Wu; Zhenyu Lu; Tianyu Cao
Identifying impact craters on planetary surfaces is one fundamental task in planetary science. In this paper, we present an embedded framework on auto-detection of craters, using feature selection and boosting strategies. The paradigm aims at building a universal and practical crater detector. This methodology addresses three issues that such a tool must possess: (i) it utilizes mathematical morphology to efficiently identify the regions of an image that can potentially contain craters; only those regions, defined as crater candidates, are the subjects of further processing; (ii) it selects Haar-like image texture features in combination with boosting ensemble supervised learning algorithms to accurately classify candidates into craters and non-craters; (iii) it uses transfer learning, at a minimum additional cost, to enable maintaining an accurate auto-detection of craters on new images, having morphology different from what has been captured by the original training set. All three aforementioned components of the detection methodology are discussed, and the entire framework is evaluated on a large test image of 37,500 x 56,250$ m2 on Mars, showing heavily cratered Martian terrain characterized by nonuniform surface morphology. Our study demonstrates that this methodology provides a robust and practical tool for planetary science, in terms of both detection accuracy and efficiency.
You are where you tweet: a content-based approach to geo-locating Twitter users BIBAFull-Text 759-768
  Zhiyuan Cheng; James Caverlee; Kyumin Lee
We propose and evaluate a probabilistic framework for estimating a Twitter user's city-level location based purely on the content of the user's tweets, even in the absence of any other geospatial cues. By augmenting the massive human-powered sensing capabilities of Twitter and related microblogging services with content-derived location information, this framework can overcome the sparsity of geo-enabled features in these services and enable new location-based personalized information services, the targeting of regional advertisements, and so on. Three of the key features of the proposed approach are: (i) its reliance purely on tweet content, meaning no need for user IP information, private login information, or external knowledge bases; (ii) a classification component for automatically identifying words in tweets with a strong local geo-scope; and (iii) a lattice-based neighborhood smoothing model for refining a user's location estimate. The system estimates k possible locations for each user in descending order of confidence. On average we find that the location estimates converge quickly (needing just 100s of tweets), placing 51% of Twitter users within 100 miles of their actual location.
Partial drift detection using a rule induction framework BIBAFull-Text 769-778
  Damon Sotoudeh; Aijun An
The major challenge in mining data streams is the issue of concept drift, the tendency of the underlying data generation process to change over time. In this paper, we propose a general rule learning framework that can efficiently handle concept-drifting data streams and maintain a highly accurate classification model. The main idea is to focus on partial drifts by allowing individual rules to monitor the stream and detect if there is a drift in the regions they cover. A rule quality measure then decides whether the affected rules are inconsistent with the concept drift. The model is accordingly updated to only include rules that are consistent with the newly arrived concept. A dynamically maintained set of instances deemed relevant to the most recent concept is also kept at memory. Learning a new concept from a larger set of instances reduces the variance of data distribution and allows for a more accurate, stable classification model. Our experiments show that this approach not only handles the drift efficiently, but it also can provide higher classification accuracy compared to other competitive approaches on a variety of real and synthetic data sets.
A method for discovering components of human rituals from streams of sensor data BIBAFull-Text 779-788
  Athanasios Bamis; Jia Fang; Andreas Savvides
This paper describes an algorithm for determining if an event occurs persistently within an interval where the interval is periodic but the event is not. The goal of the algorithm is to identify events with this property and also determine the minimum interval in which they occur. This solution is geared towards discovering human routines by considering the triggering of simple sensors over a diverse set of spatial and temporal scales. After describing the problem and the proposed solution, in this paper we demonstrate using testbed data and simulations that this approach uncovers components of routines by identifying which events are parts of the same routine through their temporal properties.

IR track: filtering and recommendation

Two-tier similarity model for story link detection BIBAFull-Text 789-798
  Tadashi Nomoto
The paper presents a novel approach to story link detection, where the goal is to determine whether a pair of news stories are linked, i.e., talk about the same event. The present work marks a departure from the prior work in that we measure similarity at two distinct levels of textual organization, the document and its collection, and combine scores at both levels to determine how well stories are linked. Experiments on the TDT-5 corpus show that the present approach, which we call a 'two-tier similarity model,' comfortably beats conventional approaches such as Clarity enhanced KL divergence, while performing robustly across diverse languages.
Selected new training documents to update user profile BIBAFull-Text 799-808
  Abdulmohsen Algarni; Yuefeng Li; Yue Xu
Relevance Feedback (RF) has been proven very effective for improving retrieval accuracy. Adaptive information filtering (AIF) technology has benefited from the improvements achieved in all the tasks involved over the last decades. A difficult problem in AIF has been how to update the system with new feedback efficiently and effectively. In current feedback methods, the updating processes focus on updating system parameters. In this paper, we developed a new approach, the Adaptive Relevance Features Discovery (ARFD). It automatically updates the system's knowledge based on a sliding window over positive and negative feedback to solve a nonmonotonic problem efficiently. Some of the new training documents will be selected using the knowledge that the system currently obtained. Then, specific features will be extracted from selected training documents. Different methods have been used to merge and revise the weights of features in a vector space. The new model is designed for Relevance Features Discovery (RFD), a pattern mining based approach, which uses negative relevance feedback to improve the quality of extracted features from positive feedback. Learning algorithms are also proposed to implement this approach on Reuters Corpus Volume 1 and TREC topics. Experiments show that the proposed approach can work efficiently and achieves the encouragement performance.
Collaborative filtering in social tagging systems based on joint item-tag recommendations BIBAFull-Text 809-818
  Jing Peng; Daniel Dajun Zeng; Huimin Zhao; Fei-yue Wang
Tapping into the wisdom of the crowd, social tagging can be considered an alternative mechanism -- as opposed to Web search -- for organizing and discovering information on the Web. Effective tag-based recommendation of information items, such as Web resources, is a critical aspect of this social information discovery mechanism. A precise understanding of the information structure of social tagging systems lies at the core of an effective tag-based recommendation method. While most of the existing research either implicitly or explicitly assumes a simple tripartite graph structure for this purpose, we propose a comprehensive information structure to capture all types of co-occurrence information in the tagging data. Based on the proposed information structure, we further propose a unified user profiling scheme to make full use of all available information. Finally, supported by our proposed user profile, we propose a novel framework for collaborative filtering in social tagging systems. In our proposed framework, we first generate joint item-tag recommendations, with tags indicating topical interests of users in target items. These joint recommendations are then refined by the wisdom from the crowd and projected to the item space for final item recommendations. Evaluation using three real-world datasets shows that our proposed recommendation approach significantly outperformed state-of-the-art approaches.
Collaborative future event recommendation BIBAFull-Text 819-828
  Einat Minkov; Ben Charrow; Jonathan Ledlie; Seth Teller; Tommi Jaakkola
We demonstrate a method for collaborative ranking of future events. Previous work on recommender systems typically relies on feedback on a particular item, such as a movie, and generalizes this to other items or other people. In contrast, we examine a setting where no feedback exists on the particular item. Because direct feedback does not exist for events that have not taken place, we recommend them based on individuals' preferences for past events, combined collaboratively with other peoples' likes and dislikes. We examine the topic of unseen item recommendation through a user study of academic (scientific) talk recommendation, where we aim to correctly estimate a ranking function for each user, predicting which talks would be of most interest to them. Then by decomposing user parameters into shared and individual dimensions, we induce a similarity metric between users based on the degree to which they share these dimensions. We show that the collaborative ranking predictions of future events are more effective than pure content-based recommendation. Finally, to further reduce the need for explicit user feedback, we suggest an active learning approach for eliciting feedback and a method for incorporating available implicit user cues.
Hybrid tag recommendation for social annotation systems BIBAFull-Text 829-838
  Jonathan Gemmell; Thomas Schimoler; Bamshad Mobasher; Robin Burke
Social annotation systems allow users to annotate resources with personalized tags and to navigate large and complex information spaces without the need to rely on predefined hierarchies. These systems help users organize and share their own resources, as well as discover new ones annotated by other users. Tag recommenders in such systems assist users in finding appropriate tags for resources and help consolidate annotations across all users and resources. But the size and complexity of the data, as well as the inherent noise and inconsistencies in the underlying tag vocabularies, have made the design of effective tag recommenders a challenge. Recent efforts have demonstrated the advantages of integrative models that leverage all three dimensions of a social annotation system: users, resources and tags. Among these approaches are recommendation models based on matrix factorization. But, these models tend to lack scalability and often hide the underlying characteristics, or "information channels" of the data that affect recommendation effectiveness. In this paper we propose a weighted hybrid tag recommender that blends multiple recommendation components drawing separately on complementary dimensions, and evaluate it on six large real-world datasets. In addition, we attempt to quantify the strength of the information channels in these datasets and use these results to explain the performance of the hybrid. We find our approach is not only competitive with the state-of-the-art techniques in terms of accuracy, but also has the added benefits of being scalable to large real world applications, extensible to incorporate a wide range of recommendation techniques, easily updateable, and more scrutable than other leading methods.

Industry track: databases and OLAP

XML schema computations: schema compatibility testing and subschema extraction BIBAFull-Text 839-848
  Thomas Yau-tat Lee; David Wai-lok Cheung
In this paper, we propose new models and algorithms to perform practical computations on W3C XML Schemas, which are schema minimization, schema equivalence testing, subschema testing and subschema extraction. We have conducted experiments on an e-commerce standard XSD called xCBL to demonstrate the effectiveness of our algorithms. One experiment has refuted the claim that the xCBL 3.5 XSD is compatible with the xCBL 3.0 XSD. Another experiment has shown that the xCBL XSDs can be effectively trimmed into small subschemas for specific applications, which has significantly reduced schema processing time.
Visual cube and on-line analytical processing of images BIBAFull-Text 849-858
  Xin Jin; Jiawei Han; Liangliang Cao; Jiebo Luo; Bolin Ding; Cindy Xide Lin
On-Line Analytical Processing (OLAP) has shown great success in many industry applications, including sales, marketing, management, financial data analysis, etc. In this paper, we propose Visual Cube and multi-dimensional OLAP of image collections, such as web images indexed in search engines (e.g., Google and Bing), product images (e.g. Amazon) and photos shared on social networks (e.g., Facebook and Flickr). It provides online responses to user requests with summarized statistics of image information and handles rich semantics related to image visual features. A clustering structure measure is proposed to help users freely navigate and explore images. Efficient algorithms are developed to construct Visual Cube. In addition, we introduce the new issue of Cell Overlapping in data cube and present efficient solutions for Visual Cube computation and OLAP operations. Extensive experiments are conducted and the results show good performance of our algorithms.
Pattern discovery for large mixed-mode database BIBAFull-Text 859-868
  Andrew K. C. Wong; Bin Wu; Gene P. K. Wu; Keith C. C. Chan
In business and industry today, large databases with mixed data types (continuous and categorical) are very common. There are great needs to discover patterns from them for knowledge interpretation and understanding. In the past, for classification, this problem is solved as a discrete data problem by first discretizing the continuous data based on the class-attribute interdependence relationship. However, so far no proper solution exists when class information is unavailable. Hence, important pattern post-processing tasks such as pattern clustering and summarization cannot be applied to mixed-mode data. This paper presents a new method for solving the problem. It is based on two essential concepts. (1) Though class information is absent, yet for a correlated dataset, the attribute with the strongest interdependence with others in the group can be used to drive the discretization of the continuous data. (2) For a large database, correlated attribute groups must first be obtained by attribute clustering before (1) can be applied. Based on (1) and (2), pattern discovery methods are developed for mixed-mode data. Extensive experiments using synthetic and real world data were conducted to validate the usefulness and effectiveness of the proposed method.

KM track: data pre- and post-processing

Constructing classification features using minimal predictive patterns BIBAFull-Text 869-878
  Iyad Batal; Milos Hauskrecht
Choosing good features to represent objects can be crucial to the success of supervised machine learning methods. Recently, there has been a great interest in applying data mining techniques to construct new classification features. The rationale behind this approach is that patterns (feature-value combinations) could capture more underlying semantics than single features. Hence the inclusion of some patterns can improve the classification performance. Currently, most methods adopt a two-phases approach by generating all frequent patterns in the first phase and selecting the discriminative patterns in the second phase. However, this approach has limited success because it is usually very difficult to correctly identify important predictive patterns in a large set of highly correlated frequent patterns.
   In this paper, we introduce the minimal predictive patterns framework to directly mine a compact set of highly predictive patterns. The idea is to integrate pattern mining and feature selection in order to filter out non-informative and redundant patterns while being generated. We propose some pruning techniques to speed up the mining process. Our extensive experimental evaluation on many datasets demonstrates the advantage of our method by outperforming many well known classifiers.
RankSVR: can preference data help regression? BIBAFull-Text 879-888
  Hwanjo Yu; Sungchul Kim; Seunghoon Na
In some regression applications (e.g., an automatic movie scoring system), a large number of ranking data is available in addition to the original regression data. This paper studies whether and how the ranking data can improve the accuracy of regression task. In particular, this paper first proposes an extension of SVR (Support Vector Regression), RankSVR, which incorporates ranking constraints in the learning of regression function. Second, this paper proposes novel sampling methods for RankSVR, which selectively choose samples of ranking data for training of regression functions in order to maximize the performance of RankSVR. While it is relatively easier to acquire ranking data than regression data, incorporating all the ranking data in the learning of regression doest not always generate the best output. Moreover, adding too many ranking constraints into the regression problem substantially lengthens the training time. Our proposed sampling methods find the ranking samples that maximize the regression performance. Experimental results on synthetic and real data sets show that, when the ranking data is additionally available, RankSVR significantly performs better than SVR by utilizing ranking constraints in the learning of regression, and also show that our sampling methods improve the RankSVR performance better than the random sampling.
Estimating accuracy for text classification tasks on large unlabeled data BIBAFull-Text 889-898
  Snigdha Chaturvedi; Tanveer A. Faruquie; L. Venkata Subramaniam; Mukesh K. Mohania
Rule based systems for processing text data encode the knowledge of a human expert into a rule base to take decisions based on interactions of the input data and the rule base. Similarly, supervised learning based systems can learn patterns present in a given dataset to make decisions on similar and other related data. Performances of both these classes of models are largely dependent on the training examples seen by them, based on which the learning was performed. Even though trained models might fit well on training data, the accuracies they yield on a new test data may be considerably different. Computing the accuracy of the learnt models on new unlabeled datasets is a challenging problem requiring costly labeling, and which is still likely to only cover a subset of the new data because of the large sizes of datasets involved. In this paper, we present a method to estimate the accuracy of a given model on a new dataset without manually labeling the data. We verify our method on large datasets for two shallow text processing tasks: document classification and postal address segmentation, and using both supervised machine learning methods and human generated rule based models.
A probabilistic topic-connection model for automatic image annotation BIBAFull-Text 899-908
  Xin Chen; Xiaohua Hu; Zhongna Zhou; Caimei Lu; Gail Rosen; Tingting He; E. K. Park
The explosive increase of image data on Internet has made it an important, yet very challenging task to index and automatically annotate image data. To achieve that end, sophisticated algorithms and models have been proposed to study the correlation between image content and corresponding text description. Despite the success of previous works, however, researchers are still facing two major difficulties that may undermine their effort of providing reliable and accurate annotations for images. The first difficulty is lacking of comprehensive benchmark image dataset with high quality text descriptions. The second difficulty is lacking of effective way to represent the image content and make it associate with the text descriptions. In our paper, we aim to deal with both problems. To deal with the first problem, we utilize Wikipedia as external knowledge source and enrich the ontology structure of ImageNet database with comprehensive and highly-reliable text descriptions from Wikipedia articles. To address the second problem, we develop a Probabilistic Topic-Connection (PTC) model to represent the connection between latent semantic topic in text description and latent patterns from image feature space. We compare the performance of our model with the currently popular Correspondence LDA (Corr-LDA) model under the same automatic image annotation scenario using cross-validation. Experimental results demonstrate that our model is able to well represent the connection between latent semantic topics and latent patterns in image feature space, thus facilitates knowledge organization and understanding of both image and text descriptions.
Orientation distance-based discriminative feature extraction for multi-class classification BIBAFull-Text 909-918
  Bo Liu; Yanshan Xiao; Longbing Cao; Philip S. Yu
Feature extraction is an effective step in data mining and machine learning. While many feature extraction methods have been proposed for clustering, classification and regression, very limited work has been done on multi-class classification problems. In fact, the accuracy of multi-class classification problems relies on well-extracted features, the modeling part aside. This paper proposes a new feature extraction method, namely extracting orientation distance-based discriminative (ODD) features, which is particularly designed for multi-class classification problems. The proposed method works in two steps. In the first step, we extend the Fisher Discriminant idea to determine more appropriate kernel function and map the input data with all classes into a feature space. In the second step, the ODD features are extracted based on the one-vs-all scheme to generate discriminative features between a pattern and each hyperplane. These newly extracted features are treated as the representative features and are further used in the subsequent classification procedure. Substantial experiments on both UCI and real-world datasets have been conducted to investigate the performance of ODD features based multi-class classification. The statistical results show that the classification accuracy based on ODD features outperforms that of the state-of-the-art feature extraction methods.

KM track: information filtering and recommender systems (II)

Evaluating, combining and generalizing recommendations with prerequisites BIBAFull-Text 919-928
  Aditya G. Parameswaran; Hector Garcia-Molina; Jeffrey D. Ullman
We consider the problem of recommending the best set of k items when there is an inherent ordering between items, expressed as a set of prerequisites (e.g., the movie 'Godfather I' is a prerequisite of 'Godfather II'). Since this general problem is computationally intractable, we develop 3 approximation algorithms to solve this problem for various prerequisite structures (e.g., chain graphs, AND graphs, AND-OR graphs). We derive worst-case bounds for these algorithms for these structures, and experimentally evaluate these algorithms on synthetic data. We also develop an algorithm to combine solutions in order to generate even better solutions, and compare the performance of this algorithm with the other three.
Automatically suggesting topics for augmenting text documents BIBAFull-Text 929-938
  Robert West; Doina Precup; Joelle Pineau
We present a method for automated topic suggestion. Given a plain-text input document, our algorithm produces a ranking of novel topics that could enrich the input document in a meaningful way. It can thus be used to assist human authors, who often fail to identify important topics relevant to the context of the documents they are writing. Our approach marries two algorithms originally designed for linking documents to Wikipedia articles, proposed by Milne and Witten [15] and West et al. [22]. While neither of them can suggest novel topics by itself, their combination does have this capability. The key step towards finding missing topics consists in generalizing from a large background corpus using principal component analysis. In a quantitative evaluation we conclude that our method achieves the precision of human editors when input documents are Wikipedia articles, and we complement this result with a qualitative analysis showing that the approach also works well on other types of input documents.
Detecting product review spammers using rating behaviors BIBAFull-Text 939-948
  Ee-Peng Lim; Viet-An Nguyen; Nitin Jindal; Bing Liu; Hady Wirawan Lauw
This paper aims to detect users generating spam reviews or review spammers. We identify several characteristic behaviors of review spammers and model these behaviors so as to detect the spammers. In particular, we seek to model the following behaviors. First, spammers may target specific products or product groups in order to maximize their impact. Second, they tend to deviate from the other reviewers in their ratings of products. We propose scoring methods to measure the degree of spam for each reviewer and apply them on an Amazon review dataset. We then select a subset of highly suspicious reviewers for further scrutiny by our user evaluators with the help of a web based spammer evaluation software specially developed for user evaluation experiments. Our results show that our proposed ranking and supervised methods are effective in discovering spammers and outperform other baseline method based on helpfulness votes alone. We finally show that the detected spammers have more significant impact on ratings compared with the unhelpful reviewers.
Classical music for rock fans?: novel recommendations for expanding user interests BIBAFull-Text 949-958
  Makoto Nakatsuji; Yasuhiro Fujiwara; Akimichi Tanaka; Toshio Uchiyama; Ko Fujimura; Toru Ishida
Most recommender algorithms produce types similar to those the active user has accessed before. This is because they measure user similarity only from the co-rating behaviors against items and compute recommendations by analyzing the items possessed by the users most similar to the active user. In this paper, we define item novelty as the smallest distance from the class the user accessed before to the class that includes target items over the taxonomy. Then, we try to accurately recommend highly novel items to the user. First, our method measures user similarity by employing items rated by users and a taxonomy of items. It can accurately identify many items that may suit the user. Second, it creates a graph whose nodes are users; weighted edges are set between users according to their similarity. It analyzes the user graph and extracts users that are related on the graph though the similarity between the active user and each of those users is not high. The users so extracted are likely to have highly novel items for the active user. An evaluation conducted on several datasets finds that our method accurately identifies items with higher novelty than previous methods.
Improving one-class collaborative filtering by incorporating rich user information BIBAFull-Text 959-968
  Yanen Li; Jia Hu; ChengXiang Zhai; Ye Chen
One-Class Collaborative Filtering (OCCF) is an emerging setup in collaborative filtering in which only positive examples or implicit feedback can be observed. Compared with the traditional collaborative filtering setting where the data has ratings, OCCF is more realistic in many scenarios when no ratings are available. In this paper, we propose to improve OCCF accuracy by exploiting the rich user information that is often naturally available in community-based interactive information systems, including a user's search query history, purchasing and browsing activities. We propose two ways to incorporate such user information into the OCCF models: one is to linearly combine scores from different sources and the other is to embed user information into collaborative filtering. Experimental results on a large-scale retail data set from a major e-commerce company show that the proposed methods are effective and can improve the performance of the One-Class Collaborative Filtering over baseline methods through leveraging rich user information.

IR track: user modeling and search personalization

Personalized search by tag-based user profile and resource profile in collaborative tagging systems BIBAFull-Text 969-978
  Yi Cai; Qing Li
With the increase of resource-sharing web sites such as YouTube1 and Flickr2, personalized search becomes more important and challenging, as users demand higher retrieval quality. To achieve this goal, personalized search needs to take users' personalized profiles and information needs into consideration. Collaborative tagging (also known as folksonomy [11]) systems allow users to annotate resources with their own tags, which provide a simple but powerful way for organizing, retrieving and sharing different types of social resources. In this paper, we examine the limitations of previous tag-based personalized search. To handle these limitations, we propose a new method to model user profiles and resource profiles in a collaborative tagging environment. A novel search method using such users' and resources' profiles is proposed to facilitate the desired personalization in resource search. We implement a prototype system named as FMRS. Experiments using FMRS data set and MovieLens data set show that our proposed method outperforms baseline methods.
A comparison of user and system query performance predictions BIBAFull-Text 979-988
  Claudia Hauff; Diane Kelly; Leif Azzopardi
Query performance prediction methods are usually applied to estimate the retrieval effectiveness of queries, where the evaluation is largely system sided. However, little work has been conducted to understand query performance prediction from the user's perspective. The question we consider is, whether the predictions of query performance that systems make are in line with the predictions that users make. To this aim, we compare the performance ratings users assign to queries with the performance scores estimated by a range of pre-retrieval and post-retrieval query performance predictors. Two studies are presented that explore the relationship between user ratings and system predictions on two levels: (i) the topic level, and, (ii) the query suggestions level. It is shown that when predicting the performance of query suggestions, user ratings were mostly uncorrelated with system predictions. At the topic level though, where a single query is judged for each information need, we observed moderate correlations between user ratings and a subset of system predictions. As query performance prediction methods are often based on intuitions of how users might rate queries, these findings suggest that such methods are not representative of how users actually rate query suggestions and topics. This motivates further research into understanding the rating process engaged by users, and developing models of query performance prediction in order to bridge the divide between systems and users.
The anatomy of a click: modeling user behavior on web information systems BIBAFull-Text 989-998
  Kunal Punera; Srujana Merugu
The ultimate goal of information retrieval science continues to be providing relevant information to users while placing minimal cognitive load on them. The retrieval and presentation of relevant information (say, search results) as well as any dynamic system behavior (e.g., search engine re-ranking) depends acutely on estimating user intent. Hence, it is critical to use all the available information about user behavior at any stage of a search-session to accurately infer the user intent. However, the simplistic interfaces provided by search engines in order to minimize the user cognitive effort, and intrinsic limits imposed by privacy concerns, latency requirements, and other web instrumentation challenges, result in only a subset of user actions that are predictive of the search intent being captured.
   In this paper, we present a dynamic Bayesian network (DBN) that models user interaction with general web information systems, taking into account both observed (clicks etc.) as well as hidden (result examinations etc.) user actions. Our model goes beyond the ranked list information access paradigm and gives a solution where arbitrary context information can be incorporated in a principled fashion. To account for heterogeneity in user behavior as well as information access tasks, we further propose a bi-clustering algorithm that partitions users and tasks, and learns separate models for each bicluster. We instantiate this general DBN model for a typical static search interface comprising of a single query box and a ranked list of search results using a set of seven common user actions and various predictive state attributes. Experimental results on real-world web search log data indicate that one can obtain superior predictive performance on various session properties (such as click positions and reformulations) compared to simpler instantiations of the DBN.
Exploring online social activities for adaptive search personalization BIBAFull-Text 999-1008
  Qihua Wang; Hongxia Jin
The web has largely become a very social environment and will continue to become even more so. People are not only enjoying their social visibility on the Web but also increasingly participating in various social activities delivered through the Web. In this paper, we propose to explore a user's public social activities, such as blogging and social bookmarking, to personalize Internet services. We believe that public social data provides a more acceptable way to derive user interests than more private data such as search histories and desktop data. We propose a framework that learns about users' preferences from their activities on a variety of online social systems. As an example, we illustrate how to apply the user interests derived by our system to personalize search results. Furthermore, our system is adaptive; it observes users' choices on search results and automatically adjusts the weights of different social systems during the information integration process, so as to refine its interest profile for each user. We have implemented our approach and performed experiments on real-world data collected from three large-scale online social systems. Over two hundred users from worldwide who are active on the three social systems have been tested. Our experimental results demonstrate the effectiveness of our personalized search approach. Our results also show that integrating information from multiple social systems usually leads to better personalized results than relying on the information from a single social system, and our adaptive approach further improves the performance of the personalization solution.
Predicting short-term interests using activity-based search context BIBAFull-Text 1009-1018
  Ryen W. White; Paul N. Bennett; Susan T. Dumais
A query considered in isolation offers limited information about a searcher's intent. Query context that considers pre-query activity (e.g., previous queries and page visits), can provide richer information about search intentions. In this paper, we describe a study in which we developed and evaluated user interest models for the current query, its context (from pre-query session activity), and their combination, which we refer to as intent. Using large-scale logs, we evaluate how accurately each model predicts the user's short-term interests under various experimental conditions. In our study we: (i) determine the extent of opportunity for using context to model intent; (ii) compare the utility of different sources of behavioral evidence (queries, search result clicks, and Web page visits) for building predictive interest models, and; (iii) investigate optimally combining the query and its context by learning a model that predicts the context weight for each query. Our findings demonstrate significant opportunity in leveraging contextual information, show that context and source influence predictive accuracy, and show that we can learn a near-optimal combination of the query and context for each query. The findings can inform the design of search systems that leverage contextual information to better understand, model, and serve searchers' information needs.

Industry track: web and social networks

Probabilistic first pass retrieval for search advertising: from theory to practice BIBAFull-Text 1019-1028
  Hema Raghavan; Rukmini Iyer
Information retrieval in search advertising, as in other ad-hoc retrieval tasks, aims to find the most appropriate ranking of the ad documents of a corpus for a given query. In addition to ranking the ad documents, we also need to filter or threshold irrelevant ads from participating in the auction to be displayed alongside search results. In this work, we describe our experience in implementing a successful ad retrieval system for a commercial search engine based on the Language Modeling (LM) framework for retrieval. The LM demonstrates significant performance improvements over the baseline vector space model (TF-IDF) system that was in production at the time. From a modeling perspective, we propose a novel approach to incorporate query segmentation and phrases in the LM framework, discuss impact of score normalization for relevance filtering, and present preliminary results of incorporating query expansions using query rewriting techniques. From an implementation perspective, we also discuss real-time latency constraints of a production search engine and how we overcome them by adapting the WAND algorithm to work with language models. In sum, our LM formulation is considerably better in terms of accuracy metrics such as Precision-Recall (10% improvement in AUC) and nDCG (8% improvement in nDCG@5) on editorial data and also demonstrates significant improvements in clicks in live user tests (0.787% improvement in Click Yield, with 8% coverage increase). Finally, we hope that this paper provides the reader with adequate insights into the challenges of building a system that serves millions of users every day.
Faceted search and browsing of audio content on spoken web BIBAFull-Text 1029-1038
  Mamadou Diao; Sougata Mukherjea; Nitendra Rajput; Kundan Srivastava
Spoken Web is a web of VoiceSites that can be accessed by a phone. The content in a VoiceSite is audio. Therefore Spoken Web provides an alternate to the World Wide Web (WWW) in developing regions where low Internet penetration and low literacy are barriers to accessing the conventional WWW. Searching of audio content in Spoken Web through an audio query-result interface presents two key challenges: indexing of audio content is not accurate, and the presentation of results in audio is sequential, and therefore cumbersome. In this paper, we apply the concepts of faceted search and browsing to the SpokenWeb search problem. We use the concepts of facets to index the meta-data associated with the audio content. We provide a mechanism to rank the facets based on the search results. We develop an interactive query interface that enables easy browsing of search results through the top ranked facets. To our knowledge, this is the first system to use the concepts of facets in audio search, and the first solution that provides an audio search for the rural population.
   We present quantitative results to illustrate the accuracy and effectiveness of the faceted search and qualitative results to highlight the usability of the interactive browsing system. The experiments have been conducted on more than 4000 audio documents collected from a live SpokenWeb VoiceSite and evaluations were carried out with 40 farmers who are the target users of the VoiceSite.
Predicting product adoption in large-scale social networks BIBAFull-Text 1039-1048
  Rushi Bhatt; Vineet Chaoji; Rajesh Parekh
Online social networks offer opportunities to analyze user behavior and social connectivity and leverage resulting insights for effective online advertising. We study the adoption of a paid product by members of a large and well-connected Instant Messenger (IM) network. This product is important to the business and poses unique challenges to advertising due to its low baseline adoption rate. We find that adoption by highly connected individuals is correlated with their social connections (friends) adopting after them. However, there is little evidence of social influence by these high degree individuals. Further, the spread of adoption remains mostly local to first-adopters and their immediate friends. We observe strong evidence of peer pressure wherein future adoption by an individual is more likely if the product has been widely adopted by the individual's friends. Social neighborhoods rich in adoptions also continue to add more new adoptions compared to those neighborhoods that are poor in adoption.
   Using these insights we build predictive models to identify individuals most suited for two types of marketing campaigns -- direct marketing where individuals with highest propensity for future adoption are targeted with suitable ads and social neighborhood marketing which involves messaging to members of the social network who are most effective in using the power of their network to convince their friends to adopt. We identify the most desirable features for predicting future adoption of the PC To Phone product which can in turn be leveraged to effectively promote its adoption. Offline analysis shows that building predictive models for direct marketing and social neighborhood marketing outperforms several widely accepted marketing heuristics. Further, these models are able to effectively combine user features and social features to predict adoption better than using either user features or social features in isolation.

IR track: query analysis and feedback

Reverted indexing for feedback and expansion BIBAFull-Text 1049-1058
  Jeremy Pickens; Matthew Cooper; Gene Golovchinsky
Traditional interactive information retrieval systems function by creating inverted lists, or term indexes. For every term in the vocabulary, a list is created that contains the documents in which that term occurs and its relative frequency within each document. Retrieval algorithms then use these term frequencies alongside other collection statistics to identify the matching documents for a query. In this paper, we turn the process around: instead of indexing documents, we index query result sets. First, queries are run through a chosen retrieval system. For each query, the resulting document IDs are treated as terms and the score or rank of the document is used as the frequency statistic. An index of documents retrieved by basis queries is created. We call this index a reverted index. With reverted indexes, standard retrieval algorithms can retrieve the matching queries (as results) for a set of documents (used as queries). These recovered queries can then be used to identify additional documents, or to aid the user in query formulation, selection, and feedback.
Improving verbose queries using subset distribution BIBAFull-Text 1059-1068
  Xiaobing Xue; Samuel Huston; W. Bruce Croft
Dealing with verbose (or long) queries poses a new challenge for information retrieval. Selecting a subset of the original query (a "sub-query") has been shown to be an effective method for improving these queries. In this paper, the distribution of sub-queries ("subset distribution") is formally modeled within a well-grounded framework. Specifically, sub-query selection is considered as a sequential labeling problem, where each query word in a verbose query is assigned a label of "keep" or "don't keep". A novel Conditional Random Field model is proposed to generate the distribution of sub-queries. This model captures the local and global dependencies between query words and directly optimizes the expected retrieval performance on a training set. The experiments, based on different retrieval models and performance measures, show that the proposed model can generate high-quality sub-query distributions and can significantly outperform state-of-the-art techniques.
A unified optimization framework for robust pseudo-relevance feedback algorithms BIBAFull-Text 1069-1078
  Joshua V. Dillon; Kevyn Collins-Thompson
We present a flexible new optimization framework for finding effective, reliable pseudo-relevance feedback models that unifies existing complementary approaches in a principled way. The result is an algorithmic approach that not only brings together different benefits of previous methods, such as parameter self-tuning and risk reduction from term dependency modeling, but also allows a rich new space of model search strategies to be investigated. We compare the effectiveness of a unified algorithm to existing methods by examining iterative performance and risk-reward tradeoffs. We also discuss extensions for generating new algorithms within our framework.
Ranking related entities: components and analyses BIBAFull-Text 1079-1088
  Marc Bron; Krisztian Balog; Maarten de Rijke
Related entity finding is the task of returning a ranked list of homepages of relevant entities of a specified type that need to engage in a given relationship with a given source entity. We propose a framework for addressing this task and perform a detailed analysis of four core components; co-occurrence models, type filtering, context modeling and homepage finding. Our initial focus is on recall. We analyze the performance of a model that only uses co-occurrence statistics. While this method identifies the potential set of related entities, it fails to rank them effectively. Two types of error emerge: (1) entities of the wrong type pollute the ranking and (2) while somehow associated to the source entity, some retrieved entities do not engage in the right relation with it. To address (1), we add type filtering based on category information available in Wikipedia. To correct for (2), we complement our related entity finding method with contextual information, represented as language models derived from documents in which source and target entities co-occur. To complete the pipeline, we find homepages of top ranked entities by combining a language modeling approach with heuristics based on Wikipedia's external links. Our method achieves very high recall scores on the end-to-end task, providing a solid starting point for expanding our focus to improve precision. Our framework can effectively incorporate additional heuristics and these extensions lead to state-of-the-art performance.

KM track: semantic techniques

Semantic tags generation and retrieval for online advertising BIBAFull-Text 1089-1098
  Roberto Mirizzi; Azzurra Ragone; Tommaso Di Noia; Eugenio Di Sciascio
One of the main problems in online advertising is to display ads which are relevant and appropriate w.r.t. what the user is looking for. Often search engines fail to reach this goal as they do not consider semantics attached to keywords. In this paper we propose a system that tackles the problem by two different angles: help (i) advertisers to create more efficient ads campaigns and (ii) ads providers to properly match ads content to keywords in search engines. We exploit semantic relations stored in the DBpedia dataset and use an hybrid ranking system to rank keywords and to expand queries formulated by the user. Inputs of our ranking system are (i) the DBpedia dataset; (ii) external information sources such as classical search engine results and social tagging systems. We compare our approach with other RDF similarity measures, proving the validity of our algorithm with an extensive evaluation involving real users.
MENTA: inducing multilingual taxonomies from wikipedia BIBAFull-Text 1099-1108
  Gerard de Melo; Gerhard Weikum
In recent years, a number of projects have turned to Wikipedia to establish large-scale taxonomies that describe orders of magnitude more entities than traditional manually built knowledge bases. So far, however, the multilingual nature of Wikipedia has largely been neglected. This paper investigates how entities from all editions of Wikipedia as well as WordNet can be integrated into a single coherent taxonomic class hierarchy. We rely on linking heuristics to discover potential taxonomic relationships, graph partitioning to form consistent equivalence classes of entities, and a Markov chain-based ranking approach to construct the final taxonomy. This results in MENTA (Multilingual Entity Taxonomy), a resource that describes 5.4 million entities and is presumably the largest multilingual lexical knowledge base currently available.
Ontology emergence from folksonomies BIBAFull-Text 1109-1118
  Kaipeng Liu; Binxing Fang; Weizhe Zhang
The folksonomies built from the large-scale social annotations made by collaborating users are perfect data sources for bootstrapping Semantic Web applications. In this paper, we develop an ontology induction approach to harvest the emergent semantics from the folksonomies. We propose a latent subsumption hierarchy model to uncover the implicit structure of tag space and develop our ontology induction approach on basis of this model. We identify tag subsumptions with a set-theoretical approach and model the tag space as a tag subsumption graph. While turning this graph into a concept hierarchy, we address the problem of inconsistent subsumptions and propose a random walk based tag generality ranking procedure to settle it. We propose an agglomerative hierarchical clustering algorithm utilizing the result of tag generality ranking to generate the concept hierarchy. We conduct experiments on the Delicious dataset. The results of both qualitative and quantitative evaluation demonstrate the effectiveness of the proposed approach.
Multi-document topic segmentation BIBAFull-Text 1119-1128
  Minwoo Jeong; Ivan Titov
Multiple documents describing the same or closely related sets of events are common and often easy to obtain: for example, consider document clusters on a news aggregator site or multiple reviews of the same product or service. Even though each such document discusses a similar set of topics, they provide alternative views or complimentary information on each of these topics. We argue that revealing hidden relations by jointly segmenting the documents, or, equivalently, predicting links between topically related segments in different documents would help to visualize documents of interest and construct friendlier user interfaces. In this paper, we refer to this problem as multi-document topic segmentation. We propose an unsupervised Bayesian model for the considered problem that models both shared and document-specific topics, and utilizes Dirichlet process priors to determine the effective number of topics. We show that topic segmentation can be inferred efficiently using a simple split-merge sampling algorithm. The resulting method outperforms baseline models on four datasets for multi-document topic segmentation.
Meta-metadata: a metadata semantics language for collection representation applications BIBAFull-Text 1129-1138
  Andruid Kerne; Yin Qu; Andrew M. Webb; Sashikanth Damaraju; Nic Lupfer; Abhinav Mathur
Collecting, organizing, and thinking about diverse information resources is the keystone of meaningful digital information experiences, from research to education to leisure. Metadata semantics are crucial for organizing collections, yet their structural diversity exacerbates problems of obtaining and manipulating them, strewing end users and application developers amidst the shadows of a proverbial tower of Babel. We introduce meta-metadata, a language and software architecture addressing a metadata semantics lifecycle: (1) data structures for representation of metadata in programs; (2) metadata extraction from information resources; (3) semantic actions that connect metadata to collection representation applications; and (4) rules for presentation to users. The language enables power users to author metadata semantics wrappers that generalize template-based information sources. The architecture supports development of independent collection representation applications that reuse wrappers. The initial meta-metadata repository of information source wrappers includes Google, Flickr, Yahoo, IMDb, Wikipedia, and the ACM Portal. Case studies validate the approach.

IR track: web search

Clickthrough-based translation models for web search: from word models to phrase models BIBAFull-Text 1139-1148
  Jianfeng Gao; Xiaodong He; Jian-Yun Nie
Web search is challenging partly due to the fact that search queries and Web documents use different language styles and vocabularies. This paper provides a quantitative analysis of the language discrepancy issue, and explores the use of clickthrough data to bridge documents and queries. We assume that a query is parallel to the titles of documents clicked on for that query. Two translation models are trained and integrated into retrieval models: A word-based translation model that learns the translation probability between single words, and a phrase-based translation model that learns the translation probability between multi-term phrases. Experiments are carried out on a real world data set. The results show that the retrieval systems that use the translation models outperform significantly the systems that do not. The paper also demonstrates that standard statistical machine translation techniques such as word alignment, bilingual phrase extraction, and phrase-based decoding, can be adapted for building a better Web document retrieval system.
Temporal query log profiling to improve web search ranking BIBAFull-Text 1149-1158
  Alexander Kotov; Pranam Kolari; Lei Duan; Yi Chang
Temporal information can be leveraged and incorporated to improve web search ranking. In this work, we propose a method to improve the ranking of search results by identifying the fundamental properties of temporal behavior of low-quality hosts and spam-prone queries in search logs and modeling those properties as quantifiable features. In particular, we introduce the concepts of host churn, a measure of changes in host visibility for user queries, and query volatility, a measure of semantic instability of query results, and propose the methods for construction of temporal profiles from search query logs that can be used for estimation of a set of features based on the introduced concepts. The utility of the proposed concepts has been experimentally demonstrated for two language-independent search tasks: the regression-based ranking of search results and a novel classification problem of detecting spam-prone queries introduced in this work.
Improving web search relevance and freshness with content previews BIBAFull-Text 1159-1168
  Siva Gurumurthy; Hang Su; Vasileios Kandylas; Vidhyashankar Venkataraman
Traditional web search engines find it challenging to achieve good search quality for recency-sensitive queries, as they are prone to delays in discovering, indexing and ranking new web pages. In this paper we introduce PreGen, an adaptive preview generation system, which is run as part of a web search engine to improve search result quality for recency-sensitive queries. PreGen uses a machine learning algorithm to classify and select live web feeds, and generates "previews" of new web pages based on the link descriptions available in these feeds. The search engine can then index and present relevant page previews as part of its search results before the pages are fetched from the web, thereby reducing end-to-end delays. Our experiments show that PreGen improves the search relevance of a state-of-the-art search engine for recency-sensitive queries by 3% and reduces the average latencies of affected documents by 50%.
Organizing query completions for web search BIBAFull-Text 1169-1178
  Alpa Jain; Gilad Mishne
All state-of-the-art web search engines implement an auto-completion mechanism -- an assistive technology enabling users to effectively formulate their search queries by predicting the next characters or words that they are likely to type. Query completions (or suggestions) are typically mined from past user interactions with the search engine, e.g., from query logs, clickthrough patterns, or query reformulations; they are ranked by some measure of query popularity, e.g., query frequency or clickthrough rate. Current query suggestion tools largely assume that the set of suggestions provided to the users is homogeneous, corresponding to a single real-world interpretation of the query. In this paper, we hypothesize that, in some cases, users would benefit from an alternative presentation of the suggestions, one where suggestions are not only ordered by likelihood but also organized by high-level user intent. Rich search suggestion interaction frameworks that reduce the user effort in identifying the set of relevant suggestions open new and promising directions towards improving user experience. Along these lines, we propose clustering the set of suggestions presented to a search engine user, and assigning an appropriate label to each subset of suggestions to help users quickly identify useful ones. For this, we present a variety of unsupervised clustering techniques for search suggestions, based on the information available to a large-scale web search engine. We evaluate our novel search suggestion presentation techniques on a real-world dataset of query logs. Based on a set of user studies, we show that by extending the existing assistance layer to effectively group suggestions and label them -- while accounting for the query popularity -- we substantially increase the user's satisfaction.
Selectively diversifying web search results BIBAFull-Text 1179-1188
  Rodrygo L. T. Santos; Craig Macdonald; Iadh Ounis
Search result diversification is a natural approach for tackling ambiguous queries. Nevertheless, not all queries are equally ambiguous, and hence different queries could benefit from different diversification strategies. A more lenient or more aggressive diversification strategy is typically encoded by existing approaches as a trade-off between promoting relevance or diversity in the search results. In this paper, we propose to learn such a trade-off on a per-query basis. In particular, we examine how the need for diversification can be learnt for each query -- given a diversification approach and an unseen query, we predict an effective trade-off between relevance and diversity based on similar previously seen queries. Thorough experiments using the TREC ClueWeb09 collection show that our selective approach can significantly outperform a uniform diversification for both classical and state-of-the-art diversification approaches.

Industry track: text mining and analytics

Building re-usable dictionary repositories for real-world text mining BIBAFull-Text 1189-1198
  Shantanu Godbole; Indrajit Bhattacharya; Ajay Gupta; Ashish Verma
Text mining, though still a nascent industry, has been growing quickly along with the awareness of the importance of unstructured data in business analytics, customer retention and extension, social media, and legal applications. There has been a recent increase in the number of commercial text mining product and service offerings, but successful or wide-spread deployments are rare, mainly due to a dependence on the expertise and skill of practitioners. Accordingly, there is a growing need for re-usable repositories for text mining. In this paper, we focus on dictionary-based text mining and its role in enabling practitioners in understanding and analyzing large text datasets. We motivate and define the problem of exploratory dictionary construction for capturing concepts of interest, and propose a framework for efficient construction, tuning, and re-use of these dictionaries across datasets. The construction framework offers a range of interaction modes to the user to quickly build concept dictionaries over large datasets. We also show how to adapt one or more dictionaries across domains and tasks, thereby enabling reuse of knowledge and effort in industrial practice. We present results and case studies on real-life CRM analytics datasets, where such repositories and tooling significantly cut down practitioner time and effort for dictionary-based text mining.
OpinionIt: a text mining system for cross-lingual opinion analysis BIBAFull-Text 1199-1208
  Honglei Guo; Huijia Zhu; Zhili Guo; Xiaoxun Zhang; Zhong Su
Opinion mining focuses on extracting customers' opinions from the reviews and predicting their sentiment orientation. Reviewers usually praise a product in some aspects and bemoan it in other aspects. With the business globalization, it is very important for enterprises to extract the opinions toward different aspects and find out cross-lingual/cross-culture difference in opinions. Cross-lingual opinion mining is a very challenging task as amounts of opinions are written in different languages, and not well structured. Since people usually use different words to describe the same aspect in the reviews, product-feature (PF) categorization becomes very critical in cross-lingual opinion mining. Manual cross-lingual PF categorization is time consuming, and practically infeasible for the massive amount of data written in different languages. In order to effectively find out cross-lingual difference in opinions, we present an aspect-oriented opinion mining method with Cross-lingual Latent Semantic Association (CLaSA). We first construct CLaSA model to learn the cross-lingual latent semantic association among all the PFs from multi-dimension semantic clues in the review corpus. Then we employ CLaSA model to categorize all the multilingual PFs into semantic aspects, and summarize cross-lingual difference in opinions towards different aspects. Experimental results show that our method achieves better performance compared with the existing approaches. With CLaSA model, our text mining system OpinionIt can effectively discover cross-lingual difference in opinions.
Hierarchical service analytics for improving productivity in an enterprise service center BIBAFull-Text 1209-1218
  Chunye Wang; Ram Akella; Srikant Ramachandran
Modern day service centers are the building blocks for highly efficient and productive business systems in a knowledge economy. In these service systems, accurate and timely delivery of pertinent information to service representatives becomes the cornerstone for delivering efficient customer service. There are two main steps in achieving this objective. The first step concerns efficient text mining to extract critical and pertinent information from the very long service request (SR) documents in the historical database. The second step concerns matching new service requests with previously stored service requests. Both lead to efficiencies by minimizing time spent by service personnel in extracting Intellectual Capital (IC). In this paper we present our text analytics system, the Service Request Analyzer and Recommender (SRAR), which is designed to improve the productivity in an enterprise service center for computer network diagnostics and support. SRAR unifies a text preprocessor, a hierarchical classifier, and a service request recommender, to deliver critical, pertinent, and categorized knowledge for improved service efficiency. The novel feature we report here is identifying the components of the diagnostic process underlying the creation of the original text documents. This identification is crucial in the successful design and prototyping of SRAR and its hierarchical classifier element. Equally, the use of domain knowledge and human expertise to generate features are indispensable synergistic elements in improving the accuracy of the text analysis toward identifying the components of the diagnostic process. The evaluation and comparison of SRAR with other benchmark approaches in the literature demonstrate the effectiveness of our framework and algorithms. This framework can be generalized to be applicable in many service industries and business functions that mine textual data to achieve increased efficiency in their service delivery. We observe significant service time responsiveness improvements during the first step of IC extraction in network service center context at Cisco.

IR track: scalability and efficiency

VSEncoding: efficient coding and fast decoding of integer lists via dynamic programming BIBAFull-Text 1219-1228
  Fabrizio Silvestri; Rossano Venturini
Encoding lists of integers efficiently is important for many applications in different fields. Adjacency lists of large graphs are usually encoded to save space and to improve decoding speed. Inverted indexes of Information Retrieval systems keep the lists of postings compressed in order to exploit the memory hierarchy. Secondary indexes of DBMSs are stored similarly to inverted indexes in IR systems. In this paper we propose Vector of Splits Encoding (VSEncoding), a novel class of encoders that work by optimally partitioning a list of integers into blocks which are efficiently compressed by using simple encoders. In previous works heuristics were applied during the partitioning step. Instead, we find the optimal solution by using a dynamic programming approach. Experiments show that our class of encoders outperform all the existing methods in literature by more than 10% (with the exception of Binary Interpolative Coding with which they, roughly, tie) still retaining a very fast decompression algorithm.
Efficient term proximity search with term-pair indexes BIBAFull-Text 1229-1238
  Hao Yan; Shuming Shi; Fan Zhang; Torsten Suel; Ji-Rong Wen
There has been a large amount of research on early termination techniques in web search and information retrieval. Such techniques return the top-k documents without scanning and evaluating the full inverted lists of the query terms. Thus, they can greatly improve query processing efficiency. However, only a limited amount of efficient top-k processing work considers the impact of term proximity, i.e., the distance between term occurrences in a document, which has recently been integrated into a number of retrieval models to improve effectiveness.
   In this paper, we propose new early termination techniques for efficient query processing for the case where term proximity is integrated into the retrieval model. We propose new index structures based on a term-pair index, and study new document retrieval strategies on the resulting indexes. We perform a detailed experimental evaluation on our new techniques and compare them with the existing approaches. Experimental results on large-scale data sets show that our techniques can significantly improve the efficiency of query processing.
Improved index compression techniques for versioned document collections BIBAFull-Text 1239-1248
  Jinru He; Junyuan Zeng; Torsten Suel
Current Information Retrieval systems use inverted index structures for efficient query processing. Due to the extremely large size of many data sets, these index structures are usually kept in compressed form, and many techniques for optimizing compressed size and query processing speed have been proposed. In this paper, we focus on versioned document collections, that is, collections where each document is modified over time, resulting in multiple versions of the document. Consecutive versions of the same document are often similar, and several researchers have explored ideas for exploiting this similarity to decrease index size.
   We propose new index compression techniques for versioned document collections that achieve reductions in index size over previous methods. In particular, we first propose several bitwise compression techniques that achieve a compact index structure but that are too slow for most applications. Based on the lessons learned, we then propose additional techniques that come close to the sizes of the bitwise technique while also improving on the speed of the best previous methods.
Building efficient multi-threaded search nodes BIBAFull-Text 1249-1258
  Carolina Bonacic; Carlos García; Mauricio Marin; Manuel Prieto-Matias; Francisco Tirado
Search nodes are single-purpose components of large Web search engines and their efficient implementation is critical to sustain thousands of queries per second and guarantee individual query response times within a fraction of a second. Current technology trends indicate that search nodes ought to be implemented as multi-threaded multi-core systems. The straightforward solution that system designers can apply in this case is simply to follow standard practice by deploying one asynchronous thread per active query in the node and attaching each thread to a different core. Each concurrent thread is responsible for sequentially processing a single query at a time. The only potential source of read/write conflicts among threads are the accesses to the different application caches present in the search node. However, new Web applications pose much more demanding requirements in terms of read/write conflicts than recent past applications since now data updates must take place concurrently with query processing. Insisting on the same paradigm of concurrent threads now augmented with a transaction concurrency control protocol is a feasible solution. In this paper we propose a more efficient and much simpler solution which has the additional advantage of enabling a very efficient administration of application caches. We propose performing relaxed bulk-synchronous parallelism at multi-core level.
Real-time memory efficient data redundancy removal algorithm BIBAFull-Text 1259-1268
  Vikas K. Garg; Ankur Narang; Souvik Bhattacherjee
Data intensive computing has become a central theme in research community and industry. There is an ever growing need to process and analyze massive amounts of data from diverse sources such as telecom call data records, telescope imagery, online transaction records, web pages, stock markets, medical records (monitoring critical health conditions of patients), climate warning systems, etc. Removing redundancy in the data is an important problem as it helps in resource and compute efficiency for downstream processing of the massive (1 billion to 10 billion records) datasets. In application domains such as IR, stock markets, telecom and others, there is a strong need for real-time data redundancy removal (referred to as DRR) of enormous amounts of data flowing at the rate of 1 GB/s or more. Real-time scalable data redundancy removal on massive datasets is a challenging problem. We present the design of a novel parallel data redundancy removal algorithm for both in-memory and disk-based execution. We also develop queueing theoretic analysis to optimize the throughput of our parallel algorithm on multi-core architectures. For 500 million records, our parallel algorithm can perform complete de-duplication in 255s, on 16 core Intel Xeon 5570 architecture, with in-memory execution. This gives a throughput of 2M records/s. For 6 billion records, our parallel algorithm can perform complete de-duplication in less than 4.5 hours, using 6 cores of Intel Xeon 5570, with disk-based execution. This gives a throughput of around 370K records/s. To the best of our knowledge, this is the highest real-time throughput for data redundancy removal on such massive datasets. We also demonstrate the scalability of our algorithm with increasing number of cores and data.

Poster session 1: DB track

Search-log anonymization and advertisement: are they mutually exclusive? BIBAFull-Text 1269-1272
  Thorben Burghardt; Klemens Böhm; Achim Guttmann; Chris Clifton
The revenue of search-engine providers strongly depends on targeted advertisement. Targeted advertisement is becoming more reliant on personal data. This puts user privacy at risk. One way to improve privacy is to anonymize search logs, but this reduces usefulness for ad placement. Further, the usefulness depends on the target function used for the anonymization. This paper is the first to study this tradeoff systematically. We quantify the usefulness of an anonymized search log for advertisement purposes, by estimating outcomes such as the number of clicks on ads or the number of ad impressions possible after anonymization. A main result is that anonymized search logs are still useful for advertisement purposes, but the extent strongly depends on the target function.
Automated interaction in social networks with datalog BIBAFull-Text 1273-1276
  Royi Ronen; Oded Shmueli
The Query Network [12] is a model for query-based social networks automation features, motivated by the rise of social networks as a central internet application. This work generalizes the model to consist of a proposal query and an acceptance query for each participant. As a result, addition of edges is done by coordination between participants, simulating interactions between participants. We designed, implemented and experimented with evaluation algorithms for this new model. Experiments with both synthetic and real datasets show the high effectiveness of our methods.
Towards a provenance framework for sub-image processing for astronomical data BIBAFull-Text 1277-1280
  Johnson Mwebaze; John McFarland; Danny Booxhorn; Edwin Valentijn
While there has been advances in observational equipment that generate huge high quality images, the processing of these images remains a major bottleneck. We show that provenance data collected during the processing of data can be reused to perform selective processing of data and support network collaboration without clogging distribution networks. We introduce the idea of sub-image processing (SIMP) in the context of processing a subset of pixels of an image and the use of provenance data to assemble pipelines and to select processing metadata for SIMP. We describe an implementation of SIMP in Astro-WISE.
Open user schema guided evaluation of streaming RDF queries BIBAFull-Text 1281-1284
  Zhihong Chong; Guilin Qi; Hu Shu; Jiajia Bao; Weiwei Ni; Aoying Zhou
Performance and scalability are two issues that are becoming increasingly pressing as RDF data model is applied to real-world applications. Because neither vertical nor flat structures of RDF storage can handle frequent schema updates and meanwhile avoid possible long-chain joins, there is no clear winner between these two typical structures. In this paper, we propose an alternative storage schema called open user schema. The open user schema consists of flat tables automatically extracted from RDF query streams. A query is divided into two parts and conquered, respectively, on the flat tables in the open user schema and on the vertical table stored in a backend storage. At the core of this divide and conquer architecture with open user schema, an efficient isomorphism decision algorithm is given to guide a query to related flat tables in the open user schema. Our proposal in essence departs from existing methods in that it can accommodate schema updates without possible long-chain joins. We implement our approach and provide empirical evaluations to demonstrate both efficiency and effectiveness of our approach in evaluating complex RDF queries.
SUMMA: subgraph matching in massive graphs BIBAFull-Text 1285-1288
  Shijie Zhang; Shirong Li; Jiong Yang
Graphs can represent a large number of data types, e.g., online social networks, internet links, procedure dependency graphs, etc. The need for indexing massive graphs is an urgent research problem of great practical importance. The main challenge is the size. Each graph may contain at least tens of millions vertices. The working memory may not be able to store the database graph due to its large size, which increases the processing time significantly.
   We propose a novel index based subgraph matching scheme, namely SUMMA, for graph querying in massive graphs. We devise two novel indices which capture both local and global information of the database graph. SUMMA is further optimized by the use of a matching scheme to reduce redundant calculations and disk accesses. Last but not least, a number of synthetic datasets are used to evaluate the efficiency and scalability of our proposed method.
Automatically weighting tags in XML collection BIBAFull-Text 1289-1292
  Dexi Liu; Changxuan Wan; Lei Chen; Xiping Liu
In XML retrieval, nodes with different tags play different roles in XML documents and then tags should be reflected in the relevance ranking. An automatic method is proposed in this paper to infer the weights of tags. We first investigate 15 features about tags, and then select five of them based on the correlations between these features and manual tag weights. Using these features, a tag weight assignment model, ATG, is designed. We evaluate the performance of ATG on two real data sets, IEEECS and Wikipedia from two different perspectives. One is to evaluate the quality of the model by measuring the correlation between weights generated by our model and those given by experts. The other is to test the effectiveness of the model in improving retrieval performance. Experimental results show that the tag weights generated by ATG are highly correlated with the manually assigned weights and the ATG model improves retrieval effectiveness significantly.
Skyline query processing for uncertain data BIBAFull-Text 1293-1296
  Mohamed E. Khalefa; Mohamed F. Mokbel; Justin J. Levandoski
Recently, several research efforts have addressed answering skyline queries efficiently over large datasets. However, this research lacks methods to compute these queries over uncertain data, where uncertain values are represented as a range. In this paper, we define skyline queries over continuous uncertain data, and propose a novel, efficient framework to answer these queries. Query answers are probabilistic, where each object is associated with a probability value of being a query answer. Typically, users specify a probability threshold, that each returned object must exceed, and a tolerance value that defines the allowed error margin in probability calculation to reduce the computational overhead. Our framework employs an efficient two-phase query processing algorithm.
FD-buffer: a buffer manager for databases on flash disks BIBAFull-Text 1297-1300
  Sai Tung On; Yinan Li; Bingsheng He; Ming Wu; Qiong Luo; Jianliang Xu
We design and implement FD-Buffer, a buffer manager for database systems running on flash-based disks. Unlike magnetic disks, flash media has an inherent read-write asymmetry: writes involve expensive erase operations and as a result are usually much slower than reads. Therefore, we address this asymmetry in FD-Buffer. Specifically, we use the average I/O cost per page access as opposed to the traditional miss rate as the performance metric for a buffer. We develop a new replacement policy in which we separate clean and dirty pages into two pools. The size ratio of the two pools is automatically adapted to the read-write asymmetry and the runtime workload. We evaluate FD-Buffer with trace-driven experiments on real flash disks. Our evaluation results show that our algorithm achieves up to 33% improvement on the overall performance on commodity flash disks, in comparison with the state-of-the-art flash-aware replacement policy.
Efficiently querying archived data using Hadoop BIBAFull-Text 1301-1304
  Rajeev Gupta; Himanshu Gupta; Ullas Nambiar; Mukesh Mohania
The need to analyze structured data for various business intelligence applications such as customer churn analysis, social network analysis, telecom network monitoring etc., is well known. However, the potential size to which such data will scale in future will make solutions that revolve around data warehouses hard to scale. As data sizes grow the movement of data from the warehouse to archives becomes more frequent. Current file based archive models make the archived data unusable for any type of insight extraction. In this paper, we present an active archival solution for data warehouses that makes use of Hadoop distributed file system (HDFS) to store the data in an always available and cost-effective manner. We investigate various structured data storage schemes within HDFS and empirical evaluations show that a combination of Universal scheme model and column store is best suited for the active archival solution.
Evaluation of top-k queries in peer-to-peer networks using threshold algorithms BIBAFull-Text 1305-1308
  Ioannis Chrysakis; Constantinos Chalkidis; Dimitris Plexousakis
In p2p networks, top-k query processing can provide a lot of advantages both in time and bandwidth consumption. Several algorithms have been proposed for the evaluation of top-k queries. A large percentage of them follow the Threshold Approach. We focus on the main adaptations of threshold algorithms fulfilling the requirements of modern p2p applications. We introduce two algorithms optimized for ranking queries in p2p networks and present their characteristics. In the setting of a simulation of large p2p networks, we evaluate the algorithms' performance. Our experiments demonstrate that in some cases a threshold algorithm can improve top-k query processing, while in others it is far more costly. The results show that distributed query processing can be more effective than a simple threshold algorithm in a p2p network.
A metamodel approach to flexible semantic web service discovery BIBAFull-Text 1309-1312
  Roberto De Virgilio; Devis Bianchini
In this paper we describe an approach for service discovery supported by semantic annotations. We propose a metamodel representation of both the WSDL documents and the associated semantic annotations. Based on this metamodel, effective service discovery is achieved by a Datalog engine implementing flexible matchmaking techniques that allow both exact and partial matches among search results. The metamodel is supported by a storage system that ensures scalability of the entire process. Finally we illustrate experiments on a public dataset of semantic service descriptions.
On top-k social web search BIBAFull-Text 1313-1316
  Peifeng Yin; Wang-Chien Lee; Ken C. K. Lee
To enhance the quality of document search, recent research studies have started to exploit the social networks of users by considering social influence (SI), measurement of the affinity between a query user and the publisher of a retrieved document, in addition to the commonly used textual relevance (TR). We refer to such document search that considers social networks as social web search. In this paper, we focus on efficient top-k social web search and propose two search strategies: (i) TR-based search and (ii) SI-based search that tailor document examination orders upon TR and SI, respectively. We evaluate the proposed strategies through experimentation.
Quantifying uncertainty in multi-dimensional cardinality estimations BIBAFull-Text 1317-1320
  Andranik Khachatryan; Klemens Boehm
We propose a method for predicting the cardinality distribution of a multi-dimensional query. Compared to conventional 'point-based' estimates, distribution-based estimates enable the query optimizer to predict the cost of a query plan more accurately, as we show experimentally. Our method is computationally efficient and works on top of a histogram already in place. It does not store any information additional to the histogram. Our experiments show that the quality of the predictions with the new method is high.
Approximate membership localization (AML) for web-based join BIBAFull-Text 1321-1324
  Zhixu Li; Laurianne Sitbon; Liwei Wang; Xiaofang Zhou; Xiaoyong Du
In this paper, we propose a search-based approach to join two tables in the absence of clean join attributes. Non-structured documents from the web are used to express the correlations between a given query and a reference list. To implement this approach, a major challenge we meet is how to efficiently determine the number of times and the locations of each clean reference from the reference list that is approximately mentioned in the retrieved documents. We formalize the Approximate Membership Localization (AML) problem and propose an efficient partial pruning algorithm to solve it. A study using real-word data sets demonstrates the effectiveness of our search-based approach, and the efficiency of our AML algorithm.
An efficient data-centric storage scheme considering storage and query hot-spots in sensor networks BIBAFull-Text 1325-1328
  Yonghun Park; Dongmin Seo; Jonghyeon Yun; Christopher T. Ryu; Jaesoo Yoo
In wireless sensor networks, various schemes have been proposed to efficiently store and process sensed data. Among them, the Data-Centric Storage (DCS) scheme is one of the most well-known. The DCS scheme distributes data regions and stores the data in the sensor that is responsible for the region. In this paper, we propose a new DCS based scheme, called Time-Parameterized Data-Centric Storage (TPDCS), that avoids the problems of storage hot-spots and query hotspots. To decentralize the skewed data and queries, the data regions are assigned by a time dimension as well as data dimensions in our proposed scheme. Therefore, TPDCS extends the lifetime of sensor networks. It is shown through various experiments that our scheme outperforms the existing schemes.
Formal approach and automated tool for constructing ontology from object-oriented database model BIBAFull-Text 1329-1332
  Fu Zhang; Z. M. Ma; Xing Wang; Yu Wang
Extracting domain knowledge from databases can facilitate the development of Web ontologies. In this paper, a formal approach and an automated tool for constructing ontologies from Object-oriented database models (OODMs) are developed. The approach and tool can automatically translate an OODM and its corresponding database instances into the ontology structure and ontology instances, respectively. Case studies show that the approach is feasible and the automated construction tool is efficient.
(k,P)-anonymity: towards pattern-preserving anonymity of time-series data BIBAFull-Text 1333-1336
  Xuan Shang; Ke Chen; Lidan Shou; Gang Chen; Tianlei Hu
The challenges with privacy protection of time series are mainly due to the complex nature of the data and the queries performed on them. We study the anonymization of time series while trying to support complex queries, such as range and pattern similarity queries, on the published data. The conventional k-anonymity cannot effectively address this problem as it may suffer severe pattern loss. We propose a novel anonymization model called (k,P)-anonymity for pattern-rich time series. This model publishes both the attribute values and the patterns of time series in separate data forms. We demonstrate that our model can prevent linkage attacks on the published data while effectively support a wide variety of queries on the anonymized data. We also design an efficient algorithm for enforcing (k,P)-anonymity on time series data.
Concurrent atomic protocols for making and changing decisions in social networks BIBAFull-Text 1337-1340
  Royi Ronen; Oded Shmueli
We study a novel data management scenario, in which social networks participants use protocols in order to manage their activities and the ever-growing data available to them in the network. In particular, we study protocols which operate on a consistent network (that we define), and transform it into another consistent state by atomically performing a set of changes. Multiple protocol instances, which work on intersecting parts of the network graphs are able to operate concurrently.
Extending dictionary-based entity extraction to tolerate errors BIBAFull-Text 1341-1344
  Guoliang Li; Dong Deng; Jianhua Feng
Entity extraction (also known as entity recognition) extracts entities (e.g., person names, locations, companies) from text. Approximate (dictionary-based) entity extraction is a recent trend to improve extraction quality, which extracts substrings in text that approximately match predefined entities in a given dictionary. In this paper, we study the problem of approximate entity extraction with edit-distance constraints. A straightforward method first extracts all substrings from the text and then for each substring identifies its similar entities from the dictionary using existing methods for approximate string search. However many substrings of the text have overlaps, and we have an opportunity to utilize the shared computation across the overlaps to avoid unnecessary duplicate computations. To this end, we propose a heap-based framework to efficiently extract entities. We have implemented our techniques, and the experimental results show that our method achieves high performance and outperforms existing studies significantly.
Yet another write-optimized DBMS layer for flash-based solid state storage BIBAFull-Text 1345-1348
  Hongchan Roh; Daewook Lee; Sanghyun Park
Flash-based Solid State Storage (flashSSS) has write-oriented problems such as low write throughput, and limited life-time. Especially, flashSSDs have a characteristic vulnerable to random-writes, due to its control logic utilizing parallelism between the flash memory chips. In this paper, we present a write-optimized layer of DBMSs to address the write-oriented problems of flashSSS in on-line transaction processing environments. The layer consists of a write-optimized buffer, a corresponding log space, and an in-memory mapping table, closely associated with a novel logging scheme called InCremental Logging (ICL). The ICL scheme enables DBMSs to reduce page-writes at the least expense of additional page-reads, while replacing random-writes into sequential-writes. Through experiments, our approach demonstrated up-to an order of magnitude performance enhancement in I/O processing time compared to the original DBMS, increasing the longevity of flashSSS by approximately a factor of two.
Print: a provenance model to support integration processes BIBAFull-Text 1349-1352
  Bruno Tomazela; Carmem S. Hara; Ricardo R. Ciferri; Cristina D. A. Ciferri
In some integration applications, users are allowed to import data from heterogeneous sources, but are not allowed to update source data directly. Imported data may be inconsistent, and even when inconsistencies are detected and solved, these changes may not be propagated to the sources due to their update policies. Therefore, they continue to provide the same inconsistent data in the future until the proper authority updates them. In this paper, we propose PrInt, a model that supports user's decisions on cleaning data to be automatically reapplied in subsequent integration processes. By reproducing previous decisions, the user may focus only on new inconsistencies originated from source modified data. The reproducibility provided by PrInt is based on logging, and by incorporating data provenance in the integration process.
OLAP-based query recommendation BIBAFull-Text 1353-1356
  Carlos Garcia-Alvarado; Zhibo Chen; Carlos Ordonez
Query recommendation is an invaluable tool for enabling users to speed up their searches. In this paper, we present algorithms for generating query suggestions, assuming no previous knowledge of the collection. We developed an online OLAP algorithm to generate query suggestions for the users based on the frequency of the keywords in the selected documents and the correlation between the keywords in the collection. In addition, performance and scalability experiments of these algorithms are presented as proof of their feasibility. We also present sampling as an additional approach for improving performance by using approximate results. We show valid recommendations as a result of combinations generated using the correlations between the keywords. The online OLAP algorithm is also compared with the well-known Apriori algorithm and found to be faster only when simple computations were performed in smaller collections with a few keywords. On the other hand, OLAP showed a more stable behavior between collections, and allows us to have more complex policies during the aggregation and term combinations. Additionally, sampling showed improvement in the time without a significant change on the suggested queries, and proved to be an accurate alternative with a few small samples.
Selective data acquisition for probabilistic K-NN query BIBAFull-Text 1357-1360
  Yu-Chieh Lin; De-Nian Yang; Ming-Syan Chen
Recently, management of uncertain data draws lots of attention to consider the granularity of devices and noises in collection and delivery of data. Previous works directly model and handle uncertain data to find the required results. However, when data uncertainty is not small or limited, users are not able to obtain useful insights and thereby tend to provide more resources to improve the solution, by reducing the uncertainty of data. In light of this issue, this paper formulates a new problem of choosing a given number of uncertain data objects for acquiring their attribute values to improve the solutions of Probabilistic k-Nearest-Neighbor (k-PNN) query. We prove that solutions must be better after data acquisition, and we devise algorithms to maximize expected improvement. Our experiment results demonstrate that the probability can be significantly improved with only a small number of data acquisitions.
Support elements in graph structured schema reintegration BIBAFull-Text 1361-1364
  Xun Sun; Rachel A. Pottinger; Michael K. Lawrence
Manipulating graph-structured schemas (ontologies, models, etc.) requires the result to remain fully connected. In certain cases, e.g., calculating the difference of two schemas, support structures may be needed in the result. We describe our engine to process support structures in the context of a schema management system and describe schema reintegration experiments which validate the performance and correctness of our system.
BP-tree: an efficient index for similarity search in high-dimensional metric spaces BIBAFull-Text 1365-1368
  Jurandy Almeida; Ricardo da S. Torres; Neucimar J. Leite
Similarity search in high-dimensional metric spaces is a key operation in many applications, such as multimedia databases, image retrieval, object recognition, and others. The high dimensionality of the data requires special index structures to facilitate the search. Most of existing indexes are constructed by partitioning the data set using distance-based criteria. However, those methods either produce disjoint partitions, but ignore the distribution properties of the data; or produce non-disjoint groups, which greatly affect the search performance. In this paper, we study the performance of a new index structure, called Ball-and-Plane tree (BP-tree), which overcomes the above disadvantages. BP-tree is constructed by recursively dividing the data set into compact clusters. Distinctive from other techniques, it integrates the advantages of both disjoint and non-disjoint paradigms in order to achieve a structure of tight and low overlapping clusters, yielding significantly improved performance. Results obtained from an extensive experimental evaluation with real-world data sets show that BP-tree consistently outperforms state-of-the-art solutions.
Query optimization for ontology-based information integration BIBAFull-Text 1369-1372
  Yingjie Li; Jeff Heflin
In recent years, there has been an explosion of publicly available RDF and OWL data sources. In order to effectively and quickly answer queries in such an environment, we present an approach to identifying the potentially relevant Semantic Web data sources using query rewritings and a term index. We demonstrate that such an approach must carefully handle query goals that lack constants; otherwise the algorithm may identify many sources that do not contribute to eventual answers. This is because the term index only indicates if URIs are present in a document, and specific answers to a subgoal cannot be calculated until the source is physically accessed -- an expensive operation given disk/network latency. We present an algorithm that, given a set of query rewritings that accounts for ontology heterogeneity, incrementally selects and processes sources in order to maintain selectivity. Once sources are selected, we use an OWL reasoner to answer queries over these sources and their corresponding ontologies. We present the results of experiments using both a synthetic data set and a subset of the real-world Billion Triple Challenge data.
Data aspects in a relational database BIBAFull-Text 1373-1376
  Curtis Dyreson; Omar U. Florez
Data has cross-cutting concerns such as versioning, privacy, and reliability. In this paper we sketch support such concerns by adapting the aspect-oriented programming (AOP) paradigm to data. Our goal, shared by AOP, is to re-engineer applications to support cross-cutting concerns without directly modifying the application's data or queries. We propose modeling a cross-cutting data concern as a data aspect. A data aspect weaves metadata around an application's data and queries, imbuing them with additional semantics for constraint and query processing.
A hierarchical approach to reachability query answering in very large graph databases BIBAFull-Text 1377-1380
  Saikat K. Dey; Hasan Jamil
The cost of reachability query computation using traditional algorithms such as depth first search or transitive closure has been found to be prohibitive and unacceptable in massive graphs such as biological interaction networks, or pathways. Contemporary solutions mainly take two distinct approaches -- precompute reachability in the form of transitive closure (trade space for time) or use state space search (trade time for space). A middle ground among the two approaches has recently gained popularity. It precomputes part of the reachability information as a complex index so that most queries can be answered within a reasonable time. In this approach, the main cost now is creation of the index, and response generation using it as well as the space needed to materialize the structure. Most contemporary solutions favor a combination of these costs to be efficient for a class of applications. In this paper, we propose a hierarchical index based on graph segmentation to reduce index size without sacrificing query efficiency. We present experimental evidence to show that our approach can achieve significant space savings, and improve efficiency. We also show that our index need not be rebuilt for a large class of updates, a feature missing in all other contemporary systems.
Computing the top-k maximal answers in a join of ranked lists BIBAFull-Text 1381-1384
  Mirit Shalem; Yaron Kanza
Complex search tasks that utilize information from several data sources, are answered by integrating the results of distinct basic search queries. In such integration, each basic query returns a ranked list of items, and the main task is to compute the join of these lists, returning the top-k combinations. Computing the top-k join of ranked lists has been studied extensively for the case where the answer comprises merely complete combinations. However, a join is a lossy operation, and over heterogeneous data sources some highly-ranked items, from the results of the basic queries, may not appear in any combination. Yet, such items and the partial combinations in which they appear may still be relevant answers and should not be discarded categorically.
   In this paper we consider a join where combinations are padded by nulls for missing items. A combination is maximal if it cannot be extended by replacing a null by an item. We present algorithms for computing the top-k maximal combinations and provide an experimental evaluation.
PruSM: a prudent schema matching approach for web forms BIBAFull-Text 1385-1388
  Thanh Hoang Nguyen; Hoa Nguyen; Juliana Freire
There has been a substantial increase in the number of Web data sources whose contents are hidden and can only be accessed through form interfaces. To leverage this data, several applications have emerged that aim to automate and simplify the access to these data sources, from hidden-Web crawlers and meta-searchers to Web information integration systems. A requirement shared by these applications is the ability to understand these forms, so that they can automatically fill them out. In this paper, we address a key problem in form understanding: how to match elements across distinct forms. Although this problem has been studied in the literature, existing approaches have important limitations. Notably, they only handle small form collections and assume that form elements are clean and normalized, often through manual pre-processing. When a large number of forms is automatically gathered, matching form schemata presents new challenges: data heterogeneity is compounded with the Web-scale and noise introduced by automated processes. We propose PruSM, a prudent schema matching strategy the determines matches for form elements in a prudent fashion, with the goal of minimizing error propagation. A experimental evaluation of PruSM using widely available data sets shows that the approach effective and able to accurately match a large number of form schemata and without requiring any manual pre-processing.
Anonymizing data with quasi-sensitive attribute values BIBAFull-Text 1389-1392
  Pu Shi; Li Xiong; Benjamin C. M. Fung
We study the problem of anonymizing data with quasi-sensitive attributes. Quasi-sensitive attributes are not sensitive by themselves, but certain values or their combinations may be linked to external knowledge to reveal indirect sensitive information of an individual. We formalize the notion of l-diversity and t-closeness for quasi-sensitive attributes, which we call QS l-diversity and QS t-closeness, to prevent indirect sensitive attribute disclosure. We propose a two-phase anonymization algorithm that combines quasi-identifying value generalization and quasi-sensitive value suppression to achieve QS l-diversity and QS t-closeness.

Poster session 2: IR track

Exploiting site-level information to improve web search BIBAFull-Text 1393-1396
  Andrei Broder; Evgeniy Gabrilovich; Vanja Josifovski; George Mavromatis; Donald Metzler; Jane Wang
Ranking Web search results has long evolved beyond simple bag-of-words retrieval models. Modern search engines routinely employ machine learning ranking that relies on exogenous relevance signals. Yet the majority of current methods still evaluate each Web page out of context. In this work, we introduce a novel source of relevance information for Web search by evaluating each page in the context of its host Web site. For this purpose, we devise two strategies for compactly representing entire Web sites. We formalize our approach by building two indices, a traditional page index and a new site index, where each "document" represents the an entire Web site. At runtime, a query is first executed against both indices, and then the final page score for a given query is produced by combining the scores of the page and its site. Experimental results carried out on a large-scale Web search test collection from a major commercial search engine confirm the proposed approach leads to consistent and significant improvements in retrieval effectiveness.
Exploration-exploitation tradeoff in interactive relevance feedback BIBAFull-Text 1397-1400
  Maryam Karimzadehgan; ChengXiang Zhai
We study an interesting optimization problem in interactive feedback that aims at optimizing the tradeoff between presenting search results with the highest immediate utility to a user (but not necessarily most useful for collecting feedback information) and presenting search results with the best potential for collecting useful feedback information (but not necessarily the most useful documents from a user's perspective). Optimizing such an exploration-exploitation tradeoff is key to the optimization of the overall utility of relevance feedback to a user in the entire session of relevance feedback. We frame this tradeoff as a problem of optimizing the diversification of search results. We propose a machine learning approach to adaptively optimizing the diversification of search results for each query so as to optimize the overall utility in an entire session. Experiment results show that the proposed learning approach can effectively optimize the exploration-exploitation tradeoff and outperforms the traditional relevance feedback approach which only does exploitation without exploration.
Ranking social bookmarks using topic models BIBAFull-Text 1401-1404
  Morgan Harvey; Ian Ruthven; Mark Carman
Ranking of resources in social tagging systems is a difficult problem due to the inherent sparsity of the data and the vocabulary problems introduced by having a completely unrestricted lexicon. In this paper we propose to use hidden topic models as a principled way of reducing the dimensionality of this data to provide more accurate resource rankings with higher recall. We first describe Latent Dirichlet Allocation (LDA) and then show how it can be used to rank resources in a social bookmarking system. We test the LDA tagging model and compare it with 3 non-topic model baselines on a large data sample obtained from the Delicious social bookmarking site. Our evaluations show that our LDA-based method significantly outperforms all of the baselines.
A fine-grained taxonomy of tables on the web BIBAFull-Text 1405-1408
  Eric Crestan; Patrick Pantel
We propose a classification taxonomy over a large crawl of HTML tables on the Web, focusing primarily on those tables that express structured knowledge. The taxonomy separates tables into two top-level classes: a) those used for layout purposes, including navigational and formatting; and b) those containing relational knowledge, including listings, attribute/value, matrix, enumeration, and form. We then propose a classification algorithm for automatically detecting a subset of the classes in our taxonomy, namely layout tables and attribute/value tables. We report on the performance of our system over a large sample of manually annotated HTML tables on the Web.
Challenges in personalized authority flow based ranking of social media BIBAFull-Text 1409-1412
  Hassan Sayyadi; John Edmonds; Vagelis Hristidis; Louiqa Raschid
As the social interaction of Internet users increases, so does the need to effectively rank social media. We study the challenges of personalized ranking of blog posts. Web search techniques are inadequate since social media lack many of the characteristics of the Web such as rich document content and an extensive hyperlink graph. Further, user behavior in social media has moved beyond keyword based search and must support users who follow a particular blog or theme. In this research, we extend a social media dataset to exploit the associations between authors, blog posts, and categories (topics) of the posts. We then apply personalized authority flow based ranking algorithms based on the random surfer model. We evaluate our personalization approaches through an extensive study on a range of virtual users whose preferences are defined based on intuitive criteria. Our evaluation shows that the accuracy of our personalized recommendations ranges from good to very good for a majority of users, and outperforms reasonable baseline approaches.
A new mathematics retrieval system BIBAFull-Text 1413-1416
  Shahab Kamali; Frank Wm. Tompa
The Web contains a large collection of documents, some with mathematical expressions. Because mathematical expressions are objects with complex structures and rather few distinct symbols, conventional text retrieval systems are not very successful in mathematics retrieval. The lack of a definition for similarity between mathematical expressions, and the inadequacy of searching for exact matches only, makes the problem of mathematics retrieval even harder. As a result, the few existing mathematics retrieval systems are not very helpful in addressing users' needs.
   We propose a powerful query language for mathematical expressions that augments exact matching with approximate matching, but in a way that is controlled by the user. We also introduce a novel indexing scheme that scales well for large collections of expressions. Based on this indexing scheme, an efficient lookup algorithm is proposed.
Explore click models for search ranking BIBAFull-Text 1417-1420
  Dong Wang; Weizhu Chen; Gang Wang; Yuchen Zhang; Botao Hu
Recent advances in click model have positioned it as an effective approach to estimate document relevance based on user behavior in web search. Yet, few works have been conducted to explore the use of click model to help web search ranking. In this paper, we focus on learning a ranking function by taking the results from a click model into account. Thus, besides the editorial relevance data arising from the explicit manually labeled search result by experts, we also have the estimated relevance data that is automatically inferred from click models based on user search behavior. We carry out extensive experiments on large-scale commercial datasets and demonstrate the effectiveness of the proposed methods.
Generating advertising keywords from video content BIBAFull-Text 1421-1424
  Michael J. Welch; Junghoo Cho; Walter Chang
With the proliferation of online distribution methods for videos, content owners require easier and more effective methods for monetization through advertising. Matching advertisements with related content has a significant impact on the effectiveness of the ads, but current methods for selecting relevant advertising keywords for videos are limited by reliance on manually supplied metadata. In this paper we study the feasibility of using text available from video content to obtain high quality keywords suitable for matching advertisements. In particular, we tap into three sources of text for ad keyword generation: production scripts, closed captioning tracks, and speech-to-text transcripts. We address several challenges associated with using such data. To overcome the high error rates prevalent in automatic speech recognition and the lack of an explicit structure to provide hints about which keywords are most relevant, we use statistical and generative methods to identify dominant terms in the source text. To overcome the sparsity of the data and resulting vocabulary mismatches between source text and the advertiser's chosen keywords, these terms are then expanded into a set of related keywords using related term mining methods. Our evaluations present a comprehensive analysis of the relative performance for these methods across a range of videos, including professionally produced films and popular videos from YouTube.
Web page classification on child suitability BIBAFull-Text 1425-1428
  Carsten Eickhoff; Pavel Serdyukov; Arjen P. de Vries
Children spend significant amounts of time on the Internet. Recent studies showed, that during these periods they are often not under adult supervision. This work presents an automatic approach to identifying suitable web pages for children based on topical and non-topical web page aspects. We discuss the characteristics of children's web sites with respect to recent findings in children's psychology and cognitive sciences. We finally evaluate our approach in a large-scale user study, finding, that it compares favourably to state of the art methods while approximating human performance.
Rough sets based reasoning and pattern mining for a two-stage information filtering system BIBAFull-Text 1429-1432
  Xujuan Zhou; Yuefeng Li; Peter David Bruza; Yue Xu; Raymond Y. K. Lau
This paper presents a novel two-stage information filtering model which combines the merits of term-based and pattern-based approaches to effectively filter sheer volume of information. In particular, the first filtering stage is supported by a novel rough analysis model which efficiently removes a large number of irrelevant documents, thereby addressing the overload problem. The second filtering stage is empowered by a semantically rich pattern taxonomy mining model which effectively fetches incoming documents according to the specific information needs of a user, thereby addressing the mismatch problem. The experiments have been conducted to compare the proposed two-stage filtering (T-SM) model with other possible "term-based + pattern-based" or "term-based + term-based" IF models. The results based on the RCV1 corpus show that the T-SM model significantly outperforms other types of "two-stage" IF models.
A late fusion approach to cross-lingual document re-ranking BIBAFull-Text 1433-1436
  Dong Zhou; Séamus Lawless; Jinming Min; Vincent Wade
The field of information retrieval still strives to develop models which allow semantic information to be integrated in the ranking process to improve performance in comparison to standard bag-of-words based models. Cross-lingual information retrieval is an example of where such a model is required, as content or concepts often need to be matched across languages. To overcome this problem, a conceptual model has been adopted in ranking an entire corpus which normally exploits latent/implicit features of the text. One of the drawbacks of this model is that the computational cost is significant and often intractable in modern test collections. Therefore, approaches utilizing concept-based models for re-ranking initial retrieval results have attracted a considerable amount of study, in particular the latent concept model. However, fitting such a model to a smaller collection is less meaningful than fitting it into the whole corpus. This paper proposes a late fusion method which incorporates scores generated by using external knowledge to enhance the space produced by the latent concept method. This method is further demonstrated to be suitable for multilingual re-ranking purposes. To illustrate the effectiveness of the proposed method, experiments were conducted over test collections across three languages. The results demonstrate that the method can comfortably achieve improvements in retrieval performance over several re-ranking methods.
Learning to generate summary as structured output BIBAFull-Text 1437-1440
  Hiroya Takamura; Manabu Okumura
We propose to use a structured output learning for summary generation based on the maximum coverage problem. Our method learns a function that outputs the benefit of each conceptual unit in the document cluster for this summarization model. In the training, we iteratively run a greedy algorithm that accepts items (sentences) with different costs (length) in order to generate a summary within the given maximum length limit. We empirically show that the structured output learning works well for this task and also examine its behavior in several different settings.
Group ranking with application to image retrieval BIBAFull-Text 1441-1444
  Ou Wu; Weiming Hu; Bing Li
Many existing ranking-related information processing applications can be summarized into one theoretical problem called group ranking (GR). A simple average-ranking approach is usually applied to GR. Although the approach seems reasonable, no theoretical analysis about its intrinsic mechanism has been presented, increasing the difficulty of evaluating the ranking results. This study provides a formal analysis for GR. We first construct an objective function for the GR problem, and discover that each GR problem can be transformed into a rank aggregation problem whose objective function is proved to be equal to the objective function of GR. As a consequence, the average-ranking approach can be explained by two well-known rank aggregation techniques. We incorporate two other effective rank aggregation methods into the GR problem and obtain two new GR algorithms. We apply the GR algorithms into image retrieval to diversify the image search results returned by search engines. Experimental results show the effectiveness of the proposed GR algorithms.
Unifying explicit and implicit feedback for collaborative filtering BIBAFull-Text 1445-1448
  Nathan N. Liu; Evan W. Xiang; Min Zhao; Qiang Yang
Most collaborative filtering algorithms are based on certain statistical models of user interests built from either explicit feedback (eg: ratings, votes) or implicit feedback (eg: clicks, purchases). Explicit feedbacks are more precise but more difficult to collect from users while implicit feedbacks are much easier to collect though less accurate in reflecting user preferences. In the existing literature, separate models have been developed for either of these two forms of user feedbacks due to their heterogeneous representation. However in most real world recommended systems both explicit and implicit user feedback are abundant and could potentially complement each other. It is desirable to be able to unify these two heterogeneous forms of user feedback in order to generate more accurate recommendations. In this work, we developed matrix factorization models that can be trained from explicit and implicit feedback simultaneously. Experimental results of multiple datasets showed that our algorithm could effectively combine these two forms of heterogeneous user feedback to improve recommendation quality.
Directly optimizing evaluation measures in learning to rank based on the clonal selection algorithm BIBAFull-Text 1449-1452
  Qiang He; Jun Ma; Shuaiqiang Wang
One fundamental issue of learning to rank is the choice of loss function to be optimized. Although the evaluation measures used in Information Retrieval (IR) are ideal ones, in many cases they can't be used directly because they do not satisfy the smooth property needed in conventional machine learning algorithms. In this paper a new method named RankCSA is proposed, which tries to use IR evaluation measure directly. It employs the clonal selection algorithm to learn an effective ranking function by combining various evidences in IR. Experimental results on the LETOR benchmark datasets demonstrate that RankCSA outperforms the baseline methods in terms of P@n, MAP and NDCG@n.
Query model refinement using word graphs BIBAFull-Text 1453-1456
  Yunping Huang; Le Sun; Jian-Yun Nie
Pseudo relevance feedback method is an effective method for query model refinement. Most existing pseudo relevance feedback methods only take into consideration the term distribution of the feedback documents, but omit the term's context information. This paper presents a graph-based method to improve query models, in which a word graph is constructed to encode terms and their co-occurrence dependencies within the feedback documents. Using a random walk, the weight of each term in the graph can be determined in a context-dependent manner, i.e. the weight of a term is strongly dependent on the weights of the connected context terms. Our experimental results on four TREC collections show that our proposed approach is more effective than the existing state-of-the-art approaches.
Building recommendation systems using peer-to-peer shared content BIBAFull-Text 1457-1460
  Yuval Shavitt; Ela Weinsberg; Udi Weinsberg
Peer-to-Peer (p2p) networks are used for sharing content by millions of users. Often, meta-data used for searching is missing or wrong, making it difficult for users to find content. Moreover, searching for new content is almost impossible. Recommender systems are unable to handle p2p data due to inherent difficulties, such as implicit ranking, noise and the extreme dimensions and sparseness of the network.
   This paper introduces methods for using p2p data in recommender systems. We present a method for creating content-similarity graph while overcoming inherent noise. Using this graph, a clustering method is presented for detecting proximity between files using the "wisdom-of-the-crowds". Evaluation using songs shared by over 1.2 million users in the Gnutella network, shows that these techniques can leverage p2p data for building efficient recommender systems.
Threshold behavior of incentives in social networks BIBAFull-Text 1461-1464
  Nagaraj Kota; Y. Narahari
The advent of large scale online social networks has resulted in a spurt of studies on the user participation in the networks. We consider a query incentive model on social networks, where user's queries are answered through her friendship network and there are 'rewards' or 'incentives' in the system to answer the queries utilizing ones community. We model the friendship network as a random graph with power-law degree distribution, and show that the reward function exhibits a theoretic threshold behavior on the scaling exponent α, a network parameter. Specifically, there exists a threshold on α above which the reward is exponential in the average path length in the network and below the threshold, the reward is proportional to the average path length. We demonstrate this finding on simulated power-law networks and real world data gathered from online social media such as Flickr, Digg, YouTube and Yahoo! Answers.
Image retrieval at memory's edge: known image search based on user-drawn sketches BIBAFull-Text 1465-1468
  Michael Springmann; Ihab Al Kabary; Heiko Schuldt
With the increasingly growing size of digital image collections, known image search is gaining more and more importance. Especially in collections where individual objects are not tagged with metadata describing their content, content-based image retrieval (CBIR) is a promising approach, but usually suffers from the unavailability of query images that are good enough to express the user's information need. In this paper, we present the QbS system that provides CBIR based on user-drawn sketches. The QbS system combines angular radial partitioning for the extraction of features in the user-provided sketch, taking into account the spatial distribution of edges, and the image distortion model. This combination offers several highly relevant invariances that allow the query sketch to slightly deviate from the searched image in terms of rotation, translation, relative size, and/or unknown objects in the background. To illustrate the benefits of the approach, we present search results from the evaluation of the QbS system on the basis of the MIRFLICKR collection with 25,000 objects and compare the retrieval results of pure metadata-driven approaches, pure content-based retrieval using different sketches, and combinations thereof.
Utilizing re-finding for personalized information retrieval BIBAFull-Text 1469-1472
  Sarah K. Tyler; Jian Wang; Yi Zhang
Individuals often use search engines to return to web pages they have previously visited. This behaviour, called re-finding, accounts for about 38% of all queries. While researchers have shown how re-finding differs from traditionally studied new-findings, research on how to predict and utilize re-finding is limited. In this paper we explore re-finding for personalized search. We compared three machine learning algorithms (decision trees, Bayesian multinomial regression and support vector machines) to identify re-findings. We then propose several re-ranking methods to utilize the prediction, including promoting predicted re-finding URLs and combining re-finding prediction with relevance estimation. The experimental results demonstrate that using re-finding predictions can improve retrieval performance for personalized search.
User behavior driven ranking without editorial judgments BIBAFull-Text 1473-1476
  Taesup Moon; Georges Dupret; Shihao Ji; Ciya Liao; Zhaohui Zheng
We explore the potential of using users click-through logs where no editorial judgment is available to improve the ranking function of a vertical search engine. We base our analysis on the Cumulate Relevance Model, a user behavior model recently proposed as a way to extract relevance signal from click-through logs. We propose a novel way of directly learning the ranking function, effectively by-passing the need to have explicit editorial relevance label for each query-document pair. This approach potentially adjusts more closely the ranking function to a variety of user behaviors both at the individual and at the aggregate levels. We investigate two ways of using behavioral model; First, we consider the parametric approach where we learn the estimates of document relevance and use them as targets for the machine learned ranking schemes. In the second, functional approach, we learn a function that maximizes the behavioral model likelihood, effectively by-passing the need to estimate a substitute for document labels. Experiments using user session data collected from a commercial vertical search engine demonstrate the potential of our approach. While in terms of DCG, the editorial model out-perform the behavioral one, online experiments show that the behavioral model is on par -- if not superior -- to the editorial model. To our knowledge, this is the first report in the Literature of a competitive behavioral model in a commercial setting.
Alignment of short length parallel corpora with an application to web search BIBAFull-Text 1477-1480
  Jitendra Ajmera; Hema Swetha Koppula; Krishna P. Leela; Shibnath Mukherjee; Mehul Parsana
With evolving Web, short length parallel corpora is becoming very common and some of these include user queries, web snippets etc. This paper concerns situations where short length parallel corpora has to be analyzed in order to find meaningful unit-alignment. This is similar to dealing with parallel corpora where a sentence level alignment of translations is required, but differs in that the alignment is to be inferred at unit (word or phrase) level. A Conditional Random Field (CRF) based approach is proposed to discover this unit alignment. Given pairs of semantically or syntactically similar entities, the problem is formulated as that of mutual segmentation and sequence alignment problem. The mutual segmentation refers to the process of segmenting the first entity based on units (or labels) in the second entity and vice-versa. The process of optimizing this mutual segmentation also results in optimal unit alignment. Since our training data is not segmented and unit-aligned, we modify the CRF objective function to accommodate unsupervised data and iterative learning. We have applied this framework to Web Search domain and specifically for query reformulation task. Finally, our experiments suggest that the proposed approach indeed results in meaningful alternatives of the original query.
A feature-word-topic model for image annotation BIBAFull-Text 1481-1484
  Cam-Tu Nguyen; Natsuda Kaothanthong; Xuan-Hieu Phan; Takeshi Tokuyama
Image annotation is to automatically associate semantic labels with images in order to obtain a more convenient way for indexing and searching images on the Web. This paper proposes a novel method for image annotation based on feature-word and word-topic distributions. The introduction of topics enables us to efficiently take word associations, such as {ocean, fish, coral}, into image annotation. Feature-word distributions are utilized to define weights in computation of topic distributions for annotation. By doing so, topic models in text mining can be applied directly in our method. Experiments show that our method is able to obtain promising improvements over the state-of-the-art method -- Supervised Multiclass Labeling (SML).
Weighting common syntactic structures for natural language based information retrieval BIBAFull-Text 1485-1488
  Chang Liu; Hui Wang; Sally McClean; Epaminondas Kapetanios; Denis Carroll
Natural Language Processing (NLP) techniques are believed to hold the potential to assist "bag-of-words" Information Retrieval (IR) in terms of retrieval accuracy. In this paper, we report a natural language based IR approach where the common syntactic structures between documents and the query is regarded to as a query-dependent feature for documents. Specifically, a "structural weight" is proposed for query terms, which can be seen as a weight to model the degree of term's involvement in the common syntactic structures. This structural weight is used together with the TF-IDF weighting scheme, which results in a new ranking function. The accumulation of this structural weight of all the query terms in the new ranking function will be seen as a measure of how much a document and a query share the common syntactic structures. The experimental results show that by using this ranking function, significant improvements in the retrieval performance are achieved.
Ranking with auxiliary data BIBAFull-Text 1489-1492
  Bo Long; Yi Chang; Srinivas Vadrevu; Shuang Hong Yang; Zhaohui Zheng
Learning to rank arises in many information retrieval applications, ranging from Web search engine, online advertising to recommendation system. In learning to rank, the performance of a ranking function heavily depends on the number of labeled examples in the training set; on the other hand, obtaining labeled examples for training data is very expensive and time-consuming. This presents a great need for making use of available auxiliary data, i.e., the within-domain unlabeled data and the out-of-domain labeled data. In this paper, we propose a general framework for ranking with auxiliary data, which is applicable to various ranking applications. Under this framework, we derive a generic ranking algorithm to effectively make use of both the within-domain unlabeled data and the out-of-domain labeled data. The proposed algorithm iteratively learns ranking functions for target domain and source domains and enforces their consensus on the unlabeled data in the target domain.
Using various term dependencies according to their utilities BIBAFull-Text 1493-1496
  Lixin Shi; Jian-Yun Nie
In this paper, we propose a model to integrate term dependencies. Different from previous studies, each pair of terms is assigned a different weight of dependency according to their utility to IR. The experiments show that our model can significantly outperform the previous dependency models using fixed weights.
Modeling reformulation using passage analysis BIBAFull-Text 1497-1500
  Xiaobing Xue; W. Bruce Croft; David A. Smith
Query reformulation modifies the original query with the aim of better matching the vocabulary of the relevant documents, and consequently improving ranking effectiveness. Previous techniques typically generate words and phrases related to the original query, but do not consider how these words and phrases would fit together in new queries. In this paper, we focus on an implementation of an approach that models reformulation as a distribution of queries, where each query is a variation of the original query. This approach considers a query as a basic unit and can capture important dependencies between words and phrases in the query. The implementation discussed here is based on passage analysis of the target corpus. Experiments on the TREC collection show that the proposed model for query reformulation significantly outperforms state-of-the-art methods.
Online learning for recency search ranking using real-time user feedback BIBAFull-Text 1501-1504
  Taesup Moon; Lihong Li; Wei Chu; Ciya Liao; Zhaohui Zheng; Yi Chang
Traditional machine-learned ranking algorithms for web search are trained in batch mode, which assume static relevance of documents for a given query. Although such a batch-learning framework has been tremendously successful in commercial search engines, in scenarios where relevance of documents to a query changes over time, such as ranking recent documents for a breaking news query, the batch-learned ranking functions do have limitations. Users' real-time click feedback becomes a better and timely proxy for the varying relevance of documents rather than the editorial judgments provided by human editors. In this paper, we propose an online learning algorithm that can quickly learn the best re-ranking of the top portion of the original ranked list based on real-time users' click feedback. In order to devise our algorithm and evaluate it accurately, we collected exploration bucket data that removes positional biases on clicks on the documents for recency-classified queries. Our initial experimental result shows that our scheme is more capable of quickly adjusting the ranking to track the varying relevance of documents reflected in the click feedback, compared to batch-trained ranking functions.
Expert identification in community question answering: exploring question selection bias BIBAFull-Text 1505-1508
  Aditya Pal; Joseph A. Konstan
Community Question Answering (CQA) services enables users to ask and answer questions. In these communities, there are typically a small number of experts amongst the large population of users. We study which questions a user select for answering and show that experts prefer answering questions where they have a higher chance of making a valuable contribution. We term this preferential selection as question selection bias and propose a mathematical model to estimate it. Our results show that using Gaussian classification models we can effectively distinguish experts from ordinary users over their selection biases. In order to estimate these biases, only a small amount of data per user is required, which makes an early identification of expertise a possibility. Further, our study of bias evolution reveals that they do not show significant changes over time indicating that they emanates from the intrinsic characteristics of users.
On the relationship between novelty and popularity of user-generated content BIBAFull-Text 1509-1512
  David Carmel; Haggai Roitman; Elad Yom-Tov
This work deals with the task of predicting the popularity of user-generated content. We demonstrate how the novelty of newly published content plays an important role in affecting its popularity. We study three dimensions of novelty: contemporaneous novelty, self novelty, and discussion novelty. We demonstrate the contribution of the new novelty measures to estimating blog-post popularity by predicting the number of comments expected for a fresh post. We further demonstrate how novelty based measures can be utilized for predicting the citation volume of academic papers.
Focused crawling using navigational rank BIBAFull-Text 1513-1516
  Shicong Feng; Li Zhang; Yuhong Xiong; Conglei Yao
The goal of focused crawling is to use limited resources to effectively discover web pages related to a specific topic rather than downloading all accessible web documents. The major challenge in focused crawling is how to effectively determine each hyperlink's capability of leading to target pages. To compute this capability, we present a novel approach, called Navigational Rank (NR). In general, NR is a kind of two-step and two-direction credit propagation approach. Compared to existing methods, NR mainly has three advantages. First, NR is dynamically updated during the crawling progress, which can adapt to different website structures very well. Second, when the crawling seed is far away from the target pages, and the target pages only constitute a small portion of the whole website, NR shows a significant performance advantage. Third, NR computes each link's capability of leading to target pages by considering both the target and non-target pages it leads to. This global knowledge causes a better performance than only using target pages. We have performed extensive experiments for performance evaluation of the proposed approach using two groups of large-scale, real-world datasets from two different domains. The experimental results show that our approach is domain-independent and significantly outperforms the state-of-arts.
TAER: time-aware entity retrieval-exploiting the past to find relevant entities in news articles BIBAFull-Text 1517-1520
  Gianluca Demartini; Malik Muhammad Saad Missen; Roi Blanco; Hugo Zaragoza
Retrieving entities instead of just documents has become an important task for search engines. In this paper we study entity retrieval for news applications, and in particular the importance of the news trail history (i.e., past related articles) in determining the relevant entities in current articles. This is an important problem in applications that display retrieved entities to the user, together with the news article.
   We analyze and discuss some statistics about entities in news trails, unveiling some unknown findings such as the persistence of relevance over time. We focus on the task of query dependent entity retrieval over time. For this task we evaluate several features, and show that their combinations significantly improves performance.
Demographic information flows BIBAFull-Text 1521-1524
  Ingmar Weber; Alejandro Jaimes
In advertising and content relevancy prediction it is important to understand whether, over time, information that reaches one demographic group spreads to others. In this paper we analyze the query log of a large U.S. web search engine to determine whether the same queries are performed by different demographic groups at different times, particularly when there are query bursts. We obtain aggregate demographic features from user-provided registration information (gender, birth year, ZIP code), U.S. census data, and election results. Given certain queries, we examine trends (from high to low and vice versa) and changes in the statistical spread of the demographic features of users that issue the queries over time periods that include query bursts. Our analysis shows that for certain types of queries (movies and news) distinct demographic groups perform searches at different times, suggesting that information related to such queries flows between them. Queries of movie titles, for instance, tend to be issued first by young and then by older users, where a sudden jump in age occurs upon the movie's release. To the best of our knowledge, this is the first time this problem has been studied using search query logs.
Detecting periodic changes in search intentions in a search engine BIBAFull-Text 1525-1528
  Masaya Murata; Hiroyuki Toda; Yumiko Matsuura; Ryoji Kataoka; Takayoshi Mochizuki
Information needs expressed by using the same query for a search engine might be totally different, whether on week days or weekends, or during the day or at night. For queries having no temporal changes in search intentions, the same search results ranking may be returned regardless of the time, but for those with temporal changes the ranking must be suitably altered depending on the time of input. To achieve time-dependent search results rankings, we focus on the temporal changes in the search intentions. We present the results obtained by analyzing a commercial search engine log and propose a method of detecting queries showing periodic changes in the search intentions.
Recommendation based on object typicality BIBAFull-Text 1529-1532
  Yi Cai; Ho-fung Leung; Qing Li; Jie Tang; Juanzi Li
Current recommendation methods are mainly classified into content-based, collaborative filtering and hybrid methods. These methods are based on similarity measurements among items or users. In this paper, we investigate recommendation systems from a new perspective based on object typicality and propose a novel typicality-based recommendation approach. Experiments show that our method outperforms compared methods on recommendation quality.
Selecting keywords for content based recommendation BIBAFull-Text 1533-1536
  Christian Wartena; Wout Slakhorst; Martin Wibbels
The continued growth of online content makes personalized recommendation an increasingly important tool for media consumption. While collaborative filtering techniques have shown to be very successful in stable collections, content based approaches are necessary for recommending new items. Content based recommendation uses the similarity between new items and consumed items to predict whether a new item is interesting for the user. The similarity is computed by comparing the content or the meta-data of the items. In this paper we consider recommendation of TV-broadcasts for which meta-data and synopses are available. We thereby concentrate on the new item problem. We investigate the value of different types of meta-data provided by the broadcaster or extracted from synopsis. We show that extracted keywords are better suited for recommendation than manually assigned keywords. Furthermore we show that the number of keywords used is of great importance. Using a rather small number of keywords to present an item yields the best results for recommendation.
Structural annotation of search queries using pseudo-relevance feedback BIBAFull-Text 1537-1540
  Michael Bendersky; W. Bruce Croft; David A. Smith
Marking up queries with annotations such as part-of-speech tags, capitalization, and segmentation, is an important part of many approaches to query processing and understanding. Due to their brevity and idiosyncratic structure, search queries pose a challenge to existing annotation tools that are commonly trained on full-length documents. To address this challenge, we view the query as an explicit representation of a latent information need, which allows us to use pseudo-relevance feedback, and to leverage additional information from the document corpus, in order to improve the quality of query annotation.
Search as if you were in your home town: geographic search by regional context and dynamic feature-space selection BIBAFull-Text 1541-1544
  Makoto P. Kato; Hiroaki Ohshima; Satoshi Oyama; Katsumi Tanaka
We propose a query-by-example geographic object search method for users that do not know well about the place they are in. Geographic objects, such as restaurants, are often retrieved using an attribute-based or keyword query. These queries, however, are difficult to use for users that have little knowledge on the place where they want to search. The proposed query-by-example method allows users to query by selecting examples in familiar places for retrieving objects in unfamiliar places. One of the challenges is to predict an effective distance metric, which varies for individuals. Another challenge is to calculate the distance between objects in heterogeneous domains considering the feature gap between them, for example, restaurants in Japan and China. Our proposed method is used to robustly estimate the distance metric by amplifying the difference between selected and non-selected examples. By using the distance metric, each object in a familiar domain is evenly assigned to one in an unfamiliar domain to eliminate the difference between those domains. We developed a restaurant search using data obtained from a Japanese restaurant Web guide to evaluate our method.
Topic aspect analysis for multi-document summarization BIBAFull-Text 1545-1548
  Chao Shen; Dingding Wang; Tao Li
Query-based multi-document summarization aims to create a short summary given a collection of documents and a query. Most of the existing methods treat the query as one single sentence and rank the sentences in the documents based on their similarities with the query sentence. However, these methods lack of intensive analysis on the given query which typically consist of several topic aspects. In this paper, we propose a topic aspect extraction method to discover the aspect words and sentences contained in the query narrative texts and the input documents, and then incorporate these aspect words and sentences into a cross propagation model based on the sentence-term bipartite graph for document summarization. Experiments on DUC benchmark data show the effectiveness of our proposed approach on the topic-driven document summarization task.
Finding unusual review patterns using unexpected rules BIBAFull-Text 1549-1552
  Nitin Jindal; Bing Liu; Ee-Peng Lim
In recent years, opinion mining attracted a great deal of research attention. However, limited work has been done on detecting opinion spam (or fake reviews). The problem is analogous to spam in Web search [1, 9 11]. However, review spam is harder to detect because it is very hard, if not impossible, to recognize fake reviews by manually reading them [2]. This paper deals with a restricted problem, i.e., identifying unusual review patterns which can represent suspicious behaviors of reviewers. We formulate the problem as finding unexpected rules. The technique is domain independent. Using the technique, we analyzed an Amazon.com review dataset and found many unexpected rules and rule groups which indicate spam activities.
Visual-semantic graphs: using queries to reduce the semantic gap in web image retrieval BIBAFull-Text 1553-1556
  Barbara Poblete; Benjamin Bustos; Marcelo Mendoza; Juan Manuel Barrios
We explore the application of a graph representation to model similarity relationships that exist among images found on the Web. The resulting similarity-induced graph allows us to model in a unified way different types of content-based similarities, as well as semantic relationships. Content-based similarities include different image descriptors, and semantic similarities can include relevance user feedback from search engines. The goal of our representation is to provide an experimental framework for combining apparently unrelated metrics into a unique graph structure, which allows us to enhance the results of Web image retrieval. We evaluate our approach by re-ranking Web image search results.
Novel local features with hybrid sampling technique for image retrieval BIBAFull-Text 1557-1560
  Leszek Kaliciak; Dawei Song; Nirmalie Wiratunga; Jeff Pan
In image retrieval, most existing approaches that incorporate local features produce high dimensional vectors, which lead to a high computational and data storage cost. Moreover, when it comes to the retrieval of generic real-life images, randomly generated patches are often more discriminant than the ones produced by corner/blob detectors. In order to tackle these problems, we propose a novel method incorporating local features with a hybrid sampling (a combination of detector-based and random sampling). We take three large data collections for the evaluation: MIRFlickr, ImageCLEF, and a collection from British National Geological Survey. The overall performance of the proposed approach is better than the performance of global features and comparable with the current state-of-the-art methods in content-based image retrieval. One of the advantages of our method when compared with others is its easy implementation and low computational cost. Another is that hybrid sampling can improve the performance of other methods based on the "bag of visual words" approach.
Expected browsing utility for web search evaluation BIBAFull-Text 1561-1564
  Emine Yilmaz; Milad Shokouhi; Nick Craswell; Stephen Robertson
Most information retrieval evaluation metrics are designed to measure the satisfaction of the user given the results returned by a search engine. In order to evaluate user satisfaction, most of these metrics have underlying user models, which aim at modeling how users interact with search engine results. Hence, the quality of an evaluation metric is a direct function of the quality of its underlying user model. This paper proposes EBU, a new evaluation metric that uses a sophisticated user model tuned by observations over many thousands of real search sessions. We compare EBU with a number of state of the art evaluation metrics and show that it is more correlated with real user behavior captured by clicks.
Community-based topic modeling for social tagging BIBAFull-Text 1565-1568
  Daifeng Li; Bing He; Ying Ding; Jie Tang; Cassidy Sugimoto; Zheng Qin; Erjia Yan; Juanzi Li; Tianxi Dong
Exploring community is fundamental for uncovering the connections between structure and function of complex networks and for practical applications in many disciplines such as biology and sociology. In this paper, we propose a TTR-LDA-Community model which combines the Latent Dirichlet Allocation model (LDA) and the Girvan-Newman community detection algorithm with an inference mechanism. The model is then applied to data from Delicious, a popular social tagging system, over the time period of 2005-2008. Our results show that 1) users in the same community tend to be interested in similar set of topics in all time periods; and 2) topics may divide into several sub-topics and scatter into different communities over time. We evaluate the effectiveness of our model and show that the TTR-LDA-Community model is meaningful for understanding communities and outperforms TTR-LDA and LDA models in tag prediction.
Discriminative factored prior models for personalized content-based recommendation BIBAFull-Text 1569-1572
  Lanbo Zhang; Yi Zhang
Most existing content-based filtering approaches including Rocchio, Language Models, SVM, Logistic Regression, Neural Networks, etc. learn user profiles independently without capturing the similarity among users. The Bayesian hierarchical models learn user profiles jointly and have the advantage of being able to borrow information from other users through a Bayesian prior. The standard Bayesian hierarchical model assumes all user profiles are generated from the same prior. However, considering the diversity of user interests, this assumption might not be optimal. Besides, most existing content-based filtering approaches implicitly assume that each user profile corresponds to exactly one user interest and fail to capture a user's multiple interests (information needs).
   In this paper, we present a flexible Bayesian hierarchical modeling approach to model both commonality and diversity among users as well as individual users' multiple interests. We propose two models each with different assumptions, and the proposed models are called Discriminative Factored Prior Models (DFPM). In our models, each user profile is modeled as a discriminative classifier with a factored model as its prior, and different factors contribute in different levels to each user profile. Compared with existing content-based filtering models, DFPM are interesting because they can 1) borrow discriminative criteria of other users while learning a particular user profile through the factored prior; 2) trade off well between diversity and commonality among users; and 3) handle the challenging classification situation where each class contains multiple concepts. The experimental results on a dataset collected from real users on digg.com show that our models significantly outperform the baseline models of L-2 regularized logistic regression and the standard Bayesian hierarchical model with logistic regression.
Fast query expansion using approximations of relevance models BIBAFull-Text 1573-1576
  Marc-Allen Cartright; James Allan; Victor Lavrenko; Andrew McGregor
Pseudo-relevance feedback (PRF) improves search quality by expanding the query using terms from high-ranking documents from an initial retrieval. Although PRF can often result in large gains in effectiveness, running two queries is time consuming, limiting its applicability. We describe a PRF method that uses corpus pre-processing to achieve query-time speeds that are near those of the original queries. Specifically, Relevance Modeling, a language modeling based PRF method, can be recast to benefit substantially from finding pairwise document relationships in advance. Using the resulting Fast Relevance Model (fastRM), we substantially reduce the online retrieval time and still benefit from expansion. We further explore methods for reducing the preprocessing time and storage requirements of the approach, allowing us to achieve up to a 10% increase in MAP over unexpanded retrieval, while only requiring 1% of the time of standard expansion.
Mining rules to explain activities in videos BIBAFull-Text 1577-1580
  Omar U. Florez; Curtis Dyreson
We present a novel approach to mining dependency rules that explain the scenes present during a video sequence. The approach first characterizes activities based on their most important events. Next, an HMM-based approach finds the mixture components that best describe the clustering dependencies between events and activities in video data. The dependencies among activities are taken as association patterns with temporal precedence and analyzed using their co-occurrence relationships in time windows. This technique is meant to understand the multiple actions taken in a video or to predict future occurrences of certain activities.
Online stratified sampling: evaluating classifiers at web-scale BIBAFull-Text 1581-1584
  Paul N. Bennett; Vitor R. Carvalho
Deploying a classifier to large-scale systems such as the web requires careful feature design and performance evaluation. Evaluation is particularly challenging because these large collections frequently change. In this paper we adapt stratified sampling techniques to evaluate the precision of classifiers deployed in large-scale systems. We investigate different types of stratification strategies, and then we derive a new online sampling algorithm that incrementally approximates the theoretical optimal disproportionate sampling strategy. In experiments, the proposed algorithm significantly outperforms both simple random sampling as well as other types of stratified sampling, with an average reduction of about 20% in labeling effort to reach the same confidence and interval-bounds on precision.
Routing questions to appropriate answerers in community question answering services BIBAFull-Text 1585-1588
  Baichuan Li; Irwin King
Community Question Answering (CQA) service provides a platform for increasing number of users to ask and answer for their own needs but unanswered questions still exist within a fixed period. To address this, the paper aims to route questions to the right answerers who have a top rank in accordance of their previous answering performance. In order to rank the answerers, we propose a framework called Question Routing (QR) which consists of four phases: (1) performance profiling, (2) expertise estimation, (3) availability estimation, and (4) answerer ranking. Applying the framework, we conduct experiments with Yahoo! Answers dataset and the results demonstrate that on average each of 1,713 testing questions obtains at least one answer if it is routed to the top 20 ranked answerers.
Learning to rank with groups BIBAFull-Text 1589-1592
  Yuan Lin; Hongfei Lin; Zheng Ye; Song Jin; Xiaoling Sun
An essential issue in document retrieval is ranking, and the documents are ranked by their expected relevance to a given query. Multiple labels are used to represent different level of relevance for documents to a given query, and the corresponding label values are used to quantify the relevance of the documents. According to the training set for a given query, the documents can be divided into several groups. Specifically, the documents with the same label are assigned to the same group. If the documents in the group with higher relevance label can always be ranked higher over the ones in groups with lower relevance label by a ranking model, it is reasonable to expect perfect ranking performance. Inspired by this idea, we propose a novel framework for learning to rank, which depends on two new samples. The first one is one-group constituted by one document with higher level label and a group of documents with lower level label; the second one is group-group constituted by a group of documents with higher level label and a group of documents with lower level label. A novel loss function is proposed based on the likelihood loss similar to ListMLE. We demonstrate the advantages of our approaches on the Letor 3.0 data set. Experimental results show that our approaches are effective in improving the ranking performance.
Optimizing unified loss for web ranking specialization BIBAFull-Text 1593-1596
  Fan Li; Xin Li; Jiang Bian; Zhaohui Zheng
In this paper, we proposed a novel divide-and-conquer approach to optimize the overall relevance in an unified framework for query clustering and query-based ranking. In our model, latent topics and specialized ranking models are learned iteratively so that an unified objective function, which lower-bounds the conditional probability of observed grades annotated by human editors on training data, is maximized. We conducted experiments comparing the proposed method with several baseline approaches on two data-sets. Experimental results illustrate that our method can significantly improve the ranking relevance over these baselines.
Hypergraph-based multilevel matrix approximation for text information retrieval BIBAFull-Text 1597-1600
  Haw-ren Fang; Yousef Saad
In Latent Semantic Indexing (LSI), a collection of documents is often pre-processed to form a sparse term-document matrix, followed by a computation of a low-rank approximation to the data matrix. A multilevel framework based on hypergraph coarsening is presented which exploits the hypergraph that is canonically associated with the sparse term-document matrix representing the data. The main goal is to reduce the cost of the matrix approximation without sacrificing accuracy. Because coarsening by multilevel hypergraph techniques is a form of clustering, the proposed approach can be regarded as a hybrid of factorization-based LSI and clustering-based LSI. Experimental results indicate that our method achieves good improvement of the retrieval performance at a reduced cost.
A peer-selection algorithm for information retrieval BIBAFull-Text 1601-1604
  Yosi Mass; Yehoshua Sagiv; Michal Shmueli-Scheuer
A novel method for creating collection summaries is developed, and a fully decentralized peer-selection algorithm is described. This algorithm finds the most promising peers for answering a given query. Specifically, peers publish per-term synopses of their documents. The synopses of a peer for a given term are divided into score intervals and for each interval, a KMV (K Minimal Values) synopsis of its documents is created. The synopses are used to effectively rank peers by their relevance to a multi-term query.
   The proposed approach is verified by experiments on a large real-world dataset. In particular, two collections were created from this dataset, each with a different number of peers. Compared to the state-of-the-art approaches, the proposed method is effective and efficient even when documents are randomly distributed among peers.
Exploring domain-specific term weight in archived question search BIBAFull-Text 1605-1608
  Zhao-Yan Ming; Tat-Seng Chua; Gao Cong
Community Question Answering services, e.g., Yahoo! Answers, have accumulated large archives of question answer (QA) pairs for information and answer retrieval. An effective question retrieval model is essential to increase the accessibility of the QA archives. QA archives are usually organized into categories and question search can be performed within the whole collection or within a certain category..
   In this paper, we explore domain-specific term weight for archived question search. Specifically, we propose a novel light-weighted term weighting scheme that exploits multiple aspects of the domain information. We also introduce a framework to seamlessly integrate domain-specific term weight into the existing retrieval models. Extensive experiments conducted on real Archived QA data demonstrate the utility of the proposed techniques.
Multi-information fusion for uncertain semantic representations of videos BIBAFull-Text 1609-1612
  Bo Lu; Guoren Wang; Xiaofeng Gong
Concept-Based Semantic Video Retrieval (CBSVR) usually uses semantic representations of videos to handle user's retrieval requests. It is obvious that the accuracy of semantic video retrieval depends on results of concept detectors, but the detection results are usually imprecise and uncertain. In this paper, we propose a multi-information fusion approach (MIF) which is dedicated to solving the problem of uncertain semantic representations of videos for improving retrieval accuracy. This approach is based on a novel two-phase framework that involves the inferring phase and the fusing phase. In the inferring phase, the most relevant concepts to the user's query are chosen by exploring both contextual correlation among concepts and temporal correlation among shots. In the fusing phase, the inferred probabilities of the related concepts are fused together with the detection results via minimization of potential function to refine the detector prediction. Experiments on the widely used TRECVID datasets demonstrate that our approach can effectively improve the accuracy of semantic concept detection.

Poster session 3: KM track

A topical link model for community discovery in textual interaction graph BIBAFull-Text 1613-1616
  Guoqing Zheng; Jinwen Guo; Lichun Yang; Shengliang Xu; Shenghua Bao; Zhong Su; Dingyi Han; Yong Yu
This paper is concerned with community discovery in textual interaction graph, where the links between entities are indicated by textual documents. Specifically, we propose a Topical Link Model (TLM), which leverages Hierarchical Dirichlet Process (HDP) to introduce hidden topical variable of the links. Other than the use of links, TLM can look into the documents on the links in detail to recover sound communities. Moreover, TLM is a nonparametric model, which is able to learn the number of communities from the data. Extensive experiments on two real world corpora show TLM outperforms two state-of-the-art baseline models, which verify the effectiveness of TLM in determining the proper number of communities and generating sound communities.
Taxonomic clustering of web service for efficient discovery BIBAFull-Text 1617-1620
  Sourish Dasgupta; Satish Bhat; Yugyung Lee
The World Wide Web (WWW) has become a major platform for hosting, discovering, and composing web services. Web service clustering is a technique for efficiently facilitating web service discovery. Most web service clustering approaches are based on suitable semantic similarity distance measure and a threshold. Threshold selection is essentially difficult and often leads to unsatisfactory accuracy. In this paper we propose a taxonomic clustering algorithm for grouping functionally similar web services. We have tested the algorithm on both simulation based randomly generated test data and the standard OWL-S TC test data set. We have observed promising results both in terms of accuracy and performance.
Active learning in parallel universes BIBAFull-Text 1621-1624
  Nicolas Cebron; Michael R. Berthold
This work addresses two challenges in combination: learning with a very limited number of labeled training examples (active learning) and learning in the presence of multiple views for each object where the global model to be learned is spread out over some or all of these views (learning in parallel universes). We propose a new active learning approach which selects the best samples to query the label with the goal of improving overall model accuracy and determining which universe contributes most to the local model. The resulting combination and class-specific weighting of universes provides a significantly better classification accuracy than traditional active learning methods.
TAGME: on-the-fly annotation of short text fragments (by wikipedia entities) BIBAFull-Text 1625-1628
  Paolo Ferragina; Ugo Scaiella
We designed and implemented TAGME, a system that is able to efficiently and judiciously augment a plain-text with pertinent hyperlinks to Wikipedia pages. The specialty of TAGME with respect to known systems [5,8] is that it may annotate texts which are short and poorly composed, such as snippets of search-engine results, tweets, news, etc.. This annotation is extremely informative, so any task that is currently addressed using the bag-of-words paradigm could benefit from using this annotation to draw upon (the millions of) Wikipedia pages and their inter-relations.
Adaptive outlierness for subspace outlier ranking BIBAFull-Text 1629-1632
  Emmanuel Müller; Matthias Schiffer; Thomas Seidl
Outlier mining is an important data analysis task to distinguish exceptional outliers from regular objects. However, in recent applications traditional outlier mining approaches miss outliers as they are hidden in subspace projections.
   In this work, we propose a novel outlier ranking based on the degree of deviation in subspaces. Object deviation is measured only in a selection of relevant subspaces and is based on adaptive neighborhoods in these subspaces. We show that our approach outperforms competing outlier ranking approaches by detecting outliers in arbitrary subspaces.
Understanding retweeting behaviors in social networks BIBAFull-Text 1633-1636
  Zi Yang; Jingyi Guo; Keke Cai; Jie Tang; Juanzi Li; Li Zhang; Zhong Su
Retweeting is an important action (behavior) on Twitter, indicating the behavior that users re-post microblogs of their friends. While much work has been conducted for mining textual content that users generate or analyzing the social network structure, few publications systematically study the underlying mechanism of the retweeting behaviors. In this paper, we perform an interesting analysis for the problem on Twitter. We have found that almost 25.5% of the tweets posted by users are actually retweeted from friends' blog spaces. Our investigation unveils that for the retweet behaviors, some statistics still follows the power law distribution, while some others violate the state-of-the-art distribution for Web. Based on these important observations, we propose a factor graph model to predict users' retweeting behaviors. Experimental results on the Twitter data set show that our method can achieve a precision of 28.81% and recall of 37.33% for prediction of the retweet behaviors.
Mapping web pages to database records via link paths BIBAFull-Text 1637-1640
  Tim Weninger; Fabio Fumarola; Jiawei Han; Donato Malerba
In this paper we propose a new knowledge management task which aims to map Web pages to their corresponding records in a structured database. For example, the DBLP database contains records for many computer scientists, and most of these persons have public Web pages; if we can map the database record with the appropriate Web page then the new information could be used to further describe the person's database record. To accomplish this goal we employ link paths which contain anchor texts from multiple paths through the Web ending at the Web page in question. We hypothesize that the information from these link paths can be used to generate an accurate Web page to database record mapping. Experiments on two large, real world data sets, DBLP and IMDB for the structured data and computer science faculty members' Web pages and official movie homepages for the Web page data, show that our method does provide an accurate mapping. Finally, we conclude by issuing a call for further research on this promising new task.
Personalized recommender system based on item taxonomy and folksonomy BIBAFull-Text 1641-1644
  Huizhi Liang; Yue Xu; Yuefeng Li; Richi Nayak
Item folksonomy or tag information is popularly available on the web now. However, since tags are arbitrary words given by users, they contain a lot of noise such as tag synonyms, semantic ambiguities and personal tags. Such noise brings difficulties to improve the accuracy of item recommendations. In this paper, we propose to combine item taxonomy and folksonomy to reduce the noise of tags and make personalized item recommendations. The experiments conducted on the dataset collected from Amazon.com demonstrated the effectiveness of the proposed approaches. The results suggested that the recommendation accuracy can be further improved if we consider the viewpoints and the vocabularies of both experts and users.
Communication motifs: a tool to characterize social communications BIBAFull-Text 1645-1648
  Qiankun Zhao; Yuan Tian; Qi He; Nuria Oliver; Ruoming Jin; Wang-Chien Lee
Social networks mediate not only the relations between entities, but also the patterns of information propagation among them and their communication behavior. In this paper, we extensively study the temporal annotations (e.g., time stamps and duration) of historical communications in social networks and propose two novel tools -- communication motifs and maximum-flow communication motifs -- for characterizations of the patterns of information propagation in social networks. Using these motifs, we verify the following hypothesis in social communication network: 1) the functional behavioral patterns of information propagation within both social networks are stable over time; 2) the patterns of information propagation in synchronous and asynchronous social networks are different and sensitive to the cost of communication; and 3) the speed and the amount of information that is propagated through a network are correlated and dependent on individual profiles.
Improving taxonomies for large-scale hierarchical classifiers of web documents BIBAFull-Text 1649-1652
  Kiyoshi Nitta
We focused on taxonomy modification algorithms for gradually improving the relevance performances of large-scale hierarchical classifiers of web documents. Considering the research results of Tang et al. [5,4], who took the same approach, we investigated and implemented two heuristic taxonomy modification algorithms for performing practical classification processes for large-scale taxonomies. Although a taxonomy modification algorithm continuously improves the relevance performances of hierarchical classifiers, it increases the computational costs of those classifiers for training and predicting processes. We developed an improved taxonomy modification algorithm for reducing computational costs by preventing child node concentration. Although the relevance performances of the algorithm-modified taxonomy classifiers improved without increasing computational costs until the fourth generation by spreading the set of predicted classes, their relevance performances and behaviors went in opposite directions from the fifth generation.
PTM: probabilistic topic mapping model for mining parallel document collections BIBAFull-Text 1653-1656
  Duo Zhang; Jimeng Sun; ChengXiang Zhai; Abhijit Bose; Nikos Anerousis
Many applications generate a large volume of parallel document collections. A parallel document collection consists of two sets of documents where the documents in each set correspond to each other and form semantic pairs (e.g., pairs of problem and solution descriptions in a help-desk setting). Although much work has been done on text mining, little previous work has attempted to mine such a novel kind of text data. In this paper, we propose a new probabilistic topic model, called Probabilistic Topic Mapping (PTM) model, to mine parallel document collections to simultaneously discover latent topics in both sets of documents as well as the mapping of topics in one set to those in the other. We evaluate the PTM model on one real parallel document collection in IT service domain. We show that PTM can effectively discover meaningful topics, as well as their mappings, and it's also useful for improving text matching and retrieval when there's a vocabulary gap.
Hierarchical auto-tagging: organizing Q&A knowledge for everyone BIBAFull-Text 1657-1660
  Kyosuke Nishida; Ko Fujimura
We propose a hierarchical auto-tagging system, TagHats, to improve users' knowledge sharing. Our system assigns three different levels of tags to Q&A documents: category, theme, and keyword. Multiple category tags can organize a document according to multiple viewpoints, and multiple theme and keyword tags can identify what the document is about clearly. Moreover, these hierarchical tags will be helpful in organizing documents to support everyone because different users have different demands in terms of tag specificity. Our system consists of a hierarchical classification method for assigning category and theme tags, a new keyword extraction method that considers the structure of Q&A documents, and a new method for selecting theme tag candidates from each category. Experiments with the documents of Oshiete! goo demonstrate that our system is able to assign hierarchical tags to the documents appropriately and is capable of outperforming baseline methods significantly.
Extracting structured information from Wikipedia articles to populate infoboxes BIBAFull-Text 1661-1664
  Dustin Lange; Christoph Böhm; Felix Naumann
Roughly every third Wikipedia article contains an infobox -- a table that displays important facts about the subject in attribute-value form. The schema of an infobox, i.e., the attributes that can be expressed for a concept, is defined by an infobox template. Often, authors do not specify all template attributes, resulting in incomplete infoboxes.
   With iPopulator, we introduce a system that automatically populates infoboxes of Wikipedia articles by extracting attribute values from the article's text. In contrast to prior work, iPopulator detects and exploits the structure of attribute values to independently extract value parts. We have tested iPopulator on the entire set of infobox templates and provide a detailed analysis of its effectiveness. For instance, we achieve an average extraction precision of 91% for 1,727 distinct infobox template attributes.
Automatic metadata extraction from multilingual enterprise content BIBAFull-Text 1665-1668
  Melike Sah; Vincent Wade
Enterprises provide professionally authored content about their products/services in different languages for use in web sites and customer care. For customer care, personalization/personalized information delivery is becoming important since it re-encourages users to return to the service provider. Personalization usually requires both contextual and descriptive metadata. But current metadata authored by content developers is usually quite simple. In this paper, we introduce an automatic metadata extraction framework, which can extract multilingual metadata from the enterprise content, for a personalized information retrieval system. We introduce two new ontologies for metadata creation and a novel semi-automatic topic vocabulary extraction algorithm. We demonstrate and evaluate our approach on the English and German Symantec Norton 360 technical content. Evaluations indicate that the proposed approach produces rich and high quality metadata for a personalized information retrieval system.
Intelligent sales forecasting engine using genetic algorithms BIBAFull-Text 1669-1672
  M. Vijayalakshmi; Bernard Menezes; Rohit Menon; Aniket Divecha; Rajesh Ravindran; Kamal Mehta
Times series techniques have been extensively used for Sales forecasting. Research has established that a combination forecast works better than a single forecast. Our research attempts to design an Intelligent Forecasting Engine which will use a combination forecasting technique. This design is based on use of Genetic Algorithms, for selecting the best methods to combine for forecasting. Early results demonstrate that Genetic Algorithms have the potential to become a powerful tool for time series sales forecasting.
Identifying new categories in community question answering archives: a topic modeling approach BIBAFull-Text 1673-1676
  Yajie Miao; Chunping Li; Jie Tang; Lili Zhao
Community Question Answering (CQA) services have evolved into a popular way of information seeking and providing. User-posted questions in CQA are generally organized into hierarchical categories. In this paper, we define and study a novel problem which is referred to as New Category Identification (NCI) in CQA question archives. New Category Identification is primarily concerned with detecting and characterizing new or emerging categories which are not included in the existing category hierarchy. We define this problem formally, and propose both unsupervised and semi-supervised topic modeling methods to solve it. Experiments with a ground-truth set built from Yahoo! Answers show that our methods identify and interpret new categories effectively.
An effective approach for mining mobile user habits BIBAFull-Text 1677-1680
  Huanhuan Cao; Tengfei Bao; Qiang Yang; Enhong Chen; Jilei Tian
The user interaction with the mobile device plays an important role in user habit understanding, which is crucial for improving context-aware services. In this paper, we propose to mine the associations between user interactions and contexts captured by mobile devices, or behavior patterns for short, from context logs to characterize the habits of mobile users. Though several state-of-the-art studies have been reported for association mining, they cannot apply to behavior pattern mining due to the unbalanced occurrences of contexts and user interaction records. To this end, we propose a novel approach for behavior pattern mining which takes context logs as time ordered sequences of context records and takes into account the co-occurrences of contexts and interaction records in the whole time ranges of contexts. Moreover, we develop an Apriori-like algorithm for behavior pattern mining and improve the original algorithm in terms of efficiency by introducing the context hash tree. Last, we build a data collection system and collect the rich context data and interaction records of 50 recruited volunteers from their mobile devices. The extensive experiments on the collected real life data clearly validate the ability of our approach for mining effective behavior patterns.
Mining networks with shared items BIBAFull-Text 1681-1684
  Jun Sese; Mio Seki; Mutsumi Fukuzaki
Recent advances in data processing have enabled the generation of large and complex graphs. Many researchers have developed techniques to investigate informative structures within these graphs. However, the vertices and edges of most real-world graphs are associated with its features, and only a few studies have considered their combination. In this paper, we specifically examine a large graph in which each vertex has associated items. From the graph, we extract subgraphs with common itemsets, which we call itemset-sharing subgraphs (ISSes). The problem has various potential applications such as the detection of gene networks affected by drugs or the findings of popular research areas of contributing researchers. We propose an efficient algorithm to enumerate ISSes in large graphs. This algorithm enumerates ISSes with two efficient data structures: a DFS itemset tree and a visited itemset table. In practice, the combination of these two structures enables us to compute optimal solutions efficiently. We demonstrate the efficiency of our algorithm in mining ISSes from synthetic graphs with more than one million edges. We also present experiments performed using two real biological networks and a citation network. The experiments show that our algorithm can find interesting patterns in real datasets.
Learning sentiment classification model from labeled features BIBAFull-Text 1685-1688
  Yulan He
We propose a novel framework where an initial classifier is learned by incorporating prior information extracted from an existing sentiment lexicon. Preferences on expectations of sentiment labels of those lexicon words are expressed using generalized expectation criteria. Documents classified with high confidence are then used as pseudo-labeled examples for automatical domain-specific feature acquisition. The word-class distributions of such self-learned features are estimated from the pseudo-labeled examples and are used to train another classifier by constraining the model's predictions on unlabeled instances. Experiments on both the movie review data and the multi-domain sentiment dataset show that our approach attains comparable or better performance than exiting weakly-supervised sentiment classification methods despite using no labeled documents.
Embedding tolerance relations in formal concept analysis: an application in information fusion BIBAFull-Text 1689-1692
  Mehdi Kaytoue; Zainab Assaghir; Amedeo Napoli; Sergei O. Kuznetsov
This paper shows how to embed a similarity relation between complex descriptions in concept lattices. We formalize similarity by a tolerance relation: objects are grouped within a same concept when having similar descriptions, extending the ability of FCA to deal with complex data. We propose two different approaches. A first classical manner defines a discretization procedure. A second way consists in representing data by pattern structures, from which a pattern concept lattice can be constructed directly. In this case, considering a tolerance relation can be mathematically defined by a projection in a meet-semi-lattice. This allows to use concept lattices for their knowledge representation and reasoning abilities without transforming data. We show finally that resulting lattices are useful for solving information fusion problems.
Online learning for multi-task feature selection BIBAFull-Text 1693-1696
  Haiqin Yang; Irwin King; Michael R. Lyu
Multi-task feature selection (MTFS) is an important tool to learn the explanatory features across multiple related tasks. Previous MTFS methods fulfill this task in batch-mode training. This makes them inefficient when data come in sequence or when the number of training data is so large that they cannot be loaded into the memory simultaneously. To tackle these problems, we propose the first online learning framework for MTFS. A main advantage of the online algorithms is the efficiency in both time complexity and memory cost due to the closed-form solutions in updating the model weights at each iteration. Experimental results on a real-world dataset attest to the merits of the proposed algorithms.
Exploiting user interests for collaborative filtering: interests expansion via personalized ranking BIBAFull-Text 1697-1700
  Qi Liu; Enhong Chen; Hui Xiong; Chris H. Q. Ding
In real applications, a given user buys or rates an item based on his/her interests. Learning to leverage this interest information is often critical for recommender systems. However, in existing recommender systems, the information about latent user interests are largely under-explored. To that end, in this paper, we propose an interest expansion strategy via personalized ranking based on the topic model, named iExpand, for building an interest-oriented collaborative filtering framework. The iExpand method introduces a three-layer, user-interest-item, representation scheme, which leads to more interpretable recommendation results and helps the understanding of the interactions among users, items, and user interests. Moreover, iExpand strategically deals with many issues, such as the overspecialization and the cold-start problems. Finally, we evaluate iExpand on benchmark data sets, and experimental results show that iExpand outperforms state-of-the-art methods.
K-farthest-neighbors-based concept boundary determination for support vector data description BIBAFull-Text 1701-1704
  Yanshan Xiao; Bo Liu; Longbing Cao
Support vector data description (SVDD) is very useful for one-class classification. However, it incurs high time complexity in handling large scale data. In this paper, we propose a novel and efficient method, named K-Farthest-Neighbors-based Concept Boundary Detection (KFN-CBD for short), to improve the SVDD learning efficiency on large datasets. This work is motivated by the observation that SVDD classifier is determined by support vectors (SVs), and removing the non-support vectors (non-SVs) will not change the classifier but will reduce computational costs. Our approach consists of two steps. In the first step, we propose the K-farthest-neighbors method to identify the samples around the hyper-sphere surface, which are more likely to be SVs. At the same time, a new tree search strategy of M-tree is presented to speed up the K-farthest neighbor query. In the second step, the non-SVs are eliminated from the training set, and only the identified boundary samples are used to train the SVDD classifier. By removing the non-SVs, the training time of SVDD can be substantially reduced. Extensive experiments have shown that KFN-CBD achieves around 6 times speedup compared to the standard SVDD, and obtains the comparable classification quality as the entire dataset used.
Relational feature engineering of natural language processing BIBAFull-Text 1705-1708
  Hamidreza Kobdani; Hinrich Schütze; Andre Burkovski; Wiltrud Kessler; Gunther Heidemann
We present a new framework for feature engineering of natural language processing that is based on a relational data model of text. It includes fast and flexible methods for implementing and extracting new features and thereby reduces the effort of creating an NLP system for a particular task.
   In an instantiation and evaluation of the framework for the problem of coreference resolution in multiple languages, we were able to obtain competitive results in a short implementation period. This demonstrates the potential power of our framework for feature engineering.
Transfer incremental learning for pattern classification BIBAFull-Text 1709-1712
  Zhenfeng Zhu; Xingquan Zhu; Yue-Fei Guo; Xiangyang Xue
Traditional machine learning methods, such as Support Vector Machines (SVMs), usually assume that training and test data share the same distributions. Due to the inherent dynamic data nature, it is often observed that (1) the volumes of the training data may gradually grow; and (2) the existing and the newly arrived samples may be subject to different distributions or learning tasks. In this paper, we propose a Transfer Incremental Support Vector Machine (TrISVM), with the objective of tackling changes in data volumes and learning tasks at the same time. By using new updating rules to calculate the inverse matrix, TrISVM solves the existing incremental learning problem more efficiently, especially for high dimensional data. Furthermore, when using new samples to update the existing models, TrISVM employs sample-based weight adjustment procedures to ensure that the concept transferring between auxiliary and target samples can be leveraged to fulfill the transfer learning goal. Experimental results on real-world data sets demonstrate that TrISVM achieves better efficiency and prediction accuracy than both incremental-learning and transfer-learning based methods. In addition, the results also show that TrISVM is able to achieve bidirectional knowledge transfer between two similar tasks.
Learning ontology resolution for document representation and its applications in text mining BIBAFull-Text 1713-1716
  Lidong Bing; Bai Sun; Shan Jiang; Yan Zhang; Wai Lam
It is well known that synonymous and polysemous terms often bring in some noises when calculating the similarity between documents. Existing ontology-based document representation methods are static, hence, the chosen semantic concept set for representing a document has a fixed resolution and it is not adaptable to the characteristics of a document collection and the text mining problem in hand. We propose an Adaptive Concept Resolution (ACR) model to overcome this issue. ACR can learn a concept border from an ontology taking into consideration of the characteristics of a particular document collection. Then this border can provide a tailor-made semantic concept representation for a document coming from the same domain. Another advantage of ACR is that it is applicable in both classification task where the groups are given in the training document set, and clustering task where no group information is available. Furthermore, the result of this model is not sensitive to the model parameter. The experimental results show that ACR outperforms an existing static method significantly.
Supervised identification and linking of concept mentions to a domain-specific ontology BIBAFull-Text 1717-1720
  Gabor Melli; Martin Ester
We propose a pipelined supervised learning approach named SDOI to the task of interlinking the concepts mentioned within a document to the concepts within an ontology. Concept mention identification is performed by training a sequential tagging model. Each identified concept mention is then associated with a set of candidate ontology concepts along with a feature vector based on features proposed in the literature and novel ones based on new data sources, such as from the training corpus itself. An iterative algorithm is defined for handling collective features. We show a lift in performance over applicable baselines against the ability to identify the concept mentions within the 139 KDD-2009 conference paper abstracts, and to link these concept mentions to a domain-specific ontology for the field of data mining. Additional experiments of 22 ICDM-2009 abstracts suggest that the trained models are portable both in terms of accuracy and in their ability to reduce annotation time.
Relevance-index size tradeoff in contextual advertising BIBAFull-Text 1721-1724
  Pavan Kumar Gm; Krishna P. Leela; Mehul Parsana; Sachin Garg
In Contextual advertising, textual ads relevant to the content in a webpage are embedded in the page. Content keywords are extracted offline by crawling webpages and then stored in an index for fast serving. Given a page, ad selection involves index lookup, computing similarity between the keywords of the page and those of candidate ads and returning the top-k scoring ads. In this approach, there is a tradeoff between relevance and index size where better relevance can be achieved if there are no limits on the index size. However, the assumption of unlimited index size is not practical due to the large number of pages on the Web and stringent requirements on the serving latency. Secondly, page visits on the web follows power-law distribution where a significant proportion of the pages are visited infrequently, also called the tail pages. Indexing tail pages is not efficient given that these pages are accessed very infrequently.
   We propose a novel mechanism to mitigate these problems in the same framework. The basic idea is to index the same keyword vector for a set of similar pages. The scheme involves learning a website specific hierarchy from (page, URL) pairs of the website. Next, keywords are populated on the nodes via bottom-up traversal over the hierarchy. We evaluate our approach on a human labeled dataset where our approach has higher nDCG compared to a recent approach even though the index size of our approach is 7 times less than index size of the recent approach.
CasJoin: a cascade chain for text similarity joins BIBAFull-Text 1725-1728
  Xiaoxun Zhang; Zhili Guo; Honglei Guo; Huijia Zhu; Zhong Su
We are concerned with the problem of similarity joins of text data, where the task is to find all pairs of documents above an expected similarity. Such a problem often serves as an indispensable step in many web applications. A crucial issue is to preclude unnecessary candidate pairs as many as possible ahead of expensive similarity evaluation. In this paper, we initiate an idea of adopting a cascade structure in text joins for a large speedup, where a latter stage can exclude a considerable number of invalid pairs survived in former stages. The proposed algorithm is shortly referred to as CasJoin. We further adopt a prefix filter to build the stage of CasJoin by introducing a novel vision to the dynamic generation of document vector. Specifically, a vector is partitioned into a chain of multiple prefixes that are appended one by one for cascade joining. We evaluate our CasJoin on a typical web corpus, ODP. Experiments indicate that, comparing to the state-of-the-art prefix algorithms, CasJoin can achieve a drastic reduction of candidates by as much as 98.15% and a dramatic speedup of joining by up to 13.34x.
Learning naïve Bayes transfer classifier through class-wise test distribution estimation BIBAFull-Text 1729-1732
  Jeong-Woo Son; Seong-Bae Park; Hyun-Je Song
Text classification is a well-known problem for various applications. For last decades, it is believed that a large corpus is one of the most important aspects for better classification. However, even though a great number of documents is available for training a classifier, it is practically impossible to achieve an ideal performance, since the distributions of labeled and unlabeled documents are often different. To overcome this problem, this paper describes a novel Naïve Bayes classifier for text classification under distribution difference between training and test data. The proposed method approximates test distribution by weighting labeled documents to cope with the distribution difference. Unlike other transfer learning which estimates the weights of labeled documents, the proposed method considered both the documents and their estimated class labels. Therefore, the proposed method naturally combines the advantage of semi-supervised learning with those of transfer learning.
Top-Eye: top-k evolving trajectory outlier detection BIBAFull-Text 1733-1736
  Yong Ge; Hui Xiong; Zhi-hua Zhou; Hasan Ozdemir; Jannite Yu; K. C. Lee
The increasing availability of large-scale location traces creates unprecedent opportunities to change the paradigm for identifying abnormal moving activities. Indeed, various aspects of abnormality of moving patterns have recently been exploited, such as wrong direction and wandering. However, there is no recognized way of combining different aspects into an unified evolving abnormality score which has the ability to capture the evolving nature of abnormal moving trajectories. To that end, in this paper, we provide an evolving trajectory outlier detection method, named TOP-EYE, which continuously computes the outlying score for each trajectory in an accumulating way. Specifically, in TOP-EYE, we introduce a decay function to mitigate the influence of the past trajectories on the evolving outlying score, which is defined based on the evolving moving direction and density of trajectories. This decay function enables the evolving computation of accumulated outlying scores along the trajectories. An advantage of TOP-EYE is to identify evolving outliers at very early stage with relatively low false alarm rate. Finally, experimental results on real-world location traces show that TOP-EYE can effectively capture evolving abnormal trajectories.
Multi task learning on multiple related networks BIBAFull-Text 1737-1740
  Prakash Mandayam Comar; Pang-ning Tan; Anil Kumar Jain
With the rapid proliferation of online social networks, the need for newer class of learning algorithm to simultaneously deal with multiple related networks has become increasingly important. This paper proposes an approach for multi-task learning in multiple related networks, where in we perform different tasks such as classification on one network and clustering on the other. We show that the framework can be extended to incorporate prior information about the correspondences between the clusters and classes in different networks. We have performed experiments on real-world data sets to demonstrate the effectiveness of the proposed framework.
Building a semantic representation for personal information BIBAFull-Text 1741-1744
  Jinyoung Kim; Anton Bakalov; David A. Smith; W. Bruce Croft
A typical collection of personal information contains many documents and mentions many concepts (e.g., person names, events, etc.). In this environment, associative browsing between these concepts and documents can be useful as a complement for search. Previous approaches in the area of semantic desktops aimed at addressing this task. However, they were not practical because they require tedious manual annotation by the user.
   In this work, we suggest a methodology and a prototype system for building a semantic representation of personal information based on click feedback from the user. We employed a feature-based model of associations between the concepts and documents. Our initial evaluation shows that the suggested semantic representation can play an important role in the known-item finding task and that the system can learn to predict such associations with a small amount of click data.
Affinity-driven prediction and ranking of products in online product review sites BIBAFull-Text 1745-1748
  Hui Li; Sourav S. Bhowmick; Aixin Sun
Large online product review websites (e.g., Epinions, Blippr) to various types of products. Typically, each product in these sites is associated with a group of members who have provided ratings and comments on it. These people form a product community. A potential member can join a produce community by giving a new rating to the product. We refer to this phenomenon of a product community's ability to "attract" new members as product affinity.
   The knowledge of a ranked list of products based on product affinity is of much importance to be utilized for implementing policies, marketing research, online advertisement, and other applications. In this paper, we identify and analyze an array of features that exert effect on product affinity and propose a novel model, called AffRank, that utilizes these features to predict the future rank of products according to their affinities. Evaluated on a real-world dataset, we demonstrate the effectiveness and superior prediction quality of AffRank compared to baseline methods. Our experiments show that features such as affinity rank history, affinity evolution distance, and average rating are the most important factors affecting future rank of products.
Topic-driven web search result organization by leveraging wikipedia semantic knowledge BIBAFull-Text 1749-1752
  Xianpei Han; Jun Zhao
Effective organization of web search results can greatly improve the utility of search engine and enhance the quality of search results. However, the organization of search results is difficult because the sub-topics of a query are usually not explicitly given. In this paper, we propose a novel topic-driven search result organization method, which can first detect the sub-topics of a query by finding the coherent Wikipedia concept groups from its search results; then organize these results using a topic-driven clustering algorithm; in the end we score and rank the topics using the support vector regression model. Empirical results show that our method can achieve competitive performance.
Fast dimension reduction for document classification based on imprecise spectrum analysis BIBAFull-Text 1753-1756
  Hu Guan; Bin Xiao; Jingyu Zhou; Minyi Guo; Tao Yang
This paper proposes an algorithm called Imprecise Spectrum Analysis (ISA) to carry out fast dimension reduction for document classification. ISA is designed based on the one-sided Jacobi method for Singular Value Decomposition (SVD). To speedup dimension reduction, it simplifies the orthogonalization process of Jacobi computation and introduces a new mapping formula for transforming original document-term vectors. To improve classification accuracy using ISA, a feature selection method is further developed to make inter-class feature vectors more orthogonal in building the initial weighted term-document matrix. Our experimental results show that ISA is extremely fast in handling large term-document matrices and delivers better or competitive classification accuracy compared to SVD-based LSI.
Manifold ranking with sink points for update summarization BIBAFull-Text 1757-1760
  Pan Du; Jiafeng Guo; Jin Zhang; Xueqi Cheng
Update summarization aims to create a summary over a topic-related multi-document dataset based on the assumption that the user has already read a set of earlier documents of the same topic. Beyond the problems (i.e., topic relevance, salience, and diversity in extracted information) tackled by topic-focused multi-document summarization, the update summarization must address the novelty problem as well. In this paper, we propose a novel extractive approach based on manifold ranking with sink points for update summarization. Specifically, our approach leverages a manifold ranking process over the sentence manifold to find topic relevant and salient sentences. More important, by introducing the sink points into sentence manifold, the ranking process can further capture the novelty and diversity based on the intrinsic sentence manifold. Therefore, we are able to address the four challenging problems above for update summarization in a unified way. Experiments on benchmarks of TAC are performed and the evaluation results show that our approach can achieve comparative performance to the existing best performing systems in TAC tasks.
Construction of a sentimental word dictionary BIBAFull-Text 1761-1764
  Eduard C. Dragut; Clement Yu; Prasad Sistla; Weiyi Meng
The Web has plenty of reviews, comments and reports about products, services, government policies, institutions, etc. The opinions expressed in these reviews influence how people regard these entities. For example, a product with consistently good reviews is likely to sell well, while a product with numerous bad reviews is likely to sell poorly. Our aim is to build a sentimental word dictionary, which is larger than existing sentimental word dictionaries and has high accuracy. We introduce rules for deduction, which take words with known polarities as input and produce synsets (a set of synonyms with a definition) with polarities. The synsets with deduced polarities can then be used to further deduce the polarities of other words. Experimental results show that for a given sentimental word dictionary with D words, approximately an additional 50% of D words with polarities can be deduced. An experiment is conducted to find the accuracy of a random sample of the deduced words. It is found that the accuracy is about the same as that of comparing the judgment of one human with that of another.
Exploiting novelty, coverage and balance for topic-focused multi-document summarization BIBAFull-Text 1765-1768
  Xuan Li; Yi-Dong Shen; Liang Du; Chen-Yan Xiong
Novelty, coverage and balance are important requirements in topic-focused summarization, which to a large extent determine the quality of a summary. In this paper, we propose a novel method that incorporates these requirements into a sentence ranking probability model. It differs from the existing methods in that the novelty, coverage and balance requirements are all modeled w.r.t. a given topic, so that summaries are highly relevant to the topic and at the same time comply with topic-aware novelty, coverage and balance. Experimental results on the DUC 2005, 2006 and 2007 benchmark data sets demonstrate the effectiveness of our method.
Context modeling for ranking and tagging bursty features in text streams BIBAFull-Text 1769-1772
  Wayne Xin Zhao; Jing Jiang; Jing He; Dongdong Shan; Hongfei Yan; Xiaoming Li
Bursty features in text streams are very useful in many text mining applications. Most existing studies detect bursty features based purely on term frequency changes without taking into account the semantic contexts of terms, and as a result the detected bursty features may not always be interesting or easy to interpret. In this paper we propose to model the contexts of bursty features using a language modeling approach. We then propose a novel topic diversity-based metric using the context models to find newsworthy bursty features. We also propose to use the context models to automatically assign meaningful tags to bursty features. Using a large corpus of a stream of news articles, we quantitatively show that the proposed context language models for bursty features can effectively help rank bursty features based on their newsworthiness and to assign meaningful tags to annotate bursty features.
Comparison of six aggregation strategies to compute users' trustworthiness BIBAFull-Text 1773-1776
  Pierpaolo Dondio; Stephen Barrett
The decision to grant trust in virtual societies is often an evidence based process. The evidence for such decision derives from a diverse set, where mutual relationships and contradictions might occur. This paper compares and evaluates six aggregation strategies to compute users' trustworthiness. Our evaluation performed over a large online-community, shows how a rule-based strategy based on an argumentation semantic outperforms strategies where mutual relationships among evidence are ignored.
Visualization and clustering of crowd video content in MPCA subspace BIBAFull-Text 1777-1780
  Haiping Lu; How-Lung Eng; Myo Thida; Konstantinos N. Plataniotis
This paper presents a novel approach for the visualization and clustering of crowd video contents by using multilinear principal component analysis (MPCA). In contrast to feature-point-based approach and frame-based dimensionality reduction approach, the proposed method maps each short video segment to a point in MPCA subspace to take temporal information into account naturally through tensorial representations. Specifically, MPCA projects each short segment of a video to a low-dimensional tensor first. A few MPCA features are then selected according to the variance captured as the final representation. Thus, a video is visualized as a trajectory in MPCA subspace. The trajectory generated enables visual interpretation of video content in a compact space as well as visual clustering of video events. The proposed method is evaluated on the PETS 2009 datasets through comparison with three existing methods for video visualization. The MPCA visualization shows superior performance in clustering segments of the same event as well as identifying the transitions between events.
ANITA: a narrative interpretation of taxonomies for their adaptation to text collections BIBAFull-Text 1781-1784
  Mario Cataldi; K. Selçuk Candan; Maria Luisa Sapino
Taxonomies embody formalized knowledge and define aggregations between concepts/categories in a given domain, facilitating the organization of the data and making the contents easily accessible to the users. Since taxonomies have significant roles in the data annotation, search and navigation, they are often carefully engineered. However, especially in very dynamic content, they do not necessarily reflect the content knowledge. Thus, in this paper, we propose A Narrative Interpretation of Taxonomies for their Adaptation (ANITA) for re-structuring existing taxonomies to varying application contexts and we evaluate the proposed scheme by user studies that show that the proposed algorithm is able to adapt the taxonomy in a new compact and understandable structure from a human point of view.
Yes we can: simplex volume maximization for descriptive web-scale matrix factorization BIBAFull-Text 1785-1788
  Christian Thurau; Kristian Kersting; Christian Bauckhage
Matrix factorization methods are among the most common techniques for detecting latent components in data. Popular examples include the Singular Value Decomposition or Non-negative Matrix Factorization. Unfortunately, most methods suffer from high computational complexity and therefore do not scale to massive data. In this paper, we present a linear time algorithm for the factorization of gigantic matrices that iteratively yields latent components. We consider a constrained matrix factorization s.t. the latent components form a simplex that encloses most of the remaining data. The algorithm maximizes the volume of that simplex and thereby reduces the displacement of data from the space spanned by the latent components. Hence, it also lowers the Frobenius norm, a common criterion for matrix factorization quality. Our algorithm is efficient, well-grounded in distance geometry, and easily applicable to matrices with billions of entries. In addition, the resulting factors allow for an intuitive interpretation of data: every data point can now be expressed as a convex combination of the most extreme and thereby often most descriptive instances in a collection of data. Extensive experimental validations on web-scale data, including 80 million images and 1.5 million twitter tweets, demonstrate superior performance compared to related factorization or clustering techniques.
Incorporating terminology evolution for query translation in text retrieval with association rules BIBAFull-Text 1789-1792
  Amal C. Kaluarachchi; Aparna S. Varde; Srikanta Bedathur; Gerhard Weikum; Jing Peng; Anna Feldman
Time-stamped documents such as newswire articles, blog posts and other web-pages are often archived online. When these archives cover long spans of time, the terminology within them could undergo significant changes. Hence, when users pose queries pertaining to historical information, over such documents, the queries need to be translated, taking into account these temporal changes, to provide accurate responses to users. For example, a query on Sri Lanka should automatically retrieve documents with its former name Ceylon. We call such concepts SITACs, i.e., Semantically Identical Temporally Altering Concepts. In order to discover SITACs, we propose an approach based on a novel framework constituting an integration of natural language processing, association rule mining, and contextual similarity as a learning technique. The proposed approach has been experimented with real data and has been found to yield good results with respect to efficiency and accuracy.
Exploiting co-occurrence and information quality metrics to recommend tags in web 2.0 applications BIBAFull-Text 1793-1796
  Fabiano Muniz Belém; Eder Ferreira Martins; Jussara Marques Almeida; Marcos André Gonçalves; Gisele Lobo Pappa
This work addresses the task of recommending high quality tags by exploiting not only previously assigned tags, but also terms extracted from other textual features (e.g., title and description) associated with the target object. To estimate the quality of a candidate tag recommendation, we use several metrics related to both tag co-occurrence and information quality. We also propose a heuristic function to combine the metrics to produce a final ranking of the recommended tags. We evaluate our heuristic function in various scenarios, for three popular Web 2.0 applications. Our experimental results indicate that our heuristic function significantly outperforms two state-of-the-art tag recommendation algorithms.
Elusive vandalism detection in wikipedia: a text stability-based approach BIBAFull-Text 1797-1800
  Qinyi Wu; Danesh Irani; Calton Pu; Lakshmish Ramaswamy
The open collaborative nature of wikis encourages participation of all users, but at the same time exposes their content to vandalism. The current vandalism-detection techniques, while effective against relatively obvious vandalism edits, prove to be inadequate in detecting increasingly prevalent sophisticated (or elusive) vandal edits. We identify a number of vandal edits that can take hours, even days, to correct and propose a text stability-based approach for detecting them. Our approach is focused on the likelihood of a certain part of an article being modified by a regular edit. In addition to text-stability, our machine learning-based technique also takes into account edit patterns. We evaluate the performance of our approach on a corpus comprising of 15000 manually labeled edits from the Wikipedia Vandalism PAN corpus. The experimental results show that text-stability is able to improve the performance of the selected machine-learning algorithms significantly.
Feature subspace transformations for enhancing k-means clustering BIBAFull-Text 1801-1804
  Anirban Chatterjee; Sanjukta Bhowmick; Padma Raghavan
Unsupervised classification typically concerns identifying clusters of similar entities in an unlabeled dataset. Popular methods include clustering based on (i) distance-based metrics between the entities in the feature space (K-Means), and (ii) combinatorial properties in a weighted graph representation of the dataset (Multilevel K-Means).
   In this paper, we present a force-directed graph layout based feature subspace transformation (FST) scheme to transform the dataset before the application of K-Means. Our FST-K-Means method utilizes both distance-based and combinatorial attributes of the original dataset to seek improvements in the internal and external quality metrics of unsupervised classification. We demonstrate the effectiveness of FST-K-Means in improving classification quality relative to K-Means and Multilevel K-Means (GraClus). The quality of classification is measured by observing internal and external quality metrics on a test suite of datasets. Our results indicate that on average, the internal quality metric (cluster cohesiveness) is 20.2% better than K-Means, and 6.6% better than GraClus. More significantly, FST-K-Means improves the external quality metric (accuracy) of classification on average by 14.9% relative to K-Means and 23.6% relative to GraClus.
On bootstrapping recommender systems BIBAFull-Text 1805-1808
  Nadav Golbandi; Yehuda Koren; Ronny Lempel
Recommender systems perform much better on users for which they have more information. This gives rise to a problem of satisfying users new to a system. The problem is even more acute considering that some of these hard to profile new users judge the unfamiliar system by its ability to immediately provide them with satisfying recommendations, and may be the quickest to abandon the system when disappointed. Rapid profiling of new users is often achieved through a bootstrapping process -- a kind of an initial interview -- that elicits users to provide their opinions on certain carefully chosen items or categories. This work offers a new bootstrapping method, which is based on a concrete optimization goal, thereby handily outperforming known approaches in our tests.
Using Wikipedia categories for compact representations of chemical documents BIBAFull-Text 1809-1812
  Benjamin Köhncke; Wolf-Tilo Balke
Today, Web pages are usually accessed using text search engines, whereas documents stored in the deep Web are accessed through domain-specific Web portals. These portals rely on external knowledge bases, respectively ontologies, mapping documents to more general concepts allowing for suitable classifications and navigational browsing. Since automatically generated ontologies are still not satisfactory for advanced information retrieval tasks, most portals heavily rely on hand-crafted domain-specific ontologies. This, however, also leads to high creation and maintaining costs. On the other hand, a freely available community maintained, if somewhat general, knowledge base is offered by Wikipedia. During the last years the coverage of Wikipedia has reached a large pool of information including articles from almost all domains. In this paper, we investigate the use of Wikipedia categories to describe the content of chemical documents in a compact form. We compare the results to the domain-specific ChEBI ontology and the results show that Wikipedia categories indeed allow useful descriptions for chemical documents that are even better than descriptions from the ChEBI ontology.
Efficient wikipedia-based semantic interpreter by exploiting top-k processing BIBAFull-Text 1813-1816
  Jong Wook Kim; Ashwin Kashyap; Dekai Li; Sandilya Bhamidipati
Proper representation of the meaning of texts is crucial to enhancing many data mining and information retrieval tasks, including clustering, computing semantic relatedness between texts, and searching. Representing of texts in the concept space derived from Wikipedia has received growing attention recently, due to its comprehensiveness and expertise, This concept-based representation is capable of extracting semantic relatedness between texts that cannot be deduced with the bag of words model. A key obstacle, however, for using Wikipedia as a semantic interpreter is that the sheer size of the concepts derived from Wikipedia makes it hard to efficiently map texts into concept-space. In this paper, we develop an efficient algorithm which is able to represent the meaning of a text by using the concepts that best match it. In particular, our approach first computes the approximate top-k concepts that are most relevant to the given text. We then leverage these concepts for representing the meaning of the given text. The experimental results show that the proposed technique provides significant gains in execution time over current solutions to the problem.
A study of rumor control strategies on social networks BIBAFull-Text 1817-1820
  Rudra M. Tripathy; Amitabha Bagchi; Sameep Mehta
In this paper we study and evaluate rumor-like methods for combating the spread of rumors on a social network. We model rumor spread as a diffusion process on a network and suggest the use of an "anti-rumor" process similar to the rumor process. We study two natural models by which these anti-rumors may arise. The main metrics we study are the belief time, i.e., the duration for which a person believes the rumor to be true and point of decline, i.e., point after which anti-rumor process dominates the rumor process. We evaluate our methods by simulating rumor spread and anti-rumor spread on a data set derived from the social networking site Twitter and on a synthetic network generated according to the Watts and Strogatz model. We find that the lifetime of a rumor increases if the delay in detecting it increases, and the relationship is at least linear. Further our findings show that coupling the detection and anti-rumor strategy by embedding agents in the network, we call them beacons, is an effective means of fighting the spread of rumor, even if these beacons do not share information.
Domain-independent entity coreference in RDF graphs BIBAFull-Text 1821-1824
  Dezhao Song; Jeff Heflin
In this paper, we present a novel entity coreference algorithm for Semantic Web instances. The key issues include how to locate context information and how to utilize the context appropriately. To collect context information, we select a neighborhood (consisting of triples) of each instance from the RDF graph. To determine the similarity between two instances, our algorithm computes the similarity between comparable property values in the neighborhood graphs. The similarity of distinct URIs and blank nodes is computed by comparing their outgoing links. To provide the best possible domain-independent matches, we examine an appropriate way to compute the discriminability of triples. To reduce the impact of distant nodes, we explore a distance-based discounting approach. We evaluated our algorithm using different instance categories in two datasets. Our experiments show that the best results are achieved by including both our triple discrimination and discounting approaches.
Opinion digger: an unsupervised opinion miner from unstructured product reviews BIBAFull-Text 1825-1828
  Samaneh Moghaddam; Martin Ester
Mining customer reviews (opinion mining) has emerged as an interesting new research direction. Most of the reviewing websites such as Epinions.com provide some additional information on top of the review text and overall rating, including a set of predefined aspects and their ratings, and a rating guideline which shows the intended interpretation of the numerical ratings. However, the existing methods have ignored this additional information. We claim that using this information, which is freely available, along with the review text can effectively improve the accuracy of opinion mining. We propose an unsupervised method, called Opinion Digger, which extracts important aspects of a product and determines the overall consumer's satisfaction for each, by estimating a rating in the range from 1 to 5. We demonstrate the improved effectiveness of our methods on a real life dataset that we crawled from Epinions.com.
Combining link and content for collective active learning BIBAFull-Text 1829-1832
  Lixin Shi; Yuhang Zhao; Jie Tang
In this paper, we study a novel problem Collective Active Learning, in which we aim to select a batch set of "informative" instances from a networking data set to query the user in order to improve the accuracy of the learned classification model. We perform a theoretical investigation of the problem and present three criteria (i.e., minimum redundancy, maximum uncertainty and maximum impact) to quantify the informativeness of a set of selected instances. We define an objective function based on the three criteria and present an efficient algorithm to optimize the objective function with a bounded approximation rate. Experimental results on a real-world data sets demonstrate the effectiveness of our proposed approach.
Classifying sentiment in microblogs: is brevity an advantage? BIBAFull-Text 1833-1836
  Adam Bermingham; Alan F. Smeaton
Microblogs as a new textual domain offer a unique proposition for sentiment analysis. Their short document length suggests any sentiment they contain is compact and explicit. However, this short length coupled with their noisy nature can pose difficulties for standard machine learning document representations. In this work we examine the hypothesis that it is easier to classify the sentiment in these short form documents than in longer form documents. Surprisingly, we find classifying sentiment in microblogs easier than in blogs and make a number of observations pertaining to the challenge of supervised learning for sentiment analysis in microblogs.
Identifying hotspots on the real-time web BIBAFull-Text 1837-1840
  Krishna Yeswanth Kamath; James Caverlee
We study the problem of automatically identifying "hotspots" on the real-time web. Concretely, we propose to identify highly-dynamic ad-hoc collections of users -- what we refer to as crowds -- in massive social messaging systems like Twitter and Facebook. The proposed approach relies on a message-based communication clustering approach over time-evolving graphs that captures the natural conversational nature of social messaging systems. One of the salient features of the proposed approach is an efficient locality-based clustering approach for identifying crowds of users in near real-time compared to more heavyweight static clustering algorithms. Based on a three month snapshot of Twitter consisting of 711,612 users and 61.3 million messages, we show how the proposed approach can efficiently and effectively identify Twitter-based crowds relative to static graph clustering techniques at a fraction of the computational cost.
Discovery of numerous specific topics via term co-occurrence analysis BIBAFull-Text 1841-1844
  Omid Madani; Jiye Yu
We describe efficient techniques for construction of large term co-occurrence graphs, and investigate an application to the discovery of numerous fine-grained (specific) topics. A topic is a small dense subgraph discovered by a random walk initiated at a term (node) in the graph. We observe that the discovered topics are highly interpretable, and reveal the different meanings of terms in the corpus. We show the information-theoretic utility of the topics when they are used as features in supervised learning. Such features lead to consistent improvements in classification accuracy over the standard bag-of-words representation, even at high training proportions. We explain how a layered pyramidal view of the term distribution helps in understanding the algorithms and in visualizing and interpreting the topics.
Digging for knowledge with information extraction: a case study on human gene-disease associations BIBAFull-Text 1845-1848
  Markus Bundschus; Anna Bauer-Mehren; Volker Tresp; Laura Furlong; Hans-Peter Kriegel
We present the information extraction system Text2SemRel. The system (semi-) automatically constructs knowledge bases from textual data consisting of facts about entities using semantic relations. An integral part of the system is a graph-based interactive visualization and search layer. The second contribution in this paper is the presentation of a case study on the (semi-) automatic construction of a knowledge base consisting of gene-disease associations. The resulting knowledge base, the Literature-derived Human Gene-Disease Network (LHGDN), is now an integral part of the Linked Life Data initiative and represents currently the largest publicly available gene-disease repository. The LHGDN is compared against several curated state of the art databases. A unique feature of the LHGDN is that the semantics of the associations constitute a wide variety of biomolecular conditions.
Towards query log based personalization using topic models BIBAFull-Text 1849-1852
  Mark J. Carman; Fabio Crestani; Morgan Harvey; Mark Baillie
We investigate the utility of topic models for the task of personalizing search results based on information present in a large query log. We define generative models that take both the user and the clicked document into account when estimating the probability of query terms. These models can then be used to rank documents by their likelihood given a particular query and user pair.
Choosing your own adventure: automatic taxonomy generation to permit many paths BIBAFull-Text 1853-1856
  Xiaoguang Qi; Dawei Yin; Zhenzhen Xue; Brian D. Davison
A taxonomy organizes concepts or topics in a hierarchical structure and can be created manually or via automated systems. A major drawback of taxonomies is that they require users to have the same view of the topics as the taxonomy creator. Users who do not share that mental taxonomy are likely to have difficulty in finding the desired topic. In this paper, we propose a new approach to taxonomy expansion which is able to provide more flexible views. Based on an existing taxonomy, our algorithm finds possible alternative paths and generates an expanded taxonomy with flexibility in user browsing choices. In experiments on the dmoz Open Directory Project, the rebuilt taxonomies provide more alternative paths and shorter paths to information. User studies show that our expanded taxonomies are preferred compared to the original.
Robust prediction from multiple heterogeneous data sources with partial information BIBAFull-Text 1857-1860
  Mohammad S. Aziz; Chandan K. Reddy
Significant research efforts for robust integration of information from multiple sources are being pursued at a rapid pace. However, the information in heterogeneous sources is often incomplete and hence making the maximum use of all the available information is a challenging problem. Most of the recent research on data integration have been primarily focused on the cases where the information is available across all the different sources and can not effectively integrate sources in the presence of partial information. We develop an ensemble method that boosts the decisions made from different models on individual sources and obtain robust results for the task of class prediction. We propose a heterogeneous boosting framework that uses all the available information even if some of the sources do not provide any information about some objects. We demonstrate the effectiveness of the proposed framework for the problem of gene function prediction and compare to the state-of-the-art methods using several real-world biological datasets. We also show that the proposed method outperforms any kind of imputation schemes that are widely used while integrating data with partial information.
Collaboration analytics: mining work patterns from collaboration activities BIBAFull-Text 1861-1864
  Qihua Wang; Hongxia Jin; Yan Liu
People are increasingly using more and more social softwares, generating flooding communications. User analytics may be performed to mine a person's activities on different social systems and extract patterns, be it interest patterns, social patterns, or work patterns. Such patterns may benefit both the individuals and the organizations the users associated with, as the information is valuable in numerous tasks, including recommendation, evaluation, management, and so on. In this article, we present an actionable solution of user analytics, namely collaboration analytics, by focusing on mining a person's work patterns from her collaboration activities. Our solution effectively makes use of a user's heterogeneous data collected from various collaboration tools to derive an integrated description of the user's collaborative work. A number of "work areas", each of which contains its work topics and people involved, are generated for every user. The challenges we face include the clustering of items with short texts and prioritizing/weighting data items based on importance/relevance. Our solutions to those issues will be described in this article. In particular, we mine users' background information from various types of data and use such information to enrich the semantics of the short texts contained in the activity instances on collaboration tools before clustering those instances into work areas. Finally, we have developed a prototype of our collaboration analytics solution and evaluated it with real-world data and people.
Adapting cost-sensitive learning for reject option BIBAFull-Text 1865-1868
  Jun Du; Eileen A. Ni; Charles X. Ling
Traditional cost-sensitive learning algorithms always deterministically predict examples as either positive or negative (in binary setting), to minimize the total misclassification cost. However, in more advanced real-world settings, the algorithms can also have another option to reject examples of high uncertainty. In this paper, we assume that cost-sensitive learning algorithms can reject the examples and obtain their true labels by paying reject cost. We therefore analyse three categories of popular cost-sensitive learning approaches, and provide generic methods to adapt them for reject option.
SKIF: a data imputation framework for concept drifting data streams BIBAFull-Text 1869-1872
  Peng Zhang; Xingquan Zhu; Jianlong Tan; Li Guo
Missing data commonly occurs in many applications. While many data imputation methods exist to handle the missing data problem for large scale databases, when applied to concept drifting data streams, these methods face some common difficulties. First, due to large and continuous data volumes, we are unable to maintain all stream records to form a candidate pool and estimate missing values, as most existing methods commonly do. Second, even if we could maintain all complete stream records using a summary structure, the concept drifting problem would make some information obsolete, and thus deteriorate the imputation accuracy. Third, in data streams, it is necessary to develop a fast yet accurate algorithm to find the most similar data for imputation. Fourth, due to the dynamic and sophisticated data collection environments, the missing rate of most stream data may be much higher than that in generic static databases, so the imputation method should be able to accommodate high missing rate in the data. To tackle these challenges, we propose, in this paper, a Streaming k-Nearest-Neighbors Imputation Framework (SKIF) for concept drifting data streams. To handle concept drifting and large volume problems in data streams, SKIF first summarizes historical complete records in some micro-resources (which are high-level statistical data structures), and maintains these micro-resources in a candidate pool as benchmark data. After that, SKIF employs a novel hybrid-kNN imputation procedure, which uses a hybrid similarity search mechanism, to find the most similar micro-resources from the large scale candidate pool efficiently. Experimental results demonstrate the effectiveness of the proposed SKIF framework for data stream imputation tasks.
Detecting controversial events from Twitter BIBAFull-Text 1873-1876
  Ana-Maria Popescu; Marco Pennacchiotti
Social media provides researchers with continuously updated information about developments of interest to large audiences. This paper addresses the task of identifying controversial events using Twitter as a starting point: we propose 3 models for this task and report encouraging initial results.
Topic detection and organization of mobile text messages BIBAFull-Text 1877-1880
  Ye Tian; Wendong Wang; Xueli Wang; Jinghai Rao; Canfeng Chen; Jian Ma
How to organize and visualize big amount of text messages stored on one's mobile phone is a challenging problem, since they can hardly be organized by threads as we do for emails due to lack of necessary metadata such as "subject" and "reply-to". In this paper, we propose an innovative approach based on clustering algorithms and natural language processing methods. We first cluster the text messages into candidate conversations based on their temporal attributes, and then do further analysis using a semantic model based on Latent Dirichlet Allocation (LDA). Considering that the text messages are usually short and sparse, we trained the model using a large scale external data collected from twitter-like web sites, and applied the model to text messages. In the end, the text messages are organized as conversations based on their topics. We evaluated our approach based on 122,359 text messages collected from 50 university students during 6 months.
Unsupervised public health event detection for epidemic intelligence BIBAFull-Text 1881-1884
  Marco Fisichella; Avaré Stewart; Kerstin Denecke; Wolfgang Nejdl
Recent pandemics such as Swine Flu have caused concern for public health officials. Given the ever increasing pace at which infectious diseases can spread globally, officials must be prepared to react sooner and with greater epidemic intelligence gathering capabilities. However, state-of-the-art systems for Epidemic Intelligence have not kept the pace with the growing need for more robust public health event detection. In this paper, we propose a game-changing approach where public health events are detected in an unsupervised manner. We address the problems associated with adapting an unsupervised learner to the medical domain and in doing so, propose an approach which combines aspects from different feature-based event detection methods. We evaluate our approach with a real world dataset with respect to the quality of article clusters. Our results show that we are able to achieve a precision of 66% and a recall of 81% when evaluated using manually annotated, real-world data. This shows promising results for the use of such techniques in this new problem setting.
Pattern based keyword extraction for contextual advertising BIBAFull-Text 1885-1888
  Kushal Dave; Vasudeva Varma
Contextual Advertising (CA) refers to the placement of ads that are contextually related to the web page content. The science of CA deals with the task of finding advertising keywords from web pages. We present a different candidate selection method to extract advertising keywords from a web page. This method makes use of Part-of-Speech (POS) patterns that restrict the number of potential candidates a classifier has to handle. It fetches words/phrases that belong to the selected set of POS patterns. We design four systems based on chunking method and the features they use. These systems are trained on a naive Bayes classifier with a set of web pages annotated with 'advertising' keywords. The systems can then find advertising keywords from previously unseen web pages. Empirical evaluation shows that systems using the proposed chunking method perform better than the systems using N-Gram based chunking. All improvements in the systems are found statistically significant at a 99% confidence interval.
Mixture model label propagation BIBAFull-Text 1889-1892
  Mingmin Chi; Xisheng He; Shipeng Yu
Usually, we can use a classification or clustering machine learning algorithm to manage knowledge and information retrieval. If we have a small size of known information with a large scale of unknown data, a semi-supervised learning (SSL) algorithm is often preferred. Under the cluster or manifold assumption, usually, the larger amount of unlabeled data are used for learning, the bigger gains of the SSL approaches are achieved. In the paper, we adopt the graph-based SSL algorithm to solve the problem. However the graph-based SSL algorithms are unable to be learnt with large-scale unlabeled samples and originally can only work in a transductive setting. In the paper, we propose a scalable graph-based SSL algorithm to attack the problems aforementioned by Gaussian mixture model label propagation. Experiments conducted on the real dataset illustrate the effectiveness of the proposed algorithm.
Regularization and feature selection for networked features BIBAFull-Text 1893-1896
  Hongliang Fei; Brian Quanz; Jun Huan
In the standard formalization of supervised learning problems, a datum is represented as a vector of features without prior knowledge about relationships among features. However, for many real world problems, we have such prior knowledge about structure relationships among features. For instance, in Microarray analysis where the genes are features, the genes form biological pathways. Such prior knowledge should be incorporated to build a more accurate and interpretable model, especially in applications with high dimensionality and low sample sizes. Towards an efficient incorporation of the structure relationships, we have designed a classification model where we use an undirected graph to capture the relationship of features. In our method, we combine both L1 norm and Laplacian based L2 norm regularization with logistic regression. In this approach, we enforce model sparsity and smoothness among features to identify a small subset of grouped features. We have derived efficient optimization algorithms based on coordinate decent for the new formulation. Using comprehensive experimental study, we have demonstrated the effectiveness of the proposed learning methods.

Poster session 4: Industry track

BagBoo: a scalable hybrid bagging-the-boosting model BIBAFull-Text 1897-1900
  Dmitry Yurievich Pavlov; Alexey Gorodilov; Cliff A. Brunk
In this paper, we introduce a novel machine learning approach for regression based on the idea of combining bagging and boosting that we call BagBoo. Our BagBoo model borrows its high accuracy potential from. Friedman's gradient boosting [2], and high efficiency and scalability through parallelism from Breiman's bagging [1]. We run empirical evaluations on large scale Web ranking data, and demonstrate that BagBoo is not only showing superior relevance than standalone bagging or boosting, but also outperforms most previously published results on these data sets. We also emphasize that BagBoo is intrinsically scalable and parallelizable, allowing us to train order of half a million trees on 200 nodes in 2 hours CPU time and beat all of the competitors in the Internet Mathematics relevance competition sponsored by Yandex and be one of the top algorithms in both tracks of Yahoo ICML-2010 challenge. We conclude the paper by stating that while impressive experimental evaluation results are presented here in the context of regression trees, the hybrid BagBoo model is applicable to other domains, such as classification, and base training models.
Exploiting sequential relationships for familial classification BIBAFull-Text 1901-1904
  Lee S. Jensen; James G. Shanahan
The pervasive nature of the internet has caused a significant transformation in the field of genealogical research. This has impacted not only how research is conducted, but has also dramatically increased the number of people discovering their family history. Recent market research (Maritz Marketing 2000, Harris Interactive 2009) indicates that general interest in the United States has increased from 45% in 1996, to 60% in 2000, and 87% in 2009. Increased popularity has caused a dramatic need for improvements in algorithms related to extracting, accessing, and processing genealogical data for use in building family trees.
   This paper presents one approach to algorithmic improvement in the family history domain, where we infer the familial relationships of households found in human transcribed United States census data. By applying advances made in natural language processing, exploiting the sequential nature of the census, and using state of the art machine learning algorithms, we were able to decrease the error by 35% over a hand coded baseline system. The resulting system is immediately applicable to hundreds of millions of other genealogical records where families are represented, but the familial relationships are missing.
Massive structured data management solution BIBAFull-Text 1905-1908
  Ullas Nambiar; Rajeev Gupta; Himanshu Gupta; Mukesh Mohania
The need to analyze structured data for various business intelligence applications such as customer churn analysis, social network analysis, etc. is well known. However, the potential size to which such data will scale in future will make solutions that revolve around data warehouses hard to scale. We begin by presenting a business case that prompted us to look at building a distributed analytics platform that is leveraging the MapReduce framework pioneered by Google. We present the results of the study and highlight issues with the current structured data access techniques for MapReduce platforms. Finally, we present a distributed and scalable data platform that leverages Apache Hadoop to enable business analysts to seamlessly query archived data along with data stored in the warehouse.
Ranking of evolving stories through meta-aggregation BIBAFull-Text 1909-1912
  Juozas Gordevicius; Francisco J. Estrada; Hyun Chul Lee; Periklis Andritsos; Johann Gamper
In this paper we focus on the problem of ranking news stories within their historical context by exploiting their content similarity. We observe that news stories evolve and thus have to be ranked in a time and query dependent manner. We do this in two steps. First, the mining step discovers metastories, which constitute meaningful groups of similar stories that occur at arbitrary points in time. Second, the ranking step uses well known measures of content similarity to construct implicit links among all metastories, and uses them to rank those metastories that overlap the time interval provided in a user query. We use real data from conventional and social media sources (weblogs) to study the impact of different meta-aggregation techniques and similarity measures in the final ranking. We evaluate the framework using both objective and subjective criteria, and discuss the selection of clustering method and similarity measure that lead to the best ranking results.
Injecting domain knowledge into a granular database engine: a position paper BIBAFull-Text 1913-1916
  Dominik Slezak; Graham Toppin
We discuss how to use techniques from such fields as text processing and knowledge management to better handle text attributes in the Infobright's RDBMS engine. Our approach leads to a rich interface for domain experts who wish to share their knowledge about data content and, on the other hand, it remains unnoticeable to data users. It enables to improve data storage, data access, and data compression, with no changes required at the database schema level.
Experiences with using SVM-based learning for multi-objective ranking BIBAFull-Text 1917-1920
  Linh Thai Nguyen; Wai Gen Yee; Roger Liew; Ophir Frieder
We describe our experiences in applying learning-to-rank techniques to improving the quality of search results of an online hotel reservation system. The search result quality factors we use are average booking position and distribution of margin in top-ranked results. (We expect that total revenue will increase with these factors.) Our application of the SVMRank technique improves booking position by up to 25% and margin distribution by up to 14%.
Learning to blend rankings: a monotonic transformation to blend rankings from heterogeneous domains BIBAFull-Text 1921-1924
  Zhenzhen Kou; Yi Chang; Zhaohui Zheng; Hongyuan Zha
There have been great needs to develop effective methods for combining multiple rankings from heterogeneous domains into one single rank list arising from many recent web search applications, such as integrating web search results from multiple engines, facets, or verticals. We define this problem as Learning to blend rankings from multiple domains. We propose a class of learning-to-blend methods that learn a monotonically increasing transformation for each ranking so that the rank order in each domain is preserved and the transformed values are comparable across multiple rankings. The transformation learning can be tackled by solving a quadratic programming problem. The novel machine learning method for blending multiple ranking lists is evaluated with queries sampled from a commercial search engine and a promising improvement of Discounted Cumulative Gain has been observed.

Demo session 1: IR

EntityEngine: answering entity-relationship queries using shallow semantics BIBAFull-Text 1925-1926
  Xiaonan Li; Chengkai Li; Cong Yu
We introduce EntityEngine, a system for answering entity-relationship queries over text. Such queries combine SQL-like structures with IR-style keyword constraints and therefore, can be expressive and flexible in querying about entities and their relationships. EntityEngine consists of various offline and online components, including a position-based ranking model for accurate ranking of query answers and a novel entity-centric index for efficient query evaluation.
Facetedpedia: enabling query-dependent faceted search for wikipedia BIBAFull-Text 1927-1928
  Ning Yan; Chengkai Li; Senjuti B. Roy; Rakesh Ramegowda; Gautam Das
Facetedpedia is a faceted search system that dynamically discovers query-dependent faceted interfaces for Wikipedia search result articles. In this paper, we give an overview of Facetedpedia, present the system architecture and implementation techniques, and elaborate on a demonstration scenario.
Discovering, ranking and annotating cross-document relationships between concepts BIBAFull-Text 1929-1930
  Wei Jin; Xin Wu
This paper presents CDRMiner, a system for automatically discovering, ranking and annotating cross-document links between concepts. Specifically, we focus on detecting hidden associations between two concepts and further generating annotations for each discovered hypothesis. We interpret such a relationship query as finding the most meaningful concept chains and evidence trails across multiple documents that potentially connect them. These functionalities are implemented using an interactive visualization paradigm which assists users for a better understanding and interpretation of discovered hypotheses, and matching their domain knowledge with the algorithmic power of text mining techniques.
WikiPop: personalized event detection system based on Wikipedia page view statistics BIBAFull-Text 1931-1932
  Marek Ciglan; Kjetil Nørvåg
In this paper, we describe WikiPop service, a system designed to detect significant increase of popularity of topics related to users' interests. We exploit Wikipedia page view statistics to identify concepts with significant increase of the interest from the public. Daily, there are thousands of articles with increased popularity; thus, a personalization is in order to provide the user only with results related to his/her interest. The WikiPop system allows a user to define a context by stating a set of Wikipedia articles describing topics of interest. The system is then able to search, for the given date, for popular topics related to the user defined context.
XReal: an interactive XML keyword searching BIBAFull-Text 1933-1934
  Zhifeng Bao; Jiaheng Lu; Tok Wang Ling
Keyword search over XML data usually brings irrelevant results especially when the keywords in a user query have ambiguities. We demonstrate a statistic-based approach to identify the search targets and constraints of a user query in the presence of keyword ambiguities, and come out a relevance oriented result ranking scheme called XML TF*IDF. Since the search intention of a same query may even vary from user to user, we provide an interactive search strategy by allowing user to simply tick their desired search targets from a list of suggestions recommended by the search engine. In this way, we can acquire more precise results and also take the burden of learning the schema of XML data off users.
EUI: an embedded engine for understanding user intents from mobile devices BIBAFull-Text 1935-1936
  JongWoo Ha; Jung-Hyun Lee; Kyu-Sun Shim; SangKeun Lee
We design and implement a novel embedded software engine, called EUI, to understand user intents from usage data within mobile devices. By developing the EUI engine in mobile devices, we expect to move towards "proactive" devices for mobile personalized services. To this end, we seek to embed the Open Directory Project (ODP) into mobile devices, and build a robust classifier with the embedded ODP. Thus, the EUI engine classifies the usage data within mobile devices into some ODP categories. Our implementation handles some challenging issues in embedding the ODP and building a robust classifier. The demonstration shows that our implementation understands the semantics of the usage data effectively.
TC-DCA: a system for text classification based on document's content allocation BIBAFull-Text 1937-1938
  Wenbo Li; Le Sun; Zhenzhong Zhang; Xue Jiang; Weiru Zhang
The text classification methods heavily depend on machine learning algorithms with abstract mathematic metrics, which obstruct the direct observation and intuitive understanding of the text-specific classification. In this paper, we model a document as a Document-Classes-Topics top-down hierarchical structure. Furthermore, by running the document generation procedure, we can obtain each class's content share, which not only can be used to make the classification decision but also can provide a natural visualization approach for text classification. We implement this idea by a new tool named TC-DCA, which provides the visualization of text classification result, where the target document is expressed graphically as its content's allocation on every class. TC-DCA can also perform the drilling down operation to reveal the classification effect of each word of the document.
Crawling the web for structured documents BIBAFull-Text 1939-1940
  Julián Urbano; Juan Loréns; Yorgos Andreadakis; Mónica Marrero
Structured Information Retrieval is gaining a lot of interest in recent years, as this kind of information is becoming an invaluable asset for professional communities such as Software Engineering. Most of the research has focused on XML documents, with initiatives like INEX to bring together and evaluate new techniques focused on structured information. Despite the use of XML documents is the immediate choice, the Web is filled with several other types of structured information, which account for millions of other documents. These documents may be collected directly using standard Web search engines like Google and Yahoo, or following specific search patterns in online repositories like SourceForge. This demo describes a distributed and focused web crawler for any kind of structured documents, and we show with it how to exploit general-purpose resources to gather large amounts of real-world structured documents off the Web. This kind of tool could help building large test collections of other types of documents, such as Java source code for software-oriented search engines or RDF for semantic searching.
Connecting the local and the online in information management BIBAFull-Text 1941-1942
  Gabriella Kazai; Natasa Milic-Frayling; Tim Haughton; Natalia Manola; Katerina Iatropoulou; Antonis Lempesis; Paolo Manghi; Marko Mikulicic
With the popularity of social media sites, digital content is increasingly stored and managed online. At the same time, the desktop and local storage continues to provide a personal environment in which users perform their daily tasks. Thus, to accomplish their tasks, users need to continuously switch between local and remote resources and applications, often carrying the burden of coordinating and synchronizing these in a consistent way. In this demonstration, we describe a system, called ScholarLynk, that bridges the local and online worlds and allows users to manage both local and online resources in a uniform way and in collaboration with others.

Demo session 2: KM and DB

LiquidXML: adaptive XML content redistribution BIBAFull-Text 1943-1944
  Jesús Camacho-Rodríguez; Asterios Katsifodimos; Ioana Manolescu; Alexandra Roatis
We propose to demonstrate LiquidXML, a platform for managing large corpora of XML documents in large-scale P2P networks. All LiquidXML peers may publish XML documents to be shared with all the network peers. The challenge then is to efficiently (re-)distribute the published content in the network, possibly in overlapping, redundant fragments, to support efficient processing of queries at each peer. The novelty of LiquidXML relies in its adaptive method of choosing which data fragments are stored where, to improve performance. The "liquid" aspect of XML management is twofold: XML data flows from many sources towards many consumers, and its distribution in the network continuously adapts to improve query performance.
Brown dwarf: a P2P data-warehousing system BIBAFull-Text 1945-1946
  Katerina Doka; Dimitrios Tsoumakos; Nectarios Koziris
In this demonstration we present the Brown Dwarf, a distributed system designed to efficiently store, query and update multidimensional data. Deployed on any number of commodity nodes, our system manages to distribute large volumes of data over network peers on-the-fly and process queries and updates on-line through cooperating nodes that hold parts of a materialized cube. Moreover, it adapts its resources according to demand and hardware failures and is cost-effective both over the required hardware and software components. All the aforementioned functionality will be tested using various datasets and query loads.
RDFViewS: a storage tuning wizard for RDF applications BIBAFull-Text 1947-1948
  François Goasdoué; Konstantinos Karanasos; Julien Leblay; Ioana Manolescu
In recent years, the significant growth of RDF data used in numerous applications has made its efficient and scalable manipulation an important issue. In this paper, we present RDFViewS, a system capable of choosing the most suitable views to materialize, in order to minimize the query response time for a specific SPARQL query workload, while taking into account the view maintenance cost and storage space constraints. Our system employs practical algorithms and heuristics to navigate through the search space of potential view configurations, and exploits the possibly available semantic information -- expressed via an RDF Schema -- to ensure the completeness of the query evaluation.
WS-GraphMatching: a web service tool for graph matching BIBAFull-Text 1949-1950
  Qiong Cheng; Mitsunori Ogihara; Jinpeng Wei; Alexander Zelikovsky
Some emerging applications deal with graph data and rely on graph matching and mining. The service-oriented graph matching and mining tool has been required. In this demo we present the web service tool WS-GraphMatching which supports the efficient and visualized matching of polytrees, series-parallel graphs, and arbitrary graphs with bounded feedback vertex set. Its embedded matching algorithms take in account the similarity of vertex-to-vertex and graph structures, allowing path contraction, vertex deletion, and vertex insertions. It provides one-to-one matching queries as well as queries in batch modes including one-to-many matching mode and many-to-many matching mode. It can be used for predicting unknown structured information, comparing and finding conserved patterns, and resolving ambiguous identification of vertices.
   WS-GraphAligner is available as web-server at: http://odysseus.cs.fiu.edu:8080/MinePW/pages/gmapping/GMMain.html.
FALCON: seamless access to meeting data from the inbox and calendar BIBAFull-Text 1951-1952
  Peter Bjellerup; Karl J. Cama; Mukundan Desikan; Yi Guo; Ajinkya G. Kale; Jennifer C. Lai; Nizar Lethif; Jie Lu; Mercan Topkara; Stephan H. Wissel
We present a system that supports seamless access to information contained in recorded meetings from the cornerstone points of a knowledge worker's daily life: mailbox and calendar. The solution supports granular search of meeting content from an enterprise email system and automatically displays recordings of meetings related to the message the user is currently viewing. Additionally thumbnail summaries of the meetings are added to the user's calendar entries after the meetings have taken place. Lastly our system supports easy sharing of videos associated with recorded meetings through the use of hot-linked thumbnail summaries which can be sent via email.
SPac: a distributed, peer-to-peer, secure and privacy-aware social space BIBAFull-Text 1953-1954
  Angela Bonifati; Hui (Wendy) Wang; Ruilin Liu
To support privacy-aware management of data in social spaces, the user personal data needs to be stored at each user device, and shared only with a trusted subset of other users. To date, social spaces only have fairly limited access control capabilities, that do not protect the possibly sensitive data of the users.
   In this demonstration, we showcase our SPAC system, a distributed, peer-to-peer, secure and privacy-aware social space system. SPAC is equipped with: (i) an SQL-based declarative distributed query language to specify which data to share and whom to share with. Such a language guarantees the fine-grained access to the data, (ii) a fully-decentralized authorization that relies on classic cryptographic protocols to provide robust and resilient key-based encryption for access control enforcement, and (iii) an update-friendly access control mechanism, that also addresses the updates on both the network and the access control policies.
SEQUEL: query completion via pattern mining on multi-column structural data BIBAFull-Text 1955-1956
  Chuancong Gao; Qingyan Yang; Jianyong Wang
In this demonstration, we propose an interactive query completion system on structural data like DBLP, called SEQUEL. It is novel in several aspects: with patterns mined on the structural data using newly devised algorithm, SEQUEL offers high-utility completions composed with not only words but also phrases, and requires no explicit indications of corresponding columns. Instead of using query logs exploited previously for unstructured data, more effective completions are provided based on patterns mined directly from the records. Moreover, an effective index structure helps SEQUEL respond fast at millisecond level for each keystroke.
MI-WDIS: web data integration system for market intelligence BIBAFull-Text 1957-1958
  Zhongmin Yan; Qingzhong Li; Shidong Zhang; Zhaohui Peng; Yongquan Dong; Yanhui Ding; Yongxin Zhang; Xiuxing Xu
As an important supporting technology of Market Intelligence (MI), Web data integration is facing new challenges, such as the integrity of data acquisition, the quality of data extraction and data consolidation. To solve such problems, we propose an MI-oriented web data integration system (MI-WDIS), which achieves excellent performances in integrating Surface Web and Deep Web data with much less manual work. Based on MI-WDIS, we have developed a platform for intelligent analysis of job data. The platform collects tens of thousands of job data daily and provides personalized services for job seekers through diversified channels. Besides, it provides other advanced services, including intelligence analysis, automatic monitoring and alerting, for various organizations, such as enterprises, training institutions and recruitment agencies.
i-SEE: integrated stream execution environment over on-line data streams BIBAFull-Text 1959-1960
  Se Jung Shin; Hong Kyu Park; Ho Jin Woo; Won Suk Lee
The primary objective of various data stream applications is to monitor the on-going variations of data elements newly generated in data streams, so that appropriate services can be delivered to users timely. Basically, such monitoring activities can be divided into three categories: instant event monitoring, frequent behavior monitoring and analytic OLAP measures monitoring. Several prototype systems for data streams have been introduced but they are designated to support only the operations of one or two categories. However, many real-world data stream applications often require mixed operations of the three categories. This demonstration introduces an Integrated Stream Execution Environment (i-SEE) that can support the mixed execution of the operations of the three categories seamlessly, so that an i-SEE system can provide the multifaceted analytic results of on-line data streams continuously for various emerging applications.
Summarizing biological literature with BioSumm BIBAFull-Text 1961-1962
  Elena Baralis; Alessandro Fiori
BioSumm is a summarization environment that supports user queries on online repositories of scientific publications by providing abstract descriptions of focused document groups. The summarization approach is driven by a grading function which evaluates the occurrences of domain dictionary terms.
   The demonstrated system enables users to query and download research papers from online databases (e.g., PubMed) and local repositories. The (possibly large) retrieved document collection is then partitioned into document clusters devoted to homogeneous topics. Finally, documents in a cluster are summarized by extracting sentences relevant for a specific application domain. In the demo the considered domain is the interaction of human genes and proteins.
Exploring and visualizing academic social networks BIBAFull-Text 1963-1964
  Veselin Ganev; Zhaochen Guo; Diego Serrano; Denilson Barbosa; Eleni Stroulia
We demonstrate the ReaSoN portal, consisting of interactive web-based tools for visualizing, exploring, querying, and integrating academic social networks. We describe how these networks are automatically extracted from bibliographic and citation databases, discuss notions of visibility in such networks which enable a rich set of social network analysis, and demonstrate our novel tools for the visualization and exploration of social networks.

Co-located workshop summaries

Summary of the 4th workshop on analytics for noisy unstructured text data (AND) BIBFull-Text 1965-1966
  Roberto Basili; Daniel Lopresti; Christoph Ringlestetter; Shourya Roy; Klaus U. Schulz; L. Venkata Subramaniam
3rd BooksOnline workshop: research advances in large digital book repositories and complementary media BIBAFull-Text 1967-1968
  Gabriella Kazai; Peter Brusilovsky
The goal of the 3rd BooksOnline Workshop is to bring together researchers and industry practitioners in information retrieval, digital libraries, e-books, human computer interaction, publishing industry, and online book services to foster progress on addressing challenges and exploring opportunities around large collections of digital books and complementary media. Towards this goal, the workshop programme consists of contributions both from academia and industry, including two keynote talks: James Crawford from Google Books and John Mark Ockerbloom from the University of Pennsylvania.
Report on the second international workshop on cloud data management (CloudDB 2010) BIBFull-Text 1969-1970
  Xiaofeng Meng; Ying Chen; Jiaheng Lu; Jianliang Xu
DTMBIO workshop summary BIBFull-Text 1971-1972
  Hagit Shatkay; Doheon Lee; Min Song; Shamkant Navathe
DOLAP 2010 workshop summary BIBAFull-Text 1973-1974
  Carlos Ordonez; Il-Yeol Song
The ACM DOLAP workshop presents research on data warehousing and On-Line Analytical Processing (OLAP). The program has three interesting sessions on modeling, query processing and new trends, as well as a keynote talk on OLAP query processing and a panel comparing relational and non-relational technology for data warehousing.
Third workshop on exploiting semantic annotations in information retrieval (ESAIR): CIKM 2010 workshop BIBAFull-Text 1975-1976
  Jaap Kamps; Jussi Karlgren; Ralf Schenkel
There is an increasing amount of structure on the Web as a result of modern Web languages, user tagging and annotation, and emerging robust NLP tools. These meaningful, semantic, annotations hold the promise to significantly enhance information access, by enhancing the depth of analysis of today's systems. Currently, we have only started exploring the possibilities and only begin to understand how these valuable semantic cues can be put to fruitful use. Unleashing the potential of semantic annotations requires us to think outside the box, by combining the insights of natural language processing (NLP) to go beyond bags of words, the insights of databases (DB) to use structure efficiently even when aggregating over millions of records, the insights of information retrieval (IR) in effective goal-directed search and evaluation, and the insights of knowledge management (KM) to get grips on the greater whole.
   The Workshop aims to bring together researchers from these different disciplines and work together on one of the greatest challenges in the years to come. The desired result of the workshop will be concrete insight into the potential of semantic annotations, and in concrete steps to take this research forward; synchronize related research happening in NLP, DB, IR, and KM, in ways that combine the strengths of each discipline; and have a lively, interactive workshop were everyone contributes and that inspires attendees to think "outside the box".
3rd international workshop on patent information retrieval (PaIR'10) BIBAFull-Text 1977-1978
  Mihai Lupu; John Tait; Katja Mayer; Christopher Harris
The 3rd International Workshop on Patent Information Retrieval builds on the experiences of the first two workshops, to provide its participants an exciting, scientifically challenging and interactive event, where the specific issues of patent retrieval may be put into the general context of Information Retrieval and Knowledge Management, in order to explore innovative solutions to new and old problems, but also to evaluate and adapt traditional or classic approaches to new problems. Between the scientific presentations and posters, distinguished keynote speakers and a panel discussion, PaIR 2010 shapes itself into a significant landmark in the field of domain specific information retrieval.
PIKM 2010: ACM workshop for ph.d. students in information and knowledge management BIBAFull-Text 1979-1980
  Anisoara Nica; Aparna S. Varde
The PIKM workshop focuses on papers consisting mainly of the Ph.D. dissertation proposals of doctoral students. A wide range of topics on any area in databases, information retrieval and knowledge management are presented at this workshop. The areas of interest are similar to those at the CIKM main conference in the three respective tracks. Interdisciplinary work across these tracks is encouraged.
Overview of the 2nd international workshop on search and mining user-generated contents BIBAFull-Text 1981-1982
  Jose Carlos Cortizo; Francisco M. Carrero; Ivan Cantador; Jose Antonio Troyano; Paolo Rosso
This overview introduces the aim of the SMUC 2010 workshop, as well as the list of papers presented in the workshop.