HCI Bibliography Home | HCI Conferences | CIKM Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
CIKM Tables of Contents: 0809101112131415

Proceedings of the 2008 ACM Conference on Information and Knowledge Management

Fullname:Proceedings of the 17th ACM Conference on Information and Knowledge Management
Editors:James G. Shanahan; Sihem Amer-Yahia; Ioana Manolescu; Yi Zhang; David A. Evans; Alek Kolcz; Key-Sun Choi; Abdur Chowdury
Location:Napa Valley, California
Dates:2008-Oct-26 to 2008-Oct-30
Publisher:ACM
Standard No:ISBN: 978-1-59593-991-3; ACM DL: Table of Contents; hcibib: CIKM08
Papers:240
Pages:1526
Links:Conference Website
  1. DB: faceted search, web query results presentation
  2. IR: web search 1
  3. KM: classification
  4. Industry research track
  5. DB: efficient maintenance and query optimization
  6. IR: social search
  7. IR/KM: machine learning
  8. KM: link and graph mining
  9. KM: information filtering
  10. DB: stream processing
  11. IR: theory
  12. IR: query analysis
  13. KM: web mining
  14. DB/industry: XML data integration and XML query optimization
  15. IR: evaluation
  16. KM: statistical techniques
  17. Panel discussion
  18. DB: indexing and physical query optimization
  19. IR: web search 2
  20. IR: multilingual & multimedia
  21. KM: data mining
  22. KM: semantic techniques
  23. DB: security and privacy
  24. IR: medley
  25. IR: recommender systems
  26. KM: feature selection
  27. Panel discussion 2
  28. IR: advertising & filtering
  29. IR: blog
  30. KM: clustering
  31. IR: enterprise search
  32. IR: structured documents
  33. KM: text mining
  34. DB: mobile and distributed data management
  35. IR: QA
  36. KM: information extraction
  37. Poster session 1 database
  38. Poster session 1/industry
  39. Poster session 1/information retrieval
  40. Poster session 1/knowledge management
  41. Poster session 2 database
  42. Poster session 2/information retrieval
  43. Poster session 2/knowledge management
  44. Poster session 3: database
  45. Poster session 3/information retrieval
  46. Poster session 3/knowledge management
Humane data mining BIBAFull-Text 1-2
  Rakesh Agrawal
Data Mining has made tremendous strides in the last decade. It is time to take data mining to the next level of contributions, while continuing to innovate for the current mainstream market. We postulate that a fruitful future direction could be humane data mining: applications to benefit individuals. The potential applications include personal data mining (e.g. personal health), enable people to get a grip on their world (e.g. dealing with the long tail of search), enable people to become creative (e.g. inventions arising from linking non-interacting scientific literature), enable people to make contributions to society (e.g. education collaboration networks), and data-driven science (e.g. study ecological disasters, brain disorders). Rooting our future work in these (and similar) applications, will lead to new data mining abstractions, algorithms, and systems.

DB: faceted search, web query results presentation

Dynamic faceted search for discovery-driven analysis BIBAFull-Text 3-12
  Debabrata Dash; Jun Rao; Nimrod Megiddo; Anastasia Ailamaki; Guy Lohman
We propose a dynamic faceted search system for discovery-driven analysis on data with both textual content and structured attributes. From a keyword query, we want to dynamically select a small set of "interesting" attributes and present aggregates on them to a user. Similar to work in OLAP exploration, we define "interestingness" as how surprising an aggregated value is, based on a given expectation. We make two new contributions by proposing a novel "navigational" expectation that's particularly useful in the context of faceted search, and a novel interestingness measure through judicious application of p-values. Through a user survey, we find the new expectation and interestingness metric quite effective. We develop an efficient dynamic faceted search system by improving a popular open source engine, Solr. Our system exploits compressed bitmaps for caching the posting lists in an inverted index, and a novel directory structure called a bitset tree for fast bitset intersection. We conduct a comprehensive experimental study on large real data sets and show that our engine performs 2 to 3 times faster than Solr.
Minimum-effort driven dynamic faceted search in structured databases BIBAFull-Text 13-22
  Senjuti Basu Roy; Haidong Wang; Gautam Das; Ullas Nambiar; Mukesh Mohania
In this paper, we propose minimum-effort driven navigational techniques for enterprise database systems based on the faceted search paradigm. Our proposed techniques dynamically suggest facets for drilling down into the database such that the cost of navigation is minimized. At every step, the system asks the user a question or a set of questions on different facets and depending on the user response, dynamically fetches the next most promising set of facets, and the process repeats. Facets are selected based on their ability to rapidly drill down to the most promising tuples, as well as on the ability of the user to provide desired values for them. Our facet selection algorithms also work in conjunction with any ranked retrieval model where a ranking function imposes a bias over the user preferences for the selected tuples. Our methods are principled as well as efficient, and our experimental study validates their effectiveness on several application scenarios.
A language for manipulating clustered web documents results BIBAFull-Text 23-32
  Gloria Bordogna; Alessandro Campi; Giuseppe Psaila; Stefania Ronchi
We propose a novel conception language for exploring the results retrieved by several internet search services (like search engines) that cluster retrieved documents. The goal is to offer users a tool to discover relevant hidden relationships between clustered documents.
   The proposal is motivated by the observation that visualization paradigms, based on either the ranked list or clustered results, do not allow users to fully exploit the combined use of several search services to answer a request.
   When the same query is submitted to distinct search services, they may produce partially overlapped clustered results, where clusters identified by distinct labels collect some common documents. Moreover, clusters with similar labels, but containing distinct documents, may be produced as well. In such a situation, it may be useful to compare, combine and rank the cluster contents, to filter out relevant documents. In the proposed language, we define several operators (inspired by relational algebra) that work on groups of clusters. New clusters (and groups) can be generated by combining (i.e., overlapping, refining and intersecting) clusters (and groups), in a set oriented fashion. Furthermore, several ranking functions are also proposed, to model distinct semantics of the combination.
Integrating web query results: holistic schema matching BIBAFull-Text 33-42
  Shui-Lung Chuang; Kevin Chen-Chuan Chang
The emergence of numerous data sources online has presented a pressing need for more automatic yet accurate data integration techniques. For the data returned from querying such sources, most works focus on how to extract the embedded structured data more accurately. However, to eventually provide an integrated access to these query results, a last but not least step is to combine the extracted data coming from different sources. A critical task is finding the correspondence of the data fields between the sources -- a problem well known as schema matching. Query results are a small and biased sample set of instances obtained from sources; the obtained schema information is thus very implicit and incomplete, which often prevents existing schema matching approaches from performing effectively. In this paper, we develop a novel framework for understanding and effectively supporting schema matching on such instance-based data, especially for integrating multiple sources. We view discovering matching as constructing a more complete domain schema that best describes the input data. With this conceptual view, we can leverage various data instances and observed regularities seamlessly with holistic, multiple-source schema matching to achieve more accurate matching results. Our experiments show that our framework consistently outperforms baseline pairwise and clustering-based approaches (raising F-measure from 50-89% to 89-94%) and works uniformly well for the surveyed domains.

IR: web search 1

How does clickthrough data reflect retrieval quality? BIBAFull-Text 43-52
  Filip Radlinski; Madhu Kurup; Thorsten Joachims
Automatically judging the quality of retrieval functions based on observable user behavior holds promise for making retrieval evaluation faster, cheaper, and more user centered. However, the relationship between observable user behavior and retrieval quality is not yet fully understood. We present a sequence of studies investigating this relationship for an operational search engine on the arXiv.org e-print archive. We find that none of the eight absolute usage metrics we explore (e.g., number of clicks, frequency of query reformulations, abandonment) reliably reflect retrieval quality for the sample sizes we consider. However, we find that paired experiment designs adapted from sensory analysis produce accurate and reliable statements about the relative quality of two retrieval functions. In particular, we investigate two paired comparison tests that analyze clickthrough data from an interleaved presentation of ranking pairs, and we find that both give accurate and consistent results. We conclude that both paired comparison tests give substantially more accurate and sensitive evaluation results than absolute usage metrics in our domain.
Efficient and effective link analysis with precomputed salsa maps BIBAFull-Text 53-62
  Marc Najork; Nick Craswell
SALSA is a link-based ranking algorithm that takes the result set of a query as input, extends the set to include additional neighboring documents in the web graph, and performs a random walk on the induced subgraph. The stationary probability distribution of this random walk, used as a relevance score, is significantly more effective for ranking purposes than popular query-independent link-based ranking algorithms such as PageRank. Unfortunately, this requires significant effort at query-time, to access the link graph and compute the stationary probability distribution. In this paper, we explore whether it is possible to perform most of the computation off-line, prior to the arrival of any queries. The off-line phase of our approach computes a "score map" for each node in the web graph by performing a SALSA-like algorithm on the neighborhood of that node and retaining the scores of the most promising nodes in the neighborhood graph. The on-line phase takes the results to a query, retrieves the score map of each result, and returns for each result a score that is the sum of the matching scores from each score map. We evaluated this algorithm on a collection of about 28,000 queries with partially labeled results, and found that it is significantly more effective than PageRank, although not quite as effective as SALSA. We also studied the trade-off between ranking effectiveness and space requirements.
Achieving both high precision and high recall in near-duplicate detection BIBAFull-Text 63-72
  Lian'en Huang; Lei Wang; Xiaoming Li
To find near-duplicate documents, fingerprint-based paradigms such as Broder's shingling and Charikar's simhash algorithms have been recognized as effective approaches and are considered the state-of-the-art. Nevertheless, we see two aspects of these approaches which may be improved. First, high score under these algorithms' similarity measurement implies high probability of similarity between documents, which is different from high similarity of the documents. But how similar two documents are is what we really need to know. Second, there has to be a tradeoff between hash-code length and hash-code multiplicity in fingerprint paradigms, which makes it hard to maintain a satisfactory recall level while improving precision.
   In this paper our contributions are two-folded. First, we propose a framework for implementing the longest common subsequence (LCS) as a similarity measurement in reasonable computing time, which leads to both high precision and recall. Second, we present an algorithm to get a trustable partition from the LCS to reduce the negative impact from templates used in web page design. A comprehensive experiment was conducted to evaluate our method in terms of its effectiveness, efficiency, and quality of result. More specifically, the method has been successfully used to partition a set of 430 million web pages into 68 million subsets of similar pages, which demonstrates its effectiveness. For quality, we compared our method with simhash and a Cosine-based method through a sampling process (Cosine is compared to LCS as an alternative similarity measurement). The result showed that our algorithm reached an overall precision of 0.95 while simhash was 0.71 and Cosine was 0.82. At the same time our method obtains 1.86 times as much recall as simhash and 1.56 times as much recall as Cosine. Comparison experiment was also done for documents in the same web sites. For that, our algorithm, simhash and Cosine find almost the same number of true-positives at a precision of 0.91, 0.50 and 0.63 respectively. In terms of efficiency, our algorithm takes 118 hours to process the whole archive of 430 million topic-type pages on a cluster of six Linux boxes, at the same time the processing time of simhash and Cosine is 94 hours and 68 hours respectively. When considering the need of word segmentation for languages such as Chinese, the processing time of Cosine should be multiplied and in our experiment it is 602 hours.
Are click-through data adequate for learning web search rankings? BIBAFull-Text 73-82
  Zhicheng Dou; Ruihua Song; Xiaojie Yuan; Ji-Rong Wen
Learning-to-rank algorithms, which can automatically adapt ranking functions in web search, require a large volume of training data. A traditional way of generating training examples is to employ human experts to judge the relevance of documents. Unfortunately, it is difficult, time-consuming and costly. In this paper, we study the problem of exploiting click-through data for learning web search rankings that can be collected at much lower cost. We extract pairwise relevance preferences from a large-scale aggregated click-through dataset, compare these preferences with explicit human judgments, and use them as training examples to learn ranking functions. We find click-through data are useful and effective in learning ranking functions. A straightforward use of aggregated click-through data can outperform human judgments. We demonstrate that the strategies are only slightly affected by fraudulent clicks. We also reveal that the pairs which are very reliable, e.g., the pairs consisting of documents with large click frequency differences, are not sufficient for learning.

KM: classification

Error-driven generalist+experts (edge): a multi-stage ensemble framework for text categorization BIBAFull-Text 83-92
  Jian Huang; Omid Madani; C. Lee Giles
We introduce a multi-stage ensemble framework, Error-Driven Generalist+Expert or Edge, for improved classification on large-scale text categorization problems. Edge first trains a generalist, capable of classifying under all classes, to deliver a reasonably accurate initial category ranking given an instance. Edge then computes a confusion graph for the generalist and allocates the learning resources to train experts on relatively small groups of classes that tend to be systematically confused with one another by the generalist. The experts' votes, when invoked on a given instance, yield a reranking of the classes, thereby correcting the errors of the generalist. Our evaluations showcase the improved classification and ranking performance on several large-scale text categorization datasets. Edge is in particular efficient when the underlying learners are efficient. Our study of confusion graphs is also of independent interest.
A sparse gaussian processes classification framework for fast tag suggestions BIBAFull-Text 93-102
  Yang Song; Lu Zhang; C. Lee Giles
Tagged data is rapidly becoming more available on the World Wide Web. Web sites which populate tagging services offer a good way for Internet users to share their knowledge. An interesting problem is how to make tag suggestions when a new resource becomes available. In this paper, we address the issue of efficient tag suggestion. We first propose a multi-class sparse Gaussian process classification framework (SGPS) which is capable of classifying data with very few training instances. We suggest a novel prototype selection algorithm to select the best subset of points for model learning. The framework is then extended to a novel multi-class multi-label classification algorithm (MMSG) that transforms tag suggestion into the problem of multi-label ranking. Experiments on bench-mark data sets and real-world data from Del.icio.us and BibSonomy suggest that our model can greatly improve the performance of tag suggestions when compared to the state-of-the-art. Overall, our model requires linear time to train and constant time to predict per case. The memory consumption is also significantly less than traditional batch learning algorithms such as SVMs. In addition, results on tagging digital data also demonstrate that our model is capable of recommending relevant tags to images and videos by using their surrounding textual information.
Transfer learning from multiple source domains via consensus regularization BIBAFull-Text 103-112
  Ping Luo; Fuzhen Zhuang; Hui Xiong; Yuhong Xiong; Qing He
Recent years have witnessed an increased interest in transfer learning. Despite the vast amount of research performed in this field, there are remaining challenges in applying the knowledge learnt from multiple source domains to a target domain. First, data from multiple source domains can be semantically related, but have different distributions. It is not clear how to exploit the distribution differences among multiple source domains to boost the learning performance in a target domain. Second, many real-world applications demand this transfer learning to be performed in a distributed manner. To meet these challenges, we propose a consensus regularization framework for transfer learning from multiple source domains to a target domain. In this framework, a local classifier is trained by considering both local data available in a source domain and the prediction consensus with the classifiers from other source domains. In addition, the training algorithm can be implemented in a distributed manner, in which all the source-domains are treated as slave nodes and the target domain is used as the master node. To combine the training results from multiple source domains, it only needs share some statistical data rather than the full contents of their labeled data. This can modestly relieve the privacy concerns and avoid the need to upload all data to a central location. Finally, our experimental results show the effectiveness of our consensus regularization learning.
Classifying networked entities with modularity kernels BIBAFull-Text 113-122
  Dell Zhang; Robert Mao
Statistical machine learning techniques for data classification usually assume that all entities are i.i.d. (independent and identically distributed). However, real-world entities often interconnect with each other through explicit or implicit relationships to form a complex network. Although some graph-based classification methods have emerged in recent years, they are not really suitable for complex networks as they do not take the degree distribution of network into consideration. In this paper, we propose a new technique, Modularity Kernel, that can effectively exploit the latent community structure of networked entities for their classification. A number of experiments on hypertext datasets show that our proposed approach leads to excellent classification performance in comparison with the state-of-the-art methods.

Industry research track

Web-scale named entity recognition BIBAFull-Text 123-132
  Casey Whitelaw; Alex Kehlenbeck; Nemanja Petrovic; Lyle Ungar
Automatic recognition of named entities such as people, places, organizations, books, and movies across the entire web presents a number of challenges, both of scale and scope. Data for training general named entity recognizers is difficult to come by, and efficient machine learning methods are required once we have found hundreds of millions of labeled observations. We present an implemented system that addresses these issues, including a method for automatically generating training data, and a multi-class online classification training method that learns to recognize not only high level categories such as place and person, but also more fine-grained categories such as soccer players, birds, and universities. The resulting system gives precision and recall performance comparable to that obtained for more limited entity types in much more structured domains such as company recognition in newswire, even though web documents often lack consistent capitalization and grammatical sentence construction.
Semi-automated logging of contact center telephone calls BIBAFull-Text 133-142
  Roy J. Byrd; Mary S. Neff; Wilfried Teiken; Youngja Park; Keh-Shin F. Cheng; Stephen C. Gates; Karthik Visweswariah
Modern businesses use contact centers as a communication channel with users of their products and services. The largest factor in the expense of running a telephone contact center is the labor cost of its agents. IBM Research has built a new system, Contact-Center Agent Buddies (CAB), which is designed to help reduce the average handle time (AHT) for customer calls, thereby also reducing their cost. In this paper, we focus on the call logging subsystem, which helps agents reduce the time they spend documenting those calls. We built a Template CAB and a Call Logging CAB, using a pipeline consisting of audio capture of a telephone conversation, automatic speech recognition, text analysis, and log generation. We developed techniques for ASR text cleansing, including normalization of expressions and acronyms, domain terms, capitalization, and boundaries for sentences, paragraphs, and call segments. We found that simple heuristics suffice to generate high-quality logs from the normalized sentences. The pipeline yields a candidate call log which the agents can edit in less time than it takes them to generate call logs manually. Evaluation of the Call Logging CAB in an industrial contact center environment shows that it reduces the amount of time agents spend logging calls by at least 50% without compromising the quality of the resulting call documentation.
MedSearch: a specialized search engine for medical information retrieval BIBAFull-Text 143-152
  Gang Luo; Chunqiang Tang; Hao Yang; Xing Wei
People are thirsty for medical information. Existing Web search engines often cannot handle medical search well because they do not consider its special requirements. Often a medical information searcher is uncertain about his exact questions and unfamiliar with medical terminology. Therefore, he sometimes prefers to pose long queries, describing his symptoms and situation in plain English, and receive comprehensive, relevant information from search results. This paper presents MedSearch, a specialized medical Web search engine, to address these challenges. MedSearch uses several key techniques to improve its usability and the quality of search results. First, it accepts queries of extended length and reforms long queries into shorter queries by extracting a subset of important and representative words. This not only significantly increases the query processing speed but also improves the quality of search results. Second, it provides diversified search results. Lastly, it suggests related medical phrases to help the user quickly digest search results and refine the query. We evaluated MedSearch using medical questions posted on medical discussion forums. The results show that MedSearch can handle various medical queries effectively and efficiently.
An empirical study of required dimensionality for large-scale latent semantic indexing applications BIBAFull-Text 153-162
  Roger B. Bradford
The technique of latent semantic indexing is used in a wide variety of commercial applications. In these applications, the processing time and RAM required for SVD computation, and the processing time and RAM required during LSI retrieval operations are all roughly linear in the number of dimensions, k, chosen for the LSI representation space. In large-scale commercial LSI applications, reducing k values could be of significant value in reducing server costs. This paper explores the effects of varying dimensionality.
   The approach taken here focuses on term comparisons. Pairs of terms are considered which have strong real-world associations. The proximities of members of these pairs in the LSI space are compared at multiple values of k. The testing is carried out for collections of from one to five million documents. For the five million document collection, a value of k ≈ 400 provides the best performance.
   The results suggest that there is something of an 'island of stability' in the k = 300 to 500 range. The results also indicate that there is relatively little room to employ k values outside of this range without incurring significant distortions in at least some term-term correlations.

DB: efficient maintenance and query optimization

Content-based filtering for efficient online materialized view maintenance BIBAFull-Text 163-172
  Gang Luo; Philip S. Yu
Real-time materialized view maintenance has become increasingly popular, especially in real-time data warehousing and data streaming environments. Upon updates to base relations, maintaining the corresponding materialized views can bring a heavy burden to the RDBMS. A traditional method to mitigate this problem is to use the where clause condition in the materialized view definition to detect whether an update to a base relation is relevant and can affect the materialized view. However, this detection method does not consider the content in the base relations and hence misses a large number of filtering opportunities. In this paper, we propose a content-based method for detecting irrelevant updates to base relations of a materialized view. At the cost of using more space, this method increases the probability of catching irrelevant updates by judiciously designing filtering relations to capture the content in the base relations. Based on the content-based method, a prototype real-time data warehouse has been implemented on top of IBM's System S using IBM DB2. Using an analytical model and our prototype, we show that the content-based method can catch most (or all) irrelevant updates to base relations that are missed by the traditional method. Thus, when the fraction of irrelevant updates is non-negligible, the load on the RDBMS due to materialized view maintenance can be significantly reduced.
A step towards incremental maintenance of the composed schema mapping BIBAFull-Text 173-182
  Gang Qian; Yisheng Dong
Schema mapping plays a fundamental role in modern information systems. Mapping composition is an operator that combines a chain of successive schema mappings into a single schema mapping. By pre-computing the composed schema mapping, the system can achieve significant performance benefits. However, when a change occurs on any mapping in the chain, the composed schema mapping has to be maintained correspondingly. In this paper we consider a restricted form of the problem in the XML setting and propose an incremental maintenance approach. Specifically, given a chain of successive mappings, we transform intermediately them into trees that consist of atomic rules and then divide the composition into sub-compositions of the atomic rules. The dividing composition approach provides a fine-grained perspective of the composition relationships between the mappings. We depict such information through an auxiliary data structure called composition relationship graph (CRG). When changes occur on any mapping in the chain, the corresponding maintenance algorithms are developed based on the dividing approach and the CRG, which compute the changes on the composed mapping and then repair it into the new version, such that the computation involves only the atomic rules that are relevant with the maintenance. We evaluate our maintenance approach and report the first experiments results, which show that it is efficient.
Modeling and exploiting query interactions in database systems BIBAFull-Text 183-192
  Mumtaz Ahmad; Ashraf Aboulnaga; Shivnath Babu; Kamesh Munagala
The typical workload in a database system consists of a mixture of multiple queries of different types, running concurrently and interacting with each other. Hence, optimizing performance requires reasoning about query mixes and their interactions, rather than considering individual queries or query types. In this paper, we show the significant impact that query interactions can have on workload performance. We present a new approach based on planning experiments and statistical modeling to capture the impact of query interactions. This approach requires no prior assumptions about the internal workings of the database system or the nature or cause of query interactions, making it portable across systems. As a concrete demonstration of the potential of capturing, modeling, and exploiting query interactions, we develop a novel interaction-aware query scheduler that targets report-generation workloads in Business Intelligence (BI) settings. Under certain assumptions, the schedule found by this scheduler is within a constant factor of optimal. An experimental evaluation with TPC-H queries on IBM DB2 demonstrates that our scheduler consistently outperforms (up to 4x) conventional schedulers that do not account for query interactions.
A novel optimization approach to efficiently process aggregate similarity queries in metric access methods BIBAFull-Text 193-202
  Humberto L. Razente; Maria Camila N. Barioni; Agma Juci M. Traina; Christos Faloutsos; Caetano, Jr. Traina
A similarity query considers an element as the query center and searches a dataset to find either the elements far up to a bounding radius or the k nearest ones from the query center. Several algorithms have been developed to efficiently execute similarity queries. However, there are queries that require more than one center, which we call Aggregate Similarity Queries. Such queries appear when the user gives multiple desirable examples, and requests data elements that are similar to all of the examples, as in the case of applying relevance feedback. Here we give the first algorithms that can handle aggregate similarity queries on Metric Access Methods (MAM) such as the M-tree and Slim-tree. Our method, which we call Metric Aggregate Similarity Search (MASS) has the following properties: (a) it requires only the triangle inequality property; (b) it guarantees no false-dismissals, as we prove that it lower-bounds the aggregate distance scores; (c) it can work with any MAM; (d) it can handle any number of query centers, which are either scattered all over the space or concentrated on a restricted region. Experiments on both real and synthetic data show that our method scales on both the number of elements and, if the dataset is in a spatial domain, also on its dimensionality. Moreover, it achieves better results than previous related methods.

IR: social search

Can all tags be used for search? BIBAFull-Text 193-202
  Kerstin Bischoff; Claudiu S. Firan; Wolfgang Nejdl; Raluca Paiu
Collaborative tagging has become an increasingly popular means for sharing and organizing Web resources, leading to a huge amount of user generated metadata. These tags represent quite a few different aspects of the resources they describe and it is not obvious whether and how these tags or subsets of them can be used for search. This paper is the first to present an in-depth study of tagging behavior for very different kinds of resources and systems -- Web pages (Del.icio.us), music (Last.fm), and images (Flickr) -- and compares the results with anchor text characteristics. We analyze and classify sample tags from these systems, to get an insight into what kinds of tags are used for different resources, and provide statistics on tag distributions in all three tagging environments. Since even relevant tags may not add new information to the search procedure, we also check overlap of tags with content, with metadata assigned by experts and from other sources. We discuss the potential of different kinds of tags for improving search, comparing them with user queries posted to search engines as well as through a user survey. The results are promising and provide more insight into both the use of different kinds of tags for improving search and possible extensions of tagging systems to support the creation of potentially search-relevant tags.
Comparing citation contexts for information retrieval BIBAFull-Text 213-222
  Anna Ritchie; Stephen Robertson; Simone Teufel
In previous work, we have shown that using terms from around citations in citing papers to index the cited paper, in addition to the cited paper's own terms, can improve retrieval effectiveness. Now, we investigate how to select text from around the citations in order to extract good index terms. We compare the retrieval effectiveness that results from a range of contexts around the citations, including no context, the entire citing paper, some fixed windows and several variations with linguistic motivations. We conclude with an analysis of the benefits of more complex, linguistically motivated methods for extracting citation index terms, over using a fixed window of terms. We speculate that there might be some advantage to using computational linguistic techniques for this task.
Social tags: meaning and suggestions BIBAFull-Text 223-232
  Fabian M. Suchanek; Milan Vojnovic; Dinan Gunawardena
This paper aims to quantify two common assumptions about social tagging: (1) that tags are "meaningful" and (2) that the tagging process is influenced by tag suggestions. For (1), we analyze the semantic properties of tags and the relationship between the tags and the content of the tagged page. Our analysis is based on a corpus of search keywords, contents, titles, and tags applied to several thousand popular Web pages. Among other results, we find that the more popular tags of a page tend to be the more meaningful ones. For (2), we develop a model of how the influence of tag suggestions can be measured. From a user study with over 4,000 participants, we conclude that roughly one third of the tag applications may be induced by the suggestions. Our results would be of interest for designers of social tagging systems and are a step towards understanding how to best leverage social tags for applications such as search and information extraction.
Mining social networks using heat diffusion processes for marketing candidates selection BIBAFull-Text 233-242
  Hao Ma; Haixuan Yang; Michael R. Lyu; Irwin King
Social Network Marketing techniques employ pre-existing social networks to increase brands or products awareness through word-of-mouth promotion. Full understanding of social network marketing and the potential candidates that can thus be marketed to certainly offer lucrative opportunities for prospective sellers. Due to the complexity of social networks, few models exist to interpret social network marketing realistically. We propose to model social network marketing using Heat Diffusion Processes. This paper presents three diffusion models, along with three algorithms for selecting the best individuals to receive marketing samples. These approaches have the following advantages to best illustrate the properties of real-world social networks: (1) We can plan a marketing strategy sequentially in time since we include a time factor in the simulation of product adoptions; (2) The algorithm of selecting marketing candidates best represents and utilizes the clustering property of real-world social networks; and (3) The model we construct can diffuse both positive and negative comments on products or brands in order to simulate the complicated communications within social networks. Our work represents a novel approach to the analysis of social network marketing, and is the first work to propose how to defend against negative comments within social networks. Complexity analysis shows our model is also scalable to very large social networks.

IR/KM: machine learning

Exploiting temporal contexts in text classification BIBAFull-Text 243-252
  Leonardo Rocha; Fernando Mourão; Adriano Pereira; Marcos André Gonçalves; Wagner, Jr. Meira
Due to the increasing amount of information being stored and accessible through the Web, Automatic Document Classification (ADC) has become an important research topic. ADC usually employs a supervised learning strategy, where we first build a classification model using pre-classified documents and then use it to classify unseen documents. One major challenge in building classifiers is dealing with the temporal evolution of the characteristics of the documents and the classes to which they belong. However, most of the current techniques for ADC do not consider this evolution while building and using the models. Previous results show that the performance of classifiers may be affected by three different temporal effects (class distribution, term distribution and class similarity). Further, it is shown that using just portions of the pre-classified documents, which we call contexts, for building the classifiers, result in better performance, as a consequence of the minimization of the aforementioned effects.
   In this paper we define the concept of temporal contexts as being the portions of documents that minimize those effects. We then propose a general algorithm for determining such contexts, discuss its implementation-related issues, and propose a heuristic that is able to determine temporal contexts efficiently. In order to demonstrate the effectiveness of our strategy, we evaluated it using two distinct collections: ACM-DL and MedLine. We initially evaluated the reduction in terms of both the effort to build a classifier and the entropy associated with each context. Further, we evaluated whether these observed reductions translate into better classification performance by employing a very simple classifier, majority voting. The results show that we achieved precision gains of up to 30% compared to a version that is not temporally contextualized, and the same accuracy of a state-of-the-art classifier (SVM), while presenting an execution time up to hundreds of times faster.
Kernel methods, syntax and semantics for relational text categorization BIBAFull-Text 253-262
  Alessandro Moschitti
Previous work on Natural Language Processing for Information Retrieval has shown the inadequateness of semantic and syntactic structures for both document retrieval and categorization. The main reason is the high reliability and effectiveness of language models, which are sufficient to accurately solve such retrieval tasks. However, when the latter involve the computation of relational semantics between text fragments simple statistical models may result ineffective. In this paper, we show that syntactic and semantic structures can be used to greatly improve complex categorization tasks such as determining if an answer correctly responds to a question. Given the high complexity of representing semantic/syntactic structures in learning algorithms, we applied kernel methods along with Support Vector Machines to better exploit the needed relational information. Our experiments on answer classification on Web and TREC data show that our models greatly improve on bag-of-words.
BNS feature scaling: an improved representation over tf-idf for svm text classification BIBAFull-Text 263-270
  George Forman
In the realm of machine learning for text classification, TF-IDF is the most widely used representation for real-valued feature vectors. However, IDF is oblivious to the training class labels and naturally scales some features inappropriately. We replace IDF with Bi-Normal Separation (BNS), which has been previously found to be excellent at ranking words for feature selection filtering. Empirical evaluation on a benchmark of 237 binary text classification tasks shows substantially better accuracy and F-measure for a Support Vector Machine (SVM) by using BNS scaling. A wide variety of other feature representations were later tested and found inferior, as well as binary features with no scaling. Moreover, BNS scaling yielded better performance without feature selection, obviating the need for feature selection.
Learning a two-stage SVM/CRF sequence classifier BIBAFull-Text 271-278
  Guilherme Hoefel; Charles Elkan
Learning a sequence classifier means learning to predict a sequence of output tags based on a set of input data items. For example, recognizing that a handwritten word is "cat", based on three images of handwritten letters and on general knowledge of English letter combinations, is a sequence classification task. This paper describes a new two-stage approach to learning a sequence classifier that is (i) highly accurate, (ii) scalable, and (iii) easy to use in data mining applications. The two-stage approach combines support vector machines (SVMs) and conditional random fields (CRFs). It is (i) highly accurate because it benefits from the maximum-margin nature of SVMs and also from the ability of CRFs to model correlations between neighboring output tags. It is (ii) scalable because the input to each SVM is a small training set, and the input to the CRF has a small number of features, namely the SVM outputs. It is (iii) easy to use because it combines existing published software in a straightforward way. In detailed experiments on the task of recognizing handwritten words, we show that the two-stage approach is more accurate, or faster and more scalable, or both, than leading other methods for learning sequence classifiers, including max-margin Markov networks (M3Ns) and standard CRFs.

KM: link and graph mining

Local approximation of pagerank and reverse pagerank BIBAFull-Text 279-288
  Ziv Bar-Yossef; Li-Tal Mashiach
We consider the problem of approximating the PageRank of a target node using only local information provided by a link server. This problem was originally studied by Chen, Gan, and Suel (CIKM 2004), who presented an algorithm for tackling it. We prove that local approximation of PageRank, even to within modest approximation factors, is infeasible in the worst-case, as it requires probing the link server for Ω(n) nodes, where n is the size of the graph. The difficulty emanates from nodes of high in-degree and/or from slow convergence of the PageRank random walk.
   We show that when the graph has bounded in-degree and admits fast PageRank convergence, then local PageRank approximation can be done using a small number of queries. Unfortunately, natural graphs, such as the web graph, are abundant with high in-degree nodes, making this algorithm (or any other local approximation algorithm) too costly. On the other hand, reverse natural graphs tend to have low in-degree while maintaining fast PageRank convergence. It follows that calculating Reverse PageRank locally is frequently more feasible than computing PageRank locally.
   We demonstrate that Reverse PageRank is useful for several applications, including computation of hub scores for web pages, finding influencers in social networks, obtaining good seeds for crawling, and measurement of semantic relatedness between concepts in a taxonomy.
Link privacy in social networks BIBAFull-Text 289-298
  Aleksandra Korolova; Rajeev Motwani; Shubha U. Nabar; Ying Xu
We consider a privacy threat to a social network in which the goal of an attacker is to obtain knowledge of a significant fraction of the links in the network. We formalize the typical social network interface and the information about links that it provides to its users in terms of lookahead. We consider a particular threat where an attacker subverts user accounts to get information about local neighborhoods in the network and pieces them together in order to get a global picture. We analyze, both experimentally and theoretically, the number of user accounts an attacker would need to subvert for a successful attack, as a function of his strategy for choosing users whose accounts to subvert and a function of lookahead provided by the network. We conclude that such an attack is feasible in practice, and thus any social network that wishes to protect the link privacy of its users should take great care in choosing the lookahead of its interface, limiting it to 1 or 2, whenever possible.
On effective presentation of graph patterns: a structural representative approach BIBAFull-Text 299-308
  Chen Chen; Cindy Xide Lin; Xifeng Yan; Jiawei Han
In the past, quite a few fast algorithms have been developed to mine frequent patterns over graph data, with the large spectrum covering many variants of the problem. However, the real bottleneck for knowledge discovery on graphs is neither efficiency nor scalability, but the usability of patterns that are mined out. Currently, what the state-of-art techniques give is a lengthy list of exact patterns, which are undesirable in the following two aspects: (1) on the micro side, due to various inherent noises or data diversity, exact patterns are usually not too useful in many real applications; and (2) on the macro side, the rigid structural requirement being posed often generates an excessive amount of patterns that are only slightly different from each other, which easily overwhelm the users.
   In this paper, we study the presentation problem of graph patterns, where structural representatives are deemed as the key mechanism to make the whole strategy effective. As a solution to fill the usability gap, we adopt a two-step smoothing-clustering framework, with the first step adding error tolerance to individual patterns (the micro side), and the second step reducing output cardinality by collapsing multiple structurally similar patterns into one representative (the macro side). This novel, integrative approach is never tried in previous studies, which essentially rolls-up our attention to a more appropriate level that no longer looks into every minute detail. The above framework is general, which may apply under various settings and incorporate a lot of extensions. Empirical studies indicate that a compact group of informative delegates can be achieved on real datasets and the proposed algorithms are both efficient and scalable.
Characterizing and predicting community members from evolutionary and heterogeneous networks BIBAFull-Text 309-318
  Qiankun Zhao; Sourav S. Bhowmick; Xin Zheng; Kai Yi
Mining different types of communities from web data have attracted a lot of research efforts in recent years. However, none of the existing community mining techniques has taken into account both the dynamic as well as heterogeneous nature of web data. In this paper, we propose to characterize and predict community members from the evolution of heterogeneous web data. We first propose a general framework for analyzing the evolution of heterogeneous networks. Then, the academic network, which is extracted from 1 million computer science papers, is used as an example to illustrate the framework. Finally, two example applications of the academic network are presented. Experimental results with a real and very large heterogeneous academic network show that our proposed framework can produce good results in terms of community member recommendation. Also, novel knowledge and insights can be gained by analyzing the community evolution pattern.

KM: information filtering

An algorithm to determine peer-reviewers BIBAFull-Text 319-328
  Marko A. Rodriguez; Johan Bollen
The peer-review process is the most widely accepted certification mechanism for officially accepting the written results of researchers within the scientific community. An essential component of peer-review is the identification of competent referees to review a submitted manuscript. This article presents an algorithm to automatically determine the most appropriate reviewers for a manuscript by way of a co-authorship network data structure and a relative-rank particle-swarm algorithm. This approach is novel in that it is not limited to a pre-selected set of referees, is computationally efficient, requires no human-intervention, and, in some instances, can automatically identify conflict of interest situations. A useful application of this algorithm would be to open commentary peer-review systems because it provides a weighting for each referee with respects to their expertise in the domain of a manuscript. The algorithm is validated using referee bid data from the 2005 Joint Conference on Digital Libraries.
Spam characterization and detection in peer-to-peer file-sharing systems BIBAFull-Text 329-338
  Dongmei Jia; Wai Gen Yee; Ophir Frieder
Spam is highly pervasive in P2P file-sharing systems and is difficult to detect automatically before actually downloading a file due to the insufficient and biased description of a file returned to a client as a query result. To alleviate this problem, we first characterize spam and spammers in the P2P file-sharing environment and then describe feature-based techniques for automatically detecting spam in P2P query result sets. Experimental results show that the proposed techniques successfully decrease the amount of spam by 9% in the top-200 results and by 92% in the top-20 results.
Predicting web spam with HTTP session information BIBAFull-Text 339-348
  Steve Webb; James Caverlee; Calton Pu
Web spam is a widely-recognized threat to the quality and security of the Web. Web spam pages pollute search engine indexes, burden Web crawlers and Web mining services, and expose users to dangerous Web-borne malware. To defend against Web spam, most previous research analyzes the contents of Web pages and the link structure of the Web graph. Unfortunately, these heavyweight approaches require full downloads of both legitimate and spam pages to be effective, making real-time deployment of these techniques infeasible for Web browsers, high-performance Web crawlers, and real-time Web applications. In this paper, we present a lightweight, predictive approach to Web spam classification that relies exclusively on HTTP session information (i.e., hosting IP addresses and HTTP session headers). Concretely, we built an HTTP session classifier based on our predictive technique, and by incorporating this classifier into HTTP retrieval operations, we are able to detect Web spam pages before the actual content transfer. As a result, our approach protects Web users from Web-propagated malware, and it generates significant bandwidth and storage savings. By applying our predictive technique to a corpus of almost 350,000 Web spam instances and almost 400,000 legitimate instances, we were able to successfully detect 88.2% of the Web spam pages with a false positive rate of only 0.4%. These classification results are superior to previous evaluation results obtained with traditional link-based and content-based techniques. Additionally, our experiments show that our approach saves an average of 15.4 KB of bandwidth and storage resources for every successfully identified Web spam page, while only adding an average of 101 microseconds to each HTTP retrieval operation. Therefore, our predictive technique can be successfully deployed in applications that demand real-time spam detection.
Inferring semantic query relations from collective user behavior BIBAFull-Text 349-358
  Nish Parikh; Neel Sundaresan
In this paper we describe how high quality transaction data comprising of online searching, product viewing, and product buying activity of a large online community can be used to infer semantic relationships between queries. We work with a large scale query log consisting of around 115 million queries from eBay. We discuss various techniques to infer semantic relationships among queries and show how the results from these methods can be combined to measure the strength and depict the kinds of relationships. Further, we show how this extraction of relations can be used to improve search relevance, related query recommendations, and recovery from null results in an eCommerce context.

DB: stream processing

Anomaly-free incremental output in stream processing BIBAFull-Text 359-368
  George A. Mihaila; Ioana Stanoi; Christian A. Lang
Continuous queries enable alerts, predictions, and early warning in various domains such as health care, business process monitoring, financial applications, and environment protection. Currently, the consistency of the result cannot be assessed by the application, since only the query processor has enough internal information to determine when the output has reached a consistent state. To our knowledge, this is the first paper that addresses the problem of consistency under the assumptions and constraints of a continuous query model. In addition to defining an appropriate consistency notion, we propose techniques for guaranteeing consistency. We implemented the proposed techniques in our existing stream engine, and we report on the characteristics of the observed performance. As we show, these methods are practical as they impose only a small overhead on the system.
SNIF TOOL: sniffing for patterns in continuous streams BIBAFull-Text 369-378
  Abhishek Mukherji; Elke A. Rundensteiner; David C. Brown; Venkatesh Raghavan
Continuous time-series sequence matching, specifically, matching a numeric live stream against a set of redefined pattern sequences, is critical for domains ranging from fire spread tracking to network traffic monitoring. While several algorithms exist for similarity matching of static time-series data, matching continuous data poses new, largely unsolved challenges including online real-time processing requirements and system resource limitations for handling infinite streams. In this work, we propose a novel live stream matching framework, called n-Snippet Indices Framework (in short, SNIF), to tackle these challenges. SNIF employs snippets as the basic unit for matching streaming time-series. The insight is to perform the matching at two levels of granularity: bag matching of subsets of snippets of the live stream against prefixes of the patterns, and order checking for maintaining successive candidate snippet bag matches. We design a two-level index structure, called SNIF index, which supports these two modes of matching. We propose a family of online two-level prefix matching algorithms that trade off between result accuracy and response time. The effectiveness of SNIF to detect patterns has been thoroughly tested through experiments using real datasets from the domains of fire monitoring and sensor motes. In this paper, we also present a study of SNIF's performance, accuracy and tolerance to noise compared against those of the state-of-the-art Continuous Query with Prediction (CQP) approach.
Real-time new event detection for video streams BIBAFull-Text 379-388
  Gang Luo; Rong Yan; Philip S. Yu
Online detection of video clips that present previously unseen events in a video stream is still an open challenge to date. For this online new event detection (ONED) task, existing studies mainly focus on optimizing the detection accuracy instead of the detection efficiency. As a result, it is difficult for existing systems to detect new events in real time, especially for large-scale video collections such as the video content available on the Web. In this paper, we propose several scalable techniques to improve the video processing speed of a baseline ONED system by orders of magnitude without sacrificing much detection accuracy. First, we use text features alone to filter out most of the non-new-event clips and to skip those expensive but unnecessary steps including image feature extraction and image similarity computation. Second, we use a combination of indexing and compression methods to speed up text processing. We implemented a prototype of our optimized ONED system on top of IBM's System S. The effectiveness of our techniques is evaluated on the standard TRECVID 2005 benchmark, which demonstrates that our techniques can achieve a 480-fold speedup with detection accuracy degraded less than 5%.
Linear time membership in a class of regular expressions with interleaving and counting BIBAFull-Text 389-398
  Giorgio Ghelli; Dario Colazzo; Carlo Sartiani
The extension of Regular Expressions (REs) with an interleaving (shuffle) operator has been proposed in many occasions, since it would be crucial to deal with unordered data. However, interleaving badly affects the complexity of basic operations, and, especially, makes membership NP-hard [13], which is unacceptable for most uses of REs.
   REs form the basis of most XML type languages, such as DTDs and XML Schema types, and XDuce types [16, 11]. In this context, the interleaving operator would be a natural addition to the language of REs, as witnessed by the presence of limited forms of interleaving in XSD (the all group), Relax-NG, and SGML, provided that the NP-hardness of membership could be avoided.
   We present here a restricted class of REs with interleaving and counting which admits a linear membership algorithm, and which is expressive enough to cover the vast majority of real-world XML types.
   We first present an algorithm for membership of a list of words into a RE with interleaving and counting, based on the translation of the RE into a set of constraints.
   We generalize the approach in order to check membership of XML trees into a class of EDTDs with interleaving and counting, which models the crucial aspects of DTDs and XSD schemas.

IR: theory

Generalized inverse document frequency BIBAFull-Text 399-408
  Donald Metzler
Inverse document frequency (IDF) is one of the most useful and widely used concepts in information retrieval. There have been various attempts to provide theoretical justifications for IDF. One of the most appealing derivations follows from the Robertson-Sparck Jones relevance weight. However, this derivation, and others related to it, typically make a number of strong assumptions that are often glossed over. In this paper, we re-examine these assumptions from a Bayesian perspective, discuss possible alternatives, and derive a new, more generalized form of IDF that we call generalized inverse document frequency. In addition to providing theoretical insights into IDF, we also undertake a rigorous empirical evaluation that shows generalized IDF outperforms classical versions of IDF on a number of ad hoc retrieval tasks.
TinyLex: static n-gram index pruning with perfect recall BIBAFull-Text 409-418
  Derrick Coetzee
Inverted indexes using sequences of characters (n-grams) as terms provide an error-resilient and language-independent way to query for arbitrary substrings and perform approximate matching in a text, but present a number of practical problems: they have a very large number of terms, they exhibit pathologically expensive worst-case query times on certain natural inputs, and they cannot cope with very short query strings. In word-based indexes, static index pruning has been successful in reducing index size while maintaining precision, at the expense of recall. Taking advantage of the unique inclusion structure of n-gram terms of different lengths, we show that the lexicon size of an n-gram index can be reduced by 7 to 15 times without any loss of recall, and without any increase in either index size or query time. Because the lexicon is typically stored in main memory, this substantially reduces the memory required for queries. Simultaneously, our construction is also the first overlapping n-gram index to place tunable worst-case bounds on false positives and to permit efficient queries on strings of any length. Using this construction, we also demonstrate the first feasible n-gram index using words rather than characters as units, and its applications to phrase searching.
Revisiting the relationship between document length and relevance BIBAFull-Text 419-428
  David E. Losada; Leif Azzopardi; Mark Baillie
The scope hypothesis in Information Retrieval (IR) states that a relationship exists between document length and relevance, such that the likelihood of relevance increases with document length. A number of empirical studies have provided statistical evidence supporting the scope hypothesis. However, these studies make the implicit assumption that modern test collections are complete (i.e. all documents are assessed for relevance). As a consequence the observed evidence is misleading. In this paper we perform a deeper analysis of document length and relevance taking into account that test collections are incomplete. We first demonstrate that previous evidence supporting the scope hypothesis was an artefact of the test collection, where there is a bias towards longer documents in the pooling process. We evaluate whether this length bias affects system comparison when using incomplete test collections. The results indicate that test collections are problematic when considering MAP as a measure of effectiveness but are relatively robust when using bpref. The implications of the study indicate that retrieval models should not be tuned to favour longer documents, and that designers of new test collections should take measures against length bias during the pooling process in order to create more reliable and robust test collections.
Relating dependent indexes using Dempster-Shafer theory BIBAFull-Text 429-438
  Lixin Shi; Jian-Yun Nie; Guihong Cao
Traditional information retrieval (IR) approaches assume that the indexing terms are independent, which is not true in reality. Although some previous studies have tried to consider term relationships, strong simplifications had to be made at the very basic indexing step, namely, dependent terms are assigned independent counts or probabilities.
   In this study, we propose to consider dependencies between terms using Dempster-Shafer theory of evidence. An occurrence of a string in a document is considered to represent the set of all the terms implied in it. Probability is assigned to such a set of terms instead of individual terms. During query evaluation phase, a part of the probability of a set can be transferred to those of the query that are related, allowing us to integrate language-dependent relations in IR.
   This approach has been tested on several Chinese IR collections. Our experimental results show that our model can outperform the existing state-of-the-art approaches. The proposed method can be used as a general way to consider different types of relationship between terms and for other languages.

IR: query analysis

Improved query difficulty prediction for the web BIBAFull-Text 439-448
  Claudia Hauff; Vanessa Murdock; Ricardo Baeza-Yates
Query performance prediction aims to predict whether a query will have a high average precision given retrieval from a particular collection, or low average precision. An accurate estimator of the quality of search engine results can allow the search engine to decide to which queries to apply query expansion, for which queries to suggest alternative search terms, to adjust the sponsored results, or to return results from specialized collections. In this paper we present an evaluation of state of the art query prediction algorithms, both post-retrieval and pre-retrieval and we analyze their sensitivity towards the retrieval algorithm. We evaluate query difficulty predictors over three widely different collections and query sets and present an analysis of why prediction algorithms perform significantly worse on Web data. Finally we introduce Improved Clarity, and demonstrate that it outperforms state-of-the-art predictors on three standard collections, including two large Web collections.
Understanding the relationship between searchers' queries and information goals BIBAFull-Text 449-458
  Doug Downey; Susan Dumais; Dan Liebling; Eric Horvitz
We describe results from Web search log studies aimed at elucidating user behaviors associated with queries and destination URLs that appear with different frequencies. We note the diversity of information goals that searchers have and the differing ways that goals are specified. We examine rare and common information goals that are specified using rare or common queries. We identify several significant differences in user behavior depending on the rarity of the query and the destination URL. We find that searchers are more likely to be successful when the frequencies of the query and destination URL are similar. We also establish that the behavioral differences observed for queries and goals of varying rarity persist even after accounting for potential confounding variables, including query length, search engine ranking, session duration, and task difficulty. Finally, using an information-theoretic measure of search difficulty, we show that the benefits obtained by search and navigation actions depend on the frequency of the information goal.
Active relevance feedback for difficult queries BIBAFull-Text 459-468
  Zuobing Xu; Ram Akella
Relevance feedback has been demonstrated to be an effective strategy for improving retrieval accuracy. The existing relevance feedback algorithms based on language models and vector space models are not effective in learning from negative feedback documents, which are abundant if the initial query is difficult. The probabilistic retrieval model has the advantage of being able to naturally improve the estimation of both the relevant and non-relevant models. The Dirichlet compound multinomial (DCM) distribution, which relies on hierarchical Bayesian modeling techniques, is a more appropriate generative model for the probabilistic retrieval model than the traditional multinomial distribution. We propose a new relevance feedback algorithm, based on a mixture model of the DCM distribution, to effectively model the overlaps between the positive and negative feedback documents. Consequently, the new algorithm improves the retrieval performance substantially for difficult queries. To further reduce human relevance evaluation, we propose a new active learning algorithm in conjunction with the new relevance feedback model. The new active learning algorithm implicitly models the diversity, density and relevance of unlabeled data in a transductive experimental design framework. Experimental results on several TREC datasets show that both the relevance feedback and active learning algorithm significantly improve retrieval accuracy.
Query suggestion using hitting time BIBAFull-Text 469-478
  Qiaozhu Mei; Dengyong Zhou; Kenneth Church
Generating alternative queries, also known as query suggestion, has long been proved useful to help a user explore and express his information need. In many scenarios, such suggestions can be generated from a large scale graph of queries and other accessory information, such as the clickthrough. However, how to generate suggestions while ensuring their semantic consistency with the original query remains a challenging problem.
   In this work, we propose a novel query suggestion algorithm based on ranking queries with the hitting time on a large scale bipartite graph. Without involvement of twisted heuristics or heavy tuning of parameters, this method clearly captures the semantic consistency between the suggested query and the original query. Empirical experiments on a large scale query log of a commercial search engine and a scientific literature collection show that hitting time is effective to generate semantically consistent query suggestions. The proposed algorithm and its variations can successfully boost long tail queries, accommodating personalized query suggestion, as well as finding related authors in research.

KM: web mining

Mining term association patterns from search logs for effective query reformulation BIBAFull-Text 479-488
  Xuanhui Wang; ChengXiang Zhai
Search engine logs are an emerging new type of data that offers interesting opportunities for data mining. Existing work on mining such data has mostly attempted to discover knowledge at the level of queries (e.g., query clusters). In this paper, we propose to mine search engine logs for patterns at the level of terms through analyzing the relations of terms inside a query. We define two novel term association patterns (i.e., context-sensitive term substitutions and term additions) and propose new methods for mining such patterns from search engine logs. These two patterns can be used to address the mis-specification and under-specification problems of ineffective queries. Experiment results on real search engine logs show that the mined context-sensitive term substitutions can be used to effectively reword queries and improve their accuracy, while the mined context-sensitive term addition patterns can be used to support query refinement in a more effective way.
Non-local evidence for expert finding BIBAFull-Text 489-498
  Krisztian Balog; Maarten de Rijke
The task addressed in this paper, finding experts in an enterprise setting, has gained in importance and interest over the past few years. Commonly, this task is approached as an association finding exercise between people and topics. Existing techniques use either documents (as a whole) or proximity-based techniques to represent candidate experts. Proximity-based techniques have shown clear precision-enhancing benefits. We complement both document and proximity-based approaches to expert finding by importing global evidence of expertise, i.e., evidence obtained using information that is not available in the immediate proximity of a candidate expert's name occurrence or even on the same page on which the name occurs. Examples include candidate priors, query models, as well as other documents a candidate expert is associated with.
   Using the CERC data set created for the TREC 2007 Enterprise track we identify examples of non-local evidence of expertise. We then propose modified expert retrieval models that are capable of incorporating both local (either document or snippet-based) evidence and non-local evidence of expertise. Results show that our refined models significantly outperform existing state-of-the-art approaches.
Discovering leaders from community actions BIBAFull-Text 499-508
  Amit Goyal; Francesco Bonchi; Laks V. S. Lakshmanan
We introduce a novel frequent pattern mining approach to discover leaders and tribes in social networks. In particular, we consider social networks where users perform actions. Actions may be as simple as tagging resources (urls) as in del.icio.us, rating songs as in Yahoo! Music, or movies as in Yahoo! Movies, or users buying gadgets such as cameras, handhelds, etc. and blogging a review on the gadgets. The assumption is that actions performed by a user can be seen by their network friends. Users seeing their friends' actions are sometimes tempted to perform those actions. We are interested in the problem of studying the propagation of such "influence", and on this basis, identifying which users are leaders when it comes to setting the trend for performing various actions. We consider alternative definitions of leaders based on frequent patterns and develop algorithms for their efficient discovery. Our definitions are based on observing the way influence propagates in a time window, as the window is moved in time. Given a social graph and a table of user actions, our algorithms can discover leaders of various flavors by making one pass over the actions table. We run detailed experiments to evaluate the utility and scalability of our algorithms on real-life data. The results of our experiments confirm on the one hand, the efficiency of the proposed algorithm, and on the other hand, the effectiveness and relevance of the overall framework. To the best of our knowledge, this the first frequent pattern based approach to social network mining.
Learning to link with wikipedia BIBAFull-Text 509-518
  David Milne; Ian H. Witten
This paper describes how to automatically cross-reference documents with Wikipedia: the largest knowledge base ever known. It explains how machine learning can be used to identify significant terms within unstructured text, and enrich it with links to the appropriate Wikipedia articles. The resulting link detector and disambiguator performs very well, with recall and precision of almost 75%. This performance is constant whether the system is evaluated on Wikipedia articles or "real world" documents.
   This work has implications far beyond enriching documents with explanatory links. It can provide structured knowledge about any unstructured fragment of text. Any task that is currently addressed with bags of words -- indexing, clustering, retrieval, and summarization to name a few -- could use the techniques described here to draw on a vast network of concepts and semantics.
Markov logic: a unifying language for knowledge and information management BIBAFull-Text 519
  Pedro Domingos
Modern information and knowledge management is characterized by high degrees of complexity and uncertainty. Complexity is well handled by first-order logic, and uncertainty by probabilistic graphical models. What has been sorely missing is a seamless combination of the two. Markov logic provides this by attaching weights to logical formulas and treating them as templates for features of Markov random fields. This talks surveys Markov logic representation, inference, learning and applications. Inference algorithms combine ideas from satisfiability testing, resolution, Markov chain Monte Carlo and belief propagation. Learning algorithms involve statistical weight learning and inductive logic programming. Markov logic has been successfully applied to a wide range of information and knowledge management problems, including information extraction, entity resolution, ontology learning, link prediction, heterogeneous knowledge bases, and others. It is the basis of the open-source Alchemy system (http://alchemy.cs.washington.edu).

DB/industry: XML data integration and XML query optimization

Rewriting of visibly pushdown languages for xml data integration BIBAFull-Text 521-530
  Alex Thomo; S. Venkatesh
In this paper, we focus on XML data integration by studying rewritings of XML target schemas in terms of source schemas. Rewriting is very important in data integration systems where the system is asked to find and assemble XML documents from the data sources and produce documents which satisfy a target schema.
   As schema representation, we consider Visibly Pushdown Automata (VPAs) which accept Visibly Pushdown Languages (VPLs). The latter have been shown to coincide with the family of (word-encoded) regular tree languages which are the basis of formalisms for specifying XML schemas. Furthermore, practical semi-formal XML schema specifications (defined by simple pattern conditions on XML) compile into VPAs which are exponentially more concise than other representations based on tree automata.
   Notably, VPLs enjoy a "well-behavedness" which facilitates us in addressing rewriting problems for XML data integration. Based on VPAs, we positively solve these problems, and present detailed complexity analyses.
Some rewrite optimizations of DB2 XQuery navigation BIBAFull-Text 531-540
  Guangjun Xie; Qi Cheng; Jarek Gryz; Calisto Zuzarte
IBM® DB2® 9 is a truly hybrid commercial database system that combines XML and relational data. It provides native support for XML storage and indexing, and query evaluation support for XQuery. By building a hybrid system, the designers of DB2 9 were able to use the existing SQL query evaluation and optimization techniques to develop similar methods for XQuery. However, SQL and XQuery are sufficiently different that new optimization techniques can and are being developed in the new XQuery domain. This paper describes a few such techniques, all based on static rewrites of XQuery expressions.
Pruning nested XQuery queries BIBAFull-Text 541-550
  Bilel Gueni; Talel Abdessalem; Bogdan Cautis; Emmanuel Waller
We present in this paper an approach for XQuery optimization that exploits minimization opportunities raised in composition-style nesting of queries. More precisely, we consider the simplification of XQuery queries in which the intermediate result constructed by a subexpression is queried by another subexpression. Based on a large subset of XQuery, we describe a rule-based algorithm that recursively prunes query expressions, eliminating useless intermediate results. Our algorithm takes as input an XQuery expression that may have navigation within its subexpressions and outputs a simplified, equivalent XQuery expression, and is thus readily usable as an optimization module in any existing XQuery processor. We demonstrate by experiments the impact of our rewriting approach on query evaluation costs and we prove formally its correctness.
A heuristic approach for checking containment of generalized tree-pattern queries BIBAFull-Text 551-560
  Pawel Placek; Dimitri Theodoratos; Stefanos Souldatos; Theodore Dalamagas; Timos Sellis
Query processing techniques for XML data have focused mainly on tree-pattern queries (TPQs). However, the need for querying XML data sources whose structure is very complex or not fully known to the user, and the need to integrate multiple XML data sources with different structures have driven, recently, the suggestion of query languages that relax the complete specification of a tree pattern. In order to implement the processing of such languages in current DBMSs, their containment problem has to be efficiently solved.
   In this paper, we consider a query language which generalizes TPQs by allowing the partial specification of a tree pattern. Partial tree-pattern queries (PTPQs) constitute a large fragment of XPath that flexibly permits the specification of a broad range of queries from keyword queries without structure, to queries with partial specification of the structure, to complete TPQs. We address the containment problem for PTPQs. This problem becomes more complex in the context of PTPQs because the partial specification of the structure allows new, non-trivial, structural expressions to be inferred from those explicitly specified in a query. We show that the containment problem cannot be characterized by homomorphisms between PTPQs, even when PTPQs are put in a canonical form that comprises all derived structural expressions. We provide necessary and sufficient conditions for this problem in terms of homomorphisms between PTPQs and (a possibly exponential number of) TPQs. To cope with the high complexity of PTPQ containment, we suggest a heuristic approach for this problem that trades accuracy for speed. An extensive experimental evaluation of our heuristic shows that our heuristic approach can be efficiently implemented in a query optimizer.

IR: evaluation

Retrievability: an evaluation measure for higher order information access tasks BIBAFull-Text 561-570
  Leif Azzopardi; Vishwa Vinay
Evaluation in Information Retrieval (IR) has long focused on effectiveness and efficiency. However, new and emerging access tasks now demand alternative evaluation measures which go beyond this traditional view. A retrieval system provides a means of gaining access to documents, therefore intuitively, our view of the collection is shaped by the retrieval system. In this paper, we outline some emerging information access related scenarios that require knowledge about how the retrieval system affects the users' ability to access information. This provides the motivation for the proposed evaluation measures and methodology where the focus is on capturing the behavior of the system, in terms of how retrievable it makes individual documents within the collection. To demonstrate the utility of the proposed methods, we perform an extensive analysis on two TREC collections showing how the measures can be applied to evaluate different information access questions. For higher order information access tasks that are inherently dependent on retrievability, our novel evaluation methodology emphasizes that effectiveness is an insufficient characterization of a retrieval system. This paper provides the foundations for the evaluation of higher order access related tasks.
Statistical power in retrieval experimentation BIBAFull-Text 571-580
  William Webber; Alistair Moffat; Justin Zobel
The power of a statistical test specifies the sample size required to reliably detect a given true effect. In IR evaluation, the power corresponds to the number of topics that are likely to be sufficient to detect a certain degree of superiority of one system over another. To predict the power of a test, one must estimate the variability of the population being sampled from; here, of between-system score deltas. This paper demonstrates that basing such an estimation either on previous experience or on trial experiments leaves wide margins of error. Iteratively adding more topics to the test set until power is achieved is more efficient; however, we show that it leads to a bias in favour of finding both power and significance. A hybrid methodology is proposed, and the reporting requirements of the experimenter using this methodology are laid out. We also demonstrate that greater statistical power is achieved for the same relevance assessment effort by evaluating a large number of topics shallowly than a small number deeply.
Comparing metrics across TREC and NTCIR: the robustness to system bias BIBAFull-Text 581-590
  Tetsuya Sakai
Test collections are growing larger, and relevance data constructed through pooling are suspected of becoming more and more incomplete and biased. Several studies have used evaluation metrics specifically designed to handle this problem, but most of them have only examined the metrics under incomplete but unbiased conditions, using random samples of the original relevance data. This paper examines nine metrics in a more realistic setting, by reducing the number of pooled systems. Even though previous work has shown that metrics based on a condensed list, obtained by removing all unjudged documents from the original ranked list, are effective for handling very incomplete but unbiased relevance data, we show that these results do not hold in the presence of system bias. In our experiments using TREC and NTCIR data, we first show that condensed-list metrics overestimate new systems while traditional metrics underestimate them, and that the overestimation tends to be larger than the underestimation. We then show that, when relevance data is heavily biased towards a single team or a few teams, the condensed-list versions of Average Precision (AP), Q-measure (Q) and normalised Discounted Cumulative Gain (nDCG), which we call AP', Q' and nDCG', are not necessarily superior to the original metrics in terms of discriminative power, i.e., the overall ability to detect pairwise statistical significance. Nevertheless, even under system bias, AP' and Q' are generally more discriminative than bpref and the condensed-list version of Rank-Biased Precision (RBP), which we call RBP'.
How evaluator domain expertise affects search result relevance judgments BIBAFull-Text 591-598
  Kenneth A. Kinney; Scott B. Huffman; Juting Zhai
Traditional search evaluation approaches have often relied on domain experts to evaluate results for each query. Unfortunately, the range of topics present in any representative sample of web queries makes it impractical to have expert evaluators for every topic. In this paper, we investigate the effect of using "generalist" evaluators instead of experts in the domain of queries being evaluated. Empirically, we ind that for queries drawn from domains requiring high expertise, (1) generalists tend to give shallow, inaccurate ratings as compared to experts. (2) Further experiments show that generalists disagree on the underlying meaning of these queries significantly more often than experts, and often appear to "give up'' and fall back on surface features such as keyword matching. (3) Finally, by estimating the percentage of "expertise requiring'' queries in a web query sample, we estimate the impact of using generalists, versus the ideal of having domain experts for every "expertise requiring'' query.

KM: statistical techniques

Clustered subset selection and its applications on it service metrics BIBAFull-Text 599-608
  Christos Boutsidis; Jimeng Sun; Nikos Anerousis
Motivated by the enormous amounts of data collected in a large IT service provider organization, this paper presents a method for quickly and automatically summarizing and extracting meaningful insights from the data. Termed Clustered Subset Selection (CSS), our method enables program-guided data explorations of high-dimensional data matrices. CSS combines clustering and subset selection into a coherent and intuitive method for data analysis. In addition to a general framework, we introduce a family of CSS algorithms with different clustering components such as k-means and Close-to-Rank-One (CRO) clustering, and Subset Selection components such as best rank-one approximation and Rank-Revealing QR (RRQR) decomposition.
   From an empirical perspective, we illustrate that CSS is achieving significant improvements over existing Subset Selection methods in terms of approximation errors. Compared to existing Subset Selection techniques, CSS is also able to provide additional insight about clusters and cluster representatives. Finally, we present a case-study of program-guided data explorations using CSS on a large amount of IT service delivery data collection.
The query-flow graph: model and applications BIBAFull-Text 609-618
  Paolo Boldi; Francesco Bonchi; Carlos Castillo; Debora Donato; Aristides Gionis; Sebastiano Vigna
Query logs record the queries and the actions of the users of search engines, and as such they contain valuable information about the interests, the preferences, and the behavior of the users, as well as their implicit feedback to search engine results. Mining the wealth of information available in the query logs has many important applications including query-log analysis, user profiling and personalization, advertising, query recommendation, and more.
   In this paper we introduce the query-flow graph, a graph representation of the interesting knowledge about latent querying behavior. Intuitively, in the query-flow graph a directed edge from query qi to query qj means that the two queries are likely to be part of the same "search mission". Any path over the query-flow graph may be seen as a searching behavior, whose likelihood is given by the strength of the edges along the path.
   The query-flow graph is an outcome of query-log mining and, at the same time, a useful tool for it. We propose a methodology that builds such a graph by mining time and textual information as well as aggregating queries from different users. Using this approach we build a real-world query-flow graph from a large-scale query log and we demonstrate its utility in concrete applications, namely, finding logical sessions, and query recommendation. We believe, however, that the usefulness of the query-flow graph goes beyond these two applications.
A framework for estimating complex probability density structures in data streams BIBAFull-Text 619-628
  Arnold P. Boedihardjo; Chang-Tien Lu; Feng Chen
Probability density function estimation is a fundamental component in several stream mining tasks such as outlier detection and classification. The nonparametric adaptive kernel density estimate (AKDE) provides a robust and asymptotically consistent estimate for an arbitrary distribution. However, its extensive computational requirements make it difficult to apply this technique to the stream environment. This paper tackles the issue of developing efficient and asymptotically consistent AKDE over data streams while heeding the stringent constraints imposed by the stream environment. We propose the concept of local regions to effectively synopsize local density features, design a suite of algorithms to maintain the AKDE under a time-based sliding window, and analyze the estimates' asymptotic consistency and computational costs. In addition, extensive experiments were conducted with real-world and synthetic data sets to demonstrate the effectiveness and efficiency of our approach.
Proactive learning: cost-sensitive active learning with multiple imperfect oracles BIBAFull-Text 619-628
  Pinar Donmez; Jaime G. Carbonell
Proactive learning is a generalization of active learning designed to relax unrealistic assumptions and thereby reach practical applications. Active learning seeks to select the most informative unlabeled instances and ask an omniscient oracle for their labels, so as to retrain the learning algorithm maximizing accuracy. However, the oracle is assumed to be infallible (never wrong), indefatigable (always answers), individual (only one oracle), and insensitive to costs (always free or always charges the same). Proactive learning relaxes all four of these assumptions, relying on a decision-theoretic approach to jointly select the optimal oracle and instance, by casting the problem as a utility optimization problem subject to a budget constraint. Results on multi-oracle optimization over several data sets demonstrate the superiority of our approach over the single-imperfect-oracle baselines in most cases.

Panel discussion

E-discovery BIBAFull-Text 1527
  David A. Evans; Jason R. Baron; Chris Buckley; Robert S. Bauer
It is common practice in the U.S. for courts to require that the parties to a legal case make available to one another all material relevant to the case, including electronically held data and documents. For large corporations, such relevant information may encompass terabytes of e-mail and other files spanning many years. The challenge of E-Discovery in response to a court order is (in a relatively short amount of time) to identify, assemble, individuate, access, categorize, and analyze an organization's electronically held material, segregate all "privileged" material (which can be legally withheld), and present to the court all (and only) the required documents. The techniques needed to accomplish such a task necessarily include search, clustering, classification, filtering, social network analysis, extraction, and more -- and no one of these is sufficient. The panel will describe this problem in detail, providing additional background and context on E-Discovery generally, and will explore specific techniques and cases that amply demonstrate why E-Discovery is quintessentially a "CIKM" problem -- multi-disciplinary, multi-technology, and "multi-difficult".

DB: indexing and physical query optimization

Exploiting pipeline interruptions for efficient memory allocation BIBAFull-Text 639-648
  Josep Aguilar-Saborit; Mohammad Jalali; Dave Sharpe; Victor Muntés Mulero
Efficiency of memory-intensive operations is a key factor in obtaining good performance during multi-join query processing. The pipelined execution of these queries forces the operations in the query plan to be processed concurrently. Making a wrong decision regarding the amount of memory allocated for such operations can have a drastic impact on the response time. However, some of the execution algorithms used at run time interrupt the pipelined execution, ensuring that some operations are never executed concurrently. Because of this, it is essential to explore new approaches in order to improve memory exploitation.
   In this paper, we address the problem of memory allocation for multiple concurrent operations in a query execution plan. First, we study the dynamic needs for memory during the execution time, and define when two operations coexist. Then, we propose a post-optimization phase, which (i) identifies the operations that are concurrently executed at run time and (ii) costs different memory allocation combinations to find a near optimal solution for any type of query execution plan. We have implemented our techniques in the IBM®DB2® Universal Database" (DB2 UDB) showing that they achieve significant execution time improvements compared with previous approaches, especially for multi-join queries accessing large data volumes.
A new method for indexing genomes using on-disk suffix trees BIBAFull-Text 649-658
  Marina Barsky; Ulrike Stege; Alex Thomo; Chris Upton
We propose a new method to build persistent suffix trees for indexing the genomic data. Our algorithm DiGeST (Disk-Based Genomic Suffix Tree) improves significantly over previous work in reducing the random access to the input string and performing only two passes over disk data. DiGeST is based on the two-phase multi-way merge sort paradigm using a concise binary representation of the DNA alphabet. Furthermore, our method scales to larger genomic data than managed before.
Supporting sub-document updates and queries in an inverted index BIBAFull-Text 659-668
  Vuk Ercegovac; Vanja Josifovski; Ning Li; Mauricio R. Mediano; Eugene J. Shekita
Inverted indexes have become the standard indexing method for supporting search queries in a variety of content-based applications. Examples of such applications include enterprise document management, e-mail, web search, and social networks. One shortcoming in current inverted index designs is that they support only document-level updates, forcing a full document to be reindexed even if just part of it changes. This paper describes a new inverted index design that enables applications to break a document into semantically meaningful sub-documents or "sections". Each section of a document can be updated separately, but search queries can still work seamlessly across sections. Our index design is motivated by applications where there is metadata associated with each document that tends to be smaller and more frequently updated than the document's content, but at the same time, it is desireable to search the metadata and content with the same index structure. A novel self-optimizing query execution algorithm is described to efficiently join the sections of a document in the inverted index. Experimental results on TREC and patent data are provided, showing that sections can dramatically improve overall system throughput on a mixed workload of updates and queries.
Modeling LSH for performance tuning BIBAFull-Text 669-678
  Wei Dong; Zhe Wang; William Josephson; Moses Charikar; Kai Li
Although Locality-Sensitive Hashing (LSH) is a promising approach to similarity search in high-dimensional spaces, it has not been considered practical partly because its search quality is sensitive to several parameters that are quite data dependent. Previous research on LSH, though obtained interesting asymptotic results, provides little guidance on how these parameters should be chosen, and tuning parameters for a given dataset remains a tedious process.
   To address this problem, we present a statistical performance model of Multi-probe LSH, a state-of-the-art variance of LSH. Our model can accurately predict the average search quality and latency given a small sample dataset. Apart from automatic parameter tuning with the performance model, we also use the model to devise an adaptive LSH search algorithm to determine the probing parameter dynamically for each query. The adaptive probing method addresses the problem that even though the average performance is tuned for optimal, the variance of the performance is extremely high. We experimented with three different datasets including audio, images and 3D shapes to evaluate our methods. The results show the accuracy of the proposed model: the recall errors predicted are within 5% from the real values for most cases; the adaptive search method reduces the standard deviation of recall by about 50% over the existing method.

IR: web search 2

Can phrase indexing help to process non-phrase queries? BIBAFull-Text 679-688
  Mingjie Zhu; Shuming Shi; Nenghai Yu; Ji-Rong Wen
Modern web search engines, while indexing billions of web pages, are expected to process queries and return results in a very short time. Many approaches have been proposed for efficiently computing top-k query results, but most of them ignore one key factor in the ranking functions of commercial search engines -- term-proximity, which is the metric of the distance between query terms in a document. When term-proximity is included in ranking functions, most of the existing top-k algorithms will become inefficient. To address this problem, in this paper we propose to build a compact phrase index to speed up the search process when incorporating the term-proximity factor. The compact phrase index can help more accurately estimate the score upper bounds of unknown documents. The size of the phrase index is controlled by including a small portion of phrases which are possibly helpful for improving search performance. Phrase index has been used to process phrase queries in existing work. It is, however, to the best of our knowledge, the first time that phrase index is used to improve the performance of generic queries. Experimental results show that, compared with the state-of-the-art top-k computation approaches, our approach can reduce average query processing time to 1/5 for typical settings.
Matching task profiles and user needs in personalized web search BIBAFull-Text 689-698
  Julia Luxenburger; Shady Elbassuoni; Gerhard Weikum
Personalization has been deemed one of the major challenges in information retrieval with a significant potential for providing better search experience to individual users. Especially, the need for enhanced user models better capturing elements such as users' goals, tasks, and contexts has been identified. In this paper, we introduce a statistical language model for user tasks representing different granularity levels of a user profile, ranging from very specific search goals to broad topics. We propose a personalization framework that selectively matches the actual user information need with relevant past user tasks, and allows to dynamically switch the course of personalization from re-finding very precise information to biasing results to general user interests. In the extreme, our model is able to detect when the user's search and browse history is not appropriate for aiding the user in satisfying her current information quest. Instead of blindly applying personalization to all user queries, our approach refrains from undue actions in these cases, accounting for the user's desire of discovering new topics, and changing interests over time. The effectiveness of our method is demonstrated by an empirical user study.
Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs BIBAFull-Text 699-708
  Rosie Jones; Kristina Lisa Klinkner
Most analysis of web search relevance and performance takes a single query as the unit of search engine interaction. When studies attempt to group queries together by task or session, a timeout is typically used to identify the boundary. However, users query search engines in order to accomplish tasks at a variety of granularities, issuing multiple queries as they attempt to accomplish tasks. In this work we study real sessions manually labeled into hierarchical tasks, and show that timeouts, whatever their length, are of limited utility in identifying task boundaries, achieving a maximum precision of only 70%. We report on properties of this search task hierarchy, as seen in a random sample of user interactions from a major web search engine's log, annotated by human editors, learning that 17% of tasks are interleaved, and 20% are hierarchically organized. No previous work has analyzed or addressed automatic identification of interleaved and hierarchically organized search tasks. We propose and evaluate a method for the automated segmentation of users' query streams into hierarchical units. Our classifiers can improve on timeout segmentation, as well as other previously published approaches, bringing the accuracy up to 92% for identifying fine-grained task boundaries, and 89-97% for identifying pairs of queries from the same task when tasks are interleaved hierarchically. This is the first work to identify, measure and automatically segment sequences of user queries into their hierarchical structure. The ability to perform this kind of segmentation paves the way for evaluating search engines in terms of user task completion.
Learning latent semantic relations from clickthrough data for query suggestion BIBAFull-Text 709-718
  Hao Ma; Haixuan Yang; Irwin King; Michael R. Lyu
For a given query raised by a specific user, the Query Suggestion technique aims to recommend relevant queries which potentially suit the information needs of that user. Due to the complexity of the Web structure and the ambiguity of users' inputs, most of the suggestion algorithms suffer from the problem of poor recommendation accuracy. In this paper, aiming at providing semantically relevant queries for users, we develop a novel, effective and efficient two-level query suggestion model by mining clickthrough data, in the form of two bipartite graphs (user-query and query-URL bipartite graphs) extracted from the clickthrough data. Based on this, we first propose a joint matrix factorization method which utilizes two bipartite graphs to learn the low-rank query latent feature space, and then build a query similarity graph based on the features. After that, we design an online ranking algorithm to propagate similarities on the query similarity graph, and finally recommend latent semantically relevant queries to users. Experimental analysis on the clickthrough data of a commercial search engine shows the effectiveness and the efficiency of our method.

IR: multilingual & multimedia

Simultaneous multilingual search for translingual information retrieval BIBAFull-Text 719-728
  Kristen Parton; Kathleen R. McKeown; James Allan; Enrique Henestroza
We consider the problem of translingual information retrieval, where monolingual searchers issue queries in a different language than the document language(s) and the results must be returned in the language they know, the query language. We present a framework for translingual IR that integrates document translation and query translation into the retrieval model. The corpus is represented as an aligned, jointly indexed "pseudo-parallel" corpus, where each document contains the text of the document along with its translation into the query language. The queries are formulated as multilingual structured queries, where each query term and its translations into the document language(s) are treated as synonym sets. This model leverages simultaneous search in multiple languages against jointly indexed documents to improve the accuracy of results over search using document translation or query translation alone. For query translation, we compared a statistical machine translation (SMT) approach to a dictionary-based approach. We found that using a Wikipedia-derived dictionary for named entities combined with an SMT-based dictionary worked better than SMT alone. Simultaneous multilingual search also has other important features suited to translingual search, since it can provide an indication of poor document translation when a match with the source document is found. We show how close integration of CLIR and SMT allows us to improve result translation in addition to IR results.
Translation enhancement: a new relevance feedback method for cross-language information retrieval BIBAFull-Text 729-738
  Daqing He; Dan Wu
As an effective technique for improving retrieval effectiveness, relevance feedback (RF) has been widely studied in both monolingual and cross-language information retrieval (CLIR) settings. The studies of RF in CLIR have been focused on query expansion (QE), in which queries are reformulated before and/or after they are translated. However, RF in CLIR actually not only can help select better query terms, but also can enhance query translation by adjusting translation probabilities and even resolve some out-of-vocabulary terms. In this paper, we propose a novel RF method called translation enhancement (TE), which uses the extracted translation relationships from relevant documents to revise the translation probabilities of query terms and to identify extra translation alternatives if available so that the translated queries are more tuned to the current search. We studied TE using pseudo relevance feedback (PRF) and interactive relevance feedback (IRF). Our results show that TE can significantly improve CLIR with both types of RF methods, and that the improvement is comparable to that of QE. More importantly, the effects of TE and QE are complementary. Their integration can produce further improvement, and makes CLIR more robust for a variety of queries.
High-dimensional descriptor indexing for large multimedia databases BIBAFull-Text 739-748
  Eduardo Valle; Matthieu Cord; Sylvie Philipp-Foliguet
In this paper we address the subject of large multimedia database indexing for content-based retrieval.
   We introduce multicurves, a new scheme for indexing high-dimensional descriptors. This technique, based on the simultaneous use of moderate-dimensional space-filling curves, has as main advantages the ability to handle high-dimensional data (100 dimensions and over), to allow the easy maintenance of the indexes (inclusion and deletion of data), and to adapt well to secondary storage, thus providing scalability to huge databases (millions, or even thousands of millions of descriptors).
   We use multicurves to perform the approximate k nearest neighbors search with a very good compromise between precision and speed. The evaluation of multicurves, carried out on large databases, demonstrates that the strategy compares well to other up-to-date k nearest neighbor search strategies.
   We also test multicurves on the real-world application of image identification for cultural institutions. In this application, which requires the fast search of a large amount of local descriptors, multicurves allows a dramatic speed-up in comparison to the brute-force strategy of sequential search, without any noticeable precision loss.
On low dimensional random projections and similarity search BIBAFull-Text 749-758
  Yu-En Lu; Pietro Lió; Steven Hand
Random projection (RP) is a common technique for dimensionality reduction under L2 norm for which many significant space embedding results have been demonstrated. However, many similarity search applications often require very low dimension embeddings in order to reduce overhead and boost performance. Inspired by the use of symmetric probability distributions in previous work, we propose a novel RP algorithm, Beta Random Projection, and give its probabilistic analyses based on Beta and Gaussian approximations. We evaluate the algorithm in terms of standard similarity metrics with other RP algorithms as well as the singular value decomposition (SVD). Our experimental results show that BRP preserves both similarity metrics well and, under various dataset types including random point sets, text (TREC5) and images, provides sharper and consistent performance.

KM: data mining

Fast mining of complex time-stamped events BIBAFull-Text 759-768
  Hanghang Tong; Yasushi Sakurai; Tina Eliassi-Rad; Christos Faloutsos
Given a collection of complex, time-stamped events, how do we find patterns and anomalies? Events could be meetings with one or more persons and one or more agenda items at zero or more locations (e.g., teleconferences), or they could be publications with authors, keywords, publishers, etc. In such settings, we want to find time stamps that look similar to each other and group them; we also want to find anomalies. In addition, we want our approach to provide interpretations of the clusters and anomalies by annotating them. Furthermore, we want our approach to automatically find the right time-granularity in which to do analysis. Lastly, we want fast, scalable algorithms for all these problems.
   We address the above challenges through two main ideas. The first (T3) is to turn the problem into a graph analysis problem, by carefully treating each time stamp as a node in a graph. This viewpoint brings to bear the vast machinery of graph analysis methods (PageRank, graph partitioning, proximity analysis, and CenterPiece Subgraphs, to name a few). Thus, T3 can automatically group the time stamps into meaningful clusters and spot anomalies. Moreover, it can select representative events/persons/locations for each cluster and each anomaly, as their interpretations. The second idea (MT3) is to use temporal multi-resolution analysis (e.g., minutes, hours, days). We show that MT3 can quickly derive results from finer-to-coarser resolutions, achieving up to 2 orders of magnitude speedups. We verify the effectiveness as well as efficiency of T3 and MT3 on several real datasets.
Predicting individual disease risk based on medical history BIBAFull-Text 769-778
  Darcy A. Davis; Nitesh V. Chawla; Nicholas Blumm; Nicholas Christakis; Albert-László Barabasi
The monumental cost of health care, especially for chronic disease treatment, is quickly becoming unmanageable. This crisis has motivated the drive towards preventative medicine, where the primary concern is recognizing disease risk and taking action at the earliest signs. However, universal testing is neither time nor cost efficient. We propose CARE, a Collaborative Assessment and Recommendation Engine, which relies only on a patient's medical history using ICD-9-CM codes in order to predict future diseases risks. CARE uses collaborative filtering to predict each patient's greatest disease risks based on their own medical history and that of similar patients. We also describe an Iterative version, ICARE, which incorporates ensemble concepts for improved performance. These novel systems require no specialized information and provide predictions for medical conditions of all kinds in a single run. We present experimental results on a Medicare dataset, demonstrating that CARE and ICARE perform well at capturing future disease risks.
Identification of gene function using prediction by partial matching (PPM) language models BIBAFull-Text 779-786
  Malika Mahoui; William John Teahan; Arvind Kumar Thirumalaiswamy Sekhar; Satyasaibabu Chilukuri
In this paper, we describe the utilization of text encoding and prediction by partial matching language modeling to identify gene functions within abstracts of biomedical papers. The National Center for Biotechnology Information has "GeneRIF" -- a collection of the best possible functional representations for a subset of abstracts from PubMed. We use GeneRIF to test the efficiency of our technique. We discuss the methodology adopted to construct models necessary to enable the Text Mining Toolkit to distinguish between gene functions and the rest of the abstract (non gene functions). We also describe the similarity based approach we deploy on the list of automatically annotated functions to generate the most likely gene function representative of the paper. The results indicate that our combined approach to identify gene functions in scientific abstracts performs very well on both precision and recall, and therefore presents exciting opportunities for use in extracting other entities embedded in scientific text.
Fast correlation analysis on time series datasets BIBAFull-Text 787-796
  Philon Nguyen; Nematollaah Shiri
There has been increasing interest for efficient techniques for fast correlation analysis of time series data in different application domains. We present three algorithms for (1) bivariate correlation queries, (2) multivariate correlation queries, and (3) correlation queries based on a new correlation measure we introduce using dynamic time warping. To support these algorithms, we use a variant of the Compact Multi-Resolution Index (CMRI). In addition to conventional nearest neighbor and range queries supported by CMRI, the proposed algorithms compute all answers to user-defined, ad hoc and parametric correlation queries. The results of our experiments indicate a speed-up of two orders of magnitude over the brute force algorithm, and an order of magnitude improvement on average, while offering more functionalities than provided by existing techniques such as StatStream and the Spatial Cone Tree.

KM: semantic techniques

Wildcards for lightweight information integration in virtual desktops BIBAFull-Text 797-806
  Rodolfo Stecher; Claudia Niederée; Wolfgang Nejdl
We present a flexible information integration approach which addresses the dynamic integration needs in a personal desktop environment where only partial mappings are defined between the sources to be integrated. Our approach is based on query rewriting using substitution rules. In addition to exploiting defined mappings, we employ substitution strategies, which are inspired by the idea of using wildcards in querying and filtering tasks. Starting from a triple based query language as used for querying RDF data, unmapped ontological elements are substituted in a controlled way with variables, leading to a controlled form of query relaxation. In addition, the approach also provides evidences for refining the existing mapping based on the results of executing the relaxed queries. Different strategies for replacing non-matched ontology elements with variables are presented and evaluated over real-world data sets.
Finding informative commonalities in concept collections BIBAFull-Text 807-817
  Simona Colucci; Eugenio Di Sciascio; Francesco M. Donini; Eufemia Tinelli
The problem of finding commonalities characterizes several Knowledge Management scenarios involving collection of resources. The automatic extraction of shared features in a collection of resource descriptions formalized in accordance with a logic language has been in fact widely investigated in the past. In particular, with reference to Description Logics concept descriptions, Least Common Subsumers have been specifically introduced.
   Nevertheless, such studies focused on identifying features shared by the whole collection. The paper proposes instead novel reasoning services in Description Logics, aimed at identifying commonalities in a significant portion of the collection, rather than in the collection as a whole.
   In particular, common subsumers adding informative content to the one provided by the Least Common Subsumer are here investigated.
   The new services are useful in all scenarios where features are not required to be fully shared, like the one motivating our research: Core Competence extraction in knowledge intensive companies.
Association thesaurus construction methods based on link co-occurrence analysis for wikipedia BIBAFull-Text 817-826
  Masahiro Ito; Kotaro Nakayama; Takahiro Hara; Shojiro Nishio
Wikipedia, a huge scale Web based encyclopedia, attracts great attention as an invaluable corpus for knowledge extraction because it has various impressive characteristics such as a huge number of articles, live updates, a dense link structure, brief anchor texts and URL identification for concepts. We have already proved that we can use Wikipedia to construct a huge scale accurate association thesaurus. The association thesaurus we constructed covers almost 1.3 million concepts and its accuracy is proved in detailed experiments. However, we still need scalable methods to analyze the huge number of Web pages and hyperlinks among articles in the Web based encyclopedia.
   In this paper, we propose a scalable method for constructing an association thesaurus from Wikipedia based on link co-occurrences. Link co-occurrence analysis is more scalable than link structure analysis because it is a one-pass process. We also propose integration method of tfidf and link co-occurrence analysis. Experimental results show that both our proposed methods are more accurate and scalable than conventional methods. Furthermore, the integration of tfidf achieved higher accuracy than using only link co-occurrences.
Peer production of structured knowledge -: an empirical study of ratings and incentive mechanisms BIBAFull-Text 827-842
  Christian Hütter; Conny Kühne; Klemens Böhm
Creating and maintaining semantic structures such as ontologies on a large scale is a labor-intensive task, which a sole individual cannot perform. Established automated solutions for this task do not yet exist. Peer production is a promising approach to create structured knowledge: Members of an online community create and maintain semantic structures collaboratively. To motivate members to participate and to ensure the quality of the data, rating-based incentive mechanisms are promising. Members mutually rate the quality of their contributions and are rewarded for good contributions and truthful ratings. Until now, there has been no systematic evaluation of such rating mechanisms in the context of structured knowledge. We have developed a platform for the collaborative creation of semantic structures. To evaluate the effect of ratings and incentive mechanisms on the quality of peer-produced data, we have conducted an extensive empirical study in an online community. We show that ratings are a reliable measure of the quality of contributions by comparing user ratings with an ex post evaluation by experts. Further experimental results are that incentive mechanisms increase the quality of contributions. We conclude that ratings and incentive mechanisms are promising to foster and improve the peer production of structured knowledge.

DB: security and privacy

Efficient techniques for document sanitization BIBAFull-Text 843-852
  Venkatesan T. Chakaravarthy; Himanshu Gupta; Prasan Roy; Mukesh K. Mohania
Sanitization of a document involves removing sensitive information from the document, so that it may be distributed to a broader audience. Such sanitization is needed while declassifying documents involving sensitive or confidential information such as corporate emails, intelligence reports, medical records, etc. In this paper, we present the ERASE framework for performing document sanitization in an automated manner. ERASE can be used to sanitize a document dynamically, so that different users get different views of the same document based on what they are authorized to know. We formalize the problem and present algorithms used in ERASE for finding the appropriate terms to remove from the document. Our preliminary experimental study demonstrates the efficiency and efficacy of the proposed algorithms.
Vanity fair: privacy in querylog bundles BIBAFull-Text 853-862
  Rosie Jones; Ravi Kumar; Bo Pang; Andrew Tomkins
A recently proposed approach to address privacy concerns in storing web search querylogs is bundling logs of multiple users together. In this work we investigate privacy leaks that are possible even when querylogs from multiple users are bundled together, without any user or session identifiers. We begin by quantifying users' propensity to issue own-name vanity queries and geographically revealing queries. We show that these propensities interact badly with two forms of vulnerabilities in the bundling scheme. First, structural vulnerabilities arise due to properties of the heavy tail of the user search frequency distribution, or the distribution of locations that appear within a user's queries. These heavy tails may cause a user to appear visibly different from other users in the same bundle. Second, we demonstrate analytical vulnerabilities based on the ability to separate the queries in a bundle into threads corresponding to individual users. These vulnerabilities raise privacy issues suggesting that bundling must be handled with great care.
Dual encryption for query integrity assurance BIBAFull-Text 863-872
  Haixun Wang; Jian Yin; Chang-shing Perng; Philip S. Yu
In database outsourcing, an enterprise contracts its database management tasks to an outside database service provider to eliminate in-house hardware, software, and expertise needs for running DBMSs. This is attractive especially for the parties with limited abilities in managing their own data. Typically, the client applications want to obtain quality assurance (e.g., data authenticity and query completeness) of the outsourced database service at a low cost. Previous work on database outsourcing has focused on issues such as communication overhead, secure data access, and data privacy. Recent work has introduced the issue of query integrity assurance, but usually, to obtain such assurance incurs a high cost. In this paper, we present a new method called dual encryption to provide low-cost query integrity assurance for outsourced database services. Dual encryption enables "cross examination" of the outsourced data, which consists of the original data stored under a certain encryption scheme, and another small percentage of the original data stored under a different encryption scheme. We generate queries against the additional piece of data and analyze their results to obtain integrity assurance. Our scheme is provable secure, that is, it is impossible to break our scheme unless some security primitives can be broken. Experiments on commercial workloads show the effectiveness of our approach.
Records retention in relational database systems BIBAFull-Text 873-882
  Ahmed A. Ataullah; Ashraf Aboulnaga; Frank Wm. Tompa
The recent introduction of several pieces of legislation mandating minimum and maximum retention periods for corporate records has prompted the Enterprise Content Management (ECM) community to develop various records retention solutions. Records retention is a significant subfield of records management, and legal records retention requirements apply over corporate records regardless of their shape or form. Unfortunately, the scope of existing solutions has been largely limited to proper identification, classification and retention of documents, and not of data more generally.
   In this paper we address the problem of managed records retention in the context of relational database systems. The problem is significantly more challenging than it is for documents for several reasons. Foremost, there is no clear definition of what constitutes a business record in relational databases; it could be an entire table, a tuple, part of a tuple, or parts of several tuples from multiple tables. There are also no standardized mechanisms for purging, anonymizing and protecting relational records. Functional dependencies, user defined constraints, and side effects caused by triggers make it even harder to guarantee that any given record will actually be protected when it needs to be protected or expunged when the necessary conditions are met. Most importantly, relational tuples may be organized such that one piece of data may be part of various legal records and subject to several (possibly conflicting) retention policies.
   We address the above problems and present a complete solution for designing, managing, and enforcing records retention policies in relational database systems. We experimentally demonstrate that the proposed framework can guarantee compliance with a broad range of retention policies on an off-the-shelf system without incurring a significant performance overhead for policy monitoring and enforcement.

IR: medley

Joke retrieval: recognizing the same joke told differently BIBAFull-Text 883-892
  Lisa Friedland; James Allan
In a corpus of jokes, a human might judge two documents to be the "same joke" even if characters, locations, and other details are varied. A given joke could be retold with an entirely different vocabulary while still maintaining its identity. Since most retrieval systems consider documents to be related only when their word content is similar, we propose joke retrieval as a domain where standard language models may fail. Other meaning-centric domains include logic puzzles, proverbs and recipes; in such domains, new techniques may be required to enable us to search effectively. For jokes, a necessary component of any retrieval system will be the ability to identify the "same joke," so we examine this task in both ranking and classification settings. We exploit the structure of jokes to develop two domain-specific alternatives to the "bag of words" document model. In one, only the punch lines, or final sentences, are compared; in the second, certain categories of words (e.g., professions and countries) are tagged and treated as interchangeable. Each technique works well for certain jokes. By combining the methods using machine learning, we create a hybrid that achieves higher performance than any individual approach.
Ranked feature fusion models for ad hoc retrieval BIBAFull-Text 893-900
  Jeremy Pickens; Gene Golovchinsky
We introduce the Ranked Feature Fusion framework for information retrieval system design. Typical information retrieval formalisms such as the vector space model, the best-match model and the language model first combine features (such as term frequency and document length) into a unified representation, and then use the representation to rank documents. We take the opposite approach: Documents are first ranked by the relevance of a single feature value and are assigned scores based on their relative ordering within the collection. A separate ranked list is created for every feature value and these lists are then fused to produce a final document scoring. This new "rank then combine" approach is extensively evaluated and is shown to be as effective as traditional "combine then rank" approaches. The model is easy to understand and contains fewer parameters than other approaches. Finally, the model is easy to extend (integration of new features is trivial) and modify. This advantage includes but is not limited to relevance feedback and distribution flattening.
AdaSum: an adaptive model for summarization BIBAFull-Text 901-910
  Jin Zhang; Xueqi Cheng; Gaowei Wu; Hongbo Xu
Topic representation mismatch is a key problem in topic-oriented summarization for the specified topic is usually too short to understand/interpret. This paper proposes a novel adaptive model for summarization, AdaSum, under the assumption that the summary and the topic representation can be mutually boosted. AdaSum aims to simultaneously optimize the topic representation and extract effective summaries. This model employs a mutual boosting process to minimize the topic representation mismatch for base summarizers. Furthermore, a linear combination of base summarizers is proposed to further reduce the topic representation mismatch from the diversity of base summarizers with a general learning framework. We prove that the training process of AdaSum can enhance the performance measure used. Experimental results on DUC 2007 dataset show that AdaSum significantly outperforms the baseline methods for summarization (e.g. MRP, LexRank, and GSPS).
Modeling hidden topics on document manifold BIBAFull-Text 911-920
  Deng Cai; Qiaozhu Mei; Jiawei Han; Chengxiang Zhai
Topic modeling has been a key problem for document analysis. One of the canonical approaches for topic modeling is Probabilistic Latent Semantic Indexing, which maximizes the joint probability of documents and terms in the corpus. The major disadvantage of PLSI is that it estimates the probability distribution of each document on the hidden topics independently and the number of parameters in the model grows linearly with the size of the corpus, which leads to serious problems with overfitting. Latent Dirichlet Allocation (LDA) is proposed to overcome this problem by treating the probability distribution of each document over topics as a hidden random variable. Both of these two methods discover the hidden topics in the Euclidean space. However, there is no convincing evidence that the document space is Euclidean, or flat. Therefore, it is more natural and reasonable to assume that the document space is a manifold, either linear or nonlinear. In this paper, we consider the problem of topic modeling on intrinsic document manifold. Specifically, we propose a novel algorithm called Laplacian Probabilistic Latent Semantic Indexing (LapPLSI) for topic modeling. LapPLSI models the document space as a submanifold embedded in the ambient space and directly performs the topic modeling on this document manifold in question. We compare the proposed LapPLSI approach with PLSI and LDA on three text data sets. Experimental results show that LapPLSI provides better representation in the sense of semantic structure.

IR: recommender systems

Tapping on the potential of q&a community by recommending answer providers BIBAFull-Text 921-930
  Jinwen Guo; Shengliang Xu; Shenghua Bao; Yong Yu
The rapidly increasing popularity of community-based Question Answering (cQA) services, e.g. Yahoo! Answers, Baidu Zhidao, etc. have attracted great attention from both academia and industry. Besides the basic problems, like question searching and answer finding, it should be noted that the low participation rate of users in cQA service is the crucial problem which limits its development potential. In this paper, we focus on addressing this problem by recommending answer providers, in which a question is given as a query and a ranked list of users is returned according to the likelihood of answering the question. Based on the intuitive idea for recommendation, we try to introduce topic-level model to improve heuristic term-level methods, which are treated as the baselines. The proposed approach consists of two steps: (1) discovering latent topics in the content of questions and answers as well as latent interests of users to build user profiles; (2) recommending question answerers for new arrival questions based on latent topics and term-level model. Specifically, we develop a general generative model for questions and answers in cQA, which is then altered to obtain a novel computationally tractable Bayesian network model. Experiments are carried out on a real-world data crawled from Yahoo! Answers during Jun 12 2007 to Aug 04 2007, which consists of 118510 questions, 772962 answers and 150324 users. The experimental results reveal significant improvements over the baseline methods and validate the positive influence of topic-level information.
SoRec: social recommendation using probabilistic matrix factorization BIBAFull-Text 931-940
  Hao Ma; Haixuan Yang; Michael R. Lyu; Irwin King
Data sparsity, scalability and prediction quality have been recognized as the three most crucial challenges that every collaborative filtering algorithm or recommender system confronts. Many existing approaches to recommender systems can neither handle very large datasets nor easily deal with users who have made very few ratings or even none at all. Moreover, traditional recommender systems assume that all the users are independent and identically distributed; this assumption ignores the social interactions or connections among users. In view of the exponential growth of information generated by online social networks, social network analysis is becoming important for many Web applications. Following the intuition that a person's social network will affect personal behaviors on the Web, this paper proposes a factor analysis approach based on probabilistic matrix factorization to solve the data sparsity and poor prediction accuracy problems by employing both users' social network information and rating records. The complexity analysis indicates that our approach can be applied to very large datasets since it scales linearly with the number of observations, while the experimental results shows that our method performs much better than the state-of-the-art approaches, especially in the circumstance that users have made few or no ratings.
Probabilistic polyadic factorization and its application to personalized recommendation BIBAFull-Text 941-950
  Yun Chi; Shenghuo Zhu; Yihong Gong; Yi Zhang
Multiple-dimensional, i.e., polyadic, data exist in many applications, such as personalized recommendation and multiple-dimensional data summarization. Analyzing all the dimensions of polyadic data in a principled way is a challenging research problem. Most existing methods separately analyze the marginal relationships among pairwise dimensions and then combine the results afterwards. Motivated by the fact that various dimensions of polyadic data jointly affect each other, we propose a probabilistic polyadic factorization approach to directly model all the dimensions simultaneously in a unified framework. We then show the connection between the probabilistic polyadic factorization and a non-negative version of the Tucker tensor factorization. We provide detailed theoretical analysis of the new modeling framework, discuss implementation techniques for our models, and propose several extensions to the basic framework. We then apply the proposed models to the application of personalized recommendation. Extensive experiments on a social bookmarking dataset, Delicious, and a paper citation dataset, CiteSeer, demonstrate the effectiveness of the proposed models.
A random walk on the red carpet: rating movies with user reviews and pagerank BIBAFull-Text 951-960
  Derry Tanti Wijaya; Stéphane Bressan
Although PageRank has been designed to estimate the popularity of Web pages, it is a general algorithm that can be applied to the analysis of other graphs other than one of hypertext documents. In this paper, we explore its application to sentiment analysis and opinion mining: i.e. the ranking of items based on user textual reviews. We first propose various techniques using collocation and pivot words to extract a weighted graph of terms from user reviews and to account for positive and negative opinions. We refer to this graph as the sentiment graph. Using PageRank and a very small set of adjectives (such as 'good', 'excellent', etc.) we rank the different items. We illustrate and evaluate our approach using reviews of box office movies by users of a popular movie review site. The results show that our approach is very effective and that the ranking it computes is comparable to the ranking obtained from the box office figures. The results also show that our approach is able to compute context-dependent ratings.

KM: feature selection

REDUS: finding reducible subspaces in high dimensional data BIBAFull-Text 961-970
  Xiang Zhang; Feng Pan; Wei Wang
Finding latent patterns in high dimensional data is an important research problem with numerous applications. The most well known approaches for high dimensional data analysis are feature selection and dimensionality reduction. Being widely used in many applications, these methods aim to capture global patterns and are typically performed in the full feature space. In many emerging applications, however, scientists are interested in the local latent patterns held by feature subspaces, which may be invisible via any global transformation.
   In this paper, we investigate the problem of finding strong linear and nonlinear correlations hidden in feature subspaces of high dimensional data. We formalize this problem as identifying reducible subspaces in the full dimensional space. Intuitively, a reducible subspace is a feature subspace whose intrinsic dimensionality is smaller than the number of features. We present an effective algorithm, REDUS, for finding the reducible subspaces. Two key components of our algorithm are finding the overall reducible subspace, and uncovering the individual reducible subspaces from the overall reducible subspace. A broad experimental evaluation demonstrates the effectiveness of our algorithm.
Mining influential attributes that capture class and group contrast behaviour BIBAFull-Text 971-980
  Elsa Loekito; James Bailey
Contrast data mining is a key tool for finding differences between sets of objects, or classes, and contrast patterns are a popular method for discrimination between two classes. However, such patterns can be limited in two primary ways: i) They do not readily allow second order differentiation -- i.e. discovering contrasts of contrasts, ii) Mining contrast patterns often results in an overwhelming volume of output for the user. To address these limitations, this paper proposes a method which can identify contrast behaviour across both classes and also groups of classes. Furthermore, to increase interpretability for the user, it presents a new technique for finding the attributes which represent the key underlying factors behind the contrast behaviour. The associated mining task is computationally challenging and we describe an efficient algorithm to handle it, based on binary decision diagrams. Experimental results demonstrate that our technique can efficiently identify and explain contrast behaviour which would be difficult or impossible to isolate using standard techniques.
Real-time data pre-processing technique for efficient feature extraction in large scale datasets BIBAFull-Text 981-990
  Ying Liu; Lucian V. Lita; R. Stefan Niculescu; Kun Bai; Prasenjit Mitra; C. Lee Giles
Due to the continuous and rampant increase in the size of domain specific data sources, there is a real and sustained need for fast processing in time-sensitive applications, such as medical record information extraction at the point of care, genetic feature extraction for personalized treatment, as well as off-line knowledge discovery such as creating evidence based medicine. Since parallel multi-string matching is at the core of most data mining tasks in these applications, faster on-line matching in static and streaming data is needed to improve the overall efficiency of such knowledge discovery. To solve this data mining need not efficiently handled by traditional information extraction and retrieval techniques, we propose a Block Suffix Shifting-based approach, which is an improvement over the state of the art multi-string matching algorithms such as Aho-Corasick, Commentz-Walter, and Wu-Manber. The strength of our approach is its ability to exploit the different block structures of domain specific data for off-line and online parallel matching. Experiments on several real world datasets show how our approach translates into significant performance improvements.
Structure feature selection for graph classification BIBAFull-Text 991-1000
  Hongliang Fei; Jun Huan
With the development of highly efficient graph data collection technology in many application fields, classification of graph data emerges as an important topic in the data mining and machine learning community. Towards building highly accurate classification models for graph data, here we present an efficient graph feature selection method. In our method, we use frequent subgraphs as features for graph classification. Different from existing methods, we consider the spatial distribution of the subgraph features in the graph data and select those ones that have consistent spatial location.
   We have applied our feature selection methods to several cheminformatics benchmarks. Our method demonstrates a significant improvement of prediction as compared to the state-of-the-art methods.

Panel discussion 2

The social (open) workspace BIBAFull-Text 1529
  David A. Evans; Susan Feldman; Ed H. Chi; Nataša Milic-Frayling; Igor Perisic
Social networking promises individuals new dimensions of freedom to interact, associate, and give expression to their talents. Recently, systems such as Mechanical Turk have started to facilitate self-organizing collaboration on work-related tasks. Such developments raise interesting questions. Is it possible to create (and sustain) businesses that do not have traditional, formal structure -- without traditional "employees"? Can we find and organize (and optimize) talent on the web for task-oriented work -- spontaneously and efficiently? How do people relate to one another in possibly evanescent workgroups? One aspect of the challenge in the Social Workspace is understanding and modeling the user behavior and the economic basis for creating, preserving, and exchanging value in the marketplace when workgroup identity, orientations to property, recruiting and managing appropriate talent are not organized under traditional company structures. Another aspect is the technology needed to support virtual organizations and work. The panel will discuss trends in social work and the evolving (scientific) basis of our understanding of new models of workers and organizations.
Unsolved problems in search: (and how we approach them) BIBAFull-Text 1001
  W. Bruce Croft
Search applications have become ubiquitous and very successful. Major advances have been made in understanding how to deliver effective results very efficiently for a class of queries. As the range of applications broaden to include web search, desktop search, enterprise search, vertical search, social search, etc., the number of new research challenges has appeared to grow rather than shrink. Many of these challenges are variations on underlying themes and principles that information retrieval has focused on for more than 40 years. In this talk, several unsolved problems arising from new search applications will be discussed and some potential paths to solutions for these problems will be outlined.

IR: advertising & filtering

To swing or not to swing: learning when (not) to advertise BIBAFull-Text 1003-1012
  Andrei Broder; Massimiliano Ciaramita; Marcus Fontoura; Evgeniy Gabrilovich; Vanja Josifovski; Donald Metzler; Vanessa Murdock; Vassilis Plachouras
Web textual advertising can be interpreted as a search problem over the corpus of ads available for display in a particular context. In contrast to conventional information retrieval systems, which always return results if the corpus contains any documents lexically related to the query, in Web advertising it is acceptable, and occasionally even desirable, not to show any results. When no ads are relevant to the user's interests, then showing irrelevant ads should be avoided since they annoy the user and produce no economic benefit. In this paper we pose a decision problem "whether to swing", that is, whether or not to show any of the ads for the incoming request. We propose two methods for addressing this problem, a simple thresholding approach and a machine learning approach, which collectively analyzes the set of candidate ads augmented with external knowledge. Our experimental evaluation, based on over 28,000 editorial judgments, shows that we are able to predict, with high accuracy, when to "swing" for both content match and sponsored search advertising.
Search advertising using web relevance feedback BIBAFull-Text 1013-1022
  Andrei Z. Broder; Peter Ciccolo; Marcus Fontoura; Evgeniy Gabrilovich; Vanja Josifovski; Lance Riedel
The business of Web search, a $10 billion industry, relies heavily on sponsored search, whereas a few carefully-selected paid advertisements are displayed alongside algorithmic search results. A key technical challenge in sponsored search is to select ads that are relevant for the user's query. Identifying relevant ads is challenging because queries are usually very short, and because users, consciously or not, choose terms intended to lead to optimal Web search results and not to optimal ads. Furthermore, the ads themselves are short and usually formulated to capture the reader's attention rather than to facilitate query matching.
   Traditionally, matching of ads to queries employed standard information retrieval techniques using the bag of words approach. Here we propose to go beyond the bag of words, and augment both queries and ads with additional knowledge-rich features. We use Web search results initially returned for the query to create a pool of relevant documents. Classifying these documents with respect to an external taxonomy and identifying salient named entities give rise to two new feature types. Empirical evaluation based on over 9,000 query-ad pairwise judgments confirms that using augmented queries produces highly relevant ads. Our methodology also relaxes the requirement for each ad to explicitly specify the exhaustive list of queries ("bid phrases") that can trigger it.
A two-stage text mining model for information filtering BIBAFull-Text 1023-1032
  Yuefeng Li; Xujuan Zhou; Peter Bruza; Yue Xu; Raymond Y. K. Lau
Mismatch and overload are the two fundamental issues regarding the effectiveness of information filtering. Both term-based and pattern (phrase) based approaches have been employed to address these issues. However, they all suffer from some limitations with regard to effectiveness. This paper proposes a novel solution that includes two stages: an initial topic filtering stage followed by a stage involving pattern taxonomy mining. The objective of the first stage is to address mismatch by quickly filtering out probable irrelevant documents. The threshold used in the first stage is motivated theoretically. The objective of the second stage is to address overload by apply pattern mining techniques to rationalize the data relevance of the reduced document set after the first stage. Substantial experiments on RCV1 show that the proposed solution achieves encouraging performance.
Automatic online news topic ranking using media focus and user attention based on aging theory BIBAFull-Text 1033-1042
  Canhui Wang; Min Zhang; Liyun Ru; Shaoping Ma
News topics, which are constructed from news stories using the techniques of Topic Detection and Tracking (TDT), bring convenience to users who intend to see what is going on through the Internet. However, it is almost impossible to view all the generated topics, because of the large amount. So it will be helpful if all topics are ranked and the top ones, which are both timely and important, can be viewed with high priority. Generally, topic ranking is determined by two primary factors. One is how frequently and recently a topic is reported by the media; the other is how much attention users pay to it. Both media focus and user attention varies as time goes on, so the effect of time on topic ranking has already been included. However, inconsistency exists between both factors. In this paper, an automatic online news topic ranking algorithm is proposed based on inconsistency analysis between media focus and user attention. News stories are organized into topics, which are ranked in terms of both media focus and user attention. Experiments performed on practical Web datasets show that the topic ranking result reflects the influence of time, the media and users. The main contributions of this paper are as follows. First, we present the quantitative measure of the inconsistency between media focus and user attention, which provides a basis for topic ranking and an experimental evidence to show that there is a gap between what the media provide and what users view. Second, to the best of our knowledge, it is the first attempt to synthesize the two factors into one algorithm for automatic online topic ranking.

IR: blog

Key blog distillation: ranking aggregates BIBAFull-Text 1043-1052
  Craig Macdonald; Iadh Ounis
Searchers on the blogosphere often have a need to identify other key bloggers with similar interests to their own. However, a main difference of this blog distillation task from normal adhoc or Web document retrieval is that each blog can be seen as an aggregate of its constituent posts. On the other hand, we show that the task is similar to the expert search task, where a person's expertise is derived from the aggregate of their publications or emails. In this paper, we investigate several aspects of blog retrieval: Firstly, we experiment whether a blog should be represented as a whole unit, or as by considering each of its posts as indicators of its relevance, showing that expert search techniques can be adapted for blog search; Secondly, we examine whether indexing only the XML feed provided by each blog (and which is often incomplete) is sufficient, or whether the full-text of each blog post should be downloaded; Lastly, we use approaches to detect the central or recurring interests of each blog to increase the retrieval effectiveness of the system. Using the TREC 2007 Blog dataset, the results show that our proposed expert search paradigm is indeed useful in identifying key bloggers, achieving high retrieval effectiveness.
Blog site search using resource selection BIBAFull-Text 1053-1062
  Jangwon Seo; W. Bruce Croft
A blog site consists of many individual blog postings. Current blog search services focus on retrieving postings but there is also a need to identify relevant blog sites. Blog site search is similar to resource selection in distributed information retrieval, in that the target is to find relevant collections of documents. We introduce resource selection techniques for blog site search and evaluate their performance. Further, we propose a "diversity factor" that measures the topic diversity of each blog site. Our results show that the appropriate combination of the resource selection techniques and the diversity factor can achieve significant improvements in retrieval performance compared to baselines. We also report results using these techniques on the TREC blog distillation task.
An effective statistical approach to blog post opinion retrieval BIBAFull-Text 1063-1072
  Ben He; Craig Macdonald; Jiyin He; Iadh Ounis
Finding opinionated blog posts is still an open problem in information retrieval, as exemplified by the recent TREC blog tracks. Most of the current solutions involve the use of external resources and manual efforts in identifying subjective features. In this paper, we propose a novel and effective dictionary-based statistical approach, which automatically derives evidence for subjectivity from the blog collection itself, without requiring any manual effort. Our experiments show that the proposed approach is capable of achieving remarkable and statistically significant improvements over robust baselines, including the best TREC baseline run. In addition, with relatively little computational costs, our proposed approach provides an effective performance in retrieving opinionated blog posts, which is as good as a computationally expensive approach using Natural Language Processing techniques.

KM: clustering

A consensus based approach to constrained clustering of software requirements BIBAFull-Text 1073-1082
  Chuan Duan; Jane Cleland-Huang; Bamshad Mobasher
Managing large-scale software projects involves a number of activities such as viewpoint extraction, feature detection, and requirements management, all of which require a human analyst to perform the arduous task of organizing requirements into meaningful topics and themes. Automating these tasks through the use of data mining techniques such as clustering could potentially increase both the efficiency of performing the tasks and the reliability of the results. Unfortunately, the unique characteristics of this domain, such as high dimensional, sparse, noisy data sets, resulting from short and ambiguous expressions of need, as well as the need for the interactive engagement of stakeholders at various stages of the process, present difficult challenges for standard clustering algorithms. In this paper, we propose a semi-supervised clustering framework, based on a combination of consensus-based and constrained clustering techniques, which can effectively handle these challenges. Specifically, we provide a probabilistic analysis for informative constraint generation based on a co-association matrix, and utilize consensus clustering to combine multiple constrained partitions in order to generate high-quality, robust clusters. Our approach is validated through a series of experiments on six well-studied TREC data sets and on two sets of user requirements.
Data weaving: scaling up the state-of-the-art in data clustering BIBAFull-Text 1083-1092
  Ron Bekkerman; Martin Scholz
The enormous amount and dimensionality of data processed by modern data mining tools require effective, scalable unsupervised learning techniques. Unfortunately, the majority of previously proposed clustering algorithms are either effective or scalable. This paper is concerned with information-theoretic clustering (ITC) that has historically been considered the state-of-the-art in clustering multi-dimensional data. Most existing ITC methods are computationally expensive and not easily scalable. Those few ITC methods that scale well (using, e.g., parallelization) are often outperformed by the others, of an inherently sequential nature. First, we justify this observation theoretically. We then propose data weaving -- a novel method for parallelizing sequential clustering algorithms. Data weaving is intrinsically multi-modal -- it allows simultaneous clustering of a few types of data (modalities). Finally, we use data weaving to parallelize multi-modal ITC, which results in proposing a powerful DataLoom algorithm. In our experimentation with small datasets, DataLoom shows practically identical performance compared to expensive sequential alternatives. On large datasets, however, DataLoom demonstrates significant gains over other parallel clustering methods. To illustrate the scalability, we simultaneously clustered rows and columns of a contingency table with over 120 billion entries.
EDSC: efficient density-based subspace clustering BIBAFull-Text 1093-1102
  Ira Assent; Ralph Krieger; Emmanuel Müller; Thomas Seidl
Subspace clustering mines clusters hidden in subspaces of high-dimensional data sets. Density-based approaches have been shown to successfully mine clusters of arbitrary shape even in the presence of noise in full space clustering. Exhaustive search of all density-based subspace clusters, however, results in infeasible runtimes for large high-dimensional data sets. This is due to the exponential number of possible subspace projections in addition to the high computational cost of density-based clustering.
   In this paper, we propose lossless efficient detection of density-based subspace clusters. In our EDSC (efficient density-based subspace clustering) algorithm we reduce the high computational cost of density-based subspace clustering by a complete multistep filter-and-refine algorithm. Our first hypercube filter step avoids exhaustive search of all regions in all subspaces by enclosing potentially density-based clusters in hypercubes. Our second filter step provides additional pruning based on a density monotonicity property. In the final refinement step, the exact unbiased density-based subspace clustering result is detected. As we prove that pruning is lossless in both filter steps, we guarantee completeness of the result.
   In thorough experiments on synthetic and real world data sets, we demonstrate substantial efficiency gains. Our lossless EDSC approach outperforms existing density-based subspace clustering algorithms by orders of magnitude.
An effective algorithm for mining 3-clusters in vertically partitioned data BIBAFull-Text 1103-1112
  Faris Alqadah; Raj Bhatnagar
Conventional clustering algorithms group similar data points together along one dimension of a data table. Bi-clustering simultaneously clusters both dimensions of a data table. 3-clustering goes one step further and aims to concurrently cluster two data tables that share a common set of row labels, but whose column labels are distinct. Such clusters reveal the underlying connections between the elements of all three sets. We present a novel algorithm that discovers 3-clusters across vertically partitioned data. Our approach presents two new and important formulations: first we introduce the notion of a 3-cluster in partitioned data; and second we present a mathematical formulation that measures the quality of such clusters. Our algorithm discovers high quality, arbitrarily positioned, overlapping clusters, and is efficient in time. These results are exhibited in a comprehensive study on real datasets.

IR: enterprise search

Multi-aspect expertise matching for review assignment BIBAFull-Text 1113-1122
  Maryam Karimzadehgan; ChengXiang Zhai; Geneva Belford
Review assignment is a common task that many people such as conference organizers, journal editors, and grant administrators would have to do routinely. As a computational problem, it involves matching a set of candidate reviewers with a paper or proposal to be reviewed. A common deficiency of all existing work on solving this problem is that they do not consider the multiple aspects of topics or expertise and all match the entire document to be reviewed with the overall expertise of a reviewer. As a result, if a document contains multiple subtopics, which often happens, existing methods would not attempt to assign reviewers to cover all the subtopics; instead, it is quite possible that all the assigned reviewers would cover the major subtopic quite well, but not covering any other subtopic. In this paper, we study how to model multiple aspects of expertise and assign reviewers so that they together can cover all subtopics in the document well. We propose three general strategies for solving this problem and propose new evaluation measures for this task. We also create a multi-aspect review assignment test set using ACM SIGIR publications. Experiment results on this data set show that the proposed methods are effective for assigning reviewers to cover all topical aspects of a document.
Dr. Searcher and Mr. Browser: a unified hyperlink-click graph BIBAFull-Text 1123-1132
  Barbara Poblete; Carlos Castillo; Aristides Gionis
We introduce a unified graph representation of the Web, which includes both structural and usage information. We model this graph using a simple union of the Web's hyperlink and click graphs. The hyperlink graph expresses link structure among Web pages, while the click graph is a bipartite graph of queries and documents denoting users' searching behavior extracted from a search engine's query log.
   Our most important motivation is to model in a unified way the two main activities of users on the Web: searching and browsing, and at the same time to analyze the effects of random walks on this new graph. The intuition behind this task is to measure how the combination of link structure and usage data provide additional information to that contained in these structures independently.
   Our experimental results show that both hyperlink and click graphs have strengths and weaknesses when it comes to using their stationary distribution scores for ranking Web pages. Furthermore, our evaluation indicates that the unified graph always generates consistent and robust scores that follow closely the best result obtained from either individual graph, even when applied to "noisy" data. It is our belief that the unified Web graph has several useful properties for improving current Web document ranking, as well as for generating new rankings of its own. In particular stationary distribution scores derived from the random walks on the combined graph can be used as an indicator of whether structural or usage data are more reliable in different situations.
Modeling multi-step relevance propagation for expert finding BIBAFull-Text 1133-1142
  Pavel Serdyukov; Henning Rode; Djoerd Hiemstra
An expert finding system allows a user to type a simple text query and retrieve names and contact information of individuals that possess the expertise expressed in the query. This paper proposes a novel approach to expert finding in large enterprises or intranets by modeling candidate experts (persons), web documents and various relations among them with so-called expertise graphs. As distinct from the state of-the-art approaches estimating personal expertise through one-step propagation of relevance probability from documents to the related candidates, our methods are based on the principle of multi-step relevance propagation in topic specific expertise graphs. We model the process of expert finding by probabilistic random walks of three kinds: finite, infinite and absorbing. Experiments on TREC Enterprise Track data originating from two large organizations show that our methods using multi-step relevance propagation improve over the baseline one-step propagation based method in almost all cases.
Trada: tree based ranking function adaptation BIBAFull-Text 1143-1152
  Keke Chen; Rongqing Lu; C. K. Wong; Gordon Sun; Larry Heck; Belle Tseng
Machine Learned Ranking approaches have shown successes in web search engines. With the increasing demands on developing effective ranking functions for different search domains, we have seen a big bottleneck, i.e., the problem of insufficient training data, which has significantly limited the fast development and deployment of machine learned ranking functions for different web search domains. In this paper, we propose a new approach called tree based ranking function adaptation ("tree adaptation") to address this problem. Tree adaptation assumes that ranking functions are trained with regression-tree based modeling methods, such as Gradient Boosting Trees. It takes such a ranking function from one domain and tunes its tree-based structure with a small amount of training data from the target domain. The unique features include (1) it can automatically identify the part of model that needs adjustment for the new domain, (2) it can appropriately weight training examples considering both local and global distributions. Experiments are performed to show that tree adaptation can provide better-quality ranking functions for a new domain, compared to other modeling methods.

IR: structured documents

Structural relevance: a common basis for the evaluation of structured document retrieval BIBAFull-Text 1153-1162
  M. S. Ali; Mariano P. Consens; Gabriella Kazai; Mounia Lalmas
This paper presents a unified framework for the evaluation of a range of structured document retrieval (SDR) approaches and tasks. The framework is based on a model of tree retrieval, evaluated using a novel extension of the Structural Relevance (SR) measure. The measure replaces the assumption of independence in traditional information retrieval (IR) with a notion of redundancy that takes into account the user navigation inside documents while seeking relevant information. Unlike existing metrics for SDR, our proposed framework does not require the computation of an ideal ranking which has, thus far, prevented the practical application of such measures. Instead, SR builds on a Markovian model of user navigation that can be estimated through the use of structural summaries. The results of this paper (supported by experimental validation using INEX data) show that SR defined over a tree retrieval model can provide a common basis for the evaluation of SDR approaches across various structured search tasks.
A generative retrieval model for structured documents BIBAFull-Text 1163-1172
  Le Zhao; Jamie Callan
Structured documents contain elements defined by the author(s) and annotations assigned by other people or processes. Structured documents pose challenges for probabilistic retrieval models when there are mismatches between the structured query and the actual structure in a relevant document or erroneous structure introduced by an annotator. This paper makes three contributions. First, a new generative retrieval model is proposed to deal with the mismatch problem. This new model extends the basic keyword language model by treating structure as hidden variable during the generation process. Second, variations of the model are compared. Third, term-level and structure-level smoothing strategies are studied. Evaluation was conducted with INEX XML retrieval and question-answering retrieval tasks. Experimental results indicate that the optimal structured retrieval model is task dependent, two-level Dirichlet smoothing significantly outperforms two-level Jelinek-Mercer smoothing, and with accurate structured queries, the proposed structured retrieval model outperforms keyword retrieval significantly, on both QA and INEX datasets.
A densitometric approach to web page segmentation BIBAFull-Text 1173-1182
  Christian Kohlschütter; Wolfgang Nejdl
Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segment-level text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web.
Using structured text for large-scale attribute extraction BIBAFull-Text 1183-1192
  Sujith Ravi; Marius Pasca
We propose a weakly-supervised approach for extracting class attributes from structured text available within Web documents. The overall precision of the extracted attributes is around 30% higher than with previous methods operating on Web documents. In addition to attribute extraction, this approach also automatically identifies values for a subset of the extracted class attributes.

KM: text mining

Identification of class specific discourse patterns BIBAFull-Text 1193-1202
  Anup Chalamalla; Sumit Negi; L. Venkata Subramaniam; Ganesh Ramakrishnan
In this paper we address the problem of extracting important (and unimportant) discourse patterns from call center conversations. Call centers provide dialog based calling-in support for customers to address their queries, requests and complaints. A Call center is the direct interface between an organization and its customers and it is important to capture the voice-of-customer by gathering insights into the customer experience. We have observed that the calls received at a call center contain segments within them that follow specific patterns that are typical of the issue being addressed in the call. We present methods to extract such patterns from the calls. We show that by aggregating over a few hundred calls, specific discourse patterns begin to emerge for each class of calls. Further, we show that such discourse patterns are useful for classifying calls and for identifying parts of the calls that provide insights into customer behaviour.
Scalable community discovery on textual data with relations BIBAFull-Text 1203-1212
  Huajing Li; Zaiqing Nie; Wang-Chien Lee; Lee Giles; Ji-Rong Wen
Every piece of textual data is generated as a method to convey its authors' opinion regarding specific topics. Authors deliberately organize their writings and create links, i.e., references, acknowledgments, for better expression. Thereafter, it is of interest to study texts as well as their relations to understand the underlying topics and communities. Although many efforts exist in the literature in data clustering and topic mining, they are not applicable to community discovery on large document corpus for several reasons. First, few of them consider both textual attributes as well as relations. Second, scalability remains a significant issue for large-scale datasets. Additionally, most algorithms rely on a set of initial parameters that are hard to be captured and tuned. Motivated by the aforementioned observations, a hierarchical community model is proposed in the paper which distinguishes community cores from affiliated members. We present our efforts to develop a scalable community discovery solution for large-scale document corpus. Our proposal tries to quickly identify potential cores as seeds of communities through relation analysis. To eliminate the influence of initial parameters, an innovative attribute-based core merge process is introduced so that the algorithm promises to return consistent communities regardless initial parameters. Experimental results suggest that the proposed method has high scalability to corpus size and feature dimensionality, with more than 15 topical precision improvement compared with popular clustering techniques.
Information shared by many objects BIBAFull-Text 1213-1220
  Chong Long; Xiaoyan Zhu; Ming Li; Bin Ma
If Kolmogorov complexity [25] measures information in one object and Information Distance measures information shared by two objects, how do we measure information shared by many objects? This paper provides an initial pragmatic study of this fundamental data mining question. Firstly, Em(x1,x2,...,xn) is defined to be the minimum amount of thermodynamic energy needed to convert from any xi to any xj. With this definition several theoretical problems have been solved. Second, our newly proposed theory is applied to select a comprehensive review and a specialized review from many reviews: (1) Core feature words, expanded words and dependent words are extracted respectively. (2) Comprehensive and specialized reviews are selected according to the information among them. This method of selecting a single review can be extended to select multiple reviews as well. Finally, experiments show that this comprehensive and specialized review mining method based on our new theory can do the job efficiently.
Extremely fast text feature extraction for classification and indexing BIBAFull-Text 1221-1230
  George Forman; Evan Kirshenbaum
Most research in speeding up text mining involves algorithmic improvements to induction algorithms, and yet for many large scale applications, such as classifying or indexing large document repositories, the time spent extracting word features from texts can itself greatly exceed the initial training time. This paper describes a fast method for text feature extraction that folds together Unicode conversion, forced lowercasing, word boundary detection, and string hash computation. We show empirically that our integer hash features result in classifiers with equivalent statistical performance to those built using string word features, but require far less computation and less memory.

DB: mobile and distributed data management

Valid scope computation for location-dependent spatial query in mobile broadcast environments BIBAFull-Text 1231-1240
  Ken C. K. Lee; Josh Schiffman; Baihua Zheng; Wang-Chien Lee
Wireless data broadcast is an efficient and scalable means to provide information access for a large population of clients in mobile environments. With Location-Based Services (LBSs) deployed upon a broadcast channel, mobile clients can collect data from the channel to answer their location-dependent spatial queries (LDSQs). Since the results of LDSQs would become invalid when mobile client moves to new locations, the knowledge of valid scopes for LDSQ results is necessary to assist clients to determine if their previous LDSQ results can be reused after they moved. This effectively improves query response time and client energy consumption. In this paper, we devise efficient algorithms to determine valid scopes for various LDSQs including range, window and nearest neighbor queries along with LDSQ processing over a broadcast channel. We conduct an extensive set of experiments to evaluate the performance of our proposed algorithms. While the proposed valid scope algorithm incurs only little extra processing overhead, unnecessary LDSQ reevaluation is significantly eliminated, thus providing faster query response and saving client energy.
Adaptive distributed indexing for structured peer-to-peer networks BIBAFull-Text 1241-1250
  Linh Thai Nguyen; Wai Gen Yee; Ophir Frieder
Structured peer-to-peer networks support keyword search by building a distributed index over the collective content shared by all peers. Building the index and processing queries involve data transfer among peers, thus it is important to keep both of these activities bandwidth-efficient. However, this goal is difficult to attain, as smaller, less precise indices reduce index building and access costs but increase query processing cost, which potentially increases overall cost. We study the trade-off between indexing cost and query processing cost in a structured peer-to-peer network and propose a cost-reducing, adaptive, distributed indexing technique based on the term distributions in local shared contents and user query logs. Using this information, we reduce costs by tuning the precision of the index. The approach we take is to group local documents and to index the groups instead of either individual documents or entire peer collections. We control total cost by controlling the number and contents of groups. We propose a probabilistic model to estimate the cost of grouping, which allows us to identify the optimal number of groups to be created. In addition, we propose a cost-based distance function to guide the document grouping process. Experimental results show that our adaptive indexing technique reduces cost by up to 47% compared with peer-level grouping and by up to 73% compared with document-level grouping.
PROQID: partial restarts of queries in distributed databases BIBAFull-Text 1251-1260
  Jon Olav Hauglid; Kjetil Nørvåg
In a number of application areas, distributed database systems can be used to provide persistent storage of data while providing efficient access for both local and remote data. With an increasing number of sites (computers) involved in a query, the probability of failure at query time increases. Recovery has previously only focused on database updates while query failures have been handled by complete restart of the query. This technique is not always applicable in the context of large queries and queries with deadlines. In this paper we present an approach for partial restart of queries that incurs minimal extra network traffic during query recovery. Based on results from experiments on an implementation of the partial restart technique in a distributed database system, we demonstrate its applicability and significant reduction of query cost in the presence of failures.

IR: QA

Answering questions with authority BIBAFull-Text 1261-1270
  Andrew Hickl
This paper presents a novel approach to textual question answering (QA) which identifies answers to natural language questions by leveraging large collections of question-answer pairs extracted from web sources or generated automatically from text.
   Instead of using the traditional retrieve-and-rerank approach common to most previous approaches to factoid question answering, we introduce a new model of answer authority which allows question-answering systems to estimate the quality of answers not just in isolation -- but in the larger context of the information contained in the corpus as a whole.
   Our approach in this paper hinges on the creation of a new representation of the information stored in a document collection, known as a Question-Answer Database (QUAB).We assume that a QUAB represents a weighted directed graph consisting of the set of factoid question-answer pairs (QAP) that can be asked -- and answered -- given the content of a corpus. Once a set of QAP have been generated, we use inferential relationships identified by a system for recognizing textual entailment in order to construct the link structure necessary to compute the authority of each interconnected question or answer.
   We have found access to the QUAB graph not only improves the accuracy of current answer retrieval techniques, but also allows for the determination of when no valid answer can be found for a question in a text corpus. Experimental results show that the authority derived from a graph of question-answer pairs can increase the performance of a factoid QA system by nearly 30%.
Cache-aware load balancing for question answering BIBAFull-Text 1271-1280
  David Dominguez-Sal; Mihai Surdeanu; Josep Aguilar-Saborit; Josep Lluis Larriba-Pey
The need for high performance and throughput Question Answering (QA) systems demands for their migration to distributed environments. However, even in such cases it is necessary to provide the distributed system with cooperative caches and load balancing facilities in order to achieve the desired goals. Until now, the literature on QA has not considered such a complex system as a whole. Currently, the load balancer regulates the assignment of tasks based only on the CPU and I/O loads without considering the status of the system cache.
   This paper investigates the load balancing problem proposing two novel algorithms that take into account the distributed cache status, in addition to the CPU and I/O load in each processing node. We have implemented, and tested the proposed algorithms in a fully fledged distributed QA system. The two algorithms show that the choice of using the status of the cache was determinant in achieving good performance, and high throughput for QA systems.
A system for finding biological entities that satisfy certain conditions from texts BIBAFull-Text 1281-1290
  Wei Zhou; Clement Yu; Weiyi Meng
Finding biological entities (such as genes or proteins) that satisfy certain conditions from texts is an important and challenging task in biomedical information retrieval and text mining. It is essential for many biomedical applications, such as drug discovery which normally requires collecting existing scientific facts from documents. This paper presents an effective IR system for this task, in which 1) domain knowledge is incorporated to improve retrieval effectiveness; 2) query expansion with related concepts on multiple semantic levels is employed; 3) a gene symbol disambiguation technique is implemented. We evaluated these techniques and examined two different concept-based IR models. Experiments based upon the proposed framework yield significant improvement (22% for automatic and 16.7% for non-automatic) over the best reported results of passage retrieval in the Genomics track of TREC 2007.

KM: information extraction

Intra-document structural frequency features for semi-supervised domain adaptation BIBAFull-Text 1291-1300
  Andrew Arnold; William W. Cohen
In this work we try to bridge the gap often encountered by researchers who find themselves with few or no labeled examples from their desired target domain, yet still have access to large amounts of labeled data from other related, but distinct source domains, and seemingly no way to transfer knowledge from one to the other. Experimentally, we focus on the problem of extracting protein mentions from academic publications in the field of biology, where the source domain data are abstracts labeled with protein mentions, and the target domain data are wholly unlabeled captions. We mine the large number of such full text articles freely available on the Internet in order to supplement the limited amount of annotated data available. By exploiting the explicit and implicit common structure of the different subsections of these documents, including the unlabeled full text, we are able to generate robust features that are insensitive to changes in marginal and conditional distributions of classes and data across domains. We supplement these domain-insensitive features with automatically obtained high-confidence positive and negative predictions on the target domain to learn extractors that generalize well from one section of a document to another. Finally, lacking labeled target testing data, we employ comparative user preference studies to evaluate the relative performance of the proposed methods with respect to existing baselines.
Academic conference homepage understanding using constrained hierarchical conditional random fields BIBAFull-Text 1301-1310
  Xin Xin; Juanzi Li; Jie Tang; Qiong Luo
We address the problem of academic conference homepage understanding for the Semantic Web. This problem consists of three labeling tasks -- labeling conference function pages, function blocks, and attributes. Different from traditional information extraction tasks, the data in academic conference homepages has complex structural dependencies across multiple Web pages. In addition, there are logical constraints in the data. In this paper, we propose a unified approach, Constrained Hierarchical Conditional Random Fields, to accomplish the three labeling tasks simultaneously. In this approach, complex structural dependencies can be well described. Also, the constrained Viterbi algorithm in the inference process can avoid logical errors. Experimental results on real world conference data have demonstrated that this approach performs better than cascaded labeling methods by 3.6% in F1-measure and that the constrained inference process can improve the accuracy by 14.3%. Based on the proposed approach, we develop a prototype system of use-oriented semantic academic conference calendar. The user simply needs to specify what conferences he/she is interested in. Subsequently, the system finds, extracts, and updates the semantic information from the Web, and then builds a calendar automatically for the user. The semantic conference data can be used in other applications, such as finding sponsors and finding experts. The proposed approach can be used in other information extraction tasks as well.
Identifying table boundaries in digital documents via sparse line detection BIBAFull-Text 1311-1320
  Ying Liu; Prasenjit Mitra; C. Lee Giles
Most prior work on information extraction has focused on extracting information from text in digital documents. However, often, the most important information being reported in an article is presented in tabular form in a digital document. If the data reported in tables can be extracted and stored in a database, the data can be queried and joined with other data using database management systems. In order to prepare the data source for table search, accurately detecting the table boundary plays a crucial role for the later table structure decomposition. Table boundary detection and content extraction is a challenging problem because tabular formats are not standardized across all documents. In this paper, we propose a simple but effective preprocessing method to improve the table boundary detection performance by considering the sparse-line property of table rows. Our method easily simplifies the table boundary detection problem into the sparse line analysis problem with much less noise. We design eight line label types and apply two machine learning techniques, Conditional Random Field (CRF) and Support Vector Machines (SVM), on the table boundary detection field. The experimental results not only compare the performances between the machine learning methods and the heuristics-based method, but also demonstrate the effectiveness of the sparse line analysis in the table boundary detection.

Poster session 1 database

Privacy-preserving data publishing for horizontally partitioned databases BIBAFull-Text 1321-1322
  Pawel Jurczyk; Li Xiong
There is an increasing need for sharing data repositories containing personal information across multiple distributed, possibly untrusted, and private databases. Such data sharing is subject to constraints imposed by privacy of data subjects as well as data confidentiality of institutions or data providers. We developed a set of decentralized protocols that enable data sharing for horizontally partitioned databases given these constraints. Our approach includes a distributed anonymization protocol that allows independent data providers to build a virtual anonymized database, and a distributed querying protocol that allows clients to query the virtual database.
CE2: towards a large scale hybrid search engine with integrated ranking support BIBAFull-Text 1323-1324
  Haofen Wang; Thanh Tran; Chang Liu
The Web contains a large amount of documents and increasingly, also semantic data in the form of RDF triples. Many of these triples are annotations that are associated with documents. While structured query is the principal mean to retrieve semantic data, keyword queries are typically used for document retrieval. Clearly, a form of hybrid search that seamlessly integrates these formalisms to query both documents and semantic data can address more complex information needs. In this paper, we present CE2, an integrated solution that leverages mature database and information retrieval technologies to tackle challenges in hybrid search on the large scale. For scalable storage, CE2 integrates database with inverted indices. Hybrid query processing is supported in CE2 through novel algorithms and data structures, which allow for advanced ranking schemes to be integrated more tightly into the process. Experiments conducted on Dbpedia and Wikipedia show that CE2 can provide good performance in terms of both effectiveness and efficiency.
Scaling up duplicate detection in graph data BIBAFull-Text 1325-1326
  Melanie Herschel; Felix Naumann
Duplicate detection determines different representations of real-world objects in a database. Recent research has considered the use of relationships among object representations to improve duplicate detection. In the general case where relationships form a graph, research has mainly focused on duplicate detection quality/effectiveness. Scalability has been neglected so far, even though it is crucial for large real-world duplicate detection tasks.
   We scale up duplicate detection in graph data (DDG) to large amounts of data using the support of a relational database system. We first generalize the process of DDG and then present how to scale DDG in space (amount of data processed with limited main memory) and in time. Finally, we explore how complex similarity computation can be performed efficiently. Experiments on data an order of magnitude larger than data considered so far in DDG clearly show that our methods scale to large amounts of data.
ROAD: an efficient framework for location dependent spatial queries on road networks BIBAFull-Text 1327-1328
  Ken C. K. Lee; Wang-Chien Lee; Baihua Zheng
In this research, we develop ROAD, a system framework for processing location dependent spatial queries (LDSQs) that search for spatial objects of interest on road networks. By exploiting search space pruning, ROAD is very efficient and flexible for various LDSQs on different types of objects over large-scale networks. In ROAD, a large road network is organized as a set of interconnected regional sub-networks (called Rnets) augmented with 1) shortcuts for accelerating search traversals; and 2) object abstracts for guiding object search. In this poster, we outline this framework and explain how it can support efficient location-dependent nearest neighbor search.
View and index selection for query-performance improvement: quality-centered algorithms and heuristics BIBAFull-Text 1329-1330
  Maxim Kormilitsin; Rada Chirkova; Yahya Fathi; Matthias Stallmann
Selecting and precomputing indexes and materialized views, with the goal of improving query-processing performance, is an important part of database-performance tuning. The significant complexity of the view- and index-selection problem may result in high total cost of ownership for database systems. In this paper, we develop efficient methods that deliver user-specified quality of the set of selected views and indexes when given view- and index-based plans as problem inputs. Here, quality means proximity to the globally optimum performance for the input query workload given the input query plans. Our experimental results and comparisons on synthetic and benchmark instances demonstrate the competitiveness of our approach and show that it provides a winning combination with end-to-end view- and index-selection frameworks such as those of [1, 2].
SQL extension for exploring multiple tables BIBAFull-Text 1331-1332
  Sung Jin Kim; Junghoo John Cho
The standard SQL assumes that the users are aware of all tables and their schemas to write queries. This assumption may be valid when the users deal with a relatively small number of tables, but writing a SQL query on a large number of tables is often challenging; (1) the users do not know what tables are relevant to their query, (2) it is too cumbersome to explicitly list tens of (or even hundreds of) relevant tables in the FROM clause and (3)
   the schemas of those tables are not identical. We now propose an intuitive yet powerful extension to SQL that helps users explore and aggregate information spread over a large number of tables.
PBFilter: indexing flash-resident data through partitioned summaries BIBAFull-Text 1333-1334
  Shaoyi Yin; Philippe Pucheral; Xiaofeng Meng
NAND Flash has become the most popular persistent data storage medium for mobile and embedded devices. The hardware characteristics of NAND Flash (e.g. page granularity for read/write with a block-erase-before-rewrite constraint, limited number of erase cycles) preclude in-place updates. In this paper, we propose a new indexing scheme, called PBFilter, designed from the outset to exploit the peculiarities of NAND Flash.

Poster session 1/industry

Transaction reordering with application to synchronized scans BIBAFull-Text 1335-1336
  Gang Luo; Jeffrey F. Naughton; Curt J. Ellmann; Michael W. Watzke
Traditional workload management methods mainly focus on the current system status while information about the interaction between queued and running transactions is largely ignored. An exception to this is the transaction reordering method, which reorders the transaction sequence submitted to the RDBMS and improves the transaction throughput by considering both the current system status and information about the interaction between queued and running transactions. The existing transaction reordering method only considers the reordering opportunities provided by analyzing the lock conflict information among multiple transactions. This significantly limits the applicability of the transaction reordering method. In this paper, we extend the existing transaction reordering method into a general transaction reordering framework that can incorporate various factors as the reordering criteria. We show that by analyzing the resource utilization information of transactions, the transaction reordering method can also improve the system throughput by increasing the resource sharing opportunities among multiple transactions. We provide a concrete example on synchronized scans and demonstrate the advantages of our method through experiments with a commercial parallel RDBMS.
Yizkor books: a voice for the silent past BIBAFull-Text 1337-1338
  Jason J. Soo; Rebecca J. Cathey; Ophir Frieder; Michlean J. Amir; Gideon Frieder
Yizkor Book collections contain firsthand commemorative accounts of events from the era surrounding the rise and fall of Nazi Germany, including documents from before, during, and after the Holocaust. Prior to our effort, information regarding the content and location of each Yizkor Book volume was limited. We established a centralized index and metadata repository for the Yizkor Book collection and developed a detailed search interface accessible worldwide.

Poster session 1/information retrieval

An approximate string matching approach for handling incorrectly typed urls BIBAFull-Text 1339-1340
  Mihai Stroe; Radu Berinde; Cosmin Negruseri; Dan Popovici
In this paper we approach the problem of providing corrections for incorrectly typed URLs. This problem is significantly different from the classical spelling correction problem. We describe our contribution -- building a custom data structure and a search algorithm that can find approximate matches for incorrect URLs. We evaluate the quality of our results through experiments with analysts. Our system is now being used in the Google search engine.
Speed up semantic search in p2p networks BIBAFull-Text 1341-1342
  Qiang Wang; Rui Li; Lei Chen; Jie Lian; M. Tamer Özsu
Peer-to-peer architectures become popular in modern massively distributed systems, which are often in very large scale and contain a huge volume of heterogeneous data. To facilitate the information retrieval process in P2P networks, we consider semantic search approach, where syntax-based queries are shipped to peers based on semantic correlations. Motivated by an interesting experience in Web information retrieval, we propose a novel ontology-based scheme to measure similarity of peer interests accurately and consistently in a decentralized way, and group peers under a scalable hierarchical overlay network. Given queries, our approach either floods them within local peer groups or guides them towards remote groups based on the similarity of interests. Our work overcomes the limitations of the existing P2P hybrid-search approaches by avoiding costly data popularity measurement. Performance evaluation and comparison against baseline algorithms show that our approach provides a better solution for information retrieval in large-scale P2P networks.
A note on search based forecasting of ad volume in contextual advertising BIBAFull-Text 1343-1344
  Xuerui Wang; Andrei Broder; Marcus Fontoura; Vanja Josifovski
In contextual advertising, estimating the number of impressions of an ad is critical in planning and budgeting advertising campaigns. However, producing this forecast, even within large margins of error, is quite challenging. We attack this problem by simulating the presence of a given ad with its associated bid over historical data, involving billions of impressions. This apparently enormous computational task is reduced to a search task involving only the set of distinct pages in the data. Furthermore the search is made more efficient using a two-level search process. Experimental results show that our approach can accurately forecast the expected number of impressions of contextual ads in real time.
An extension of PLSA for document clustering BIBAFull-Text 1345-1346
  Young-Min Kim; Jean-François Pessiot; Massih Reza Amini; Patrick Gallinari
In this paper we propose an extension of the PLSA model in which an extra latent variable allows the model to co-cluster documents and terms simultaneously. We show on three datasets that our extended model produces statistically significant improvements with respect to two clustering measures over the original PLSA and the multinomial mixture MM models.
Online spam-blog detection through blog search BIBAFull-Text 1347-1348
  Linhong Zhu; Aixin Sun; Byron Choi
In this work, we propose a novel post-indexing spam-blog (or splog) detection method, which capitalizes on the results returned by blog search engines. More specifically, we analyze the search results of a sequence of temporally-ordered queries returned by a blog search engine, and build and maintain Blog profiles for those blogs whose posts frequently appear in the top-ranked search results. With the blog profiles, 4 splog scoring functions were evaluated using real data collected from a popular blog search engine. Our experiments show that the proposed method could effectively detect splogs with a high accuracy.
Nested region algebra extended with variables for tag-annotated text search BIBAFull-Text 1349-1350
  Katsuya Masuda; Jun'ichi Tsujii
This paper presents a framework for searching text regions with specifying annotated information in tag-annotated text by using Region Algebra. We extend the efficient algorithm for region algebra to handle both nested and crossed regions and introduce variables for attribute values to treat tag-annotations in which attributes indicate another tag regions. Our framework have been implemented in a text search engine for MEDLINE, which is a large textbase of abstracts in medical science. Experiments in tag-annotated MEDLINE abstracts demonstrate the effectiveness of specifying annotations and the efficiency of our framework.
Searching the wikipedia with contextual information BIBAFull-Text 1351-1352
  Antti Ukkonen; Carlos Castillo; Debora Donato; Aristides Gionis
We propose a framework for searching the Wikipedia with contextual information. Our framework extends the typical keyword search, by considering queries of the type (q,p), where q is a set of terms (as in classical Web search), and p is a source Wikipedia document. The query terms q represent the information that the user is interested in finding, and the document p provides the context of the query. The task is to rank other documents in Wikipedia with respect to their relevance to the query terms q given the context document p. By associating a context to the query terms, the search results of a search initiated in a particular page can be made more relevant.
   We suggest a number of features that extend the classical query-search model so that the context document p is considered. We then use RankSVM (Joachims 2002) to learn weights for the individual features given suitably constructed training data. Documents are ranked at query time using the inner product of the feature and the weight vectors. The experiments indicate that the proposed method considerably improves results obtained by a more traditional approach that does not take the context into account.
Winnowing-based text clustering BIBAFull-Text 1353-1354
  Javier Parapar; Álvaro Barreiro
We present an approach to document clustering based on winnowing fingerprints that achieved good values of effectiveness with considerable save in memory space and computation time.
Using sequence classification for filtering web pages BIBAFull-Text 1355-1356
  Binyamin Rosenfeld; Ronen Feldman; Lyle Ungar
Web pages often contain text that is irrelevant to their main content, such as advertisements, generic format elements, and references to other pages on the same site. When used by automatic content-processing systems, e.g., for Web indexing, text classification, or information extraction, this irrelevant text often produces substantial amount of noise. This paper describes a trainable filtering system based on a feature-rich sequence classifier that removes irrelevant parts from pages, while keeping the content intact. Most of the features the system uses are purely form-related: HTML tags and their positions, sizes of elements, etc. This keeps the system general and domain-independent. We also experiment with content words and show that while they perform very poorly alone, they can slightly improve the performance of pure-form features, without jeopardizing the domain-independence. Our system achieves very high accuracy (95% and above) on several collections of Web pages. We also do a series of tests with different features and different classifiers, comparing the contribution of different components to the system performance, and comparing two known sequence classifiers, Robust Risk Minimization (RRM) and Conditional Random Fields (CRF), in a novel setting.
Passage relevance models for genomics search BIBAFull-Text 1357-1358
  Jay Urbain; Ophir Frieder; Nazli Goharian
We present a passage relevance model for integrating semantic and statistical evidence of biomedical concepts and topics in context using the framework of a probabilistic graphical model. Component models of topics, concepts, terms, and document are represented as potential functions within a Markov Random Field, and the probability of a passage being relevant to a biologist's information need is represented as the joint distribution across all potential functions. Relevance model feedback of top ranked passages is used to improve distributional estimates of concepts and topics in context, and a dimensional indexing strategy is used for efficient aggregation of concept and term statistics. By integrating multiple sources of evidence including dependencies between topics, concepts, and terms, we seek to improve genomics literature passage retrieval precision. Using this model, we demonstrate statistically significant improvements in retrieval precision using a large genomics literature corpus.
Cross-document cross-lingual coreference retrieval BIBAFull-Text 1359-1360
  Elif Aktolga; Marc-Allen Cartright; James Allan
In this work, we address coreference retrieval, which involves identifying aliases that are distinct references to an entity. We begin with a known alias and discover unknown aliases that refer to the same entity. We use Entity Language Models to capture the contextual language around the known alias, which aids in finding new aliases. We also show that modeling the significant dates of the known aliases improves alias discovery performance.
Siphon++: a hidden-webcrawler for keyword-based interfaces BIBAFull-Text 1361-1362
  Karane Vieira; Luciano Barbosa; Juliana Freire; Altigran Silva
The hidden Web consists of data that is generally hidden behind form interfaces, and as such, it is out of reach for traditional search engines. With the goal of leveraging the high-quality information in this largely unexplored portion of the Web, in this paper, we propose a new strategy for automatically retrieving data hidden behind keyword-based form interfaces. Unlike previous approaches to this problem, our strategy adapts the query generation and selection by detecting features of the index. We describe an extensive experimental evaluation which shows that: our strategy is able to derive appropriate queries to obtain high coverage while, at the same time, avoiding the retrieval of redundant data; and it obtains higher coverage and is more efficient approaches that use a fixed strategy for query generation.
Investigating external corpus and clickthrough statistics for query expansion in the legal domain BIBAFull-Text 1363-1364
  Tonya Custis; Khalid Al-Kofahi
We apply language modeling keyword search augmented with Berger and Lafferty's (1999) translation model for query expansion to formulate three query expansion methods using word co-occurrence statistics from a large external corpus and user clickthrough data. We study the performance of these methods on a vertical domain (case law documents) using standard metrics and an evaluation framework designed specifically to measure the performance of query expansion under varying degrees of query-document term mismatch.
Corpus microsurgery: criteria optimization for medical cross-language ir BIBAFull-Text 1365-1366
  Monica Rogati; Yiming Yang; Jaime Carbonell
Automatic subset selection from a parallel corpus significantly cross-lingual information retrieval (CLIR) performance, in addition to increasing its efficiency. Our selection method extracts relevant training data by incorporating additional criteria (i.e. estimated corpus quality, taxonomy projection and size) in addition to lexical-based criteria. The challenge lies in combining these criteria using a meaningful scoring function that can be used for ranking parallel sentence candidates. We choose weighted geometric mean for its soft-AND properties, and we optimize criteria weights by wrapping the CLIR task in an optimization shell. Due to the indeterminate nature of the search space convexity properties, we have explored continuous reactive tabu search (CRTS), a global optimization method. We use a large parallel corpus in the medical domain to examine the effect of adaptation criteria and their combination on CLIR performance. In our experiments, 100 selected sentences yield 90% of the performance obtained with 5,000 times more in-domain parallel sentences. Our optimized criteria weights considerably outperform the uniform distribution baseline, as well as lexical
   similarity adaptation.
Metadata extraction and indexing for map search in web documents BIBAFull-Text 1367-1368
  Qingzhao Tan; Prasenjit Mitra; C. Lee Giles
In academic scientific articles, maps are widely used to provide the related geographic information and to give readers a visual understanding of the document content. As more digital documents containing maps become accessible on the Web, there is a growing demand for a Web search system to provide users with tools to retrieve documents based on the information available within a document's maps. In this paper, we design methods and algorithms to extract, identify, and index maps from academic and scientific documents in digital libraries. Experimental results show that our approach can accurately locate maps and significantly improve the retrieve quality for maps in digital documents.

Poster session 1/knowledge management

An integration strategy for mining product features and opinions BIBAFull-Text 1369-1370
  Qingliang Miao; Qiudan Li; Ruwei Dai
With the development of Web 2.0, the web has become an extremely valuable source for mining opinions. In this paper, we study how to automatically mine product features and opinions by integrating multiple review sources. We propose an integration strategy to solve the problem. Experiments show that the proposed strategy is effective.
Overlapping community structure detection in networks BIBAFull-Text 1371-1372
  Nan Du; Bai Wang; Bin Wu
Many systems in nature and human society take the form of networks with community structures. In this paper, we describe a simple algorithm COCD (Clique-based Overlapping Community Detection) to efficiently mine the overlapping communities in large-scale networks, which is useful for us to have a better understanding of the nested sub-structures embedded in the whole network.
Coreference resolution using expressive logic models BIBAFull-Text 1373-1374
  Ki Chan; Wai Lam; Xiaofeng Yu
Coreference resolution is regarded as a crucial step for acquiring linkages among pieces of information extracted. Traditionally, coreference resolution models make use of independent attribute-value features over pairs of noun phrases. However, dependency and deeper relations between features can more adequately describe the properties of coreference relations between noun phrases. In this paper, we propose a framework of coreference resolution based on first-order logic and probabilistic graphical model, the Markov Logic Network. The proposed framework enables the use of background knowledge and captures more complex coreference linkage properties through rich expression of conditions. Moreover, the proposed conditions can capture the structural pattern within a noun phrase as well as contextual information between noun phrases. Our experiments show improvement with the use of the expressive logic models and the use of pattern-based conditions.
A method to predict social annotations BIBAFull-Text 1375-1376
  Ming-Hung Hsu; Hsin-Hsi Chen
This paper predicts the stabilized tag set of a resource, with feedback of a small amount of user annotations, aiming to reduce the requirement of sufficient user annotations and to resolve the cold-start problem in a social annotation system.
Large maximal cliques enumeration in sparse graphs BIBAFull-Text 1377-1378
  Natwar Modani; Kuntal Dey
Here we study a variant of maximal clique enumeration problem by incorporating a minimum size criterion. We describe preprocessing techniques to reduce the graph size. This is of practical interest since enumerating maximal cliques is a computationally hard problem and the execution time increases rapidly with the input size. We discuss basics of an algorithm for enumerating large maximal cliques which exploits the constraint on minimum size of the desired maximal cliques. Social networks are prime examples of large sparse graphs where enumerating large maximal cliques is of interest. We present experimental results on the social network formed by the call detail records of one of the world's largest telecom service providers. Our results show that the preprocessing methods achieve significant reduction in the graph size. We also characterize the execution behaviour of our large maximal clique enumeration algorithm.
Summarization of social activity over time: people, actions and concepts in dynamic networks BIBAFull-Text 1379-1380
  Yu-Ru Lin; Hari Sundaram; Aisling Kelliher
We present a framework for automatically summarizing social group activity over time. The problem is important in understanding large scale online social networks, which have diverse social interactions and exhibit temporal dynamics. In this work we construct summarization by extracting activity themes. We propose a novel unified temporal multi-graph framework for extracting activity themes over time. We use non-negative matrix factorization (NMF) approach to derive two interrelated latent spaces for users and concepts. Activity themes are extracted from the derived latent spaces to construct group activity summary. Experiments on real-world Flickr datasets demonstrate that our technique outperforms baseline algorithms such as LSI, and is additionally able to extract temporally representative activities to construct meaningful group activity summary.
Using tag semantic network for keyphrase extraction in blogs BIBAFull-Text 1381-1382
  Lizhen Qu; Christof Müller; Iryna Gurevych
Folksonomies provide a comfortable way to search and browse the blogosphere. As the tags in the blogosphere are sparse, ambiguous and too general, this paper proposes both a supervised and an unsupervised approach that extract tags from posts using a tag semantic network. We evaluate the two methods on a blog dataset and observe an improvement in F1-measure from 0.23 to 0.50 when compared to the baseline system.
Handling implicit geographic evidence for geographic ir BIBAFull-Text 1383-1384
  Nuno Cardoso; Mário J. Silva; Diana Santos
Most geographic information retrieval systems depend on the detection and disambiguation of place names in documents, assuming that the documents with a specific geographic scope contain explicit place names in the text that are strongly related to the document scopes. However, some non-geographic names such as companies, monuments or sport events, may also provide indirect relevant evidence that can significantly contribute to the assignment of geographic scopes to documents. In this paper, we analyze the amount of implicit and explicit geographic evidence in newspaper documents, and measure its impact on geographic information retrieval by evaluating the performance of a retrieval system using the GeoCLEF evaluation data.
Estimating real-valued characteristics of criminals from their recorded crimes BIBAFull-Text 1385-1386
  Richard Bache; Fabio Crestani
Offender profiling concerns making inferences about a criminal from the crime(s) he has committed. Where descriptions of the crimes are recorded electronically, text mining techniques provide a means by which recorded characteristics of the offenders can be linked with features of his crimes as revealed in the text. Past studies have used Language Modelling to identify characteristics that can be described by a categorical variable e.g. gender. Here we adapt the Language Modelling approach to allow estimation of numerical quantities such as age and distance travelled.
Representative entry selection for profiling blogs BIBAFull-Text 1387-1388
  Jinfeng Zhuang; Steven C. H. Hoi; Aixin Sun; Rong Jin
Many applications on blog search and mining often meet the challenge of handling huge volume of blog data, in which one single blog could contain hundreds or even thousands of entries. We investigate novel techniques for profiling blogs by selecting a subset of representative entries for each blog. We propose two principles for guiding the entry selection task: representativeness and diversity. Further, we formulate the entry selection task into a combinatorial optimization problem and propose a greedy yet effective algorithm for finding a good approximate solution by exploiting the theory of submodular functions. We suggest blog classification for judging the performance of the proposed entry selection techniques and evaluate their performance on a real blog dataset, in which encouraging results were obtained.
Efficient web matrix processing based on dual reordering BIBAFull-Text 1389-1390
  Chih-Ming Hsu; Ming-Syan Chen
PageRank is one of the most important ranking techniques used in search engines nowadays. Since billions of pages already existed in Web, many PageRank acceleration techniques were explored by many researchers. However, in the real Web, just like the existence of pages without out-links, there are many pages without in-links. (In this paper, a Web page without in-links is called a reverse-dangling page.) Given the state of the art, we propose a new reordered PageRank algorithm, called two-way reordered PageRank algorithm, which exploits both dangling nodes and reverse-dangling nodes to reduce the computational complexity of the PageRank vector.
Coreex: content extraction from online news articles BIBAFull-Text 1391-1392
  Jyotika Prasad; Andreas Paepcke
We developed and tested a heuristic technique for extracting the main article from news site Web pages. We construct the DOM tree of the page and score every node based on the amount of text, the number of links it contains and additional heuristics. The method is site-independent and does not use any language-based features. We tested our algorithm on a set of 1120 news article pages from 27 domains. Our algorithm achieved over 97% precision and 98% recall, and an average processing speed of under 15ms per page.
A novel email abstraction scheme for spam detection BIBAFull-Text 1393-1394
  Chi-Yao Tseng; Pin-Chieh Sung; Ming-Syan Chen
Prior studies on collaborative spam filtering with near-duplicate similarity matching scheme mainly represent each email by a succinct abstraction derived from email content text. Since these abstractions of emails cannot fully catch the evolving nature of spams, we propose in this paper a novel email abstraction scheme, which considers email layout structure to represent emails. With the proposed abstraction, we design a near-duplicate matching scheme to efficiently match each incoming email with a huge spam database.
Tag-based filtering for personalized bookmark recommendations BIBAFull-Text 1395-1396
  Pavan Kumar Vatturi; Werner Geyer; Casey Dugan; Michael Muller; Beth Brownholtz
This paper investigates using social tags for the purpose of making personalized content recommendations. Our tag-based recommender creates a personalized bookmark recommendation model for each user based on "current" and "general interest" tags, defined by different time intervals.
Closing the loop in webpage understanding BIBAFull-Text 1397-1398
  Chunyu Yang; Yong Cao; Zaiqing Nie; Jie Zhou; Ji-Rong Wen
Little work has been done towards an integrated statistical model for understanding webpage structures and processing natural language sentences within the HTML elements. This paper proposed a novel framework called WebNLP which enables bidirectional integration of page structure understanding and text understanding in an iterative manner. Experiments show that the WebNLP framework achieved significantly better performance.

Poster session 2 database

Efficient processing of probabilistic spatio-temporal range queries over moving objects BIBAFull-Text 1399-1400
  Bruce S. E. Chung; Wang-Chien Lee; Arbee L. P. Chen
Range queries for querying the current and future positions of the moving objects have received growing interests in the research community. Existing methods, however, assume that an object only moves along an anticipated path. In this paper, we study the problem of answering probabilistic range queries on moving objects based on an uncertainty model, which captures the possible movements of objects with probabilities. We conduct a performance study, which shows our proposal significantly reduces the number of object examinations and the overall cost of the query evaluation.
Data degradation: making private data less sensitive over time BIBAFull-Text 1401-1402
  Nicolas Anciaux; Luc Bouganim; Harold van Heerde; Philippe Pucheral; Peter M. G. Apers
Trail disclosure is the leakage of privacy sensitive data, resulting from negligence, attack or abusive scrutinization or usage of personal digital trails. To prevent trail disclosure, data degradation is proposed as an alternative to the limited retention principle. Data degradation is based on the assumption that long lasting purposes can often be satisfied with a less accurate, and therefore less sensitive, version of the data. Data will be progressively degraded such that it still serves application purposes, while decreasing accuracy and thus privacy sensitivity.
A light weighted damage tracking quarantine and recovery scheme for mission-critical database systems BIBAFull-Text 1403-1404
  Kun Bai; Peng Liu
As online applications gain popularity in today's E-Business world, surviving DBMS from an attack is becoming crucial because of the increasingly critical role that database servers are playing. Although a number of research projects have been done to tackle the emerging data corruption threats, existing mechanisms are still limited in meeting four highly desired requirements: near zero run time overhead, zero system down time. In this paper, we propose TRACE, a light weighted database Damage Tracking, Quarantine, and Recovery (DTQR) solution with negligible run time overhead.
Query optimization in xml-based information integration BIBAFull-Text 1405-1406
  Dongfeng Chen; Rada Chirkova; Maxim Kormilitsin; Fereidoon Sadri; Timo J. Salo
The problem of decentralized data sharing is relevant for a wide range of applications and is still a source of major theoretical and practical challenges, in spite of many years of sustained research in information integration. We focus on the challenge of efficiency of query evaluation in information-integration systems, with the objective of developing query-processing strategies that are widely applicable and easy to implement in real-life applications. In our algorithms we take into account important features of today's data-sharing applications, namely: XML as likely interface to or representation for data sources; potential for information overlap across data sources; and the need for inter-source processing (i.e., joins of data across data sources) in many applications.
   To the best of our knowledge, our methods are the first to account for the practical issues of information overlap across data sources and of inter-source processing. While most of our algorithms are platform- and implementation-independent, we also propose XML-specific optimization techniques that allow for system-level tuning of query processing performance. Finally, using real-life datasets and our implementation of an information-integration system shell, we provide experimental results that demonstrate that our algorithms are efficient and competitive in the information-integration setting. For all the details, please see [1].
Estimating the number of answers with guarantees for structured queries in p2p databases BIBAFull-Text 1407-1408
  Marcel Karnstedt; Kai-Uwe Sattler; Michael Haß; Manfred Hauswirth; Brahmananda Sapkota; Roman Schmidt
Structured P2P overlays supporting standard database functionalities are a popular choice for building large-scale distributed data management systems. In such systems, estimating the number of answers for structured queries can help approximating query completeness, but is especially challenging. In this paper, we propose to use routing graphs in order to achieve this. We introduce the general approach and briefly discuss further aspects like overhead and guarantees.
Evaluating partial tree-pattern queries on XML streams BIBAFull-Text 1409-1410
  Xiaoying Wu; Dimitri Theodoratos
The streaming evaluation is a popular way of evaluating queries on XML documents. Besides its many advantages, it is also the only option for a number of important XML applications. Unfortunately, existing algorithms focus almost exclusively on tree-pattern queries (TPQs). Requirements for flexible querying of XML data have motivated recently the introduction of query languages that are more general and flexible than TPQs.
   We consider a partial tree-pattern query (PTPQ) language which generalizes and strictly contains TPQs. PTPQs can express a fragment of XPath which comprises reverse axes and the node identity equality ($is$) operator, in addition to forward axes, wildcards and predicates. We outline an original streaming algorithm for PTPQs. Our algorithm is the first one to support the streaming evaluation of such a broad fragment of XPath.
Characterization of TPC-H queries for a column-oriented database on a dual-core AMD Athlon processor BIBAFull-Text 1411-1412
  Pranav Vaidya; Jaehwan (John) Lee
In this paper, we characterize the performance of the TPC-H benchmark for a popular column-oriented database called MonetDB running on a dual-core AMD Athlon machine. Specifically, we measure the performance of key microarchitectural components and analyze in detail the nature of various stalls namely cache stalls, branch misprediction stalls and resource stalls. We compare our results with published results on the characterization of TPC-H for row-oriented databases. As opposed to the previous approaches, we use thread-level monitoring of database threads to study the performance of the database in isolation from the rest of the system.

Poster session 2/information retrieval

Natural language retrieval of grocery products BIBAFull-Text 1413-1414
  Petteri Nurmi; Eemil Lagerspetz; Wray Buntine; Patrik Floréen; Joonas Kukkonen; Peter Peltonen
In this paper we describe modifications to a natural language grocery retrieval system, introduced in our earlier work. We also compare our system against an off-the-shelf retrieval tool, and show that our system is significantly better for top-ranked retrieval results.
Improve the effectiveness of the opinion retrieval and opinion polarity classification BIBAFull-Text 1415-1416
  Wei Zhang; Lifeng Jia; Clement Yu; Weiyi Meng
Opinion retrieval is a document retrieving and ranking process. A relevant document must be relevant to the query and contain opinions toward the query. Opinion polarity classification is an extension of opinion retrieval. It classifies the retrieved document as positive, negative or mixed, according to the overall polarity of the query relevant opinions in the document. This paper (1) proposes several new techniques that help improve the effectiveness of an existing opinion retrieval system; (2) presents a novel two-stage model to solve the opinion polarity classification problem. In this model, every query relevant opinionated sentence in a document retrieved by our opinion retrieval system is classified as positive or negative respectively by a SVM classifier. Then a second classifier determines the overall opinion polarity of the document. Experimental results show that both the opinion retrieval system with the proposed opinion retrieval techniques and the polarity classification model outperformed the best reported systems respectively.
A latent variable model for query expansion using the hidden Markov model BIBAFull-Text 1417-1418
  Qiang Huang; Dawei Song
We propose a novel probabilistic method based on the Hidden Markov Model (HMM) to learn the structure of a Latent Variable Model (LVM) for query language modeling. In the proposed LVM, the combinations of query terms are viewed as the latent variables and the segmented chunks from the feedback documents are used as the observations given these latent variables. Our extensive experiments shows that our method significantly outperforms a number of strong baselines in terms of both effectiveness and robustness.
A survey of pre-retrieval query performance predictors BIBAFull-Text 1419-1420
  Claudia Hauff; Djoerd Hiemstra; Franciska de Jong
The focus of research on query performance prediction is to predict the effectiveness of a query given a search system and a collection of documents. If the performance of queries can be estimated in advance of, or during the retrieval stage, specific measures can be taken to improve the overall performance of the system. In particular, pre-retrieval predictors predict the query performance before the retrieval step and are thus independent of the ranked list of results; such predictors base their predictions solely on query terms, the collection statistics and possibly external sources such as WordNet. In this poster, 22 pre-retrieval predictors are categorized and assessed on three different TREC test collections.
Modeling document features for expert finding BIBAFull-Text 1421-1422
  Jianhan Zhu; Dawei Song; Stefan Rüger; Xiangji Huang
We argue that expert finding is sensitive to multiple document features in an organization, and therefore, can benefit from the incorporation of these document features. We propose a unified language model, which integrates multiple document features, namely, multiple levels of associations, PageRank, indegree, internal document structure, and URL length. Our experiments on two TREC Enterprise Track collections, i.e., the W3C and CSIRO datasets, demonstrate that the natures of the two organizational intranets and two types of expert finding tasks, i.e., key contact finding for CSIRO and knowledgeable person finding for W3C, influence the effectiveness of different document features. Our work provides insights into which document features work for certain types of expert finding tasks, and helps design expert finding strategies that are effective for different scenarios.
Mining named entity transliteration equivalents from comparable corpora BIBFull-Text 1423-1424
  Raghavendra Udupa; K. Saravanan; A. Kumaran; Jagadeesh Jagarlamudi
Estimating retrieval effectiveness using rank distributions BIBAFull-Text 1425-1426
  Vishwa Vinay; Natasa Milic-Frayling; Ingemar Cox
In this paper, we consider the task of estimating query effectiveness, i.e., assessment of the retrieval system performance in absence of the user relevance judgments. In our approach we model the score associated with each document in the result set as a Gaussian random variable. The mean and the variance of each document score can then be used to estimate the probability that a document will be ranked above another one and thus calculate the expected rank of the document in the ranked list. We propose to measure the effectiveness of the system performance by comparing the predicted and actual ranks of the retrieved documents. In our experiments we consider two retrieval models and five document scoring methods and evaluate their impact on the proposed estimation measures. Our experiments with standardized data sets that include document relevance judgments and the task of predicting the relative query effectiveness show that the expected rank metric is robust to variations in document scoring and retrieval algorithms.
Semi-supervised ranking aggregation BIBAFull-Text 1427-1428
  Shouchun Chen; Fei Wang; Yaangqiu Song; Changshui Zhang
Ranking aggregation is important in data mining and information retrieval. In this paper, we proposed a semi-supervised ranking aggregation method, in which the order of several item pairs are labeled as side information. The core idea is to learn a ranking function based on the ordering agreement of different rankers. The ranking scores assigned by this ranking function on the labeled data are consistent with the given pairwise order constraints while the ranking scores on the unlabeled data obey the intrinsic manifold structure of the rank items. The experiment results show our method work well.
Ranking in folksonomy systems: can context help? BIBAFull-Text 1429-1430
  Fabian Abel; Nicola Henze; Daniel Krause
Folksonomy systems have shown to contribute to the quality of Web search ranking strategies. In this paper, we analyze and compare different graph-based ranking algorithms, namely FolkRank, SocialPageRank, and SocialSimRank. We enhance these algorithms by exploiting the context of tag assignments, and evaluate the results on the GroupMe! dataset. In GroupMe!, users can organize and maintain arbitrary Web resources in self-defined groups. When users annotate resources in GroupMe!, this can be interpreted in context of a certain group. The grouping activity delivers valuable semantic information about resources and their context. We show how to use this information to improve the detection of relevant search results, and compare different strategies for ranking result lists in folksonomy systems.
Evaluating topic models for information retrieval BIBAFull-Text 1431-1432
  Xing Yi; James Allan
We explore the utility of different types of topic models, both probabilistic and not, for retrieval purposes. We show that: (1) topic models are effective for document smoothing; (2) more elaborate topic models that capture topic dependencies provide no additional gains; (3) smoothing documents by using their similar documents is as effective as smoothing them by using topic models; (4) topics discovered on the whole corpus are too coarse-grained to be useful for query expansion. Experiments to measure topic models' ability to predict held-out likelihood confirm past results on small corpora, but suggest that simple approaches to topic model are better for large corpora.
A novel statistical Chinese language model and its application in pinyin-to-character conversion BIBAFull-Text 1433-1434
  Bo Lin; Jun Zhang
In this paper, we present a novel Chinese language model, and study its applications, in particular in Chinese pinyin-to-character conversion. In the new model, each word is associated with supporting context constructed by mining the frequent sets of nearby phrases and their distances to the word. Such information was usually overlooked in previous n-gram model and its variants. We apply the model to Chinese pinyin-to-character conversion and find that it offers a better solution to Chinese input. The model has lower perplexity in our evaluation and higher prediction accuracy than the state-of-the-art n-gram Markov model for Chinese language.
Integrating clustering and multi-document summarization to improve document understanding BIBAFull-Text 1435-1436
  Dingding Wang; Shenghuo Zhu; Tao Li; Yun Chi; Yihong Gong
Document understanding techniques such as document clustering and multi-document summarization have been receiving much attention in recent years. Current document clustering methods usually represent documents as a term-document matrix and perform clustering algorithms on it. Although these clustering methods can group the documents satisfactorily, it is still hard for people to capture the meanings of the documents since there is no satisfactory interpretation for each document cluster. In this paper, we propose a new language model to simultaneously cluster and summarize the documents. By utilizing the mutual influence of the document clustering and summarization, our method makes (1) a better document clustering method with more meaningful interpretation and (2) a better document summarization method taking the document context information into consideration.
Answering general time sensitive queries BIBAFull-Text 1437-1438
  Wisam Dakka; Luis Gravano; Panagiotis G. Ipeirotis
Time is an important dimension of relevance for a large number of searches, such as over blogs and news archives. So far, research on searching over such collections has largely focused on locating topically similar documents for a query. Unfortunately, topic similarity alone is not always sufficient for document ranking. In this paper, we observe that, for an important class of queries that we call time-sensitive queries, the publication time of the documents in a news archive is important and should be considered in conjunction with the topic similarity to derive the final document ranking. Earlier work has focused on improving retrieval for "recency" queries that target recent documents. We propose a more general framework for handling time-sensitive queries and we automatically identify the important time intervals that are likely to be of interest for a query. Then, we build scoring techniques that seamlessly integrate the temporal aspect into the overall ranking mechanism. We extensively evaluated our techniques using a variety of news article data sets, including TREC data as well as real web data analyzed using the Amazon Mechanical Turk. We examined several alternatives for detecting the important time intervals for a query over a news archive and for incorporating this information in the retrieval process. Our techniques are robust and significantly improve result quality for time-sensitive queries compared to state-of-the-art retrieval techniques.
Search-based query suggestion BIBAFull-Text 1439-1440
  Jiang-Ming Yang; Rui Cai; Feng Jing; Shuo Wang; Lei Zhang; Wei-Ying Ma
In this paper, we proposed a unified strategy to combine query log and search results for query suggestion. In this way, we leverage both the users' search intentions for popular queries and the power of search engines for unpopular queries. The suggested queries are also ranked according to their relevance and qualities; and each suggestion is described with a rich snippet including a photo and related description.
Entity-based query reformulation using wikipedia BIBAFull-Text 1441-1442
  Yang Xu; Fan Ding; Bin Wang
Many real world applications increasingly involve both structured data and text, and entity based retrieval is an important problem in this realm. In this paper, we present an automatic query reformulation approach based on entities detected in each query. The aim is to utilize semantics associated with entities for enhancing document retrieval. This is done by expanding a query with terms/phrases related to entities in the query. We exploit Wikipedia as a large repository of entity information. Our reformulated approach consists of three major steps: (1) detect representative entity in a query; (2) expand the query with entity related terms/phrases; and (3) facilitate term dependency features. We evaluate our approach in ad-hoc retrieval task on four TREC collections, including two large web collections. Experiments results show that significant improvement is possible by utilizing entity corresponding information.

Poster session 2/knowledge management

Group-based learning: a boosting approach BIBAFull-Text 1443-1444
  Weijian Ni; Jun Xu; Hang Li; Yalou Huang
This paper points out that many machine learning problems in IR should be and can be formalized in a novel way, referred to as 'group-based learning'. In group-based learning, it is assumed that training data as well as testing data consist of groups. The classifier is created and utilized across groups. Furthermore, evaluation in testing and also in training are conducted at group level, with the use of evaluation measures defined on a group. This paper addresses the problem and presents a Boosting algorithm to perform the new learning task. The algorithm, referred to as AdaBoost.Group, is proved to be able to improve accuracies in terms of group-based measures during training.
Collaborative partitioning with maximum user satisfaction BIBAFull-Text 1445-1446
  Fred S. Annexstein; Svetlana Strunjaš
Our collaborative partitioning model posits a bicriteria objective in which we seek the best item clustering that satisfies the most users at the highest level of satisfaction. We consider two basic methods for determining user satisfaction. The first method is based on how well each user's preferences match a given partition, and the second method is based on average correlation scores taken over sufficiently large subpopulations of users. We show these problems are NP-Hard and develop a set of heuristic approaches for solving them. We provide lower bounds on the satisfaction level on random data, and error bounds in the planted partition model, which provide confidence levels for our heuristic methods. Finally, we present experiments on several real examples that demonstrate the effectiveness of our framework.
Efficient frequent pattern mining over data streams BIBAFull-Text 1447-1448
  Syed Khairuzzaman Tanbeer; Chowdhury Farhan Ahmed; Byeong-Soo Jeong; Young-Koo Lee
This paper proposes a prefix-tree structure, called CPS-tree (Compact Pattern Stream tree) that efficiently discovers the exact set of recent frequent patterns from high-speed data stream. The CPS-tree introduces the concept of dynamic tree restructuring technique in handling stream data that allows it to achieve highly compact frequency-descending tree structure at runtime and facilitates an efficient FP-growth-based [1] mining technique.
GHOST: an effective graph-based framework for name distinction BIBAFull-Text 1449-1450
  Xiaoming Fan; Jianyong Wang; Bing Lv; Lizhu Zhou; Wei Hu
Name ambiguity stems from the fact that many people or objects share identical names. In this paper, we focus on investigating the problem in digital libraries to distinguish publications written by authors with identical names. We present an effective graph-based framework, GHOST (abbr. GrapH-based framewOrk for name diStincTion), to solve the problem systematically. We evaluated the framework on the real DBLP dataset, and the experimental results show that GHOST outperforms the state-of-the-art method.
Deriving non-redundant approximate association rules from hierarchical datasets BIBAFull-Text 1451-1452
  Gavin Shaw; Yue Xu; Shlomo Geva
Association rule mining plays an important job in knowledge and information discovery. However, there are still shortcomings with the quality of the discovered rules and often the number of discovered rules is huge and contain redundancies, especially in the case of multi-level datasets. Previous work has shown that the mining of non-redundant rules is a promising approach to solving this problem, with work by [6,8,9,10] focusing on single level datasets. Recent work by Shaw et. al. [7] has extended the non-redundant approaches presented in [6,8,9] to include the elimination of redundant exact basis rules from multi-level datasets. Here we propose a continuation of the work in [7] that allows for the removal of hierarchically redundant approximate basis rules from multi-level datasets by using a dataset's hierarchy or taxonomy.
Pattern-based semantic class discovery with multi-membership support BIBAFull-Text 1453-1454
  Shuming Shi; Xiaokang Liu; Ji-Rong Wen
A semantic class is a collection of items (words or phrases) sharing common semantic properties. This paper proposes an approach to constructing one or multiple semantic classes for an input item. Two challenges are addressed: multi-membership, and noise-tolerance.
Detecting significant distinguishing sets among bi-clusters BIBAFull-Text 1455-1456
  Faris Alqadah; Raj Bhatnagar
A fundamental task of data analysis is comprehending what distinguishes clusters found within the data. We present the problem of mining distinguishing sets; which seeks to find sets of objects or attributes that induce the most incremental change between adjacent bi-clusters of a binary dataset. Viewing the lattice of bi-clusters formed within a data set as a weighted directed graph, we mine the most significant distinguishing sets by growing a maximal-cost spanning tree of the lattice.
Semi-supervised metric learning by maximizing constraint margin BIBAFull-Text 1457-1458
  Fei Wang; Shouchun Chen; Changshui Zhang; Tao Li
Distance metric learning is an old problem that has been researched in the supervised learning field for a very long time. In this paper, we consider the problem of learning a proper distance metric under the guidance of some weak supervisory information. Specifically, those information are in the form of pairwise constraints which specify whether a pair of data points are in the same class (must link constraints) or in the different classes (cannot link constraints). Given those constraints, our algorithm aims to learn a distance metric under which the points with must link constraints are pushed as close as possible, while simultaneously the points with cannot link constraints are pulled away as far as possible. Finally the experimental results are presented to show the effectiveness of our method.
On quantifying changes in temporally evolving dataset BIBAFull-Text 1459-1460
  Rohan Choudhary; Sameep Mehta; Amitabha Bagchi
In this paper, we present a general framework to quantify changes in temporally evolving data. We focus on changes that materialize due to evolution and interactions of features extracted from the data. The changes are captured by the following key transformations: create, merge, split, continue, and cease. First, we identify various factors which influence the importance of each transformation. These factors are then combined using a weight vector. The weight vector encapsulates domain knowledge. We evaluate our algorithm using the following datasets: DBLP, IMDB, Text and Scientific Dataset.
Fast spatial co-location mining without cliqueness checking BIBAFull-Text 1461-1462
  Zhongshan Lin; SeungJin Lim
In this paper, we propose a novel, spatial co-location mining algorithm which automatically generates co-located spatial features without generating any non-clique candidates at each level. Subsequently our algorithm is more efficient than other existing level-wise co-location algorithms because no cliqueness checking is performed in our algorithm. In addition, our algorithm produces a smaller number of co-location candidates than the other existing algorithms.
Decomposition of terminology graphs for domain knowledge acquisition BIBAFull-Text 1463-1464
  Fidelia Ibekwe-SanJuan; Eric SanJuan; Michael S. E. Vogeley
We propose a graph decomposition algorithm for analyzing the structure of complex graph networks. After multi-word term extraction, we apply techniques from text mining and visual analytics in a novel way by integrating symbolic and numeric information to build clusters of domain topics. Terms are clustered based on surface linguistic variations and clusters are inserted in an association network based on their intersection with documents. The graph is then decomposed based on atom graph structure into central (non-decomposable) atom and peripheral atoms. The whole process is applied to publications from the Sloan Digital Sky Survey (SDSS) project in the Astronomy field. The mapping obtained was evaluated by a domain expert and appeared to have captured interesting conceptual relations between different domain topics.
In the development of a Spanish metamap BIBAFull-Text 1465-1466
  Francisco M. Carrero; José Carlos Cortizo; José María Gómez; Manuel de Buenaga
MetaMap is an online application that allows mapping text to UMLS Metathesaurus concepts, which is very useful interoperability among different languages and systems within the biomedical domain. MetaMap Transfer (MMTx) is a Java program that makes MetaMap available to biomedical researchers. Currently there is no Spanish version of MetaMap, which difficults the use of UMLS Metathesaurus to extract concepts from Spanish biomedical texts.
   Our ongoing research is mainly focused on using biomedical concepts for cross-lingual text classification and retrieval [3]. In this context the use of concepts instead of bag of words representation allows us to face text classification tasks abstracting from the language [4]. In this paper we evaluate the possibility of combining automatic translation techniques with the use of biomedical ontologies to produce an English text that can be processed by MMTx.
Scalable complex pattern search in sequential data BIBAFull-Text 1467-1468
  Leila Kaghazian; Dennis McLeod; Reza Sadri
Searching data streams has been traditionally very limited, either in the complexity of the search or in the size of the searched dataset. In this paper, we investigate the design and optimization of constructs that enable SQL to express complex patterns. In particular we propose the RSPS (recursive sequential pattern search) algorithm which exploits the inter-dependencies between the elements of a sequential pattern to minimize repeated passes over the same data. Performance gains derived experimental results show impressive speedup up to 100 times.
Combining concept hierarchies and statistical topic models BIBAFull-Text 1469-1470
  Chaitanya Chemudugunta; Padhraic Smyth; Mark Steyvers
Statistical topic models provide a general data-driven framework for automated discovery of high-level knowledge from large collections of text documents. While topic models can potentially discover a broad range of themes in a data set, the interpretability of the learned topics is not always ideal. Human-defined concepts, on the other hand, tend to be semantically richer due to careful selection of words to define concepts but they tend not to cover the themes in a data set exhaustively. In this paper, we propose a probabilistic framework to combine a hierarchy of human-defined semantic concepts with statistical topic models to seek the best of both worlds. Experimental results using two different sources of concept hierarchies and two collections of text documents indicate that this combination leads to systematic improvements in the quality of the associated language models as well as enabling new techniques for inferring and visualizing the semantics of a document.

Poster session 3: database

Energy-efficient skyline query processing and maintenance in sensor networks BIBAFull-Text 1471-1472
  Weifa Liang; Baichen Chen; Jeffrey Xu Yu
The skyline query, as an important operator in databases for multi-preference analysis and decision making, has received much attention recently due to its wide application backgrounds. In this paper, we consider the skyline query problem in Wireless Sensor Network with an objective to maximize the network lifetime by proposing filter-based distributed algorithms for skyline evaluation and maintenance. We also conduct preliminary experiments to evaluate the performance of the proposed algorithms. The experimental results demonstrate that the proposed algorithms significantly outperform existing algorithms on various datasets.
Table summarization with the help of domain lattices BIBAFull-Text 1473-1474
  K. Selçuk Candan; Huiping Cao; Yan Qi; Maria Luisa Sapino
Table summarization is necessary in various scenarios where it is hard to display a large table. It can benefit from knowledge about acceptable value clustering alternatives. In this paper, we formulate the problem of table summarization with the help of domain knowledge lattices. We provide the outline of a fuzzy mechanism to express alternative clustering strategies. We further sketch a novel ranked set cover based evaluation mechanism (RSC) to tackle with the inherent complexity.
Protecting location privacy against location-dependent attack in mobile services BIBAFull-Text 1475-1476
  Xiao Pan; Jianliang Xu; Xiaofeng Meng
Privacy preservation has recently received considerable attention for location-based mobile services. In this paper, we present location-dependent attack resulting from continuous and dependent location updates and propose an incremental clique-based cloaking algorithm, called ICliqueCloak, to defend against location-dependent attack. The main idea is to incrementally maintain maximal cliques for location cloaking in an un-directed graph that takes into consideration the effect of continuous location updates.
Polyhedral transformation for indexed rank order correlation queries BIBAFull-Text 1477-1478
  Philon Nguyen; Nematollaah Shiri
Rank order correlation has been used extensively when the data is non-parametric or when the relationship between two variables is nonlinear and monotonic. In such cases, linear correlation measures, such as the product-moment coefficient, are inadequate and fail to detect correlative relations. We present a polyhedral indexing technique for rank order correlation queries for time series data. We use an interesting geometry interpretation of rank order correlation which lends itself to indexing by spatial indexes such as R-trees. Our experimental results indicate one to two orders of magnitudes improvement over sequential scan -- the only alternative solution.
Workload-based optimization of integration processes BIBAFull-Text 1479-1480
  Matthias Boehm; Uwe Wloka; Dirk Habich; Wolfgang Lehner
The efficient execution of integration processes between distributed, heterogeneous data sources and applications is a challenging research area of data management. These integration processes are an abstraction for workflow-based integration tasks, used in EAI servers and WfMS. The major problem are significant workload changes during runtime. The performance of integration processes strongly depends on those dynamic workload characteristics, and hence workload-based optimization is important. However, existing approaches of workflow optimization only address the rule-based optimization and disregard changing workload characteristics. To overcome the problem of inefficient process execution in the presence of workload shifts, here, we present an approach for the workload-based optimization of instance-based integration processes and show that significant execution time reductions are possible.

Poster session 3/information retrieval

Suppressing outliers in pairwise preference ranking BIBAFull-Text 1487-1488
  Vitor R. Carvalho; Jonathan L. Elsas; William W. Cohen; Jaime G. Carbonell
Many of the recently proposed algorithms for learning feature-based ranking functions are based on the pairwise preference framework, in which instead of taking documents in isolation, document pairs are used as instances in the learning process. One disadvantage of this process is that a noisy relevance judgment on a single document can lead to a large number of mis-labeled document pairs. This can jeopardize robustness and deteriorate overall ranking performance. In this paper we study the effects of outlying pairs in rank learning with pairwise preferences and introduce a new meta-learning algorithm capable of suppressing these undesirable effects. This algorithm works as a second optimization step in which any linear baseline ranker can be used as input. Experiments on eight different ranking datasets show that this optimization step produces statistically significant performance gains over state-of-the-art methods.
Re-considering neighborhood-based collaborative filtering parameters in the context of new data BIBAFull-Text 1481-1482
  Adele E. Howe; Ryan D. Forbes
The Movielens dataset and the Herlocker et al. study of 1999 have been very influential in collaborative filtering. Yet, the age of both invites re-examining their applicability. We use Netflix challenge data to re-visit the prior results. In particular, we re-evaluate the parameters of Herlocker et al.'s method on two critical factors: measuring similarity between users and normalizing the ratings of the users. We find that normalization plays a significant role and that Pearson Correlation is not necessarily the best similarity metric.
Efficient estimation of the size of text deep web data source BIBFull-Text 1485-1486
  Jianguo Lu
A georeferencing multistage method for locating geographic context in web search BIBAFull-Text 1485-1486
  Álvaro Zubizarreta; Pablo de la Fuente; José M. Cantera; Mario Arias; Jorge Cabrero; Guido García; César Llamas; Jesús Vegas
The geographic scope of Web pages is becoming an essential dimension of Web search, especially for mobile users. This paper shows a multistage method for assigning a geographic focus to Web pages (GeoReferencing) according to their text contents. We suggest several heuristics for the disambiguation toponyms and a scoring procedure for focus determination. Furthermore, we provide an experimental methodology for evaluating the accuracy. Finally, we obtained promising results of over 70% accuracy with a city-level resolution.
Incorporating place name extents into geo-ir ranking BIBAFull-Text 1489-1490
  Hiroyuki Toda; Norihito Yasuda; Yumiko Matsuura; Ryoji Kataoka
This paper proposes a novel Geo-IR ranking method that realizes effective searches that emphasize the user's immediate surroundings. It assesses the extent implied by place names in documents and then emphasizes place names that are highly specific in terms of identifying locations.
The effect of contextualization at different granularity levels in content-oriented xml retrieval BIBAFull-Text 1491-1492
  Paavo Arvola; Jaana Kekäläinen; Marko Junkkari
In the hierarchical XML structure, the ancestors form the context of an XML element. The process of taking element's context into account in element scoring is called contextualization. The aim of this paper is to separate different granularity levels and test the effect of contextualization on these levels.
Using the current browsing context to improve search relevance BIBAFull-Text 1493-1494
  Mandar A. Rahurkar; Silviu Cucerzan
In this paper, we investigate the problem of improving the relevance of a Web search engine by adapting it to the dynamic needs of the user. We examine a representative case of sudden information need change, namely the exposure of the user to news data. In our earlier work we showed that the majority of queries submitted by users after browsing documents in the news domain are related to the most recently browsed document. We explore several methods of biasing the search by performing query expansion and re-ranking of the search results of a major search engine for queries identified as good candidates for contextualization. We show that these methods highly increase the similarity between the obtained top 10 search results and the most recently browsed document.
Using a graph-based ontological user profile for personalizing search BIBAFull-Text 1495-1496
  Mariam Daoud; Lynda Tamine-Lechani; Mohand Boughanem
In this poster, we describe a personalized search approach, which involves a graph based user profile issued from ontology and a session boundary recognition mechanism. The user profile refers to the short term user interest and is used for re-ranking the search results of queries in the same search session. The session boundary recognition is based on tracking changes in the dominant concepts held by the query and the user profile. Experimental evaluation was carried out using the HARD 2003 TREC collection and shows that our approach is effective.
Measuring user preference changes in digital libraries BIBAFull-Text 1497-1498
  Yang Sun; Huajing Li; Isaac G. Councill; Wang-Chien Lee; C. Lee Giles
Much research has been conducted using web access logs to study implicit user feedback and infer user preferences from clickstreams. However, little research measures the changes of user preferences of ranking documents over time. We present a study that measures the changes of user preferences based on an analysis of access logs of a large scale digital library over one year. A metric based on the accuracy of predicting future user actions is proposed. The results show that although user preferences change over time, the majority of user actions should be predictable from previous browsing behavior in the digital library.
Utilization of navigational queries for result presentation and caching in search engines BIBAFull-Text 1499-1500
  Rifat Ozcan; Ismail Sengor Altingovde; Özgür Ulusoy
We propose result page models with varying granularities for navigational queries and show that this approach provides a better utilization of cache space and reduces bandwidth requirements.
SHOPSMART: product recommendations through technical specifications and user reviews BIBAFull-Text 1501-1502
  Alexander Yates; James Joseph; Ana-Maria Popescu; Alexander D. Cohn; Nick Sillick
This paper describes a new method for providing recommendations tailored to a user's preferences using text mining techniques and online technical specifications of products. We first learn a model that can predict the price of a product given automatically-determined features describing technical specifications and users' opinions. We then use this model to rank a list of products based on individual users' preferences about various features. On a data set collected from Amazon reviews and online technical specifications, rankings produced by this model rank the best product for a user in the 87th percentile of products in its category, on average. Our approach outperforms several comparison systems by 21 percentiles or more.
Trust, authority and popularity in social information retrieval BIBAFull-Text 1503-1504
  Gabriella Kazai; Natasa Milic-Frayling
We present a social information retrieval (SIR) model comprising the social network of actors (e.g., authors, publishers, consumers), the graph representing relations in data (e.g., publications), and the links between the social and data network that reflect activities in the network such as search, authoring, annotation, etc. Building on this hybrid network, we describe relevance in terms of the trust propagated through the network and rendered onto a given item. In particular, relevance is a function of the approval votes from the associated sub-graph and the reputation of the sub-graph nodes. We explore a model that differentiates between approval from actors who are perceived authorities by the user and the approval by a wider community, representing the popular opinion.
A spam resistant family of concavo-convex ranks for link analysis BIBAFull-Text 1505-1506
  Sreangsu Acharyya; Joydeep Ghosh
A parameterized family of non-linear, link analytic ranking functions is proposed that includes Pagerank as a special case and uses the convexity property of those functions to be more resistant to link spam attacks. A contribution of the paper is the construction of such a scheme with provable uniqueness and convergence guarantees. The paper also demonstrates that even in an unlabelled scenario this family can have spam resistance comparable to Trustrank [3] that uses labels of spam or nat-spam on a training set. The proposed method can use labels, if available, to improve its performance to provide state of the art level of link spam protection.

Poster session 3/knowledge management

Boosting social annotations using propagation BIBAFull-Text 1507-1508
  Shenghua Bao; Bohai Yang; Ben Fei; Shengliang Xu; Zhong Su; Yong Yu
This paper is concerned with the problem of boosting social annotations using propagation, which is also called social propagation. In particular, we focus on propagating social annotations of web pages (e.g., annotations in Del.icio.us). Although social annotations are developing fast, they cover only a small proportion of Web pages on the World Wide Web. To alleviate the low coverage problem, a general propagation model based on Random Surfer is proposed. Specifically, four steps are included: basic propagation, multiple-annotation propagation, multiple-link-type propagation, and constraint-guided propagation. Experimental results show that the proposed model is very effective in increasing coverage of annotations as well as preserving property of social annotations.
Effective pattern taxonomy mining in text documents BIBAFull-Text 1509-1510
  Yuefeng Li; Sheng-Tang Wu; Xiaohui Tao
Many data mining techniques have been proposed for mining useful patterns in databases. However, how to effectively utilize discovered patterns is still an open research issue, especially in the domain of text mining. Most existing methods adopt term-based approaches. However, they all suffer from the problems of polysemy and synonymy. This paper presents an innovative technique, pattern taxonomy mining, to improve the effectiveness of using discovered patterns for finding useful information. Substantial experiments on RCV1 demonstrate that the proposed solution achieves encouraging performance.
Incorporating topical support documents into a small training set in text categorization BIBAFull-Text 1511-1512
  Kyung Soon Lee
This paper explores the incorporation of topical support documents into a training set as a means of compensating for a shortage of positive training data in text categorization. To support topical representation, our method applies a simple transformation to documents, i.e., making new documents from existing positive documents by squaring a conventional term weight. The topical support documents thus created not only are expected to preserve the topic, but even improve the topical representation by emphasizing terms with higher weights. Experiments with support vector machines showed the effectiveness on RCV1 collection with a small number of positive training data. Our topical support representation achieved 52.01% and 8.83% improvements for 33 and 56 categories of RCV1 Topic in micro-averaged F1 with less than 100 and 300 positive documents in learning, respectively. Result analyses based on robustness indicate that topical support documents contribute to a steady and stable improvement.
Exploiting context to detect sensitive information in call center conversations BIBAFull-Text 1513-1514
  Tanveer A. Faruquie; Sumit Negi; Anup Chalamalla; L. Venkata Subramaniam
Protecting sensitive information while preserving the share-ability and usability of data is becoming increasingly important. In call-centers a lot of customer related sensitive information is stored in audio recordings. In this work, we address the problem of protecting sensitive information in audio recordings and speech transcripts. We present a semi-supervised method to model sensitive information as a directed graph. Effectiveness of this approach is demonstrated by applying it to the problem of detecting and locating credit card transaction in real life conversations between agents and customers in a call center.
Multi-scale characterization of social network dynamics in the blogosphere BIBAFull-Text 1515-1516
  Munmun De Choudhury; Hari Sundaram; Ajita John; Dorée Duncan Seligmann
We have developed a computational framework to characterize social network dynamics in the blogosphere at individual, group and community levels. Such characterization could be used by corporations to help drive targeted advertising and to track the moods and sentiments of consumers. We tested our model on a widely read technology blog called Engadget. Our results show that communities transit between states of high and low entropy, depending on sentiments (positive / negative) about external happenings. We also propose an innovative method to establish the utility of the extracted knowledge, by correlating the mined knowledge with an external time series data (the stock market). Our validation results show that the characterized groups exhibit high stock market movement predictability (89%) and removal of 'impactful' groups makes the community less resilient by lowering predictability (26%) and affecting the composition of the groups in the rest of the community.
Semi-supervised text categorization by active search BIBAFull-Text 1517-1518
  Zenglin Xu; Rong Jin; Kaizhu Huang; Michael R. Lyu; Irwin King
In automated text categorization, given a small number of labeled documents, it is very challenging, if not impossible, to build a reliable classifier that is able to achieve high classification accuracy. To address this problem, a novel web-assisted text categorization framework is proposed in this paper. Important keywords are first automatically identified from the available labeled documents to form the queries. Search engines are then utilized to retrieve from the Web a multitude of relevant documents, which are then exploited by a semi-supervised framework. To our best knowledge, this work is the first study of this kind. Extensive experimental study shows the encouraging results of the proposed text categorization framework: using Google as the web search engine, the proposed framework is able to reduce the classification error by 30% when compared with the state-of-the-art supervised text categorization method.
Clustering multi-way data via adaptive subspace iteration BIBAFull-Text 1519-1520
  Wei Peng; Tao Li; Bo Shao
Clustering multi-way data is a very important research topic due to the intrinsic rich structures in real-world datasets. In this paper, we propose the subspace clustering algorithm on multi-way data, called ASI-T (Adaptive Subspace Iteration on Tensor). ASI-T is a special version of High Order SVD (HOSVD), and it simultaneously performs subspace identification using 2DSVD and data clustering using K-Means. The experimental results on synthetic data and real-world data demonstrate the effectiveness of ASI-T.
A coarse-grain grid-based subspace clustering method for online multi-dimensional data streams BIBAFull-Text 1521-1522
  Jae Woo Lee; Won Suk Lee
This paper proposes a subspace clustering algorithm which combines grid-based clustering with frequent itemset mining. Given a d-dimensional data stream, the on-going distribution statistics of its data elements in every one-dimensional data space is monitored by a list of fine-grain grid-cells called a sibling list, so that all the one-dimensional clusters are accurately identified. By tracing a set of frequently co-occurred one-dimensional clusters, it is possible to find a coarse-grain dense rectangular space in a higher dimensional subspace. An ST-tree is introduced to continuously monitor dense rectangular spaces in all the subspaces of the d dimensions. Among the spaces, those ones whose densities are greater than or equal to a user defined minimum support threshold Smin are corresponding to final clusters.
A matrix-based approach for semi-supervised document co-clustering BIBAFull-Text 1523-1524
  Yanhua Chen; Lijun Wang; Ming Dong
In order to derive high quality information from text, the field of text mining has advanced swiftly from simple document clustering to co-clustering documents and words. However, document co-clustering without any prior knowledge or background information is a challenging problem. In this paper, we propose a Semi-Supervised Non-negative Matrix Factorization (SS-NMF) based framework for document co-clustering. Our method computes a new word-document matrix by incorporating user provided constraints through distance metric learning. Using an iterative algorithm, we perform tri-factorization of the new matrix to infer the document and word clusters. Through extensive experiments conducted on publicly available data sets, we demonstrate the superior performance of SS-NMF for document co-clustering.
Categorizing blogger's interests based on short snippets of blog posts BIBAFull-Text 1525-1526
  Jiahui Liu; Larry Birnbaum; Bryan Pardo
Blogs have become an important medium for people to express opinions and share information on the web. Predicting the interests of bloggers can be beneficial for information retrieval and knowledge discovery in the blogosphere. In this paper, we propose a two-layer classification model to categorize the interests of bloggers based on a set of short snippets collected from their blog posts. Experiments were conducted on a list of bloggers collected from blog directories, with their snippets collected from Google Blog Search. The results show that the proposed method is robust to errors in the lower level and achieve satisfactory performance in categorizing blogger's interests.