HCI Bibliography Home | HCI Journals | About TOIS | Journal Info | TOIS Journal Volumes | Detailed Records | RefWorks | EndNote | Hide Abstracts
TOIS Tables of Contents: 192021222324252627282930313233

ACM Transactions on Information Systems 29

Editors:Jamie Callan
Standard No:ISSN 1046-8188; HF S548.125 A33
Links:Table of Contents
  1. TOIS 2010-12 Volume 29 Issue 1
  2. TOIS 2011-04 Volume 29 Issue 2
  3. TOIS 2011-07 Volume 29 Issue 3
  4. TOIS 2011-12 Volume 29 Issue 4

TOIS 2010-12 Volume 29 Issue 1

Efficient set intersection for inverted indexing BIBAFull-Text 1
  J. Shane Culpepper; Alistair Moffat
Conjunctive Boolean queries are a key component of modern information retrieval systems, especially when Web-scale repositories are being searched. A conjunctive query q is equivalent to a |q|-way intersection over ordered sets of integers, where each set represents the documents containing one of the terms, and each integer in each set is an ordinal document identifier. As is the case with many computing applications, there is tension between the way in which the data is represented, and the ways in which it is to be manipulated. In particular, the sets representing index data for typical document collections are highly compressible, but are processed using random access techniques, meaning that methods for carrying out set intersections must be alert to issues to do with access patterns and data representation. Our purpose in this article is to explore these trade-offs, by investigating intersection techniques that make use of both uncompressed "integer" representations, as well as compressed arrangements. We also propose a simple hybrid method that provides both compact storage, and also faster intersection computations for conjunctive querying than is possible even with uncompressed representations.
Engineering basic algorithms of an in-memory text search engine BIBAFull-Text 2
  Frederik Transier; Peter Sanders
Inverted index data structures are the key to fast text search engines. We first investigate one of the predominant operation on inverted indexes, which asks for intersecting two sorted lists of document IDs of different lengths. We explore compression and performance of different inverted list data structures. In particular, we present Lookup, a new data structure that allows intersection in expected time linear in the smaller list.
   Based on this result, we present the algorithmic core of a full text data base that allows fast Boolean queries, phrase queries, and document reporting using less space than the input text. The system uses a carefully choreographed combination of classical data compression techniques and inverted-index-based search data structures. Our experiments show that inverted indexes are preferable over purely suffix-array-based techniques for in-memory (English) text search engines.A similar system is now running in practice in each core of the distributed data base engine TREX of SAP.
Utilizing inter-passage and inter-document similarities for reranking search results BIBAFull-Text 3
  Eyal Krikon; Oren Kurland; Michael Bendersky
We present a novel language-model-based approach to reranking search results; that is, reordering the documents in an initially retrieved list so as to improve precision at top ranks. Our model integrates whole-document information with that induced from passages. Specifically, inter-passage, inter-document, and query-based similarities, which constitute a rich source of information, are combined in our model. Empirical evaluation shows that the precision-at-top-ranks performance of our model is substantially better than that of the initial ranking upon which reranking is performed. Furthermore, the performance is substantially better than that of a commonly used passage-based document ranking method that does not exploit inter-item similarities. Our model also generalizes and outperforms a recently proposed reranking method that utilizes inter-document similarities, but which does not exploit passage-based information. Finally, the model's performance is superior to that of a state-of-the-art pseudo-feedback-based retrieval approach.
Improving graph-walk-based similarity with reranking: Case studies for personal information management BIBAFull-Text 4
  Einat Minkov; William W. Cohen
Relational or semistructured data is naturally represented by a graph, where nodes denote entities and directed typed edges represent the relations between them. Such graphs are heterogeneous, describing different types of objects and links. We represent personal information as a graph that includes messages, terms, persons, dates, and other object types, and relations like sent-to and has-term. Given the graph, we apply finite random graph walks to induce a measure of entity similarity, which can be viewed as a tool for performing search in the graph. Experiments conducted using personal email collections derived from the Enron corpus and other corpora show how the different tasks of alias finding, threading, and person name disambiguation can be all addressed as search queries in this framework, where the graph-walk-based similarity metric is preferable to alternative approaches, and further improvements are achieved with learning. While researchers have suggested to tune edge weight parameters to optimize the graph walk performance per task, we apply reranking to improve the graph walk results, using features that describe high-level information such as the paths traversed in the walk. High performance, together with practical runtimes, suggest that the described framework is a useful search system in the PIM domain, as well as in other semistructured domains.
Dependable filtering: Philosophy and realizations BIBAFull-Text 5
  Matteo Dell'Amico; Licia Capra
Digital content production and distribution has radically changed our business models. An unprecedented volume of supply is now on offer, whetted by the demand of millions of users from all over the world. Since users cannot be expected to browse through millions of different items to find what they might like, filtering has become a popular technique to connect supply and demand: trusted users are first identified, and their opinions are then used to create recommendations. In this domain, users' trustworthiness has been measured according to one of the following two criteria: taste similarity (i.e., "I trust those who agree with me"), or social ties (i.e., "I trust my friends, and the people that my friends trust"). The former criterion aims at identifying concordant users, but is subject to abuse by malicious behaviors. The latter aims at detecting well-intentioned users, but fails to capture the natural subjectivity of tastes. In this article, we propose a new definition of trusted recommenders, addressing those users that are both well-intentioned and concordant. Based on this characterisation, we propose a novel approach to information filtering that we call dependable filtering. We describe alternative algorithms realizing this approach, and demonstrate, by means of extensive performance evaluation on a variety of real large-scale datasets, the high degree of both accuracy and robustness they entail.
Extraction, characterization and utility of prototypical communication groups in the blogosphere BIBAFull-Text 6
  Munmun De Choudhury; Hari Sundaram; Ajita John; Doree Duncan Seligmann
This article analyzes communication within a set of individuals to extract the representative prototypical groups and provides a novel framework to establish the utility of such groups. Corporations may want to identify representative groups (which are indicative of the overall communication set) because it is easier to track the prototypical groups rather than the entire set. This can be useful for advertising, identifying "hot" spots of resource consumption as well as in mining representative moods or temperature of a community. Our framework has three parts: extraction, characterization, and utility of prototypical groups. First, we extract groups by developing features representing communication dynamics of the individuals. Second, to characterize the overall communication set, we identify a subset of groups within the community as the prototypical groups. Third, we justify the utility of these prototypical groups by using them as predictors of related external phenomena; specifically, stock market movement of technology companies and political polls of Presidential candidates in the 2008 U.S. elections.
   We have conducted extensive experiments on two popular blogs, Engadget and Huffington Post. We observe that the prototypical groups can predict stock market movement/political polls satisfactorily with mean error rate of 20.32%. Further, our method outperforms baseline methods based on alternative group extraction and prototypical group identification methods. We evaluate the quality of the extracted groups based on their conductance and coverage measures and develop metrics: predictivity and resilience to evaluate their ability to predict a related external time-series variable (stock market movement/political polls). This implies that communication dynamics of individuals are essential in extracting groups in a community, and the prototypical groups extracted by our method are meaningful in characterizing the overall communication sets.

TOIS 2011-04 Volume 29 Issue 2

Diagnostic Evaluation of Information Retrieval Models BIBAFull-Text 7
  Hui Fang; Tao Tao; Chengxiang Zhai
Developing effective retrieval models is a long-standing central challenge in information retrieval research. In order to develop more effective models, it is necessary to understand the deficiencies of the current retrieval models and the relative strengths of each of them. In this article, we propose a general methodology to analytically and experimentally diagnose the weaknesses of a retrieval function, which provides guidance on how to further improve its performance. Our methodology is motivated by the empirical observation that good retrieval performance is closely related to the use of various retrieval heuristics. We connect the weaknesses and strengths of a retrieval function with its implementations of these retrieval heuristics, and propose two strategies to check how well a retrieval function implements the desired retrieval heuristics. The first strategy is to formalize heuristics as constraints, and use constraint analysis to analytically check the implementation of retrieval heuristics. The second strategy is to define a set of relevance-preserving perturbations and perform diagnostic tests to empirically evaluate how well a retrieval function implements retrieval heuristics. Experiments show that both strategies are effective to identify the potential problems in implementations of the retrieval heuristics. The performance of retrieval functions can be improved after we fix these problems.
Concept-Based Information Retrieval Using Explicit Semantic Analysis BIBAFull-Text 8
  Ofer Egozi; Shaul Markovitch; Evgeniy Gabrilovich
Information retrieval systems traditionally rely on textual keywords to index and retrieve documents. Keyword-based retrieval may return inaccurate and incomplete results when different keywords are used to describe the same concept in the documents and in the queries. Furthermore, the relationship between these related keywords may be semantic rather than syntactic, and capturing it thus requires access to comprehensive human world knowledge. Concept-based retrieval methods have attempted to tackle these difficulties by using manually built thesauri, by relying on term cooccurrence data, or by extracting latent word relationships and concepts from a corpus. In this article we introduce a new concept-based retrieval approach based on Explicit Semantic Analysis (ESA), a recently proposed method that augments keyword-based text representation with concept-based features, automatically extracted from massive human knowledge repositories such as Wikipedia. Our approach generates new text features automatically, and we have found that high-quality feature selection becomes crucial in this setting to make the retrieval more focused. However, due to the lack of labeled data, traditional feature selection methods cannot be used, hence we propose new methods that use self-generated labeled training data. The resulting system is evaluated on several TREC datasets, showing superior performance over previous state-of-the-art results.
Improving Recommender Systems by Incorporating Social Contextual Information BIBAFull-Text 9
  Hao Ma; Tom Chao Zhou; Michael R. Lyu; Irwin King
Due to their potential commercial value and the associated great research challenges, recommender systems have been extensively studied by both academia and industry recently. However, the data sparsity problem of the involved user-item matrix seriously affects the recommendation quality. Many existing approaches to recommender systems cannot easily deal with users who have made very few ratings. In view of the exponential growth of information generated by online users, social contextual information analysis is becoming important for many Web applications. In this article, we propose a factor analysis approach based on probabilistic matrix factorization to alleviate the data sparsity and poor prediction accuracy problems by incorporating social contextual information, such as social networks and social tags. The complexity analysis indicates that our approach can be applied to very large datasets since it scales linearly with the number of observations. Moreover, the experimental results show that our method performs much better than the state-of-the-art approaches, especially in the circumstance that users have made few ratings.
Contextual Video Recommendation by Multimodal Relevance and User Feedback BIBAFull-Text 10
  Tao Mei; Bo Yang; Xian-Sheng Hua; Shipeng Li
With Internet delivery of video content surging to an unprecedented level, video recommendation, which suggests relevant videos to targeted users according to their historical and current viewings or preferences, has become one of most pervasive online video services. This article presents a novel contextual video recommendation system, called VideoReach, based on multimodal content relevance and user feedback. We consider an online video usually consists of different modalities (i.e., visual and audio track, as well as associated texts such as query, keywords, and surrounding text). Therefore, the recommended videos should be relevant to current viewing in terms of multimodal relevance. We also consider that different parts of videos are with different degrees of interest to a user, as well as different features and modalities have different contributions to the overall relevance. As a result, the recommended videos should also be relevant to current users in terms of user feedback (i.e., user click-through). We then design a unified framework for VideoReach which can seamlessly integrate both multimodal relevance and user feedback by relevance feedback and attention fusion. VideoReach represents one of the first attempts toward contextual recommendation driven by video content and user click-through, without assuming a sufficient collection of user profiles available. We conducted experiments over a large-scale real-world video data and reported the effectiveness of VideoReach.
Effects of Usage-Based Feedback on Video Retrieval: A Simulation-Based Study BIBAFull-Text 11
  David Vallet; Frank Hopfgartner; Joemon M. Jose; Pablo Castells
We present a model for exploiting community-based usage information for video retrieval, where implicit usage information from past users is exploited in order to provide enhanced assistance in video retrieval tasks, and alleviate the effects of the semantic gap problem. We propose a graph-based model for all types of implicit and explicit feedback, in which the relevant usage information is represented. Our model is designed to capture the complex interactions of a user with an interactive video retrieval system, including the representation of sequences of user-system interaction during a search session. Building upon this model, four recommendation strategies are defined and evaluated. An evaluation strategy is proposed based on simulated user actions, which enables the evaluation of our recommendation strategies over a usage information pool obtained from 24 users performing four different TRECVid tasks. Furthermore, the proposed simulation approach is used to simulate usage information pools with different characteristics, with which the recommendation approaches are further evaluated on a larger set of tasks, and their performance is studied with respect to the scalability and quality of the available implicit information.
Identifying, Indexing, and Ranking Chemical Formulae and Chemical Names in Digital Documents BIBAFull-Text 12
  Bingjun Sun; Prasenjit Mitra; C. Lee Giles; Karl T. Mueller
End-users utilize chemical search engines to search for chemical formulae and chemical names. Chemical search engines identify and index chemical formulae and chemical names appearing in text documents to support efficient search and retrieval in the future. Identifying chemical formulae and chemical names in text automatically has been a hard problem that has met with varying degrees of success in the past. We propose algorithms for chemical formula and chemical name tagging using Conditional Random Fields (CRFs) and Support Vector Machines (SVMs) that achieve higher accuracy than existing (published) methods. After chemical entities have been identified in text documents, they must be indexed. In order to support user-provided search queries that require a partial match between the chemical name segment used as a keyword or a partial chemical formula, all possible (or a significant number of) subformulae of formulae that appear in any document and all possible subterms (e.g., "methyl") of chemical names (e.g., "methylethyl ketone") must be indexed. Indexing all possible subformulae and subterms results in an exponential increase in the storage and memory requirements as well as the time taken to process the indices. We propose techniques to prune the indices significantly without reducing the quality of the returned results significantly. Finally, we propose multiple query semantics to allow users to pose different types of partial search queries for chemical entities. We demonstrate empirically that our search engines improve the relevance of the returned results for search queries involving chemical entities.

TOIS 2011-07 Volume 29 Issue 3

Content redundancy in YouTube and its application to video tagging BIBAFull-Text 13
  Jose San Pedro; Stefan Siersdorfer; Mark Sanderson
The emergence of large-scale social Web communities has enabled users to share online vast amounts of multimedia content. An analysis of YouTube reveals a high amount of redundancy, in the form of videos with overlapping or duplicated content. We use robust content-based video analysis techniques to detect overlapping sequences between videos. Based on the output of these techniques, we present an in-depth study of duplication and content overlap in YouTube, and analyze various dependencies between content overlap and meta data such as video titles, views, video ratings, and tags. As an application, we show that content-based links provide useful information for generating new tag assignments. We propose different tag propagation methods for automatically obtaining richer video annotations. Experiments on video clustering and classification as well as a user evaluation demonstrate the viability of our approach.
Exploring the music similarity space on the web BIBAFull-Text 14
  Markus Schedl; Tim Pohle; Peter Knees; Gerhard Widmer
This article comprehensively addresses the problem of similarity measurement between music artists via text-based features extracted from Web pages. To this end, we present a thorough evaluation of different term-weighting strategies, normalization methods, aggregation functions, and similarity measurement techniques. In large-scale genre classification experiments carried out on real-world artist collections, we analyze several thousand combinations of settings/parameters that influence the similarity calculation process, and investigate in which way they impact the quality of the similarity estimates. Accurate similarity measures for music are vital for many applications, such as automated playlist generation, music recommender systems, music information systems, or intelligent user interfaces to access music collections by means beyond text-based browsing. Therefore, by exhaustively analyzing the potential of text-based features derived from artist-related Web pages, this article constitutes an important contribution to context-based music information research.
Toward a semantic granularity model for domain-specific information retrieval BIBAFull-Text 15
  Xin Yan; Raymond Y. K. Lau; Dawei Song; Xue Li; Jian Ma
Both similarity-based and popularity-based document ranking functions have been successfully applied to information retrieval (IR) in general. However, the dimension of semantic granularity also should be considered for effective retrieval. In this article, we propose a semantic granularity-based IR model that takes into account the three dimensions, namely similarity, popularity, and semantic granularity, to improve domain-specific search. In particular, a concept-based computational model is developed to estimate the semantic granularity of documents with reference to a domain ontology. Semantic granularity refers to the levels of semantic detail carried by an information item. The results of our benchmark experiments confirm that the proposed semantic granularity based IR model performs significantly better than the similarity-based baseline in both a bio-medical and an agricultural domain. In addition, a series of user-oriented studies reveal that the proposed document ranking functions resemble the implicit ranking functions exercised by humans. The perceived relevance of the documents delivered by the granularity-based IR system is significantly higher than that produced by a popular search engine for a number of domain-specific search tasks. To the best of our knowledge, this is the first study regarding the application of semantic granularity to enhance domain-specific IR.
Fast construction of the HYB index BIBAFull-Text 16
  Hannah Bast; Marjan Celikik
As shown in a series of recent works, the HYB index is an alternative to the inverted index (INV) that enables very fast prefix searches, which in turn is the basis for fast processing of many other types of advanced queries, including autocompletion, faceted search, error-tolerant search, database-style select and join, and semantic search. In this work we show that HYB can be constructed at least as fast as INV, and often up to twice as fast. This is because HYB, by its nature, requires only a half-inversion of the data and allows an efficient in-place instead of the traditional merge-based index construction. We also pay particular attention to the cache efficiency of the in-memory posting accumulation, an issue that has not been addressed in previous work, and show that our simple multilevel posting accumulation scheme yields much fewer cache misses compared to related approaches. Finally, we show that HYB supports fast dynamic index updates more easily than INV.

TOIS 2011-12 Volume 29 Issue 4

Upper-bound approximations for dynamic pruning BIBAFull-Text 17
  Craig Macdonald; Iadh Ounis; Nicola Tonellotto
Dynamic pruning strategies for information retrieval systems can increase querying efficiency without decreasing effectiveness by using upper bounds to safely omit scoring documents that are unlikely to make the final retrieved set. Often, such upper bounds are pre-calculated at indexing time for a given weighting model. However, this precludes changing, adapting or training the weighting model without recalculating the upper bounds. Instead, upper bounds should be approximated at querying time from various statistics of each term to allow on-the-fly adaptation of the applied retrieval strategy. This article, by using uniform notation, formulates the problem of determining a term upper-bound given a weighting model and discusses the limitations of existing approximations. Moreover, we propose an upper-bound approximation using a constrained nonlinear maximization problem. We prove that our proposed upper-bound approximation does not impact the retrieval effectiveness of several modern weighting models from various different families. We also show the applicability of the approximation for the Markov Random Field proximity model. Finally, we empirically examine how the accuracy of the upper-bound approximation impacts the number of postings scored and the resulting efficiency in the context of several large Web test collections.
Ranking function adaptation with boosting trees BIBAFull-Text 18
  Keke Chen; Jing Bai; Zhaohui Zheng
Machine-learned ranking functions have shown successes in Web search engines. With the increasing demands on developing effective ranking functions for different search domains, we have seen a big bottleneck, that is, the problem of insufficient labeled training data, which has significantly slowed the development and deployment of machine-learned ranking functions for different domains. There are two possible approaches to address this problem: (1) combining labeled training data from similar domains with the small target-domain labeled data for training or (2) using pairwise preference data extracted from user clickthrough log for the target domain for training. In this article, we propose a new approach called tree-based ranking function adaptation (Trada) to effectively utilize these data sources for training cross-domain ranking functions. Tree adaptation assumes that ranking functions are trained with the Stochastic Gradient Boosting Trees method -- a gradient boosting method on regression trees. It takes such a ranking function from one domain and tunes its tree-based structure with a small amount of training data from the target domain. The unique features include (1) automatic identification of the part of the model that needs adjustment for the new domain and (2) appropriate weighing of training examples considering both local and global distributions. Based on a novel pairwise loss function that we developed for pairwise learning, the basic tree adaptation algorithm is also extended (Pairwise Trada) to utilize the pairwise preference data from the target domain to further improve the effectiveness of adaptation. Experiments are performed on real datasets to show that tree adaptation can provide better-quality ranking functions for a new domain than other methods.
GRAS: An effective and efficient stemming algorithm for information retrieval BIBAFull-Text 19
  Jiaul H. Paik; Mandar Mitra; Swapan K. Parui; Kalervo Järvelin
A novel graph-based language-independent stemming algorithm suitable for information retrieval is proposed in this article. The main features of the algorithm are retrieval effectiveness, generality, and computational efficiency. We test our approach on seven languages (using collections from the TREC, CLEF, and FIRE evaluation platforms) of varying morphological complexity. Significant performance improvement over plain word-based retrieval, three other language-independent morphological normalizers, as well as rule-based stemmers is demonstrated.
Recommendation systems with complex constraints: A course recommendation perspective BIBAFull-Text 20
  Aditya Parameswaran; Petros Venetis; Hector Garcia-Molina
We study the problem of making recommendations when the objects to be recommended must also satisfy constraints or requirements. In particular, we focus on course recommendations: the courses taken by a student must satisfy requirements (e.g., take two out of a set of five math courses) in order for the student to graduate. Our work is done in the context of the CourseRank system, used by students to plan their academic program at Stanford University. Our goal is to recommend to these students courses that not only help satisfy constraints, but that are also desirable (e.g., popular or taken by similar students). We develop increasingly expressive models for course requirements, and present a variety of schemes for both checking if the requirements are satisfied, and for making recommendations that take into account the requirements. We show that some types of requirements are inherently expensive to check, and we present exact, as well as heuristic techniques, for those cases. Although our work is specific to course requirements, it provides insights into the design of recommendation systems in the presence of complex constraints found in other applications.
Correlation-based retrieval for heavily changed near-duplicate videos BIBAFull-Text 21
  Jiajun Liu; Zi Huang; Heng Tao Shen; Bin Cui
The unprecedented and ever-growing number of Web videos nowadays leads to the massive existence of near-duplicate videos. Very often, some near-duplicate videos exhibit great content changes, while the user perceives little information change, for example, color features change significantly when transforming a color video with a blue filter. These feature changes contribute to low-level video similarity computations, making conventional similarity-based near-duplicate video retrieval techniques incapable of accurately capturing the implicit relationship between two near-duplicate videos with fairly large content modifications. In this paper, we introduce a new dimension for near-duplicate video retrieval. Different from existing near-duplicate video retrieval approaches which are based on video-content similarity, we explore the correlation between two videos. The intuition is that near-duplicate videos should preserve strong information correlation in spite of intensive content changes. More effective retrieval with stronger tolerance is achieved by replacing video-content similarity measures with information correlation analysis. Theoretical justification and experimental results prove the effectiveness of correlation-based near-duplicate retrieval.
Query modeling for entity search based on terms, categories, and examples BIBAFull-Text 22
  Krisztian Balog; Marc Bron; Maarten de Rijke
Users often search for entities instead of documents, and in this setting, are willing to provide extra input, in addition to a series of query terms, such as category information and example entities. We propose a general probabilistic framework for entity search to evaluate and provide insights in the many ways of using these types of input for query modeling. We focus on the use of category information and show the advantage of a category-based representation over a term-based representation, and also demonstrate the effectiveness of category-based expansion using example entities. Our best performing model shows very competitive performance on the INEX-XER entity ranking and list completion tasks.