HCI Bibliography Home | HCI Journals | About TOIS | Journal Info | TOIS Journal Volumes | Detailed Records | RefWorks | EndNote | Hide Abstracts
TOIS Tables of Contents: 18192021222324252627282930313233

ACM Transactions on Information Systems 28

Editors:Jamie Callan
Standard No:ISSN 1046-8188; HF S548.125 A33
Links:Table of Contents
  1. TOIS 2010 Volume 28 Issue 1
  2. TOIS 2010 Volume 28 Issue 2
  3. TOIS 2010 Volume 28 Issue 3
  4. TOIS 2010 Volume 28 Issue 4

TOIS 2010 Volume 28 Issue 1

Probabilistic static pruning of inverted files BIBAFull-Text 1
  Roi Blanco; Alvaro Barreiro
Information retrieval (IR) systems typically compress their indexes in order to increase their efficiency. Static pruning is a form of lossy data compression: it removes from the index, data that is estimated to be the least important to retrieval performance, according to some criterion. Generally, pruning criteria are derived from term weighting functions, which assign weights to terms according to their contribution to a document's contents. Usually, document-term occurrences that are assigned a low weight are ruled out from the index. The main assumption is that those entries contribute little to the document content.
   We present a novel pruning technique that is based on a probabilistic model of IR. We employ the Probability Ranking Principle as a decision criterion over which posting list entries are to be pruned. The proposed approach requires the estimation of three probabilities, combining them in such a way that we gather all the necessary information to apply the aforementioned criterion.
   We evaluate our proposed pruning technique on five TREC collections and various retrieval tasks, and show that in almost every situation it outperforms the state of the art in index pruning. The main contribution of this work is proposing a pruning technique that stems directly from the same source as probabilistic retrieval models, and hence is independent of the final model used for retrieval.
Statistical lattice-based spoken document retrieval BIBAFull-Text 2
  Tee Kiah Chia; Khe Chai Sim; Haizhou Li; Hwee Tou Ng
Recent research efforts on spoken document retrieval have tried to overcome the low quality of 1-best automatic speech recognition transcripts, especially in the case of conversational speech, by using statistics derived from speech lattices containing multiple transcription hypotheses as output by a speech recognizer. We present a method for lattice-based spoken document retrieval based on a statistical n-gram modeling approach to information retrieval. In this statistical lattice-based retrieval (SLBR) method, a smoothed statistical model is estimated for each document from the expected counts of words given the information in a lattice, and the relevance of each document to a query is measured as a probability under such a model. We investigate the efficacy of our method under various parameter settings of the speech recognition and lattice processing engines, using the Fisher English Corpus of conversational telephone speech. Experimental results show that our method consistently achieves better retrieval performance than using only the 1-best transcripts in statistical retrieval, outperforms a recently proposed lattice-based vector space retrieval method, and also compares favorably with a lattice-based retrieval method based on the Okapi BM25 model.
Semantic clustering of XML documents BIBAFull-Text 3
  Andrea Tagarelli; Sergio Greco
Dealing with structure and content semantics underlying semistructured documents is challenging for any task of document management and knowledge discovery conceived for such data. In this work we address the novel problem of clustering semantically related XML documents according to their structure and content features. XML features are generated by enriching syntactic with semantic information based on a lexical knowledge base. The backbone of the proposed framework for the semantic clustering of XML documents is a data representation model that exploits the notion of tree tuple to identify semantically cohesive substructures in XML documents and represent them as transactional data. This framework is equipped with two clustering algorithms based on different paradigms, namely centroid-based partitional clustering and frequent-itemset-based hierarchical clustering. An extensive experimental evaluation was conducted on real data sets from various domains, showing the significance of our approach as a solution for the semantic clustering of XML documents.
Learning author-topic models from text corpora BIBAFull-Text 4
  Michal Rosen-Zvi; Chaitanya Chemudugunta; Thomas Griffiths; Padhraic Smyth; Mark Steyvers
We propose an unsupervised learning technique for extracting information about authors and topics from large text collections. We model documents as if they were generated by a two-stage stochastic process. An author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words. The probability distribution over topics in a multi-author paper is a mixture of the distributions associated with the authors. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to three large text corpora: 150,000 abstracts from the CiteSeer digital library, 1740 papers from the Neural Information Processing Systems (NIPS) Conferences, and 121,000 emails from the Enron corporation. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, parsing of abstracts by topics and authors, and detection of unusual papers by specific authors. Experiments based on perplexity scores for test documents and precision-recall for document retrieval are used to illustrate systematic differences between the proposed author-topic model and a number of alternatives. Extensions to the model, allowing for example, generalizations of the notion of an author, are also briefly discussed.

TOIS 2010 Volume 28 Issue 2

Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load BIBAFull-Text 5
  Diego Puppin; Fabrizio Silvestri; Raffaele Perego; Ricardo Baeza-Yates
This article introduces an architecture for a document-partitioned search engine, based on a novel approach combining collection selection and load balancing, called load-driven routing. By exploiting the query-vector document model, and the incremental caching technique, our architecture can compute very high quality results for any query, with only a fraction of the computational load used in a typical document-partitioned architecture. By trading off a small fraction of the results, our technique allows us to strongly reduce the computing pressure to a search engine back-end; we are able to retrieve more than 2/3 of the top-5 results for a given query with only 10% the computing load needed by a configuration where the query is processed by each index partition. Alternatively, we can slightly increase the load up to 25% to improve precision and get more than 80% of the top-5 results. In fact, the flexibility of our system allows a wide range of different configurations, so as to easily respond to different needs in result quality or restrictions in computing power. More important, the system configuration can be adjusted dynamically in order to fit unexpected query peaks or unpredictable failures. This article wraps up some recent works by the authors, showing the results obtained by tests conducted on 6 million documents, 2,800,000 queries and real query cost timing as measured on an actual index.
Exploiting query logs for cross-lingual query suggestions BIBAFull-Text 6
  Wei Gao; Cheng Niu; Jian-Yun Nie; Ming Zhou; Kam-Fai Wong; Hsiao-Wuen Hon
Query suggestion aims to suggest relevant queries for a given query, which helps users better specify their information needs. Previous work on query suggestion has been limited to the same language. In this article, we extend it to cross-lingual query suggestion (CLQS): for a query in one language, we suggest similar or relevant queries in other languages. This is very important to the scenarios of cross-language information retrieval (CLIR) and other related cross-lingual applications. Instead of relying on existing query translation technologies for CLQS, we present an effective means to map the input query of one language to queries of the other language in the query log. Important monolingual and cross-lingual information such as word translation relations and word co-occurrence statistics, and so on, are used to estimate the cross-lingual query similarity with a discriminative model. Benchmarks show that the resulting CLQS system significantly outperforms a baseline system that uses dictionary-based query translation. Besides, we evaluate CLQS with French-English and Chinese-English CLIR tasks on TREC-6 and NTCIR-4 collections, respectively. The CLIR experiments using typical retrieval models demonstrate that the CLQS-based approach has significantly higher effectiveness than several traditional query translation methods. We find that when combined with pseudo-relevance feedback, the effectiveness of CLIR using CLQS is enhanced for different pairs of languages.
Efficient k-nearest neighbor searching in nonordered discrete data spaces BIBAFull-Text 7
  Dashiell Kolbe; Qiang Zhu; Sakti Pramanik
Numerous techniques have been proposed in the past for supporting efficient k-nearest neighbor (k-NN) queries in continuous data spaces. Limited work has been reported in the literature for k-NN queries in a nonordered discrete data space (NDDS). Performing k-NN queries in an NDDS raises new challenges. The Hamming distance is usually used to measure the distance between two vectors (objects) in an NDDS. Due to the coarse granularity of the Hamming distance, a k-NN query in an NDDS may lead to a high degree of nondeterminism for the query result. We propose a new distance measure, called Granularity-Enhanced Hamming (GEH) distance, which effectively reduces the number of candidate solutions for a query. We have also implemented k-NN queries using multidimensional database indexing in NDDSs. Further, we use the properties of our multidimensional NDDS index to derive the probability of encountering valid neighbors within specific regions of the index. This probability is used to develop a new search ordering heuristic. Our experiments on synthetic and genomic data sets demonstrate that our index-based k-NN algorithm is efficient in finding k-NNs in both uniform and nonuniform data sets in NDDSs and that our heuristics are effective in improving the performance of such queries.
Exploiting neighborhood knowledge for single document summarization and keyphrase extraction BIBAFull-Text 8
  Xiaojun Wan; Jianguo Xiao
Document summarization and keyphrase extraction are two related tasks in the IR and NLP fields, and both of them aim at extracting condensed representations from a single text document. Existing methods for single document summarization and keyphrase extraction usually make use of only the information contained in the specified document. This article proposes using a small number of nearest neighbor documents to improve document summarization and keyphrase extraction for the specified document, under the assumption that the neighbor documents could provide additional knowledge and more clues. The specified document is expanded to a small document set by adding a few neighbor documents close to the document, and the graph-based ranking algorithm is then applied on the expanded document set to make use of both the local information in the specified document and the global information in the neighbor documents. Experimental results on the Document Understanding Conference (DUC) benchmark datasets demonstrate the effectiveness and robustness of our proposed approaches. The cross-document sentence relationships in the expanded document set are validated to be beneficial to single document summarization, and the word cooccurrence relationships in the neighbor documents are validated to be very helpful to single document keyphrase extraction.
Effects of position and number of relevant documents retrieved on users' evaluations of system performance BIBAFull-Text 9
  Diane Kelly; Xin Fu; Chirag Shah
Information retrieval research has demonstrated that system performance does not always correlate positively with user performance, and that users often assign positive evaluation scores to search systems even when they are unable to complete tasks successfully. This research investigated the relationship between objective measures of system performance and users' perceptions of that performance. In this study, subjects evaluated the performance of four search systems whose search results were manipulated systematically to produce different orderings and numbers of relevant documents. Three laboratory studies were conducted with a total of eighty-one subjects. The first two studies investigated the effect of the order of five relevant and five nonrelevant documents in a search results list containing ten results on subjects' evaluations. The third study investigated the effect of varying the number of relevant documents in a search results list containing ten results on subjects' evaluations. Results demonstrate linear relationships between subjects' evaluations and the position of relevant documents in a search results list and the total number of relevant documents retrieved. Of the two, number of relevant documents retrieved was a stronger predictor of subjects' evaluation ratings and resulted in subjects using a greater range of evaluation scores.

TOIS 2010 Volume 28 Issue 3

Dynamic lightweight text compression BIBAFull-Text 10
  Nieves Brisaboa; Antonio Fariña; Gonzalo Navarro; José Paramá
We address the problem of adaptive compression of natural language text, considering the case where the receiver is much less powerful than the sender, as in mobile applications. Our techniques achieve compression ratios around 32% and require very little effort from the receiver. Furthermore, the receiver is not only lighter, but it can also search the compressed text with less work than that necessary to decompress it. This is a novelty in two senses: it breaks the usual compressor/decompressor symmetry typical of adaptive schemes, and it contradicts the long-standing assumption that only semistatic codes could be searched more efficiently than the uncompressed text. Our novel compression methods are preferable in several aspects over the existing adaptive and semistatic compressors for natural language texts.
Arnoldi versus GMRES for computing PageRank: A theoretical contribution to google's PageRank problem BIBAFull-Text 11
  Gang Wu; Yimin Wei
PageRank is one of the most important ranking techniques used in today's search engines. A recent very interesting research track focuses on exploiting efficient numerical methods to speed up the computation of PageRank, among which the Arnoldi-type algorithm and the GMRES algorithm are competitive candidates. In essence, the former deals with the PageRank problem from an eigen problem, while the latter from a linear system, point of view. However, there is little known about the relations between the two approaches for PageRank. In this article, we focus on a theoretical and numerical comparison of the two approaches. Numerical experiments illustrate the effectiveness of our theoretical results.
Learning with click graph for query intent classification BIBAFull-Text 12
  Xiao Li; Ye-Yi Wang; Dou Shen; Alex Acero
Topical query classification, as one step toward understanding users' search intent, is gaining increasing attention in information retrieval. Previous works on this subject primarily focused on enrichment of query features, for example, by augmenting queries with search engine results. In this work, we investigate a completely orthogonal approach -- instead of improving feature representation, we aim at drastically increasing the amount of training data. To this end, we propose two semisupervised learning methods that exploit user click-through data. In one approach, we infer class memberships of unlabeled queries from those of labeled ones according to their proximities in a click graph; and then use these automatically labeled queries to train classifiers using query terms as features. In a second approach, click graph learning and query classifier training are conducted jointly with an integrated objective. Our methods are evaluated in two applications, product intent and job intent classification. In both cases, we expand the training data by over two orders of magnitude, leading to significant improvements in classification performance. An additional finding is that with a large amount of training data obtained in this fashion, a classifier based on simple query term features can outperform those using state-of-the-art, augmented features.
Using topic themes for multi-document summarization BIBAFull-Text 13
  Sanda Harabagiu; Finley Lacatusu
The problem of using topic representations for multidocument summarization (MDS) has received considerable attention recently. Several topic representations have been employed for producing informative and coherent summaries. In this article, we describe five previously known topic representations and introduce two novel representations of topics based on topic themes. We present eight different methods of generating multidocument summaries and evaluate each of these methods on a large set of topics used in past DUC workshops. Our evaluation results show a significant improvement in the quality of summaries based on topic themes over MDS methods that use other alternative topic representations.
Combining relations for information extraction from free text BIBAFull-Text 14
  Mstislav Maslennikov; Tat-Seng Chua
Relations between entities of the same semantic type tend to be sparse in free texts. Therefore, combining relations is the key to effective information extraction (IE) on free text datasets with a small set of training samples. Previous approaches to bootstrapping for IE used different types of relations, such as dependency or co-occurrence, and faced the problems of paraphrasing and misalignment of instances. To cope with these problems, we propose a framework that integrates several types of relations. After extracting candidate entities, our framework evaluates relations between them at the phrasal, dependency, semantic frame, and discourse levels. For each of these levels, we build a classifier that outputs a score for relation instances. In order to integrate these scores, we propose three strategies: (1) integrate evaluation scores from each relation classifier; (2) incorporate the elimination of negatively labeled instances in a previous strategy; and (3) add cascading of extracted relations into strategy (2). Our framework improves the state-of-art results for supervised systems by 8%, 15%, 3%, and 5% on MUC4 (terrorism); MUC6 (management succession); ACE RDC 2003 (news, general types); and ACE RDC 2003 (news, specific types) domains respectively.
STEvent: Spatio-temporal event model for social network discovery BIBAFull-Text 15
  Hady W. Lauw; Ee-Peng Lim; Hweehwa Pang; Teck-Tim Tan
Spatio-temporal data concerning the movement of individuals over space and time contains latent information on the associations among these individuals. Sources of spatio-temporal data include usage logs of mobile and Internet technologies. This article defines a spatio-temporal event by the co-occurrences among individuals that indicate potential associations among them. Each spatio-temporal event is assigned a weight based on the precision and uniqueness of the event. By aggregating the weights of events relating two individuals, we can determine the strength of association between them. We conduct extensive experimentation to investigate both the efficacy of the proposed model as well as the computational complexity of the proposed algorithms. Experimental results on three real-life spatio-temporal datasets cross-validate each other, lending greater confidence on the reliability of our proposed model.
Probabilistic models for answer-ranking in multilingual question-answering BIBAFull-Text 16
  Jeongwoo Ko; Luo Si; Eric Nyberg; Teruko Mitamura
This article presents two probabilistic models for answering ranking in the multilingual question-answering (QA) task, which finds exact answers to a natural language question written in different languages. Although some probabilistic methods have been utilized in traditional monolingual answer-ranking, limited prior research has been conducted for answer-ranking in multilingual question-answering with formal methods. This article first describes a probabilistic model that predicts the probabilities of correctness for individual answers in an independent way. It then proposes a novel probabilistic method to jointly predict the correctness of answers by considering both the correctness of individual answers as well as their correlations. As far as we know, this is the first probabilistic framework that proposes to model the correctness and correlation of answer candidates in multilingual question-answering and provide a novel approach to design a flexible and extensible system architecture for answer selection in multilingual QA. An extensive set of experiments were conducted to show the effectiveness of the proposed probabilistic methods in English-to-Chinese and English-to-Japanese cross-lingual QA, as well as English, Chinese, and Japanese monolingual QA using TREC and NTCIR questions.

TOIS 2010 Volume 28 Issue 4

Clustering-based incremental web crawling BIBAFull-Text 17
  Qingzhao Tan; Prasenjit Mitra
When crawling resources, for example, number of machines, crawl-time, and so on, are limited, so a crawler has to decide an optimal order in which to crawl and recrawl Web pages. Ideally, crawlers should request only those Web pages that have changed since the last crawl; in practice, a crawler may not know whether a Web page has changed before downloading it. In this article, we identify features of Web pages that are correlated to their change frequency. We design a crawling algorithm that clusters Web pages based on features that correlate to their change frequencies obtained by examining past history. The crawler downloads a sample of Web pages from each cluster, and depending upon whether a significant number of these Web pages have changed in the last crawl cycle, it decides whether to recrawl the entire cluster. To evaluate the performance of our incremental crawler, we develop an evaluation framework that measures which crawling policy results in the best search results for the end-user. We run experiments on a real Web data set of about 300,000 distinct URLs distributed among 210 Web sites. The results demonstrate that the clustering-based sampling algorithm effectively clusters the pages with similar change patterns, and our clustering-based crawling algorithm outperforms existing algorithms in that it can improve the quality of the user experience for those who query the search engine.
PageRank without hyperlinks: Structural reranking using links induced by language models BIBAFull-Text 18
  Oren Kurland; Lillian Lee
The ad hoc retrieval task is to find documents in a corpus that are relevant to a query. Inspired by the PageRank and HITS (hubs and authorities) algorithms for Web search, we propose a structural reranking approach to ad-hoc retrieval that applies to settings with no hyperlink information. We reorder the documents in an initially retrieved set by exploiting implicit asymmetric relationships among them. We consider generation links, which indicate that the language model induced from one document assigns high probability to the text of another. We study a number of reranking criteria based on measures of centrality in the graphs formed by generation links, and show that integrating centrality into standard language-model-based retrieval is quite effective at improving precision at top ranks; the best resultant performance is comparable, and often superior, to that of a state-of-the-art pseudo-feedback-based retrieval approach. In addition, we demonstrate the merits of our language-model-based method for inducing interdocument links by comparing it to previously suggested notions of interdocument similarities (e.g., cosines within the vector-space model).
   We also show that our methods for inducing centrality are substantially more effective than approaches based on document-specific characteristics, several of which are novel to this study.
An information-theoretic framework for semantic-multimedia retrieval BIBAFull-Text 19
  João Magalhães; Stefan Rüger
This article is set in the context of searching text and image repositories by keyword. We develop a unified probabilistic framework for text, image, and combined text and image retrieval that is based on the detection of keywords (concepts) using automated image annotation technology. Our framework is deeply rooted in information theory and lends itself to use with other media types.
   We estimate a statistical model in a multimodal feature space for each possible query keyword. The key element of our framework is to identify feature space transformations that make them comparable in complexity and density. We select the optimal multimodal feature space with a minimum description length criterion from a set of candidate feature spaces that are computed with the average-mutual-information criterion for the text part and hierarchical expectation maximization for the visual part of the data. We evaluate our approach in three retrieval experiments (only text retrieval, only image retrieval, and text combined with image retrieval), verify the framework's low computational complexity, and compare with existing state-of-the-art ad-hoc models.
A similarity measure for indefinite rankings BIBAFull-Text 20
  William Webber; Alistair Moffat; Justin Zobel
Ranked lists are encountered in research and daily life and it is often of interest to compare these lists even when they are incomplete or have only some members in common. An example is document rankings returned for the same query by different search engines. A measure of the similarity between incomplete rankings should handle nonconjointness, weight high ranks more heavily than low, and be monotonic with increasing depth of evaluation; but no measure satisfying all these criteria currently exists. In this article, we propose a new measure having these qualities, namely rank-biased overlap (RBO). The RBO measure is based on a simple probabilistic user model. It provides monotonicity by calculating, at a given depth of evaluation, a base score that is non-decreasing with additional evaluation, and a maximum score that is nonincreasing. An extrapolated score can be calculated between these bounds if a point estimate is required. RBO has a parameter which determines the strength of the weighting to top ranks. We extend RBO to handle tied ranks and rankings of different lengths. Finally, we give examples of the use of the measure in comparing the results produced by public search engines and in assessing retrieval systems in the laboratory.
The task-dependent effect of tags and ratings on social media access BIBAFull-Text 21
  Maarten Clements; Arjen P. De Vries; Marcel J. T. Reinders
Recently, online social networks have emerged that allow people to share their multimedia files, retrieve interesting content, and discover like-minded people. These systems often provide the possibility to annotate the content with tags and ratings.
   Using a random walk through the social annotation graph, we have combined these annotations into a retrieval model that effectively balances the personal preferences and opinions of like-minded users into a single relevance ranking for either content, tags, or people. We use this model to identify the influence of different annotation methods and system design aspects on common ranking tasks in social content systems.
   Our results show that a combination of rating and tagging information can improve tasks like search and recommendation. The optimal influence of both sources on the ranking is highly dependent on the retrieval task and system design. Results on content search and tag suggestion indicate that the profile created by a user's annotations can be used effectively to adapt the ranking to personal preferences. The random walk reduces sparsity problems by smoothly integrating indirectly related concepts in the relevance ranking, which is especially valuable for cold-start users or individual tagging systems like YouTube and Flickr.
Mining near-duplicate graph for cluster-based reranking of web video search results BIBAFull-Text 22
  Zi Huang; Bo Hu; Hong Cheng; Heng Tao Shen; Hongyan Liu; Xiaofang Zhou
Recently, video search reranking has been an effective mechanism to improve the initial text-based ranking list by incorporating visual consistency among the result videos. While existing methods attempt to rerank all the individual result videos, they suffer from several drawbacks. In this article, we propose a new video reranking paradigm called cluster-based video reranking (CVR). The idea is to first construct a video near-duplicate graph representing the visual similarity relationship among videos, followed by identifying the near-duplicate clusters from the video near-duplicate graph, then ranking the obtained near-duplicate clusters based on cluster properties and intercluster links, and finally for each ranked cluster, a representative video is selected and returned. Compared to existing methods, the new CVR ranks clusters and exhibits several advantages, including superior reranking by utilizing more reliable cluster properties, fast reranking on a small number of clusters, diverse and representative results. Particularly, we formulate the near-duplicate cluster identification as a novel maximally cohesive subgraph mining problem. By leveraging the designed cluster scoring properties indicating the cluster's importance and quality, random walk is applied over the near-duplicate cluster graph to rank clusters. An extensive evaluation study proves the novelty and superiority of our proposals over existing methods.