HCI Bibliography Home | HCI Journals | About TOIS | Journal Info | TOIS Journal Volumes | Detailed Records | RefWorks | EndNote | Hide Abstracts
TOIS Tables of Contents: 131415161718192021222324252627282930313233

ACM Transactions on Information Systems 23

Editors:W. Bruce Croft
Dates:2005
Volume:23
Publisher:ACM
Standard No:ISSN 1046-8188; HF S548.125 A33
Papers:16
Links:Table of Contents
  1. TOIS 2005 Volume 23 Issue 1
  2. TOIS 2005 Volume 23 Issue 2
  3. TOIS 2005 Volume 23 Issue 3
  4. TOIS 2005 Volume 23 Issue 4

TOIS 2005 Volume 23 Issue 1

Introduction to genomic information retrieval BIBFull-Text 1-2
  Hugh E. Williams
An efficient normalized maximum likelihood algorithm for DNA sequence compression BIBAFull-Text 3-34
  Gergely Korodi; Ioan Tabus
This article presents an efficient algorithm for DNA sequence compression, which achieves the best compression ratios reported over a test set commonly used for evaluating DNA compression programs. The algorithm introduces many refinements to a compression method that combines: (1) encoding by a simple normalized maximum likelihood (NML) model for discrete regression, through reference to preceding approximate matching blocks, (2) encoding by a first order context coding and (3) representing strings in clear, to make efficient use of the redundancy sources in DNA data, under fast execution times. One of the main algorithmic features is the constraint on the matching blocks to include reasonably long contiguous matches, which not only reduces significantly the search time, but also can be used to modify the NML model to exploit the constraint for getting smaller code lengths. The algorithm handles the changing statistics of DNA data in an adaptive way and by predictively encoding the matching pointers it is successful in compressing long approximate matches. Apart from comparison with previous DNA encoding methods, we present compression results for the recently published human genome data.
A methodology for analyzing SAGE libraries for cancer profiling BIBAFull-Text 35-60
  Jorg Sander; Raymond T. Ng; Monica C. Sleumer; Man Saint Yuen; Steven J. Jones
Serial Analysis of Gene Expression (SAGE) has proven to be an important alternative to microarray techniques for global profiling of mRNA populations. We have developed preprocessing methodologies to address problems in analyzing SAGE data due to noise caused by sequencing error, normalization methodologies to account for libraries sampled at different depths, and missing tag imputation methodologies to aid in the analysis of poorly sampled SAGE libraries. We have also used subspace selection using the Wilcoxon rank sum test to exclude tags that have similar expression levels regardless of source. Using these methodologies we have clustered, using the OPTICS algorithm, 88 SAGE libraries derived from cancerous and normal tissues as well as cell line material. Our results produced eight dense clusters representing ovarian cancer cell line, brain cancer cell line, brain cancer bulk tissue, prostate tissue, pancreatic cancer, breast cancer cell line, normal brain, and normal breast bulk tissue. The ovarian cancer and brain cancer cell lines clustered closely together, leading to a further investigation on possible associations between these two cancer types. We also investigated the utility of gene expression data in the classification between normal and cancerous tissues. Our results indicate that brain and breast cancer libraries have strong identities allowing robust discrimination from their normal counterparts. However, the SAGE expression data provide poor predictive accuracy in discriminating between prostate and ovarian cancers and their respective normal tissues.
Historical spatio-temporal aggregation BIBAFull-Text 61-102
  Yufei Tao; Dimitris Papadias
Spatio-temporal databases store information about the positions of individual objects over time. However, in many applications such as traffic supervision or mobile communication systems, only summarized data, like the number of cars in an area for a specific period, or phone-calls serviced by a cell each day, is required. Although this information can be obtained from operational databases, its computation is expensive, rendering online processing inapplicable. In this paper, we present specialized methods, which integrate spatio-temporal indexing with pre-aggregation. The methods support dynamic spatio-temporal dimensions for the efficient processing of historical aggregate queries without a priori knowledge of grouping hierarchies. The superiority of the proposed techniques over existing methods is demonstrated through a comprehensive probabilistic analysis and an extensive experimental evaluation.
Incorporating contextual information in recommender systems using a multidimensional approach BIBAFull-Text 103-145
  Gediminas Adomavicius; Ramesh Sankaranarayanan; Shahana Sen; Alexander Tuzhilin
The article presents a multidimensional (MD) approach to recommender systems that can provide recommendations based on additional contextual information besides the typical information on users and items used in most of the current recommender systems. This approach supports multiple dimensions, profiling information, and hierarchical aggregation of recommendations. The article also presents a multidimensional rating estimation method capable of selecting two-dimensional segments of ratings pertinent to the recommendation context and applying standard collaborative filtering or other traditional two-dimensional rating estimation techniques to these segments. A comparison of the multidimensional and two-dimensional rating estimation approaches is made, and the tradeoffs between the two are studied. Moreover, the article introduces a combined rating estimation method, which identifies the situations where the MD approach outperforms the standard two-dimensional approach and uses the MD approach in those situations and the standard two-dimensional approach elsewhere. Finally, the article presents a pilot empirical study of the combined approach, using a multidimensional movie recommender system that was developed for implementing this approach and testing its performance.

TOIS 2005 Volume 23 Issue 2

Evaluating implicit measures to improve web search BIBAFull-Text 147-168
  Steve Fox; Kuldeep Karnawat; Mark Mydland; Susan Dumais; Thomas White
In this article we describe an evaluation of relevance feedback (RF) algorithms using searcher simulations. Since these algorithms select additional terms for query modification based on inferences made from searcher interaction, not on relevance information searchers explicitly provide (as in traditional RF), we refer to them as implicit feedback models. We introduce six different models that base their decisions on the interactions of searchers and use different approaches to rank query modification terms. The aim of this article is to determine which of these models should be used to assist searchers in the systems we develop. To evaluate these models we used searcher simulations that afforded us more control over the experimental conditions than experiments with human subjects and allowed complex interaction to be modeled without the need for costly human experimentation. The simulation-based evaluation methodology measures how well the models learn the distribution of terms across relevant documents (i.e., learn what information is relevant) and how well they improve search effectiveness (i.e., create effective search queries). Our findings show that an implicit feedback model based on Jeffrey's rule of conditioning outperformed other models under investigation.
Ad Hoc, self-supervising peer-to-peer search networks BIBAFull-Text 169-200
  Brian F. Cooper; Hector Garcia-Molina
Peer-to-peer search networks are a popular and widely deployed means of searching massively distributed digital information repositories. Unfortunately, as such networks grow, peers may become overloaded processing messages from other peers. This article examines how to reduce the load on nodes in P2P networks by allowing them to self-organize into a relatively efficient network, and then self-tune to make the network even more efficient. Two local operations used by a peer are introduced: connect(), in which the peer forms an ad hoc search or index link to another peer, and break(), in which the peer breaks a link that is producing too much load. By replacing fixed rules with dynamic local decision-making, such "self-supervising" networks can better adjust to network conditions. Different ways to implement connect() and break() are described, and the network structures that form under different configurations are examined. Simulation results indicate that the ad hoc networks formed using the described techniques are more efficient than popular supernode topologies for several important scenarios. Results for the fault tolerance and search latency of such ad hoc networks are also presented.
CrimeNet explorer: a framework for criminal network knowledge discovery BIBAFull-Text 201-226
  Jennifer J. Xu; Hsinchun Chen
Knowledge about the structure and organization of criminal networks is important for both crime investigation and the development of effective strategies to prevent crimes. However, except for network visualization, criminal network analysis remains primarily a manual process. Existing tools do not provide advanced structural analysis techniques that allow extraction of network knowledge from large volumes of criminal-justice data. To help law enforcement and intelligence agencies discover criminal network knowledge efficiently and effectively, in this research we proposed a framework for automated network analysis and visualization. The framework included four stages: network creation, network partition, structural analysis, and network visualization. Based upon it, we have developed a system called CrimeNet Explorer that incorporates several advanced techniques: a concept space approach, hierarchical clustering, social network analysis methods, and multidimensional scaling. Results from controlled experiments involving student subjects demonstrated that our system could achieve higher clustering recall and precision than did untrained subjects when detecting subgroups from criminal networks. Moreover, subjects identified central members and interaction patterns between groups significantly faster with the help of structural analysis functionality than with only visualization functionality. No significant gain in effectiveness was present, however. Our domain experts also reported that they believed CrimeNet Explorer could be very useful in crime investigation.

TOIS 2005 Volume 23 Issue 3

A market-based approach to recommender systems BIBAFull-Text 227-266
  Yan Zheng Wei; Luc Moreau; Nicholas R. Jennings
Recommender systems have been widely advocated as a way of coping with the problem of information overload for knowledge workers. Given this, multiple recommendation methods have been developed. However, it has been shown that no one technique is best for all users in all situations. Thus we believe that effective recommender systems should incorporate a wide variety of such techniques and that some form of overarching framework should be put in place to coordinate the various recommendations so that only the best of them (from whatever source) are presented to the user. To this end, we show that a marketplace, in which the various recommendation methods compete to offer their recommendations to the user, can be used in this role. Specifically, this article presents the principled design of such a marketplace (including the auction protocol, the reward mechanism, and the bidding strategies of the individual recommendation agents) and evaluates the market's capability to effectively coordinate multiple methods. Through analysis and simulation, we show that our market is capable of shortlisting recommendations in decreasing order of user perceived quality and of correlating the individual agent's internal quality rating to the user's perceived quality.
A novel document retrieval method using the discrete wavelet transform BIBAFull-Text 267-298
  Laurence A. F. Park; Kotagiri Ramamohanarao; Marimuthu Palaniswami
Current information retrieval methods either ignore the term positions or deal with exact term positions; the former can be seen as coarse document resolution, the latter as fine document resolution. We propose a new spectral-based information retrieval method that is able to utilize many different levels of document resolution by examining the term patterns that occur in the documents. To do this, we take advantage of the multiresolution analysis properties of the wavelet transform. We show that we are able to achieve higher precision when compared to vector space and proximity retrieval methods, while producing fast query times and using a compact index.
Trustworthy 100-year digital objects: durable encoding for when it's too late to ask BIBAFull-Text 299-324
  H. M. Gladney; R. A. Lorie
How can an author store digital information so that it will be reliably intelligible, even years later when he or she is no longer available to answer questions? Methods that might work are not good enough; what is preserved today should be reliably intelligible whenever someone wants it. Prior proposals fail because they generally confound saved data with irrelevant details of today's information technology -- details that are difficult to define, extract, and save completely and accurately. We use a virtual machine to represent and eventually to render any data whatsoever. We focus on a case of intermediate difficulty -- an executable procedure -- and identify a variant for every other data type. This solution might be more elaborate than needed to render some text, image, audio, or video data. Simple data can be preserved as representations using well-known standards. We sketch practical methods for files ranging from simple structures to those containing computer programs, treating simple cases here and deferring complex cases for future work. Enough of the complete solution is known to enable practical aggressive preservation programs today.
Evaluating implicit feedback models using searcher simulations BIBAFull-Text 325-361
  Ryen W. White; Ian Ruthven; Joemon M. Jose; C. J. Van Rijsbergen
In this article we describe an evaluation of relevance feedback (RF) algorithms using searcher simulations. Since these algorithms select additional terms for query modification based on inferences made from searcher interaction, not on relevance information searchers explicitly provide (as in traditional RF), we refer to them as implicit feedback models. We introduce six different models that base their decisions on the interactions of searchers and use different approaches to rank query modification terms. The aim of this article is to determine which of these models should be used to assist searchers in the systems we develop. To evaluate these models we used searcher simulations that afforded us more control over the experimental conditions than experiments with human subjects and allowed complex interaction to be modeled without the need for costly human experimentation. The simulation-based evaluation methodology measures how well the models learn the distribution of terms across relevant documents (i.e., learn what information is relevant) and how well they improve search effectiveness (i.e., create effective search queries). Our findings show that an implicit feedback model based on Jeffrey's rule of conditioning outperformed other models under investigation.

TOIS 2005 Volume 23 Issue 4

Taxonomy generation for text segments: A practical web-based approach BIBAFull-Text 363-396
  Shui-Lung Chuang; Lee-Feng Chien
It is crucial in many information systems to organize short text segments, such as keywords in documents and queries from users, into a well-formed taxonomy. In this article, we address the problem of taxonomy generation for diverse text segments with a general and practical approach that uses the Web as an additional knowledge source. Unlike long documents, short text segments typically do not contain enough information to extract reliable features. This work investigates the possibilities of using highly ranked search-result snippets to enrich the representation of text segments. A hierarchical clustering algorithm is then designed for creating the hierarchical topic structure of text segments. Text segments with close concepts can be grouped together in a cluster, and relevant clusters linked at the same or near levels. Different from traditional clustering algorithms, which tend to produce cluster hierarchies with a very unnatural shape, the algorithm tries to produce a more natural and comprehensive tree hierarchy. Extensive experiments were conducted on different domains of text segments, including subject terms, people names, paper titles, and natural language questions. The obtained experimental results have shown the potential of the proposed approach, which provides a basis for the in-depth analysis of text segments on a larger scale and is believed able to benefit many information systems.
Set-based vector model: An efficient approach for correlation-based ranking BIBAFull-Text 397-429
  Bruno Possass; Nivio Ziviani; Wagner, Jr. Meira; Berthier Ribeiro-Neto
This work presents a new approach for ranking documents in the vector space model. The novelty lies in two fronts. First, patterns of term co-occurrence are taken into account and are processed efficiently. Second, term weights are generated using a data mining technique called association rules. This leads to a new ranking mechanism called the set-based vector model. The components of our model are no longer index terms but index termsets, where a termset is a set of index terms. Termsets capture the intuition that semantically related terms appear close to each other in a document. They can be efficiently obtained by limiting the computation to small passages of text. Once termsets have been computed, the ranking is calculated as a function of the termset frequency in the document and its scarcity in the document collection. Experimental results show that the set-based vector model improves average precision for all collections and query types evaluated, while keeping computational costs small. For the 2-gigabyte TREC-8 collection, the set-based vector model leads to a gain in average precision figures of 14.7% and 16.4% for disjunctive and conjunctive queries, respectively, with respect to the standard vector space model. These gains increase to 24.9% and 30.0%, respectively, when proximity information is taken into account. Query processing times are larger but, on average, still comparable to those obtained with the standard vector model (increases in processing time varied from 30% to 300%). Our results suggest that the set-based vector model provides a correlation-based ranking formula that is effective with general collections and computationally practical.
Learning to crawl: Comparing classification schemes BIBAFull-Text 430-462
  Gautam Pant; Padmini Srinivasan
Topical crawling is a young and creative area of research that holds the promise of benefiting from several sophisticated data mining techniques. The use of classification algorithms to guide topical crawlers has been sporadically suggested in the literature. No systematic study, however, has been done on their relative merits. Using the lessons learned from our previous crawler evaluation studies, we experiment with multiple versions of different classification schemes. The crawling process is modeled as a parallel best-first search over a graph defined by the Web. The classifiers provide heuristics to the crawler thus biasing it towards certain portions of the Web graph. Our results show that Naive Bayes is a weak choice for guiding a topical crawler when compared with Support Vector Machine or Neural Network. Further, the weak performance of Naive Bayes can be partly explained by extreme skewness of posterior probabilities generated by it. We also observe that despite similar performances, different topical crawlers cover subspaces on the Web with low overlap.
Evolution of web site design patterns BIBAFull-Text 463-497
  Melody Y. Ivory; Rodrick Megraw
The Web enables broad dissemination of information and services; however, the ways in which sites are designed can either facilitate or impede users' benefit from these resources. We present a longitudinal study of web site design from 2000 to 2003. We analyze over 150 quantitative measures of interface aspects (e.g., amount of text on pages, numbers and types of links, consistency, accessibility, etc.) for 22,000 pages and over 1,500 sites that received ratings from Internet professionals. We examine characteristics of highly rated sites and provide three perspectives on the evolution of web site design patterns: (1) descriptions of design patterns during each time period; (2) changes in design patterns across the three time periods; and (3) comparisons of design patterns to those that are recommended in the relevant literature (i.e., texts by recognized experts and user studies). We illustrate how design practices conform to or deviate from recommended practices and the consequent implications. We show that the most glaring deficiency of web sites, even for sites that are highly rated, is their inadequate accessibility, in particular for browser scripts, tables, and form elements.