HCI Bibliography Home | HCI Journals | About TOIS | Journal Info | TOIS Journal Volumes | Detailed Records | RefWorks | EndNote | Hide Abstracts
TOIS Tables of Contents: 121314151617181920212223242526272829303132

ACM Transactions on Information Systems 22

Editors:W. Bruce Croft
Dates:2004
Volume:22
Publisher:ACM
Standard No:ISSN 1046-8188; HF S548.125 A33
Papers:20
Links:Table of Contents
  1. TOIS 2004 Volume 22 Issue 1
  2. TOIS 2004 Volume 22 Issue 2
  3. TOIS 2004 Volume 22 Issue 3
  4. TOIS 2004 Volume 22 Issue 4

TOIS 2004 Volume 22 Issue 1

Introduction to recommender systems: Algorithms and Evaluation BIBFull-Text 1-4
  Joseph A. Konstan
Evaluating collaborative filtering recommender systems BIBAFull-Text 5-53
  Jonathan L. Herlocker; Joseph A. Konstan; Loren G. Terveen; John T. Riedl
Recommender systems have been evaluated in many, often incomparable, ways. In this article, we review the key decisions in evaluating collaborative filtering recommender systems: the user tasks being evaluated, the types of analysis and datasets being used, the ways in which prediction quality is measured, the evaluation of prediction attributes other than quality, and the user-based evaluation of the system as a whole. In addition to reviewing the evaluation strategies used by prior researchers, we present empirical results from the analysis of various accuracy metrics on one content domain where all the tested metrics collapsed roughly into three equivalence classes. Metrics within each equivalency class were strongly correlated, while metrics from different equivalency classes were uncorrelated.
Ontological user profiling in recommender systems BIBAFull-Text 54-88
  Stuart E. Middleton; Nigel R. Shadbolt; David C. De Roure
We explore a novel ontological approach to user profiling within recommender systems, working on the problem of recommending on-line academic research papers. Our two experimental systems, Quickstep and Foxtrot, create user profiles from unobtrusively monitored behaviour and relevance feedback, representing the profiles in terms of a research paper topic ontology. A novel profile visualization approach is taken to acquire profile feedback. Research papers are classified using ontological classes and collaborative recommendation algorithms used to recommend papers seen by similar people on their current topics of interest. Two small-scale experiments, with 24 subjects over 3 months, and a large-scale experiment, with 260 subjects over an academic year, are conducted to evaluate different aspects of our approach. Ontological inference is shown to improve user profiling, external ontological knowledge used to successfully bootstrap a recommender system and profile visualization employed to improve profiling accuracy. The overall performance of our ontological recommender systems are also presented and favourably compared to other systems in the literature.
Latent semantic models for collaborative filtering BIBAFull-Text 89-115
  Thomas Hofmann
Collaborative filtering aims at learning predictive models of user preferences, interests or behavior from community data, that is, a database of available user preferences. In this article, we describe a new family of model-based algorithms designed for this task. These algorithms rely on a statistical modelling technique that introduces latent class variables in a mixture model setting to discover user communities and prototypical interest profiles. We investigate several variations to deal with discrete and continuous response variables as well as with different objective functions. The main advantages of this technique over standard memory-based methods are higher accuracy, constant time prediction, and an explicit and compact model representation. The latter can also be used to mine for user communities. The experimental evaluation shows that substantial improvements in accuracy over existing methods and published results can be obtained.
Applying associative retrieval techniques to alleviate the sparsity problem in collaborative filtering BIBAFull-Text 116-142
  Zan Huang; Hsinchun Chen; Daniel Zeng
Recommender systems are being widely applied in many application settings to suggest products, services, and information items to potential consumers. Collaborative filtering, the most successful recommendation approach, makes recommendations based on past transactions and feedback from consumers sharing similar interests. A major problem limiting the usefulness of collaborative filtering is the sparsity problem, which refers to a situation in which transactional or feedback data is sparse and insufficient to identify similarities in consumer interests. In this article, we propose to deal with this sparsity problem by applying an associative retrieval framework and related spreading activation algorithms to explore transitive associations among consumers through their past transactions and feedback. Such transitive associations are a valuable source of information to help infer consumer interests and can be explored to deal with the sparsity problem. To evaluate the effectiveness of our approach, we have conducted an experimental study using a data set from an online bookstore. We experimented with three spreading activation algorithms including a constrained Leaky Capacitor algorithm, a branch-and-bound serial symbolic search algorithm, and a Hopfield net parallel relaxation search algorithm. These algorithms were compared with several collaborative filtering approaches that do not consider the transitive associations: a simple graph search approach, two variations of the user-based approach, and an item-based approach. Our experimental results indicate that spreading activation-based approaches significantly outperformed the other collaborative filtering methods as measured by recommendation precision, recall, the F-measure, and the rank score. We also observed the over-activation effect of the spreading activation approach, that is, incorporating transitive associations with past transactional data that is not sparse may "dilute" the data used to infer user preferences and lead to degradation in recommendation performance.
Item-based top-N recommendation algorithms BIBAFull-Text 143-177
  Mukund Deshpande; George Karypis
The explosive growth of the world-wide-web and the emergence of e-commerce has led to the development of recommender systems -- a personalized information filtering technology used to identify a set of items that will be of interest to a certain user. User-based collaborative filtering is the most successful technology for building recommender systems to date and is extensively used in many commercial recommender systems. Unfortunately, the computational complexity of these methods grows linearly with the number of customers, which in typical commercial applications can be several millions. To address these scalability concerns model-based recommendation techniques have been developed. These techniques analyze the user-item matrix to discover relations between the different items and use these relations to compute the list of recommendations.
   In this article, we present one such class of model-based recommendation algorithms that first determines the similarities between the various items and then uses them to identify the set of items to be recommended. The key steps in this class of algorithms are (i) the method used to compute the similarity between the items, and (ii) the method used to combine these similarities in order to compute the similarity between a basket of items and a candidate recommender item. Our experimental evaluation on eight real datasets shows that these item-based algorithms are up to two orders of magnitude faster than the traditional user-neighborhood based recommender systems and provide recommendations with comparable or better quality.

TOIS 2004 Volume 22 Issue 2

A study of smoothing methods for language models applied to information retrieval BIBAFull-Text 179-214
  Chengxiang Zhai; John Lafferty
Language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech recognition. The basic idea of these approaches is to estimate a language model for each document, and to then rank documents by the likelihood of the query according to the estimated language model. A central issue in language model estimation is smoothing, the problem of adjusting the maximum likelihood estimator to compensate for data sparseness. In this article, we study the problem of language model smoothing and its influence on retrieval performance. We examine the sensitivity of retrieval performance to the smoothing parameters and compare several popular smoothing methods on different test collections. Experimental results show that not only is the retrieval performance generally sensitive to the smoothing parameters, but also the sensitivity pattern is affected by the query type, with performance being more sensitive to smoothing for verbose queries than for keyword queries. Verbose queries also generally require more aggressive smoothing to achieve optimal performance. This suggests that smoothing plays two different role -- to make the estimated document language model more accurate and to "explain" the noninformative words in the query. In order to decouple these two distinct roles of smoothing, we propose a two-stage smoothing strategy, which yields better sensitivity patterns and facilitates the setting of smoothing parameters automatically. We further propose methods for estimating the smoothing parameters automatically. Evaluation on five different databases and four types of queries indicates that the two-stage smoothing method with the proposed parameter estimation methods consistently gives retrieval performance that is close to -- or better than -- the best results achieved using a single smoothing method and exhaustive parameter search on the test data.
Multidocument summarization: An added value to clustering in interactive retrieval BIBAFull-Text 215-241
  Manuel J. Mana-Lopez; Manuel De Buenaga; Jose M. Gomez-Hidalgo
A more and more generalized problem in effective information access is the presence in the same corpus of multiple documents that contain similar information. Generally, users may be interested in locating, for a topic addressed by a group of similar documents, one or several particular aspects. This kind of task, called instance or aspectual retrieval, has been explored in several TREC Interactive Tracks. In this article, we propose in addition to the classification capacity of clustering techniques, the possibility of offering a indicative extract about the contents of several sources by means of multidocument summarization techniques. Two kinds of summaries are provided. The first one covers the similarities of each cluster of documents retrieved. The second one shows the particularities of each document with respect to the common topic in the cluster. The document multitopic structure has been used in order to determine similarities and differences of topics in the cluster of documents. The system is independent of document domain and genre. An evaluation of the proposed system with users proves significant improvements in effectiveness. The results of previous experiments that have compared clustering algorithms are also reported.
Anchor text mining for translation of Web queries: A transitive translation approach BIBAFull-Text 242-269
  Wen-Hsiang Lu; Lee-Feng Chien; Hsi-Jian Lee
To discover translation knowledge in diverse data resources on the Web, this article proposes an effective approach to finding translation equivalents of query terms and constructing multilingual lexicons through the mining of Web anchor texts and link structures. Although Web anchor texts are wide-scoped hypertext resources, not every particular pair of languages contains sufficient anchor texts for effective extraction of translations for Web queries. For more generalized applications, the approach is designed based on a transitive translation model. The translation equivalents of a query term can be extracted via its translation in an intermediate language. To reduce interference from translation errors, the approach further integrates a competitive linking algorithm into the process of determining the most probable translation. A series of experiments has been conducted, including performance tests on term translation extraction, cross-language information retrieval, and translation suggestions for practical Web search services, respectively. The obtained experimental results have shown that the proposed approach is effective in extracting translations of unknown queries, is easy to combine with the probabilistic retrieval model to improve the cross-language retrieval performance, and is very useful when the considered language pairs lack a sufficient number of anchor texts. Based on the approach, an experimental system called LiveTrans has been developed for English-Chinese cross-language Web search.
Streams, structures, spaces, scenarios, societies (5s): A formal model for digital libraries BIBAFull-Text 270-312
  Marcos Andre Goncalves; Edward A. Fox; Layne T. Watson; Neill A. Kipp
Digital libraries (DLs) are complex information systems and therefore demand formal foundations lest development efforts diverge and interoperability suffers. In this article, we propose the fundamental abstractions of Streams, Structures, Spaces, Scenarios, and Societies (5S), which allow us to define digital libraries rigorously and usefully. Streams are sequences of arbitrary items used to describe both static and dynamic (e.g., video) content. Structures can be viewed as labeled directed graphs, which impose organization. Spaces are sets with operations on those sets that obey certain constraints. Scenarios consist of sequences of events or actions that modify states of a computation in order to accomplish a functional requirement. Societies are sets of entities and activities and the relationships among them. Together these abstractions provide a formal foundation to define, relate, and unify concepts -- among others, of digital objects, metadata, collections, and services -- required to formalize and elucidate "digital libraries". The applicability, versatility, and unifying power of the 5S model are demonstrated through its use in three distinct applications: building and interpretation of a DL taxonomy, informal and formal analysis of case studies of digital libraries (NDLTD and OAI), and utilization as a formal basis for a DL description language.
XIRQL: An XML query language based on information retrieval concepts BIBAFull-Text 313-356
  Norbert Fuhr; Kai Grossjohann
XIRQL ("circle") is an XML query language that incorporates imprecision and vagueness for both structural and content-oriented query conditions. The corresponding uncertainty is handled by a consistent probabilistic model. The core features of XIRQL are (1) document ranking based on index term weighting, (2) specificity-oriented search for retrieving the most relevant parts of documents, (3) datatypes with vague predicates for dealing with specific types of content and (4) structural vagueness for vague interpretation of structural query conditions. A XIRQL database may contain several classes of documents, where all documents in a class conform to the same DTD; links between documents also are supported. XIRQL queries are translated into a path algebra, which can be processed by our HyREX retrieval engine.

TOIS 2004 Volume 22 Issue 3

Relevance models to help estimate document and query parameters BIBAFull-Text 357-380
  David Bodoff
A central idea of Language Models is that documents (and perhaps queries) are random variables, generated by data-generating functions that are characterized by document (query) parameters. The key new idea of this paper is to model that a relevance judgment is also generated stochastically, and that its data generating function is also governed by those same document and query parameters. The result of this addition is that any available relevance judgments are easily incorporated as additional evidence about the true document and query model parameters. An additional aspect of this approach is that it also resolves the long-standing problem of document-oriented versus query-oriented probabilities. The general approach can be used with a wide variety of hypothesized distributions for documents, queries, and relevance. We test the approach on Reuters Corpus Volume 1, using one set of possible distributions. Experimental results show that the approach does succeed in incorporating relevance data to improve estimates of both document and query parameters, but on this data and for the specific distributions we hypothesized, performance was no better than two separate one-sided models. We conclude that the model's theoretical contribution is its integration of relevance models, document models, and query models, and that the potential for additional performance improvement over one-sided methods requires refinements.
Efficient mining of both positive and negative association rules BIBAFull-Text 381-405
  Xindong Wu; Chengqi Zhang; Shichao Zhang
This paper presents an efficient method for mining both positive and negative association rules in databases. The method extends traditional associations to include association rules of forms A implies not B, A implies B, and not A implies not B, which indicate negative associations between itemsets. With a pruning strategy and an interestingness measure, our method scales to large databases. The method has been evaluated using both synthetic and real-world databases, and our experimental results demonstrate its effectiveness and efficiency.
Trustworthy 100-year digital objects: Evidence after every witness is dead BIBAFull-Text 406-436
  Henry M. Gladney
In ancient times, wax seals impressed with signet rings were affixed to documents as evidence of their authenticity. A digital counterpart is a message authentication code fixed firmly to each important document. If a digital object is sealed together with its own audit trail, each user can examine this evidence to decide whether to trust the content -- no matter how distant this user is in time, space, and social affiliation from the document's source.
   We propose an architecture and design that accomplish this: encapsulation of digital object content with metadata describing its origins, cryptographic sealing, webs of trust for public keys rooted in a forest of respected institutions, and a certain way of managing information identifiers. These means will satisfy emerging needs in civilian and military record management, including medical patient records, regulatory records for aircraft and pharmaceuticals, business records for financial audit, legislative and legal briefs, and scholarly works.
   This is true for any kind of digital object, independent of its purposes and of most data type and representation details, and provides every kind of user -- information authors and editors, librarians and collection managers, and information consumers -- with autonomy for implied tasks. Our prototype will conform to applicable standards, will be interoperable over most computing bases, and will be compatible with existing digital library software.
   The proposed architecture integrates software that is mostly available and widely accepted.
PocketLens: Toward a personal recommender system BIBAFull-Text 437-476
  Bradley N. Miller; Joseph A. Konstan; John Riedl
Recommender systems using collaborative filtering are a popular technique for reducing information overload and finding products to purchase. One limitation of current recommenders is that they are not portable. They can only run on large computers connected to the Internet. A second limitation is that they require the user to trust the owner of the recommender with personal preference data. Personal recommenders hold the promise of delivering high quality recommendations on palmtop computers, even when disconnected from the Internet. Further, they can protect the user's privacy by storing personal information locally, or by sharing it in encrypted form. In this article we present the new PocketLens collaborative filtering algorithm along with five peer-to-peer architectures for finding neighbors. We evaluate the architectures and algorithms in a series of offline experiments. These experiments show that Pocketlens can run on connected servers, on usually connected workstations, or on occasionally connected portable devices, and produce recommendations that are as good as the best published algorithms to date.
Distributed content-based visual information retrieval system on peer-to-peer networks BIBAFull-Text 477-501
  Irwin King; Cheuk Hang Ng; Ka Cheung Sia
With the recent advances of distributed computing, the limitation of information retrieval from a centralized image collection can be removed by allowing distributed image data sources to interact with each other for data storage sharing and information retrieval. In this article, we present our design and implementation of DISCOVIR: DIStributed COntent-based Visual Information Retrieval system using the Peer-to-Peer (P2P) Network. We describe the system architecture and detail the interactions among various system modules. Specifically, we propose a Firework Query Model for distributed information retrieval, which aims to reduce the network traffic of query passing in the network. We carry out experiments to show the distributed image retrieval system and the Firework information retrieval algorithm. The results show that the algorithm reduces network traffic while increases searching performance.

TOIS 2004 Volume 22 Issue 4

Qualitative decision making in adaptive presentation of structured information BIBAFull-Text 503-539
  Ronen I. Brafman; Carmel Domshlak; Solomon E. Shimony
We present a new approach for adaptive presentation of structured information, based on preference-based constrained optimization techniques rooted in qualitative decision-theory. In this approach, document presentation is viewed as a configuration problem whose goal is to determine the optimal presentation of a document, while taking into account the preferences of the content provider, viewer interaction with the browser, and, possibly, some layout constraints. The preferences of the content provider are represented by a CP-net, a graphical, qualitative preference model developed in Boutilier et al. [1999]. The layout constraints are represented as geometric constraints, integrated within the optimization process. We discuss the theoretical basis of our approach, as well as implemented prototype systems for Web pages and for general media-rich document presentation.
Analysis of lexical signatures for improving information persistence on the World Wide Web BIBAFull-Text 540-572
  Seung-Taek Park; David M. Pennock; C. Lee Giles; Robert Krovetz
A lexical signature (LS) consisting of several key words from a Web document is often sufficient information for finding the document later, even if its URL has changed. We conduct a large-scale empirical study of nine methods for generating lexical signatures, including Phelps and Wilensky's original proposal (PW), seven of our own static variations, and one new dynamic method. We examine their performance on the Web over a 10-month period, and on a TREC data set, evaluating their ability to both (1) uniquely identify the original (possibly modified) document, and (2) locate other relevant documents if the original is lost. Lexical signatures chosen to minimize document frequency (DF) are good at unique identification but poor at finding relevant documents. PW works well on the relatively small TREC data set, but acts almost identically to DF on the Web, which contains billions of documents. Term-frequency-based lexical signatures (TF) are very easy to compute and often perform well, but are highly dependent on the ranking system of the search engine used. The term-frequency inverse-document-frequency- (TFIDF-) based method and hybrid methods (which combine DF with TF or TFIDF) seem to be the most promising candidates among static methods for generating effective lexical signatures. We propose a dynamic LS generator called Test & Select (TS) to mitigate LS conflict. TS outperforms all eight static methods in terms of both extracting the desired document and finding relevant information, over three different search engines. All LS methods show significant performance degradation as documents in the corpus are edited.
Fast phrase querying with combined indexes BIBAFull-Text 573-594
  Hugh E. Williams; Justin Zobel; Dirk Bahle
Search engines need to evaluate queries extremely fast, a challenging task given the quantities of data being indexed. A significant proportion of the queries posed to search engines involve phrases. In this article we consider how phrase queries can be efficiently supported with low disk overheads. Our previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. Alternatively, special-purpose phrase indexes can be used, but it is not feasible to index all phrases. We propose combinations of nextword indexes and phrase indexes with inverted files as a solution to this problem. Our experiments show that combined use of a partial nextword, partial phrase, and conventional inverted index allows evaluation of phrase queries in a quarter the time required to evaluate such queries with an inverted file alone; the additional space overhead is only 26% of the size of the inverted file.
Information systems interoperability: What lies beneath? BIBAFull-Text 595-632
  Jinsoo Park; Sudha Ram
Interoperability is the most critical issue facing businesses that need to access information from multiple information systems. Our objective in this research is to develop a comprehensive framework and methodology to facilitate semantic interoperability among distributed and heterogeneous information systems. A comprehensive framework for managing various semantic conflicts is proposed. Our proposed framework provides a unified view of the underlying representational and reasoning formalism for the semantic mediation process. This framework is then used as a basis for automating the detection and resolution of semantic conflicts among heterogeneous information sources. We define several types of semantic mediators to achieve semantic interoperability. A domain-independent ontology is used to capture various semantic conflicts. A mediation-based query processing technique is developed to provide uniform and integrated access to the multiple heterogeneous databases. A usable prototype is implemented as a proof-of-concept for this work. Finally, the usefulness of our approach is evaluated using three cases in different application domains. Various heterogeneous datasets are used during the evaluation phase. The results of the evaluation suggest that correct identification and construction of both schema and ontology-schema mapping knowledge play very important roles in achieving interoperability at both the data and schema levels.