HCI Bibliography Home | HCI Journals | About TOIS | Journal Info | TOIS Journal Volumes | Detailed Records | RefWorks | EndNote | Hide Abstracts
TOIS Tables of Contents: 080910111213141516171819202122232425262728

ACM Transactions on Information Systems 18

Editors:W. Bruce Croft
Standard No:ISSN 1046-8188; HF S548.125 A33
Links:Table of Contents
  1. TOIS 2000 Volume 18 Issue 1
  2. TOIS 2000 Volume 18 Issue 2
  3. TOIS 2000 Volume 18 Issue 3
  4. TOIS 2000 Volume 18 Issue 4

TOIS 2000 Volume 18 Issue 1

Evaluating the performance of distributed architectures for information retrieval using a variety of workloads BIBAKFull-Text 1-43
  Brendon Cahoon; Kathryn S. McKinley; Zhihong Lu
The information explosion across the Internet and elswhere offers access to an increasing number of document collections. In order for users to effectively access these collections, information retrieval (IR) systems must provide coordinated, concurrent, and distributed access. In this article, we explore how to achieve scalable performance in a distributed system for collection sizes ranging from 1GB to 128GB. We implement a fully functional distributed IR system based on a multithreaded version of the Inquery simulation model. We measure performance as a function of system parameters such as client command rate, number of document collections, ter ms per query, query term frequency, number of answers returned, and command mixture. Our results show that it is important to model both query and document commands because the heterogeneity of commands significantly impacts performance. Based on our results, we recommend simple changes to the prototype and evaluate the changes using the simulator. Because of the significant resource demands of information retrieval, it is not difficult to generate workloads that overwhelm system resources regardless of the architecture. However under some realistic workloads, we demonstrate system organizations for which response time gracefully degrades as the workload increases and performance scales with the number of processors. This scalable architecture includes a surprisingly small number of brokers through which a large number of clients and servers communicate.
Keywords: distributed information retrieval architectures
Shortest-substring retrieval and ranking BIBAKFull-Text 44-78
  Charles L. A. Clarke; Gordon V. Cormack
We present a model for arbitrary passage retrieval using Boolean queries. The model is applied to the task of ranking documents, or other structural elements, in the order of their expected relevance. Features such as phrase matching, truncation, and stemming integrate naturally into the model. Properties of Boolean algebra are obeyed, and the exact-match semantics of Boolean retrieval are preserved. Simple inverted-list file structures provide an efficient implementation. Retrieval effectiveness is comparable to that of standard ranking techniques. Since global statistics are not used, the method is of particular value in distributed environments. Since ranking is based on arbitrary passages, the structural elements to be ranked may be specified at query time and do not need to be restricted to predefined elements.
Keywords: Boolean retrieval model, passage retrieval, relevance ranking
Improving the effectiveness of information retrieval with local context analysis BIBAKFull-Text 79-112
  Jinxi Xu; W. Bruce Croft
Techniques for automatic query expansion have been extensively studied in information research as a means of addressing the word mismatch between queries and documents. These techniques can be categorized as either global or local. While global techniques rely on analysis of a whole collection to discover word relationships, local techniques emphasize analysis of the top-ranked documents retrieved for a query. While local techniques have shown to be more effective that global techniques in general, existing local techniques are not robust and can seriously hurt retrieved when few of the retrieval documents are relevant. We propose a new technique, called local context analysis, which selects expansion terms based on cooccurrence with the query terms within the top-ranked documents. Experiments on a number of collections, both English and non-English, show that local context analysis offers more effective and consistent retrieval results.
Keywords: cooccurrence, document analysis, feedback, global techniques, information retrieval, local context analysis, local techniques

TOIS 2000 Volume 18 Issue 2

Fast and flexible word searching on compressed text BIBAKFull-Text 113-139
  Edleno Silva de Moura; Gonzalo Navarro; Nivio Ziviani; Ricardo Baeza-Yates
We present a fast compression technique for natural language texts. The novelties are that (1) decompression of arbitrary portions of the text can be done very efficiently, (2) exact search for words and phrases can be done on the compressed text directly, using any known sequential pattern-matching algorithm, and (3) word-based approximate and extended search can also be done efficiently without any decoding. The compression scheme uses a semistatic word-based model and a Huffman code where the coding alphabet is byte-oriented rather than bit-oriented. We compress typical English texts to about 30% of their original size, against 40% and 35% for Compress and Gzip, respectively. Compression time is close to that of Compress and approximately half of the time of Gzip, and decompression time is lower than that of Gzip and one third of that of Compress. We present three algorithms to search the compressed text. They allow a large number of variations over the basic word and phrase search capability, such as sets of characters, arbitrary regular expressions, and approximate matching. Separators and stopwords can be discarded at search time without significantly increasing the cost. When searching for simple words, the experiments show that running our algorithms on a compressed text is twice as fast as running the best existing software on the uncompressed version of the same text. When searching complex or approximate patterns, our algorithms are up to 8 times faster than the search on uncompressed text. We also discuss the impact of our technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, decompressing only for displaying purposes.
Keywords: compressed pattern matching, natural language text compression, word searching, word-based Huffman coding
Extending document management systems with user-specific active properties BIBAKFull-Text 140-170
  Paul Dourish; W. Keith Edwards; Anthony LaMarca; John Lamping; Karin Petersen; Michael Salisbury; Douglas B. Terry; James Thornton
Document properties are a compelling infrastructure on which to develop document management applications. A property-based approach avoids many of the problems of traditional hierarchical storage mechanisms, reflects document organizations meaningful to user tasks, provides a means to integrate the perspectives of multiple individuals and groups, and does this all within a uniform interaction framework. Document properties can reflect not only categorizations of documents and document use, but also expressions of desired system activity, such as sharing criteria, replication management, and versioning. Augmenting property-based document management systems with active properties that carry executable code enables the provision of document-based services on a property infrastructure. The combination of document properties as a uniform mechanism for document management, and active properties as a way of delivering document services, represents a new paradigm for document management infrastructures. The Placeless Documents system is an experimental prototype developed to explore this new paradigm. It is based on the seamless integration of user-specific, active properties. We present the fundamental design approach, explore the challenges and opportunities it presents, and show our architectures deals with them.
Keywords: active properties, component software, document management systems, document services, user experience
Efficient content-based indexing of large image databases BIBAKFull-Text 171-210
  Essam A. El-Kwae; Mansur R. Kabuka
Large image databases have emerged in various applications in recent years. A prime requisite of these databases is the means by which their contents can be indexed and retrieved. A multilevel signature file called the Two Signature Multi-level Signature File (2SMLSF) is introduced as an efficient access structure for large image databases. The 2SMLSF encodes image information into binary signatures and creates a tree structures can be efficiently searched to satisfy a user's query. Two types of signatures are generated. Type I signatures are used at all tree levels except the leaf level and are based only on the domain objects included in the image. Type II signatures, on the other hand, are stored at the leaf level and are based on the included domain objects and their spatial relationships. The 2SMLSF was compared analytically to existing signature file techniques. The 2SMLSF significantly reduces the storage requirements; the index structure can answer more queries; and the 2SMLSF performance significantly improves over current techniques. Both storage reduction and performance improvement increase with the number of objects per image and the number of images in the database. For an example large image database, a storage reduction of 78% may be achieved while the performance improvement may reach 98%.
Keywords: content analysis and indexing, document managing, image databases, index generation, multimedia databases

TOIS 2000 Volume 18 Issue 3

Chimera: hypermedia for heterogeneous software development environments BIBAKFull-Text 211-245
  Kenneth M. Anderson; Richard N. Taylor; E. James, Jr. Whitehead
Emerging software development environments are characterized by heterogeneity: they are composed of diverse object stores, user interfaces, and tools. This paper presents an approach for providing hypermedia services in this heterogeneous setting. Central notions of the approach include the following: anchors are established with respect to interactive views of objects, rather than the objects themselves; composable, n-ary links can be established between anchors on different views of objects which may be stored in distinct object bases; viewers may be implemented in different programming languages; and, hypermedia services are provided to multiple, concurrently active, viewers. The paper describes the approach, supporting architecture, and lessons learned. Related work in the areas of supporting heterogeneity and hypermedia data modeling is discussed. The system has been employed in a variety of contexts including research, development, and education.
Keywords: heterogeneous hypermedia, hypermedia system architectures, link servers, open hypermedia systems, software development environments
The maximum entropy approach and probabilistic IR models BIBAKFull-Text 246-287
  Warren R. Greiff; Jay M. Ponte
This paper takes a fresh look at modeling approaches to information retrieval that have been the basis of much of the probabilistically motivated IR research over the last 20 years. We shall adopt a subjectivist Bayesian view of probabilities and argue that classical work on probabilistic retrieval is best understood from this perspective. The main focus of the paper will be the ranking formulas corresponding to the Binary Independence Model (BIM), presented originally by Roberston and Sparck Jones [1977] and the Combination Match Model (CMM), developed shortly thereafter by Croft and Harper [1979]. We will show how these same ranking formulas can result from a probabilistic methodology commonly known as Maximum Entropy (MAXENT).
Keywords: idf weighting, binary independence model, combination match, linked dependence, probability ranking principle
Data integration using similarity joins and a word-based information representation language BIBAFull-Text 288-321
  William W. Cohen
The integration of distributed, heterogeneous databases, such as those available on the World Wide Web, poses many problems. Herer we consider the problem of integrating data from sources that lack common object identifiers. A solution to this problem is proposed for databases that contain informal, natural-language "names" for objects; most Web-based databases satisfy this requirement, since they usually present their information to the end-user through a veneer of text. We describe WHIRL, a "soft" database management system which supports "similarity joins," based on certain robust, general-purpose similarity metrics for text. This enables fragments of text (e.g., informal names of objects) to be used as keys. WHIRL includes textual objects as a built-in type, similarity reasoning as a built-in predicate, and answers every query with a list of answer substitutions that are ranked according to an overall score. Experiments show that WHIRL is much faster than naive inference methods, even for short queries, and efficient on typical queries to real-world databases with tens of thousands of tuples. Inferences made by WHIRL are also surprisingly accurate, equaling the accuracy of hand-coded normalization routines on one benchmark problem, and outerperforming exact matching with a plausible global domain on a second.

TOIS 2000 Volume 18 Issue 4

Model-driven development of Web applications: the AutoWeb system BIBAKFull-Text 323-382
  Piero Fraternali; Paolo Paolini
This paper describes a methodology for the development of WWW applications and a tool environment specifically tailored for the methodology. The methodology and the development environment are based upon models and techniques already used in the hypermedia, information systems, and software engineering fields, adapted and blended in an original mix. The foundation of the proposal is the conceptual design of WWW applications, using HDM-lite, a notation for the specification of structure, navigation, and presentation semantics. The conceptual schema is then translated into a "traditional" database schema, which describes both the organization of the content and the desired navigation and presentation features. The WWW pages can therefore be dynamically generated from the database content, following the navigation requests of the user. A CASE environment, called Autoweb System, offers a set of software tools, which assist the design and the execution of a WWW application, in all its different aspects, Real-life experiences of the use of the methodology and of the AutoWeb System in both the industrial and academic context are reported.
Keywords: HTML, WWW, application, development, intranet, modeling
Beneath the surface of organizational processes: a social representation framework for business process redesign BIBAKFull-Text 383-422
  Gary Katzenstein; F. Javier Lerch
This paper raises the question, "What is an effective representation framework for organizational process design?" By combining our knowledge of existing process models with data from a field study, the paper develops criteria for an effective process representation. Using these criteria and the case study, the paper integrates the process redesign and information system literatures to develop a representation framework that captures a process' social context. The paper argues that this social context framework, which represents people's motivations, social relationships, and social constraints, gives redesigners a richer sense of the process and allows process redesigners to simultaneously change social and logistic systems. The paper demonstrates the framework and some of its benefits and limitations.
Keywords: business process redesign, organizational change, process representation
Beyond intratransaction association analysis: mining multidimensional intertransaction association rules BIBAKFull-Text 423-454
  Hongjun Lu; Ling Feng; Jiawei Han
In this paper, we extend the scope of mining association rules from traditional single-dimensional intratransaction associations, to multidimensional intertransaction associations. Intratransaction associations are the associations among items with the same transaction, where the notion of the transaction could be the items bought by the same customer, the events happened on the same day, and so on. However, an intertransaction association describes the association relationships among different transactions, such as "if(company) A's stock goes up on day 1, B's stock will go down on day 2, but go up on day 4." In this case, whether we treat company or day as the unit of transaction, the associated items belong to different transactions. Moreover, such an intertransaction association can be extended to associate multiple contextual properties in the same rule, so that multidimensional intertransaction associations can be defined and discovered. A two-dimensional intertransaction association rule example is "After McDonald and Burger King open branches, KFC will open a branch two months later and one mile away," which involves two dimensions: time and space. Mining intertransaction associations poses more challenges on efficient processing than mining intratransaction associations. Interestingly, intratransaction association can be treated as a special case of intertransaction association from both a conceptual and algorithmic point of view. In this study, we introduce the notion of multidimensional intertransaction association rules, study their measurements -- support and confidence -- and develop algorithms for mining intertransaction associations by extension of Apriori. We overview our experience using the algorithms on both real-life and synthetic data sets. Further extensions of multidimensional intertransaction association rules and potential applications are also discussed.
Keywords: association rules, data mining, intra/intertransaction, multidimensional context