| Evaluating the performance of distributed architectures for information retrieval using a variety of workloads | | BIBAK | Full-Text | 1-43 | |
| Brendon Cahoon; Kathryn S. McKinley; Zhihong Lu | |||
| The information explosion across the Internet and elswhere offers access to
an increasing number of document collections. In order for users to effectively
access these collections, information retrieval (IR) systems must provide
coordinated, concurrent, and distributed access. In this article, we explore
how to achieve scalable performance in a distributed system for collection
sizes ranging from 1GB to 128GB. We implement a fully functional distributed IR
system based on a multithreaded version of the Inquery simulation model. We
measure performance as a function of system parameters such as client command
rate, number of document collections, ter ms per query, query term frequency,
number of answers returned, and command mixture. Our results show that it is
important to model both query and document commands because the heterogeneity
of commands significantly impacts performance. Based on our results, we
recommend simple changes to the prototype and evaluate the changes using the
simulator. Because of the significant resource demands of information
retrieval, it is not difficult to generate workloads that overwhelm system
resources regardless of the architecture. However under some realistic
workloads, we demonstrate system organizations for which response time
gracefully degrades as the workload increases and performance scales with the
number of processors. This scalable architecture includes a surprisingly small
number of brokers through which a large number of clients and servers
communicate. Keywords: distributed information retrieval architectures | |||
| Shortest-substring retrieval and ranking | | BIBAK | Full-Text | 44-78 | |
| Charles L. A. Clarke; Gordon V. Cormack | |||
| We present a model for arbitrary passage retrieval using Boolean queries.
The model is applied to the task of ranking documents, or other structural
elements, in the order of their expected relevance. Features such as phrase
matching, truncation, and stemming integrate naturally into the model.
Properties of Boolean algebra are obeyed, and the exact-match semantics of
Boolean retrieval are preserved. Simple inverted-list file structures provide
an efficient implementation. Retrieval effectiveness is comparable to that of
standard ranking techniques. Since global statistics are not used, the method
is of particular value in distributed environments. Since ranking is based on
arbitrary passages, the structural elements to be ranked may be specified at
query time and do not need to be restricted to predefined elements. Keywords: Boolean retrieval model, passage retrieval, relevance ranking | |||
| Improving the effectiveness of information retrieval with local context analysis | | BIBAK | Full-Text | 79-112 | |
| Jinxi Xu; W. Bruce Croft | |||
| Techniques for automatic query expansion have been extensively studied in
information research as a means of addressing the word mismatch between queries
and documents. These techniques can be categorized as either global or local.
While global techniques rely on analysis of a whole collection to discover word
relationships, local techniques emphasize analysis of the top-ranked documents
retrieved for a query. While local techniques have shown to be more effective
that global techniques in general, existing local techniques are not robust and
can seriously hurt retrieved when few of the retrieval documents are relevant.
We propose a new technique, called local context analysis, which selects
expansion terms based on cooccurrence with the query terms within the
top-ranked documents. Experiments on a number of collections, both English and
non-English, show that local context analysis offers more effective and
consistent retrieval results. Keywords: cooccurrence, document analysis, feedback, global techniques, information
retrieval, local context analysis, local techniques | |||
| Fast and flexible word searching on compressed text | | BIBAK | Full-Text | 113-139 | |
| Edleno Silva de Moura; Gonzalo Navarro; Nivio Ziviani; Ricardo Baeza-Yates | |||
| We present a fast compression technique for natural language texts. The
novelties are that (1) decompression of arbitrary portions of the text can be
done very efficiently, (2) exact search for words and phrases can be done on
the compressed text directly, using any known sequential pattern-matching
algorithm, and (3) word-based approximate and extended search can also be done
efficiently without any decoding. The compression scheme uses a semistatic
word-based model and a Huffman code where the coding alphabet is byte-oriented
rather than bit-oriented. We compress typical English texts to about 30% of
their original size, against 40% and 35% for Compress and Gzip, respectively.
Compression time is close to that of Compress and approximately half of the
time of Gzip, and decompression time is lower than that of Gzip and one third
of that of Compress. We present three algorithms to search the compressed text.
They allow a large number of variations over the basic word and phrase search
capability, such as sets of characters, arbitrary regular expressions, and
approximate matching. Separators and stopwords can be discarded at search time
without significantly increasing the cost. When searching for simple words, the
experiments show that running our algorithms on a compressed text is twice as
fast as running the best existing software on the uncompressed version of the
same text. When searching complex or approximate patterns, our algorithms are
up to 8 times faster than the search on uncompressed text. We also discuss the
impact of our technique in inverted files pointing to logical blocks and argue
for the possibility of keeping the text compressed all the time, decompressing
only for displaying purposes. Keywords: compressed pattern matching, natural language text compression, word
searching, word-based Huffman coding | |||
| Extending document management systems with user-specific active properties | | BIBAK | Full-Text | 140-170 | |
| Paul Dourish; W. Keith Edwards; Anthony LaMarca; John Lamping; Karin Petersen; Michael Salisbury; Douglas B. Terry; James Thornton | |||
| Document properties are a compelling infrastructure on which to develop
document management applications. A property-based approach avoids many of the
problems of traditional hierarchical storage mechanisms, reflects document
organizations meaningful to user tasks, provides a means to integrate the
perspectives of multiple individuals and groups, and does this all within a
uniform interaction framework. Document properties can reflect not only
categorizations of documents and document use, but also expressions of desired
system activity, such as sharing criteria, replication management, and
versioning. Augmenting property-based document management systems with active
properties that carry executable code enables the provision of document-based
services on a property infrastructure. The combination of document properties
as a uniform mechanism for document management, and active properties as a way
of delivering document services, represents a new paradigm for document
management infrastructures. The Placeless Documents system is an experimental
prototype developed to explore this new paradigm. It is based on the seamless
integration of user-specific, active properties. We present the fundamental
design approach, explore the challenges and opportunities it presents, and show
our architectures deals with them. Keywords: active properties, component software, document management systems, document
services, user experience | |||
| Efficient content-based indexing of large image databases | | BIBAK | Full-Text | 171-210 | |
| Essam A. El-Kwae; Mansur R. Kabuka | |||
| Large image databases have emerged in various applications in recent years.
A prime requisite of these databases is the means by which their contents can
be indexed and retrieved. A multilevel signature file called the Two Signature
Multi-level Signature File (2SMLSF) is introduced as an efficient access
structure for large image databases. The 2SMLSF encodes image information into
binary signatures and creates a tree structures can be efficiently searched to
satisfy a user's query. Two types of signatures are generated. Type I
signatures are used at all tree levels except the leaf level and are based only
on the domain objects included in the image. Type II signatures, on the other
hand, are stored at the leaf level and are based on the included domain objects
and their spatial relationships. The 2SMLSF was compared analytically to
existing signature file techniques. The 2SMLSF significantly reduces the
storage requirements; the index structure can answer more queries; and the
2SMLSF performance significantly improves over current techniques. Both storage
reduction and performance improvement increase with the number of objects per
image and the number of images in the database. For an example large image
database, a storage reduction of 78% may be achieved while the performance
improvement may reach 98%. Keywords: content analysis and indexing, document managing, image databases, index
generation, multimedia databases | |||
| Chimera: hypermedia for heterogeneous software development environments | | BIBAK | Full-Text | 211-245 | |
| Kenneth M. Anderson; Richard N. Taylor; E. James, Jr. Whitehead | |||
| Emerging software development environments are characterized by
heterogeneity: they are composed of diverse object stores, user interfaces, and
tools. This paper presents an approach for providing hypermedia services in
this heterogeneous setting. Central notions of the approach include the
following: anchors are established with respect to interactive views of
objects, rather than the objects themselves; composable, n-ary links can be
established between anchors on different views of objects which may be stored
in distinct object bases; viewers may be implemented in different programming
languages; and, hypermedia services are provided to multiple, concurrently
active, viewers. The paper describes the approach, supporting architecture, and
lessons learned. Related work in the areas of supporting heterogeneity and
hypermedia data modeling is discussed. The system has been employed in a
variety of contexts including research, development, and education. Keywords: heterogeneous hypermedia, hypermedia system architectures, link servers,
open hypermedia systems, software development environments | |||
| The maximum entropy approach and probabilistic IR models | | BIBAK | Full-Text | 246-287 | |
| Warren R. Greiff; Jay M. Ponte | |||
| This paper takes a fresh look at modeling approaches to information
retrieval that have been the basis of much of the probabilistically motivated
IR research over the last 20 years. We shall adopt a subjectivist Bayesian view
of probabilities and argue that classical work on probabilistic retrieval is
best understood from this perspective. The main focus of the paper will be the
ranking formulas corresponding to the Binary Independence Model (BIM),
presented originally by Roberston and Sparck Jones [1977] and the Combination
Match Model (CMM), developed shortly thereafter by Croft and Harper [1979]. We
will show how these same ranking formulas can result from a probabilistic
methodology commonly known as Maximum Entropy (MAXENT). Keywords: idf weighting, binary independence model, combination match, linked
dependence, probability ranking principle | |||
| Data integration using similarity joins and a word-based information representation language | | BIBA | Full-Text | 288-321 | |
| William W. Cohen | |||
| The integration of distributed, heterogeneous databases, such as those available on the World Wide Web, poses many problems. Herer we consider the problem of integrating data from sources that lack common object identifiers. A solution to this problem is proposed for databases that contain informal, natural-language "names" for objects; most Web-based databases satisfy this requirement, since they usually present their information to the end-user through a veneer of text. We describe WHIRL, a "soft" database management system which supports "similarity joins," based on certain robust, general-purpose similarity metrics for text. This enables fragments of text (e.g., informal names of objects) to be used as keys. WHIRL includes textual objects as a built-in type, similarity reasoning as a built-in predicate, and answers every query with a list of answer substitutions that are ranked according to an overall score. Experiments show that WHIRL is much faster than naive inference methods, even for short queries, and efficient on typical queries to real-world databases with tens of thousands of tuples. Inferences made by WHIRL are also surprisingly accurate, equaling the accuracy of hand-coded normalization routines on one benchmark problem, and outerperforming exact matching with a plausible global domain on a second. | |||
| Model-driven development of Web applications: the AutoWeb system | | BIBAK | Full-Text | 323-382 | |
| Piero Fraternali; Paolo Paolini | |||
| This paper describes a methodology for the development of WWW applications
and a tool environment specifically tailored for the methodology. The
methodology and the development environment are based upon models and
techniques already used in the hypermedia, information systems, and software
engineering fields, adapted and blended in an original mix. The foundation of
the proposal is the conceptual design of WWW applications, using HDM-lite, a
notation for the specification of structure, navigation, and presentation
semantics. The conceptual schema is then translated into a "traditional"
database schema, which describes both the organization of the content and the
desired navigation and presentation features. The WWW pages can therefore be
dynamically generated from the database content, following the navigation
requests of the user. A CASE environment, called Autoweb System, offers a set
of software tools, which assist the design and the execution of a WWW
application, in all its different aspects, Real-life experiences of the use of
the methodology and of the AutoWeb System in both the industrial and academic
context are reported. Keywords: HTML, WWW, application, development, intranet, modeling | |||
| Beneath the surface of organizational processes: a social representation framework for business process redesign | | BIBAK | Full-Text | 383-422 | |
| Gary Katzenstein; F. Javier Lerch | |||
| This paper raises the question, "What is an effective representation
framework for organizational process design?" By combining our knowledge of
existing process models with data from a field study, the paper develops
criteria for an effective process representation. Using these criteria and the
case study, the paper integrates the process redesign and information system
literatures to develop a representation framework that captures a process'
social context. The paper argues that this social context framework, which
represents people's motivations, social relationships, and social constraints,
gives redesigners a richer sense of the process and allows process redesigners
to simultaneously change social and logistic systems. The paper demonstrates
the framework and some of its benefits and limitations. Keywords: business process redesign, organizational change, process representation | |||
| Beyond intratransaction association analysis: mining multidimensional intertransaction association rules | | BIBAK | Full-Text | 423-454 | |
| Hongjun Lu; Ling Feng; Jiawei Han | |||
| In this paper, we extend the scope of mining association rules from
traditional single-dimensional intratransaction associations, to
multidimensional intertransaction associations. Intratransaction associations
are the associations among items with the same transaction, where the notion of
the transaction could be the items bought by the same customer, the events
happened on the same day, and so on. However, an intertransaction association
describes the association relationships among different transactions, such as
"if(company) A's stock goes up on day 1, B's stock will go down on day 2, but
go up on day 4." In this case, whether we treat company or day as the unit of
transaction, the associated items belong to different transactions. Moreover,
such an intertransaction association can be extended to associate multiple
contextual properties in the same rule, so that multidimensional
intertransaction associations can be defined and discovered. A two-dimensional
intertransaction association rule example is "After McDonald and Burger King
open branches, KFC will open a branch two months later and one mile away,"
which involves two dimensions: time and space. Mining intertransaction
associations poses more challenges on efficient processing than mining
intratransaction associations. Interestingly, intratransaction association can
be treated as a special case of intertransaction association from both a
conceptual and algorithmic point of view. In this study, we introduce the
notion of multidimensional intertransaction association rules, study their
measurements -- support and confidence -- and develop algorithms for mining
intertransaction associations by extension of Apriori. We overview our
experience using the algorithms on both real-life and synthetic data sets.
Further extensions of multidimensional intertransaction association rules and
potential applications are also discussed. Keywords: association rules, data mining, intra/intertransaction, multidimensional
context | |||