| Predicate rewriting for translating Boolean queries in a heterogeneous information system | | BIBAK | Full-Text | 1-39 | |
| Chen-Chuan K. Chang; Hector Garcia-Molina; Andreas Paepcke | |||
| Searching over heterogeneous information sources is difficult in part
because of the nonuniform query languages. Our approach is to allow users to
compose Boolean queries in one rich front-end language. For each user query and
target source, we transform the user query into a subsuming query that can be
supported by the source but that may return extra documents. The results are
then processed by a filter query to yield the correct final results. In this
article we introduce the architecture and associated mechanism for query
translation. In particular, we discuss techniques for rewriting predicates in
Boolean queries into native subsuming forms, which is a basis of translating
complex queries. In addition, we present experimental results for evaluating
the cost of postfiltering. We also discuss the drawbacks of this approach and
cases when it may not be effective. We have implemented prototype versions of
these mechanisms and demonstrated them on heterogeneous Boolean systems. Keywords: Boolean queries, content-based retrieval, filtering, predicate rewriting,
query subsumption, query translation | |||
| Methods for information server selection | | BIBAK | Full-Text | 40-76 | |
| David Hawking; Paul Thistlewaite | |||
| The problem of using a broker to select a subset of available information
servers in order to achieve a good trade-off between document retrieval
effectiveness and cost is addressed. Server selection methods which are capable
of operating in the absence of global information, and where servers have no
knowledge of brokers, are investigated. A novel method using Lightweight Probe
queries (LWP method) is compared with several methods based on data from past
query processing, while Random and Optimal server rankings serve as controls.
Methods are evaluated, using TREC data and relevance judgments, by computing
ratios, both empirical and ideal, of recall and early precision for the subset
versus the complete set of available servers. Estimates are also made of the
best-possible performance of each of the methods. LWP and Topic Similarity
methods achieved best results, each being capable of retrieving about 60% of
the relevant documents for only one-third of the cost of querying all servers.
Subject to the applicable cost model, the LWP method is likely to be preferred
because it is suited to dynamic environments. The good results obtained with a
simple automatic LWP implementation were replicated using different data and a
larger set of query topics. Keywords: Lightweight Probe queries, information servers, network servers, server
ranking, server selection, text retrieval | |||
| The equalizing impact of a group support system on status differentials | | BIBAK | Full-Text | 77-100 | |
| Bernard C. Y. Tan; Kwok-kee Wei; Richard T. Watson | |||
| This study investigates the impact of the electronic communication
capability of a group support system (GSS) on status differentials in small
groups. A laboratory experiment was used to answer the research questions.
Three support levels were studied: manual, face-to-face GSS, and dispersed GSS.
Two task types were examined: intellective and preference. Five dependent
variables reflecting different aspects of status differentials were measured:
status influence, sustained influence, residual disagreement, perceived
influence, and decision confidence. The results show that manual groups had
higher status influence, sustained influence, and decision confidence, but
lower residual disagreement than face-to-face GSS and dispersed GSS groups.
Preference task groups also produced higher status influence and sustained
influence, but lower residual disagreement compared to intellective task
groups. In addition, manual groups working on the preference task reported
higher perceived influence than face-to-face GSS and dispersed GSS groups
working on the same task. These findings suggest that when groups are engaged
in activities for which status differentials are undesirable, a GSS can be used
in both face-to-face and dispersed settings to dampen status differentials.
Moreover, when a task amplifies status differentials, the use of a GSS tends to
produce corresponding stronger dampening effects. Keywords: electronic communication, group support systems, status differentials, task
type | |||
| A flexible authorization mechanism for relational data management systems | | BIBAK | Full-Text | 101-140 | |
| Elisa Bertino; Sushil Jajodia; Pierangela Samarati | |||
| In this article, we present an authorization model that can be used to
express a number of discretionary access control policies for relational data
management systems. The model permits both positive and negative authorizations
and supports exceptions at the same time. The model is flexible in that the
users can specify, for each authorization they grant, whether the authorization
can allow for exceptions or whether it must be strongly obeyed. It provides
authorization management for groups with exceptions at any level of the group
hierarchy, and temporary suspension of authorizations. The model supports
ownership together with decentralized administration of authorizations.
Administrative privileges can also be restricted so that owners retain control
over their tables. Keywords: access control mechanism, access control policy, authorization, data
management system, group management support, relational database | |||
| Context-sensitive learning methods for text categorization | | BIBAK | Full-Text | 141-173 | |
| William W. Cohen; Yoram Singer | |||
| Two recently implemented machine-learning algorithms, RIPPERand
sleeping-experts for phrases, are evaluated on a number of large text
categorization problems. These algorithms both construct classifiers that allow
the "context" of a word w to affect how (or even whether) the presence or
absence of w will contribute to a classification. However, RIPPER and
sleeping-experts differ radically in many other respects: differences include
different notions as to what constitutes a context, different ways of combining
contexts to construct a classifier, different methods to search for a
combination of contexts, and different criteria as to what contexts should be
included in such a combination. In spite of these differences, both RIPPER and
sleeping-experts perform extremely well across a wide variety of categorization
problems, generally outperforming previously applied learning methods. We view
this result as a confirmation of the usefulness of classifiers that represent
contextual information. Keywords: context-sensitive models, mistake-driven algorithms, on-line learning, rule
learning, text categorization | |||
| A robust framework for content-based retrieval by spatial similarity in image databases | | BIBAK | Full-Text | 174-198 | |
| Essam A. El-Kwae; Mansur R. Kabuka | |||
| A framework for retrieving images by spatial similarity (FRISS) in ima ge
databases is presented. In this framework, a robust retrieval by spatial
similarity (RSS) algorithm is defined as one that incorporates both directional
and topological spatial constraints, retrieves similar images, and recognized
images even after they undergo translation, scaling, rotation (both perfect and
multiple), or any arbitrary combination of transformations. The FRISS framework
is discussed and used as a base for comparing various existing RSS algorithms.
Analysis shows that none of them satisfies all the FRISS specifications. An
algorithm, SIMdtc, is then presented. SIMdtc introduces the concept of a
rotation correction angle (RCA) to align objects in one image spatially closer
to matching objects in another image for more accurate similarity assessment.
Similarity between two images is a function of the number of common objects
between them and the closeness of directional and topological spatial
relationships between object pairs in both images. The SIMdtc retrieval is
invariant under translation, scaling, and perfect rotation, and the algorithm
is able to rank multiple rotation variants. The algorithm was tested using
synthetic images and the TESSA image database. Analysis shows the robustness of
the SIMdtc algorithm over current algorithms. Keywords: content-based retrieval, image databases, multimedia databases, query
formulation, retrieval models, similarity retrieval, spatial similarity | |||
| Incremental formalization with the hyper-object substrate | | BIBA | Full-Text | 199-227 | |
| Frank M., III Shipman; Raymond J. McCall | |||
| Computers require formally represented information to perform computations that support users; yet users who have needed such support have often proved to be unable or unwilling to formalize it. To address this problem, this article introduces an approach called incremental formalization, in which, first, users express information informally and then the system aids them in formalizing it. Incremental formalization requires a system architecture the (1) integrates formal and informal representations and (2) supports progressive formalization of information. The system should have both tools to capture naturally available informal information and techniques to suggest possible formalizations of this information. The hyper-object substrate (HOS) was developed to satisfy these requirements. HOS has been applied to a number of problem domains, including network design, archeological site analysis, and neuroscience education. Users have been successful in adding informal information and then later formalizing it incrementally with the aid of the system. Our experience with HOS has reaffirmed the need for information spaces to evolve during use and has identified additional considerations in the design and instantiation of systems enabling and supporting incremental formalization. | |||
| A decision-theoretic approach to database selection in networked IR | | BIBAK | Full-Text | 229-249 | |
| Norbert Fuhr | |||
| In networked IR, a client submits a query to a broker, which is in contact
with a large number of databases. In order to yield a maximum number of
documents at minimum cost, the broker has to make estimates about the retrieval
cost of each database, and then decide for each database whether or not to use
it for the current query, and if, how many documents to retrieve from it. For
this purpose, we develop a general decision-theoretic model and discuss
different cost structures. Besides cost for retrieving relevant versus
nonrelevant documents, we consider the following parameters for each database:
expected retrieval quality, expected number of relevant documents in the
database and cost factors for query processing and document delivery. For
computing the overall optimum, a divide-and-conquer algorithm is given. If
there are several brokers knowing different databases, a preselection of
brokers can only be performed heuristically, but the computation of the optimum
can be done similarly to the single-broker case. In addition, we derive a
formula which estimates the number of relevant documents in a database based on
dictionary information. Keywords: networked retrieval, probabilistic retrieval, probability ranking principle,
resource discovery | |||
| A corpus analysis approach for automatic query expansion and its extension to multiple databases | | BIBAK | Full-Text | 250-269 | |
| Susan Gauch; Jianying Wang; Satya Mahesh Rachakonda | |||
| Searching online text collections can be both rewarding and frustrating.
While valuable information can be found, typically many irrelevant documents
are also retrieved, while many relevant ones are missed. Terminology mismatches
between the user's query and document contents are a main cause of retrieval
failures. Expanding a user's query with related words can improve search
performances, but finding and using related words is an open problem. This
research uses corpus analysis techniques to automatically discover similar
words directly from the contents of the databases which are not tagged with
part-of-speech labels. Using these similarities, user queries are automatically
expanded, resulting in conceptual retrieval rather than requiring exact word
matches between queries and documents. We are able to achieve a 7.6%
improvement for TREC 5 queries and up to a 28.5% improvement on the
narrow-domain Cystic Fibrosis collection. This work has been extended to
multidatabase collections where each subdatabase has a collection-specific
similarity matrix associated with it. If the best matrix is selected,
substantial search improvements are possible. Various techniques to select the
appropriate matrix for a particular query are analyzed, and a 4.8% improvement
in the results is validated. Keywords: query expansion | |||
| Context interchange: new features and formalisms for the intelligent integration of information | | BIBAK | Full-Text | 270-293 | |
| Cheng Hian Goh; Stefane Bressan; Stuart Madnick; Michael Siegel | |||
| The Context Interchange strategy presents a novel perspective for mediated
data access in which semantic conflicts among heterogeneous systems are not
identified a priori, but are detected and reconciled by a context mediator
through comparison of contexts axioms corresponding to the systems engaged in
data exchange. In this article, we show that queries formulated on shared
views, export schema, and shared "ontologies" can be mediated in the same way
using the Context Interchange framework. The proposed framework provides a
logic-based object-oriented formalsim for representing and reasoning about data
semantics in disparate systems, and has been validated in a prototype
implementation providing mediated data access to both traditional and web-based
information sources. Keywords: abductive reasoning, information integration, mediators, semantic
heterogeneity, semantic interoperability | |||
| Harp: a distributed query system for legacy public libraries and structured databases | | BIBAK | Full-Text | 294-319 | |
| Ee-Peng Lim; Ying Lu | |||
| The main purpose of a digital library is to facilitate users easy access to
enormous amount of globally networked information. Typically, this information
includes preexisting public library catalog data, digitized document
collections, and other databases. In this article, we describe the distributed
query system of a digital library prototype system known as HARP. In the HARP
project, we have designed and implemented a distributed query processor and its
query front-end to support integrated queries to preexisting public library
catalogs and structured databases. This article describes our experiences in
the design of an extended Sequel (SQL) query language known as HarpSQL. It also
presents the design and implementation of the distributed query system. Our
experience in distributed query processor and user interface design and
development will be highlighted. We believe that our prototyping effort will
provide useful lessons to the development of a complete digital library
infrastructure. Keywords: Internet databases, digital libraries, interoperable databases | |||
| Interface and data architecture for query preview in networked information systems | | BIBAK | Full-Text | 320-341 | |
| Catherine Plaisant; Ben Shneiderman; Khoa Doan; Tom Bruns | |||
| There are numerous problems associated with formulating queries on networked
information systems. These include increased data volume and complexity,
accompanied by slow network access. This article proposes a new approach to a
network query user interfaces that consists of two phases: query preview and
query refinement. This new approach is based on the concepts of dynamic queries
and query previews, which guides users in rapidly and dynamically eliminating
undesired records, reducing the data volume to a manageable size, and refining
queries locally before submission over a network. Examples of two applications
are given: a Restaurant Finder and a prototype for NASA's Earth Observing
Systems Data Information Systems (EOSDIS). Data architecture is discussed, and
user feedback is presented. Keywords: EOSDIS, direct manipulation, dynamic query, graphical user interface, query
preview, query refinement, science data | |||
| Integrating geometrical and linguistic analysis for email signature block parsing | | BIBAK | Full-Text | 343-366 | |
| Hao Chen; Jianying Hu; Richard W. Sproat | |||
| The signature block is a common structured component found in email
messages. Accurate identification and analysis of signature blocks is important
in many multimedia messaging and information retrieval applications such as
email text-to-speech rendering, automatic construction of personal address
databases, and interactive message retrieval. It is also a very challenging
task, because signature blocks often appear in complex two-dimensional layouts
which are guided only by loose conventions. Traditional text analysis methods
designed to deal with sequential text cannot handle two-dimensional structures,
while the highly unconstrained nature of signature blocks makes the application
of two-dimensional grammars very difficult. In this article, we describe an
algorithm for signature block analysis which combines two-dimensional
structural segmentation with one-dimensional grammatical constraints. The
information obtained from both layout and linguistic analysis is integrated in
the form of weighted finite-state transducers. The algorithm is currently
implemented as a component in a preprocessing system for email text-to-speech
rendering. Keywords: email signature block, finite-state transducer, geometrical analysis,
linguistic analysis, text-to-speech rendering | |||
| PIC matrices: a computationally tractable class of probabilistic query operators | | BIBAK | Full-Text | 367-405 | |
| Warren R. Greiff; W. Bruce Croft; Howard Turtle | |||
| The inference network model of information retrieval allows a probabilistic
interpretation of query operators. In particular, Boolean query operators are
conveniently modeled as link matrices of the Bayesian Network. Prior work has
shown, however, that these operators do not perform as well as the pnorm
operators used for modeling query operators in the context of the vector space
model. This motivates the search for alternative probabilistic formulations for
these operators. The design of such alternatives must contend with the issue of
computational tractability, since the evaluation of an arbitrary operator
requires exponential time. We define a flexible class of link matrices that are
natural candidates for the implementation of query operators and an O(n2)
algorithm (n = the number of parent nodes) for the computation of probabilities
involving link matrices of this class. We present experimental results
indicating that Boolean operators implemented in terms of link matrices from
this class perform as well as pnorm operators in the context of the INQUERY
inference network. Keywords: Bayesian networks, Boolean queries, computational complexity, inference
networks, link matrices, piecewise linear functions, pnorm, probabilistic
information retrieval, query operators | |||
| Efficient passage ranking for document databases | | BIBAK | Full-Text | 406-439 | |
| Marcin Kaszkiel; Justin Zobel; Ron Sacks-Davis | |||
| Queries to text collections are resolved by ranking the documents in the
collection and returning the highest-scoring documents to the user. An
alternative retrieval method is to rank passages, that is, short fragments of
documents, a strategy that can improve effectiveness and identify relevant
material in documents that are too large for users to consider as a whole.
However, ranking of passages can considerably increase retrieval costs. In this
article we explore alternative query evaluation techniques, and develop new
techniques for evaluating queries on passages. We show experimentally that,
appropriately implemented, effective passage retrieval is practical in limited
memory on a desktop machine. Compared to passage ranking with adaptations of
current document ranking algorithms, our new "DO-TOS" passage-ranking algorithm
requires only a fraction of the resources, at the cost of a small loss of
effectiveness. Keywords: inverted files, passage retrieval, query evaluation, text databases, text
retrieval | |||
| The impact on retrieval effectiveness of skewed frequency distributions | | BIBAK | Full-Text | 440-465 | |
| Mark Sanderson; C. J. Van Rijsbergen | |||
| We present an analysis of word senses that provides a fresh insight into the
impact of word ambiguity on retrieval effectiveness with potential broader
implications for other processes of information retrieval. Using a methodology
of forming artifically ambiguous words, known as pseudowords, and through
reference to other researchers' work, the analysis illustrates that the
distribution of the frequency of occurrance of the senses of a word plays a
strong role in ambiguity's impact of effectiveness. Further investigation shows
that this analysis may also be applicable to other processes of retrieval, such
as Cross Language Information Retrieval, query expansion, retrieval of OCR'ed
texts, and stemming. The analysis appears to provide a means of explaining, at
least in part, reasons for the processes' impact (or lack of it) on
effectiveness. Keywords: pseudowords, word sense ambiguity, word sense disambiguation | |||