| Hyperform: A Hypermedia System Development Environment | | BIBAK | PDF | 1-31 | |
| Uffe K. Wiil; John J. Leggett | |||
| Development of hypermedia systems is a complex matter. The current trend
toward open, extensible, and distributed multiuser hypermedia systems adds
additional complexity to the development process. As a means of reducing this
complexity, there has been an increasing interest in hyperbase management
systems that allow hypermedia system developers to abstract from the
intricacies and complexity of the hyperbase layer and fully attend to
application and user interface issues. Design, development, and deployment
experiences of a dynamic, open, and distributed multiuser hypermedia system
development environment called Hyperform is presented. Hyperform is based on
the concepts of extensibility, tailorability, and rapid prototyping of
hypermedia system services. Open, extensible hyperbase management systems
permit hypermedia system developers to tailor hypermedia functionality for
specific applications and to serve as a platform for research. The Hyperform
development environment is comprised of multiple instances of four component
types: (1) a hyperbase management system server, (2) a tool integrator, (3)
editors, and (4) participating tools. Hyperform has been deployed in Unix
environments, and experiments have shown that Hyperform greatly reduces the
effort required to provide customized hyperbase management system support for
distributed multiuser hypermedia systems. Keywords: Design, Experimentation, Management, Advanced hypermedia system
architecture, Extensible hyperbase management system, Object-oriented extension
language, H.1.1 Models and principles, Systems and information theory, General
systems theory, H.2.1 Database management, Logical design, Data models, H.2.4
Database management, Systems, Distributed databases, H.2.8 Database management,
Database applications, H.3.4 Information storage and retrieval, Systems and
software, H.3.m Information storage and retrieval, Miscellaneous, Hypermedia | |||
| A Probabilistic Relational Algebra for the Integration of Information Retrieval and Database Systems | | BIBAK | PDF | 32-66 | |
| Norbert Fuhr; Thomas Rolleke | |||
| We present a probabilistic relational algebra (PRA) which is a
generalization of standard relational algebra. In PRA, tuples are assigned
probabilistic weights giving the probability that a tuple belongs to a
relation. Based on intensional semantics, the tuple weights of the result of a
PRA expression always conform to the underlying probabilistic model. We also
show for which expressions extensional semantics yields the same results.
Furthermore, we discuss complexity issues and indicate possibilities for
optimization. With regard to databases, the approach allows for representing
imprecise attribute values, whereas for information retrieval, probabilistic
document indexing and probabilistic search term weighting can be modeled. We
introduce the concept of vague predicates which yield probabilistic weights
instead of Boolean values, thus allowing for queries with vague selection
conditions. With these features, PRA implements uncertainty and vagueness in
combination with the relational model. Keywords: Theory, Hypertext, Retrieval, Imprecise data, Logical retrieval model,
Probabilistic retrieval, Relational data model, Uncertain data, Vague
predicates, H.3.3 Information storage and retrieval, Information search and
retrieval, Retrieval models, H.2.1 Database management, Logical design, Data
models | |||
| Customizing Information Capture and Access | | BIBAK | PDF | 67-101 | |
| Daniela Rus; Devika Subramanian | |||
| This article presents a customizable architecture for software agents that
capture and access information in large, heterogeneous, distributed electronic
repositories. The key idea is to exploit underlying structure at various
levels of granularity to build high-level indices with task-specific
interpretations. Information agents construct such indices and are configured
as a network of reusable modules called structure detectors and segmenters. We
illustrate our architecture with the design and implementation of smart
information filters in two contexts: retrieving stock market data from Internet
newsgroups and retrieving technical reports from Internet FTP sites. Keywords: Algorithms, Documentation, Experimentation, Standardization, H.3.3
Information storage and retrieval, Information search and retrieval, Selection
process, H.3.4 Information storage and retrieval, Systems and software, Current
awareness systems, Information networks | |||
| Making a Digital Library: The Contents of the CORE Project | | BIBAK | PDF | 103-123 | |
| Richard Entlich; Lorrin Garson; Michael Lesk; Lorraine Normore; Jan Olsen; Stuart Weibel | |||
| The CORE (Chemical Online Retrieval Experiment) project is a library of
primary journal articles in chemistry. Any library has an inside and an
outside; in this article we describe the inside of the library and the methods
for building the system and accumulating the database. A later article will
describe the outside (user experiences). Among electronic-library projects,
the CORE project is unusual in that it has both ASCII derived from typesetting
and image data for all its pages, and among experimental electronic-library
projects, it is unusually large. We describe here (a) the processes of
scanning and analyzing about 400,000 pages of primary journal material, (b) the
conversion of a similar amount of textual database material, (c) the linking of
these two data sources, and (d) the indexing of the text material. Keywords: Algorithms, Design, Experimentation, Human factors, Image segmentation,
H.3.2 Information storage and retrieval, Information storage, H.3.6 Information
storage and retrieval, Library automation, Large text archives, H.5.2
Information interfaces and presentation, User interfaces, Interaction styles,
I.4.5 Image processing, Reconstruction | |||
| A Text Compression Scheme that Allows Fast Searching Directly in the Compressed File | | BIBAK | PDF | 124-136 | |
| Udi Manber | |||
| A new text compression scheme is presented in this article. The main
purpose of this scheme is to speed up string matching by searching the
compressed file directly. The scheme requires no modification of the
string-matching algorithm, which is used as a black box; any string-matching
procedure can be used. Instead, the pattern is modified; only the outcome of
the matching of the modified pattern against the compressed file is
decompressed. Since the compressed file is smaller than the original file, the
search is faster both in terms of I/O time and precessing time than a search in
the original file. For typical text files, we achieve about 30% reduction of
space and slightly less of search time. A 30% space saving is not competitive
with good text compression schemes, and thus should not be used where space is
the predominant concern. The intended applications of this scheme are files
that are searched often, such as catalogs, bibliographic files, and address
books. Such files are typically not compressed, but with this scheme they can
remain compressed indefinitely, saving space while allowing faster search at
the same time. A particular application to an information retrieval system
that we developed is also discussed. Keywords: Algorithms, Performance, Data compression search, E.4 Data, Coding and
information theory, Data compaction and compression, F.2.2 Analysis of
algorithms and problem complexity, Nonnumerical algorithms and problems,
Pattern matching, H.3.3 Information storage and retrieval, Information search
and retrieval, Search process | |||
| The Effect of Accessing Nonmatching Documents on Relevance Feedback | | BIBAK | PDF | 137-153 | |
| Mark D. Dunlop | |||
| Traditional information retrieval (IR) systems only allow users access to
documents that match their current query, and therefore, users can only give
relevance feedback on matching documents (or those with a matching strength
greater than a set threshold. This article shows that, in systems that allow
access to nonmatching documents (e.g., hybrid hypertext and information
retrieval systems), the strength of the effect of giving relevance feedback
varies between matching and nonmatching documents. For positive feedback the
results shown here are encouraging, as they can be justified by an intuitive
view of the process. However, for negative feedback the results show behavior
that cannot easily be justified and that varies greatly depending on the model
of feedback used. Keywords: Experimentation, Theory, Free-text information retrieval, Hypertext,
Negative feedback, Probabilistic model, Relevance feedback, Vector space model,
H.3.3 Information storage and retrieval, Information search and retrieval,
Search process | |||
| Access Control for Large Collections | | BIBAK | PDF | 154-194 | |
| H. M. Gladney | |||
| Efforts to place vast information resources at the fingertips of each
individual in large user populations must be balanced by commensurate attention
to information protection. For distributed systems with less-structured tasks,
more-diversified information, and a heterogeneous user set, the computing
system must administer enterprise-chosen access control policies. One kind of
resource is a digital library that emulates massive collections of paper and
other physical media for clerical, engineering, and cultural applications.
This article considers the security requirements for such libraries and
proposes an access control method that mimics organizational practice by
combining a subject tree with ad hoc role granting that controls privileges for
many operations independently, that treats (all but one) privileged roles
(e.g., auditor, security officer) like every other individual authorization,
and that binds access control information to objects indirectly for scaling,
flexibility, and reflexive protection. We sketch a realization and show that
it will perform well, generalizes many deployed proposed access control
policies, and permits individual data centers to implement other models
economically and without disruption. Keywords: Management, Security, Access control, Digital library, Document, Electronic
library, Information security, C.2.4 Computer-communication networks,
Distributed systems, Distributed applications, C.2.4 Computer-communication
networks, Distributed systems, Distributed databases, D.2.0 Software
engineering, General, Protection mechanisms, D.4.6 Operating systems, Security
and protection, H.2.0 Database management, General, H.2.8 Database management,
Database applications, H.3.5 Information storage and retrieval, Online
information services, H.3.6 Information storage and retrieval, Library
automation, Large text archives | |||
| Experiences with Selecting Search Engines using Metasearch | | BIBAK | PDF | 195-222 | |
| Daniel Dreilinger; Adele E. Howe | |||
| Search engines are among the most useful and high-profile resources on the
Internet. The problem of finding information on the Internet has been replaced
with the problem of knowing where search engines are, what they are designed to
retrieve, and how to use them. This article describes and evaluates
SavvySearch, a metasearch engine designed to intelligently select and interface
with multiple remote search engines. The primary metasearch issue examined is
the importance of carefully selecting and ranking remote search engines for
user queries. We studied the efficacy of SavvySearch's incrementally acquired
metaindex approach to selecting search engines by analyzing the effect of time
and experience on performance. We also compared the metaindex approach to the
simpler categorical approach and showed how much experience is required to
surpass the simple scheme. Keywords: Algorithms, Experimentation, Information retrieval, Machine learning, Search
engine, WWW, H.3.3 Information storage and retrieval, Information search and
retrieval, H.3.4 Information storage and retrieval, Systems and software | |||
| Data Structures for Efficient Broker Implementation | | BIBAK | PDF | 223-253 | |
| Anthony Tomasic; Luis Gravano; Calvin Lue; Peter Schwarz; Laura Haas | |||
| With the profusion of text databases on the Internet, it is becoming
increasingly hard to find the most useful databases for a given query. To
attack this problem, several existing and proposed systems employ brokers to
direct user queries, using a local database of summary information about the
available databases. This summary information must effectively distinguish
relevant databases and must be compact while allowing efficient access. We
offer evidence that one broker, GlOSS, can be effective at locating databases
of interest even in a system of hundreds of databased and can examine the
performance of accessing the GlOSS summaries for two promising storage methods:
the grid file and partitioned hashing. We show that both methods can be tuned
to provide good performance for a particular workload (within a broad range of
workloads), and we discuss the tradeoffs between the two data structures. As a
side effect of our work, we show that grid files are more broadly applicable
than previously thought; in particular, we show that by varying the policies
used to construct the grid file we can provide good performance for a wide
range of workloads even when storing highly skewed data. Keywords: Algorithms, Measurement, Performance, Broker architecture, Broker
performance, Distributed information, GlOSS, Grid files, Partitioned hashing,
H.2.2 Database management, Physical design, Access methods, H.3.1 Information
storage and retrieval, Content analysis and indexing, Indexing methods, H.3.3
Information storage and retrieval, Information search and retrieval, Search
process, H.3.4 Information storage and retrieval, Systems and software,
Information networks | |||
| Modeling Word Occurrences for the Compression of Concordances | | BIBAK | PDF | 254-290 | |
| A. Bookstein; S. T. Klein; T. Raita | |||
| An earlier paper developed a procedure for compressing concordances,
assuming that all alements occurred independently. The models introduced in
that paper are extended here to take the possibility of clustering into
account. The concordance is conceptualized as a set of bitmaps, in which the
bit locations represent documents, and the one-bits represent the occurrence of
given terms. Hidden Markov Models (HMM's) are used to describe the clustering
of the one-bits. However, for computational reasons, the HMM is approximated
by traditional Markov models. A set of criteria is developed to constrain the
allowable set of n-state models, and a full inventory is given for n <= 4.
Graph-theoretic reduction and complementation operations are defined among the
various models and are used to provide a structure relating the models studied.
Finally, the new methods were tested on the concordances of the English Bible
and of two of the world's largest full-text retrieval systems: the Tresor de la
Langue Francaise and the Responsa Project. Keywords: Algorithms, Performance, Classification of graph nodes, Concordance storage,
Concordance organization, Graph structure, E.2 Data, Data storage
representations, Composite structures, E.4 Data, Coding and information theory,
Data compaction and compression, F.1.2 Computation by abstract devices, Modes
of computation, G.2.2 Discrete mathematics, Graph theory, H.3.2 Information
storage and retrieval, Information storage, File organization | |||
| Recursive Hashing Functions for n-Grams | | BIBAK | PDF | 291-320 | |
| Jonathan D. Cohen | |||
| Many indexing, retrieval, and comparison methods are based on counting or
cataloguing n-grams in streams of symbols. The fastest method of implementing
such operations is through the use of hash tables. Rapid hashing of
consecutive n-grams is best done using a recursive hash function, in which the
hash value of the current n-gram is derived from the hash value of its
predecessor. This article generalizes recursive hash functions found in the
literature and proposes new methods offering superior performance.
Experimental results demonstrate substantial speed improvement over
conventional approaches, while retaining near-ideal hash value distribution. Keywords: Algorithms, Experimentation, Theory, Hashing, Hashing functions, n-grams,
Recursive hashing, E.2 Data, Data storage representations, Hash-table
representations, G.2.1 Discrete mathematics, Combinatorics, Recurrences and
difference equations, G.3 Mathematics of computing, Probability and statistics,
Probabilistic algorithms H.3.1 Information storage and retrieval, Content
analysis and indexing, Indexing methods, H.3.2 Information storage and
retrieval, Information storage, H.3.3 Information storage and retrieval,
Information search and retrieval | |||
| On Automated Message Processing in Electronic Commerce and Work Support Systems: Speech Act Theory and Expressive Felicity | | BIBAK | PDF | 321-367 | |
| Steven O. Kimbrough; Scott A. Moore | |||
| Electronic messaging, whether in an office environment or for electronic
commerce, is normally carried out in natural language, even when supported by
information systems. For a variety of reasons, it would be useful if
electronic messaging systems could have semantic access to, that is, access to
the meanings and contents of, the messages they process. Given that natural
language understanding is not a practicable alternative, there remain three
approaches to delivering systems with semantic access: electronic data
interchange (EDI), tagged messages, and the development of a formal language
for business communication (FLBC). We favor the latter approach. In this
article we compare and contrast these three approaches, present a theoretical
basis for an FLBC (using speech act theory), and describe a prototype
implementation. Keywords: Languages, Theory, Electronic commerce, Formal language for business
communication, Speech act theory, I.2.4 Artificial intelligence, Knowledge
representation formalisms and methods, Representation languages, | |||
| A Multilevel Approach to Intelligent Information Filtering: Model, System, and Evaluation | | BIBAK | PDF | 368-399 | |
| J. Mostafa; S. Mukhopadhyay; W. Lam; M. Palakal | |||
| In information-filtering environments, uncertainties associated with
changing interests of the user and the dynamic document stream must be handled
efficiently. In this article, a filtering model is proposed that decomposes
the overall task into subsystem functionalities and highlights the need for
multiple adaptation techniques to cope with uncertainties. A filtering system,
SIFTER, has been implemented based on the model, using established techniques
in information retrieval and artificial intelligence. These techniques include
document representation by a vector-space model, document classification by
unsupervised learning, and user modeling by reinforcement learning. The system
can filter information based on content and a user's specific interests. The
user's interests are automatically learned with only limited user intervention
in the form of optional relevance feedback for documents. We also describe
experimental studies conducted with SIFTER to filter computer and information
science documents collected from the Internet and commercial database services.
The experimental results demonstrate that the system performs very well in
filtering documents in a realistic problem setting. Keywords: Algorithms, Experimentation, Theory, Automated document representation,
Information filtering, User modeling, H.3.3 Information storage and retrieval,
Information search and retrieval, Clustering, H.3.3 Information storage and
retrieval, Information search and retrieval, Selection process, I.2.6
Artificial intelligence, Learning, I.7.3 Text processing, Index generation | |||
| Proximal Nodes: A Model to Query Document Databases by Content and Structure | | BIBAK | PDF | 400-435 | |
| Gonzalo Navarro; Ricardo Baeza-Yates | |||
| A model to query document databases by both their content and structure is
presented. The goal is to obtain a query language that is expressive in
practice while being efficiently implementable, features not present at the
same time in previous work. The key ideas of the model are a set-oriented
query language based on operations on nearby structure elements of one or more
hierarchies, together with content and structural indexing and bottom-up
evaluation. The model is evaluated in regard to expressiveness and efficiency,
showing that it provides a good trade-off between both goals. Finally, it is
shown how to include in the model other media different from text. Keywords: Algorithms, Design, Human factors, Languages, Performance, Expressivity and
efficiency of query languages, Hierarchical documents, Structured text, Text
algebras, H.1.2 Information systems, User/machine systems, Human information
processing, H.2.1 Database management, Logical design, Data models, H.2.2
Database management, Physical design, Access methods, H.2.3 Database
management, Languages, Query languages, H.2.4 Database management, Systems,
Query processing, H.3 Information storage and retrieval, I.7.2 Text processing,
Document preparation, Format and notation, I.7.2 Text processing, Document
preparation, Languages and systems, I.7.2 Text processing, Document
preparation, Standards I.7.3 Text processing, Document preparation, Index
generation, | |||