| Libraries and Digital Property Rights | | BIBA | Full-Text | 1-10 | |
| Mark Stefik; Giuliana Lavendel | |||
| The realization of the digital library -- a computer system to enable anyone with a workstation to have access to any of the published works of mankind -- has stayed out of reach because of a presumed technical problem. Once a written work is digitized, it becomes so easy to make and distribute copyright infringing copies that publishers would go out of business. A technical solution to this problem based on trusted systems and digital property rights is now becoming available. The big issues for libraries -- social and institutional policy challenges -- are still ahead. | |||
| Object Database Support for Digital Libraries | | BIBA | Full-Text | 11-23 | |
| Serge Abiteboul | |||
| In this paper, we discuss some aspects of database support for digital
libraries.
From a DL perspective, database systems, and in particular, object database systems provide a nice basis for future DL systems. More generally, database research provides solutions to many DL issues even if these are partial or fragmented. When possible, work should not be duplicated and good software and ideas should be reused. From a DB perspective, we want to stress that digital libraries propose beautiful applications and challenges to DBMS technology. They suggest a number of improvements to DBMSs that could be beneficial beyond DL applications. | |||
| Enhancing Community and Collaboration in the Virtual Library | | BIBAK | Full-Text | 25-40 | |
| Rob Procter; Andy McKinlay; Ana Goldenberg; Elisabeth Davenport; Peter Burnhill; Sheila Cannell | |||
| The advent of the virtual library is usually presented as a welcome
development for library users. Unfortunately, the emphasis which is often
placed upon convenience of access tends to reinforce the perception of the use
of information resources as a solitary activity. In fact, information retrieval
(IR) in the conventional library is often a highly collaborative activity,
involving users' peers and experts such as librarians. Failure in the design of
virtual library services to take into account the ways in which physical spaces
help engender a sense of community and facilitate collaboration will result in
its users being denied timely and effective access to valuable sources of
assistance.
We report an investigation of collaboration issues in IR. We begin by defining a generic model of collaboration, and of collaborative spaces. Finally, we describe the design of a prototype multimedia-based system intended to facilitate a sense of community and collaboration between its users. Keywords: information retrieval; collaboration; virtual library | |||
| Comprehension and Object Recognition Capabilities for Presentations of Simultaneous Video Key Frame Surrogates | | BIBA | Full-Text | 41-54 | |
| Laura A. Slaughter; Ben Shneiderman; Gary Marchionini | |||
| The demand for more efficient browsing of video data is expected to increase as greater access to this type of data becomes available. This experiment looked at one technique for displaying video data using key frame surrogates that are presented as a "slide show". Subjects viewed key frames for between one and four video clips simultaneously. Following this presentation, the subjects performed object recognition and gist comprehension tasks in order to determine human thresholds for divided attention between these multiple displays. It was our belief that subject performance would degrade as the number of slide shows shown simultaneously increased. For object recognition and gist comprehension tasks, a decrease in performance between the one slide show display and the two, three or four slide show displays was found. In the case of two or three video presentations, performance is about the same, and there remains adequate object recognition abilities and comprehension of the video clips. Performance drops off to unacceptable levels when four slide shows are displayed at once. | |||
| Automating the Construction of Authority Files in Digital Libraries: A Case Study | | BIBA | Full-Text | 55-71 | |
| James C. French; Allison L. Powell; Eric Schulman; John L. Pfaltz | |||
| The issue of quality control has become increasingly important as more online databases are integrated into digital libraries. This can have a dramatic effect on the search effectiveness of an online system. Authority work, the need to discover and reconcile variant forms of strings in bibliographic entries, will become more difficult. Spelling variants, misspellings, translation and transliteration differences all increase the difficulty of retrieving information. This paper is a case study of our efforts to automate the creation of an authority file for authors' institutional affiliations in the Astrophysics Data System. The techniques surveyed here for the detection and categorization of variant forms have broader applicability and may be used to help automate authority work for other bibliographic fields. | |||
| Using Semantic, Geographical, and Temporal Relationships to Enhance Search and Retrieval in Digital Catalogs | | BIBAK | Full-Text | 73-86 | |
| Klaus Tochtermann; Wolf-Fritz Riekert; Gerlinde Wiest; Jürgen Seggelke; Birgit Mohaupt-Jahr | |||
| The amount and quality of information available on the Internet increases
steadily. To search for information, users are provided with search engines
which often return unsatisfactory search results. Against this background,
digital catalog systems are becoming more and more popular. Unlike earlier
search engines, they contain information about information (meta-information)
available on the Internet or in the holdings of digital libraries but not the
information itself. Users can benefit from these systems in two ways depending
on what information is modeled in them. Firstly, these systems allow for new
types of queries; secondly, the quality of retrieval results is improved. This
paper sets out how semantic, geographical, and temporal relationships can be
integrated into digital catalog systems and how these relationships can be used
to enhance search and retrieval processes in such systems. The presentation
covers both concepts and a comprehensive description of a digital catalog
system which is already used by environmental agencies. Keywords: Digital Catalog Systems; Semantic; Geographical; Temporal Relationships;
German Environmental Information Network | |||
| Metadata Repositories using PICS | | BIBA | Full-Text | 87-98 | |
| Renato Iannella | |||
| Metadata is 'information about data'. That is, metadata describes some aspect of data on the Internet. There has been significant activity recently on defining the semantic and technical aspects of metadata for use on the Internet and the WWW. A number of metadata sets have been proposed together with the technological framework to support the interchange of metadata. These initiatives will have a dramatic effect on how the Web is indexed and will improve the discovery of resources on the Internet by a significant factor. This paper discusses the issue of the provision of a mechanism for a registry of metadata schemas. A proposal, using an enhanced version of PICS is presented. This will enable global interoperability across various extensible metadata sets. | |||
| Relevance Feedback and Query Expansion for Searching the Web: A Model for Searching a Digital Library | | BIBA | Full-Text | 99-112 | |
| Alan F. Smeaton; Francis Crimmins | |||
| A fully operational large scale digital library is likely to be based on a distributed architecture and because of this it is likely that a number of independent search engines may be used to index different overlapping portions of the entire contents of the library. In any case, different media, text, audio, image, etc., will be indexed for retrieval by different search engines so techniques which provide a coherent and unified search over a suite of underlying independent search engines are thus likely to be an important part of navigating in a digital library. In this paper we present an architecture and a system for searching the world's largest DL, the world wide web. What makes our system novel is that we use a suite of underlying web search engines to do the bulk of the work while our system orchestrates them in a parallel fashion to provide a higher level of information retrieval functionality. Thus it is our meta search engine and not the underlying direct search engines that provide the relevance feedback and query expansion options for the user. The paper presents the design and architecture of the system which has been implemented, describes an initial version which has been operational for almost a year, and outlines the operation of the advanced version. | |||
| Text Segmentation by Topic | | BIBA | Full-Text | 113-125 | |
| Jay M. Ponte; W. Bruce Croft | |||
| We investigate the problem of text segmentation by topic. Applications for this task include topic tracking of broadcast speech data and topic identification in full-text databases. Researchers have tackled similar problems before but with different goals. This study focuses on data with relatively small segment sizes and for which within-segment sentences have relatively few words in common making the problem challenging. We present a method for segmentation which makes use of a query expansion technique to find common features for the topic segments. Experiments with the technique show that it can be effective. | |||
| Scalable Text Retrieval for Large Digital Libraries | | BIBA | Full-Text | 127-145 | |
| David Hawking | |||
| It is argued that digital libraries of the future will contain terabyte-scale collections of digital text and that full-text searching techniques will be required to operate over collections of this magnitude. Algorithms expected to be capable of scaling to these data sizes using clusters of modern workstations are described. First, basic indexing and retrieval algorithms operating at performance levels comparable to other leading systems over gigabytes of text on a single workstation are presented. Next, simple mechanisms for extending query processing capacity to much greater collection sizes are presented, to tens of gigabytes for single workstations and to terabytes for clusters of such workstations. Query-processing efficiency on a single workstation is shown to deteriorate dramatically when data size is increased above a certain multiple of physical memory size. By contrast, the number of clustered workstations necessary to maintain a constant level of service increases linearly with increasing data size. Experiments using clusters of up to 16 workstations are reported. A non-replicated 20 gigabyte collection was indexed in just over 5 hours using a ten workstation cluster and scalability results are presented for query processing over replicated collections of up to 102 gigabytes. | |||
| Awareness Services for Digital Libraries | | BIBA | Full-Text | 147-171 | |
| Arturo Crespo; Hector Garcia-Molina | |||
| We propose an architecture for Digital Library repositories where one or more data stores persistently hold the digital objects (e.g., documents), and interact with clients that perform indexing, replication, intellectual property management, revenue management, and other functions. One of the most critical components in such stores is the awareness mechanism, used to notify clients of inserted, deleted or changed objects. In this paper we survey the various awareness schemes (including snapshot, timestamp and log based), describing them all as variations of a single unified scheme. This makes it possible to understand their relative differences and strengths. In particular we focus on a signature-based awareness scheme that we believe is especially well suited for digital libraries, and show enhancements to improve its performance. | |||
| Towards a Common Infrastructure for Large-scale Distributed Applications | | BIBA | Full-Text | 173-193 | |
| Christos Nikolaou; Manolis Marazakis; Dimitris Papadakis; Yiorgos Yeorgiannakis; Jakka Sairamesh | |||
| This paper discusses the requirements of current and emerging large-scale distributed applications and emphasizes the need for a common infrastructure to support them. A design for an infrastructure that aims at satisfying these requirements is presented. Moreover, it is shown how key aspects of important large-scale applications can exploit the services included in the proposed infrastructure. The paper concludes by discussing the current status of a prototype implementation and our research plan. | |||
| Machine Learning + On-line Libraries = IDL | | BIBA | Full-Text | 195-214 | |
| Giovanni Semeraro; Floriana Esposito; Donato Malerba; Nicola Fanizzi; Stefano Ferilli | |||
| One of the current issues faced by information professionals is that of building digital libraries. In this context, two key points are represented by information capture, which involves complex pattern recognition problems, and integration of different DBMS technologies, in order to connect many libraries to form a unique virtual library. This paper presents IDL, a prototypical intelligent digital library service. IDL addresses both the problems mentioned above and proposes a solution for them: The former, by integrating learning tools and techniques in order to make effective, efficient and economically feasible the task of capturing the information that should be stored and indexed by content in a digital library; the latter, by defining a metaquery language which answers for the interoperability of the various digital libraries to be connected. | |||
| Building a Multi-lingual Electronic Text Collection of Folk Tales as a Set of Encapsulated Document Objects: An Approach for Casual Users to Browse Multi-lingual Documents on the Fly | | BIBA | Full-Text | 215-231 | |
| Myriam Dartois; Akira Maeda; Takehisa Fujita; Tetsuo Sakaguchi; Shigeo Sugimoto; Koichi Tabata | |||
| Folk tales are an important heritage of every nation. Electronic text collections of folk tales are meaningful information resources for people who wish to learn about foreign cultures and their languages. This paper describes an electronic text collection of old folk tales which was developed using a multilingual document browsing system called the MHTML browser system, a gateway service to help clients access and display WWW documents written in foreign or multiple languages that the client browser cannot display by itself. The MHTML browser system converts a WWW document into a form which contains the source text and the minimum set of font glyphs required to display the text. The converted document object is sent to the client with a set of applets which display the document on the client browser. Since the glyphs are sent to the client from the MHTML gateway, the client does not need to have installed the fonts for the multilingual document, provided that the client is Java-enabled. The folk tale collection currently includes ten old Japanese folk tales. Each tale is written in English, French, and Japanese, and the user can show the three texts of a tale simultaneously on his/her WWW browser, e.g., Netscape Navigator and Internet Explorer. Thus, a consumer user utilizing an off-the-shelf WWW browser can get a multilingual document on the fly without any additional procedures to set up his/her environment. In this paper, we first discuss the technological background of MHTML and the multilingual browser service for the digital library, as well as the issues involved in building the folk tale collection. | |||
| Automated Indexing with Thesaurus Descriptors: A Co-occurence Based Approach to Multilingual Retrieval | | BIBA | Full-Text | 233-252 | |
| Reginald Ferber | |||
| Indexing documents with descriptors from a multilingual thesaurus is an
approach to multilingual Information Retrieval. However, manual indexing is
expensive. Automated indexing methods in general use terms found in the
document. Thesaurus descriptors are complex terms that are often not used in
documents or have specific meanings within the thesaurus; therefore most
weighting schemes of automated indexing methods are not suited to select
thesaurus descriptors.
In this paper a linear associative system is described that uses similarity values extracted from a large corpus of manually indexed documents to construct a rank ordering of the descriptors for a given document title. The system is adaptive and has to be tuned with a training sample of records for the specific task. The system was tested on a corpus of some 80,000 bibliographic records. The results show a high variability with changing parameter values. This indicates that it is very important to empirically adapt the model to the specific situation it is used in. The overall median of the manually assigned descriptors in the automatically generated ranked list of all 3,631 descriptors is 14 for the set used to adapt the system and 11 for a test set not used in the optimization process. This result shows that the optimization is not a fitting to a specific training set but a real adaptation of the model to the setting. | |||
| Cross-Language Information Retrieval in a Multilingual Legal Domain | | BIBA | Full-Text | 253-268 | |
| Paraic Sheridan; Martin Braschler; Peter Schäuble | |||
| We describe here the application of a cross-language information retrieval technique based on similarity thesauri in the domain of Swiss law. We present the theory of similarity thesauri, which are information structures derived from corpora, and show how they can be used for cross-language retrieval. We also discuss the collections of Swiss legal documents and show how we have used them to construct an environment in which we can directly evaluate the performance of our cross-language retrieval system. Evaluation shows that cross-language retrieval works equally as well as monolingual retrieval in the best case. We conclude that providing cross-language access to digital libraries is already a viable possibility. | |||
| The Digital Library and Computational Philology: The BAMBI Project | | BIBAK | Full-Text | 269-285 | |
| Andrea Bozzi; Sylvie Calabretto | |||
| The work presented in this paper has been developed within a European
project called BAMBI. It enhances the accessibility of ancient manuscripts and
presents new ways of working with them. More precisely, the BAMBI project aims
to produce a software tool allowing historians, and more particularly
codicologists and philologists, to read manuscripts, write annotations, and
navigate between the words of the transcription and the matching piece of image
in the digitized picture of the manuscript.
In the first part of this paper, we present the functions and the design of a Hypermedia Workstation. In the second part, we describe how HyTime (Hypermedia/Time-based Structured Language) can be used as a modelling language to describe work on manuscripts (annotations, links, ...). We present relevant parts of the HyTime model and prove that the model thus obtained can also serve as a basis for implementation. Keywords: Ancient Manuscripts; Digital Library; Hypermedia; HyTime; Philological
Workstation | |||
| Multivalent Annotations | | BIBA | Full-Text | 287-303 | |
| Thomas A. Phelps; Robert Wilensky | |||
| Paper is still preferred to digital document systems for tasks involving annotating, folding, juxtaposing, or otherwise treating the document as a tactile object. Based on the Multivalent Documents model, Multivalent annotations bring to digital documents of potentially any source format, from PostScript to SGML, an open ended variety of user-extensible, sharable manipulations. Several very different forms of distributed annotation based on this model have been implemented. The Multivalent framework composes together annotations of any type, which can result in novel, useful combinations. | |||
| A Semantic Network Approach to Semi-Structured Documents Repositories | | BIBA | Full-Text | 305-324 | |
| Vassilis Christophides; Martin Doerr; Irini Fundulaki | |||
| Using database technology for the administration of digital libraries offers many advantages in a multi-user and distributed environment. However, conventional DBMS are not particularly suited to manage semi-structured data with heterogeneous, irregular, evolving structures as in the case of SGML documents found in digital libraries. To overcome the difficulties imposed by the rigid schema of conventional systems, several schema-less approaches have been proposed. Using instead unconstrained, extensible schemata offered by object-oriented semantic network systems, we are able both to map document specific structures as database classes, and to model the associated constraint information as integrated schema annotations. In this paper we present the benefits of this approach to create, access and process heterogeneous SGML documents, and in particular to exploit the shared semantics of evolving SGML structures. A respective application is currently being implemented in the context of the AQUARELLE project. | |||
| Modelling the Retrieval of Structured Documents Containing Texts and Images | | BIBA | Full-Text | 325-344 | |
| Carlo Meghini; Fabrizio Sebastiani; Umberto Straccia | |||
| We present a model for complex documents possibly consisting of a hierarchically structured set of images or texts. Documents are represented both at the form level (as sets of physical features of the representing objects), at the content level (as sets of properties of the represented entities), and at the structure level. A uniform and powerful query language allows queries to be issued that transparently combine features pertaining to form, content and structure alike. Queries are expressions of a (fuzzy) logical language. While that part of the query that pertains to (medium-independent) content is "directly" processed by an inferential engine, that part that pertains to (medium-dependent) form is entrusted to specialised document processing procedures linked to the logical language by a procedural attachment mechanism. The model thus combines the power of state-of-the-art document processing techniques with the advantages of a clean, logically defined framework for understanding multimedia document retrieval. | |||
| Probabilistic Retrieval of OCR Degraded Text Using N-Grams | | BIBA | Full-Text | 345-359 | |
| Stephen M. Harding; W. Bruce Croft; C. Weir | |||
| The retrieval of OCR degraded text using n-gram formulations within a probabilistic retrieval system is examined in this paper. Direct retrieval of documents using n-gram databases of 2 and 3-grams or 2, 3, 4 and 5-grams resulted in improved retrieval performance over standard (word based) queries on the same data when a level of 10 percent degradation or worse was achieved. A second method of using n-grams to identify appropriate matching and near matching terms for query expansion which also performed better than using standard queries is also described. This method was less effective than direct n-gram query formulations but can likely be improved with alternative query component weighting schemes and measures of term similarity. Finally, a web based retrieval application using n-gram retrieval of OCR text and display, with query term highlighting, of the source document image is described. | |||
| Deposit for Dutch Electronic Publications: Research and Practice in The Netherlands | | BIBA | Full-Text | 361-373 | |
| Trudi C. Noordermeer | |||
| The objective of this article is to describe the state-of-affairs of the Deposit for Dutch Electronic Publications which is organized by the Koninklijke Bibliotheek, the National Library of The Netherlands. It describes in general the results and actual status of the research which is carried out in the period April 1996 - December 1997. Research topics are e.g. selection, acquisition, bibliographical and technical description, unique identification, migration, storage, authenticity and the experience with a limited number of test records to define the workflow. Further, tests with publishers like Elsevier Science and Kluwer Academic Publishers are described. The objective of the Deposit for Dutch Electronic Publications is to preserve electronic off-line and on-line documents, publications, for the remote future, as a last resort. | |||
| Charging for a Digital Library -- The Business Model and the Cost Models of the MeDoc Digital Library | | BIBA | Full-Text | 375-385 | |
| Michael Breu; Ricarda Weber | |||
| MeDoc is a German digital library project bringing together 7 developing
institutions and 24 pilot user institutions as well as 12 international
publishing houses. McDoc provides uniform access to a variety of information
sources and an information broker service. McDoc offers a range of billable
digital books and journals contributed by the participating publishing houses.
Operating a digital library has not only many technical but also important economical aspects. The contents of a digital library can be regarded as information merchandise just like paper books or journals bought in a book store. In order to encourage publishing houses to contribute their books and journals to digital libraries, suitable business models must be defined. New innovative cost models, like floating licenses or fine grained pay per view, become both necessary and feasible in network based digital libraries. This paper introduces the McDoc business model and discusses various cost models and their applicability to the services of a digital library. It gives an overview of the McDoc license and pricing policy and the applied cost models. | |||
| Bibliothèque Nationale de France's Audiovisual System: Digital Audio, Video, and Photo Consultation in a Library | | BIBA | Full-Text | 387-403 | |
| Sylvie Mony | |||
| Digital audio, video and photo have become an operationnal service and a
reality very appreciated by users, at the Bibliothèque nationale de
France since the 20th of December 1996, date of its opening to general public.
In December, in the Audiovisual Room of the General Public level, readers could consult on 45 audiovisual workstations digitized audiovisual materials including 120 hours of video (documentaries), 250 hours of audio (interviews and music), and 50,000 photos. The increase of these digitalized collections will be possible until the full capacity of servers: 300 hours for video, 500 hours for audio, and 300,000 photos. In this communication, we describe the reasons why the Bibliothèque nationale de France chose a digital system to communicate a part of its audiovisual collections; the stages of setting up this audiovisual system, and the first lessons to glean after 6 months in service. | |||
| The Electronic Colloquium on Computational Complexity (ECCC): A Digital Library in Use | | BIBA | Full-Text | 405-421 | |
| Jochen Bern; Carsten Damm; Christoph Meinel | |||
| The Electronic Colloquium on Computational Complexity (ECCC) is a digital
library that specifically addresses the current problem of scientific
publishing, more precisely, the problem of presenting suitably filtered work to
other researchers, for the field of computational complexity. Developing the
detailed concepts in discussions with a scientific board of researchers in this
field, ECCC now fills the gap between author controlled electronic publication
(preprint servers, very fast but lacking content filtering) and conventional
journal or conference proceedings publication (currently taking months, if not
over a year, from submission to publication). Additionally, like a real
colloquium, ECCC supports ongoing discussions through the publication of
comments to already published material. Further authors have the possibility to
present improved versions of their publications while maintaining bibliographic
consistency by version control.
In this paper, we will first describe the situation ECCC is meant to remedy (Sections 1 and 2) and then detail the setup with respect to organization (3.1), basic functionality (3.2 through 3.4), cooperation with other services (3.5) and plans for the future (3.6). | |||