DL Tables of Contents: 9697989900010203040506070809101112131415

JCDL'06: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries

Fullname:ACM/IEEE Joint Conference on Digital Libraries
Note:Opening Information Horizons
Editors:Michael L. Nelson; Cathy Marshall
Location:Chapel Hill, NC, USA
Dates:2006-Jun-11 to 2006-Jun-15
ISBN 1-59593-354-9; ACM Order Number 606062
Links:Conference Home Page
  1. Visualization for libraries
  2. Named entities 1
  3. Classification and links
  4. Panel
  5. Digital preservation
  6. Document analysis
  7. Time and space
  8. Digital library curriculum
  9. Panel
  10. Images and sound
  11. Information retrieval 1
  12. Supporting education
  13. Metadata in action
  14. Information retrieval 2
  15. Usage and relationships
  16. Named entities 2
  17. Posters
  18. Demos

Visualization for libraries

Exploring digital libraries: integrating browsing, searching, and visualization BIBAFull-Text 1-10
  Rao Shen; Naga Srinivas Vemuri; Weiguo Fan; Ricardo da S. Torres; Edward A. Fox
Exploring services for digital libraries (DLs) include two major paradigms, browsing and searching, as well as other services such as clustering and visualization. In this paper, we formalize and generalize DL exploring services within a DL theory. We develop theorems to indicate that browsing and searching can be converted or mapped to each other under certain conditions. The theorems guide the design and implementation of exploring services for an integrated archaeological DL, ETANA-DL. Its integrated browsing and searching can support users in moving seamlessly between these operations, minimizing context switching, and keeping users focused. It also integrates browsing and searching into a single visual interface for DL exploration. A user study to evaluate ETANA-DL's exploring services helped validate our hypotheses.
combinFormation: a mixed-initiative system for representing collections as compositions of image and text surrogates BIBAFull-Text 11-20
  Andruid Kerne; Eunyee Koh; Blake Dworaczyk; J. Michael Mistrot; Hyun Choi; Steven M. Smith; Ross Graeber; Daniel Caruso; Andrew Webb; Rodney Hill; Joel Albea
People need to find, work with, and put together information. Diverse activities, such as scholarly research, comparison shopping, and entertainment involve collecting and connecting information resources. We need to represent collections in ways that promote understanding of individual information resources and also their relationships. Representing individual resources with images as well as text makes good use of human cognitive facilities. Composition, an alternative to lists, means putting representations of elements in a collection together using design principles to form a connected whole.
   We develop combinFormation, a mixed-initiative system for representing collections as compositions of image and text surrogates. The system provides a set of direct manipulation facilities for forming, editing, organizing, and distributing collections as compositions. Additionally, to assist users in sifting through the vast expanse of potentially relevant information resources, the system also includes a generative agent that can proactively engage in processes of collecting information resources and forming image and text surrogates. A generative temporal visual composition agent develops the collection and its visual representation over time, enabling users to see more possibilities. To keep the user in control, we develop interactive techniques that enable the user to direct the agent.
   For evaluation, we conducted a field study in an undergraduate general education course offered in the architecture department. Alternating groups of students used combinFormation as an aid in preparing one of two major assignments involving information discovery to support processes of invention. The students that used combinFormation were found to perform better.
InfoGallery: informative art services for physical library spaces BIBAFull-Text 21-30
  Kaj Grønbæk; Anne Rohde; BalaSuthas Sundararajah; Sidsel Bech-Petersen
Much focus in digital libraries research has been devoted to new online services rather than services for the visitors in the physical library. This paper describes InfoGallery, which is a web-based infrastructure for enriching the physical library space with informative art "exhibitions" of digital library material and other relevant information, such as RSS news streams, event announcements etc. InfoGallery presents information in an aesthetically attractive manner on a variety of surfaces in the library, including cylindrical displays and floors. The infrastructure consists of a server structure, an editor application and a variety of display clients. The paper discusses the design of the infrastructure and its utilization of RSS, podcasts and manually edited news. Applications in the library domain are described and the experiences are discussed.

Named entities 1

The challenge of virginia banks: an evaluation of named entity analysis in a 19th-century newspaper collection BIBAFull-Text 31-40
  Gregory Crane; Alison Jones
This paper evaluates automatic extraction of ten named entity classes from a 19th century newspaper, the Civil War years of the Richmond Times Dispatch, digitized with IMLS support by the University of Richmond. This paper analyzes success with ten categories of entities prominent in these newspapers and the particular problems that these classes of named entities raise. Personal and place names are familiar but some more important categories (such as ship names and military units) illustrate some of the challenges that named entity identification confronts as it evolves into a fundamental tool not only for automatic metadata generation but also for searching and browsing as well. We conclude by suggesting the kinds of knowledge sources that digital libraries need to assemble as part of their machine readable reference collections to support named entity identification as a core service.
Learning to deduplicate BIBAFull-Text 41-50
  Moises G. de Carvalho; Marcos Andre Goncalves; Alberto H. F. Laender; Altigran S. da Silva
Identifying record replicas in Digital Libraries and other types of digital repositories is fundamental to improve the quality of their content and services as well as to yield eventual sharing efforts. Several deduplication strategies are available, but most of them rely on manually chosen settings to combine evidence used to identify records as being replicas. In this paper, we present the results of experiments we have carried out with a novel Machine Learning approach we have proposed for the deduplication problem. This approach, based on Genetic Programming (GP), is able to automatically generate similarity functions to identify record replicas in a given repository. The generated similarity functions properly combine and weight the best evidence available among the record fields in order to tell when two distinct records represent the same real-world entity. The results of the experiments show that our approach outperforms the baseline method by Fellegi and Sunter by more than 12% when identifying replicas in a data set containing researcher's personal data, and by more than 7%, in a data set with article citation data.
An effective approach to entity resolution problem using quasi-clique and its application to digital libraries BIBAFull-Text 51-52
  Byung-Won On; Ergin Elmacioglu; Dongwon Lee; Jaewoo Kang; Jian Pei
We study how to resolve entities that contain a group of related elements in them (e.g., an author entity with a list of citations or an intermediate result by GROUP BY SQL query). Such entities, named as grouped-entities, frequently occur in many applications. By exploiting contextual information mined from the group of elements per entity in addition to syntactic similarity, we show that our approach, Quasi-Clique, improves precision and recall unto 91% when used together with a variety of existing entity resolution solutions, but never worsens them.
Also by the same author: AKTiveAuthor, a citation graph approach to name disambiguation BIBAFull-Text 53-54
  Duncan M. McRae-Spencer; Nigel R. Shadbolt
The desire for definitive data and the semantic web drive for inference over heterogeneous data sources requires co-reference resolution to be performed on those data. In particular, name disambiguation is required to allow accurate publication lists, citation counts and impact measures to be determined. This paper describes a graph-based approach to author disambiguation on large-scale citation networks. Using self-citation, co-authorship and document source analyses, AKTiveAuthor clusters papers, achieving precision of 0.997 and recall of 0.818 over a test group of eight surname clusters.

Classification and links

Probabilistic, object-oriented logics for annotation-based retrieval in digital libraries BIBAFull-Text 55-64
  Ingo Frommholz; Norbert Fuhr
In this paper we introduce POLAR, a probabilistic object-oriented logical framework for annotation-based information retrieval. In POLAR, the knowledge about digital objects, annotations and their relationships in a digital library repository can be modelled considering certain characteristics of annotations and annotated objects. Insights about these characteristics are gained by an analysis of the annotation models behind existing systems and a discussion of an object-oriented, logical view on relevant objects in a digital library. Retrieval methods applied in a digital library should take annotations into account to satisfy users' information needs. POLAR thus supports a wide range of flexible and powerful annotation-based fact and content queries by making use of knowledge and relevance augmentation. An evaluation of our approach on email discussions shows performance improvements when annotation characteristics are considered.
Bibliometric impact measures leveraging topic analysis BIBAFull-Text 65-74
  Gideon S. Mann; David Mimno; Andrew McCallum
Measurements of the impact and history of research literature provide a useful complement to scientific digital library collections. Bibliometric indicators have been extensively studied, mostly in the context of journals. However, journal-based metrics poorly capture topical distinctions in fast-moving fields, and are increasingly problematic with the rise of open-access publishing. Recent developments in latent topic models have produced promising results for automatic sub-field discovery. The fine-grained, faceted topics produced by such models provide a clearer view of the topical divisions of a body of research literature and the interactions between those divisions. We demonstrate the usefulness of topic models in measuring impact by applying a new phrase-based topic discovery model to a collection of 300,000 Computer Science publications, collected by the Rexa automatic citation indexing system.
A comparative study of citations and links in document classification BIBAFull-Text 75-84
  Thierson Couto; Marco Cristo; Marcos Andre Goncalves; Pavel Calado; Nivio Ziviani; Edleno Moura; Berthier Ribeiro-Neto
It is well known that links are an important source of information when dealing with Web collections. However, the question remains on whether the same techniques that are used on the Web can be applied to collections of documents containing citations between scientific papers. In this work we present a comparative study of digital library citations and Web links, in the context of automatic text classification. We show that there are in fact differences between citations and links in this context. For the comparison, we run a series of experiments using a digital library of computer science papers and a Web directory. In our reference collections, measures based on co-citation tend to perform better for pages in the Web directory, with gains up to 37% over text based classifiers, while measures based on bibliographic coupling perform better in a digital library. We also propose a simple and effective way of combining a traditional text based classifier with a citation-link based classifier. This combination is based on the notion of classifier reliability and presented gains of up to 14% in micro-averaged F1 in the Web collection. However, no significant gain was obtained in the digital library. Finally, a user study was performed to further investigate the causes for these results. We discovered that misclassifications by the citation-link based classifiers are in fact difficult cases, hard to classify even for humans.


Augmenting interoperability across scholarly repositories BIBAFull-Text 85
  Tony Hey; Herbert Van de Sompel; Don Waters; Cliff Lynch; Carl Lagoze
The panel will discuss various aspects related to an invitational meeting held at the Mellon Foundation On April 20th and 21st 2006 aimed at identifying concrete steps that can be taken to reach new levels of interoperability across scholarly repositories. The focus of the meeting was specifically on repository interfaces that support locating, identifying, harvesting, retrieving and submitting complex digital objects.

Digital preservation

Quantifying software requirements for supporting archived office documents using emulation BIBAFull-Text 86-94
  Thomas Reichherzer; Geoffrey Brown
This paper addresses the issues associated with building software images to support a collection of archived documents using machine emulators. Emulation has been proposed as a strategy for preservation of digital documents that require their original software for access. The creation of software images is a critical component in archiving documents via emulation. The software images include the operating system, application software, and supporting software artifacts such as fonts and Codecs (Compression-Decompression algorithm). A practical emulation environment to support a digital document requires both an emulator and a software image. This paper considers the issues associated with creating such software images to support Microsoft Office documents. In particular, we discuss a set of software tools and strategies that we developed to automatically analyze the dependencies of Microsoft Office documents on software resources and supporting files. As a proof of concept, the tools and strategies have been applied to establish dependencies of Office documents from a document library containing approximately 200,000 documents and to automatically collect missing resources such as fonts. The software tools are a first step toward an interactive system that aids in the construction of robust emulation environments for preserving digital artifacts. However, they may also be used in other contexts, for example, to support screening of documents for archiving and migration to new platforms to ensure correct visualization.
Building a research library for the history of the web BIBAFull-Text 95-102
  William Y. Arms; Selcuk Aya; Pavel Dmitriev; Blazej J. Kot; Ruth Mitchell; Lucia Walle
This paper describes the building of a research library for studying the Web, especially research on how the structure and content of the Web change over time. The library is particularly aimed at supporting social scientists for whom the Web is both a fascinating social phenomenon and a mirror on society.
   The library is built on the collections of the Internet Archive, which has been preserving a crawl of the Web every two months since 1996. The technical challenges in organizing this data for research fall into two categories: high-performance computing to transfer and manage the very large amounts of data, and human-computer interfaces that empower research by non-computer specialists.
The processing of digitized works BIBAFull-Text 103-104
  Jose Borbinha; Joao Gil; Gilberto Pedrosa; Joao Penas
This paper describes the processing of digitised works at the National Library of Portugal, as done in the scope of the National Digital Library initiate (BND). This comprises the normalization of the names of the images, the creation of technical metadata, image processing, OCR, indexing, and the creation of derived copies for preservation and copies for access in PNG, JPG, GIF, and PDF. The structural descriptions of all the objects are done in METS.
Document level interoperability for collection creators BIBAFull-Text 105-106
  David Bainbridge; Kaun Yu (Jeffrey) Ke; Ian H. Witten
Digital library interoperability for both documents and metadata is a critical and complex issue. Although many relevant standards have been developed, and continue to evolve, in practice things are not quite so easy as they seem. We have built a software environment called the Exchange Center that helps digital librarians manage the process of sourcing documents and metadata from various repositories, adding local content where necessary, and exporting the resulting collection into formats that are suitable for digital library repositories. This paper describes the software, which is built on Greenstone but does not require its use as the final digital library server.
Repository software evaluation using the audit checklist for certification of trusted digital repositories BIBAFull-Text 107-108
  Joanne S. Kaczmarek; Thomas G. Habing; Janet Eke
The NDIIPP ECHO DEPository project [1] digital repository evaluation will use an augmented version of the draft Audit Checklist for Certification of Trusted Digital Repositories (Audit Checklist) [2] to provide a framework for examining how well currently popular repository software applications support the notion of a "trusted digital repository." The evaluation will also demonstrate the application of a scoring software evaluation methodology similar to one developed by the Center for Data Insight (CDI) at Northern Arizona University [3], used for evaluation data mining software. This scoring methodology in conjunction with the Audit Checklist can be used as a tool by librarians, archivists, and other data custodians to make informed decisions as they develop digital preservation management services.

Document analysis

A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books BIBAFull-Text 109-118
  Shaolei Feng; R. Manmatha
A number of projects are creating searchable digital libraries of printed books. These include the Million Book Project, the Google Book project and similar efforts from Yahoo and Microsoft. Content-based on line book retrieval usually requires first converting printed text into machine readable (e.g. ASCII) text using an optical character recognition (OCR) engine and then doing full text search on the results. Many of these books are old and there are a variety of processing steps that are required to create an end to end system. Changing any step (including the scanning process) can affect OCR performance and hence a good automatic statistical evaluation of OCR performance on book length material is needed. Evaluating OCR performance on the entire book is non-trivial. The only easily obtainable ground truth (the Gutenberg e-texts) must be automatically aligned with the OCR output over the entire length of a book. This may be viewed as equivalent to the problem of aligning two large (easily a million long) sequences. The problem is further complicated by OCR errors as well as the possibility of large chunks of missing material in one of the sequences. We propose a Hidden Markov Model (HMM) based hierarchical alignment algorithm to align OCR output and the ground truth for books. We believe this is the first work to automatically align a whole book without using any book structure information. The alignment process works by breaking up the problem of aligning two long sequences into the problem of aligning many smaller subsequences. This can be rapidly and effectively done. Experimental results show that our hierarchical alignment approach works very well even if OCR output has a high recognition error rate. Finally, we evaluate the performance of a commercial OCR engine over a large dataset of books based on the alignment results.
Combining DOM tree and geometric layout analysis for online medical journal article segmentation BIBAFull-Text 119-128
  Jie Zou; Daniel Le; George R. Thoma
We describe an HTML web page segmentation algorithm, which is applied to segment online medical journal articles (regular HTML and PDF-Converted-HTML files). The web page content is modeled by a zone tree structure based primarily on the geometric layout of the web page. For a given journal article, a zone tree is generated by combining DOM tree analysis and recursive X-Y cut algorithm. Combining with other visual cues, such as background color, font size, font color and so on, the page is segmented into homogeneous regions. Evaluation is conducted with 104 articles from 11 journals. Out of 9726 ground-truth zones, 9376 zones are correctly segmented, for an accuracy of 96.40%. Segmenting the entire web page into zones can significantly expedite and increase the accuracy of the subsequent information retrieval steps.
Automatic categorization of figures in scientific documents BIBAFull-Text 129-138
  Xiaonan Lu; Prasenjit Mitra; James Z. Wang; C. Lee Giles
Figures are very important non-textual information contained in scientific documents. Current digital libraries do not provide users tools to retrieve documents based on the information available within the figures. We propose an architecture for retrieving documents by integrating figures and other information. The initial step in enabling integrated document search is to categorize figures into a set of pre-defined types. We propose several categories of figures based on their functionalities in scholarly articles. We have developed a machine-learning-based approach for automatic categorization of figures. Both global features, such as texture, and part features, such as lines, are utilized in the architecture for discriminating among figure categories. The proposed approach has been evaluated on a testbed document set collected from the CiteSeer scientific literature digital library. Experimental evaluation has demonstrated that our algorithms can produce acceptable results for realworld use. Our tools will be integrated into a scientific document digital library.
XML views for electronic editions BIBAFull-Text 139-140
  Ionut E. Iacob; Alex Dekhtyar
In this paper we discuss the implementation of user-defined views over multihierarchical document-centric XML documents.

Time and space

Exploring erotics in Emily Dickinson's correspondence with text mining and visual interfaces BIBAFull-Text 141-150
  Catherine Plaisant; James Rose; Bei Yu; Loretta Auvil; Matthew G. Kirschenbaum; Martha Nell Smith; Tanya Clement; Greg Lord
This paper describes a system to support humanities scholars in their interpretation of literary work. It presents a user interface and web architecture that integrates text mining, a graphical user interface and visualization, while attempting to remain easy to use by non specialists. Users can interactively read and rate documents found in a digital libraries collection, prepare training sets, review results of classification algorithms and explore possible indicators and explanations. Initial evaluation steps suggest that there is a rationale for "provocational" text mining in literary interpretation.
Time period directories: a metadata infrastructure for placing events in temporal and geographic context BIBAFull-Text 151-160
  Vivien Petras; Ray R. Larson; Michael Buckland
Metadata is ordinarily used to describe documents, but it can also constitute a form of infrastructure for access to networked resources and for traversal of those resources. One problematic area for access to digital library resources has been the search for time periods or events. If there is a capability to search for time, it is usually a date search - a standardized and precise form but unfortunately rarely used in common chronological expressions. For example, a user interested in the "Vietnam war", "Clinton Administration" or the "Elizabethan Period" must either know the corresponding dates, or rely on simple keyword matching for those period names. We consider the ability to interpret user statements of periods or eras as ranges of dates and to associate them with particular locations an important feature of an information system. This paper describes the Time Period Directory, a metadata infrastructure for named time periods linking them with their geographic location as well as a canonical time period range.
ETANA-ADD: an interactive tool for integrating archaeological DL collections BIBAFull-Text 161-162
  Naga Srinivas Vemuri; Rao Shen; Sameer Tupe; Weiguo Fan; Edward A. Fox
ETANA-DL is an archaeology digital library built based on the principles of Open Digital Libraries. A key challenge addressed in ETANA-DL is integration of new archaeological sites. To enable archaeologists to build OAI data providers for easy integration, we developed an interactive software tool for database-to-XML generation, schema mapping, and global archive generation. This tool greatly enhances our ability to build new Open Archives. We tested the tool with data from the Umm el-Jimal site.
Enabling exploration: travelers in the middle east archive BIBAFull-Text 163-164
  Lisa Spiro; Marie Wise; Geneva Henry; Chuck Bearden; Sid Byrd; Eva Garza; Michael Decker
In this paper, we describe the Travelers in the Middle East Archive (TIMEA), a digital archive focused on Western explorations in the Middle East between the 18th and early 20th centuries 7. TIMEA brings together TEI-encoded texts and digital images stored in DSpace, research and teaching materials in Connexions, and GIS maps made available online through ArcIMS. By using the functionality of three distinct systems, TIMEA enables users to more fully understand the materials, place them in context, and conduct queries. We outline the rationale for this architecture, the challenges it presents, and our approach to providing an integrated user experience.

Digital library curriculum

Digital library education: the current status BIBAFull-Text 165-174
  Yongqing Ma; Warwick Clegg; Ann O'Brien
In this paper, we review and examine the current status of digital library education and compare the range of provision with that found in earlier studies [1, 2, 3]. It is found that the number of institutions offering programmes or courses in digital library education is still increasing. About 43% of these programmes or courses are stand-alone rather than integrated with wider material. The curriculum design and focused teaching areas appear more systematic and comprehensive than in earlier studies. Over half the institutions examined in this study have posted their detailed course information on-line. Most courses offered are now based on a combination of theory and practice, and are available at different levels. There are increasing opportunities for funding for developing new initiatives in digital library education. However, since digital library education is still at an early stage, an optimized model of best practice in digital library education has not yet emerged.
Curriculum development for digital libraries BIBAFull-Text 175-184
  Jeffrey Pomerantz; Barbara M. Wildemuth; Seungwon Yang; Edward A. Fox
The Virginia Tech Department of Computer Science (VT CS) and the University of North Carolina at Chapel Hill School of Information and Library Science (UNC SILS) have launched a curriculum development project in the area of digital libraries. Educational resources will be developed based on the ACM/IEEE-CS Computing Curriculum 2001. Lesson plans and modules will be developed in a variety of areas (that cover the topics of papers and conference sessions in the field), evaluated by experts in those areas, and then pilot tested in CS and LIS courses. An analysis of papers on digital library-related topics from several corpora was performed, to identify the areas in which more and less work has already been performed on these topics; this analysis will guide the initial stages of this curriculum development.
Learning by building digital libraries BIBAFull-Text 185-186
  David M. Nichols; David Bainbridge; J. Stephen Downie; Michael B. Twidale
The implications of using digital library software in educational contexts, for both students and software developers, are discussed using two case studies of students building digital libraries.
What do digital librarians do BIBAFull-Text 187-188
  Youngok Choi; Edie Rasmussen
Without well-educated digital librarians, digital libraries cannot reach their full potential. In order to offer relevant courses and programs to train digital librarians, educators need feedback from practitioners. Current digital library professionals in academic libraries in the United States were surveyed to determine their activities, skills and training gaps. The findings have implications for the design of digital library education in order to meet workplace needs.


The NDIIPP preservation network: progress, problems, and promise BIBKFull-Text 189
  Laura E. Campbell; Helen R. Tibbo; Peter Leousis
Keywords: Library of Congress, NDIIPP, data management, digital preservation

Images and sound

Windowing time in digital libraries BIBAFull-Text 190-191
  Michael G. Christel
This paper discusses the specification, organization, and utility of time references identified in digital library materials, emphasizing how to treat date references that cannot be resolved to a single day. The HistoryMakers oral history archive is used to illustrate the concept of windowing such time in digital library interfaces.
Concept maps to support oral history search and use BIBAFull-Text 192-193
  Ryen W. White; Hyunyoung Song; Jay Liu
In this paper we describe a novel technique to support information seeking in oral history archives using concept maps. We conducted a pilot study with teachers engaged in work tasks using a prototype concept mapping tool. Results suggest that concept maps can help searchers, especially when tasks are complex.
Facilitating access to large digital oral history archives through informedia technologies BIBAFull-Text 194-195
  Michael G. Christel; Julieanna Richardson; Howard D. Wactlar
This paper discusses the application of speech alignment, image processing, and language understanding technologies to build efficient interfaces into large digital oral history archives, as exemplified by a thousand hour HistoryMakers corpus. Browsing, querying, and navigation features are discussed.
Review mining for music digital libraries: phase II BIBAFull-Text 196-197
  J. Stephen Downie; Xiao Hu
We continue our work on the automatic mining of user-created music reviews towards the goal of connecting user opinions to music objects in Music Digital Libraries (MDL). We demonstrate an experimental system which automatically discovered the key descriptive patterns that differentiated positive from negative reviews which helps us to better understand our successful Phase I results. Comparison to an earlier study indicates an important consistency across projects that warrants further investigation.
Looking for a picture: an analysis of everyday image information searching BIBAFull-Text 198-199
  Sally Jo Cunningham; Masood Masoodian
There is at present a dearth of information on the everyday image information behavior of ordinary people. Analysis of a set of 64 image-related searches provides insight into potentially useful facilities for an image digital library.
Image-based evaluation of video-acquired research skills BIBAFull-Text 200-201
  Unmil P. Karadkar; Marlo Nordt; Richard Furuta; Cody Lee; Christopher Quick
We are exploring the use of image interfaces for testing video-acquired research skills. We studied user performance on three testing image layouts that differ in their use of the available display real estate and in the flexibility of managing the time available to them. Our results confirm that image layout affects user performance on particular tasks and that experts use different strategies from novices. These alternative layouts will be useful for viewing and understanding digital image collections.

Information retrieval 1

Keyphrase extraction-based query expansion in digital libraries BIBAFull-Text 202-209
  Min Song; Il Yeol Song; Robert B. Allen; Zoran Obradovic
In pseudo-relevance feedback, the two key factors affecting the retrieval performance most are the source from which expansion terms are generated and the method of ranking those expansion terms. In this paper, we present a novel unsupervised query expansion technique that utilizes keyphrases and POS phrase categorization. The keyphrases are extracted from the retrieved documents and weighted with an algorithm based on information gain and co-occurrence of phrases. The selected keyphrases are translated into Disjunctive Normal Form (DNF) based on the POS phrase categorization technique for better query refomulation. Furthermore, we study whether ontologies such as WordNet and MeSH improve the retrieval performance in conjunction with the keyphrases. We test our techniques on TREC 5, 6, and 7 as well as a MEDLINE collection. The experimental results show that the use of keyphrases with POS phrase categorization produces the best average precision.
Categorizing web search results into meaningful and stable categories using fast-feature techniques BIBAFull-Text 210-219
  Bill Kules; Jack Kustanowitz; Ben Shneiderman
When search results against digital libraries and web resources have limited metadata, augmenting them with meaningful and stable category information can enable better overviews and support user exploration. This paper proposes six fast-feature techniques that use only features available in the search result list, such as title, snippet, and URL, to categorize results into meaningful categories. They use credible knowledge resources, including a US government organizational hierarchy, a thematic hierarchy from the Open Directory Project (ODP) web directory, and personal browse histories, to add valuable metadata to search results. In three tests the percent of results categorized for five representative queries was high enough to suggest practical benefits: general web search (76-90%), government web search (39-100%), and the Bureau of Labor Statistics website (48-94%). An additional test submitted 250 TREC queries to a search engine and successfully categorized 66% of the top 100 using the ODP and 61% of the top 350. Fast-feature techniques have been implemented in a prototype search engine. We propose research directions to improve categorization rates and make suggestions about how web site designers could re-organize their sites to support fast categorization of search results.
A comprehensive comparison study of document clustering for a biomedical digital library MEDLINE BIBAFull-Text 220-229
  Illhoi Yoo; Xiaohua Hu
Document clustering has been used for better document retrieval, document browsing, and text mining in digital library. In this paper, we perform a comprehensive comparison study of various document clustering approaches such as three hierarchical methods (single-link, complete-link, and complete link), Bisecting K-means, K-means, and Suffix Tree Clustering in terms of the efficiency, the effectiveness, and the scalability. In addition, we apply a domain ontology to document clustering to investigate if the ontology such as MeSH improves clustering qualify for MEDLINE articles. Because an ontology is a formal, explicit specification of a shared conceptualization for a domain of interest, the use of ontologies is a natural way to solve traditional information retrieval problems such as synonym/hypernym/ hyponym problems. We conducted fairly extensive experiments based on different evaluation metrics such as misclassification index, F-measure, cluster purity, and Entropy on very large article sets from MEDLINE, the largest biomedical digital library in biomedicine.

Supporting education

Metadata aggregation and "automated digital libraries": a retrospective on the NSDL experience BIBAFull-Text 230-239
  Carl Lagoze; Dean Krafft; Tim Cornwell; Naomi Dushay; Dean Eckstrom; John Saylor
Over three years ago, the Core Integration team of the National Science Digital Library (NSDL) implemented a digital library based on metadata aggregation using Dublin Core and OAI-PMH. The initial expectation was that such low-barrier technologies would be relatively easy to automate and administer. While this architectural choice permitted rapid deployment of a production NSDL, our three years of experience have contradicted our original expectations of easy automation and low people cost. We have learned that alleged "low-barrier" standards are often harder to deploy than expected. In this paper we report on this experience and comment on the general cost, the functionality, and the ultimate effectiveness of this architecture.
Using resources across educational digital libraries BIBAFull-Text 240-241
  Mimi Recker; Bart Palmer
This article reports on analyses of usage and design activities by users of the Instructional Architect (IA), an end-user authoring tool designed to support easy access to and use of NSDL and online resources in creating instructional materials. This analysis provides a unique window for understanding how users use resources from multiple digital libraries, and the related issues of resource granularity and context dependence. Analyses suggest that active use and design with online resources is relegated to 'early adopters'. These users designed significantly more instructional projects with more content and more online resources than less-active users. Users in general appeared to value digital library resources, and at a smaller granularity than cataloged.
Template-based authoring of educational artifacts BIBAFull-Text 242-243
  Sarah Davis; Paul Bogen; Lauren Cifuentes; Luis Francisco-Revilla; Richard Furuta; Takeisha Hubbard; Unmil P. Karadkar; Daniel Pogue; Frank Shipman
The Walden's Paths project is developing tools for leveraging student learning with the incredible amount of educational material on the Web. Specialized templates based on established educational frameworks, learning theories, or activities aid path authors in creating pedagogically sound paths by guiding them in collecting and structuring the information included in the path. We describe a template based on the Inquiry-Based Learning educational framework and an implementation that provides support in applying the template to the path authoring process.
EcoPod: a mobile tool for community based biodiversity collection building BIBAFull-Text 244-253
  YuanYuan Yu; Jeannie A. Stamberger; Aswath Manoharan; Andreas Paepcke
Biological studies rely heavily on large collections of species observations. All of these collections cannot be compiled by biology professionals alone. Skilled amateurs can assist by contributing observations they make in the field. The challenge with such contributions is their potentially questionable quality. We present our PDA-based application EcoPod, which replaces traditional paper field guides with a mobile computing platform. EcoPod aims both to increase the efficiency of the identification process and its reliability. The application solicits as little information from the user as possible. At the same time it places no restrictions on the sequencing of the identification process. This approach is to make our solution attractive to both skilled amateurs and professionals. The tool creates a record of the identification process, thereby providing an audit trail for quality assurance. EcoPod's user interface driver computes information gain over identification metadata to maximize screen utilization. The tool ingests SDD, an international standard for XML datasets that describe organisms.
Factors motivating use of digital libraries BIBAFull-Text 254-255
  Flora McMartin; Ellen Iverson; Cathryn Manduca; Alan Wolf; Glenda Morgan
Knowledge about how users use digital libraries and their contents is inextricably tied to a library's ability to sustain itself, grow its services and meet the needs of its users. This paper reports on the preliminary results of a study of how science, technology, engineering and mathematics (STEM) instructors perceive and use digital libraries. Preliminary findings indicate that: they do not differentiate between digital libraries and other kinds of content that comes from the web, they seek content to supplement traditional teaching methods and their reliance on Google and personal networks impedes their ability to recall the primary sources of useful content.

Metadata in action

Scaffolding the infrastructure of the computational science digital library BIBAFull-Text 256-257
  Diana Tanase; Michael Bruce; Jonathan Stuart-Moore; David A. Joiner
This paper describes from a developer's point of view the integration of a content management system (Plone) and a metadata repository (CWIS) in order to create an interactive online digital library for publishing and evaluation of computational science materials. It explains how CSERD's project requirements were addressed by setting up a framework for collaboration between the two systems mentioned above.
Dynamic generation of OAI servers BIBAFull-Text 258-259
  J. Alfredo Sanchez; Antonio Razo; Juan Manuel Cordova; Abraham Villegas
We describe Voai and Xoai, two software environments that facilitate the automatic construction of OAI servers for collections managed via relational and XML databases, respectively. We have used Voai and Xoai to generate OAI servers for diverse collections. We use freely available tools and do not impose programming requirements upon the users. By making this software publicly available, we aim to facilitate the process of joining the OAI community and becoming data providers.
FRBR: enriching and integrating digital libraries BIBAFull-Text 260-269
  George Buchanan
FRBR (Functional Requirements for Bibliographic Records) is a promising framework for supporting rich indexation, and therefore rich interaction, in digital libraries. However, it is poorly reported in the digital library research literature and practical examples of its use are seldom discussed. In this paper, we introduce an implemented architecture for FRBR support that can supplement existing digital library systems. We also demonstrate the benefits gained by the user when FRBR data is used to enrich the user's interaction with the digital library.
Learning from artifacts: metadata utilization analysis BIBAFull-Text 270-271
  William E. Moen; Shawne D. Miksa; Amy Eklund; Serhiy Polyakov; Gregory Snyder
Describes the MARC Content Designation Utilization Project, which is examining a very large set of metadata records as artifacts of the library cataloging enterprise. This is the first large-scale examination of descriptive metadata utilization. Presents an overview of study activities and suggests the study's significance to the broader use of metadata in digital libraries.
Looking back, looking forward: a metadata standard for LANL's aDORe repository BIBAFull-Text 272-273
  Beth Goldsmith; Frances Knudson
Although often disparaged or dismissed in the library community, the MARC standard, notably the MARCXML standard, provides surprising flexibility and robustness for mapping disparate metadata to a vendor-neutral format for storage, exchange, and downstream use.

Information retrieval 2

Measuring inter-indexer consistency using a thesaurus BIBAFull-Text 274-275
  Olena Medelyan; Ian H. Witten
When professional indexers independently assign terms to a given document, the term sets generally differ between indexers. Studies of inter-indexer consistency measure the percentage of matching index terms, but none of them consider the semantic relationships that exist amongst these terms. We propose to represent multiple-indexers data in a vector space and use the cosine metric as a new consistency measure that can be extended by semantic relations between index terms. We believe that this new measure is more accurate and realistic than existing ones and therefore more suitable for evaluation of automatically extracted index terms.
Learning metadata from the evidence in an on-line citation matching scheme BIBAFull-Text 276-285
  Isaac G. Councill; Huajing Li; Ziming Zhuang; Sandip Debnath; Levent Bolelli; Wang Chien Lee; Anand Sivasubramaniam; C. Lee Giles
Citation matching, or the automatic grouping of bibliographic references that refer to the same document, is a data management problem faced by automatic digital libraries for scientific literature such as CiteSeer and Google Scholar. Although several solutions have been offered for citation matching in large bibliographic databases, these solutions typically require expensive batch clustering operations that must be run offline. Large digital libraries containing citation information can reduce maintenance costs and provide new services through efficient online processing of citation data, resolving document citation relationships as new records become available. Additionally, information found in citations can be used to supplement document metadata, requiring the generation of a canonical citation record from merging variant citation subfields into a unified "best guess" from which to draw information. Citation information must be merged with other information sources in order to provide a complete document record. This paper outlines a system and algorithms for online citation matching and canonical metadata generation. A Bayesian framework is employed to build the ideal citation record for a document that carries the added advantages of fusing information from disparate sources and increasing system resilience to erroneous data.
Using controlled query generation to evaluate blind relevance feedback algorithms BIBAFull-Text 286-295
  Chris Jordan; Carolyn Watters; Qigang Gao
Currently in document retrieval there are many algorithms each with different strengths and weakness. There is some difficulty, however, in evaluating the impact of the test query set on retrieval results. The traditional evaluation process, the Cranfield evaluation paradigm, which uses a corpus and a set of user queries, focuses on making the queries as realistic as possible. Unfortunately such query sets lack the fine grained control necessary to test algorithm properties. We present an approach called Controlled Query Generation (CQG) that creates query sets from documents in the corpus in a way that regulates the theoretic information quality of each query. This allows us to generate reproducible and well defined sets of queries of varying length and term specificity. Imposing this level of control over the query sets used for testing retrieval algorithms enables the rigorous simulation of different query environments to identify specific algorithm properties before introducing user queries. In this work, we demonstrate the usefulness of CQG by generating three different query environments to investigate characteristics of two blind relevance feedback approaches.
Thesaurus based automatic keyphrase indexing BIBAFull-Text 296-297
  Olena Medelyan; Ian H. Witten
We propose a new method that enhances automatic keyphrase extraction by using semantic information on terms and phrases gleaned from a domain-specific thesaurus. We evaluate the results against keyphrase sets assigned by a state-of-the-art keyphrase extraction system and those assigned by six professional indexers.

Usage and relationships

An architecture for the aggregation and analysis of scholarly usage data BIBAFull-Text 298-307
  Johan Bollen; Herbert Van de Sompel
Although recording of usage data is common in scholarly information services, its exploitation for the creation of value-added services remains limited due to concerns regarding, among others, user privacy, data validity, and the lack of accepted standards for the representation, sharing and aggregation of usage data. This paper presents a technical, standards-based architecture for sharing usage information, which we have designed and implemented. In this architecture, OpenURL-compliant linking servers aggregate usage information of a specific user community as it navigates the distributed information environment that it has access to. This usage information is made OAI-PMH harvestable so that usage information exposed by many linking servers can be aggregated to facilitate the creation of value-added services with a reach beyond that of a single community or a single information service. This paper also discusses issues that were encountered when implementing the proposed approach, and it presents preliminary results obtained from analyzing a usage data set containing about 3,500,000 requests aggregated by a federation of linking servers at the California State University system over a 20 month period.
An experimental framework for comparative digital library evaluation: the logging scheme BIBAFull-Text 308-309
  Claus-Peter Klas; Norbert Fuhr; Sascha Kriewel; Hanne Albrechtsen; Giannis Tsakonas; Sarantos Kapidakis; Christos Papatheodorou; Preben Hansen; Laszlo Kovacs; Andras Micsik; Elin Jacob
Evaluation of digital libraries assesses their effectiveness, quality and overall impact. In this paper we present a novel, multi-level logging framework that will provide complete coverage of the different aspects of DL usage for user-system interactions. Based on this framework, we can analyse for various DL stakeholders the logging data according to their specific interests. In addition, analysis tools and a freely accessible log data repository will yield synergies and sustainability in DL evaluation and encourage a community for DL evaluation by providing for discussion on a common ground.
Insights into collections gaps through examination of null result searches in DLESE BIBAFull-Text 310-311
  Barbara DeFelice; Kim A. Kastens; Constance Rinaldo; John Weatherley
We describe the analysis of zero result searches in DLESE, the Digital Library for Earth System Education, with the intent to use the information to discover gaps in the collection. Close examination of null result searches reveals insights into the kinds of information sought by users but which is missing from the collection. Although it is not possible to consistently isolate collection gaps as a cause for null result searches, it is possible to define a set of null result searches that are very likely to have been caused by collections gaps. This information can be used to improve a collection in specific subject areas. We recommend using this method, along with other inputs, for a digital library with a specific collection scope but a broad and mostly unknown user base, and we recognize the need for automating this analysis.
The social life of books in the humane library BIBAFull-Text 312-313
  Yoram Chisik; Nancy Kaplan
The development of public libraries may have inadvertently brought the age of marginalia to a close but now that digital copies no longer require us to refrain from writing in a shared text, it is possible to create sociable books, texts that sustain communities of readers. How might people respond to opportunities to share their readings through marginalia and how might the process of reading for pleasure be altered by situating it in a more social space? The current study examining sociable reading among a small group of middle-school girls demonstrates the potential of reading sociably and affirms the value of developing digital library books to support social exchanges among readers.

Named entities 2

Search engine driven author disambiguation BIBAFull-Text 314-315
  Yee Fan Tan; Min Yen Kan; Dongwon Lee
In scholarly digital libraries, author disambiguation is an important task that attributes a scholarly work with specific authors. This is critical when individuals share the same name. We present an approach to this task that analyzes the results of automatically-crafted web searches. A key observation is that pages from rare web sites are stronger source of evidence than pages from common web sites, which we model as Inverse Host Frequency (IHF). Our system is able to achieve an average accuracy of 0.836.
Tagging of name records for genealogical data browsing BIBAFull-Text 316-325
  Mike Perrow; David Barber
In this paper we present a method of parsing unstructured textual records briefly describing a person and their direct relatives, which we use in the construction of a browsing tool for genealogical data. The records have been created by researchers who are currently digitising a collection of historical archives stored at the Abbaye de Saint-Maurice, Switzerland. The string 'Beatrix, daughter of Johannes Trona, of Saillon' is a typical example of a record. We wish to annotate every term (word and symbol) in our records with a label which describes whether the term is a name (e.g. 'Beatrix'), a place (e.g. 'Saillon'), or a relationship (e.g. 'daughter'). Using this information, we are able to derive both a canonical form for each name (e.g. 'Beatrix Trona'), and the relationships between people. We build upon work developed for the cleaning and standardization of names for record linkage corpora, adding several enhancements to deal with our more difficult data, which contains common name structures of French, Italian and Latin, over hundreds of years. We present an approach to this problem that works interactively with a user to annotate the data set accurately, greatly reducing the human effort required. We do this by learning a Hidden Markov Model representing a record structure, and finding structural patterns in new records. Finally, we present a brief overview of a tool we are developing to help genealogical researchers browse and search the data.
Automatic feature thesaurus enrichment: extracting generic terms from digital gazetteer BIBAFull-Text 326-333
  Jun Wang; Ning Ge
ADL Gazetteer is a digitalized worldwide gazetteer developed in the Alexandria Digital Library (ADL) Project, which contains millions of geographic names (placenames). The placenames are indexed with type terms from the ADL Feature Type Thesaurus (FTT), a hierarchical category scheme. The paper proposes a two-step method to enrich the category scheme automatically: to discover frequent generic terms by detecting phase boundaries with a mutual information-based method, and to correlate the generic terms with the relevant type terms by hierarchical clustering. The correlation pair established can then be used to supplement the FTT with the generic terms found. The extensive experiments conducted on millions of ADLG placenames demonstrated the effectiveness of the proposed methods. Besides the thesaurus enrichment, the potential applications of this research include: to suggest likely type terms when categorizing new placenames, and to help users choose likely search terms.


An assessment of access and use rights for licensed scholarly digital resources BIBAFull-Text 334
  Kristin R. Eschenfelder; Ian Benton
This research in progress investigates how what types of technological protection measures are being used on collections of licensed scholarly resources. It seeks to ascertain the range and variation in access and rights restrictions, and whether observed restrictions were described in acceptable use statements and resource licenses.
Information seeking in academic learning environments: an exploratory factor analytic approach to understanding design features BIBKFull-Text 335
  Shu-Shing Lee; Yin-Leng Theng; Dion Hoe-Lian Goh; Schubert Shou-Boon Foo
Keywords: exploratory factor analysis, information retrieval, interface design, subjective relevance
A performance support systems approach to digital publishing in libraries BIBAFull-Text 336
  Chuck Thomas; Robert H. McDonald
Electronic performance support tools are used in many workplaces, but digital libraries have not evaluated their potential usefulness. In a pilot project, the Florida State University Libraries developed inexpensive performance support tools for three types of in-house digital publishing. This strategy improved productivity and quality control.
Selecting books: a performance-based study BIBAFull-Text 337
  Nina Wacholder; Lu Liu; Ying-Hsang Liu
Our research compares the impact of paper vs. electronic presentation of text on the book selection process. Our focus is on the stage of book selection in which users study the content of a book to decide whether it will be useful for their intended purpose. Effectiveness is operationalized as accurate determination of whether a non-fiction book contains enough discussion of a particular topic to be useful for a research paper. 24 undergraduates participated in a balanced study in which they were given a topic-book pair and asked to decide whether the book was useful for the topic. We explore the differences in performance, with specific reference to the role of the search function, table-of-contents and index.
User perceptions of a federated search system BIBAFull-Text 338
  Ingrid Hsieh-Yee; Rong Tang; Shanyun Zhang
To examine how users make sense of a federated search system we collected data from professional and novice searchers, using a survey instrument that contained simulated searches. A main task of participants was to provide a narrative and a drawing of their understanding of how MetaLib works. The poster presents the methodology and findings, identifies design issues related to federated search systems, and discusses strategies for increasing information literacy in federated search.
Automatic extraction of table metadata from digital documents BIBAFull-Text 339-340
  Ying Liu; Prasenjit Mitra; C. Lee Giles; Kun Bai
Tables are used to present, list, summarize, and structure important data in documents. In scholarly articles, they are often used to present the relationships among data and high-light a collection of results obtained from experiments and scientific analysis. In digital libraries, extracting this data automatically and understanding the structure and content of tables are very important to many applications. Automatic identification extraction, and search for the contents of tables can be made more precise with the help of metadata. In this paper, we propose a set of medium-independent table metadata to facilitate the table indexing, searching, and exchanging. To extract the contents of tables and their metadata, an automatic table metadata extraction algorithm is designed and tested on PDF documents.
A tool for teaching principles of image metadata generation BIBAFull-Text 341
  Palakorn Achananuparp; Katherine W. McCain; Robert B. Allen
We developed a simple web-based prototype to familiarize students with digital library tools. To assist the students with the indexing task, the prototype provided basic functionalities, including metadata input form, photo search interface. The students generally expressed a positive feedback toward the use of digital library tools in their image indexing project.
Evaluating the national science digital library BIBAFull-Text 342
  Michael Khoo
NSDL Core Integration is conducting a program-wide evaluation of all NSDL program activities. The evaluation will inventory and describe NSDL achievements to date, and identify directions for future development. The scale and complexity of the NSDL program - 200 projects over 5 years - poses significant challenges for the evaluation. The poster outlines the theoretical and practical approaches being used to guide and coordinate evaluation activities.
Multi-linguistic collaborative distance learning: from information translation to knowledge translation BIBAFull-Text 343
  Xiangming Mu
A new Video-based Muti-linguistic Collaborative Distance Learning (VMC-DE) was proposed to support knowledge translation by integrating information, translation, interactive learning, knowledge and context into an interactive learning environment. Two types of user interfaces are under development.
Metadata data dictionary for analog sound recordings BIBAFull-Text 344
  Catherine Lai; Ichiro Fujinaga
This paper introduces a new metadata data dictionary design to assist in the consistent creation of digital libraries of analog sound recording and to promote their interoperability.
Feasibility of developing curriculum standards metadata BIBAFull-Text 345
  Ron T. Brown; Sharon W. Bowers
We asked 18 teachers about their use of video and digital video in the classroom. In general teachers desired digital collections organized by curriculum objectives because curriculum objectives allowed them to quickly narrow search results based on particular units and objectives. To understand the implications for creating metadata, an example video was tagged according to 3 different state curricular standards. The time intensive task required both in-depth knowledge of the video and of the standards. This finding has implications for third party metadata generation.
On-demand metadata extraction network (OMEN) BIBAFull-Text 346
  Ichiro Fuinaga; Daniel McEnnis
A new method for federated searching of music archives using a grid-based dynamic feature extraction system is proposed.
Managing intellectual property issues in a commons of geographic data BIBKFull-Text 347
  James Campbell; Marilyn Lutz; David McCurry; Harlan Onsrud; Kenton Williams
Keywords: commons, geographic data, intellectual property
scientific research groups, digital libraries, & education: metadata from nanoscale simulation code BIBAFull-Text 348
  Laura M. Bartolo; Cathy S. Lowe; Sharon C. Glotzer; Christopher R. Iacovella
The NSDL Materials Digital Library Pathway (MatDL) is working with materials scientists to capture, in Dublin Core XML format, optimal description of nanoscale computer simulation output as research codes are executed. The long term goal of the work is to enable users, such as research groups and students, to efficiently and effectively manage their results for internal use, for exchange with outside collaborators, for use in educational settings, and for submissions to digital libraries.
Interface design for browsing faceted metadata BIBAFull-Text 349
  Jonathan Stuart-Moore; Monte Evans; Patricia Jacobs
The team developing the new version of the Computational Science Education Reference Desk (CSERD) has recognized the need for, and implemented, a more flexible and interactive system for finding resources using a combination of browsing and searching.
Developing a metadata schema for CSERD: a computational science digital library BIBAFull-Text 350
  D. E. Swain; Jill Wagy; Marilyn McClelland; Patricia Jacobs
The poster traces ongoing efforts to develop and refine a metadata schema for the Computational Science Education Reference Desk (CSERD). Design and development is informed by evolving metadata standards for educational resources, usability studies, audience analysis, and interoperability guidelines for National Science Digital Library (NSDL), NSDL Metadata Registry, and digital libraries, such as Merlot. The poster will illustrate and define each of these as "facets" of metadata structures.
Creating a multi disciplinary digital library in the 5S framework BIBAFull-Text 351
  Michael Drutar; Charles Coleman; Edward Fox
This study identifies (1) the steps involved in framing a multi disciplinary digital library to the 5S Model, a formal model for digital libraries. (2) the major benefits the 5S Model delivers toward simplifying a digital library's resource info structure; including the creation of simplified resource classification trees and development of a user interface which interacts with such trees to enable the best in browse ability.
   This poster presents a graphical mapping of how individual naming functions of the spatial temporal organization (Fnodes) are mapped to the user interface. More so, the poser will display the importance of the examples of how the spaces element is used to create "HTML HELPERS" which eventually result in the ultimate in ease of usability on the user end. The blending of these components to suffice for all disciplines of a digital library are the fundamental ingredients to creating an established online repository.
An analysis of the bid behavior of the 2005 JCDL program committee BIBKFull-Text 352
  Marko A. Rodriguez; Johan Bollen; Herbert Van de Sompel
Keywords: digital libraries, network analysis, peer-review process
Supporting biological information work: research and education for digital resources and long-lived data BIBAFull-Text 353
  Carole L. Palmer; Melissa H. Cragin; P. Bryan Heidorn
New practices are emerging in all stages of biological research, from data collection through dissemination of results. Through a series of cooperative projects with biologists working in data-intensive and informatics-based domains, we have documented requirements for digital libraries, tool development, and data management techniques to support contemporary scientific practice. This research is now serving as the foundation for a new biological informatics master's program to train scientific information specialists to manage and integrate scientific information and tools to support scientific problem solving and communication.
Finding a metaphor for collecting and disseminating distributed NSDL content and communications BIBAFull-Text 354
  Carol Minton Morris; Helene Hembrooke; Lynette Rayle
The National Science Digital Library (NSDL) dramatically broadens the information about STEM resources that it can accept and make available to its users with the introduction of the NSDL Data Repository (NDR) architecture. [1].
   On Ramp is a platform for managing workflow, and creating, editing, distributing and storing content from multiple users and groups in a variety of formats to transform information into knowledge by enabling the NSDL community to engage in a rich exchange of information. [2] Flexible content that is small, modular, and adaptable is favored to promote this type of distributed reusable and multilayered information in an educational digital library such as NSDL.
   In this poster we trace the process used to determine a metaphor for the NSDL On Ramp (ONR) content and communications system by exhibiting iterative designs for a user interface derived from ONR User Survey results.
A curated harvesting approach to establishing a multi-protocol online subject portal BIBAFull-Text 355
  Robert Sanderson; John Harrison; Clare Llewellyn
We describe a curated harvesting approach to creating and maintaining a subject portal, comprising selected records harvested from remote services via information retrieval standards such as SRU, Z39.50 and OAI-PMH. The result was a web-based data curation interface where administrative users can configure access to remote resources, queries to be performed at them, and review records for inclusion in end user searches.
Incorporating computational science activities in high school algebra BIBAFull-Text 356
  Joseph DeLuca; David A. Joiner
Despite great increases in the role of computation in Science, Technology, Engineering and Mathematics (STEM), there has been no comprehensive curriculum for computational science in K-12 education [5]. The June 2005 President's Information Technology Advisory Committee (PITAC) report stated that "only a small fraction of the potential of computational science is being realized", and "the diverse technical skills and technologies ... constitute a critical U.S. infrastructure that we under appreciate and undervalue at our peril [4]." Despite a growing focus on STEM education, a substantial shortage exists of Americans qualified to work in STEM professions, including scientific research [1]. Progress in training computational scientists is lagging demand in the U.S. today. As this decade is seeing growth in the number of graduate, undergraduate, and teacher training programs in computational science [7], it is vital that the curriculum and materials to infuse computation into K-12 schools are made avail.
   Previous studies have shown how interactive learning objects can be incorporated into teaching, allowing teachers to make classrooms more engaging and student active, provided faculty using the resources have adequate training, a willingness to modify their teaching styles, and access to or time to create quality interactive assignments [6]. The Computational Science Education Reference Desk (CSERD), a Pathway project of the National Science Digital Library, collects learning objects for teaching about and teaching with computation, reviewing items in its catalog on the basis of verification, validation, and accreditation to help provide faculty with information regarding the quality of the learning objects [3].
   This study attempts to determine the effectiveness of a set of interactive learning materials from the CSERD collection in teaching concepts in a freshman Algebra I class. Materials from the CSERD resource Project Interactivate [2] will be used in a series of 4 lessons through February and March 2006 at a parochial school in Northeastern New Jersey. Students will take a pre- and post-test on topics covered in this period. Students and teachers will be surveyed to determine their attitudes towards the use of computation in learning and towards mathematics in general. Additionally, students will submit a daily feedback statement after each augmented lesson.
Adapting peer verification, validation and accreditation processes for digital libraries BIBAFull-Text 357
  Linda Schmalbeck; Jonathan Stuart-Moore; Monte Evans
This poster describes an on-going process to adapt a public access peer-based verification, validation and accreditation system for a digital library that is designed to serve the science, technology, engineering and mathematics (STEM) education community.
Quantifying the accuracy of relational statements in Wikipedia: a methodology BIBAFull-Text 358
  Gabriel Weaver; Barbara Strickland; Gregory Crane
An initial evaluation of the English Wikipedia indicates that it may provide accurate data for disambiguating and finding relations among named entities.
The ingest and maintenance of electronic records: moving from theory to practice BIBAFull-Text 359
  Kevin L. Glick; Eliot Wilczek; Robert Dockins
This poster will present the findings of an NHPRC electronic records research grant conducted by Tufts University and Yale University.
Technical architecture overview: tools for acquisition, packaging and ingest of web objects into multiple repositories BIBAFull-Text 360
  Shweta Rani; Jay Goodkin; Judy Cobb; Tom Habing; Richard Urban; Janet Eke; Richard Pearce-Moses
This poster describes a model for acquiring, packaging and ingesting web objects for archiving in multiple repositories. This ongoing work is part of the ECHO DEPository Project, a 3-year NDIIPP-partner digital preservation project at the University of Illinois at Urbana-Champaign with partners OCLC, a consortium of content provider partners, and the Library of Congress.
Browsing affordance designs for the human-centered computing education digital library BIBAFull-Text 361
  Edward Clarkson; James D. Foley
Browsing is a widespread user behavior in the digital library (DL) environment; there are an array of existing techniques that afford browsing and are readily applicable to digital libraries. We outline the designs of two such methods based on well-known techniques: treemaps and ScentTrails.
Indexing institutional data to promote library resource discovery BIBAFull-Text 362
  Tito Sierra
Most academic research libraries provide subject guides or data-driven subject portals on their websites to help users find information resources by topical research area. Unfortunately, these guides and portals are underutilized because users fail to discover them in their information search process. We describe an approach in development at NCSU to increase the discovery of library subject portals, and topically organized library resources in general. This approach exploits the rich topical content in available institutional data stores to generate subject recommendations related to the user's search query.
Keeping the context: an investigation in preserving collections of digital video BIBAFull-Text 363
  Christopher A. Lee; Helen R. Tibbo; Dawne Howard; Yaxiao Song; Terrell Russell; Paul Jones
There has been a recent dramatic shift from analog to digital creation, management and use of video, creating unprecedented opportunities to develop rich, interactive collections, but without proper care, much of this digital video could be inaccessible or incomprehensible in the future. Several projects have explored technical challenges and potential strategies for ensuring long-term access to digital video collections. A number of initiatives have also generated sets of proposed metadata for digital video. Most of the above activities have focused on ensuring that videos can be discovered, accessed and rendered over time.
   Another active steam of research has examined how users can best navigate, understand, view, interact with and annotate collections of digital video. This research has generated valuable lessons, tools and observations to support current users. However, it has generally not investigated how the components of a digital video collection might support or fail to support future users of videos.
   The Preserving Video Objects and Context (VidArch) project -- NSF Grant # IIS 0455970, involving the authors, Gary Marchionini and Gary Geisler -- lies at the intersection between the two streams of research described above. We are developing a preservation framework for digital video context. Among other issues, we are considering: Are there interface elements from current collections (e.g. surrogates, navigation aids, behaviors) that should be retained over time, in order to support long-term use and understanding of the videos? How might curators of digital video collections decide which contextual elements are important and then devise strategies for preserving them.
   According to the glossary of the Society of American Archivists, context is the "organizational, functional, and operational circumstances surrounding materials' creation, receipt, storage, or use, and its relationship to other materials." Documents derive value and meaning from relationships with other documents within the same collection. Rather than treating each item as a discrete entity, archival theory and practice suggests that digital videos should be managed, preserved and presented to users in a way that reflects the social and documentary context in which they were originally embedded.
   Access systems for text-based collections often rely on surrogates, such as indices, catalogs, and abstracts. In addition to facilitating information navigation, discovery and retrieval, surrogates also provide valuable contextual information about the documents. In archival descriptive practices, attention to context is expressed through the creation of finding aids, which include not only inventories of the contents of collections, but also background information about the actors and activities that generated the materials, and the ways they were organized by their original creators or recipients. Recent research has produced and investigated an analogous set of surrogates for digital video collections. These include textual descriptions, title, captions, and annotations, but they also include surrogates that are themselves still or moving images: video segments, keyframes, slide shows, and fast forwards.
   VidArch is focused on two collections within the Open Video repository: the complete set of videos that National Aeronautic and Space Administration (NASA) produces and broadcasts to advance learning and appreciation for science; and a set of videos of juried presentations to various annual Association for Computing Machinery (ACM) conferences. The two collections reflect several forms of documentation that may be valuable to preserve in order to convey the context of the videos: text-based surrogates, image-based surrogates (story boards and fast forwards), links to related videos, use history data, and supporting documents (e.g. lesson plans). We have generated archival finding aids to the two collections in order to reflect contextual information that is not readily available within Open Video. Such documentary elements should not simply be treated as part of the current interface to the collection but should also be considered as potential targets of long-term preservation in their own right.
   This poster presents an information model for digital video context and places the information model within the context of recent guidance on metadata for digital video, metadata for digital preservation, and the Reference Model for an Open Archival Information System (OAIS).
cloudalicious: folksonomy over time BIBAFull-Text 364
  Terrell Russell
Cloudalicious is an online visualization tool that has been designed to give insight into how "tag clouds" or folksonomies, develop over time. A folksonomy is an organic system of text labels attributed to an object by the users of that object. The most common object so far to be the subject of this tagging has been the online bookmark. Stabilization of a URL's tag cloud over time is the clearest result of this type of visualization. Any diagonal movement on the graphs, indicative of a change in the tags being used to describe a URL, should garner further discussion.
Apparatus and methods for production of printed aromatic and gustative information BIBAFull-Text 365
  Berg P. Hyacinthe
As we advance with the implementation of novel technologies, apparati and protocols that combine existing technologies with emergent ones are becoming more relevant to ensure smoother transitions and overcome challenges that new technologies alone can not address. The author essentially suggests a new type of information exchange which focuses on the olfactory/gustatory perceptual realm of smell and taste. In principle, a printing module transforms signals received from the processing unit into olfactory documents that can, in turn, be stored and preserved as scented texts on thin layers of a gustative medium.
Teaching box builder: customizing pedagogical contexts for use of digital library resources in classrooms BIBAFull-Text 366
  Huda J. Khan; Keith E. Maull
This poster and accompanying demonstration introduces the Teaching Box Builder application that, as being implemented, supports the development of pedagogically rich inquiry-based earth science lessons using digital library resources.
ClaimID: a system for personal identity management BIBAFull-Text 367
  Frederic Stutzman; Terrell Russell
In this poster, the authors describe a system, that enables individuals to create representation of their online identity. Realizing that online identity, especially personal identity as represented in search, is difficult to collect and verify, the authors propose a system that enables individuals to collect and self-classify the information that is about them online.
Pathways core: a data model for cross-repository services BIBAFull-Text 368
  Jeroen Bekaert; Xiaoming Liu; Herbert Van de Sompel; Carl Lagoze; Sandy Payette; Simeon Warner
As part of the NSF-funded Pathways project, we have created an interoperable data model to facilitate object re-use and a broad spectrum of cross-repository services. The resulting Pathways Core data model is designed to be lightweight to implement, and to be widely applicable as a shared profile or as an overlay on data models currently used in repository systems and applications. We consider the data models underlying the Fedora, Dspace and aDORe repository systems, and a number of XML-based formats used for the representation of compound objects, including MPEG-21 DIDL, METS, and IMS/CP.
   At the heart of the Pathways Core data model (Fig. 1) are the entity and datastream elements. entity elements model the abstract aspects of digital objects and align with works and expressions in FRBR [1]. An entity can model anything from a digital object to a collection of digital objects (other entities), to a node created merely to express abstract properties. Core properties of entities are hasIdentifier, hasProviderInfo, hasLineage, and hasProvider-Persistence. If a repository attaches providerInfo to an entity, it provides a handle to access the entity from the repository, supporting its use and re-use. Persistence of this handle may be indicated with providerPersistence. The hasLineage property is used to indicate the entity (or entities) from which the entity to which the hasLineage is attached was derived. Other properties, such as hasSemantic, that convey the intellectual genre of the entity (i.e. journal article), can be added. datastream elements model the concrete aspects of a digital object; these align with items in FRBR, and can be thought of as aspects at the level of bitstreams. An entity may have any number of datastreams. Two properties of datastream have been defined as part of the Pathways Core: hasLocation conveys a URI that can be resolved to yield a bitstream; and hasFormat conveys the digital format of the bitstream. If a datastream has multiple hasLocation properties, resolution of the conveyed URIs yields bit-equivalent bitstreamsThe Pathways Core data model can be serialized in a variety of ways, and, an RDF serialization as well as a profile of MPEG-21 DIDL have been created as reference implementations. We have also conducted the following experiment to illustrate the power of the Pathways Core. A number of heterogeneous repositories implemented an OpenURL-based obtain interface from which, given the providerInfo of an entity, an RDF serialization of the entity compliant with the Pathways Core could be retrievedUsing this interface, an overlay journal can collect serializations of some entities (scholarly papers) from the different collaborating repositories, and assemble those into a new issue of the journal. The overlay journal then itself implemented the same obtain interface, and as a result, an RDF serialization of the entire journal, an issue, and an article could be extracted. This interface could then, for example, be used by a preservation repository to collect content from the overlay journal for ingest and mirroring. This experiment illustrates how cross-repository services and workflows can be facilitated through support of an interoperable data model (the Pathways Core) and an interoperable service interface (the OpenURL-based obtain interface).
Pilot testing the DigiQUAL protocol: lessons learned BIBAFull-Text 369
  Martha Kyrillidou; Sarah Giersch
The Association of Research Libraries is developing the DigiQUAL protocol to assess the service quality provided by digital libraries (DLs). In 2005, statements about DL service quality were put through a two-step validation process with DL developers and then with users in an online survey..
Using citations for ranking in digital libraries BIBKFull-Text 370
  Birger Larsen; Peter Ingwersen
Keywords: citation indexing, information retrieval


LexiURL web link analysis for digital libraries BIBAFull-Text 371
  Alesia Zuccala; Mike Thelwall
The purpose of this demonstration is to show how LexiURL may be used with a search engine to download links to and colinks with a digital library site for "Web intelligence" purposes..
Exploring content-actor paired network data using iterative query refinement with NetLens BIBAFull-Text 372
  Hyunmo Kang; Catherine Plaisant; Bongshin Lee; Benjamin B. Bederson
Networks have remained a challenge for information retrieval and visualization because of the rich set of tasks that users want to accomplish. This paper demonstrates a tool, NetLens, to explore a Content-Actor paired network data model. The NetLens interface was designed to allow users to pose a series of elementary queries and iteratively refine visual overviews and sorted lists. This enables the support of complex queries that are traditionally hard to specify in node-link visualizations. NetLens is general and scalable in that it applies to any dataset that can be represented with our abstract Content-Actor data model.
A content-based video browsing system based on visual neighbor similarity BIBAFull-Text 373
  Xiangming Mu
A new interactive shot level video navigation system is developed to support three types of content-based browsing functions: Neighbor clustering, Visual similarity, and Visual Neighbor Similarity (VNS) browsing..
OpenArXiv = arXiv + RDBMS + web services BIBKFull-Text 374
  Justin Fisher; Hyunyoung Kil; Dongwon Lee
Keywords: API, arXiv, digital library, web services
Real-time collaboration through visual search and voice-over-IP BIBAFull-Text 375
  Cathal Hoare; Humphrey Sorensen
Numerous digital libraries (DLs), electronic archives (EAs) and portal services have been developed. These services allow online structured access to digitised information, facilitating remote access for educators and students. Often, DL users and information are remotely located - so too are their users. The authors can envision numerous circumstances where two remotely located parties may wish to opportunistically examine an online resource - e-learning environments for example. We are particularly interested in assisting users whose collaboration resolves around discussion of a common visual resource (documents and collections of documents in the case under discussion). By providing a single tool for information seeking and multi-user collaboration, we believe that the amount of preparation required for an online session is reduced, while the flexibility allowed to parties to conduct ad-hoc examinations of a resource is increased. This paper proposes a framework to address this functionality deficit by describing a document foraging tool that provides facilities for both visual exploration of a document set and Voiceover-IP (VoIP) based collaborative features.
Demonstrating the use of a SenseCam in two domains BIBAFull-Text 376
  Seungwon Yang; Ben Congleton; George Luc; Manuel A. Perez-Quinones; Edward A. Fox
MyLifeBits is both an application and a framework to manage a personal lifetime of memories. We will demonstrate the use of a small digital library that manages data from two Microsoft SenseCams, used by: 1) students in the Virginia-Maryland Regional College of Veterinary Medicine, and 2) students supported by our Assistive Technologies office.
SIMPEL: a superimposed multimedia presentation editor and player BIBAFull-Text 377
  Uma Murthy; Kapil Ahuja; Sudarshan Murthy; Edward A. Fox
In a variety of applications such as learning, we need to integrate multimedia information into convenient packages (like presentations). The challenges involved in this process are: Selecting or working with information elements at sub-document level while retaining the original context; describing the integration or packaging of such elements; and making use of minimal storage during this activity.
   Current multimedia authoring software, like RealProducer (http://www.realnetworks.com/products/producer/), tend to repeatedly copy information, or to limit granularity of information referenced. Although editors for the Synchronized Multimedia Integration Language (http://www.w3.org/AudioVideo/), such as GRiNS (http://www.oratrix.com/Products/G2E), address some of the aforementioned challenges, they are difficult to use and require considerable training effort before a user can work with them.
   We developed the Superimposed Multimedia Presentation Editor and Player (SIMPEL), a tool to address these challenges. SIMPEL allows a user to reference information of many types, at varying granularity, without replicating the referenced information. It also allows the user to compose synchronized multimedia presentations. For example, for a specific topic a user can select an audio clip, some images, and some text. He can then "play" (render in specific panes of a window) this information-set in some order. Figure 1 shows a snapshot of a SIMPEL presentation. Pane A contains an audio clip. Panes B, C, and D show selected information within web pages.
   SIMPEL is in a genre of applications called superimposed applications (SAs), which allow users to superimpose new interpretations over existing or base information [1]. SAs employ "marks", references to selected regions within base information. SIMPEL uses the Superimposed Pluggable Architecture for Contexts and Excerpts (SPARCE), middleware that provides mark management and other services for SAs [2]. SIMPEL has been implemented for Windows in Visual Basic.
   NET and uses XML for storing presentation data.
   Future work on SIMPEL will include support for pre-fetching media files (for better performance) and packaging and sharing of SIMPEL presentations. We also plan to index marks and make them searchable, thus facilitating further reuse. A more detailed report on SIMPEL is available at http://pubs.dlib.vt.edu:9090/48/.
Extended XQuery for digital libraries BIBAFull-Text 378
  Alex Dekhtyar; Ionut E. Iacob; Kevin Kiernan; Dorothy C. Porter
Documents have, in general, a multihierarchical structure (such as physical organization in the form of pages and lines, content organization in the form of paragraphs and sentences, etc.). Searching multihierarchical XML encoding presents a number of unique challenges for both computer scientists and document experts. We present an extension of the XQuery language suitable for searching multihierarchical XML.
ETANA-GIS: GIS for archaeological digital libraries BIBAFull-Text 379
  Douglas Gorton; Rao Shen; Naga Srinivas Vemuri; Weiguo Fan; Edward A. Fox
With the growing importance of mapping land, regions, and their related features, Geographic Information Systems (GIS) has become an ever important standard in fields where such detailed study of land features is required. Our archaeology digital library, ETANA-DL (http://etana.dlib.vt.edu), contains thousands of records from eight member excavations. Here, we draw on the Space aspect of the 5S meta-model [1] for digital libraries and demonstrate a methodology used to integrate archaeological GIS data with the wealth of information within ETANA-DL. ETANAGIS connects the digital library's textual records with a spatial representation of their original locations, enhancing users' understanding of the find.
   Using a dataset of the University of Toronto's Tell Madaba excavation project [2], we developed an interactive, Web-based representation of the original ArcGIS document (accessible from ETANA-DL homepage). For dynamic generation of maps from geospatial data, we use the MapServer [3] project, a mature, project which boasts a rich toolset of features for cartographic related image generation. MapServer can directly utilize ArcGIS layer resources but some translation and additional authoring must occur for proper image generation. Then, using PHP, the MapScript MapServer API, and navigation tools, the map was ported to an interactive, Web-accessible format. Based on a study of alternatives, the technology we chose for our technique seemed to be the best suited for digital library integration and is also completely open source.
   To explore the presentation of the map, a user employs the navigation tools displayed in the corner of the main view (see Figure 1). In addition, full control of displayed layers, a smaller map showing overall view and context, as well as a dynamic scale bar are available for use. To integrate the Web-based version of the Tell Madaba GIS map with the existing digital library, the layers depicting archaeological divisions are clickable and labeled for easy identification. Any area queried results in a pop-up box with ETANA-DL's records and artifacts for that area.
   While this integration connects the digital library with the spatial representation of the region, the unique quality of various GIS maps causes certain difficulties. The lack of standard in denoting spatial divisions in GIS is one hindrance to producing a more automated technique. Future work will include more automation, usability evaluation, and integration of additional excavations. We hope integration of the digital library and GIS greatly aids users' understanding of the spatial organization of the included data.
MANGAS infrastructure BIBAFull-Text 380
  Hugo Manguinhas; Jose Borbinha
This demonstration shows a set of tools for managing and performing quality control processes to monitor and enforce quality over UNIMARC descriptive metadata records. These tools share a common infrastructure consisting mainly on information coded in XML and tools to process it. This system is currently being used on the National Library of Portugal in production services for the quality control and maintenance of the national union catalogue.
How science web sites are leveraging DLESE search web services to extend value to their users BIBAFull-Text 381
  Lynne Davis; John Weatherley
This demonstration illustrates the use of two search services offered by the Digital Library for Earth System Education (DLESE) and shows how they have been used to create customized discovery interfaces for library resources in science Web sites.
Unsupervised structure discovery for biodiversity information BIBKFull-Text 382
  Hong Cui; Richard M. McCourt; Monique Feist
Keywords: document structure, unsupervised machine learning
Visualizing an enterprise social network from email BIBKFull-Text 383
  Weizhong Zhu; Chaomei Chen; Robert B. Allen
Keywords: email, evolution, social network, visualization