HCI Bibliography Home | HCI Conferences | DL Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
DL Tables of Contents: 9697989900010203040506070809101112131415

JCDL'10: Proceedings of the 2010 Joint International Conference on Digital Libraries

Fullname:Proceedings of the 2010 joint international conference on Digital libraries
Editors:Jane Hunter; Carl Lagoze; Lee Giles; Yuan-Fang Li
Location:Gold Coast, Queensland, Australia
Dates:2010-Jun-21 to 2010-Jun-25
Publisher:ACM
Standard No:ISBN 1-4503-0085-5, 978-1-4503-0085-8; ACM DL: Table of Contents hcibib: DL10
Papers:66
Pages:410
Links:Conference Home Page
  1. Annotations & markup
  2. Scholarly publications
  3. Search 1
  4. Historical text & documents
  5. Collaborative information environments
  6. Personal collections
  7. Visualization
  8. Data mining
  9. Infrastructure & systems
  10. Integration of physical and digital media
  11. Search 2
  12. Theory & frameworks
  13. Social aspects
  14. Digital preservation
  15. Posters
  16. Demonstrations

Annotations & markup

Making web annotations persistent over time BIBAKFull-Text 1-10
  Robert Sanderson; Herbert Van de Sompel
As Digital Libraries (DL) become more aligned with the web architecture, their functional components need to be fundamentally rethought in terms of URIs and HTTP. Annotation, a core scholarly activity enabled by many DL solutions, exhibits a clearly unacceptable characteristic when existing models are applied to the web: due to the representations of web resources changing over time, an annotation made about a web resource today may no longer be relevant to the representation that is served from that same resource tomorrow.
   We assume the existence of archived versions of resources, and combine the temporal features of the emerging Open Annotation data model with the capability offered by the Memento framework that allows seamless navigation from the URI of a resource to archived versions of that resource, and arrive at a solution that provides guarantees regarding the persistence of web annotations over time. More specifically, we provide theoretical solutions and proof-of-concept experimental evaluations for two problems: reconstructing an existing annotation so that the correct archived version is displayed for all resources involved in the annotation, and retrieving all annotations that involve a given archived version of a web resource.
Keywords: annotation, digital preservation, persistence, web architecture
Transferring structural markup across translations using multilingual alignment and projection BIBAKFull-Text 11-20
  David Bamman; Alison Babeu; Gregory Crane
We present here a method for automatically projecting structural information across translations, including canonical citation structure (such as chapters and sections), speaker information, quotations, markup for people and places, and any other element in TEI-compliant XML that delimits spans of text that are linguistically symmetrical in two languages. We evaluate this technique on two datasets, one containing perfectly transcribed texts and one containing errorful OCR, and achieve an accuracy rate of 88.2% projecting 13,023 XML tags from source documents to their transcribed translations, with an 83.6% accuracy rate when projecting to texts containing uncorrected OCR. This approach has the potential to allow a highly granular multilingual digital library to be bootstrapped by applying the knowledge contained in a small, heavily curated collection to a much larger but unstructured one.
Keywords: annotation projection, knowledge transfer, multilingual alignment
ProcessTron: efficient semi-automated markup generation for scientific documents BIBAKFull-Text 21-28
  Guido Sautter; Klemens Böhm; Conny Kühne; Tobias Mathäß
Digitizing legacy documents and marking them up with XML is important for many scientific domains. However, creating comprehensive semantic markup of high quality is challenging. Respective processes consist of many steps, with automated markup generation and intermediate manual correction. These corrections are extremely laborious. To reduce this effort, this paper makes two contributions: First, it proposes ProcessTron, a lightweight markup-process-control mechanism. ProcessTron assists users in two ways: It ensures that the steps are executed in the appropriate order, and it points the user to possible errors during manual correction. Second, ProcessTron has been deployed in real-world projects, and this paper reports on our experiences. A core observation is that ProcessTron more than halves the time users need to mark up a document. Results from laboratory experiments, which we have conducted as well, confirm this finding.
Keywords: data-driven markup process control, semantic xml markup

Scholarly publications

Scholarly paper recommendation via user's recent research interests BIBAKFull-Text 29-38
  Kazunari Sugiyama; Min-Yen Kan
We examine the effect of modeling a researcher's past works in recommending scholarly papers to the researcher. Our hypothesis is that an author's published works constitute a clean signal of the latent interests of a researcher. A key part of our model is to enhance the profile derived directly from past works with information coming from the past works' referenced papers as well as papers that cite the work. In our experiments, we differentiate between junior researchers that have only published one paper and senior researchers that have multiple publications. We show that filtering these sources of information is advantageous -- when we additionally prune noisy citations, referenced papers and publication history, we achieve statistically significant higher levels of recommendation accuracy.
Keywords: digital library, information retrieval, recommendation, user modeling
Effective self-training author name disambiguation in scholarly digital libraries BIBAKFull-Text 39-48
  Anderson A. Ferreira; Adriano Veloso; Marcos André Gonçalves; Alberto H. F. Laender
Name ambiguity in the context of bibliographic citation records is a hard problem that affects the quality of services and content in digital libraries and similar systems. Supervised methods that exploit training examples in order to distinguish ambiguous author names are among the most effective solutions for the problem, but they require skilled human annotators in a laborious and continuous process of manually labeling citations in order to provide enough training examples. Thus, addressing the issues of (i) automatic acquisition of examples and (ii) highly effective disambiguation even when only few examples are available, are the need of the hour for such systems. In this paper, we propose a novel two-step disambiguation method, SAND (Self-training Associative Name Disambiguator), that deals with these two issues. The first step eliminates the need of any manual labeling effort by automatically acquiring examples using a clustering method that groups citation records based on the similarity among coauthor names. The second step uses a supervised disambiguation method that is able to detect unseen authors not included in any of the given training examples. Experiments conducted with standard public collections, using the minimum set of attributes present in a citation (i.e., author names, work title and publication venue), demonstrated that our proposed method outperforms representative unsupervised disambiguation methods that exploit similarities between citation records and is as effective as, and in some cases superior to, supervised ones, without manually labeling any training example.
Keywords: bibliographic citations, name disambiguation
Citing for high impact BIBAKFull-Text 49-58
  Xiaolin Shi; Jure Leskovec; Daniel A. McFarland
The question of citation behavior has always intrigued scientists from various disciplines. While general citation patterns have been widely studied in the literature we develop the notion of citation projection graphs by investigating the citations among the publications that a given paper cites. We investigate how patterns of citations vary between various scientific disciplines and how such patterns reflect the scientific impact of the paper. We find that idiosyncratic citation patterns are characteristic for low impact papers; while narrow, discipline-focused citation patterns are common for medium impact papers. Our results show that crossing-community, or bridging citation patters are high risk and high reward since such patterns are characteristic for both low and high impact papers. Last, we observe that recently citation networks are trending toward more bridging and interdisciplinary forms.
Keywords: citation networks, citation projection, publication impact

Search 1

Evaluating methods to rediscover missing web pages from the web infrastructure BIBAKFull-Text 59-68
  Martin Klein; Michael L. Nelson
Missing web pages (pages that return the 404 "Page Not Found error) are part of the browsing experience. The manual use of search engines to rediscover missing pages can be frustrating and unsuccessful. We compare four automated methods for rediscovering web pages. We extract the page's title, generate the page's lexical signature (LS), obtain the page's tags from the bookmarking website delicious.com and generate a LS from the page's link neighborhood. We use the output of all methods to query Internet search engines and analyze their retrieval performance. Our results show that both LSs and titles perform fairly well with over 60% URIs returned top ranked from Yahoo!. However, the combination of methods improves the retrieval performance. Considering the complexity of the LS generation, querying the title first and in case of insufficient results querying the LSs second is the preferable setup. This combination accounts for more than 75% top ranked URIs.
Keywords: digital preservation, search engines, web page discovery
Search behaviors in different task types BIBAKFull-Text 69-78
  Jingjing Liu; Michael J. Cole; Chang Liu; Ralf Bierig; Jacek Gwizdka; Nicholas J. Belkin; Jun Zhang; Xiangmin Zhang
Personalization of information retrieval tailors search towards individual users to meet their particular information needs by taking into account information about users and their contexts, often through implicit sources of evidence such as user behaviors. Task types have been shown to influence search behaviors including usefulness judgments. This paper reports on an investigation of user behaviors associated with different task types. Twenty-two undergraduate journalism students participated in a controlled lab experiment, each searching on four tasks which varied on four dimensions: complexity, task product, task goal and task level. Results indicate regular differences associated with different task characteristics in several search behaviors, including task completion time, decision time (the time taken to decide whether a document is useful or not), and eye fixations, etc. We suggest these behaviors can be used as implicit indicators of the user's task type.
Keywords: eye tracking, information retrieval, personalization, task type, user behavior
Exploiting time-based synonyms in searching document archives BIBAKFull-Text 79-88
  Nattiya Kanhabua; Kjetil Nørvåg
Query expansion of named entities can be employed in order to increase the retrieval effectiveness. A peculiarity of named entities compared to other vocabulary terms is that they are very dynamic in appearance, and synonym relationships between terms change with time. In this paper, we present an approach to extracting synonyms of named entities over time from the whole history of Wikipedia. In addition, we will use their temporal patterns as a feature in ranking and classifying them into two types, i.e., time-independent or time-dependent. Time-independent synonyms are invariant to time, while time-dependent synonyms are relevant to a particular time period, i.e., the synonym relationships change over time. Further, we describe how to make use of both types of synonyms to increase the retrieval effectiveness, i.e., query expansion with time-independent synonyms for an ordinary search, and query expansion with time-dependent synonyms for a search wrt. temporal criteria. Finally, through an evaluation based on TREC collections, we demonstrate how retrieval performance of queries consisting of named entities can be improved using our approach.
Keywords: query expansion, synonym detection, temporal search

Historical text & documents

Using word sense discrimination on historic document collections BIBAKFull-Text 89-98
  Nina Tahmasebi; Kai Niklas; Thomas Theuerkauf; Thomas Risse
Word sense discrimination is the first, important step towards automatic detection of language evolution within large, historic document collections. By comparing the found word senses over time, we can reveal and use important information that will improve understanding and accessibility of a digital archive. Algorithms for word sense discrimination have been developed while keeping today's language in mind and have thus been evaluated on well selected, modern datasets. The quality of the word senses found in the discrimination step has a large impact on the detection of language evolution. Therefore, as a first step, we verify that word sense discrimination can successfully be applied to digitized historic documents and that the results correctly correspond to word senses. Because accessibility of digitized historic collections is influenced also by the quality of the optical character recognition (OCR), as a second step we investigate the effects of OCR errors on word sense discrimination results. All evaluations in this paper are performed on The Times Archive, a collection of newspaper articles from 1785-1985.
Keywords: OCR error impact, historic document collections, information extraction, word sense discrimination
Chinese calligraphy specific style rendering system BIBAKFull-Text 99-108
  Zhenting Zhang; Jiangqin Wu; Kai Yu
Manifesting the handwriting characters with the specific style of a famous artwork is fascinating. In this paper, a system is built to render the user's handwriting characters with a specific style. A stroke database is established firstly. When rendering a character, the strokes are extracted and recognized, then proper radicals and strokes are filtered, finally these strokes are deformed and the result is generated. The Special Nine Grid (SNG) is presented to help recognize radicals and strokes. The Rule-base Stroke Deformation Algorithm (RSDA) is proposed to deform the original strokes according to the handwriting strokes. The rendering result manifests the specific style with high quality. It is feasible for people to generate the tablet or other artworks with the proposed system.
Keywords: rule-base stroke deformation, special nine grid, specific style rendering
Translating handwritten bushman texts BIBAKFull-Text 109-118
  Kyle Williams; Hussein Suleman
The Bleek and Lloyd Collection is a collection of artefacts documenting the life and language of the Bushman people of southern Africa in the 19th century. Included in this collection is a handwritten dictionary that contains English words and their corresponding |xam Bushman language translations. This dictionary allows for the manual translation of |xam words that appear in the notebooks of the Bleek and Lloyd collection. This, however, is not practical due to the size of the dictionary, which contains over 14000 entries. To solve this problem a content-based image retrieval system was built that allows for the selection of a |xam word from a notebook and returns matching words from the dictionary. The system shows promise with some search keys returning relevant results.
Keywords: CBIR, cultural heritage preservation, digital libraries, handwritten manuscripts, image processing, information retrieval

Collaborative information environments

Do Wikipedians follow domain experts?: a domain-specific study on Wikipedia knowledge building BIBAKFull-Text 119-128
  Yi Zhang; Aixin Sun; Anwitaman Datta; Kuiyu Chang; Ee-Peng Lim
Wikipedia is one of the most successful online knowledge bases, attracting millions of visits daily. Not surprisingly, its huge success has in turn led to immense research interest for a better understanding of the collaborative knowledge building process. In this paper, we performed a (terrorism) domain-specific case study, comparing and contrasting the knowledge evolution in Wikipedia with a knowledge base created by domain experts. Specifically, we used the Terrorism Knowledge Base (TKB) developed by experts at MIPT. We identified 409 Wikipedia articles matching TKB records, and went ahead to study them from three aspects: creation, revision, and link evolution. We found that the knowledge building in Wikipedia had largely been independent, and did not follow TKB -- despite the open and online availability of the latter, as well as awareness of at least some of the Wikipedia contributors about the TKB source. In an attempt to identify possible reasons, we conducted a detailed analysis of contribution behavior demonstrated by Wikipedians. It was found that most Wikipedians contribute to a relatively small set of articles each. Their contribution was biased towards one or very few article(s). At the same time, each article's contributions are often championed by very few active contributors including the article's creator. We finally arrive at a conjecture that the contributions in Wikipedia are more to cover knowledge at the article level rather than at the domain level.
Keywords: Wikipedia, contributing behavior, knowledge building
Spatiotemporal mapping of Wikipedia concepts BIBAKFull-Text 129-138
  Adrian Popescu; Gregory Grefenstette
Space and time are important dimensions in the representation of a large number of concepts. However there exists no available resource that provides spatiotemporal mappings of generic concepts. Here we present a link-analysis based method for extracting the main locations and periods associated to all Wikipedia concepts. Relevant locations are selected from a set of geotagged articles, while relevant periods are discovered using a list of people with associated life periods. We analyze article versions over multiple languages and consider the strength of a spatial/temporal reference to be proportional to the number of languages in which it appears. To illustrate the utility of the spatiotemporal mapping of Wikipedia concepts, we present an analysis of cultural interactions and a temporal analysis of two domains. The Wikipedia mapping can also be used to perform rich spatiotemporal document indexing by extracting implicit spatial and temporal references from texts.
Keywords: concept, cultural, interaction, multilinguism, spatial-temporal, wikipedia
Crowdsourcing the assembly of concept hierarchies BIBAKFull-Text 139-148
  Kai Eckert; Mathias Niepert; Christof Niemann; Cameron Buckner; Colin Allen; Heiner Stuckenschmidt
The "wisdom of crowds" is accomplishing tasks that are cumbersome for individuals yet cannot be fully automated by means of specialized computer algorithms. One such task is the construction of thesauri and other types of concept hierarchies. Human expert feedback on the relatedness and relative generality of terms, however, can be aggregated to dynamically construct evolving concept hierarchies. The InPhO (Indiana Philosophy Ontology) project bootstraps feedback from volunteer users unskilled in ontology design into a precise representation of a specific domain. The approach combines statistical text processing methods with expert feedback and logic programming to create a dynamic semantic representation of the discipline of philosophy.
   In this paper, we show that results of comparable quality can be achieved by leveraging the workforce of crowdsourcing services such as the Amazon Mechanical Turk (AMT). In an extensive empirical study, we compare the feedback obtained from AMT's workers with that from the InPhO volunteer users providing an insight into qualitative differences of the two groups. Furthermore, we present a set of strategies for assessing the quality of different users when gold standards are missing. We finally use these methods to construct a concept hierarchy based on the feedback acquired from AMT workers.
Keywords: crowdsourcing, similarity, thesaurus learning

Personal collections

A user-centered design of a personal digital library for music exploration BIBAKFull-Text 149-158
  David Bainbridge; Brook J. Novak; Sally Jo Cunningham
We describe the evaluation of a personal digital library environment designed to help musicians capture, enrich and store their ideas using a spatial hypermedia paradigm. The target user group is musicians who primarily use audio and text for composition and arrangement, rather than with formal music notation. Using the principle of user-centered design, the software implementation was guided by a diary study involving nine musicians which suggested five requirements for the software to support: capturing, overdubbing, developing, storing, and organizing. Moreover, the underlying spatial data-model was exploited to give raw audio compositions a hierarchical structure, and -- to aid musicians in retrieving previous ideas -- a search facility is available to support both query by humming and text-based queries. A user evaluation of the completed design with eleven subjects indicated that musicians, in general, would find the hypermedia environment useful for capturing and managing their moments of musical creativity and exploration. More specifically they would make use of the query by humming facility and the hierarchical track organization, but not the overdubbing facility as implemented.
Keywords: music composition, personal digital music library, spatial hypermedia
Improving mood classification in music digital libraries by combining lyrics and audio BIBAKFull-Text 159-168
  Xiao Hu; J. Stephen Downie
Mood is an emerging metadata type and access point in music digital libraries (MDL) and online music repositories. In this study, we present a comprehensive investigation of the usefulness of lyrics in music mood classification by evaluating and comparing a wide range of lyric text features including linguistic and text stylistic features. We then combine the best lyric features with features extracted from music audio using two fusion methods. The results show that combining lyrics and audio significantly outperformed systems using audio-only features. In addition, the examination of learning curves shows that the hybrid lyric + audio system needed fewer training samples to achieve the same or better classification accuracies than systems using lyrics or audio singularly. These experiments were conducted on a unique large-scale dataset of 5,296 songs (with both audio and lyrics for each) representing 18 mood categories derived from social tags. The findings push forward the state-of-the-art on lyric sentiment analysis and automatic music mood classification and will help make mood a practical access point in music digital libraries.
Keywords: audio features, feature fusion, lyric sentiment analysis, music digital libraries, music mood classification, supervised learning
Visualizing personal digital collections BIBAKFull-Text 169-172
  Weijia Xu; Maria Esteva; Suyog Dott Jain
This paper describes the use of relational database management system (RDBMS) and treemap visualization to represent and analyze a group of personal digital collections created in the context of work and with no external metadata. We evaluated the visualization vis a vis the results of previous personal information management (PIM) studies. We suggest that this visualization supports analysis that allow understanding PIM practices overtime.
Keywords: database applications, digital collections, information visualization, personal information management (PIM), treemap
Interpretation of web page layouts by blind users BIBAKFull-Text 173-176
  Luis Francisco-Revilla; Jeff Crow
Digital libraries must support assistive technologies that allow people with disabilities such as blindness to use, navigate and understand their documents. Increasingly, many documents are Web-based and present their contents using complex layouts. However, approaches that translate two-dimensional layouts to one-dimensional speech produce a very different user experience and loss of information. To address this issue, we conducted a study of how blind people navigate and interpret layouts of news and shopping Web pages using current assistive technology. The study revealed that blind people do not parse Web pages fully during their first visit, and that they can miss important parts. The study also provided insights for improving assistive technologies.
Keywords: assistive technology, blind users, web page layouts

Visualization

Supporting document triage via annotation-based multi-application visualizations BIBAKFull-Text 177-186
  Soonil Bae; DoHyoung Kim; Konstantinos Meintanis; J. Michael Moore; Anna Zacchi; Frank Shipman; Haowei Hsieh; Catherine C. Marshall
For open-ended information tasks, users must sift through many potentially relevant documents, a practice we refer to as document triage. Normally, people perform triage using multiple applications in concert: a search engine interface presents lists of potentially relevant documents; a document reader displays their contents; and a third tool -- a text editor or personal information management application -- is used to record notes and assessments. To support document triage, we have developed an extensible multi-application architecture that initially includes an information workspace and a document reader. An Interest Profile Manager infers users' interests from their interactions with the triage applications, coupled with the characteristics of the documents they are interacting with. The resulting interest profile is used to generate visualizations that direct users' attention to documents or parts of documents that match their inferred interests. The novelty of our approach lies in the aggregation of activity records across applications to generate fine-grained models of user interest.
Keywords: document triage, multi-application user modeling, visualization
Flexible access to photo libraries via time, place, tags, and visual features BIBAKFull-Text 187-196
  Andreas Girgensohn; Frank Shipman; Thea Turner; Lynn Wilcox
Photo libraries are growing in quantity and size, requiring better support for locating desired photographs. MediaGLOW is an interactive visual workspace designed to address this concern. It uses attributes such as visual appearance, GPS locations, user-assigned tags, and dates to filter and group photos. An automatic layout algorithm positions photos with similar attributes near each other to support users in serendipitously finding multiple relevant photos. In addition, the system can explicitly select photos similar to specified photos. We conducted a user evaluation to determine the benefit provided by similarity layout and the relative advantages offered by the different layout similarity criteria and attribute filters. Study participants had to locate photos matching probe statements. In some tasks, participants were restricted to a single layout similarity criterion and filter option. Participants used multiple attributes to filter photos. Layout by similarity without additional filters turned out to be one of the most used strategies and was especially beneficial for geographical similarity. Lastly, the relative appropriateness of the single similarity criterion to the probe significantly affected retrieval performance.
Keywords: geographic data, photo libraries, photo retrieval, similarity criteria, tagged photos, visual similarity
Interactively browsing movies in terms of action, foreshadowing and resolution BIBAKFull-Text 197-200
  Stewart Greenhill; Brett Adams; Svetha Venkatesh
We describe a novel video player that uses Temporal Semantic Compression (TSC) to present a compressed summary of a movie. Compression is based on tempo which is derived from film rhythms. The technique identifies periods of action, drama, foreshadowing and resolution, which can be mixed in different amounts to vary the kind of summary presented. The compression algorithm is embedded in a video player, so that the summary can be interactively recomputed during playback.
Keywords: compression, media aesthetics, video browsing
Timeline interactive multimedia experience (time): on location access to aggregate event information BIBAKFull-Text 201-204
  Jeff Crow; Eryn Whitworth; Ame Wongsa; Luis Francisco-Revilla; Swati Pendyala
Attending a complex scheduled social event, such as a multi-day music festival, requires a significant amount of planning before and during its progression. Advancements in mobile technology and social networks enable attendees to contribute content in real-time that can provide useful information to many. Currently access to and presentation of such information is challenging to use during an event. The Timeline Interactive Multimedia Experience (TIME) system aggregates information posted to multiple social networks and presents the flow of information in a multi-touch timeline interface. TIME was designed to be placed on location to allow real-time access to relevant information that helps attendees to make plans and navigate their crowded surroundings.
Keywords: complex scheduled events, events, multi-touch, planning, social media, timeline

Data mining

Domain-specific iterative readability computation BIBAKFull-Text 205-214
  Jin Zhao; Min-Yen Kan
We present a new algorithm to measure domain-specific readability. It iteratively computes the readability of domain-specific resources based on the difficulty of domain-specific concepts and vice versa, in a style reminiscent of other bipartite graph algorithms such as Hyperlink-Induced Topic Search (HITS) and the Stochastic Approach for Link-Structure Analysis (SALSA). While simple, our algorithm outperforms standard heuristic measures and remains competitive among supervised-learning approaches. Moreover, it is less domain-dependent and portable across domains as it does not rely on an annotated corpus or expensive expert knowledge that supervised or domain-specific methods require.
Keywords: domain-specific information retrieval, graph-based algorithm, iterative computation, readability measure
Evaluating topic models for digital libraries BIBAKFull-Text 215-224
  David Newman; Youn Noh; Edmund Talley; Sarvnaz Karimi; Timothy Baldwin
Topic models could have a huge impact on improving the ways users find and discover content in digital libraries and search interfaces through their ability to automatically learn and apply subject tags to each and every item in a collection, and their ability to dynamically create virtual collections on the fly. However, much remains to be done to tap this potential, and empirically evaluate the true value of a given topic model to humans. In this work, we sketch out some sub-tasks that we suggest pave the way towards this goal, and present methods for assessing the coherence and interpretability of topics learned by topic models. Our large-scale user study includes over 70 human subjects evaluating and scoring almost 500 topics learned from collections from a wide range of genres and domains. We show how scoring model -- based on pointwise mutual information of word-pair using Wikipedia, Google and MEDLINE as external data sources -- performs well at predicting human scores. This automated scoring of topics is an important first step to integrating topic modeling into digital libraries.
Keywords: evaluation, topic models, topic quality, user studies
FRBRization of MARC records in multiple catalogs BIBAKFull-Text 225-234
  Hugo Miguel Álvaro Manguinhas; Nuno Miguel Antunes Freire; José Luis Brinquete Borbinha
This paper addresses the problem of using the FRBR model to support the presentation of results. It describes a service implementing new algorithms and techniques for transforming existing MARC records into the FRBR model for this specific purpose. This work was developed in the context of the TELPlus project and processed 100,000 bibliographic and authority records from multilingual catalogs of 12 European countries.
Keywords: FRBR, FRBRization, bibliographic records, multilingual catalogs

Infrastructure & systems

Exposing the hidden web for chemical digital libraries BIBAKFull-Text 235-244
  Sascha Tönnies; Benjamin Köhncke; Oliver Koepler; Wolf-Tilo Balke
In recent years, the vast amount of digitally available content has lead to the creation of many topic-centered digital libraries. Also in the domain of chemistry more and more digital collections are available, but the complex query formulation still hampers their intuitive adoption. This is because information seeking in chemical documents is focused on chemical entities, for which current standard search relies on complex structures which are hard to extract from documents. Moreover, although simple keyword searches would often be sufficient, current collections simply cannot be indexed by Web search providers due to the ambiguity of chemical substance names. In this paper we present a framework for automatically generating metadata-enriched index pages for all documents in a given chemical collection. All information is then linked to the respective documents and thus provides an easy to crawl metadata repository promising to open up digital chemical libraries. Our experiments, indexing an open access journal, show that not only the documents can be found using a simple Google search via the automatically created index pages, but also that the quality of the search is much more efficient than fulltext indexing in terms of both precision/recall and performance. Finally, we compare our indexing against a classical structure search and figured out that keyword-based search can indeed solve at least some of the daily tasks in chemical workflows. To use our framework thus promises to expose a large part of the currently still hidden chemical Web, making the techniques employed interesting for chemical information providers like digital libraries and open access journals.
Keywords: chemical digital collections, digital libraries, hidden web, information extraction, information retrieval, web search
oreChem ChemXSeer: a semantic digital library for chemistry BIBAKFull-Text 245-254
  Na Li; Leilei Zhu; Prasenjit Mitra; Karl Mueller; Eric Poweleit; C. Lee Giles
Representing the semantics of unstructured scientific publications will certainly facilitate access and search and hopefully lead to new discoveries. However, current digital libraries are usually limited to classic flat structured metadata even for scientific publications that potentially contain rich semantic metadata. In addition, how to search the scientific literature of linked semantic metadata is an open problem. We have developed a semantic digital library oreChem ChemxSeer that models chemistry papers with semantic metadata. It stores and indexes extracted metadata from a chemistry paper repository Chemx Seer using "compound objects".
   We use the Open Archives Initiative Object Reuse and Exchange (OAI-ORE) (http://www.openarchives.org/ore/ standard to define a compound object that aggregates metadata fields related to a digital object. Aggregated metadata can be managed and retrieved easily as one unit resulting in improved ease-of-use and has the potential to improve the semantic interpretation of shared data. We show how metadata can be extracted from documents and aggregated using OAI-ORE. ORE objects are created on demand; thus, we are able to search for a set of linked metadata with one query.
   We were also able to model new types of metadata easily. For example, chemists are especially interested in finding information related to experiments in documents. We show how paragraphs containing experiment information in chemistry papers can be extracted and tagged based on a chemistry ontology with 470 classes, and then represented in ORE along with other document-related metadata. Our algorithm uses a classifier with features that are words that are typically only used to describe experiments, such as "apparatus", "prepare", etc. Using a dataset comprised of documents from the Royal Society of Chemistry digital library, we show that the our proposed method performs well in extracting experiment-related paragraphs from chemistry documents.
Keywords: ChemXSeer, digital library, metadata extraction, oai-ore, seersuite, semantic web, support vector machines
BinarizationShop: a user-assisted software suite for converting old documents to black-and-white BIBAKFull-Text 255-258
  Fanbo Deng; Zheng Wu; Zheng Lu; Michael S. Brown
Converting a scanned document to a binary format (black and white) is a key step in the digitization process. While many existing binarization algorithms operate robustly for well-kept documents, these algorithms often produce less than satisfactory results when applied to old documents, especially those degraded with stains and other discolorations. For these challenging documents, user assistance can be advantageous in directing the binarization procedure. Many existing algorithms, however, are poorly designed to incorporate user assistance. In this paper, we discuss a software framework, BinarizationShop, that combines a series of binarization approaches that have been tailored to exploit user assistance. This framework provides a practical approach for converting difficult documents to black and white.
Keywords: binarization, document processing, user-assisted software
Using an ontology and a multilingual glossary for enhancing the nautical archaeology digital library BIBAKFull-Text 259-262
  Carlos Monroy; Richard Furuta; Filipe Castro
Access to materials in digital collections has been extensively studied within digital libraries. Exploring a collection requires customized indices and novel interfaces to allow users new exploration mechanisms. Materials or objects can then be found by way of full-text, faceted, or thematic indexes. There has been a marked interest not only in finding objects in a collection, but in discovering relationships and properties. For example, multiple representations of the same object enable the use of visual aids to augment collection exploration. Depending on the domain and characteristics of the objects in a collection, relationships among components can be used to enrich the process of understanding their contents. In this context, the Nautical Archaeology Digital Library (NADL) includes multilingual textual- and visual-rich objects (shipbuilding treatises, illustrations, photographs, and drawings). In this paper we describe an approach for enhancing access to a collection of ancient technical documents, illustrations, and photographs documenting archaeological excavations. Because of the nature of our collection, we exploit a multilingual glossary along with an ontology. Preliminary tests of our prototype suggest the feasibility of our method for enhancing access to the collection.
Keywords: information retrieval, interfaces, multilingual technical manuscripts, nautical archaeology, ship reconstruction, technical documents

Integration of physical and digital media

In-depth utilization of Chinese ancient maps: a hybrid approach to digitizing map resources in CADAL BIBAKFull-Text 263-272
  Zhenchao Ye; Ling Zhuang; Jiangqin Wu; Chenyang Du; Baogang Wei; Yin Zhang
Digital map is getting increasingly popular as an intuitive and interactive platform for data presentation recently. Thus applications integrated with digital map have attracted much attention. But no off-the-shelf systems or services could we use if the time span of maps be extended to historical ones. There are a large number of valuable ancient atlases in CADAL digital library. However, they are seldom made use of because the ones which are in image format are not convenient for users to read or search. In this paper, we propose a novel hybrid approach to utilizing these atlases directly and constructing some applications based on ancient maps. We call it CAMAME which means Chinese Ancient Maps Automatic Marking and Extraction. We create a gazetteer to store the geographic information of sites which will be project on the map, then use kernel method to do the regression and correct the estimated results with image processing and local regression methods. The empirical results show that CAMAME is effective and efficient, by which most valuable data in the map images is marked and identified. Some Chinese literary chronicle applications that exhibit ancient literary and related historical information over those digitized atlas resources in CADAL digital library were developed.
Keywords: atlases, digital library, image processing, kernel method
The fused library: integrating digital and physical libraries with location-aware sensors BIBAKFull-Text 273-282
  George R. Buchanan
This paper reports an investigation into the connection of the workspace of physical libraries with digital library services. Using simple sensor technology, we provide focused access to digital resources on the basis of the user's physical context, including the topic of the stacks they are next to, and the content of books on their reading desks. Our research developed the technological infrastructure to support this fused interaction, investigated current patron behavior in physical libraries, and evaluated our system in a user-centred pilot study. The outcome of this research demonstrates the potential utility of the fused library, and provides a starting point for future exploitation.
Keywords: digital libraries, human factors, physical interaction
What humanists want: how scholars use source materials BIBAKFull-Text 283-292
  Neal Audenaert; Richard Furuta
Despite the growing prominence of digital libraries as tools to support humanities scholars, little is known about the work practices and needs of these scholars as they pertain to working with source documents. In this paper we present our findings from a formative user study consisting of semi-structured interviews with eight scholars.
   We find that the use of source materials (by which we mean the original physical documents or digital facsimiles with minimal editorial intervention) in scholarship is not a simple, straight-forward examination of a document in isolation. Instead, scholars study source materials as an integral part of a complex ecosystem of inquiry that seeks to understand both the text being studied and the context in which that text was created, transmitted and used. Drawing examples from our interviews, we address critical questions of why scholars use source documents and what information they hope to gain by studying them. We also briefly summarize key note-taking practices as a means for assessing the potential to design user interfaces that support scholarly work-practices.
Keywords: digital humanities, source documents, user studies

Search 2

Context identification of sentences in related work sections using a conditional random field: towards intelligent digital libraries BIBAKFull-Text 293-302
  M. A. Angrosh; Stephen Cranefield; Nigel Stanger
Identification of contexts associated with sentences is becoming increasingly necessary for developing intelligent information retrieval systems. This article describes a supervised learning mechanism employing a conditional random field (CRF) for context identification and sentence classification. Specifically, we focus on sentences in related work sections in research articles. Based on a generic rhetorical pattern, a framework for modelling the sequential flow in these sections is proposed. Adopting a generalization strategy, each of these sentences is transformed into a set of features, which forms our dataset. We distinguish between two kinds of features for each of these sentences viz., citation features and sentence features. While an overall accuracy of 96.51% is achieved by using a combination of both citation and sentence features, the use of sentence features alone yields an accuracy of 93.22%. The results also show F-Scores ranging from 0.99 to 0.90 for various classes indicating the robustness of our application.
Keywords: citation classification, conditional random fields, linear chain CRFs, sentence classification
Can an intermediary collection help users search image databases without annotations? BIBAKFull-Text 303-312
  Robert Villa; Martin Halvey; Hideo Joho; David Hannah; Joemon M. Jose
Developing methods for searching image databases is a challenging and ongoing area of research. A common approach is to use manual annotations, although generating annotations can be expensive in terms of time and money, and therefore may not be justified in many situations. Content-based search techniques which extract visual features from image data can be used, but users are typically forced to express their information need using example images, or through sketching interfaces. This can be difficult if no visual example of the information need is available, or when the information need cannot be easily drawn.
   In this paper, we consider an alternative approach which allows a user to search for images through an intermediate database. In this approach, a user can search using text in the intermediate database as a way of finding visual examples of their information need. The visual examples can then be used to search a database that lacks annotations. Three experiments are presented which investigate this process. The first experiment automatically selects the image queries from the intermediary database; the second instead uses images which have been hand-picked by users. A third experiment, an interactive study, is then presented this study compares the intermediary interface to text search, where we consider text as an upper bound of performance. For this last study, an interface which supports the intermediary search process is described. Results show that while performance does not match manual annotations, users are able to find relevant material without requiring collection annotations.
Keywords: content-based image retrieval, search strategies
Social network document ranking BIBAKFull-Text 313-322
  Liang Gou; Xiaolong (Luke) Zhang; Hung-Hsuan Chen; Jung-Hyun Kim; C. Lee Giles
In search engines, ranking algorithms measure the importance and relevance of documents mainly based on the contents and relationships between documents. User attributes are usually not considered in ranking. This user-neutral approach, however, may not meet the diverse interests of users, who may demand different documents even with the same queries. To satisfy this need for more personalized ranking, we propose a ranking framework. Social Network Document Rank (SNDocRank), that considers both document contents and the relationship between a searched and document owners in a social network. This method combined the traditional tf-idf ranking for document contents with out Multi-level Actor Similarity (MAS) algorithm to measure to what extent document owners and the searcher are structurally similar in a social network. We implemented our ranking method in simulated video social network based on data extracted from YouTube and tested its effectiveness on video search. The results show that compared with the traditional ranking method like tf-idfs the SNDocRank algorithm returns more relevant documents. More specifically, a searcher can get significantly better results be being in a larger social network, having more friends, and being associated with larger local communities in a social network.
Keywords: information retrieval, multilevel actor similarity, ranking, social networks

Theory & frameworks

A mathematical framework for modeling and analyzing migration time BIBAKFull-Text 323-332
  Feng Luan; Mads Nygård; Thomas Mestl
File format obsolescence has so far been considered the major risk in long-term storage of digital objects. There are, however, growing indications that file transfer may be a real threat as the migration time, i.e., the time required to migrate Petabytes of data, may easily spend years. However, hardware support is usually limited to 3-4 years and a situation can emerge when a new migration has to be started although the previous one is still not finished yet. This paper chooses a process modeling approach to obtain estimates of upper and lower bounds for the required migration time. The advantage is that information about potential bottlenecks can be acquired. Our theoretical considerations are validated by migration tests at the National Library of Norway (NB) as well as at our department.
Keywords: long-term preservation, migration, performance, process modeling, storage
Digital libraries for scientific data discovery and reuse: from vision to practical reality BIBAKFull-Text 333-340
  Jillian C. Wallis; Matthew S. Mayernik; Christine L. Borgman; Alberto Pepe
Science and technology research is becoming not only more distributed and collaborative, but more highly instrumented. Digital libraries provide a means to capture, manage, and access the data deluge that results from these research enterprises. We have conducted research on data practices and participated in developing data management services for the Center for Embedded Networked Sensing since its founding in 2002 as a National Science Foundation Science and Technology Center. Over the course of eight years, our digital library strategy has shifted dramatically in response to changing technologies, practices, and policies. We report on the development of several DL systems and on the lessons learned, which include the difficulty of anticipating data requirements from nascent technologies, building systems for highly diverse work practices and data types, the need to bind together multiple single-purpose systems, the lack of incentives to manage and share data, the complementary nature of research and development in understanding practices, and sustainability.
Keywords: collaborative research, cyberinfrastructure, data deluge, distributed research, escience
Ensemble PDP-8: eight principles for distributed portals BIBAKFull-Text 341-344
  Edward A. Fox; Yinlin Chen; Monika Akbar; Clifford A. Shaffer; Stephen H. Edwards; Peter Brusilovsky; Dan Garcia; Lois Delcambre; Felicia Decker; David Archer; Richard Furuta; Frank Shipman; Stephen Carpenter; Lillian Cassel
Ensemble, the National Science Digital Library (NSDL) Pathways project for Computing, builds upon a diverse group of prior NSDL, DL-I, and other projects. Ensemble has shaped its activities according to principles related to design, development, implementation, and operation of distributed portals. Here we articulate 8 key principles for distributed portals (PDPs). While our focus is on education and pedagogy, we expect that our experiences will generalize to other digital library application domains. These principles inform, facilitate, and enhance the Ensemble R&D and production activities. They allow us to provide a broad range of services, from personalization to coordination across communities. The eight PDPs can be briefly summarized as: (1) Articulation across communities using ontologies. (2) Browsing tailored to collections. (3) Integration across interfaces and virtual environments. (4) Metadata interoperability and integration. (5) Social graph construction using logging and metrics. (6) Superimposed information and annotation integrated across distributed systems. (7) Streamlined user access with IDs. (8) Web 2.0 multiple social network system interconnection.
Keywords: adaptive education system, distributed portal, ontology, superimposed information
Discovering Australia's research data BIBAKFull-Text 345-348
  Stefanie Kethers; Xiaobin Shen; Andrew E. Treloar; Ross G. Wilkinson
Access to data crucial to research is often slow and difficult. When research problems cross disciplinary boundaries, problems are exacerbated. This paper argues that it is important to make it easier to find and access data that might be found in an institution, in a disciplinary data store, in a government department, or held privately. We explore how to meet ad hoc needs that cannot easily be supported by a disciplinary ontology, and argue that web pages that describe data collections with rich links and rich text are valuable. We describe the approach followed by the Australian National Data Service (ANDS) in making such pages available. Finally, we discuss how we plan to evaluate this approach.
Keywords: Australian research data commons, e-research, metadata

Social aspects

This is what i'm doing and why: reflections on a think-aloud study of dl users' information behaviour BIBAKFull-Text 349-352
  Stephann Makri; Ann Blandford; Anna L. Cox
Many user-centred studies of digital libraries (DLs) include a think-aloud element and are usually conducted with the purpose of identifying usability issues related to the DLs used or understanding aspects of users' information behaviour. However, few of these studies present detailed accounts of how their think-aloud data was collected and analysed or reflect on this process. In this paper, we discuss and reflect on the decisions made when planning and conducting a think-aloud study of lawyers' interactive information behaviour. Our discussion is framed by Blandford et al.'s PRET A Rapporter ('ready to report') framework -- a framework that can be used to plan, conduct and describe user-centred studies of DL use from an information work perspective.
Keywords: methodology, reflection, think-aloud, user study
Customizing science instruction with educational digital libraries BIBAKFull-Text 353-356
  Tamara Sumner
The Curriculum Customization Service enables science educators to customize their instruction with interactive digital library resources. Preliminary results from a field trial with 124 middle and high school teachers suggest that the Service offers a promising model for embedding educational digital libraries into teaching practices and for supporting teachers to integrate customizing into their curriculum planning.
Keywords: customizing instruction, differentiated instruction, educational digital libraries, personalization, science education, software infrastructure for teachers
Impact and prospect of social bookmarks for bibliographic information retrieval BIBAKFull-Text 357-360
  Kazuhiro Seki; Huawei Qin; Kuniaki Uehara
This paper presents our ongoing study of the current/future impact of social bookmarks (or social tags) on information retrieval (IR). Our main research question asked in the present work is "How are social tags compared with conventional, yet reliable manual indexing from the viewpoint of IR performance?". To answer the question, we look at the biomedical literature and begin with examining basic statistics of social tags from CiteULike in comparison with Medical Subject Headings (MeSH) annotated in the Medline bibliographic database. Then, using the data, we conduct various experiments in an IR setting, which reveals that social tags work complementarily with MeSH and that retrieval performance would improve as the coverage of CiteULike grows.
Keywords: controlled vocabulary, folksonomy, free keywords, subject headings
Merging metadata: a sociotechnical study of crosswalking and interoperability BIBAKFull-Text 361-364
  Michael Khoo; Catherine Hall
Digital library interoperability relies on the use of a common metadata format. However, implementing a common metadata format among multiple digital libraries is not always a straightforward exercise. This paper reviews some of the metadata issues that arose during the merger of two digital libraries, the Internet Public Library and the Librarian's Internet Index. As part of the merger, each library's metadata was crosswalked to Dublin Core. This required considerable work. A sociotechnical analysis suggests that the metadata for each library had been shaped in complex ways over time by local factors, and that this complexity negatively impacted the efficiency of the crosswalk. Some implications of this finding for digital library interoperability are discussed.
Keywords: Dublin core, crosswalk, interoperability, metadata, operations, organizational knowledge, organizations, sociotechnical

Digital preservation

Emulation based services in digital preservation BIBAKFull-Text 365-368
  Klaus Rechert; Dirk von Suchodoletz; Randolph Welte
The creation of most digital objects occurs solely in interactive graphical user interfaces which were available at the particular time period. Archiving and preservation organizations are posed with large amounts of such objects of various types. At some point they will need to process these automatically to make them available to their users or convert them to a commonly used format. A substantial problem is to provide a wide range of different users with access to ancient environments and to allow using the original environment for a given object. We propose an abstract architecture for emulation services in digital preservation to provide remote user interfaces to emulation over computer networks without the need to install additional software components. Furthermore, we describe how these ideas can be integrated in a framework of web services for common preservation tasks like viewing or migrating digital objects.
Keywords: digital library, digital preservation, emulation, interactive software, long-term access

Posters

Many-to-many information connection connections in a distributed digital library portal BIBAKFull-Text 369-370
  Lillian N. Cassel; Edward A. Fox; Richard Furuta; Lois M. L. Delcambre
The Ensemble computing education portal is part of the US NSF's National Science Digital Library (NSDL). The underlying assumption in Ensemble's design is that people will not come just because we build something new. The information must be available from wherever potential users are. This poster describes early efforts to provide multiple community oriented entry points to multiple sources relevant to computing educators.
Keywords: NSDL, digital library, distributed portal
SPIRO-V: a collaborative approach to controlled vocabularies gathering and management BIBAKFull-Text 371-372
  Lina Huang; Rahul A. Deshmukh; Javed Mostafa; Jane Greenberg
This paper describes SPIRO-V, a collaborative controlled vocabulary development system integrating automatic and manual approaches for domain-specific vocabulary acquisition, and leveraging the knowledge of field experts.
Keywords: clinical study, controlled vocabulary construction
Generating citation digests for scientific publications BIBAKFull-Text 373-374
  Richard Easty; Nikolay Nikolov
Science is characterized nowadays by unprecedented growth in the number of publications. Thus it would be helpful if there were a way to summarize the contents of the publications or explain the argumentative relationship between them (e.g. support, further improvement, critique). Such semantic analysis might involve analyzing the citation contexts (the paragraphs where a certain publication is referred to by another publication). Here we present our work on a system that creates the pre-requisites for such analysis by harvesting publications from the web, extracting the contexts from them, and aggregating them into citation digests that are retrieved in the context of user interactions with web sites that mention these publications.
Keywords: browser extension, citation contexts, science literature
AIRFrame: integrating diverse digital collections in astrobiology BIBAKFull-Text 375-376
  Rich Gazan
Astrobiology is an inherently interdisciplinary field concerned with questions of life in the universe. This paper describes the design and ongoing implementation of the Astrobiology Integrative Research Framework (AIRFrame), an open source, ontology-driven information system designed to ingest and analyze heterogeneous inputs of both published and unpublished data, and to identify and illustrate latent connections between research in astrobiology's diverse constituent fields.
Keywords: astrobiology, collaboration, interdisciplinary science
A public education tool for tsunami disasters based on walking tours in TDL BIBAKFull-Text 377-378
  Sayaka Imai; Yoshinari Kanamori; Nobuo Shuto
As described in this paper, we proposed a public education tools for Tsunami Disasters based on TDL.
Keywords: GPS mobile phones, public education, tsunami digital library
A search engine for Japanese academic papers BIBAKFull-Text 379-380
  Emi Ishita; Teru Agata; Atsushi Ikeuchi; Nozue Michiko; Miyata Yosuke; Shuichi Ueda
A search engine for Japanese academic papers rendered in PDF is described. Evaluation results indicate fewer zero-result queries and higher precision in the top-10 documents than was obtained for the same Japanese queries using Google Scholar or Scirus.
Keywords: PDF, academic papers, search engine
Analyzing viewing patterns while reading picture books BIBAKFull-Text 381-382
  Emi Ishita; Shinji Mine; Chihiro Kunimoto; Junko Shiozaki; Keiko Kurata; Shuichi Ueda
We examine the eye movements of children who can read books on their own as they read printed picture books. Our analysis focuses on two points; 1) Is it the pictures or the text that they most frequently gaze at?, and 2) In what sequence do they read picture books? Our results indicate that children look at both text and pictures, but that there are large variations in the ratio of viewing time for each child. Both circular and linear patterns are found in the sequence of eye movements.
Keywords: eye tracking, viewing patterns
Personalizing information retrieval for people with different levels of topic knowledge BIBKFull-Text 383-384
  Jingjing Liu; Nicholas J. Belkin
Keywords: decision time, dwell time, personalization of IR, topic knowledge
Rethinking preservation validation with the preserved object and repository risks ontology (PORRO) BIBAKFull-Text 385-386
  Andrew McHugh; Mounia Lalmas
For securing digital longevity, the processes of preservation planning and evaluation are fundamentally implicit and share similar complexity. Means are required for the identification, documentation and association of those properties of data, representation and management mechanisms that in combination lend value, facilitate interaction and influence the preservation process. These properties may be almost limitless in terms of diversity, but are integral to the establishment of classes of risk exposure, and the planning and deployment of appropriate preservation strategies. We present PORRO, an ontology based approach for documenting objects, repositories and risk information, intended to support preservation decision making and evaluation.
Keywords: digital preservation, ontologies, validation
ForeCite: towards a reader-centric scholarly digital library BIBAKFull-Text 387-388
  Thuy Dung Nguyen; Min-Yen Kan; Dinh-Trung Dang; Markus Hänse; Ching Hoi Andy Hong; Minh-Thang Luong; Jesse Prabawa Gozali; Kazunari Sugiyama; Yee Fan Tang
We present ForeCite (FC), a prototype reader-centric digital library that supports the scholar in using scholarly documents. FC integrates three user interfaces: a bibliometric component, a document reader and annotation system, and a bibliographic management application.
Keywords: ForeCite, argumentative zoning, document logical structure, scholarly digital library
An architecture for a distributed digital library from the desktop up: the fascinator BIBAKFull-Text 389-390
  Peter Sefton; Duncan Dickinson
This poster describes the architecture of a new kind of digital repository service that includes components that run on desktop computers, designed to close the gap between Institutional Repositories (IRs) and the day-to-day electronic work environment used by researchers, and to address the too-often heard cry from repository managers of "we built it but they didn't come.
   The team at the Australian Digital Futures Institute are working with researchers to provide software that can (a) index and expose the research data content on their hard disks (b) extract metadata from files (c) automatically process data according to highly configurable workflows including producing web-ready renditions of research objects including documents, domain specific data visualizations (such as chemical molecules) and converting video and images so that they may be easily previewed.
   The architecture is inspired by the success of consumer software in two ways; the way entertainment programs organize content via faceted browse and search interfaces using embedded metadata, and the way photographic software allows content to be grouped into collections and pushed to online services, which are essentially repositories.
Keywords: information systems, repositories, research, search
A digital library architecture supporting massive small files and efficient replica maintenance BIBAKFull-Text 391-392
  Chunhui Shen; Weiming Lu; Jiangqin Wu; Baogang Wei
In this paper, we presented a service infrastructure based on distributed file system for massive storage in digital library. In addition, we addressed the small-file problem by merging small files into big ones, and proposed a novel dynamic replica number adjustment scheme to ensure the maximal availability and reliability in a limited storage space.
Keywords: digital libraries, distributed system, replication, small file
Text clustering with important words using normalization BIBAKFull-Text 393-394
  Shunyao Wu; Jinlong Wang; Huy Quan Vu; Gang Li
Important words, which usually exist in part of Title, Subject and Keywords, can briefly reflect the main topic of a document. In recent years, it is a common practice to exploit the semantic topic of documents and utilize important words to achieve document clustering, especially for short texts such as news articles. This paper proposes a novel method to extract important words from Subject and Keywords of articles, and then partition documents only with those important words. Considering the fact that frequencies of important words are usually low and the scale matrix dataset for important words is small, a normalization method is then proposed to normalize the scale dataset so that more accurate results can be achieved by sufficiently exploiting the limited information. The experiments validate the effectiveness of our method.
Keywords: document clustering, important words, normalization

Demonstrations

Liquid journals: scientific journals in the Web 2.0 era BIBAKFull-Text 395-396
  Marcos Baez; Alejandro Mussi; Fabio Casati; Aliaksandr Birukou; Maurizio Marchese
In this demo we introduce a platform and a model of journal in the age of the Web called liquid journal. The goal of the model (and of the supporting platform) is to disseminate knowledge in the best possible way while also supporting scientists in the credit attribution. In a nutshell, liquid journals are collections of "interesting" links to scientific contributions, such as papers, blogs, datasets, that are related to certain topics. The content gets to the journal either by querying both conventional and non conventional sources on the Web or manually by the group of editors. Liquid journals combines depth and breath in bringing a wider spectrum of scientific contributions from different communities, while also focusing editors' and readers' attention on the things they care about. The demo illustrates the features and benefits of the proposed platform.
Keywords: Web, academic journals, enhanced search
Multiple sources with multiple portals: a demonstration of the ensemble computing portal in second life BIBAKFull-Text 397-398
  B. Stephen, II Carpenter; Richard Furuta; Frank Shipman; Allison Huie; Daniel Pogue; Edward A. Fox; Spencer Lee; Peter Brusilovsky; Lillian Cassel; Lois Delcambre
This demonstration is an overview of our Ensemble pathway project with group members on-location at the conference and in the virtual world of Second Life from remote locations providing a live walk-through tour of our project online. This approach allows the demonstration to extend beyond the allocated conference session as a means to attract people to JCDL/ICADL.
Keywords: computing portal, ensemble, second life, virtual worlds
Capturing and curating published data BIBAKFull-Text 399-400
  Tim DiLauro; Mark Cyzyk; Elliot Metsger; Mark Patton
Verifiability and reproducibility are core tenets of the scholarly communication process. For many scientific publications, however, it is often the case that supporting datasets are not preserved, even when the article text is. And when they are, it is usually as a collection of files without relationships amongst one another or to the articles with which they are associated. There are some existing approaches that attempt to link datasets with articles after the fact (e.g., NED), but they are relatively few and involve substantial human intervention.
   The Digital Research and Curation Center in the Johns Hopkins University Sheridan Libraries, in conjunction with its partners has developed a proof-of-concept system that demonstrates an approach to capturing datasets during the process of submitting the associated article. As part of this process, linkages are established between the datasets and the article.
Keywords: OAI-ORE, data curation, data publication, data services, scholarly communication
OntoFrame S3: academic research information portal service using semantic web technologies and linguistic knowledge BIBAKFull-Text 401-402
  Seungwoo Lee; Mikyoung Lee; Pyung Kim; Hanmin Jung; Won-Kyung Sung
In this paper, we show how Semantic Web technologies can be used for information connection and fusion in academic research information service and empowered by linguistic knowledge.
Keywords: academic research information service, ontology, reasoning, semantic word network
Entertainment history museums in virtual worlds: video game and music preservation in second life BIBAKFull-Text 403-404
  Spencer Lee; Bradley Willis; Joseph S., Jr. Bourne; Edward A. Fox
This research explores and demonstrates the use of Second Life (the popular 3D virtual world) for the purpose of digitally preserving various aspects of video game and music history. Physical game interfaces like joysticks, advertisements used for games, and famous game characters and cultural icons over the history are displayed and preserved in multiple video game exhibits for different eras. Selected game characters are digitally recreated in 3D format as Second Life avatar appearances. Historical changes of musical instruments, musicians, and genres are displayed and preserved likewise. Selected musical instruments are digitally recreated as 3D models playing their real sounds. Some of them will be available for the visitors to play in basic ways.
Keywords: 3D, digital preservation, entertainment, game, history, music, second life, virtual worlds
Integrating Greenstone with an interactive map visualizer BIBAKFull-Text 405-406
  Sam McIntosh; David Bainbridge
This extended abstract describes recent work in combining interactive map functionality with the Greenstone 3 digital library software research framework.
Keywords: digital library integration, interactive map visualizer
Subject metadata support powered by Maui BIBKFull-Text 407-408
  Olena Medelyan; Vye Perrone; Ian H. Witten
Keywords: keyword extraction, metadata extraction, subject heading extraction, web interface
Recommender system for MIR research community BIBAKFull-Text 409-410
  Yi Yu; Vincent Oria; J. Stephen Downie
In this demonstration, we show a recommender system for the Music Information Retrieval (MIR) research community. We extract the key topics and tags by analyzing the ten-year cumulative ISMIR proceedings, and recommend papers and research colleagues to users in an interactive way.
Keywords: ISMIR, music-IR, recommender systems, social networks