TPDL 2012: Proceedings of the International Conference on Theory and Practice of Digital Libraries

Fullname:TPDL 2012: Theory and Practice of Digital Libraries: Second International Conference
Editors:Panayiotis Zaphiris; George Buchanan; Edie Rasmussen; Fernando Loizides
Location:Paphos, Cyprus
Dates:2012-Sep-23 to 2012-Sep-27
Publisher:Springer Berlin Heidelberg
Series:Lecture Notes in Computer Science 7489
Standard No:DOI: 10.1007/978-3-642-33290-6; ISBN: 978-3-642-33289-0 (print), 978-3-642-33290-6 (online); hcibib: TPDL12
Links:Online Proceedings
  1. User Behaviour
  2. Mobiles & Place
  3. Heritage and Sustainability
  4. Preservation
  5. Linked Data
  6. Analysing and Enriching Documents
  7. Content and Metadata Quality
  8. Folksonomy and Ontology
  9. Information Retrieval
  10. Organising Collections
  11. Extracting and Indexing
  12. Poster Papers
  13. Demonstration Papers

User Behaviour

What Would 'Google' Do? Users' Mental Models of a Digital Library Search Engine BIBAKFull-Text 1-12
  Michael Khoo; Catherine Hall
A mental model is a model that people have of themselves, others, the environment, and the things with which they interact, such as technologies. Mental models can support the user-centered development of digital libraries: if we can understand how users perceive digital libraries, we can design interfaces that take these perceptions into account. In this paper, we describe a novel method for eliciting a generic mental model from users, in this case of a digital library's search engine. The method is based on a content analysis of users' mental representations of the system's usability, which they generated in heuristic evaluations. The content analysis elicited features that the evaluators thought important for the search engine. The resulting mental model represents a generic model of the search engine, rather than a clustering of individuals' mental models of the same search engine. The model includes a number of references to Web search engines as ideal models, but these references are idealistic rather than realistic. We conclude that users' mental models of Web search engines should not be taken at face value. The implications of this finding for digital library development and design are discussed.
Keywords: human-computer interaction; human factors; mental models; search; search engine; users; user-centered design
An Exploration of ebook Selection Behavior in Academic Library Collections BIBAKFull-Text 13-24
  Dana McKay; Annika Hinze; Ralf Heese; Nicholas Vanderschantz; Claire Timpany; Sally Jo Cunningham
Academic libraries have offered ebooks for some time, however little is known about how readers interact with them while making relevance decisions. In this paper we seek to address that gap by analyzing ebook transaction logs for books in a university library.
Keywords: ebooks; log analysis; book selection; HCI; information behavior
Information Seekers' Visual Focus during Time Constraint Document Triage BIBAKFull-Text 25-31
  Fernando Loizides
Time-constraints are a commonly accepted limitation to a user's information seeking process. Physical time constraints can cause users to have a low tolerance of time consuming information seeking tasks. This paper examines the effects of time constraints on the document triage process in an eye-tracked lab-based study. The visual attention of three time constraints are reported on. Similarities and differences to previous triage data are also reported on, contributing to an ongoing research investigation of the general document triage process.
Keywords: Document Triage; Information Seeking; Relevance Decisions
Which Words Do You Remember? Temporal Properties of Language Use in Digital Archives BIBAFull-Text 32-37
  Nina Tahmasebi; Gerhard Gossen; Thomas Risse
Knowing the behavior of terms in written texts can help us tailor fit models, algorithms and resources to improve access to digital libraries and help us answer information needs in longer spanning archives. In this paper we investigate the behavior of English written text in blogs in comparison to traditional texts from the New York Times, The Times Archive, and the British National Corpus. We show that user generated content, similar to spoken content, differs in characteristics from 'professionally' written text and experiences a more dynamic behavior.

Mobiles & Place

Toward Mobile-Friendly Libraries: The Status Quo BIBAFull-Text 38-50
  Dongwon Lee
As the number of users accessing web sites from their mobile devices rapidly increases, it becomes increasingly important for libraries to make their homepages "mobile-friendly." However, to our best knowledge, there has been little attempt to survey how ready existing libraries are towards this upcoming mobile era and to quantitatively analyze the findings via data exploration methods. In this paper, using the W3C's tool, mobileOK, we characterize the mobile-friendliness of comprehensive set of more than 400 libraries with respect to locations (e.g., world-wide vs. US vs. EU) and types (e.g., desktop vs. mobile). Based on our findings, we conclude that majority of current libraries (regardless of locations and types) be not mobile-friendly at all (with low mobile-friendliness scores of 0.16-0.21). Using mobilization tools, in addition, we demonstrate that the mobile-friendliness of library homepages can be improved significantly (i.e., 67%-82%). As such, much more efforts to make library homepages more mobile-friendly are greatly needed.
Listen to Tipple: Creating a Mobile Digital Library with Location-Triggered Audio Books BIBAFull-Text 51-56
  Annika Hinze; David Bainbridge
This paper explores the role of audio as a means to access ebooks while the user is at the locations that are referred to in the books. The books are sourced from a digital library and can either be accompanied by pre-recorded audio or synthesized using text-to-speech. The paper discusses the implications of audio access for ebook with particular reference to HCI challenges.
Re-finding Physical Documents: Extending a Digital Library into a Human-Centred Workplace BIBAFull-Text 57-63
  Annika Hinze; Amay Dighe
It is often difficult for busy people to keep track of or re-find documents in their own workplace. Very few methods have been developed for finding a physical object's location in an office. Most of the existing methods require that some kind of structured approach be followed by the user. We created a "Human-Centred Workplace" system that does not require orderly users. The system embeds passive tags in documents and uses cameras in the office to track changes in the documents' locations. This paper introduces the design and implementation of the system, explores its use in an office environment and gives a initial evaluation of our prototypical implementation.

Heritage and Sustainability

User Needs for Enhanced Engagement with Cultural Heritage Collections BIBAFull-Text 64-75
  Mark S. Sweetnam; Maristella Agosti; Nicola Orio; Chiara Ponchia; Christina M. Steiner; Eva-Catherine Hillemann; Micheál Ó Siochrú; Séamus Lawless
This paper presents research carried out in order to elicit user needs for the design and development of a digital library and research platform intended to enhance user engagement with cultural heritage collections. It outlines a range of user constituencies for this digital library. The paper outlines a taxonomy of intended users for this system and describes in detail the characteristics and requirements of these users for the facilitation and enhancement of their engagement with and use of textual and visual cultural artefacts.
Digital Library Sustainability and Design Processes BIBAKFull-Text 76-88
  Anne Adams; Pauline Ngimwa
This paper highlights the importance of sustainability in digital library design processes and frames these arguments within current digital library forums and literature. Sustainability of digital libraries is analysed through an empirical study of 10 best practice digital library projects across three African countries (Uganda, South Africa, Kenya). Through a retrospective review of the projects design processes the paper focuses on the role of technologies / platforms (bespoke, open source, proprietary, web 2 and mobile) in sustainability of these systems. In-depth interviews from 38 stakeholders were triangulated against a documentary analysis and observational data and the findings integrated through a grounded theory analysis. The results identify the importance of flexibility in technologies that enable customization of educational digital resources to meet specific institutional and subject discipline needs. Comparative Evidence is presented that highlights poor sustainability when inflexible systems do not consider scalability or maintenance issues.
Keywords: Sustainability; design processes; flexibility; African HE; case studies
Creation of Textual Versions of Historical Documents from Polish Digital Libraries BIBAFull-Text 89-94
  Adam Dudczak; Milosz Kmieciak; Marcin Werla
This paper describes the results of initial work aimed at increasing the number and improving the quality of textual versions of the historical documents available in Polish digital libraries. Digital libraries community is missing tools that integrate existing digitisation workflow with customizable OCR engine and crowd-based text correction, this paper describes work on providing such a solution. Apart from today's state of the art in this field, this paper includes a description of the Virtual Transcription Laboratory (VTL) prototype, a crowdsourcing platform that utilize the Tesseract OCR engine. The last chapter outlines results of the prototype's evaluation on real life dataset of historical documents from the IMPACT project. Results prove the applicability of the proposed solution as an enhancement of the digitisation workflow.
Increasing Recall for Text Re-use in Historical Documents to Support Research in the Humanities BIBAKFull-Text 95-100
  Marco Büchler; Gregory Crane; Maria Moritz; Alison Babeu
High precision text re-use detection allows humanists to discover where and how particular authors are quoted (e.g., the different sections of Plato's work that come in and out of vogue). This paper reports on on-going work to provide the high recall text re-use detection that humanists often demand. Using an edition of one Greek work that marked quotations and paraphrases from the Homeric epics as our testbed, we were able to achieve a recall of at least 94% while maintaining a precision of 73%. This particular study is part of a larger effort to detect text re-use across 15 million words of Greek and 10 million words of Latin available or under development as openly licensed TEI XML.
Keywords: historical text re-use; hypertextuality; Homer; Athenaeus


PrEV: Preservation Explorer and Vault for Web 2.0 User-Generated Content BIBAKFull-Text 101-112
  Anqi Cui; Liner Yang; Dejun Hou; Min-Yen Kan; Yiqun Liu; Min Zhang; Shaoping Ma
We present the Preservation Explorer and Vault (PrEV) system, a city-centric multilingual digital library that archives and makes available Web 2.0 resources, and aims to store a comprehensive record of what urban lifestyle is like. To match the current state of the digital environment, a key architectural design choice in PrEV is to archive not only Web 1.0 web pages, but also Web 2.0 multilingual resources that include multimedia, real-time microblog content, as well as mobile application descriptions (e.g., iPhone app) in a collaborative manner. PrEV performs the preservation of such resources for posterity, and makes them available for programmatic retrieval by third party agents, and for exploration by scholars with its user interface.
Keywords: Preservation; Archive Visualization; API; Web 2.0; User-Generated Content; NExT; PrEV
Preserving Scientific Processes from Design to Publications BIBAKFull-Text 113-124
  Rudolf Mayer; Andreas Rauber; Martin Alexander Neumann; John Thomson; Gonçalo Antunes
Digital Preservation has so far focused mainly on digital objects that are static in their nature, such as text and multimedia documents. However, there is an increasing demand to extend the applications towards dynamic objects and whole processes, such as scientific workflows in the domain of E-Science. This calls for a revision and extension of current concepts, methods and practices. Important questions to address are e.g. what needs to be captured at ingest, how do the digital objects need to be described, which preservation actions are applicable and how can the preserved objects be evaluated. In this paper we present a conceptual model for capturing the required information and show how this can be linked to evaluating the re-invocation of a preserved process.
Keywords: Digital Preservation; Context
Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost? BIBAKFull-Text 125-137
  Hany SalahEldeen; Michael L. Nelson
Social media content has grown exponentially in the recent years and the role of social media has evolved from just narrating life events to actually shaping them. In this paper we explore how many resources shared in social media are still available on the live web or in public web archives. By analyzing six different event-centric datasets of resources shared in social media in the period from June 2009 to March 2012, we found about 11% lost and 20% archived after just a year and an average of 27% lost and 41% archived after two and a half years. Furthermore, we found a nearly linear relationship between time of sharing of the resource and the percentage lost, with a slightly less linear relationship between time of sharing and archiving coverage of the resource. From this model we conclude that after the first year of publishing, nearly 11% of shared resources will be lost and after that we will continue to lose 0.02% per day.
Keywords: Web Archiving; Social Media; Digital Preservation
Automatic Vandalism Detection in Wikipedia with Active Associative Classification BIBAFull-Text 138-143
  Maria Sumbana; Marcos André Gonçalves; Rodrigo Silva Oliveira; Jussara M. Almeida; Adriano Veloso
Wikipedia and other free editing services for collaboratively generated content have quickly grown in popularity. However, the lack of editing control has made these services vulnerable to various types of malicious actions such as vandalism. State-of-the-art vandalism detection methods are based on supervised techniques, thus relying on the availability of large and representative training collections. Building such collections, often with the help of crowdsourcing, is very costly due to a natural skew towards very few vandalism examples in the available data as well as dynamic patterns. Aiming at reducing the cost of building such collections, we present a new active sampling technique coupled with an on-demand associative classification algorithm for Wikipedia vandalism detection. We show that our classifier enhanced with a simple undersampling technique for building the training set outperforms state-of-the-art classifiers such as SVMs and kNNs. Furthermore, by applying active sampling, we are able to reduce the need for training in almost 96% with only a small impact on detection results.
Applying Digital Library Technologies to Nuclear Forensics BIBAFull-Text 144-149
  Electra Sutton; Chloe Reynolds; Fredric C. Gey; Ray R. Larson
Digital Libraries will enhance the value of forensic endeavors if they provide tools that enable data mining capabilities. In fact, collecting data without such tools can result in investigators becoming overwhelmed. Currently, the quantity of highly dangerous radioactive materials is increasing with the advancement of civilizations' scientific inventions. This creates a demand for an equivalently sophisticated forensics capability that prevents misuse and brings malicious intent to justice. Our forensics approach applies digital library and data mining techniques. Specifically, the forensic investigator will utilize our digital library system which has been enhanced with advanced data mining query tools in order to determine attribution of material to their geographic sources and threat levels, enabling tracing and rating of smuggling activities.

Linked Data

Identifying References to Datasets in Publications BIBAKFull-Text 150-161
  Katarina Boland; Dominique Ritze; Kai Eckert; Brigitte Mathiak
Research data and publications are usually stored in separate and structurally distinct information systems. Often, links between these resources are not explicitly available which complicates the search for previous research. In this paper, we propose a pattern induction method for the detection of study references in full texts. Since these references are not specified in a standardized way and may occur inside a variety of different contexts -- i.e., captions, footnotes, or continuous text -- our algorithm is required to induce very flexible patterns. To overcome the sparse distribution of training instances, we induce patterns iteratively using a bootstrapping approach. We show that our method achieves promising results for the automatic identification of data references and is a first step towards building an integrated information system.
Keywords: Digital Libraries; Information Extraction; Recognition of Dataset References; Iterative Pattern Induction; Bootstrapping
Collaborative Tagging of Art Digital Libraries: Who Should Be Tagging? -- A Case Study BIBAKFull-Text 162-172
  M. Mahoui; C. Boston-Clay; R. Stein; N. Tirupattur
Collaborative tagging is attracting a growing community in the arts museums as manifested by several initiatives such as the Steve Museum project and the Posse initiative at the Brooklyn Museum. The driving force for these projects is the quest for increased and improved access to artifact collections such as art collections. Previous results of studying the nature of tags provided by users reveal that these tags have little overlap with museum documentation; but on the other hand, there is good overlapping with terms from vocabulary sources such as the Art and Architecture Thesaurus (AAT). This paper reports a case study that we performed where the aim was to include tags provided by "average" users from the broader community, not necessarily closely related to the art field as it was the focus of the previous studies. The study we performed comparing tags generated by average users, expert users and metadata seems to indicate the unique role that tags provided by average users would play in facilitating the interaction with art digital libraries.
Keywords: User tags; annotation; art digital libraries; metadata; comparative study
A System for Exposing Linguistic Linked Open Data BIBAFull-Text 173-178
  Emanuele Di Buccio; Giorgio Maria Di Nunzio; Gianmaria Silvello
In this paper we introduce the Atlante Sintattico d'Italia, Syntactic Atlas of Italy (ASIt) enterprise which is a linguistic project aiming to account for minimally different variants within a sample of closely related languages. One of the main goals of ASIt is to share and make linguistic data re-usable. In order to create a universally available resource and be compliant with other relevant linguistic projects, we define a Resource Description Framework (RDF) model for the ASIt linguistic data thus providing an instrument to expose these data as Linked Open Data (LOD). By exploiting RDF native capabilities we overcome the ASIt methodological and technical peculiarities and enable different linguistic projects to read, manipulate and re-use linguistic data.
Linking the Parliamentary Record: A New Approach to Metadata for Legislative Proceedings BIBAKFull-Text 179-184
  Richard Gartner
This paper discusses an on-going project which aims to develop an XML architecture for linking Parliamentary and other legislative proceedings. The project has developed a schema which allows key components of the record to be linked semantically and a set of controlled vocabularies to support these linkages. The project will convert two collections of proceedings to the schema and develop a prototype web-based union catalogue for them.
Keywords: metadata; Parliamentary proceedings; XML; controlled vocabularies

Analysing and Enriching Documents

A Ground Truth Bleed-Through Document Image Database BIBAKFull-Text 185-196
  Róisín Rowley-Brooke; François Pitié; Anil C. Kokaram
This paper introduces a new database of 25 recto/verso image pairs from documents suffering from bleed-through degradation, together with manually created foreground text masks. The structure and creation of the database is described, and three bleed-through restoration methods are compared in two ways; visually, and quantitatively using the ground truth masks.
Keywords: Document database; bleed-through; document restoration
Identifying "Soft 404" Error Pages: Analyzing the Lexical Signatures of Documents in Distributed Collections BIBAKFull-Text 197-208
  Luis Meneses; Richard Furuta; Frank Shipman
Collections of Web-based resources are often decentralized; leaving the task of identifying and locating removed resources to collection managers who must rely on http response codes. When a resource is no longer available, the server is supposed to return a 404 error code. In practice and to be friendlier to human readers, many servers respond with a 200 OK code and indicate in the text of the response that the document is no longer available. In the reported study, 3.41% of servers respond in this manner. To help collection managers identify these "friendly" or "soft" 404s, we developed two methods that use a Naïve Bayes classifier based on known valid responses and known 404 responses. The classifier was able to predict soft 404 pages with a precision of 99% and a recall of 92%. We will also elaborate on the results obtained from our study and will detail the lessons learned.
Keywords: Soft 404; Web resource management; distributed collections
User-Defined Semantic Enrichment of Full-Text Documents: Experiences and Lessons Learned BIBAFull-Text 209-214
  Annika Hinze; Ralf Heese; Alexa Schlegel; Markus Luczak-Rösch
Semantic annotation of digital documents is typically done at meta-data level. However, for fine-grained access semantic enrichment of text elements or passages is needed. Automatic annotation is not of sufficient quality to enable focused search and retrieval: either too many or too few terms are semantically annotated. User-defined semantic enrichment allows for a more targeted approach. We developed a tool for semantic annotation of digital documents and conducted a number of studies to evaluate its acceptance by and usability for non-expert users. This paper discusses the lessons learned about both the semantic enrichment process and our methodology of exposing non-experts to semantic enrichment.
Semantic Document Selection -- Historical Research on Collections That Span Multiple Centuries BIBAFull-Text 215-221
  Daan Odijk; Ork de Rooij; Maria-Hendrike Peetz; Toine Pieters; Maarten de Rijke; Stephen Snelders
The availability of digitized collections of historical data, such as newspapers, increases every day. With that, so does the wish for historians to explore these collections. Methods that are traditionally used to examine a collection do not scale up to today's collection sizes. We propose a method that combines text mining with exploratory search to provide historians with a means of interactively selecting and inspecting relevant documents from very large collections. We assess our proposal with a case study on a prototype system.

Content and Metadata Quality

Finding Quality Issues in SKOS Vocabularies BIBAFull-Text 222-233
  Christian Mader; Bernhard Haslhofer; Antoine Isaac
The Simple Knowledge Organization System (SKOS) is a standard model for controlled vocabularies on the Web. However, SKOS vocabularies often differ in terms of quality, which reduces their applicability across system boundaries. Here we investigate how we can support taxonomists in improving SKOS vocabularies by pointing out quality issues that go beyond the integrity constraints defined in the SKOS specification. We identified potential quantifiable quality issues and formalized them into computable quality checking functions that can find affected resources in a given SKOS vocabulary. We implemented these functions in the qSKOS quality assessment tool, analyzed 15 existing vocabularies, and found possible quality issues in all of them.
On MultiView-Based Meta-learning for Automatic Quality Assessment of Wiki Articles BIBAFull-Text 234-246
  Daniel Hasan Dalip; Marcos André Gonçalves; Marco Cristo; Pável Calado
The Internet has seen a surge of new types of repositories with free access and collaborative open edition. However, this large amount of information, made available democratically and virtually without any control, raises questions about its quality. In this work, we investigate the use of meta-learning techniques to combine sets of semantically related quality indicators (aka, views) in order to automatically assess the quality of wiki articles. The idea is inspired on the combination of multiple (quality) experts. We perform a thorough analysis of the proposed multiview-based meta-learning approach in 3 collections. In our experiments, meta-learning was able to improve the performance of a state-of-the-art method in all tested datasets, with gains of up to 27% in quality assessment.

Folksonomy and Ontology

A Methodology for Folksonomy Evaluation BIBAKFull-Text 247-259
  Spyros Daglas; Constantia Kakali; Dionysis Kakavoulis; Marina Koumaki; Christos Papatheodorou
In recent years, the folksonomies were created and maintained in libraries and other information organizations alongside the traditional subject indexing systems. Folksonomies, consisting of tags, often express the "wisdom of the crowd". Despite their weaknesses and unstructured form, they reveal the language of users or even the terminology of the experts. This knowledge could be exploited in order to update and enrich the indexing vocabularies, thus improving information services. This paper deals with the design of an evaluation model for social tags. It introduces four quality indicators for an information service which offers social tagging functionalities, a weighted metric for the tag assessment process and a set of evaluation criteria that support information professionals to select meaningful tags as new descriptors to a set of bibliographic records.
Keywords: social tagging; folksonomy; subject indexing; evaluation
Advanced Automatic Mapping from Flat or Hierarchical Metadata Schemas to a Semantic Web Ontology -- Requirements, Languages, Tools BIBAKFull-Text 260-272
  Justyna Walkowska; Marcin Werla
This paper is dedicated to the issue of automatic mapping from flat or hierarchical metadata schemas to Semantic Web data formats. It proposes a checklist of requirements for such mappings and, based on this checklist, tries to compare functionalities of existing mapping tools. Finally, it introduces jMet2Ont, an open source mapping tool created by PSNC during the SYNAT research project.
Keywords: Metadata Mapping; Semantic Web; Dublin Core; MARC21; CIDOC CRM; FRBRoo; EDM; OWL; RDF
Ontological Formalization of Scientific Experiments Based on Core Scientific Metadata Model BIBAKFull-Text 273-279
  Armand Brahaj; Matthias Razum; Frank Schwichtenberg
This paper describes an ontology for the representation of contextual information for laboratory-centered scientific experiments based on Core of Scientific Metadata Model. This information describes entities such as instruments, investigations, studies, researchers, and institutions that play a key role in the generation of research data, thus forming an important source for understanding the provenance of the data. Formalization of this information in the form of an ontology and reusing existing and well-established vocabularies foster the publication of research data and accompanying provenance metadata as Linked Open Data. Core Scientific Model Ontology (CSMO) is part of a larger effort, which includes data acquisition in the laboratory and semi-automated metadata generation. It is intended to support cataloging, data curation and data reuse. A formal definition of the RDF classes and properties introduced for CSMO is provided. We demonstrate the efficacy of this ontology by applying it to two different research domains.
Keywords: Ontologies; research data management; data cataloging; contextual information; scientific experiments; science study; CSMD; CSMO
Domain Analysis for a Video Game Metadata Schema: Issues and Challenges BIBAFull-Text 280-285
  Jin Ha Lee; Joseph T. Tennis; Rachel Ivy Clarke
As interest in video games increases, so does the need for intelligent access to them. However, traditional organization systems and standards fall short. Through domain analysis and cataloging real-world examples while attempting to develop a formal metadata schema for video games, we encountered challenges in description. Inconsistent, vague, and subjective sources of information for genre, release date, feature, region, language, developer and publisher information confirm the importance of developing a standardized description model for video games.

Information Retrieval

A Benchmark for Content-Based Retrieval in Bivariate Data Collections BIBAKFull-Text 286-297
  Maximilian Scherer; Tatiana von Landesberger; Tobias Schreck
Huge amounts of various research data are produced and made publicly available in digital libraries. An important category is bivariate data (measurements of one variable versus the other). Examples of bivariate data include observations of temperature and ozone levels (e.g., in environmental observation), domestic production and unemployment (e.g., in economics), or education and income level levels (in the social sciences). For accessing these data, content-based retrieval is an important query modality. It allows researchers to search for specific relationships among data variables (e.g., quadratic dependence of temperature on altitude). However, such retrieval is to date a challenge, as it is not clear which similarity measures to apply. Various approaches have been proposed, yet no benchmarks to compare their retrieval effectiveness have been defined.
   In this paper, we construct a benchmark for retrieval of bivariate data. It is based on a large collection of bivariate research data. To define similarity classes, we use category information that was annotated by domain experts. The resulting similarity classes are used to compare several recently proposed content-based retrieval approaches for bivariate data, by means of precision and recall. This study is the first to present an encompassing benchmark data set and compare the performance of respective techniques. We also identify potential research directions based on the results obtained for bivariate data. The benchmark and implementations of similarity functions are made available, to foster research in this emerging area of content-based retrieval.
Keywords: bivariate data; benchmarking; content-based retrieval; feature extraction
Web Search Personalization Using Social Data BIBAKFull-Text 298-310
  Dong Zhou; Séamus Lawless; Vincent Wade
Web search that utilizes social tagging data suffers from an extreme example of the vocabulary mismatch problem encountered in traditional Information Retrieval (IR). This is due to the personalized, unrestricted vocabulary that users choose to describe and tag each resource. Previous research has proposed the utilization of query expansion to deal with search in this rather complicated space. However, non-personalized approaches based on relevance feedback and personalized approaches based on co-occurrence statistics have only demonstrated limited improvements. This paper proposes an Iterative Personalized Query Expansion Algorithm for Web Search (iPAW), which is based on individual user profiles mined from the annotations and resources the user has marked. The method also incorporates a user model constructed from a co-occurrence matrix and from a Tag-Topic model where annotations and web documents are connected in a latent graph. The experimental results suggest that the proposed personalized query expansion method can produce better results than both the classical non-personalized search approach and other personalized query expansion methods. An "adaptivity factor" was further investigated to adjust the level of personalization.
Keywords: Personalized Web Search; Query Expansion; Social Data; Tag-Topic Model; Graph Algorithm
A Model for Searching Musical Scores by Instrumentation BIBAFull-Text 311-316
  Michel Beigbeder
We propose here a preliminary study on the definition of a search model that allows to look for musical scores that exactly or approximately match a query where the query defines the exact instrumentation wanted by the user. We define two versions of approximate matchings. In the first one, the ranking is done with a crisp matching of the instruments. In the second one we relax this constraint and we use a similarity between instruments. We present a first experiment and envision future works.
Extending Term Suggestion with Author Names BIBAKFull-Text 317-322
  Philipp Schaer; Philipp Mayr; Thomas Lüke
Term suggestion or recommendation modules can help users to formulate their queries by mapping their personal vocabularies onto the specialized vocabulary of a digital library. While we examined actual user queries of the social sciences digital library Sowiport we could see that nearly one third of the users were explicitly looking for author names rather than terms. Common term recommenders neglect this fact. By picking up the idea of polyrepresentation we could show that in a standardized IR evaluation setting we can significantly increase the retrieval performances by adding topical-related author names to the query. This positive effect only appears when the query is additionally expanded with thesaurus terms. By just adding the author names to a query we often observe a query drift which results in worse results.
Keywords: Term Suggestion; Query Suggestion; Evaluation; Digital Libraries; Query Expansion; Polyrepresentation

Organising Collections

Evaluating the Use of Clustering for Automatically Organising Digital Library Collections BIBAFull-Text 323-334
  Mark Hall; Paul Clough; Mark Stevenson
Large digital libraries have become available over the past years through digitisation and aggregation projects. These large collections present a challenge to the new user who wishes to discover what is available in the collections. Subject classification can help in this task, however in large collections it is frequently incomplete or inconsistent. Automatic clustering algorithms provide a solution to this, however the question remains whether they produce clusters that are sufficiently cohesive and distinct for them to be used in supporting discovery and exploration in digital libraries. In this paper we present a novel approach to investigating cluster cohesion that is based on identifying instruders in a cluster. The results from a human-subject experiment show that clustering algorithms produce clusters that are sufficiently cohesive to be used where no (consistent) manual classification exists.
A Unique Arrangement: Organizing Collections for Digital Libraries, Archives, and Repositories BIBAKFull-Text 335-344
  Jeff Crow; Luis Francisco-Revilla; April Norris; Shilpa Shukla; Ciaran B. Trace
Digital libraries increasingly host collections that are archival in nature, and contain digitized and born-digital materials. In order to preserve the evidentiary value of these materials, the collection organization must capture the general context and preserve the relationships among objects. Archival processing is a well-established method for organizing collections this way. However, the current archival workflow leads to artificial boundaries between materials and delays in getting digitized content online because physical and born-digital materials are processed independently, and digitized materials not at all. In response, this work explores the approach of processing materials in a digitized form using a large multi-touch table. This alternative workflow provides the first step towards integrating the archival processing of digital and physical materials, and can expedite the process of making the materials available online. However, this approach demands high quality digitization and requires that archivists perform additional tasks like matching multi-sided, multi-paged documents.
Keywords: Multi-touch; archival processing; digitized materials
Mix-n-Match: Building Personal Libraries from Web Content BIBAKFull-Text 345-356
  Matthias Geel; Timothy Church; Moira C. Norrie
We present an approach to web content aggregation that allows information to be harvested from web pages, independent of specific markup languages. It builds on ideas from data warehousing and we present solutions to the well-known problems of data integration, namely detection of equivalences and data cleaning, adapted to this context. We describe how the content aggregation engine has been realised as an extensible framework in such a way that end-users as well as developers can use the associated tools to create personal libraries of content extracted from the web.
Keywords: content aggregation; data integration; data harvesting
Machine Learning in Building a Collection of Computer Science Course Syllabi BIBAKFull-Text 357-362
  Nakul Rathod; Lillian N. Cassel
Syllabi are rich educational resources. However, finding Computer Science syllabi on a generic search engine does not work well. Towards our goal of building a syllabus collection we have trained various Decision Tree, Naive-Bayes, Support Vector Machine and Feed-Forward Neural Network classifiers to recognize Computer Science syllabi from other web pages. We have also trained our classifiers to distinguish between Artificial Intelligence and Software Engineering syllabi. Our best classifiers are 95% accurate at both the tasks. We present an analysis of the various feature selection methods and classifiers we used hoping to help others developing their own collections.
Keywords: Syllabus; Feature Selection; Text Classification; Machine Learning
PubLight: Managing Publications Using a Task-Oriented Approach BIBAKFull-Text 363-369
  Matthias Geel; Michael Nebeling; Moira C. Norrie
We report on the development of a powerful and task-oriented tool for the management of research publications. The work was motivated by a survey showing that researchers still rely heavily on basic tools such as text editors for managing bibliographic data. We present the approach as well as the resulting tool, PubLight, and compare the features of this tool with existing reference management systems.
Keywords: task-oriented information management; publications tool; reference management

Extracting and Indexing

Improved Bibliographic Reference Parsing Based on Repeated Patterns BIBAKFull-Text 370-382
  Guido Sautter; Klemens Böhm
Parsing details like author names and titles out of bibliographic references of scientific publications is an important issue. However, most existing techniques are tailored to the highly standardized reference styles used in the last two to three decades. Their performance tends to degrade when faced with the wider variety of reference styles used in older, historic publications. Thus, existing techniques are of limited use when creating comprehensive bibliographies covering both historic and contemporary scientific publications. This paper presents RefParse, a generic approach to bibliographic reference parsing that is independent of any specific reference style. Its core feature is an inference mechanism that exploits the regularities inherent in any list of references to deduce its format. Our evaluation shows that RefParse outperforms existing parsers both for contemporary and for historic reference lists.
Keywords: Parsing; Bibliography Data; Algorithms
Catching the Drift -- Indexing Implicit Knowledge in Chemical Digital Libraries BIBAKFull-Text 383-395
  Benjamin Köhncke; Sascha Tönnies; Wolf-Tilo Balke
In the domain of chemistry the information gathering process is highly focused on chemical entities. But due to synonyms and different entity representations the indexing of chemical documents is a challenging process. Considering the field of drug design, the task is even more complex. Domain experts from this field are usually not interested in any chemical entity itself, but in representatives of some chemical class showing a specific reaction behavior. For describing such a reaction behavior of chemical entities the most interesting parts are their functional groups. The restriction of each chemical class is somehow also related to the entities' reaction behavior, but further based on the chemist's implicit knowledge. In this paper we present an approach dealing with this implicit knowledge by clustering chemical entities based on their functional groups. However, since such clusters are generally too unspecific, containing chemical entities from different chemical classes, we further divide them into sub-clusters using fingerprint based similarity measures. We analyze several uncorrelated fingerprint/similarity measure combinations and show that the most similar entities with respect to a query entity can be found in the respective sub-cluster. Furthermore, we use our approach for document retrieval introducing a new similarity measure based on Wikipedia categories. Our evaluation shows that the sub-clustering leads to suitable results enabling sophisticated document retrieval in chemical digital libraries.
Keywords: chemical digital collections; document ranking; clustering
Using Visual Cues for the Extraction of Web Image Semantic Information BIBAFull-Text 396-401
  Georgina Tryfou; Nicolas Tsapatsoulis
Mining information for the images that currently exist in huge amounts on the web, has been a main scientific interest during the past years. Several methods have been exploited and web image information is extracted from textual sources such as image file names, anchor texts, existing keywords and, of course, surrounding text. However, the systems that attempt to mine information for images using surrounding text suffer from several problems, such as the inability to correctly assign all relevant text to an image and discard the irrelevant text as well. A novel method for extracting web image information is discussed in the present paper. The proposed system uses visual cues in order to cluster a web page into several regions and assign to each hosted image the text that most possibly refers to it. Three different approaches to the problem of text to image assignment are discussed and evaluated. The evaluation procedure indicates the advantages of using visual cues and two dimensional euclidean measures for extracting information for web images.

Poster Papers

Malleable Finding Aids BIBAKFull-Text 402-407
  Scott R. Anderson; Robert B. Allen
We show a prototype implementation of a Wiki-based Malleable Finding Aid that provides features to support user engagement and we discuss the contribution of individual features such as graphical representations, a table of contents, interactive sorting of entries, and the possibility for user tagging. Finally, we explore the implications of Malleable Finding Aids for collections which are richly inter-linked and which support a fully social Archival Commons.
Keywords: Archives; Finding Aid; User Engagement; Wiki
Improving Retrieval Results with Discipline-Specific Query Expansion BIBAKFull-Text 408-413
  Thomas Lüke; Philipp Schaer; Philipp Mayr
Choosing the right terms to describe an information need is becoming more difficult as the amount of available information increases. Search-Term-Recommendation (STR) systems can help to overcome these problems. This paper evaluates the benefits that may be gained from the use of STRs in Query Expansion (QE). We create 17 STRs, 16 based on specific disciplines and one giving general recommendations, and compare the retrieval performance of these STRs. The main findings are: (1) QE with specific STRs leads to significantly better results than QE with a general STR, (2) QE with specific STRs selected by a heuristic mechanism of topic classification leads to better results than the general STR, however (3) selecting the best matching specific STR in an automatic way is a major challenge of this process.
Keywords: Term Suggestion; Information Retrieval; Thesaurus; Query Expansion; Digital Libraries; Search Term Recommendation
An Evaluation System for Digital Libraries BIBAKFull-Text 414-419
  Alexander Nussbaumer; Eva-Catherine Hillemann; Christina M. Steiner; Dietrich Albert
Evaluation is an important task for digital libraries, because it reveals relevant information about their quality. This paper presents a conceptual and technical approach to support the systematic evaluation of digital libraries in three ways and a system is presented that assists during the entire evaluation process. First, it allows for formally modelling the evaluation goals and designing the evaluation process. Second, it allows for data collection in a continuous and non-continuous, invasive and non-invasive way. Third, it automatically creates reports based on the defined evaluation models. On the basis of an example evaluation it is outlined how the evaluation process can be designed and supported with this system.
Keywords: evaluation; evaluation system; digital libraries; continuous data collection; evaluation report
Enhancing Digital Libraries and Portals with Canonical Structures for Complex Objects BIBAFull-Text 420-425
  Scott Britell; Lois M. L. Delcambre; Lillian N. Cassel; Edward A. Fox; Richard Furuta
Individual digital library resources are of interest in their own right, but, in some domains, resources can be part of (perhaps multiple) complex objects. We focus on domains with complex objects where a digital library user can benefit from seeing and browsing a resource in the context of its structure(s). We introduce canonical structures that can represent local digital library structures; the canonical structures allow us to provide sophisticated browsing/navigation aids in a generic way. We evaluate a means to transfer the structure of our resources to a digital library portal. We implement and evaluate approaches based on OAI-PMH and OAI-ORE using Dublin Core -- with and without a custom namespace. We also transfer the canonical structure to a portal where our navigation widget is implemented.
Exploiting the Social and Semantic Web for Guided Web Archiving BIBAKFull-Text 426-432
  Thomas Risse; Stefan Dietze; Wim Peters; Katerina Doka; Yannis Stavrakas; Pierre Senellart
The constantly growing amount of Web content and the success of the Social Web lead to increasing needs for Web archiving. These needs go beyond the pure preservation of Web pages. Web archives are turning into "community memories" that aim at building a better understanding of the public view on, e.g., celebrities, court decisions, and other events. In this paper we present the ARCOMEM architecture that uses semantic information such as entities, topics, and events complemented with information from the social Web to guide a novel Web crawler. The resulting archives are automatically enriched with semantic meta-information to ease the access and allow retrieval based on conditions that involve high-level concepts.
Keywords: Web Archiving; Web Crawler; Text Analysis; Social Web
Query Expansion of Zero-Hit Subject Searches: Using a Thesaurus in Conjunction with NLP Techniques BIBAKFull-Text 433-438
  Sarantos Kapidakis; Anna Mastora; Manolis Peponakis
The focus of our study is zero-hit queries in keyword subject searches and the effort of increasing recall in these cases by reformulating and, then, expanding the initial queries using an external source of knowledge, namely a thesaurus. To this end, the objectives of this study are twofold. First, we perform the mapping of query terms to the thesaurus terms. Second, we use the matched terms to expand the user's initial query by taking advantage of the thesaurus relations and implementing natural language processing (NLP) techniques. We report on the overall procedure and elaborate on key points and considerations of each step of the process.
Keywords: Query expansion; Thesaurus; Zero-hit queries; NLP techniques
Towards Digital Repository Interoperability: The Document Indexing and Semantic Tagging Interface for Libraries (DISTIL) BIBAKFull-Text 439-444
  Michael Khoo; Douglas Tudhope; Ceri Binding; Eileen G. Abels; Xia Lin; Diane Massam
The question of how to integrate diverse digital repositories into a unified information infrastructure, accessible and discoverable through simple interfaces, remains a central research issue for digital libraries. Many collections are described by specialized metadata, which currently has to be mapped and crosswalked to a standard format in order to be useful. However, this metadata work can be expensive and resource consuming. We describe work-in-progress with DISTIL (Document Indexing & Semantic Tagging Interface for Libraries) to support federated cross-collection search in humanities and the social sciences. DISTIL proposes to support interoperability by generating Dewey Decimal Classification 'tags' from individual metadata records. The resulting tags can then be used to support cross-collection browsing. We focus here on some of the initial pre-processing stages of the metadata workflow, which include cleaning and formatting metadata records, in order to extract terms that can then be used to generate the DDC tags. Some initial strategies for and issues with this workflow are described.
Keywords: dewey decimal classification; digital humanities; interoperability; metadata; social sciences; tagging
Aggregating Content for Europeana: A Workflow to Support Content Providers BIBAKFull-Text 445-454
  Valentina Vassallo; Marzia Piccininno
This document comes from the experiences of the authors as leaders of Work Packages about "Coordination of content" within the digital library projects to aggregate content to Europeana. In particular, it will focus on two projects, ATHENA and Linked Heritage (LH), with the definition of a workflow and a structured organization for content aggregation. The large amount of digital objects, coming from various European cultural institutions, has to be aggregated (at national level and/or for Europeana) creating good practices and implementing solutions to sustain the material aggregation in a long term perspective.
Keywords: Aggregation; Standards; Europeana
Diva: A Web-Based High-Resolution Digital Document Viewer BIBAKFull-Text 455-460
  Andrew Hankinson; Wendy Liu; Laurent Pugin; Ichiro Fujinaga
This paper introduces the Diva (Document Image Viewer with Ajax) project. Diva is a multi-page image viewer, designed for web-based digital libraries to present documents in a web browser. Key features of Diva include: "lazily loading" only the parts of the document the user is viewing, the ability to "zoom" in and out for viewing high-resolution page images, support for Pyramid TIFF or multi-resolution JPEG 2000 images, a multi-page "grid" view for page images, and HTML5 canvas support for document image rotation and brightness/contrast control. We briefly discuss the history and motivation behind its development, provide an overview of how it compares to other document image viewers, illustrate the different components of Diva and how it works, and provide examples of how this may be used in a digital library context.
Keywords: Document images; image viewer; web applications
Collaborative Authoring of Walden's Paths BIBAFull-Text 461-467
  Yuangling Li; Paul Logasa, II Bogen; Daniel Pogue; Richard Furuta; Frank Shipman
This paper presents a prototype of an authoring tool to allow users to collaboratively build, annotate, manage, share and reuse collections of distributed resources from the World Wide Web. This extends on the Walden's Path project's work to help educators bring resources found on the World Wide Web into a linear contextualized structure. The introduction of collaborative authoring feature fosters collaborative learning activities through social interaction among participants, where participants can coauthor paths in groups. Besides, the prototype supports path sharing, branching and reusing; specifically, individual participant can contribute to the group with private collections of knowledge resources; paths completed by group can be shared among group members, such that participants can tailor, extend, reorder and/or replace nodes to have sub versions of shared paths for different information needs.
Quantitative Analysis of Search Sessions Enhanced by Gaze Tracking with Dynamic Areas of Interest BIBAFull-Text 468-473
  Vu Tuan Tran; Norbert Fuhr
After presenting the ezDL experimental framework for the evaluation of user interfaces to digital library systems, we describe a new method for the quantitative analysis of user search sessions at a cognitive level. We combine system logs with gaze tracking, which is enhanced by a new framework for capturing dynamic areas of interest. This observation data is mapped onto a user action level. Then the user search process is modeled as a Markov-chain. The analysis not only allows for a better understanding of user behavior, but also points out possible system improvements.
Generating Content for Digital Libraries Using an Interactive Content Management System BIBAKFull-Text 474-479
  Uros Damnjanovic; Sorin Hermon
The goal of this paper is to present an interactive content management system for generating content of a digital library. The idea is to use interaction and data visualization techniques in the process of content generation, to check, understand and modify available information. We show the importance of interacting with data during the process of library creation and how this can lead to better quality of data. We present browsing functionalities for exploring relations within data that are used in the Human Sanctuary project. The set of developed tools can easily be extended and used for generating content in any other digital library project.
Keywords: Digital libraries; digital repositories; user interface; data visualization; human computer interaction
Enhancing the Curation of Botanical Data Using Text Analysis Tools BIBAKFull-Text 480-485
  Clare Llewellyn; Clare Grover; Jon Oberlander; Elspeth Haston
Automatic text analysis tools have significant potential to improve the productivity of those who organise large collections of data. However, to be effective, they have to be both technically efficient and provide a productive interaction with the user. Geographic referencing of historical botanical data is difficult, time consuming and relies heavily on the expertise of the curators. Botanical specimens that have poor quality labelling are often disregarded and the information is lost. This work highlights how the use of automated analysis methods can be used to assist in the curation of a botanical specimen library.
Keywords: text analysis; text mining; geographical location; assisted curation; botany
Ranking Distributed Knowledge Repositories BIBAFull-Text 486-491
  Robert Neumayer; Krisztian Balog; Kjetil Nørvåg
Increasingly many knowledge bases are published as Linked Data, driving the need for effective and efficient techniques for information access. Knowledge repositories are naturally organised around objects or entities and constitute a promising data source for entity-oriented search. There is a growing body of research on the subject, however, it is almost always (implicitly) assumed that a centralised index of all data is available. In this paper, we address the task of ranking distributed knowledge repositories -- a vital component of federated search systems -- and present two probabilistic methods based on generative language modeling techniques. We present a benchmarking testbed based on the test suites of the Semantic Search Challenge series to evaluate our approaches. In our experiments, we show that both our ranking approaches provide competitive performance and offer a viable alternative to centralised retrieval.

Demonstration Papers

The CMDI MI Search Engine: Access to Language Resources and Tools Using Heterogeneous Metadata Schemas BIBAFull-Text 492-495
  Junte Zhang; Marc Kemps-Snijders; Hans Bennis
The CLARIN Metadata Infrastructure (CMDI) provides a solution for access to different types of language resources and tools across Europe. Researchers have different research data and tools, which are large-scale and described differently with domain-specific metadata. In the context of the Search & Develop (S&D) project at the Meertens Institute within CLARIN, we present a system description of an advanced search engine that semantically converges differently structured metadata records based on CMDI for search and retrieval. It allows different groups of users -- such as language researchers -- to search across yet unexplored research data and locate relevant data for new insights, and find existing tools that could provide novel use cases.
SIARD Archive Browser BIBAKFull-Text 496-499
  Arif Ur Rahman; Gabriel David; Cristina Ribeiro
SIARD Suite enables us to preserve a relational database in an open format. It migrates a relational database to SIARD format and preserves technical and contextual metadata along with the primary data ensuring long term accessibility.
   This paper introduces a web application, the SIARD Archive Browser, which allows operations on the archive such as searching for a specific record, counting records in a table containing a keyword, sorting by a column and making joins. In many use cases, the application avoids the need to load a preserved database to a DBMS.
Keywords: SIARD; database archiving; database preservation
PATHS -- Exploring Digital Cultural Heritage Spaces BIBAFull-Text 500-503
  Mark Hall; Eneko Agirre; Nikolaos Aletras; Runar Bergheim; Konstantinos Chandrinos; Paul Clough; Samuel Fernando; Kate Fernie; Paula Goodale; Jillian Griffiths; Oier Lopez de Lacalle; Andrea de Polo; Aitor Soroa; Mark Stevenson
Large amounts of digital cultural heritage (CH) information have become available over the past years, requiring more powerful exploration systems than just a search box. The PATHS system aims to provide an environment in which users can successfully explore a large, unknown collection through two modalities: following existing paths to learn about what is available and then freely exploring.
FrbrVis: An Information Visualization Approach to Presenting FRBR Work Families BIBAKFull-Text 504-507
  Tanja Mercun; Maja Zumer; Trond Aalberg
Although FRBR is becoming an important player in the bibliographic world, we have not seen many discussions or examples of how FRBR-based entities or relationships could best be displayed, explored or interacted with within a user interface. The paper presents a FrbrVis prototype as one possible approach to presenting FRBR-based bibliographic data using hierarchical information visualization structures and looks into how FRBR concepts have been implemented into an interactive user interface display.
Keywords: FRBR; Information Visualization; User Interface; Interaction
Metadata Enrichment Services for the Europeana Digital Library BIBAFull-Text 508-511
  Giacomo Berardi; Andrea Esuli; Sergiu Gordea; Diego Marcheggiani; Fabrizio Sebastiani
We demonstrate a metadata enrichment system for the Europeana digital library. The system allows different institutions which provide to Europeana pointers (in the form of metadata records -- MRs) to their content to enrich their MRs by classifying them under a classification scheme of their choice, and to extract/highlight entities of significant interest within the MRs themselves. The use of a supervised learning metaphor allows each content provider (CP) to generate classifiers and extractors tailored to the CP's specific needs, thus allowing the tool to be effectively available to the multitude (2000+) of Europeana CPs.
Collaboratively Creating a Thematic Repository Using Interactive Table-Top Technology BIBAKFull-Text 512-516
  Fernando Loizides; Christina Vasiliou; Andri Ioannou; Panayiotis Zaphiris
This paper reports on the design and development of a surface computing application in support of collaborative idea creation and thematic categorisation. C.A.R.T (Collaborative Assisted Repository for Tabletops) allows up to 4 users to simultaneously interact with virtual objects, each containing a single concept, to create thematic categories. Each object, which replicates a physical post-it on a multi-touch tabletop, is created by one of the team members either previous to the meeting or during the initial stage. The application then encourages the exchange of debate and conversation by presenting the ideas one at a time for users to discuss and categorise. The resulting idea repository can be used for roadmap creation as well as comparative studies using further participants. The application's main task is similar to that of card sorting and affinity diagramming. We report on the functionality of the application which was designed and developed following a user-centred approach.
Keywords: Digital Repositories; Idea Mapping; Tabletops