JCDL'14: Proceedings of the 2014 ACM/IEEE-CS Joint Conference on Digital Libraries

Fullname:Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries
Editors:George Buchanan; Martin Klein; Andreas Rauber; Sally Jo Cunningham
Location:London, England
Dates:2014-Sep-08 to 2014-Sep-12
Standard No:ISBN: 978-1-4503-2077-1; hcibib: DL14
Links:Conference Website | Online Proceedings
  1. Preservation Strategies
  2. Recommendation and Indexing
  3. Publications impacts
  4. Personal DL Design
  5. Building Systems
  6. Browsing and Searching
  7. Item identification
  8. Quality Data and Metadata
  9. Topics, evolution, and relationships
  10. Knowledge infrastructure and repositories
  11. Data transformation and description
  12. Web archives and memory
  13. Citation, citation, citation
  14. Education and collaboration
  15. Other

Preservation Strategies

When should I make preservation copies of myself? BIBAKFull-Text 1-10
  Charles L. Cartledge; Michael L. Nelson
We investigate how different replication policies ranging from least aggressive to most aggressive affect the level of preservation achieved by autonomic processes used by web objects (WOs). Based on simulations of small-world graphs of WOs created using the Unsupervised Small-World algorithm, we report quantitative and qualitative results for graphs ranging in order from 10 to 5000 WOs. Our results show that a moderately aggressive replication policy makes the best use of distributed host resources by not causing spikes in CPU resources nor spikes in LAN activity while meeting preservation goals.
An argument for archiving Facebook as a heterogeneous personal store BIBAKFull-Text 11-20
  Catherine C. Marshall; Frank M. Shipman
A decade ago, the locus of activity for our digital belongings -- photos, email, videos, documents, and the like -- was on our personal computers. Now the situation is different. Not only is personal media born-digital, it may also spend its entire life stored online in social media services and cloud stores, and locally on portable devices. Studies have revealed that most people lack the requisite skills to archive their digital belongings, regardless of where they are stored; furthermore people value the context offered by these large-scale, socially intertwined online stores. So why not archive the contents of a major social media service like Facebook to ensure the permanence of a meaningful portion of peoples' personal digital belongings? Rather than being delighted by this idea, participants in a study of digital ownership have expressed squeamishness about institutional efforts to archive social media: Facebook is not only viewed as private and vulnerable to violations of content ownership, but also as lacking long-term value. However, measures such as data embargoes, aggregation, and permissions mitigate participants' fears and objections to some extent. In this paper, we will use an example of biographical research, coupled with the results of a recent study, to argue that Facebook should be archived by a public institution.
Implementing Digital Preservation Strategy: Developing content collection profiles at the British Library BIBAKFull-Text 21-24
  Michael Day; Ann MacDonald; Maureen Pennock; Akiko Kimura
The British Library is increasingly a digital library. Through both digitization and acquisition, it has built up significant collections of digital content covering a very wide range of content types. Most recently, the extension of legal deposit provisions to non-print works in 2013 has meant that it -- working in conjunction with the other UK legal deposit libraries -- has begun to collect new categories of digital content, including periodic harvests of the UK Web domain. In order to support this, the Library has also invested heavily in developing scalable infrastructures for the acquisition, storage and management of large amounts of digital content. The British Library Digital Preservation Strategy, 2013-2016 is focused on the embedding of digital sustainability as an organizational principle across the Library and to help manage preservation risks and challenges across all digital collection content lifecycles. This practice paper describes work being undertaken by the Digital Preservation Team at the British Library to develop content profiles of high-level digital collections that will support the implementation of the strategy, in particular for the capture of long-term preservation requirements.
The Archival Acid Test: Evaluating archive performance on advanced HTML and JavaScript BIBAKFull-Text 25-28
  Mat Kelly; Michael L. Nelson; Michele C. Weigle
When preserving web pages, archival crawlers sometimes produce a result that varies from what an end-user expects. To quantitatively evaluate the degree to which an archival crawler is capable of comprehensively reproducing a web page from the live web into the archives, the crawlers' capabilities must be evaluated. In this paper, we propose a set of metrics to evaluate the capability of archival crawlers and other preservation tools using the Acid Test concept. For a variety of web preservation tools, we examine previous captures within web archives and note the features that produce incomplete or unexpected results. From there, we design the test to produce a quantitative measure of how well each tool performs its task.
Recommendation and Indexing

Recommendation based on Deduced Social Networks in an educational digital library BIBAKFull-Text 29-38
  Monika Akbar; Clifford A. Shaffer; Weiguo Fan; Edward A. Fox
Discovering useful resources can be difficult in digital libraries with large content collections. Many educational digital libraries (edu-DLs) host thousands of resources. One approach to avoiding information overload involves modeling user behavior. But this often depends on user feedback, along with the demographic information found in user account profiles, in order to model and predict user interests. However, edu-DLs often host collections with open public access, allowing users to navigate through the system without needing to provide identification. With few identifiable users, building models linked to user accounts provides insufficient data to recommend useful resources. Analyzing user activity on a per-session basis, to deduce a latent user network, can take place even without user profiles or prior use history. The resulting Deduced Social Network (DSN) can be used to improve DL services. An example of a DSN is a graph whose nodes are sessions and whose edges connect two sessions that view the same resource. In this paper we present a recommendation framework for edu-DLs that depends on deduced connections between users. Results show that a recommendation system built from DSN-dependent parameters can improve performance compared to when only text similarity between resources is used. Our approach can potentially improve recommendation for DL resources when implicit user activities (e.g., view, click, search) are abundant but explicit user activities (e.g., account creation, rating, comment) are unavailable.
Dynamic taxonomy composition via keyqueries BIBAKFull-Text 39-48
  Tim Gollub; Michael Volske; Matthias Hagen; Benno Stein
This paper presents an unsupervised framework for dynamic, subject-oriented taxonomy composition in digital libraries, which can naturally integrate existing library classification systems. The taxonomy classes in our approach correspond to so-called keyqueries that are run against the digital library's full-text retrieval system. Given a document, a keyquery is a set of few keywords for which the document achieves a high relevance score. Keyqueries can hence be viewed as a general and concise description of the returned retrieval results. The keyquery framework addresses important problems of static classification systems: overlarge classes and overly complex taxonomy structures. If, for instance, a leaf class grows to an indigestible size, keyqueries for the contained documents provide a suitable split mechanism. Since queries are well-known to library users from their daily web search experience, they increase the structural complexity in a transparent way. The paper presents also a strategy for taxonomy-based library exploration. Given a user's information need in the form of library documents, we synthesize a hierarchy of keyqueries that covers this library subset. We manage to solve this difficult set covering problem on-the-fly by combining inverted and reverted indexes along with heuristic search space pruning within a map-reduce application. An empirical evaluation with an ACM collection of scientific papers demonstrates the efficiency and effectiveness of our taxonomy composition framework.
Personalised PageRank for making recommendations in digital cultural heritage collections BIBAKFull-Text 49-52
  Arantxa Otegi; Eneko Agirre; Paul Clough
In this paper we describe the use of Personalised PageRank (PPR) to generate recommendations from a large collection of cultural heritage items. Various methods for computing item-to-item similarities are investigated, together with representing the collection as a network over which random walks can be taken. The network can represent similarity between item metadata, item co-occurrences in search logs, and the similarity of items based on linking them to Wikipedia articles and categories. To evaluate the use of PPR, search logs from Europeana are used to simulate user interactions. PPR on each information source is compared to a standard retrieval-based baseline, resulting in higher performance.
Making research data findable in digital libraries: A layered model for user-oriented indexing of survey data BIBAKFull-Text 53-56
  Tanja Friedrich; Andreas Oskar Kempf
The growing amount of data in research and the aspired culture of data sharing make it necessary to improve data documentation in digital libraries. On these grounds we present a conceptual model for subject indexing of research data. Taking the example of social science survey data we inquire the applicability of established indexing principles. Based on these principles our research incorporates the special characteristics of social science survey data, leading us to a model of layered subject indexing.
Publications impacts

Characterizing scholar popularity: A case study in the Computer Science research community BIBAKFull-Text 57-66
  Glauber D. Goncalves; Flavio Figueiredo; Jussara M. Almeida; Marcos A. Goncalves
A common live debate among scholars regards the popularity, productivity and impact of research. This paper aims to contribute to such discussion by quantifying the impact of various academic features on a scholar popularity throughout her career. Using a list of over 2 million publications in the Computer Science research area obtained from two large digital libraries, we analyze how features that capture the number and rate of publications, number and quality of publication venues, and the importance of the scholar in the co-authorship network relate to the scholar popularity. We also investigate the temporal dynamics of scholar popularity, identifying a few common profiles, and characterizing scholars in each profile according to their academic features.
Community-based endogamy as an influence indicator BIBAKFull-Text 67-76
  Thiago H. P. Silva; Mirella M. Moro; Ana Paula C. Silva; Wagner Meira; Alberto H. F. Laender
Evaluating researchers (individually or in groups) usually depends on qualifying their publications and influence. Here, we aid such crucial task by introducing two new metrics (C-Endo and Comb) that rely on the concept of endogamy for communities of authors who publish in conferences and journals, and produce patents. Endogamy here measures how tightly structured the groups of authors are within a community. We validate and evaluate the metrics by using real datasets, two ground-truth rankings and citation count. We also perform random sampling analysis to account for any unbalance from the ground-truth rankings. Overall, such a thorough evaluation shows that our metrics are successful in defining community-based endogamy as an influence indicator.
Disambiguating publication venue titles using association rules BIBAKFull-Text 77-86
  Denilson Alves Pereira; Eduardo Emanuel Braga da Silva; Ahmed A. A. Esmin
Research agencies in several countries evaluate the impact of scientific publications of researcher groups to define their investments, and one of the main used metrics is the quality of the publication venues where their works were published. Several bibliometric indexes have been formulated by measuring the quality of a publication venue. However, given a set of citations extracted, for example, from curricula vitae of a researcher group, to effectively use bibliometric indexes to evaluate their quality it is necessary to identify correctly the publication venue title of each citation. This task is not easy, since there are not unique identifiers for publication venues. Frequently, citations contain abbreviated forms and acronyms, publication venues share similar titles, sometimes they change their titles, divide or merge, creating new ones. Traditional digital libraries deal with this problem by creating Authority Files. In this work, we present a twofold contribution: (i) the creation of a Computer Science publication venue authority file and (ii) the proposal of a method that uses association rules to disambiguate publication venue titles originated from citations. The disambiguator is a supervised learning method that uses the authority file to train a classifier, whose generated model is a set of association rules to identify publication venues. Experiments show that our method obtains better results than three state of art baselines.
Personal DL Design

From user needs to opportunities in personal information management: A case study on organisational strategies in cross-media information spaces BIBAKFull-Text 87-96
  Sandra Trullemans; Beat Signer
The efficient management of our daily information in physical and digital information spaces is a well-known problem. Current research on personal information management (PIM) aims to understand and improve organisational and re-finding activities. We present a case study about organisational strategies in cross-media information spaces, consisting of physical as well as digital information. In contrast to existing work, we provide a unified view on organisational strategies and investigate how re-finding cues differ across the physical and digital space. We further introduce a new mixing organisational strategy which is used in addition to the well-known filing and piling strategies. Last but not least, based on the results of our study we discuss opportunities and pitfalls for future descriptive PIM research and outline some directions for future PIM system design.
PerCon: A personal digital library for heterogeneous data BIBAKFull-Text 97-106
  Su Inn Park; Frank Shipman
Systems are needed to support access to and analysis of large heterogeneous scientific datasets. We developed PerCon, a data management and analysis environment, to support such activities. PerCon processes and integrates data gathered via queries to existing data providers to create a personal digital library of data. Users may then search, browse, visualize and annotate the data as they proceed with analysis and interpretation. Interpretation in PerCon takes place in a visual workspace in which multiple data visualizations and annotations are placed into spatial arrangements based on the current task. The system watches for patterns in the user's data selection and organization and through mixed-initiative interaction assists users by suggesting potentially relevant data from unexplored data sources. PerCon's data location and analysis capabilities were evaluated in a controlled study with 24 users. Study participants had to locate and analyze heterogeneous weather and river data with and without the visual workspace and mixed-initiative interaction, respectively. Results indicate that the visual workspace facilitated information representation and aided in the identification of relationships between datasets. The system's suggestions encouraged data exploration, leading participants to identify more evidence of correlation among data streams and more potential interactions among weather and river data.
Social information behaviour in physical libraries: Implications for the design of digital libraries BIBAKFull-Text 107-116
  Annika Hinze; Hayat Alqurashi; Nicholas Vanderschantz; Claire Timpany; Saad Alzahrani
Physical bookshops and libraries are visited by both individuals, and groups of patrons, while digital libraries are designed primarily for individual users. This paper reports on a study exploring the behaviour of groups of patrons in physical libraries, detailing their collaboration and communication during book searches. We aim to identify how characteristics such as location, time, environment, ambiance, layout and personal motivation play a role in a group's search and browsing behaviour. We report the findings of observations of group collaboration in academic and public libraries, and compare the observed book and library use techniques employed by patron groups. Further, we examine the support for group collaboration in digital libraries and discuss the implications of our observations for the design of digital libraries that support group collaboration and interaction among users. To that end, the paper suggests features and functions that could be added to DLs to enable asynchronous group communication and interaction.
Building Systems

Towards building a scholarly big data platform: Challenges, lessons and opportunities BIBAKFull-Text 117-126
  Zhaohui Wu; Jian Wu; Madian Khabsa; Kyle Williams; Hung-Hsuan Chen; Wenyi Huang; Suppawong Tuarob; Sagnik Ray Choudhury; Alexander Ororbia; Prasenjit Mitra; C. Lee Giles
We introduce a big data platform that provides various services for harvesting scholarly information and enabling efficient scholarly applications. The core architecture of the platform is built on a secured private cloud, crawls data using a scholarly focused crawler that leverages a dynamic scheduler, processes by utilizing a map reduce based crawl-extraction-ingestion (CEI) workflow, and is stored in distributed repositories and databases. Services such as scholarly data harvesting, information extraction, and user information and log data analytics are integrated into the platform and provided by an OAI and RESTful API. We also introduce a set of scholarly applications built on top of this platform including citation recommendation and collaborator discovery.
Bridging the gap between real world repositories and Scalable Preservation Environments BIBAKFull-Text 127-136
  Bolette Ammitzboll Jurik; Asger Askov Blekinge; Rune Bruun Ferneke-Nielsen; Per Moldrup-Dalum
Integrating large scale processing environments, such as Hadoop, with traditional repository systems, such as Fedora Commons 3, have long proved a daunting task. In this paper we show how this integration can be achieved using software developed in the SCAPE project. The SCAPE integration is based on four steps: retrieving the metadata records from the repository, reading the records and their references to data files, updating the records, and storing them back in the repository. This allows full use of the Hadoop system for massively distributed processing without causing excessive load on the repository.
Using ACM DL paper metadata as an auxiliary source for building educational collections BIBAKFull-Text 137-140
  Yinlin Chen; Edward A. Fox
Some digital libraries harvest metadata records from multiple content providers to build their collections. However, the quality and quantity of such metadata records are limited by what is harvested. To ensure collection growth, and to expand the scope beyond just what can be harvested, additional content acquisition methods are needed. Accordingly, we discuss how the Ensemble project (a pathway effort in the NSDL) is broadening its collection with the help of machine learning. Since Ensemble aims to aid computing education, we make use of ACM Digital Library records as a resource to help with transfer learning. We have built classifiers that can identify if a potential additional resource is about computing education. We approached this as a cross-domain text classification problem and developed suitable methods for feature extraction and bootstrapping for classifier training. Our experiments on three datasets of computing education metadata records show our approach can enhance the quality and quantity of records being added to Ensemble.
Crowd-sourcing Web knowledge for metadata extraction BIBAKFull-Text 141-144
  Zhaohui Wu; Wenyi Huang; Chen Liang; C. Lee Giles
We explore a new metadata extraction framework without human annotators with the ground truth harvested from Web. A new training sample is selected based on not only the uncertainty and representativeness in the unlabeled pool, but also on its availability and credibility in Web knowledge bases. We construct a dataset of 4329 books with valid metadata and evaluate our approach using 5 Web book databases as oracles. Empirical results demonstrate its effectiveness and efficiency.
Browsing and Searching

Lend me some sugar: Borrowing rates of neighbouring books as evidence for browsing BIBAKFull-Text 145-154
  Dana McKay; Wally Smith; Shanton Chang
There is more to choosing a book than simply keyword searching. Browsing is a fundamental part of the information seeking process, and one that information seekers profess to value, though it has attracted little study. This dearth of research is undoubtedly in part because browsing is nebulous and difficult to quantify. In this paper we use a large circulation dataset from an academic library consortium to examine whether books in the library stacks are loaned in clusters, with a view firstly to confirming the existence of book browsing that has been reported anecdotally, and secondly to quantifying its impact on loan patterns.
Improving the visibility of geospatial data on the Web BIBAKFull-Text 155-164
  Javier Lacasta; Francisco J. Lopez-Pellicer; Walter Renteria-Agualimpia; Javier Nogueras-Iso
Geospatial information is a common resource used at personal and corporative levels for decision making. Nowadays, a relevant percentage of the geospatial data on the web is provided by standardized services. However, due to the deficiencies in the service content descriptions, the data required for a task are not easy to find. To improve the description of geospatial information on the Web, this work proposes a process to construct a Linked Data model of geospatial resources that allows semantic searching and browsing. This is done by crawling the web in search of available geospatial services, and enriching their descriptions with concepts from common knowledge organizations models. As use case, we have created a Linked Data model describing the Web Map Services published by Spanish organizations.
The anatomy of a search and mining system for digital humanities BIBAKFull-Text 165-168
  Martyn Harris; Mark Levene; Dell Zhang; Dan Levene
Samtla (Search And Mining Tools with Linguistic Analysis) is an online integrated research environment designed in collaboration with historians and linguists to facilitate the study of digitised texts written in any language. It currently supports the research of two corpora: the Genizah collection held by the Taylor-Schechter Genizah Research Unit in Cambridge University, and a collection of Aramaic incantation texts from late antiquity. In contrast to standard search engines and text mining systems that rely on the bag-of-words representation of text, Samtla provides the retrieval and discovery of fuzzy text patterns/motifs (aka "formulae" to historians), which is achieved through applying a character-based n-gram statistical language model built on top of a powerful generalised suffix tree data structure. This paper briefly describes the major components of Samtla and their underlying techniques.
Increasing the visibility of library records via consortial search engine BIBAKFull-Text 169-172
  Onne Mets; Silvia Gstrein; Veronika Grundhammer
In this paper we describe a common search engine which currently comprises the records of public domain literature from 29 libraries across Europe. These libraries offer the EOD (eBooks on Demand) digitization on request service and make digitized materials available. The search engine (http://search.books2ebooks.eu) has been developed to enable users to browse the respective content simultaneously from several library catalogues. The current case study provides a description of the search engine, statistical trends of user engagement, and their implications. Our findings show the effectiveness of such collaboration, especially in terms of increasing the visibility of data and engaging new user groups. The lessons learned are to encourage the library community, including smaller language groups, for more active cooperation.
Item identification

Combining domain-specific heuristics for author name disambiguation BIBAKFull-Text 173-182
  Alan Filipe Santana; Marcos Andre Goncalves; Alberto H. F. Laender; Anderson Ferreira
Author name disambiguation has been one of the hardest problems faced by digital libraries since their early days. Historically, supervised solutions have empirically outperformed those based on heuristics, but with the burden of having to rely on manually labelled training sets for the learning process. Moreover, most supervised solutions just apply some type of generic machine learning solution and do not exploit specific knowledge about the problem. In this paper, we follow a similar reasoning, but in the opposite direction. Instead of extending an existing supervised solution, we propose a set of carefully designed heuristics and similarity functions and apply supervision only to optimize such parameters for each particular dataset. As our experiments show, the result is a very effective, efficient and practical author name disambiguation method that can be used in many different scenarios.
Detecting and modeling local text reuse BIBAKFull-Text 183-192
  David A. Smith; Ryan Cordel; Elizabeth Maddock Dillon; Nick Stramp; John Wilkerson
Texts propagate through many social networks and provide evidence for their structure. We describe and evaluate efficient algorithms for detecting clusters of reused passages embedded within longer documents in large collections. We apply these techniques to two case studies: analyzing the culture of free reprinting in the nineteenth-century United States and the development of bills into legislation in the U.S. Congress. Using these divergent case studies, we evaluate both the efficiency of the approximate local text reuse detection methods and the accuracy of the results. These techniques allow us to explore how ideas spread, which ideas spread, and which subgroups shared ideas.
Identifying the same records across multiple Ukiyo-e image databases using textual data in different languages BIBAKFull-Text 193-196
  Biligsaikhan Batjargal; Takeo Kuyama; Fuminori Kimura; Akira Maeda
This paper proposes a novel method for identifying the same records across multiple databases in different languages. In order to identify the same records, we calculate the similarities between records by comparing the text values of metadata elements. The proposed method, i.e. finding the same records across multiple databases, will help users to know which organization has a certain record and its customized versions regardless of languages and differences in formats. Although the proposed approach was demonstrated on Japanese Ukiyo-e databases, it might be applicable to other disciplines for bridging the gaps between databases in different languages.
Reducing computational effort for plagiarism detection by using citation characteristics to limit retrieval space BIBAKFull-Text 197-200
  Norman Meuschke; Bela Gipp
This paper proposes a hybrid approach to plagiarism detection in academic documents that integrates detection methods using citations, semantic argument structure, and semantic word similarity with character-based methods to achieve a higher detection performance for disguised plagiarism forms. Currently available software for plagiarism detection exclusively performs text string comparisons. These systems find copies, but fail to identify disguised plagiarism, such as paraphrases, translations, or idea plagiarism. Detection approaches that consider semantic similarity on word and sentence level exist and have consistently achieved higher detection accuracy for disguised plagiarism forms compared to character-based approaches. However, the high computational effort of these semantic approaches makes them infeasible for use in real-world plagiarism detection scenarios. The proposed hybrid approach uses citation-based methods as a preliminary heuristic to reduce the retrieval space with a relatively low loss in detection accuracy. This preliminary step can then be followed by a computationally more expensive semantic and character-based analysis. We show that such a hybrid approach allows semantic plagiarism detection to become feasible even on large collections for the first time.
Quality Data and Metadata

Quality assessment of collaborative content with minimal information BIBAKFull-Text 201-210
  Daniel H. Dalip; Harlley Lima; Marcos Andre Goncalves; Marco Cristo; Pavel Calado
Content generated by users is one of the most interesting phenomena of published media. However, the possibility of unrestricted edition is a source of doubts about its quality. This issue has motivated many studies on how to automatically assess content quality in collaborative web sites. Generally, these studies use machine learning techniques to combine large number of quality indicators into a single value representing the overall quality of the document. This need for a high number of indicators, however, has detrimental implications both on the efficiency and on the effectiveness of the quality assessment algorithms. In this work, we exploit and extend a feature selection method based on the SPEA2 multi-objective genetic algorithm. Results show that we can reduce the feature set to a fraction of 15% through 25% of the original, while obtaining error rates comparable to the state of the art.
Fast Image-based Chinese Calligraphic Character Retrieval on Large Scale Data BIBAKFull-Text 211-220
  Pengcheng Gao; Jiangqin Wu; Yuan Lin; Yang Xia; Mao Tianjiao; Wei Baogang
Chinese calligraphy is the art of handwriting, it draws a lot of attention for its beauty and elegance. In CADAL¹, a Calligraphic Character Dictionary (CCD) which contains hundreds of thousands of character images labeled with semantic meaning has been constructed and provided online to common users. It is a great challenge to perform quick and accurate image-based calligraphic character retrieval on CCD. In this paper, a novel shape descriptor, Oriented Shape Context (OSC) is proposed basing on the traditional Shape Context (SC) to perform similarity searching. Together with GIST, GIST-OSC descriptor is proposed to represent calligraphic character image for efficient and effective retrieval. In addition, an effective retrieval schema is proposed. The retrieval schema works in two steps. Firstly approximate nearest neighbors of the query image are found quickly using GIST and then one-to-one fine matching between approximate nearest neighbors and the query image is performed using OSC. Our experiments show that the GIST-OSC descriptor and the retrieval schema are efficient and effective for Chinese calligraphic character retrieval on large scale data.
Human and machine error analysis on dependency parsing of ancient Greek texts BIBAKFull-Text 221-224
  Saeed Majidi; Gregory Crane
Automatically generated metadata from large collections is an essential component of digital libraries. It is beginning to emerge as fundamental to the study of languages. Morphosyntactic annotation captures the form of individual words and their function. Nonetheless automated syntactic analysis is still imperfect and human annotators can be significantly more accurate. On the other hand, human work is expensive and even humans find some constructions difficult to annotate correctly. Comparing the performance of human annotators with that of an automatic parser is thus important for exploring how the two methods can best be combined. In the present study, we compare the frequency of the different types of errors made by student annotators with those made by different dependency parsers when annotating ancient Greek. With a few exceptions, the frequency of the different types of errors was similar for human and machine. The significance of these results is briefly discussed.
The feasibility of investing in manual correction of metadata for a large-scale digital library BIBAKFull-Text 225-228
  Hung-Hsuan Chen; Madian Khabsa; C. Lee Giles
Given a large-scale digital library that automatically crawls and parses PDF files to generate metadata for documents and authors, we estimate the number of person-hours required to correct a small portion of the metadata, in the hope that a large portion of users can benefit from these corrections. We obtain users requests by analyzing Cite-SeerX's log files from September 2009 to March 2013. We found that the distribution of users requests for search is highly imbalanced: most document search queries and author search queries concentrate on a small set of terms. As a result, even for a large-scale digital library, we estimate it is affordable to invest a few person-hours to check the correctness of a few metadata, and thus provide benefits to a good portion of document search and author search requests.
Topics, evolution, and relationships

A framework for analyzing semantic change of words across time BIBAKFull-Text 229-238
  Adam Jatowt; Kevin Duh
Recently, large amounts of historical texts have been digitized and made accessible to the public. Thanks to this, for the first time, it became possible to analyze evolution of language through the use of automatic approaches. In this paper, we show the results of an exploratory analysis aiming to investigate methods for studying and visualizing changes in word meaning over time. In particular, we propose a framework for exploring semantic change at the lexical level, at the contrastive-pair level, and at the sentiment orientation level. We demonstrate several kinds of NLP approaches that altogether give users deeper understanding of word evolution. We use two diachronic corpora that are currently the largest available historical language corpora. Our results indicate that the task is feasible and satisfactory outcomes can be already achieved by using simple approaches.
Representing topics labels for exploring digital libraries BIBAKFull-Text 239-248
  Nikolaos Aletras; Timothy Baldwin; Jey Han Lau; Mark Stevenson
Topic models have been shown to be a useful way of representing the content of large document collections, for example via visualisation interfaces (topic browsers). These systems enable users to explore collections by way of latent topics. A standard way to represent a topic is using a set of keywords, i.e. the top-n words with highest marginal probability within the topic. However, alternative topic representations have been proposed, including textual and image labels. In this paper, we compare different topic representations, i.e. sets of topic words, textual phrases and images, in a document retrieval task. We asked participants to retrieve relevant documents based on pre-defined queries within a fixed time limit, presenting topics in one of the following modalities: (1) sets of keywords, (2) textual labels, and (3) image labels. Our results show that textual labels are easier for users to interpret than keywords and image labels. Moreover, the precision of retrieved documents for textual and image labels is comparable to the precision achieved by representing topics using sets of keywords, demonstrating that labelling methods are an effective alternative topic representation.
Topical establishment leveraging literature evolution BIBAKFull-Text 249-252
  Han Xu; Eric Martin; Ashesh Mahidadia
From an evolutionary perspective, a body of research is an evolving ecosystem, consisting of research topics subjected to a form of natural selection as topics come into existence, and thrive more or less over a variable period of time. Identifying the form of establishment of a given topic in a scientific domain, in terms of its momentum at the time of inquiry, can provide useful insights into where this topic is heading, and can facilitate effective literature research. Here we propose to identify three forms of establishment of topics, emerging from a comparison between two different methodologies in ranking papers, taking advantage of the mutual relationship between recognition of papers and recognition of topics. More specifically, by analysing the correlation between the rankings obtained by applying both methodologies, we discover thee clusters of topics, each of which is associated with a particular momentum of establishment.
Method for supporting analysis of personal relationships through place names extracted from documents BIBAKFull-Text 253-256
  Fuminori Kimura; Akira Maeda
Visualizing information extracted from text is helpful for intuitively understanding the information. Extracting and visualizing personal relationships from text is one of the promising applications of this approach. Existing methods usually estimate personal relationships from direct co-occurrences of personal names that appear in a text. In our previous work, we proposed a method for extracting personal relationships from indirect co-occurrence relationships obtained through place names. This method can estimate the relationships among persons who do not necessarily have direct relationships. These relationships are visualized in a network graph. However, it becomes difficult to grasp the relationships when the number of persons increases. In this paper, we propose a method that supports analyzing the extracted personal relationships through place names and that is based on our previous work. Our goal is to support analysis by providing the information of the clustering of closely related people and important place names for each cluster. The proposed method was applied to a Japanese historical chronicle written in the 12th century. Experimental results showed a strong correspondence to the known historical facts. The results also indicate that the proposed method might be able to uncover the characteristics of people whose histories are not clearly known yet.
Knowledge infrastructure and repositories

The ups and downs of knowledge infrastructures in science: Implications for data management BIBAKFull-Text 257-266
  Christine L. Borgman; Peter T. Darch; Ashley E. Sands; Jillian C. Wallis; Sharon Traweek
The promise of technology-enabled, data-intensive scholarship is predicated upon access to knowledge infrastructures that are not yet in place. Scientific data management requires expertise in the scientific domain and in organizing and retrieving complex research objects. The Knowledge Infrastructures project compares data management activities of four large, distributed, multidisciplinary scientific endeavors as they ramp their activities up or down; two are big science and two are small science. Research questions address digital library solutions, knowledge infrastructure concerns, issues specific to individual domains, and common problems across domains. Findings are based on interviews (n=113 to date), ethnography, and other analyses of these four cases, studied since 2002. Based on initial comparisons, we conclude that the roles of digital libraries in scientific data management often depend upon the scale of data, the scientific goals, and the temporal scale of the research projects being supported. Digital libraries serve immediate data management purposes in some projects and long-term stewardship in others. In small science projects, data management tools are selected, designed, and used by the same individuals. In the multi-decade time scale of some big science research, data management technologies, policies, and practices are designed for anticipated future uses and users. The need for library, archival, and digital library expertise is apparent throughout all four of these cases. Managing research data is a knowledge infrastructure problem beyond the scope of individual researchers or projects. The real challenges lie in designing digital libraries to assist in the capture, management, interpretation, use, reuse, and stewardship of research data.
CED²AR: The Comprehensive Extensible Data Documentation and Access Repository BIBAKFull-Text 267-276
  Carl Lagoze; Lars Vilhuber; Jeremy Williams; Benjamin Perry; William C. Block
We describe the design, implementation, and deployment of the Comprehensive Extensible Data Documentation and Access Repository (CED²AR). This is a metadata repository system that allows researchers to search, browse, access, and cite confidential data and metadata through either a web-based user interface or programmatically through a search API, all the while re-reusing and linking to existing archive and provider generated metadata. CED²AR is distinguished from other metadata repository-based applications due to requirements that derive from its social science context. These include the need to cloak confidential data and metadata and manage complex provenance chains.
Big Brother is Watching You -- But in a Good Way BIBAKFull-Text 277-280
  Carlin St. Pierre; David Bainbridge; Bill Rogers
In any modern desktop environment the glyph compositor -- where raw text information is combined with font information and other attributes to render rasterized component images -- is part of the software's core functionality. In this paper we present work that shows it is computationally feasible to apply full-text indexing in real-time to the live stream of glyph compositor operations generated by a user's interaction with their desktop environment. By embedding indexing functionality at such a level, we effectively get to "see" (and more importantly remember) all the text that is drawn on the user's screen. With elements reminiscent of the Memex, we illustrate the technique in use through a personal digital library we have developed that enriches (through text-searching and context) the user's desktop experience by letting them go back in time to view information that had previously been displayed. We achieved this by augmenting our dynamically updated text index with time-stamped snapshots of the desktop. By recording the (x, y) positions of the text at the time it is rendered, the snapshots have a semi-live feel, whereby text can be selected for copy-and-paste operations for further use. Moreover, windows -- even if they were hidden behind others at the time the text was rendered -- can be brought to the front and their text accessed.
A comparative analysis of disciplinary data management workflows BIBAKFull-Text 281-284
  Sunje Dallmeier-Tiessen; Artemis Lavasa; Patricia Herterich; Laura Rueda; Rachael Kotarski; Elizabeth Newbold
Datasets are now an integral part of scholarly communication. The result is that research data has now become a reality in library and information science, and its curation requires dedicated workflows. Here, we compare two disciplinary examples from High-Energy Physics and Humanities and Social Sciences, both referenced to the OAIS conceptual model. Even though we know that the research datasets and their metadata (preparation and curation) are very different in both disciplines, it can be seen that the conceptual workflow models are very similar, including the assignment of persistent identifiers (PIDs). The latter is particularly interesting when discussing the design and implementation of transdisciplinary services in library and information science.
An Open Cultural Digital Content Infrastructure BIBAKFull-Text 285-288
  Ioanna-Ourania Stathopoulou; Panagiotis Stathopoulos; Haris Georgiadis; Nikos Houssos; Vangelis Banos; Evi Sachini
We present an Open Cultural Digital Content Infrastructure, a platform providing a coherent suite of loosely-coupled services that aim to promote metadata quality in repositories and facilitate metadata data and digital content reuse. The key functions of the infrastructure are the aggregation of metadata and digital files and the automatic validation of metadata records and digital material for compliance with desired quality specifications. The system that has recently moved to production, is currently being employed to ensure the quality standards of the output of more than 70 projects that support Greek cultural heritage organisations and are funded by the European Union structural funds. These projects are expected to produce more than 1.5 million digitized and born-digital items accompanied with detailed metadata. The validation is based on a set of quality and interoperability specifications that have been developed for the purpose. The infrastructure has been developed using an open source technology stack and tools and in particular reuses a number of components of the publicly available Europeana aggregator and portal software platform.
Data transformation and description

Zeri e LODE. Extracting the Zeri photo archive to linked open data: formalizing the conceptual model BIBAKFull-Text 289-298
  Ciro Mattia Gonano; Francesca Tomasi; Francesca Mambelli; Fabio Vitali; Silvio Peroni
This paper presents the first steps of a project to convert the notable Italian "Zeri photo archive" to a linked and open dataset. The full project entails the analysis of the records' description model (Scheda F) in order to define a suitable ontology by exploring existing data models, the creation of the RDF triple store, the creation of links to the cloud, and the definition of the user interface for browsing the linked open dataset. This paper presents and discusses the conceptual modeling of the data stored in the Zeri archival database.
Comic2CEBX: A system for automatic comic content adaptation BIBAKFull-Text 299-308
  Luyuan Li; Yongtao Wang; Liangcai Gao; Zhi Tang; Ching Y. Suen
Comics are popular almost throughout the world. With the help of comic document digitization, it is much easier for people to archive and browse comic works. However, there are still some big challenges along with comic document digitization progress. Among these challenges, comic content adaptation is an important one to be tackled. The existing works only focus on parts of this problem and do not provide a tangible solution to display comic contents on different devices. In this paper, we solve these problems by proposing Comic2CEBX, a system which can automatically convert a set of scanned comic page images into a CEBX file that allows reflowing of the original comic pages with fixed layouts. Taking raw comic images as inputs, our system first extracts three kinds of low-level visual patterns and then uses multilayer Conditional Random Fields to detect all the panels. Meanwhile, our system automatically identifies the reading orders of the panels within each page. Finally, we encapsulate the comic page images and the obtained page structure information (i.e., the panels detection results and the corresponding reading orders) to generate a CEBX file. Experimental results show that our comic page layout analysis method achieves better performance than the existing ones, and use case presentation of the CEBX files produced by our system demonstrates that it brings better comic reading experience especially on mobile devices.
Explorations in Linked Data practice for early music corpora BIBAKFull-Text 309-312
  Tim Crawford; Ben Fields; David Lewis; Kevin Page
Exploring connections between pieces, people and places and relating them to culture as a whole is a central activity of musicology. As libraries increase the availability of musical information in digital form, the data available for such research also expands, but to take such resources together and combine them with others that are relevant a further step of alignment and linkage is needed. We describe here the process and tools we applied to two corpora of early modern music: Early Music Online, which comprises catalogue metadata in MarcXML and facsimile images for approximately 8,500 items of early printed music; and the Electronic Corpus of Lute Music, containing over 1,000 pieces with supporting metadata. A supervised process with automated elements assists the musicologist to create a linked and extensible knowledge structure, aligning entities within and between corpora and to external Linked Data. Finally, we reflect upon how we believe these methods integrate with, and indeed form a crucial element of, the transformed process of modern digital scholarship.
Creating lightweight ontologies for dataset description practical applications in a cross-domain research data management workflow BIBAKFull-Text 313-316
  Joao Aguiar Castro; Joao Rocha da Silva; Cristina Ribeiro
The description of data is a central task in research data management. Describing datasets requires deep knowledge of both the data and the data creation process to ensure adequate capture of their meaning and context. Metadata schemas are usually followed in resource description to enforce comprehensiveness and interoperability, but they can be hard to understand and adopt by researchers. We propose to address data description using ontologies, which can evolve easily, express semantics at different granularity levels and be directly used in system development. Considering that existing ontologies are often hard to use in a crossdomain research data management environment, we present an approach for creating lightweight ontologies to describe research data. We illustrate our process with two ontologies, and then use them as configuration parameters for Dendro, a software platform for research data management currently being developed at the University of Porto.
A preliminary evaluation of HathiTrust metadata: Assessing the sufficiency of legacy records BIBAKFull-Text 317-320
  Katrina Fenlon; Colleen Fallaw; Timothy Cole; Myung-Ja Han
Print-based libraries use metadata (specifically MARC catalog records) for both bibliographic control and to support discovery through online public access catalogs. Depending on its accuracy, completeness, and detail, metadata can afford an aerial view of a collection's topical strengths, scope of coverage, and item-to-item relationships, but the view offered is in part a function of metadata design. Most MARC records were created to support management of large print collections and optimized to meet the requirements of library online public access catalogs. How well do pre-existing MARC records serve the discovery needs of scholars using a large-scale digital library hosting collections of retrospectively digitized books and serials? This paper reports on an ongoing assessment of the utility of the MARC-based metadata underlying the HathiTrust Digital Library and explores the implications for advanced computational access to texts in the HathiTrust. We consider here the utility of metadata to scholars creating worksets for analysis, examining three user scenarios, which were gleaned from an ongoing user-requirements study done for the HathiTrust Research Center: (1) using metadata fields in combination for corpus characterization and discovery; (2) relying on metadata to identify resources of interest; and (3) using bibliographies of known items to seed research worksets. Our goal is to better understand the need for metadata remediation and augmentation and assess the scope of additional work required.
Web archives and memory

Not all mementos are created equal: Measuring the impact of missing resources BIBAKFull-Text 321-330
  Justin F. Brunelle; Mat Kelly; Hany SalahEldeen; Michele C. Weigle; Michael L. Nelson
Web archives do not capture every resource on every page that they attempt to archive. This results in archived pages missing a portion of their embedded resources. These embedded resources have varying historic, utility, and importance values. The proportion of missing embedded resources does not provide an accurate measure of their impact on the Web page; some embedded resources are more important to the utility of a page than others. We propose a method to measure the relative value of embedded resources and assign a damage rating to archived pages as a way to evaluate archival success. In this paper, we show that Web users' perceptions of damage are not accurately estimated by the proportion of missing embedded resources. The proportion of missing embedded resources is a less accurate estimate of resource damage than a random selection. We propose a damage rating algorithm that provides closer alignment to Web user perception, providing an overall improved agreement with users on memento damage by 17% and an improvement by 51% if the mementos are not similarly damaged. We use our algorithm to measure damage in the Internet Archive, showing that it is getting better at mitigating damage over time (going from 0.16 in 1998 to 0.13 in 2013). However, we show that a greater number of important embedded resources (2.05 per memento on average) are missing over time.
Finding pages on the unarchived Web BIBAKFull-Text 331-340
  Hugo C. Huurdeman; Anat Ben-David; Jaap Kamps; Thaer Samar; Arjen P. de Vries
Web archives preserve the fast changing Web, yet are highly incomplete due to crawling restrictions, crawling depth and frequency, or restrictive selection policies -- most of the Web is unarchived and therefore lost to posterity. In this paper, we propose an approach to recover significant parts of the unarchived Web, by reconstructing descriptions of these pages based on links and anchors in the set of crawled pages, and experiment with this approach on the DutchWeb archive. Our main findings are threefold. First, the crawled Web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of the Web archive. Second, the link and anchor descriptions have a highly skewed distribution: popular pages such as home pages have more terms, but the richness tapers off quickly. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived Web: in a known-item search setting we can retrieve these pages within the first ranks on average.
What triggers human remembering of events? A large-scale analysis of catalysts for collective memory in Wikipedia BIBAKFull-Text 341-350
  Nattiya Kanhabua; Tu Ngoc Nguyen; Claudia Niederee
Going beyond its role as an encyclopedia, Wikipedia becomes a global memory place for high-impact events, such as, natural disasters and manmade incidents, thus influencing collective memory, i.e., the way we remember the past. Due to the importance of collective memory for framing the assessment of new situations, our actions and value systems, its open construction and negotiation in Wikipedia is an important new cultural and societal phenomenon. The analysis of this phenomenon does not only promise new insights in collective memory. It is also an important foundation for technology, which more effectively complements the processes of human forgetting and remembering and better enables us to learn from the past. In this paper, we analyse the long-term dynamics of Wikipedia as a global memory place for high-impact events. This complements existing work in analysing the collective memory negotiation and construction process in Wikipedia directly following the event. In more detail, we are interested in catalysts for reviving memories, i.e., in the fuel that keeps memories of past events alive, interrupting the general trend for fast forgetting. For this purpose, we study the trigger of revisiting behavior for a large set of event pages by exploiting page views and time series analysis, as well as identify of most important catalyst features.
Citation, citation, citation

Towards a stratified learning approach to predict future citation counts BIBAKFull-Text 351-360
  Tanmoy Chakraborty; Suhansanu Kumar; Pawan Goyal; Niloy Ganguly; Animesh Mukherjee
In this paper, we study the problem of predicting future citation count of a scientific article after a given time interval of its publication. To this end, we gather and conduct an exhaustive analysis on a dataset of more than 1.5 million scientific papers of computer science domain. On analysis of the dataset, we notice that the citation count of the articles over the years follows a diverse set of patterns; on closer inspection we identify six broad categories of citation patterns. This important observation motivates us to adopt stratified learning approach in the prediction task, whereby, we propose a two-stage prediction model -- in the first stage, the model maps a query paper into one of the six categories, and then in the second stage a regression module is run only on the subpopulation corresponding to that category to predict the future citation count of the query paper. Experimental results show that the categorization of this huge dataset during the training phase leads to a remarkable improvement (around 50%) in comparison to the well-known baseline system.
Full-text based context-rich heterogeneous network mining approach for citation recommendation BIBAKFull-Text 361-370
  Xiaozhong Liu; Yingying Yu; Chun Guo; Yizhou Sun; Liangcai Gao
Citation relationship between scientific publications has been successfully used for scholarly bibliometrics, information retrieval and data mining tasks, and citation-based recommendation algorithms are well documented. While previous studies investigated citation relations from various viewpoints, most of them share the same assumption that, if paper1 cites paper2 (or author1 cites author2), they are connected, regardless of citation importance, sentiment, reason, topic, or motivation. However, this assumption is oversimplified. In this study, we employ an innovative "context-rich heterogeneous network" approach, which paves a new way for citation recommendation task. In the network, we characterize 1) the importance of citation relationships between citing and cited papers, and 2) the topical citation motivation. Unlike earlier studies, the citation information, in this paper, is characterized by citation textual contexts extracted from the full-text citing paper. We also propose algorithm to cope with the situation when large portion of full-text missing information exists in the bibliographic repository. Evaluation results show that, context-rich heterogeneous network can significantly enhance the citation recommendation performance.
RefSeer: A citation recommendation system BIBAKFull-Text 371-374
  Wenyi Huang; Zhaohui Wu; Prasenjit Mitra; C. Lee Giles
Citations are important in academic dissemination. To help researchers check the completeness of citations while authoring a paper, we introduce a citation recommendation system called RefSeer. Researchers can use it to find related works to cited while authoring papers. It can also be used by reviewers to check the completeness of a paper's references. RefSeer presents both topic based global recommendation and also citation-context based local recommendation. By evaluating the quality of recommendation, we show that such recommendation system can recommend citations with good precision and recall. We also show that our recommendation system is very efficient and scalable.
Do altmetrics follow the crowd or does the crowd follow altmetrics? BIBAKFull-Text 375-378
  Hamed Alhoori; Richard Furuta
Changes are occurring in scholarly communication as scientific discourse and research activities spread across various social media platforms. In this paper, we study altmetrics on the article and journal levels, investigating whether the online attention received by research articles is related to scholarly impact or may be due to other factors. We define a new metric, Journal Social Impact (JSI), based on eleven data sources: CiteULike, Mendeley, F1000, blogs, Twitter, Facebook, mainstream news outlets, Google Plus, Pinterest, Reddit, and sites running Stack Exchange (Q&A). We compare JSI against diverse citation-based metrics, and find that JSI significantly correlates with a number of them. These findings indicate that online attention of scholarly articles is related to traditional journal rankings and favors journals with a longer history of scholarly impact. We also find that journal-level altmetrics have strong significant correlations among themselves, compared with the weak correlations among article-level altmetrics. Another finding is that Mendeley and Twitter have the highest usage and coverage of scholarly activities. Among individual altmetrics, we find that the readership of academic social networks have the highest correlations with citation-based metrics. Our findings deepen the overall understanding of altmetrics and can assist in validating them.
Education and collaboration

Towards automatic identification of core concepts in educational resources BIBAKFull-Text 379-388
  Md Arafat Sultan; Steven Bethard; Tamara Sumner
Automatically identifying and extracting key ideas and concepts from educational resources is an important but challenging computational task. We present a supervised machine learning approach to assessing the "coreness" of concepts expressed by resource sentences. The algorithm has been developed and evaluated in the domain of science education where coreness refers to the degree to which a sentence embodies key concepts important to developing a robust understanding of the domain. Our method operates by automatically computing and leveraging the degree of semantic similarity between resource sentences and standard domain concepts designed by human experts for various STEM domains. In our experiments, the algorithm demonstrates high accuracy in identifying sentence coreness when there is agreement between human experts on the coreness rating. We also present performance comparisons with a number of baseline systems.
Using affective embodied agents in information literacy education BIBAKFull-Text 389-398
  Yan Ru Guo; Dion Hoe-Lian Goh; Brendan Luyt
This study aims to evaluate the impact of affective embodied agents (EAs) on students' learning performance in an online tutorial that teaches academic information seeking skills. A hundred and twenty tertiary students from two major universities participated in the between-subjects experiment. The results suggested that the use of affective EAs significantly increased students' learning motivation and enjoyment, compared to neutral-EAs or text-only conditions. However, there were no significant differences in knowledge retention between the three groups. This study paves the way for a better understanding of embedding affective EAs in online information literacy (IL) education. Furthermore, the improvement in students' learning motivation and enjoyment can serve as a basis for future research in this context.
Bend me, shape me: A practical experience of repurposing research data BIBAKFull-Text 399-402
  Dana McKay
This paper presents a practical experience of using a large, publically available dataset for a purpose that it was not originally collected The process is examined from discovery to analysis with reference to the vaunted but seldom seen ideal of data digital libraries.
Research networks in data repositories BIBAKFull-Text 403-406
  Mark R. Costa; Jian Qin; Jun Wang
This paper reports our ongoing work investigating the structural features of scientific collaboration based on metadata collected from a scientific data repository (SDR). The background literature is reviewed in supporting our claim that metadata collected from SDRs offer a complimentary data source to traditional publication metadata collected from digital libraries. Methodological considerations are discussed in association with using metadata from SDRs, including author name disambiguation and data parsing. Initial findings show that the network has some unique macro-level structural features while also in agreement with existing networks theories. Challenges due to inconsistent metadata quality control procedures are also discussed in an attempt to reinforce claims that metadata should be designed to support both domain specific retrieval and evaluation and assessment needs.
TagTick: A tool for annotation tagging over solr indexes BIBAKFull-Text 407-408
  Michele Artini; Claudio Atzori; Alessia Bardi; Sandro La Bruzzo; Paolo Manghi
"Annotation tagging" is an important curation action performed by authorized data curators willing to classify according to a common vocabulary an Information Space of potentially heterogeneous objects (e.g. not sharing common classification schemes). To carry out their activities, data curators need annotation tagging tools which allow them to bulk tag or untag large sets of objects in temporary work sessions, where they can experiment in real-time the effect of their actions before making the changes visible to end-users. Real-time temporary bulk tagging is a non trivial feature to implement, which strictly depends on the back-end used to index the Information Space. This demo presents TagTick, a tool which offers to data curators a fully functional annotation tagging environment over full-text index Apache Solr, considered a "de facto standard" in the field.
Keeping your aggregative infrastructure under control BIBAKFull-Text 409-410
  Michele Artini; Claudio Atzori; Paolo Manghi
"Aggregative Data Infrastructures" (ADIs) are systems devised to collect metadata descriptions (and files) from several data sources to construct uniform Information Spaces, hence providing cross-data source access via standard APIs or custom portals. ADIs typically deal with data collection workflows from arbitrary numbers of data sources, with heterogeneous access protocols, data exchange formats, and data models. Besides, they handle data processing work-flows for the harmonization and enrichment of aggregated metadata. Correct workflow management is crucial to ensure Information Space consistency, but is in general hard to sustain. This demo will present the solution offered in the context of the OpenAIRE infrastructure, which today collects metadata and files from around 450+ data sources (and growing) of several typologies. The D-NET Workflow Management Suite user interfaces support data curators at orchestrating overtime and in a sustainable way the configuration, execution, and monitoring of data collection and processing workflows for thousands of data sources.
Epimenides: An information system offering automated reasoning for the needs of digital preservation BIBAKFull-Text 411-412
  Yannis Kargakis; Yannis Tzitzikas
Epimenides is a system that can be used in the context of digital archives and digital libraries for helping archivists in checking whether the archived digital artifacts remain intelligible and functional, and in identifying the consequences of probable losses. A distinctive feature of Epimenides is that it can model also converters and emulators, and the adopted modelling approach enables the automatic reasoning needed for reducing the human effort required for checking whether a task can be performed over a digital object (or digital collection in general).
Extraction of evolution descriptions from the web BIBAKFull-Text 413-414
  Helge Holzmann; Thomas Risse
The evolution of named entities affects exploration and retrieval tasks in digital libraries. An information retrieval system that is aware of name changes can actively support users in finding former occurrences of evolved entities. However, current structured knowledge bases, such as DBpedia or Freebase, do not provide enough information about evolutions, even though the data is available on their resources, like Wikipedia. Our Evolution Base prototype will demonstrate how excerpts describing name evolutions can be identified on these websites with a promising precision. The descriptions are classified by means of models that we trained based on a recent analysis of named entity evolutions on Wikipedia.
Rexplore: Unveiling the dynamics of scholarly data BIBAKFull-Text 415-416
  Francesco Osborne; Enrico Motta
Rexplore is a novel system that integrates semantic technologies, data mining techniques, and visual analytics to provide an innovative environment for making sense of scholarly data. Its functionalities include: i) a variety of views to make sense of important trends in research; ii) a novel semantic approach for characterising research topics; iii) a very fine-grained expert search with detailed multi-dimensional parameters; iv) an innovative graph view to relate a variety of academic entities; iv) the ability to detect and explore the main communities within a research topic; v) the ability to analyse research performance at different levels of abstraction, including individual researchers, organizations, countries, and research communities.
Explore the stacks: A system for exploration in large digital libraries BIBAKFull-Text 417-418
  Mark M. Hall
Providing access to large digital library collections to novice users requires novel interfaces that are not built around the concept of search, as novice users frequently struggle to formulate appropriate queries. This paper presents the "Explore the Stacks" system, which provides a novel, browsing-focused interface for exploring digital library collections that is applicable to Big Data scale digital libraries. The system is demonstrated using a collection of approximately one million book illustrations provided by the British Library.
Great War stories told by the people -- Crowdsourced cultural heritage in digital museums BIBAKFull-Text 419-420
  Ingo Frommholz; David Graves; Haiming Liu; Ashwin Kumar; Gordon Brady
The increasing interest in the centenary of the Great War 1914-1918 motivates the development of a digital library to capture and access valuable cultural heritage artefacts that would otherwise be lost. We will present a prototype to make available the story of the First World War in the local context of a British town, as told by the people today. The core of our prototype is crowdsourced ingest. To this end we apply latest insights from information interaction and access to foster user engagement. Open standards like CIDOC/CRM facilitate the external provision of our data and the integration of external resources. In the demo we will present our current Great War Stories prototype and how researchers from the humanities as well as digital libraries researchers will be able to benefit from and contribute to the project.
When catalogs collide: A mashup of the bibliographic records from New Zealand's National Bibliography and the HathiTrust BIBAKFull-Text 421-422
  Steffan Safey; David Bainbridge
In this article we present work done developing an interactive comparison tool for large-scale catalogs using the general purpose open source digital library toolkit, Greenstone. The two catalogs selected to demonstrate the approach were the Bibliographic Records from New Zealand's National Bibliography and the HathiTrust. With Greenstone's triple-store extension activated, the two collections were ingested to form two Greenstone collections. Next, an interactive visualization tool was developed within the digital library's presentation layer to allow users to explore the two collections, comparing fields from the two collections and producing a variety of visualizations. The required interactivity was accomplished using AJAX calls to the Greenstone triple-store, further supported by the use of Javascript libraries for the presentation of the retrieved data in both visual and spreadsheet forms.
LODE: Linking digital humanities content to the web of data BIBAKFull-Text 423-424
  Timo Sztyler; Jakob Huber; Jan Noessner; Jaimie Murdock; Colin Allen; Mathias Niepert
Numerous digital libraries projects maintain their data collections in the form of text, images, and metadata. While data may be stored in many formats, from plain text to XML to relational databases, the use of the resource description framework (RDF) as a standardized representation has gained considerable traction during the last five years. Almost every digital humanities meeting has at least one session concerned with the topic of digital humanities, RDF, and linked data, including JCDL. While most existing work in linked data has focused on improving algorithms for entity matching, the aim of our Linked Open Data Enhancer Lode is to work "out of the box", enabling their use by humanities scholars, computer scientists, librarians, and information scientists alike. With Lode we enable non-technical users to enrich a local RDF repository with high-quality data from the Linked Open Data cloud. Lode links and enhances the local RDF repository without reducing the quality of the data. In particular, we support the user in the enhancement and linking process by providing intuitive user-interfaces and by suggesting high quality linking candidates using state of the art matching algorithms. We hope that the Lode framework will be useful to digital humanities scholars complementing other digital humanities tools.
The SCAPE preservation lifecycle BIBAKFull-Text 425-426
  Kresimir Duretec; Artur Kulmukhametov; Michael Kraxner; Markus Plangg; Christoph Becker; Luis Faria
Continuous activities such as preservation monitoring, planning and operations, including the provisioning of access mechanisms or the creation of derivatives through migration, are needed to enable continuous access to content across evolving technological contexts without affecting the authenticity of digital objects. This article describes the SCAPE preservation suite, a loosely coupled set of systems and open APIs that facilitate scalable content profiling, monitoring, planning and workflow execution.
SNAC: The Social Networks and Archival Context project -- Towards an archival authority cooperative BIBAKFull-Text 427-428
  Ray R. Larson; Daniel Pitti; Adrian Turner
Social Networks and Archival Context (SNAC) is a multi-year research and demonstration project that aims to address the longstanding research challenge of discovering, locating, and using distributed historical resources. It also seeks to redefine traditional online access points for those resources, by exposing information about the people, families, and organizations who created them in addition to their socio-historical contexts. Finally, SNAC endeavors to set the stage for a cooperative program for maintaining names of creators of archival materials, via the Encoded Archival Context -- Corporate Bodies, Persons, and Families (EAC-CPF) standard. This demonstration will show the prototype access and search systems for the second phase of SNAC, incorporating over 2 million records derived from Encoded Archival Descriptions (EAD), MARC Archival Records and EAC-CPF records from over 40 repositories and consortia including the Library of Congress, ArchivesHub, Archives nationales, the Bibliothèque nationale de France (BnF), and OCLC World-Cat.
Visualized Related Topics (VRT) system for health information retrieval BIBAKFull-Text 429-430
  Sukjin You; Joel DesArmo; Xiangming Mu; Sukwon Lee; Jessica C. Neal
To help bridge the gap between consumer user's vocabulary and controlled vocabulary used to index health information, in this demo we implemented a Visualized Related Topics (VRT) browser system. The VRT was integrated into the "MeshMed" [2] system to support health information retrieval. The key technology behind the VRT browser is to select MeSH terms, which represent the related topics or subjects, from the top relevant documents. We rank these MeSH terms using the traditional Term Frequency-Inverse Document Frequency (TF-IDF) algorithm. The VRT browser displays a graphic representation of these MeSH terms by creating a visual where the selected MeSH terms stem from the centered user query. The design goal is provide users an overview of the key topics of the search results. In addition, VRT browser may also help users form better queries. Using the VRT browser we will be studying how to effectively assist in consumer users with their health information seeking.
Modeling abstractions for dance digital libraries BIBAKFull-Text 431-432
  Katerina El Raheb; Yannis Ioannidis
The description of the human body and its movement is fundamental and a critical part of the content of a Dance Digital Library. It must be captured in an organized way, both for allowing user interaction (browse, search) and computational analysis (similarity comparison) of dances, as well as for exploring meaningful ways to present content to users. In this paper, we present a comprehensive modeling abstraction for such digital libraries, which consists of a multi-layered model that covers different levels for describing dance movement. We address the semantic challenge of organizing knowledge of dance by starting from defining a dance piece or work, going to the characterization of its structural movement components and their related concepts and standard detailed movement description and notation i.e., Labanotation. In addition, we take into account the existing chorological research, as well as, related work in other domains, such as music information systems and standards i.e., IEEE 1599 and generic cultural heritage models i.e., FRBRoo. These modeling abstractions have been devised in the context of a more general on-going effort to develop a Dance Digital Library System and will be instrumental in some critical functionality, i.e., searching by movement concepts and characteristics in a meaningful way for a wide range of users, and linking different manifestations of movement recordings, descriptions, prescriptions or representations.
Reading from paper versus reading from a touch-based tablet device in proofreading BIBAKFull-Text 433-434
  Hirohito Shibata; Kentaro Takano
This paper describes an experiment to evaluate the impact of the use of a touch-based digital reading device in active reading. We compared the performance of proofreading when using paper and when using a touch-based tablet device. Results showed that participants detected more errors when reading from paper than when reading from the tablet device. During reading, when using paper, participants frequently performed the interaction with text, such as pointing to words or sliding their fingers or pens along sentences. This fact suggests that interaction with text plays an important role in proofreading tasks.
Vector-Borne Disease Network digital library BIBAKFull-Text 435-436
  Michelle Barker; Donald Brower; Natalie Meyers
Borne Disease Network (VecNet)'s digital library provides part of a common analytical framework to assemble data on malaria transmission and make it accessible for the purposes of computational modeling. This poster-paper reports on VecNet digital library development, key decisions related to metadata standards, design, the central role of metadata and authority files in its architecture, and future directions of this Hydra/Fedora based repository solution.
CKGHV: a comprehensive knowledge graph for history visualization BIBAKFull-Text 437-438
  Yingzhen Zhu; Xinyi Cao; Yali Bian; Jiangqin Wu
How to help users learn history efficiently is a problem. To solve it, We proposed CKGHV (Comprehensive Knowledge Graph for History Visualization). This paper focuses on analyzing character relationship of the three kingdoms, and proposed a visualization called overview map. Wordcloud and radiogram present information of battle and character relationship.
The value of risk management for data management in science and engineering BIBAKFull-Text 439-440
  Filipe Ferreira; Ricardo Vieira; Jose Borbinha
An established concept to address data management challenges in science and engineering is the Data Management Plans. However, we claim that in some complex scenarios the actual principles for Data Management Plans might not be enough, especially when Risk Management turns to be relevant. Therefore, we propose a method, based on the ISO 31000, for science and engineering projects to create a Risk Management Plan that can complement the Data Management Plan. The validation of this proposal is presented in the real case of an engineering laboratory.
Utilizing digital humanities methods for quantifying Howell's State Trials BIBAKFull-Text 441-442
  Tracy Bergstrom; Donald Brower; Natalie Meyers
In this paper we describe the undertaking of a quantitative, historically oriented analysis of the law of England between 1650-1700 as represented in Howell's State Trials. Our goal was to analyze cases over time to support investigation into whether a quantitative analysis of the content of the 1650-1700 State Trials would exhibit an upward trend of religious tolerance.
On Cloud deployment of digital preservation environments BIBAKFull-Text 443-444
  Daniel Pop; Marian Neagul; Dana Petcu
Although migrating library applications to Cloud environment is not an easy task, many libraries are interested in using Cloud infrastructure services broadly across their businesses, whether is about a Public, Private or Hybrid Cloud. One of the migration expectations is the scalability of digital preservation architectures in Cloud environments. In this paper we address the scalability and portability of storage and compute platforms, which combine storage of large datasets and their processing. Concretely, we propose a toolkit developed using Puppet configuration management system that facilitate the deployment of complex digital preservation platforms over heterogeneous Cloud environments and we present, as a use case, its integration with SCAPE platform.
Microscopic analysis of document handling while reading: Classification of behavior toward paper document BIBAKFull-Text 445-446
  Kentaro Takano; Hirohito Shibata; Junko Ichino; Tomonori Hashiyama; Shun'ichi Tano
We conducted a microscopic analysis of work-related reading to find ways to support reading in the workplace. We obtained empirical data from video recording, concurrent verbal reporting, and retrospective reporting of 18 participants in 10 target types of reading using paper. Using these data, we categorized the ways people interact with paper while reading in detail. We will discuss what kinds of support are required for work-related reading.
Amplifying scientific paper's abstract by leveraging data-weighted reconstruction BIBAKFull-Text 447-448
  Shansong Yang; Weiming Lu; Baogang Wei; Wenjia An
This paper considers the problem of amplifying scientific paper's abstract by using its citation sentences. While scientific paper's abstract is concise and subjective, its citation sentences, which can be regarded as one kind of comment, are redundant and objective. A summary combining the merits of those two resource is helpful to researchers. A data-weighted reconstruction approach is proposed to generate this summary and sentence's weight is learned through a ranking algorithm over hypergraph of bibliographic network. The experimental results show that the proposed approach can achieve significantly better than several document summarization's techniques.
Data mapping framework in a digital library with computational epidemiology datasets BIBAKFull-Text 449-450
  S. M. Shamimul Hasan; Sandeep Gupta; Edward A. Fox; Keith Bisset; Madhav V. Marathe
Computational epidemiology employs computer models and informatics tools to reason about the spatio-temporal spread of diseases. The diversity of models, data sources, data representations, and modalities that are collected, used, and modified motivate the development of a digital library (DL) framework to support computational epidemiology. The heterogeneous content includes metadata, text, tables, spreadsheets, experimental descriptions, and large result files. There is no accepted framework that allows unified access to such content. We propose a framework for a digital library system tailored to such datasets to support computational network epidemiology.
Extraction and analysis of referenced web links in large-scale scholarly articles BIBAKFull-Text 451-452
  Ke Zhou; Richard Tobin; Claire Grover
In this paper we report on a sub-task undertaken as part of Hiberlink, a project which is examining the phenomenon of reference rot within scholarly works. In our sub-task we aim to quantify and understand the nature of occurrence of links to web resources referenced from papers in very large-scale scholarly collections. We first introduce the challenges involved in extracting links from scholarly articles and develop and evaluate the accuracy of a set of link extraction systems. Secondly, five collections containing millions of scholarly articles with different characteristics (across different disciplines, time periods and publication types) are studied and we demonstrate that web resources are widely cited in scholarly publications and should be an important concern for digital preservation.
What is this song about anyway?: Automatic classification of subject using user interpretations and lyrics BIBAKFull-Text 453-454
  Kahyun Choi; Jin Ha Lee; J. Stephen Downie
Metadata research for music digital libraries has traditionally focused on genre. Despite its potential for improving the ability of users to better search and browse music collections, music subject metadata is an unexplored area. The objective of this study is to expand the scope of music metadata research, in particular, by exploring music subject classification based on user interpretations of music. Furthermore, we compare this previously unexplored form of user data to lyrics at subject prediction tasks. In our experiment, we use datasets consisting of 900 songs annotated with user interpretations. To determine the significance of performance differences between the two sources, we applied Friedman's ANOVA test on the classification accuracies. The results show that user-generated interpretations are significantly more useful than lyrics as classification features (p < 0.05). The findings support the possibility of exploiting various existing sources for subject metadata enrichment in music digital libraries.
REEL: A Relation Extraction Learning framework BIBAKFull-Text 455-456
  Pablo Barrio; Goncalo Simoes; Helena Galhardas; Luis Gravano
We introduce the REEL (RElation Extraction Learning) framework, an open source framework that facilitates the development and evaluation of relation extraction systems over text collections. To define a relation extraction system for a new relation and text collection, users only need to specify the parsers to load the collection, the relation and its constraints, and the learning and extraction techniques to be used. This makes REEL a powerful framework to enable the deployment and evaluation of relation extraction systems for both application building and research.
Linking the Thesaurus for the Social Sciences to the Web of Linked Data BIBAKFull-Text 457-458
  Andias Wira Alam; Andreas Oskar Kempf; Benjamin Zapilko
In this paper, we apply different methods for linking subject headings of the Thesaurus for the Social Sciences (TheSoz) to DBpedia, the nucleus of the Web of Linked Data which is derived from the structured information of Wikipedia. Our method utilizes the backlinks and outlinks within Wikipedia for link detection. We examine to what extent the linking process can be optimized with the help of a network-based similarity measure, in order to achieve a higher precision and recall. We test two baseline methods, string alignment and language property matching and compare them to our own method. Our method outperforms the F-scores of the baselines by 10 percentage points.
A context model for digital preservation of processes and its application to a digital library system BIBAKFull-Text 459-460
  Rudolf Mayer; Andreas Rauber; Goncalo Antunes
Digital preservation is an important aspect to ensure authenticity, traceability and auditing in processes. Digital Library Systems are one example where data transformation processes are executed upon collections of data, and where such preservation of processes is an important aspect for the trustworthiness of the repository. We thus present a model for the semantic description of processes, and apply it on a Digital Library System.
The Organization information integration in the management of a Digital Library System BIBAKFull-Text 461-462
  Angela Di Iorio; Marco Schaerf
The Sapienza Digital Library collects digital resources from the different University's Organizations representing the multidisciplinary Sapienza University's community. The poster presents the pre-ingestion process for creating and aggregating digital resources, under the Organizational Collection conceptualization. The pre-ingestion building process had allowed to automatically provide information about the resources' custody from the origination, until their creation as OAIS Submission Information Package. Whatever system able to provide archival, preservation or dissemination services, could potentially use it, maintaining provenance information.
Life span of web pages: A survey of 10 million pages collected in 2001 BIBAKFull-Text 463-464
  Teru Agata; Yosuke Miyata; Emi Ishita; Atsushi Ikeuchi; Shuichi Ueda
This paper highlights the results of a survival survey and life span study of 10 million web pages, mainly in Japanese, that were collected for NTCIR-3 (web task) in 2001. To calculate web page life span, metadata was collected from Internet Archive's Wayback Machine via Memento. The life span study showed that the average life span of a web page is 1,132.1 days.
PageRank-based Word Sense Induction within Web Search Results Clustering BIBAKFull-Text 465-466
  Jose G. Moreno; Gael Dias
Word Sense Induction is an open problem in Natural Language Processing. Many recent works have been addressing this problem with a wide spectrum of strategies based on content analysis. In this paper, we present a sense induction strategy exclusively based on link analysis over the Web. In particular, we explore the idea that the main different senses of a given word share similar linking properties and can be found by performing clustering with link-based similarity metrics. The evaluation results show that PageRank-based sense induction achieves interesting results when compared to state-of-the-art content-based algorithms in the context of Web Search Results Clustering.
Enabling multilingual information access to digital collections: An investigation of metadata records translation BIBAKFull-Text 467-468
  Jiangping Chen; Olajumoke Azogu; Ryan Knudson
We conducted a research project exploring machine translation performance on digital metadata records. This short paper reports the background, research purposes, research design, experiments, and evaluation results.
Mink: Integrating the live and archived web viewing experience using web browsers and memento BIBAKFull-Text 469-470
  Mat Kelly; Michael L. Nelson; Michele C. Weigle
We describe Mink, a new web browser extension that provides a different model for integration of the live and archived web. While a user browses the live web, Mink actively queries the archives and reports other instances of the page in the archives without requiring active querying by the user. Further, by querying the archives dynamically and asynchronously, a user can view the extent to which the currently viewed page on the live web has been archived and proactively submit a request to various archives using an overlay on the live web page and a simple interface.
Cross-cultural mood regression for music digital libraries BIBAKFull-Text 471-472
  Xiao Hu; Yi-Hsuan Yang
Mood is a popular access point in music digital libraries and online music repositories, and is often represented as numerical values in a small number of emotion-related dimensions (e.g., valence and arousal). As music mood is recognized as culturally dependent, this study investigates whether regression models built with music data in one culture can be applied to music in another culture. Results indicate that cross-cultural predictions of both valence and arousal values are feasible.
Establishing an online access panel for interactive information retrieval research BIBAKFull-Text 473-474
  Dagmar Kern; Peter Mutschke; Philipp Mayr
We propose an online access panel to support the evaluation process of Interactive Information Retrieval (IIR) systems -- called IIRpanel. By maintaining an online access panel with users of IIR systems we assume that the recurring effort to recruit participants for web-based as well as for lab studies can be minimized. We target on using the online access panel not only for our own development processes but to open it for other interested researchers in the field of IIR. In this paper we present the concept of IIRpanel as well as first implementation details.
Mood metadata for video games and interactive media BIBAKFull-Text 475-476
  Stephanie Rossi; Jin Ha Lee; Rachel Ivy Clarke
Video games are becoming an important part of digital library collections due to increasing popularity and the acknowledgement of their significance as cultural artifacts. In order to support robust search and browse functions, it is imperative to develop a metadata schema to effectively represent this medium. The potential of mood metadata in the domain of video game classification is little explored, despite the value given to it by gamers in user studies. Here, we present a Controlled Vocabulary (CV) for moods related to video games with 17 defined mood terms, equivalent terms, and game examples. This CV will enable catalogers to organize video games by mood, allowing mood to be used for search and collocation. In order to evaluate the applicability of this CV and discover which terms are most relevant for video games, we annotated the mood of a sample collection of 617 video game titles. In this poster, we discuss the issues and challenges we encountered in the creation and evaluation of the current CV and our future research goals.
Balancing factors affecting Virtual Reference Services: Identified from academic Librarians' perspective BIBAKFull-Text 477-478
  Sukjin You; Joel DesArmo; Xiangming Mu; Alexandra Dimitroff
Many digital libraries are providing Virtual Reference Services (VRS). There could be various approaches to increase the quality of VRS. In this study, we focused on two key factors; improving helpfulness and reducing user's feeling of intrusiveness. Studies indicated that librarian-initiated attempts for help may increase user's feeling of intrusiveness [2] [3]. It is challenging to provide high helpfulness along with less intrusiveness in VRS. This study aimed to identify factors that contribute to improving helpfulness and reducing intrusiveness. Data were collected based on a survey using systemic random sample approach. Our initial results indicated that awareness, timing, and transparency were key factors affecting the helpfulness and intrusiveness.
Articles, papers, chapters, theses -- who wins the visibility wars? BIBAKFull-Text 479-480
  Melius Weideman
Researchers need access to previous research to base their own work on. Some of the most commonly referenced materials are published in the form of journal articles, conference papers, books and book chapters, and research theses. The purpose of this research was to determine how these four categories of documents compare in terms of visibility to search engine crawlers. A questionnaire was used to gather data from international scholars on their completed research. Three types of queries were generated and over 3000 websites were inspected to determine the visibility of these outputs. Search engine result pages were inspected, and the rankings of the research documents were recorded and converted to a scoring system. The results have indicated that the four types of outputs enjoy varying degrees of exposure to search engines, with journal articles leading the way, and books/book chapters having the smallest degree of exposure to search engines. Some query types also produced better results than others. It was concluded that journal articles provide the best way to expose research work to Internet searchers through search engines.
Exploring relationships among video games BIBAKFull-Text 481-482
  Rachel Ivy Clarke; Jin Ha Lee; Jacob Jett; Simone Sacchi
This poster explores relationships among video games in an attempt to better understand the domain of video games and interactive media as well as improve user access to games. Video games are related in complex ways that cannot be adequately represented by contemporary conceptual models like Functional Requirements for Bibliographic Records (FRBR). Relationships between game editions, series, distribution methods and additional game content all pose challenges for those seeking to describe video games in a user-centered way.
Where should I publish? Detecting journal similarity based on what have been published there BIBAKFull-Text 483-484
  Rob Koopman; Shenghui Wang
Finding similar journals within a large amount of existing ones is not a trivial task. Based on the hypothesis that similar journals publish similar articles, we propose in this paper a scalable method based on random projection to calculate the similarities between 35K journals based on 67 millions articles published with them. We evaluate our results against Dewey Decimal codes and analyse the networks of similar journals.
A quantitative comparison on file folder structures of two groups of information workers BIBAKFull-Text 485-486
  Hong Zhang; Xiao Hu
This study compares file folder structures on personal computers of two groups of information workers, administrative staff and PhD students. A set of quantitative measures are calculated which disclose the differences and similarities between folder structures of the two user groups. The results shows that the group conducting more administrative activities has broader and shallower folders than the PhD group who performs more research activities, and the folders of the PhD group are more populated over deeper levels of the trees than those of the administrative group. The study improves our understanding of the various quantitative measures in investigating personal computer folder structures, and furthermore contributes to our knowledge of the information organization structure in personal information systems.
A computational approach to understanding and predicting the behavior of educators using an online curriculum planning tool BIBAKFull-Text 487-488
  Ogheneovo Dibie; Keith Maull; Tamara Sumner
This paper presents a computational approach to understanding and predicting the behavior of Earth Science educators using an online curriculum planning tool incorporating digital library resources. It expands on prior work on understanding educators' adoption and use of digital library resources [2] by introducing a methodology for characterizing user behaviors and understanding the trends and frequent patterns of use that are observable from these behaviors.
An approach to named entity extraction from historical documents in traditional Mongolian script BIBAKFull-Text 489-490
  Biligsaikhan Batjargal; Garmaabazar Khaltarkhuu; Fuminori Kimura; Akira Maeda
In this poster, we propose an information extraction method for digitized ancient Mongolian documents by utilizing an ancient-modern dictionary. Named entities such as historical figures and place names will be extracted by employing text mining techniques that aim to reduce the labor-intensive annotation on historical text. The Text Encoding Initiative (TEI) guidelines will be applied to digital text representations that encode the historical figures and place names along with their interpretations, and commentaries.
Design of Europeana Cloud technical infrastructure BIBAKFull-Text 491-492
  Pavel Kats; Marcin Mielnicki; Petr Knoth; Markus Muhr; Georgios Mamakis; Marcin Werla
In this paper, we present the overview of Europeana Cloud system, which is a new undertaking of Europeana Foundation and partnering institutions aimed to provide shared, cloud-based infrastructure for aggregation and exchange of cultural heritage metadata and content for European institutions.
Building a dataset of sensitive information BIBAKFull-Text 493-494
  Clare Llewellyn; Laine Ruus; Ros Burnett; Steve Kirkwood; Mark Smith; Rocio von-Jungenfeld
Using text analysis tools to study large data sets is currently an area of popular interest. Prompted by the success of several big data research initiatives, researchers from a variety of disciplines wish to gather and analyse textual data [7]. Communication between members of diverse teams can present a problem and developing a shared language and understanding of the task is necessary [6].
