HCI Bibliography Home | HCI Conferences | DL Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
DL Tables of Contents: 9697989900010203040506070809101112131415

JCDL'13: Proceedings of the 2013 ACM/IEEE-CS Joint Conference on Digital Libraries

Fullname:Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries
Editors:J. Stephen Downie; Robert H. McDonald; Timothy W. Cole; Robert Sanderson; Frank Shipman
Location:Indianapolis, Indiana
Dates:2013-Jul-22 to 2013-Jul-26
Standard No:ISBN: 978-1-4503-2077-1; ACM DL: Table of Contents; hcibib: DL13
Links:Conference Website
  1. Web 2.0
  2. Preservation I
  3. Education
  4. Information ranking
  5. Evaluation
  6. Information clustering
  7. Specialists DLs
  8. Name extraction
  9. Metadata
  10. Web replication
  11. Data
  12. Historical DLs
  13. Preservation II
  14. Posters
  15. Demonstrations

Web 2.0

Identification of useful user comments in social media: a case study on Flickr commons BIBAFull-Text 1-10
  Elaheh Momeni; Ke Tao; Bernhard Haslhofer; Geert-Jan Houben
Cultural institutions are increasingly opening up their repositories and contribute digital objects to social media platforms such as Flickr. In return they often receive user comments containing information that could be incorporated in their catalog records. Since judging the usefulness of a large number of user comments is a labor-intensive task, our aim is to provide automated support for filtering potentially useful social media comments on digital objects. In this paper, we discuss the notion of usefulness in the context of social media comments and compare it from end-users as well as expert users perspectives. Then we present a machine-learning approach to automatically classify comments according to their usefulness. Our approach makes use of syntactic and semantic comment features and also considers user context. We present the results of an experiment we did on user comments received in six different Flickr Commons collections. They show that a few relatively straightforward features can be used to infer useful comments with up to 89% accuracy.
WikiMirs: a mathematical information retrieval system for wikipedia BIBAFull-Text 11-20
  Xuan Hu; Liangcai Gao; Xiaoyan Lin; Zhi Tang; Xiaofan Lin; Josef B. Baker
Mathematical formulae in structural formats such as MathML and LaTeX are becoming increasingly available. Moreover, repositories and websites, including ArXiv and Wikipedia, and growing numbers of digital libraries use these structural formats to present mathematical formulae. This presents an important new and challenging area of research, namely Mathematical Information Retrieval (MIR). In this paper, we propose WikiMirs, a tool to facilitate mathematical formula retrieval in Wikipedia. WikiMirs is aimed at searching for similar mathematical formulae based upon both textual and spatial similarities, using a new indexing and matching model developed for layout structures. A hierarchical generalization technique is proposed to generate sub-trees from presentation trees of mathematical formulae, and similarity is calculated based upon matching at different levels of these trees. Experimental results show that WikiMirs can efficiently support sub-structure matching and similarity matching of mathematical formulae. Moreover, WikiMirs obtains both higher accuracy and better ranked results over Wikipedia in comparison to Wikipedia Search and Egomath. We conclude that WikiMirs provides a new, alternative, and hopefully better service for users to search mathematical expressions within Wikipedia.
Interacting with and through a digital library collection: commenting behavior in Flickr's the commons BIBAFull-Text 21-24
  Sally Jo Cunningham; Malika Mahoui
There is growing interest by digital collection providers to engage collection users in interacting with the collection (e.g. by tagging or annotating collection contents) and with the collection organizers and other users (e.g. to form loose "communities" associated with the collection). At present, little has been documented as to the uptake of these mechanisms in specific collections, or the range of behaviors that emerge as users bend existing facilities to their own needs. This paper is one step in that direction: it describes the social information behaviors exhibited in a cultural heritage photography collection in The Commons on Flickr, and suggests implications for digital library design in response to these behaviors.
A comparative study of academic and Wikipedia ranking BIBAFull-Text 25-28
  Xin Shuai; Zhuoren Jiang; Xiaozhong Liu; Johan Bollen
In addition to its broad popularity Wikipedia is also widely used for scholarly purposes. Many Wikipedia pages pertain to academic papers, scholars and topics providing a rich ecology for scholarly uses. Scholarly references and mentions on Wikipedia may thus shape the "societal impact" of a certain scholarly communication item, but it is not clear whether they shape actual "academic impact". In this paper we compare the impact of papers, scholars, and topics according to two different measures, namely scholarly citations and Wikipedia mentions. Our results show that academic and Wikipedia impact are positively correlated. Papers, authors, and topics that are mentioned on Wikipedia have higher academic impact than those are not mentioned. Our findings validate the hypothesis that Wikipedia can help assess the impact of scholarly publications and underpin relevance indicators for scholarly retrieval or recommendation systems.

Preservation I

A distributed archival network for process-oriented autonomic long-term digital preservation BIBAFull-Text 29-38
  Ivan Subotic; Lukas Rosenthaler; Heiko Schuldt
The reliable and consistent long-term preservation of digital content and metadata is becoming increasingly important -- even though the storage media used are potentially subject to failures, or the data formats may become obsolete over time. A common approach is to replicate data across several sites to increase their availability. Nevertheless, network, software, or hardware failures as well as the evolution of data formats have to be coped with in a timely and, ideally, an autonomous way, without intervention of an administrator. In this paper we present DISTARNET, a distributed, autonomous long-term digital preservation system. Essentially, DISTARNET exploits dedicated processes to ensure the integrity and consistency of data with a given replication degree. At the data level, DISTARNET supports complex data objects, the management of collections, annotations, and arbitrary links between digital objects. At process level, dynamic replication management, consistency checking, and automated recovery of the archived digital objects is provided utilizing autonomic behavior governed by preservation policies without any centralized component. We present the concepts and implementation of the distributed DISTARNET preservation approach. Most importantly, we provide details of the qualitative and quantitative evaluation of the DISTARNET system. The former addresses the effectiveness of the internal preservation processes while the latter evaluates DISTARNET's performance regarding the overall archiving storage capacity and scalability.
Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive BIBAFull-Text 39-48
  Scott G. Ainsworth; Michael L. Nelson
When a user views an archived page using the archive's user interface (UI), the user selects a datetime to view from a list. The archived web page, if available, is then displayed. From this display, the web archive UI attempts to simulate the web browsing experience by smoothly transitioning between archived pages. During this process, the target datetime changes with each link followed; drifting away from the datetime originally selected. When browsing sparsely-archived pages, this nearly-silent drift can be many years in just a few clicks. We conducted 200,000 acyclic walks of archived pages, following up to 50 links per walk, comparing the results of two target datetime policies. The Sliding Target policy allows the target datetime to change as it does in archive UIs such as the Internet Archive's Wayback Machine. The Sticky Target policy, represented by the Memento API, keeps the target datetime the same throughout the walk. We found that the Sliding Target policy drift increases with the number of walk steps, number of domains visited, and choice (number of links available). However, the Sticky Target policy controls temporal drift, holding it to less than 30 days on average regardless of walk length or number of domains visited. The Sticky Target policy shows some increase as choice increases, but this may be caused by other factors. We conclude that based on walk length, the Sticky Target policy generally produces at least 30 days less drift than the Sliding Target policy.
Medusa at the university of Illinois at Urbana-Champaign: a digital preservation service based on PREMIS BIBAFull-Text 49-52
  Kyle R. Rimkus; Thomas Habing
The Medusa digital preservation service at the University of Illinois at Urbana-Champaign provides a storage environment for digital content selected for long-term retention by content managers and producers affiliated with the Library in order to ensure its enduring access and use. This paper reports on Medusa development, with emphasis on the research processes that informed key decisions related to its design, the central role of PREMIS metadata in its architecture, and future directions of integrating PREMIS management into a Fedora repository architecture. In so doing, it describes a strategy of digital preservation content management that draws strength from the creation and management of comprehensive PREMIS preservation metadata records.
First steps in archiving the mobile web: automated discovery of mobile websites BIBAFull-Text 53-56
  Richard Schneider; Frank McCown
Smartphones and tablets are increasingly used to access the Web, and many websites now provide alternative sites tailored specifically for these mobile devices. Web archivists are in need of tools to aid in archiving this equally ephemeral Mobile Web. We present Findmobile, a tool for automating the discovery of mobile websites. We tested our tool in an experiment examining 10K popular websites and found that the most frequently used technique used by popular websites to direct mobile users to mobile sites was by automated client and server-side redirection. We found that nearly half of mobile web pages differ dramatically from their stationary web counterparts and that the most popular websites are those most likely to have mobile-specific pages.


Vertical selection in the information domain of children BIBAFull-Text 57-66
  Sergio Duarte Torres; Djoerd Hiemstra; Theo Huibers
In this paper we explore the vertical selection methods in aggregated search in the specific domain of topics for children between 7 and 12 years old. A test collection consisting of 25 verticals, 3.8K queries and relevant assessments for a large sample of these queries mapping relevant verticals to queries was built. We gather relevant assessment by envisaging two aggregated search systems: one in which the Web vertical is always displayed and in which each vertical is assessed independently from the web vertical. We show that both approaches lead to a different set of relevant verticals and that the former is prone to bias of visually oriented verticals. In the second part of this paper we estimate the size of the verticals for the target domain. We show that employing the global size and domain specific size estimation of the verticals lead to significant improvements when using state-of-the art methods of vertical selection. We also introduce a novel vertical and query representation based on tags from social media and we show that its use lead to significant performance gains.
Automatic extraction of core learning goals and generation of pedagogical sequences through a collection of digital library resources BIBAFull-Text 67-76
  Ifeyinwa Okoye; Tamara Sumner; Steven Bethard
A key challenge facing educational technology researchers is how to provide structure and guidance when learners use unstructured and open tools such as digital libraries for their own learning. This work attempts to use computational methods to identify that structure in a domain independent way and support learners as they navigate and interpret the information they find. This article highlights a computational methodology for generating a pedagogical sequence through core learning goals extracted from a collection of resources which in this case, are resources from the Digital Library for Earth System Education (DLESE). This article describes how we use the technique of multi-document summarization to extract the core learning goals from the digital library resources and how we create a supervised classifier that performs a pair-wise classification of the core learning goals; the judgments from these classifications are used to automatically generate pedagogical sequences. Results show that we can extract good core learning goals and make pair-wise classifications that are up to 76% similar to the pair-wise classifications generated from pedagogical sequences created by two science education experts. Thus we can dynamically generate pedagogically meaningful learning paths through digital library resources.
Building a search engine for computer science course syllabi BIBAFull-Text 77-86
  Nakul Rathod; Lillian Cassel
Syllabi are rich educational resources. However, finding Computer Science syllabi on a generic search engine does not work well. Towards our goal of building a syllabus collection we have trained various Machine Learning classifiers to recognize Computer Science syllabi from other web pages and the discipline that they represent (AI or SE for instance) among other things. We have crawled 50 Computer Science departments in the US and gathered 100,000 candidate pages. Our best classifiers are more than 90% accurate at identifying syllabi from real-world data. The syllabus repository we created is live for public use (at http://syllabus.sdakak.com) and contains more than 3000 syllabi that our classifiers filtered out from the crawl data. We present an analysis of the various feature selection methods and classifiers used.

Information ranking

Ranking experts using author-document-topic graphs BIBAFull-Text 87-96
  Sujatha Das Gollapalli; Prasenjit Mitra; C. Lee Giles
Expert search or recommendation involves the retrieval of people (experts) in response to a query and on occasion, a given set of constraints. In this paper, we address expert recommendation in academic domains that are different from web and intranet environments studied in TREC. We propose and study graph-based models for expertise retrieval with the objective of enabling search using either a topic (e.g. "Information Extraction") or a name (e.g. "Bruce Croft"). We show that graph-based ranking schemes despite being "generic" perform on par with expert ranking models specific to topic-based and name-based querying.
Aggregating productivity indices for ranking researchers across multiple areas BIBAFull-Text 97-106
  Harlley Lima; Thiago H. P. Silva; Mirella M. Moro; Rodrygo L. T. Santos; Wagner, Jr. Meira; Alberto H. F. Laender
The impact of scientific research has traditionally been quantified using productivity indices such as the well-known h-index. On the other hand, different research fields -- in fact, even different research areas within a single field -- may have very different publishing patterns, which may not be well described by a single, global index. In this paper, we argue that productivity indices should account for the singularities of the publication patterns of different research areas, in order to produce an unbiased assessment of the impact of scientific research. Inspired by ranking aggregation approaches in distributed information retrieval, we propose a novel approach for ranking researchers across multiple research areas. Our approach is generic and produces cross-area versions of any global productivity index, such as the volume of publications, citation count and even the h-index. Our thorough evaluation considering multiple areas within the broad field of Computer Science shows that our cross-area indices outperform their global counterparts when assessed against the official ranking produced by CNPq, the Brazilian National Research Council for Scientific and Technological Development. As a result, this paper contributes a valuable mechanism to support the decisions of funding bodies and research agencies, for example, in any research assessment effort.
IFME: information filtering by multiple examples with under-sampling in a digital library environment BIBAFull-Text 107-110
  Mingzhu Zhu; Chao Xu; Yi-Fang Brook Wu
With the amount of digitalized documents increasing exponentially, it is more difficult for users to keep up to date with the knowledge in their domain. In this paper, we present a framework named IFME (Information Filtering by Multiple Examples) in a digital library environment to help users identify the literature related to their interests by leveraging the Positive Unlabeled learning (PU learning). Using a few relevant documents provided by a user and considering the documents in an online database as unlabeled data (called U), it ranks the documents in U using a PU learning algorithm. From the experimental results, we found that while the approach performed well when a large set of relevant feedback documents were available, it performed relatively poor when the relevant feedback documents were few. We improved IFME by combining PU learning with under-sampling to tune the performance. Using Mean Average Precision (MAP), our experimental results indicated that with under-sampling, the performance improved significantly even when the size of P was small. We believe the PU learning based IFME framework brings insights to develop more effective digital library systems.
Can't see the forest for the trees?: a citation recommendation system BIBAFull-Text 111-114
  Cornelia Caragea; Adrian Silvescu; Prasenjit Mitra; C. Lee Giles
Scientists continue to find challenges in the ever increasing amount of information that has been produced on a world wide scale, during the last decades. When writing a paper, an author searches for the most relevant citations that started or were the foundation of a particular topic, which would very likely explain the thinking or algorithms that are employed. The search is usually done using specific keywords submitted to literature search engines such as Google Scholar and CiteSeer. However, finding relevant citations is distinctive from producing articles that are only topically similar to an author's proposal. In this paper, we address the problem of citation recommendation using a singular value decomposition approach. The models are trained and evaluated on the Citeseer digital library. The results of our experiments show that the proposed approach achieves significant success when compared with collaborative filtering methods on the citation recommendation task.


Comparative appraisal: systematic assessment of expressive qualities BIBAFull-Text 115-124
  Melanie Feinberg
Clifford Lynch describes the value of digital libraries as adding interpretive layers to collections of cultural heritage materials. However, standard forms of evaluation, which focus on the degree to which a system solves problems, are insufficient assessments of the expressive qualities that distinguish such interpretive content. This paper describes a form of comparative, structured appraisal that supplements the existing repertoire of assessment techniques. Comparative appraisal uses a situationally defined set of procedures to be followed by multiple assessors in examining a group of artifacts. While this approach aims for a goal of systematic comparison based on selected dimensions, it is grounded in the recognition that expressive qualities are not conventionally measurable and that absolute agreement between assessors is neither possible nor desirable. The conceptual basis for this comparative method is drawn from the literature of writing assessment.
Charting the digital library evaluation domain with a semantically enhanced mining methodology BIBAFull-Text 125-134
  Eleni Afiontzi; Giannis Kazadeis; Leonidas Papachristopoulos; Michalis Sfakakis; Giannis Tsakonas; Christos Papatheodorou
The digital library evaluation field has an evolving nature and it is characterized by a noteworthy proclivity to enfold various methodological orientations. Given the fact that the scientific literature in the specific domain is vast, researchers require tools that will exhibit either commonly acceptable practices, or areas for further investigation. In this paper, a data mining methodology is proposed to identify prominent patterns in the evaluation of digital libraries. Using Machine Learning techniques, all papers presented in the ECDL and JCDL conferences between the years 2001 and 2011 were categorized as relevant or non-relevant to the DL evaluation domain. Then, the relevant papers were semantically annotated according to the Digital Library Evaluation Ontology (DiLEO) vocabulary. The produced set of annotations was clustered to evaluation patterns for the most frequently used tools, methods and goals of the domain. Our findings highlight the expressive nature of DiLEO, place emphasis on semantic annotation as a necessary step in handling domain-centric corpora and underline the potential of the proposed methodology in the profiling of evaluation activities.
Mendeley group as a new source of interdisciplinarity study: how do disciplines interact on mendeley? BIBAFull-Text 135-138
  Jiepu Jiang; Chaoqun Ni; Daqing He; Wei Jeng
In this paper, we studied interdisciplinary structures by looking into how online academic groups of different disciplines share members and followers. Results based on Mendeley online groups show clear interdisciplinary structures, indicating Mendeley online groups as a promising data source and a new perspective of disciplinarity and interdisciplinarity studies.
Following bibliometric footprints: the ACM digital library and the evolution of computer science BIBAFull-Text 139-142
  Shion Guha; Stephanie Steinhardt; Syed Ishtiaque Ahmed; Carl Lagoze
Using bibliometric methods, this exploratory work shows evidence of transitions in the field of computer science since the emergence of HCI as a distinct sub-discipline. We mined the ACM Digital Library in order to expose relationships between sub-disciplines in computer science, focusing in particular on the transformational nature of the SIG Computer-Human Interaction (CHI) in relation to other SIGs. Our results suggest shifts in the field due to broader social, economic and political changes in computing research and are intended as a prolegomena to further investigations.

Information clustering

Information-theoretic term weighting schemes for document clustering BIBAFull-Text 143-152
  Weimao Ke
We propose a new theory to quantify information in probability distributions and derive a new document representation model for text clustering. By extending Shannon entropy to accommodate a non-linear relation between information and uncertainty, the proposed Least Information theory (LIT) provides insight into how terms can be weighted based on their probability distributions in documents vs. in the collection. We derive two basic quantities in the document clustering context: 1) LI Binary (LIB) which quantifies information due to the observation of a term's (binary) occurrence in a document; and 2) LI Frequency (LIF) which measures information for the observation of a randomly picked term from the document. Both quantities are computed given term distributions in the document collection as prior knowledge and can be used separately or combined to represent documents for text clustering. Experiments on four benchmark text collections demonstrate strong performances of the proposed methods compared to classic TF*IDF. Particularly, the LIB*LIF weighting scheme, which combines LIB and LIF, consistently outperforms TF*IDF in terms of multiple evaluation metrics. The least information measure has a potentially broad range of applications beyond text clustering.
Exploiting potential citation papers in scholarly paper recommendation BIBAFull-Text 153-162
  Kazunari Sugiyama; Min-Yen Kan
To help generate relevant suggestions for researchers, recommendation systems have started to leverage the latent interests in the publication profiles of the researchers themselves. While using such a publication citation network has been shown to enhance performance, the network is often sparse, making recommendation difficult. To alleviate this sparsity, we identify "potential citation papers" through the use of collaborative filtering. Also, as different logical sections of a paper have different significance, as a secondary contribution, we investigate which sections of papers can be leveraged to represent papers effectively.
   On a scholarly paper recommendation dataset, we show that recommendation accuracy significantly outperforms state-of-the-art recommendation baselines as measured by nDCG and MRR, when we discover potential citation papers using imputed similarities via collaborative filtering and represent candidate papers using both the full text and assigning more weight to the conclusion sections.
Addressing diverse corpora with cluster-based term weighting BIBAFull-Text 163-166
  Peter Organisciak
Highly heterogeneous collections present difficulties to term weighting models that are informed by corpus-level frequencies. Collections which span multiple languages or large time periods do not provide realistic statistics on which words are interesting to a system. This paper presents a case where diverse corpora can frustrate term weighting and proposes a modification that weighs documents according to their class or cluster within the collection. In cases of diverse corpora, the proposed modification better represents the intuitions behind corpus-level document frequencies.
Interactive search result clustering: a study of user behavior and retrieval effectiveness BIBAFull-Text 167-170
  Xuemei Gong; Weimao Ke; Yan Zhang; Ramona Broussard
Scatter/Gather is a document browsing and information retrieval method based on document clustering. It is designed to facilitate user articulation of information needs through iterative clustering and interactive browsing. This paper reports on a study that investigated the effectiveness of Scatter/Gather browsing for information retrieval. We conducted a within-subject user study of 24 college students to investigate the utility of a Scatter/Gather system, to examine its strengths and weaknesses, and to receive feedback from users on the system. Results show that the clustering-based Scatter/Gather method was more difficult to use than the classic information retrieval systems in terms of user perception. However, clustering helped the subjects accomplish the tasks more efficiently. Scatter/Gather clustering was particularly useful in helping users finish tasks that they were less familiar with and allowed them to search with fewer words. Scatter/Gather tended to be more useful when it was more difficult for the user to do query specification for an information need. Topic familiarity and specificity had significant influences on user perceived retrieval effectiveness. The influences appeared to be greater with the Scatter/Gather system compared to a classic search system. Topic familiarity also had significant influences on query formulation.

Specialists DLs

Tipple: location-triggered mobile access to a digital library for audio books BIBAFull-Text 171-180
  Annika Hinze; David Bainbridge
This paper explores the role of audio as a means to access books in a digital library while being at the location referred to in the books. The books are sourced from the digital library and can either be accompanied by pre-recorded audio or synthesized using text-to-speech. The paper details the functional requirements, design and implementation of Tipple. The concept was extensively tested in three field studies.
Redeye: a digital library for forensic document triage BIBAFull-Text 181-190
  Paul L. Bogen; Amber McKenzie; Rob Gillen
Forensic document analysis has become an important aspect of investigation of many different kinds of crimes from money laundering to fraud and from cybercrime to smuggling. The current workflow for analysts includes powerful tools, such as Palantir and Analyst's Notebook, for moving from evidence to actionable intelligence and tools for finding documents among the millions of files on a hard disk, such as Forensic Toolkit (FTK). Analysts often leave the process of sorting through collections of seized documents to filter out noise from actual evidence to highly labor-intensive manual efforts. This paper presents the Redeye Analysis Workbench, a tool to help analysts move from manual sorting of a collection of documents to performing intelligent document triage over a digital library. We will discuss the tools and techniques we build upon in addition to an in-depth discussion of our tool and how it addresses two major use cases we observed analysts performing. Finally, we also include a new layout algorithm for radial graphs that is used to visualize clusters of documents in our system.
Local histories in global digital libraries: identifying demand and evaluating coverage BIBAFull-Text 191-194
  Katrina Fenlon; Virgil E., Jr. Varvel
Digital collections of primary source materials have potential to change how citizen historians and scholars research and engage with local history. The problem at the heart of this study is how to evaluate local history coverage, particularly among large-scale, distributed collections and aggregations. As part of an effort to holistically evaluate one such national aggregation, the Institute of Museum and Library Services (IMLS) Digital Collections and Content (DCC), we conducted a national survey of reference service providers at academic and public libraries throughout the United States. In this paper, we report the results of this survey that appear relevant to local history and collection evaluation, and consider the implications for scalable evaluation of local history coverage in massive, aggregative digital libraries.
Instrument distribution and music notation search for enhancing bibliographic music score retrieval BIBAFull-Text 195-198
  Laurent Pugin; Rodolfo Zitellini
Because of the unique characteristics of music scores, searching bibliographical music collections using traditional library systems can be a challenge. In this paper, we present two specific search functionalities added to the Swiss RISM data-base and describe how they improve the user experience. The first is a search functionality for instrument and vocal part distribution that leverages coded information available in the MarcXML records of the database. It enables scores for precise ensemble distribution to be retrieved. The second is a search functionality of music notation excerpts transcribed from the beginning of the pieces, known as music incipits. The incipit search is achieved using a well-known music information retrieval (MIR) tool, Themefinder. A novelty of our implementation is that it can operate at three different levels (pitch, duration and metric), singularly or combined, and that it is performed through a specifically-developed intuitive graphical interface for note input and parameter selection. The two additions illustrate why it is important to take into consideration the particularities of music scores when designing a search system and how MIR tools can be beneficially integrated into existing heterogeneous bibliographic score collections.

Name extraction

A search engine approach to estimating temporal changes in gender orientation of first names BIBAFull-Text 199-208
  Brittany N. Smith; Mamta Singh; Vetle I. Torvik
This paper presents an approach for predicting the gender orientation of any given first name over time based on a set of search engine queries with the name prefixed by masculine and feminine markers (e.g., "Uncle Taylor"). We hypothesize that these markers can capture the great majority of variability in gender orientation, including temporal changes. To test this hypothesis, we train a logistic regression model, with time-varying marker weights, using marker counts from Bing.com to predict male/female counts for 85,406 names in US Social Security Administration (SSA) data during 1880-2008. The model misclassifies 2.25% of the people in the SSA dataset (slightly worse than the 1.74% pure error rate) and provides accurate predictions for names beyond the SSA. The misclassification rate is higher in recent years (due to increasing name diversity), for general English words (e.g., Will), for names from certain countries (e.g., China), and for rare names. However, the model tends to err on the side of caution by predicting neutral/unknown rather than false positive. As hypothesized, the markers also capture temporal patterns of androgyny. For example, Daughter is a stronger female predictor for recent years while Grandfather is a stronger male predictor around the turn of the 20th century. The model has been implemented as a web-tool called Genni (available via http://abel.lis.illinois.edu/) that displays the predicted proportion of females vs. males over time for any given name. This should be a valuable resource for those who utilize names in order to discern gender on a large scale, e.g., to study the roles of gender and diversity in scholarly work based on digital libraries and bibliographic databases where the authors? names are listed.
A relevance feedback approach for the author name disambiguation problem BIBAFull-Text 209-218
  Thiago A. Godoi; Ricardo da S. Torres; Ariadne M. B. R. Carvalho; Marcos A. Gonçalves; Anderson A. Ferreira; Weiguo Fan; Edward A. Fox
This paper presents a new name disambiguation method that exploits user feedback on ambiguous references across iterations. An unsupervised step is used to define pure training samples, and a hybrid supervised step is employed to learn a classification model for assigning references to authors. Our classification scheme combines the Optimum-Path Forest (OPF) classifier with complex reference similarity functions generated by a Genetic Programming framework. Experiments demonstrate that the proposed method yields better results than state-of-the-art disambiguation methods on two traditional datasets.
Extracting and matching authors and affiliations in scholarly documents BIBAFull-Text 219-228
  Huy Hoang Nhat Do; Muthu Kumar Chandrasekaran; Philip S. Cho; Min Yen Kan
We introduce Enlil, an information extraction system that discovers the institutional affiliations of authors in scholarly papers. Enlil consists of two steps: one that first identifies authors and affiliations using a conditional random field; and a second support vector machine that connects authors to their affiliations. We benchmark Enlil in three separate experiments drawn from three different sources: the ACL Anthology Corpus, the ACM Digital Library, and a set of cross-disciplinary scientific journal articles acquired by querying Google Scholar. Against a state-of-the-art production baseline, Enlil reports a statistically significant improvement in F_1 of nearly 10% (p < 0.01). In the case of multidisciplinary articles from Google Scholar, Enlil is benchmarked over both clean input (F_1 > 90%) and automatically-acquired input (F_1 > 80%).
   We have deployed Enlil in a case study involving Asian genomics research publication patterns to understand how government sponsored collaborative links evolve. Enlil has enabled our team to construct and validate new metrics to quantify the facilitation of research as opposed to direct publication.


User-centered approach in creating a metadata schema for video games and interactive media BIBAFull-Text 229-238
  Jin Ha Lee; Hyerim Cho; Violet Fox; Andrew Perti
Video games and interactive media are increasingly becoming important part of our culture and everyday life, and subsequently, of archival and digital library collections. However, existing organizational systems often use vague or inconsistent terms to describe video games or attempt to use schemas designed for textual bibliographic resources. Our research aims to create a standardized metadata schema and encoding scheme that provides an intelligent and comprehensive way to represent video games. We conducted interviews with 24 gamers, focusing on their video game-related information needs and seeking behaviors. We also performed a domain analysis of current organizational systems used in catalog records and popular game websites, evaluating metadata elements used to describe games. With these results in mind, we created a list of elements which form a metadata schema for describing video games, with both a core set of 16 elements and an extended set of 46 elements providing more flexibility in expressing the nature of a game.
Automatic tag recommendation for metadata annotation using probabilistic topic modeling BIBAFull-Text 239-248
  Suppawong Tuarob; Line C. Pouchard; C. Lee Giles
The increase of the complexity and advancement in ecological and environmental sciences encourages scientists across the world to collect data from multiple places, times, and thematic scales to verify their hypotheses. Accumulated over time, such data not only increases in amount, but also in the diversity of the data sources spread around the world. This poses a huge challenge for scientists who have to manually search for information. To alleviate such problems, ONEMercury has recently been implemented as part of the DataONE project to serve as a portal for accessing environmental and observational data across the globe. ONEMercury harvests metadata from the data hosted by multiple repositories and makes it searchable. However, harvested metadata records sometimes are poorly annotated or lacking meaningful keywords, which could affect effective retrieval. Here, we develop algorithms for automatic annotation of metadata. We transform the problem into a tag recommendation problem with a controlled tag library, and propose two variants of an algorithm for recommending tags. Our experiments on four datasets of environmental science metadata records not only show great promises on the performance of our method, but also shed light on the different natures of the datasets.
The user-centered development and testing of a Dublin core metadata tool BIBAFull-Text 249-252
  Catherine Hall; Michael Khoo
Digital libraries are supported by good quality metadata, and thus by the use of good quality metadata tools. The design of metadata tools can be supported by following user-centered design processes. In this paper we discuss the application and evaluation of several cognitively-based rules, derived from the work of Donald Norman, to the design of a metadata tool for administering Dublin Core metadata. One overall finding was that while the use of the rules supported users in their immediate interactions with the tool interface, they provided less support for the more cognitively intensive tasks associated with developing a wider conceptual understanding of the purpose of metadata. The findings have implications for the wider development of tools to support metadata work in digital libraries and allied contexts.
Identification of works of manga using LOD resources: an experimental FRBRization of bibliographic data of comic books BIBAFull-Text 253-256
  Wenling He; Tetsuya Mihara; Mitsuharu Nagamori; Shigeo Sugimoto
Manga -- a Japanese term meaning graphic novel or comic -- has been globally accepted. In Japan, there are a huge number of monographs and magazines of manga published. The work entity defined in Functional Requirements of Bibliographic Records (FRBR) is useful to identify and find manga. This paper examines how to identify manga works in a set of bibliographic records maintained by Kyoto International Manga Museum. It is known that authority data is useful to identify works from the bibliographic records. However, the authority data of manga is not rich, because manga has been recognized as a sub-culture resource and is generally not included in library collections. In this study, we used DBpedia, which is a large Linked Open Data (LOD) resource created from Wikipedia, to identify FRBR manga entities in bibliographic records. The results of this study show that using LOD resources is a reasonable way to identify works from bibliographic records. It also shows the accuracy and efficiency of work identification depending on the quality of the LOD resources used.

Web replication

Reading the correct history?: modeling temporal intention in resource sharing BIBAFull-Text 257-266
  Hany M. SalahEldeen; Michael L. Nelson
The web is trapped in the "perpetual now", and when users traverse from page to page, they are seeing the state of the web resource (i.e., the page) as it exists at the time of the click and not necessarily at the time when the link was made. Thus, a temporal discrepancy can arise between the resource at the time the page author created a link to it and the time when a reader follows the link. This is especially important in the context of social media: the ease of sharing links in a tweet or Facebook post allows many people to author web content, but the space constraints combined with poor awareness by authors often prevents sufficient context from being generated to determine the intent of the post. If the links are clicked as soon as they are shared, the temporal distance between sharing and clicking is so small that there is little to no difference in content. However, not all clicks occur immediately, and a delay of days or even hours can result in reading something other than what the author intended. We introduce the concept of a user's temporal intention upon publishing a link in social media. We investigate the features that could be extracted from the post, the linked resource, and the patterns of social dissemination to model this user intention. Finally, we analyze the historical integrity of the shared resources in social media across time. In other words, how much is the knowledge of the author's intent beneficial in maintaining the consistency of the story being told through social posts and in enriching the archived content coverage and depth of vulnerable resources?
An evaluation of caching policies for memento timemaps BIBAFull-Text 267-276
  Justin F. Brunelle; Michael L. Nelson
As defined by the Memento Framework, TimeMaps are machine-readable lists of time-specific copies -- called "mementos" -- of an archived original resource. In theory, as an archive acquires additional mementos over time, a TimeMap should be monotonically increasing. However, there are reasons why the number of mementos in a TimeMap would decrease, for example: archival redaction of some or all of the mementos, archival restructuring, and transient errors of one or more archives. We study TimeMaps for 4,000 original resources over a three month period, note their change patterns, and develop a caching algorithm for TimeMaps suitable for a reverse proxy in front of a Memento aggregator. We show that TimeMap cardinality is constant or monotonically increasing for 80.2% of all TimeMap downloads in the observation period. The goal of the caching algorithm is to exploit the ideally monotonically increasing nature of TimeMaps and not cache responses with fewer mementos than the already cached TimeMap. This new caching algorithm uses conditional cache replacement and a Time To Live (TTL) value to ensure the user has access to the most complete TimeMap available. Based on our empirical data, a TTL of 15 days will minimize the number of mementos missed by users, and minimize the load on archives contributing to TimeMaps.
Extending sitemaps for ResourceSync BIBAFull-Text 277-280
  Martin Klein; Herbert Van de Sompel
The documents used in the ResourceSync synchronization framework are based on the widely adopted document format defined by the Sitemap protocol. In order to address requirements of the framework, extensions to the Sitemap format were necessary. This short paper describes the concerns we had about introducing such extensions, the tests we did to evaluate their validity, and aspects of the framework to address them.
Multimodal alignment of scholarly documents and their presentations BIBAFull-Text 281-284
  Bamdad Bahrani; Min-Yen Kan
We present a multimodal system for aligning scholarly documents to corresponding presentations in a fine-grained manner (i.e., per presentation slide and per paper section). Our method improves upon a state-of-the-art baseline that employs only textual similarity. Based on an analysis of baseline errors, we propose a three-pronged alignment system that combines textual, image, and ordering information to establish alignment. Our results show a statistically significant improvement of 25%, confirming the importance of visual content in improving alignment accuracy.


Visual-interactive querying for multivariate research data repositories using bag-of-words BIBAFull-Text 285-294
  Maximilian Scherer; Tatiana von Landesberger; Tobias Schreck
Large amounts of multivariate data are collected in different areas of scientific research and industrial production. These data are collected, archived and made publicly available by research data repositories. In addition to meta-data based access, content-based approaches are highly desirable to effectively retrieve, discover and analyze data sets of interest. Several such methods, that allow users to search for particular curve progressions, have been proposed. However, a major challenge when providing content-based access -- interactive feedback during query formulation -- has not received much attention yet. This is important because it can substantially improve the user's search effectiveness. In this paper, we present a novel interactive feedback approach for content-based access to multivariate research data. Thereby, we enable query modalities that were not available for multivariate data before. We provide instant search results and highlight query patterns in the result set. Real-time search suggestions give an overview of important patterns to look for in the data repository. For this purpose, we develop a bag-of-words index for multivariate data as the back-end of our approach. We apply our method to a large repository of multivariate data from the climate research domain. We describe a use-case for the discovery of interesting patterns in maritime climate research using our new visual-interactive query tools.
The challenges of digging data: a study of context in archaeological data reuse BIBAFull-Text 295-304
  Ixchel Faniel; Eric Kansa; Sarah Whitcher Kansa; Julianna Barrera-Gomez; Elizabeth Yakel
Field archaeology only recently developed centralized systems for data curation, management, and reuse. Data documentation guidelines, standards, and ontologies have yet to see wide adoption in this discipline. Moreover, repository practices have focused on supporting data collection, deposit, discovery, and access more than data reuse. In this paper we examine the needs of archaeological data reusers, particularly the context they need to understand, verify, and trust data others collect during field studies. We then apply our findings to the existing work on standards development. We find that archaeologists place the most importance on data collection procedures, but the reputation and scholarly affiliation of the archaeologists who conducted the original field studies, the wording and structure of the documentation created during field work, and the repository where the data are housed also inform reuse. While guidelines, standards, and ontologies address some aspects of the context data reusers need, they provide less guidance on others, especially those related to research design. We argue repositories need to address these missing dimensions of context to better support data reuse in archaeology.
Constructing an anonymous dataset from the personal digital photo libraries of mac app store users BIBAFull-Text 305-308
  Jesse Prabawa Gozali; Min-Yen Kan; Hari Sundaram
Personal digital photo libraries embody a large amount of information useful for research into photo organization, photo layout, and development of novel photo browser features. Even when anonymity can be ensured, amassing a sizable dataset from these libraries is still difficult due to the visibility and cost that would be required from such a study.
   We explore using the Mac App Store to reach more users to collect data from such personal digital photo libraries. More specifically, we compare and discuss how it differs from common data collection methods, e.g. Amazon Mechanical Turk, in terms of time, cost, quantity, and design of the data collection application.
   We have collected a large, openly available photo feature dataset using this manner. We illustrate the types of data that can be collected. In 60 days, we collected data from 20,778 photo sets (473,772 photos). Our study with the Mac App Store suggests that popular application distribution channels is a viable means to acquire massive data collections for researchers.
Modeling heterogeneous data resources for social-ecological research: a data-centric perspective BIBAFull-Text 309-312
  Miao Chen; Umashanthi Pavalanathan; Scott Jensen; Beth Plale
Digital repositories are grappling with an influx of scientific data brought about by the well publicized "data deluge" in science, business, and society. One particularly perplexing problem is the long-term archival and reuse of complex data sets. This paper presents an integrated approach to data discovery over heterogeneous data resources in social-ecological systems research. Social-ecological systems data is complex because the research draws from both social and natural sciences. Using a sample set of data resources from the domain, we explore an approach to discovery and representation of this data. Specifically, we develop an ontology-based process of organization and visualization from a data-centric perspective. We define data resources broadly and identify six key categories of resources that include data collected from site visits to shared ecological resources, the structure of research instruments, domain concepts, research designs, publications, theories and models. We identify the underlying relationships and construct an ontology that captures these relationships using semantic web languages. The ontology and a NoSQL data store at the back end store the data resource instances. These are integrated into a portal architecture we refer to as the Integrated Visualization of Social-Ecological Resources (IViSER) that allows users to both browse the relationships captured in the ontology and easily visualize the granular details of data resources.

Historical DLs

Non-linear book manifolds: learning from associations the dynamic geometry of digital libraries BIBAFull-Text 313-322
  Richard Nock; Frank Nielsen; Eric Briys
Mainstream approaches in the design of virtual libraries basically exploit the same ambient space as their physical twins. Our paper is an attempt to rather capture automatically the actual space on which the books live, and learn the virtual library as a non-linear book manifold. This tackles tantalizing questions, chief among which whether modeling should be static and book focused (e.g. using bag of words encoding) or dynamic and user focused (e.g. relying on what we define as a bag of readers encoding). Experiments on a real-world digital library display that the latter encoding is a serious challenger to the former. Our results also show that the geometric layers of the manifold learned bring sizeable advantages for retrieval and visualization purposes. For example, the topological layer of the manifold allows to craft Manifold association rules; experiments display that they bring dramatic improvements over conventional association rules built from the discrete topology of book sets. Improvements embrace each of the following major standpoints on association rule mining: computational, support, confidence, lift, and leverage standpoint.
LSH-based large scale Chinese calligraphic character recognition BIBAFull-Text 323-330
  Yuan Lin; Jiangqin Wu; Pengcheng Gao; Yang Xia; Tianjiao Mao
Chinese calligraphy is the art of handwriting and is an important part of Chinese traditional culture. But due to the complexity of shape and styles of calligraphic characters, it is difficult for com-mon people to recognize them. So it would be great if a tool is provided to help users to recognize the unknown calligraphic characters. But the well-known OCR (Optical Character Recognition) technology can hardly help people to recognize the unknown characters because of their deformation and complexity. Numerous collections of historical Chinese calligraphic works are digitized and stored in CADAL (China Academic Digital Associate Library) calligraphic system [1], and a huge database CCD (Calligraphic Character Dictionary) is built, which contains character images labeled with semantic meaning. In this paper, a LSH-based large scale Chinese calligraphic character recognition method is proposed basing on CCD. In our method, GIST descriptor is used to represent the global features of the calligraphic character images, LSH (Locality-sensitive hashing) is used to search CCD to find the similar character images to the recognized calligraphic character image. The recognition is based on the semantic probability which is computed according to the ranks of retrieved images and their distances to the recognized image in the Gist feature space. Our experiments show that our method is effective and efficient for recognizing Chinese calligraphic character image.
Automatic performance evaluation of dewarping methods in large scale digitization of historical documents BIBAFull-Text 331-334
  Maryam Rahnemoonfar; Beth Plale
Geometric distortions are among the major challenging issues in the analysis of historical document images. Such distortions appear as arbitrary warping, folds and page curl, and have detrimental effects upon recognition (OCR) and readability. While there are many dewarping techniques discussed in the literature, there exists no standard method by which their performance can be evaluated against each other. In particular, there is not any satisfactory method capable of comparing the results of existing dewarping techniques on arbitrary wrapped documents. The existing methods either rely on the visual comparison of the output and input images or depend on the recognition rate of an OCR system. In the case of historical documents, OCR either is not available or does not generate an acceptable result. In this paper, an objective and automatic evaluation methodology for document image dewarping technique is presented. In the first step, all the baselines in the original distorted image as well as dewarped image are modelled precisely and automatically. Then based on the mathematical function of each line, a comprehensive metric which calculates the performance of a dewarping technique is introduced. The presented method does not require user interference in any stage of evaluation and therefore is quite objective. Experimental results, applied to two state-of-the art dewarping methods and an industry-standard commercial system, demonstrate the effectiveness of the proposed dewarping evaluation method.
Semiautomatic recognition and georeferencing of places in early maps BIBAFull-Text 335-338
  Winfried Höhn; Hans-Günter Schmidt; Hendrik Schöneberg
Early maps are a valuable resource for historical research, this is why digital libraries for early maps become a necessary tool for research support in the age of information. In this article we introduce the Referencing and Annotation Tool (RAT), designed to extract information about all places displayed in a map and link them to a place on a modern map. RAT automatically recognizes place markers in an early map according to a template specified by the user and estimates the position of the annotated place in the modern map, thus making georeferencing easier. After a brief summary on related projects, we describe the functionality of the system. We discuss the most important implementation details and factors influencing recognition accuracy and performance. The advantages of our semiautomatic approach are high accuracy and a significant decrease of the user's cognitive load.

Preservation II

Access patterns for robots and humans in web archives BIBAFull-Text 339-348
  Yasmin A. AlNoamany; Michele C. Weigle; Michael L. Nelson
Although user access patterns on the live web are well-understood, there has been no corresponding study of how users, both humans and robots, access web archives. Based on samples from the Internet Archive's public Wayback Machine, we propose a set of basic usage patterns: Dip (a single access), Slide (the same page at different archive times), Dive (different pages at approximately the same archive time), and Skim (lists of what pages are archived, i.e., TimeMaps). Robots are limited almost exclusively to Dips and Skims, but human accesses are more varied between all four types. Robots outnumber humans 10:1 in terms of sessions, 5:4 in terms of raw HTTP accesses, and 4:1 in terms of megabytes transferred. Robots almost always access TimeMaps (95% of accesses), but humans predominately access the archived web pages themselves (82% of accesses). In terms of unique archived web pages, there is no overall preference for a particular time, but the recent past (within the last year) shows significant repeat accesses.
Free benchmark corpora for preservation experiments: using model-driven engineering to generate data sets BIBAFull-Text 349-358
  Christoph Becker; Kresimir Duretec
Digital preservation is an active area of research, and recent years have brought forward an increasing number of characterisation tools for the object-level analysis of digital content. However, there is a profound lack of objective, standardised and comparable metrics and benchmark collections to enable experimentation and validation of these tools. While fields such as Information Retrieval have for decades been able to rely on benchmark collections annotated with ground truth to enable systematic improvement of algorithms and systems along objective metrics, the digital preservation field is yet unable to provide the necessary ground truth for such benchmarks. Objective indicators, however, are the key enabler for quantitative experimentation and innovation.
   This paper presents a systematic model-driven benchmark generation framework that aims to provide realistic approximations of real-world digital information collections with fully known ground truth that enables systematic quantitative experimentation, measurement and improvement against objective indicators. We describe the key motivation and idea behind the framework, outline the technological building blocks, and discuss results of the generation of page-based and hierarchical documents from a ground truth model. Based on a discussion of the benefits and challenges of the approach, we outline future work.
A scalable, distributed and dynamic workflow system for digitization processes BIBAFull-Text 359-362
  Hendrik Schöneberg; Hans-Günter Schmidt; Winfried Höhn
Creating digital representations of ancient manuscripts, prints and maps is a challenging task due to the sources' fragile and heterogeneous natures. Digitization requires a very specialized set of scanning hardware in order to cover the sources' diversity. The central task is obtaining the maximum reproduction quality while minimizing the error rate, which is difficult to achieve due to the large amounts of image data resulting from digitization, putting huge computational loads on image processing modules, error-detection and information retrieval heuristics. As digital copies initially do not contain any information about their sources' semantics, additional efforts have to be made to extract semantic metadata. This is an error-prone, time-consuming manual process, which calls for automated mechanisms to support the user. This paper introduces a decentralized, event-driven workflow system designed to overcome the above mentioned challenges. It leverages dynamic routing between workflow components, thus being able to quickly adapt to the sources' unique requirements. It provides a scalable approach to soften out high computational loads on single units by using distributed computing and provides modules for automated image pre-/post-processing, error-detection heuristics, data mining, semantic analysis, metadata augmentation, quality assurance and an export functionality to established publishing platforms or long-term storage facilities.
Domain-specific image geocoding: a case study on Virginia tech building photos BIBAFull-Text 363-366
  Lin Tzy Li; Otávio A. B. Penatti; Edward A. Fox; Ricardo da S. Torres
The use of map-based browser services is of great relevance in numerous digital libraries. The implementation of such services, however, demands the use of geocoded data collections. This paper investigates the use of image content local representations in geocoding tasks. Performed experiments demonstrate that some of the evaluated descriptors yield effective results in the task of geocoding VT building photos. This study is the first step to geocode multimedia material related to the VT April 16, 2007 school shooting tragedy.


A classification scheme for algorithm citation function in scholarly works BIBAFull-Text 367-368
  Suppawong Tuarob; Prasenjit Mitra; C. Lee Giles
Algorithms are ubiquitous in the computer science literature. A search engine for algorithms has been tested as part of the CiteseerX suite; however, it only retrieves algorithms whose metadata is textually matched with the search query. Such a limitation occurs because a traditional search engine does not have the ability to understand what algorithms are and how they work. Here, we present an initial effort in understanding the semantics of algorithms. Specifically, we identify how an existing algorithm can be used in scholarly works and propose a classification scheme for algorithm function.
A figure search engine architecture for a chemistry digital library BIBAFull-Text 369-370
  Sagnik Ray Choudhury; Suppawong Tuarob; Prasenjit Mitra; Lior Rokach; Andi Kirk; Silvia Szep; Donald Pellegrino; Sue Jones; Clyde Lee Giles
Academic papers contain multiple figures representing important findings and experimental results; we present a search engine specifically focused on figures in academic documents. This search engine allows users to search on figures in approximately 150,000 chemistry journal articles though the method is easily extendable to other domains. Our system indexes figure caption and mentions extracted from the PDF in documents using a custom built extractor. Recall and precision performance of extracted figures is in the 80 to 90% range. We give the frame work for the extraction algorithm, architecture and ranking function.
A memento web browser for iOS BIBAFull-Text 371-372
  Heather Tweedy; Frank McCown; Michael L. Nelson
The Memento framework allows web browsers to request and view archived web pages in a transparent fashion. However, Memento is still in the early stages of adoption, and browser-plugins are often required to enable Memento support. We report on a new iOS app called the Memento Browser, a web browser that supports Memento and gives iPhone and iPad users transparent access to the world's largest web archives.
A retrospective review on a decade of building a national science digital library to transform STEM education BIBAFull-Text 373-374
  Sarah Giersch; Flora McMartin
Since 2000, the National Science Foundation's NSDL program has made many direct contributions to digital library research and STEM education. Originally called the National STEM Digital Library (and now National STEM Distributed Learning), the program catalyzed significant technology developments and served to advance state-of-the-art teaching and learning practices during a period of dramatic technological change. This poster describes the results of a three-day writing workshop, convened in April 2012, which generated a retrospective report and a series of interviews on the NSDL building process.
A roadmap for data services BIBAFull-Text 375-376
  Inna Kouper; Katherine G. Akers; Natsuko H. Nicholls; Fe C. Sferdean
This poster describes our experiences as four CLIR/DLF postdoctoral fellows in developing data services at our respective universities. We report on our particular activities and achievements, which we synthesize into a common framework that can guide the development of data services at other academic institutions. The analysis of our experiences suggests the necessity of stronger cooperation of units within universities as well as increased and more diverse collaborations among universities.
ArcLink: optimization techniques to build and retrieve the temporal web graph BIBAFull-Text 377-378
  Ahmed AlSum; Michael L. Nelson
We present ArcLink, a proof-of-concept system that complements open source Wayback Machine installations by optimizing the construction, storage, and access to the temporal web graph. We divide the web graph construction into four stages (filtering, extraction, storage, and access) and explore optimization for each stage. ArcLink extends the current web archive interfaces to return content and structural metadata for each URI.
Checking out: customizing and downloading complex and compound digital library resources BIBFull-Text 379-380
  Scott Britell; Lois Delcambre
CSSeer: an expert recommendation system based on CiteseerX BIBAFull-Text 381-382
  Hung-Hsuan Chen; Pucktada Treeratpituk; Prasenjit Mitra; C. Lee Giles
We propose CSSeer, a free and publicly available keyphrase based recommendation system for expert discovery based on the CiteSeerX digital library and Wikipedia as an auxiliary resource. CSSeer generates keyphrases from the title and the abstract of each document in CiteSeerX. These keyphrases are then utilized to infer the authors' expertise. We compared CSSeer with the other two state-of-the-art expert recommenders and found that the three systems have moderately divergent recommendations on 20 benchmark queries. Thus, we recommend users to browse through several different recommenders to obtain a more complete expert list.
Environmental studies faculty attitudes towards sharing of research data BIBAFull-Text 383-384
  Nathan F. Hall
This research explores the attitudes of Environmental Studies faculty towards sharing research data. The findings are drawn from a broader unpublished study in progress on information behavior of Environmental Studies faculty in e-science and scholarly communications. The author conducted fourteen semi-structured interviews with tenure-track and tenured faculty in various environmental studies and earth science disciplines at two large state universities. Early findings and areas for further analysis are described.
Evaluation of header metadata extraction approaches and tools for scientific PDF documents BIBAFull-Text 385-386
  Mario Lipinski; Kevin Yao; Corinna Breitinger; Joeran Beel; Bela Gipp
This paper evaluates the performance of tools for the extraction of metadata from scientific articles. Accurate metadata extraction is an important task for automating the management of digital libraries. This comparative study is a guide for developers looking to integrate the most suitable and effective metadata extraction tool into their software. We shed light on the strengths and weaknesses of seven tools in common use. In our evaluation using papers from the arXiv collection, GROBID delivered the best results, followed by Mendeley Desktop. SciPlore Xtract, PDFMeat, and SVMHeaderParse also delivered good results depending on the metadata type to be extracted.
Exploring the usability of folksonomies in the online art museum community BIBAFull-Text 387-388
  Crystal N. Boston-Clay; Malika Mahoui; Kyle Jaebker
This paper presents a usability evaluation of the Indianapolis Museum of Art website -- as a typical art museum website supporting both tag-based search and user tagging of artwork -- in an effort to explore how users access artwork while interacting with the museum online search and retrieval system. The usability study examined the extent of usage of Steve Tagger capabilities (annotation and use of tags in the process of searching/accessing artwork resources) deployed on the website. The usability test results showed that 55% of the users were able to successfully locate information on the website using both traditional searching techniques and folksonomies. However, only 34% of the users were able to successfully locate artwork using tags only. On the other hand, 95% of the participants were able to annotate an object by adding a term or tag to describe the artwork.
Flickr feedback framework: a service model for leveraging user interactions BIBAFull-Text 389-390
  Jacob Jett; Megan Senseney; Carole L. Palmer
It has been well documented that cultural heritage institutions can enhance their metadata by sharing content through popular web services such as Flickr. Through the Flickr Feasibility Study, the IMLS Digital Collections and Content project examined how an aggregation service can facilitate participation of cultural heritage institutions in popular web services. This poster presents a proposed feedback framework through which an aggregation service can facilitate and increase the impact of Web user interactions with shared cultural heritage collections through direct metadata enhancement and user analysis.
Formal foundations for systematic digital library generation BIBAFull-Text 391-392
  Jonathan P. Leidig; Edward A. Fox
Many digital library design, development, and deployment processes are not based on systematic generation activities. The utilization of generation processes enables the precise definition of digital libraries, identification of existing software components, co-generation of multiple digital libraries, and evaluation of a digital library's coverage and completeness. The foundation of a digital library generation process is in the formal framework in which it is described. Two notable formal frameworks have previously been proposed for describing digital libraries and their content, architecture, functionality, and related societies. These two frameworks are merged in this effort to provide the foundation for a generation framework in support of an emerging class of scientific digital libraries.
Full-text and topic based authorrank and enhanced publication ranking BIBAFull-Text 393-394
  Jinsong Zhang; Xiaozhong Liu
The idea behind AuthorRank is that a content created by more popular authors should rank higher than the content created by less popular authors. This paper brings this idea into scientific publications analysis to test whether the optimized topical AuthorRank can replace or enhance topical PageRank for publication ranking. First, the PageRank with Priors (PRP) algorithm was employed to rank topic-based publications and authors. Second, the first author's reputation was used for generating an AuthorRank score. Additionally, linear combination method of topical AuthorRank and PageRank were compared with several baselines. Finally, as shown in our evaluation results, the performance of topical AuthorRank combined with topic-based PageRank is better than other baselines for publication ranking.
HathiTrust research center: computational access for digital humanities and beyond BIBAFull-Text 395-396
  Beth Plale; Robert McDonald; Yiming Sun; Inna Kouper; Ryan Cobine; J. Stephen Downie; Beth Sandore Namachchivaya; John Unsworth
Academic libraries are increasingly looking to provide services that allow their users to work with digital collections in innovative ways, for example, to analyze large volumes of digitized collections. The HathiTrust Research Center (HTRC) is a large collaborative that provides an innovative research infrastructure for dealing with massive amounts of digital texts. In this poster, we report on the technical progress of the HTRC as well as on the efforts to build a user community around our cyberinfrastructure.
How do users' search tactic selections influence search outputs in exploratory search tasks? BIBAFull-Text 397-398
  Soohyung Joo; Iris Xie
This study investigates the relationships between users' search tactic selections and search outputs while conducting exploratory searches in digital libraries. Frequencies of different types of search tactics applied in an exploratory search task were counted. Based on correlation analysis and multiple regression analysis, we identified types of search tactics that are associated with aspectual recall. Preliminary results indicate that browsing and evaluating item tactics affect aspectual recall in exploratory search tasks.
Information visualization of nuclear decay chain libraries BIBAFull-Text 399-400
  Electra Sutton; Charles Wang; David Weisz; Fredric Gey; Ray R. Larson
This poster presents multiple information visualization techniques for scientific visualization of the nuclear isotope decay process, including (but not limited to) circle packing and directed graphs. The practical goal of this visualization process is to support nuclear forensics, the identification of the origin of intercepted smuggled nuclear materials.
Institutional structures for research data and metadata curation BIBFull-Text 401-402
  Matthew S. Mayernik
Investigating influential factors influencing users to share news in social media: a diffusion of innovations perspective BIBAFull-Text 403-404
  Long Ma; Chei Sian Lee; Dion Hoe-Lian Goh
This study aims to investigate the factors influencing news sharing in social media. Drawing from the diffusion of innovations theory (DOI), the influential factors identified are opinion leadership, homophily, tie strength, and news attributes. By incorporating social network analysis with multiple regression analysis, our results indicate that opinion leadership was the strongest factor predicting users' news sharing, followed by news attribute and tie strength. Unexpectedly, we also found that homophily hampered news sharing in social media. Implications are discussed.
Mapping the intersection of science and philosophy BIBAFull-Text 405-406
  Jaimie Murdock; Robert Light; Colin Allen; Katy Börner
This poster presents what we believe to be the first attempt to empirically measure and visualize the cross-pollination of science and philosophy through citation patterns. Using the Stanford Encyclopedia of Philosophy as a proxy for the philosophical literature, we plot SEP citations onto the UCSD Map of Science to highlight areas of science which overlap with philosophical discussion. An outline of further studies is also discussed.
Modeling search assistance mechanisms within web-scale discovery systems BIBAFull-Text 407-408
  William H. Mischo; Mary C. Schlembach; Michael A. Norman
The University of Illinois Library has been conducting transaction log analyses to model user search behaviors within our Library gateway. These analyses have informed the development and implementation of various search assistance mechanisms designed to facilitate search strategy modification and enhance user search navigation methods. This paper discusses the efforts to effectively overlay search assistance mechanisms into the web-scale discovery system environment. These search assistance mechanisms seek to meet user search needs and to address known issues with web-scale systems. This paper describes an evolving search assistance model being deployed in a web-scale environment and reports our findings from transaction log studies and user surveys.
MOOD-lighting: massive open online discovery using solr and blacklight BIBAFull-Text 409-410
  Juliet L. Hardesty; Courtney Greene
In this poster, we present findings from the user experience and metadata perspectives of using Blacklight and Solr to combine large and distinct resource sets. We also share results of a survey of academic and educational institutions on their approaches to Solr indexing and end-user options and identify next steps toward articulating best practices for user experience around discovery in the context of metadata.
OmniMea: an approach to improved content recruitment for institutional repositories BIBAFull-Text 411-412
  James C. French; Allison L. Powell
A common complaint of providers of institutional repository services is that they have low utilization by their intended user base. The reasons are varied, but two are germane to our project: lack of individual incentives among potentially participating faculty; and a perceived high barrier to entry. The OmniMea project adopts a user-centric focus by advocating for personal repositories as a more appealing concept and by directing relevant intellectual output placed in these personal repositories to the institutional repository for long-term curation. While our approach was explicitly aimed and capturing long-tail data, it has turned out to be more generally applicable.
Pretty as a pixel: issues and challenges in developing a controlled vocabulary for video game visual styles BIBAFull-Text 413-414
  Andy Donovan; Hyerim Cho; Chris Magnifico; Jin Ha Lee
Despite the increase in interest in video games across commercial and academic areas, organizational systems for classifying them remain inadequate, particularly in describing the visual styles of video games. Because video games are by and large a visual medium, the ability to describe their visual "look" coherently and consistently greatly contributes to their discovery through classification. A set of controlled terms would be instrumental in complementing game recommendation engines and search applications in digital libraries to meet users' content-related information needs. In our study we examine the academic and user-generated content about video games' visual styles in order to extract potentially useful controlled vocabulary terms. These terms are then organized into facets and arranged into a classified schedule. In this poster, we discuss the challenges in our controlled vocabulary term definitions and their application.
Providing context for digital library content BIBAFull-Text 415-416
  Pamela Campbell; Katrina Stierholz
In 2004, the Federal Reserve Bank of St. Louis created FRASER (Federal Reserve Archival System for Economic Research, http://fraser.stlouisfed.org), a digital library of economic, financial, and banking data and policy documents. The history of American economic policy is documented in these publications, records, and archival materials. It has become evident that FRASER's growing user base needs additional context to improve users? ability to navigate, select, and understand the continually growing content. We have taken three approaches to improve the accessibility of our content: changes to the database, integration with other web databases, and addition of material to support teaching activities. We hope that these changes will broaden our audience and increase site traffic.
Publishing earthquake engineering research data BIBAFull-Text 417-418
  Stanislav Pejsa; Cheng Song
Earthquake engineering brings together researchers from seismology, structural, mechanical, and geotechnical engineering whose research results in saving lives and protecting property during earthquakes and tsunamis. Such diversity poses unique challenges for data management, data archiving, preservation, and data publication. The poster demonstrates new innovative approaches to curation, visualization, and publishing of earthquake engineering research data in the NEEShub, a collaborative platform that provides a combined virtual research environment and data repository to researchers participating in the Network for Earthquake Engineering Simulations (NEES) and to the earthquake engineering community in general. The poster provides graphical depictions demonstrating the curation workflows established in NEES, the progression of data from unprocessed sensor measurements to datasets that can be analyzed by a variety of analytical and visualization tools, and finally their transformation into a citable published product. It documents ways in which NEEShub exposes research data and facilitates collaboration and sharing, as well as re-use and repurposing of the datasets. Furthermore, the poster illustrates some of the successes of the NEEShub in its four years of existence including continuous growth in uploaded files, users, number of downloaded files, curated projects, and published datasets.
Recovering missing citations in a scholarly network: a 2-step citation analysis to estimate publication importance BIBAFull-Text 419-420
  Zhuoren Jiang; Xiaozhong Liu
Citation relationships between publications are important for assessing the importance of scholarly components (e.g., authors, publications, and venues) within a network. Missing citation metadata in scholarly databases, however, creates problems for classical citation-based ranking algorithms. In the ACM database, for example, 18.5% of publications don't have citation metadata. In this research we propose an innovative, 2-step method of citation analysis, to investigate the importance of publications for which citation data is missing. Preliminary evaluation results show that this method can effectively uncover the importance of publications without using citation metadata.
Semi-automated rediscovery of lost YouTube music videos BIBAFull-Text 421-422
  Daniel Sebastian; Frank McCown; Michael L. Nelson
Users frequently post popular material to YouTube, and in response, others link to these videos from social media, blogs, forums, and email. However, this content may be removed for numerous reasons, only to resurface again at another URL. This continuous movement and breaking of the web graph makes it difficult for users to relocate content that has moved in YouTube. We present Volitrax, an add-on for FireFox which redirects users to YouTube music videos that have moved to a different URL within YouTube. Volitrax acts as an intermediary that corrects the web graph transparently so YouTube links continue to work even after the content has changed locations.
Streamlining user interaction in tag-based conversational navigation of knowledge resource libraries BIBAFull-Text 423-424
  Jinyue Xia; David C. Wilson
This paper presents an approach for helping users more quickly discover relevant information resources in a tag based system, where each resource is associated with a number of descriptive meta-data tags. Our approach builds an adaptive conversational decision-tree structure to minimize the number of interactive cues required to help a user navigate to resources of interest. Initial experiments demonstrate the potential of the approach, with shallower decision trees supporting better overall interaction performance.
Studying the data practices of a scientific community BIBAFull-Text 425-426
  Besiki Stvilia; Charles C. Hinnant; Shuheng Wu; Adam Worrall; Dong Joon Lee; Kathleen Burnett; Gary Burnett; Michelle M. Kazmer; Paul F. Marty
To be effective and at the same time sustainable, a community data curation model has to be aligned with the community's current work organization: practices and activities; divisions of labor; data and collaborative relationships; and the community's value structure, norms, and conventions for data, quality assessment, and data sharing. This poster discusses a framework for developing a community data curation model, using a case of the scientific community gathered around the National High Magnetic Field Laboratory, a large national lab. The poster also reports findings of preliminary research based on semi-structured interviews with a sample of the main stakeholder groups of the community.
The new ACM CCS and a computing ontology BIBAFull-Text 427-428
  Lillian N. Cassel; Sudhamsha Palivela; Srikanth Marepalli; AhiMahidhara Padyala; Rahul Deep; Siddhartha Terala
This poster presents an overview of the new ACM Computing Classification system and compares it with work done in creating an ontology of all computing-related topics. There are similarities and differences and the differences lead to conclusions about both approaches.
The nuestra Iowa project: creating a digital collection as a tool for history education BIBAFull-Text 429-430
  Audrey Altman; Kelly Thompson; Haowei Hsieh
This poster describes the progress of a research project exploring how public digital publishing affects undergraduate research and learning. Participants are students in "Latina/o Immigration", an undergraduate-level history course at the University of Iowa. Students use a custom web interface to create a digital exhibit about the history of Latino/as in Iowa, using multimedia primary-source materials from the Iowa Women's Archives (IWA). Students also use the tool to learn the concepts related to metadata and digital libraries.
The open parks network BIBAFull-Text 431-432
  Yongyang Yu; Michael Witt; Mohamed Saber Abdelfattah; Christopher Vinson; Scott Hammel
The goal of the Open Parks Network (OPN) is to create a portal that connects park managers, researchers, policy makers, and citizens to each other and to valuable cultural resources related. Led by Clemson University in collaboration with the National Parks Service and Purdue University, OPN is designed to provide a virtual community of professionals in parks and protected areas with the tools, resources, and knowledge base they need to conduct intensive research, perform their jobs duties effectively, and share information with colleagues and users on an international scale. To date, 80,000 of 200,000 archival images and 500,000 out of 2 million bound pages have been digitized from various parks, and a beta version of the platform will be available for demonstration in March 2013. The project includes the integration of a Fedora repository with the Joomla! content management system with extensions to enable GIS functionality, expose metadata as Linked Data and for harvest using OAI-PMH, and enable users to create their own custom collections.
TheAdvisor: a webservice for academic recommendation BIBAFull-Text 433-434
  Onur Küçüktunç; Erik Saule; Kamer Kaya; Ümit V. Çatalyürek
The academic community has published millions of research papers to date, and the number of new papers has been increasing with time. To discover new research, researchers typically rely on manual methods such as keyword-based search, reading proceedings of conferences, browsing publication lists of known experts, or checking the references of the papers they are interested. Existing tools for the literature search are suitable for a first-level bibliographic search. However, they do not allow complex second-level searches. In this paper, we present a web service called TheAdvisor (http://theadvisor.osu.edu) which helps the users to build a strong bibliography by extending the document set obtained after a first-level search. The service makes use of the citation graph for recommendation. It also features diversification, relevance feedback, graphical visualization, venue and reviewer recommendation. In this work, we explain the design criteria and rationale we employed to make the TheAdvisor a useful and scalable web service along with a thorough experimental evaluation.
User interface evaluation of meta-indexes for search BIBAFull-Text 435-436
  Michael Huggett; Edie Rasmussen
In the Indexer's Legacy Project, we have created meta-indexes for domain-oriented collections of digital books in order to promote searching, navigation and browsing in digital collections. Because the meta-index is a new knowledge structure, we have used focus groups and sample tasks to collect information on user's perception and use of meta-indexes. User's responses were positive and their suggestions led to improvements in our online Meta-Dex User Interface (MUI) tool, which will be tested in subsequent user studies.
Using google analytics to explore ETDs use BIBAFull-Text 437-438
  Midge Coates
This poster presents preliminary Google Analytics usage data for a collection of electronic theses and dissertations (ETDs). Correlation of page views with page type, user location, and source (referring link) shows that, during the study period, most in-state users found the collection via internal sources (University links) and viewed mostly home and navigation pages, while most out-of-state users found the collection via external sources (search engines, databases) and viewed mostly bibliographic information pages. Nearly all of those who viewed actual ETDs were out-of-state "direct" users who may have bookmarked the collection during a previous visit.
The SEAD DataNet prototype: data preservation services for sustainability science BIBAFull-Text 439-440
  Beth Plale; Robert H. McDonald; Kavitha Chandrasekar; Inna Kouper; Robert Light; Stacy R. Konkiel; Margaret Hedstrom; James Myers; Praveen Kumar
In this poster we will present the SEAD project [1] and its prototype software and describe how SEAD approaches long-term data preservation and access through multiple partnerships and how it supports sustainability science researchers in their data management, analysis and archival needs. SEAD's initial prototype system currently is being tested by ingesting datasets from the National Center for Earth Surface Dynamics (1.6 terabyte of data containing over 450,000 files) [2] and packaging them for transmission to long-term archival storage.


CORE: aggregation use cases for open access BIBAFull-Text 441-442
  Petr Knoth; Zdenek Zdrahal
The push for free online availability of research outputs promoted by the Open Access (OA) movement is undoubtedly transforming the publishing industry. However, the mere availability of research outputs is insufficient. To exploit the full potential of OA, it must be possible to search, discover, mine, analyse, etc. this content. To achieve this, it is essential to improve the existing OA technical infrastructure to effectively support these functionalities. Many of the vital benefits of OA are expected to come with the ability to reuse OA content in unanticipated ways. Access to the OA content must therefore be flexible, yet practical, content-based and not just metadata based. In this demonstration, we present the CORE system, which aggregates millions of OA resources from hundreds of OA repositories and journals. We discuss the use cases aggregations should support and demonstrate how the CORE system addresses them, including searching, discovering, mining and analyzing content. We also show how aggregated OA content can be reused to build new applications on top of CORE's functionality.
Docear's PDF inspector: title extraction from PDF files BIBAFull-Text 443-444
  Joeran Beel; Stefan Langer; Marcel Genzmehr; Christoph Müller
In this demo-paper we present Docear's PDF Inspector (DPI). DPI extracts titles from academic PDF files by applying a simple heuristic: the largest text on the first page of a PDF is assumed to be the title. This simple heuristic achieves accuracies around 70% and outperforms the tools ParsCit and SciPlore Xtract in both run-time and accuracy. In addition, DPI is released under the free open source license GPL 2+ at http://www.docear.org, written in JAVA, and runs on any major operating system.
Docear4Word: reference management for Microsoft word based on BibTeX and the citation style language (CSL) BIBAFull-Text 445-446
  Joeran Beel; Marcel Genzmehr; Stefan Langer
In this demo-paper we introduce Docear4Word which enables researchers to insert and format their references and bibliographies in Microsoft Word. Docear4Word is based on BibTeX and the Citation Style Language (CSL), features over 1,700 citation styles (Harvard, IEEE, ACM, etc.), is published as open source tool on http://docear.org, and runs with Microsoft Word 2002 (and later) on Windows XP (and later). Docear4Word is similar to the MS-Word add-ons that reference managers like Endnote, Zotero, or Citavi offer with the difference that it is being developed to work with the de-facto standard BibTeX and hence to work with almost any reference manager.
FishTraits version 2: integrating ecological, biogeographic and bibliographic information BIBAFull-Text 447-448
  Zhiwu Xie; Emmanual A. Frimpong; Sunshin Lee
In this paper we describe the new development of FishTraits. Originating from an ecological database that documents and consolidates more than 100 traits for 809 fish species, the new version focuses on the integration of these traits data with the bibliographic and biogeographic information. We explain the overall design as well as the implementation details.
Greenbug: a hybrid web-inspector, debugger and design editor for greenstone BIBAFull-Text 449-450
  David Bainbridge; Sam J. McIntosh; David M. Nichols
In this paper we present Greenbug: a hybrid web inspector, debugger and design editor developed for use with the open source digital library software Greenstone 3. Inspired by the web development tool Firebug, Greenbug is more tightly coupled with the underlying (digital library) server than that provided by Firebug; for example, Greenbug has a fine-grained knowledge of the connection between the underlying file system and the rendered web content, and also provides the ability to commit any changes made through the web interface back to the underlying file system. Moreover, because web page production in Greenstone 3 is the result of an XSLT processing pipeline, the necessarily well-formed hierarchical XML content can be manipulated into a graphical representation, which can then be manipulated directly through a visual interface supplied by Greenbug. We showcase the interface in use, provide a brief overview of implementation details, and conclude with a discussion on how the approach can be adapted to other XSLT transformation-based content management systems, such as DSpace.
IssueLab: the social sector's digital library BIBAFull-Text 451-452
  Lisa Brooks; Gabriela Fitz
In this paper, we describe IssueLab, a digital library and distribution network for social sector publications and resources.
The apollo archive explorer BIBAFull-Text 453-454
  Douglas W. Oard; Joseph Malionek
A system for exploring the rich recorded legacy of the Apollo missions to the Moon, using the event structure of each mission as an organizing principle, will be demonstrated. A scalable implementation is achieved by automating temporal, spatial, and topical content alignment across diverse media. Multiple access points are supported, including event-based access through the flight plan, time-based access using event timelines, and content-based access using information retrieval techniques.
The avalon media system: a platform for access-controlled delivery of time-based media BIBAFull-Text 455-456
  Jon W. Dunn; Stuart L. Baker
This demonstration will show version 1.0 of the Avalon Media System, an open source system being developed by Indiana University and Northwestern University to allow libraries and archives to provide online access to audio and video collections.
The digital atlas of American religion BIBAFull-Text 457-458
  Sharon M. Kandris; Neil Devadasan; Malika Mahoui; David J. Bodenhamer
In this demonstration-paper we introduce DAAR, the Digital Atlas of American Religion (www.religionatlas.org). The DAAR is a web-based research platform with innovative data exploration and visualization tools to support research in the humanities. Using a user-centered design approach, we incorporated historic religion data on adherence, membership, and congregations with historic census data and new religion typologies as the test-bed for the tools that we developed to establish the technology infrastructure and framework that can be used for other humanities data.
Introducing Docear's research paper recommender system BIBAFull-Text 459-460
  Joeran Beel; Stefan Langer; Marcel Genzmehr; Andreas Nürnberger
In this demo paper we present Docear's research paper recommender system. Docear is an academic literature suite to search, organize, and create research articles. The users' data (papers, references, annotations, etc.) is managed in mind maps and these mind maps are utilized for the recommendations. Using content-based filtering methods, Docear's recommender achieves click-through rates around 6%, in some scenarios even over 10%.