HCI Bibliography Home | HCI Conferences | DocEng Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
DocEng Tables of Contents: 0102030405060708091011121314

Proceedings of the 2013 ACM Symposium on Document Engineering

Fullname:Proceedings of the 2013 ACM Symposium on Document Engineering
Editors:Simone Marinai; Kim Marriott
Location:Florence, Italy
Dates:2013-Sep-10 to 2013-Sep-13
Publisher:ACM
Standard No:ISBN: 978-1-4503-1789-4; ACM DL: Table of Contents; hcibib: DocEng13
Papers:55
Pages:286
Links:Conference Website
  1. Keynote address
  2. Digital humanities
  3. Dealing with multiple versions
  4. Search & sense making
  5. Architecture & processes
  6. Document recognition & analysis I
  7. Document layout & presentation generation I
  8. Document recognition & analysis II
  9. Metadata & annotation
  10. Multimedia I
  11. Posters & demonstrations
  12. Document layout & presentation generation II
  13. Multimedia II
  14. Workshops

Keynote address

Symbolic machine learning methods for historical document processing BIBKFull-Text 1-2
  Floriana Esposito
Keywords: Concept learning methods; Inductive Logic Programming; Incremental Learning, Semantic Processing

Digital humanities

Revisiting a summer vacation: digital restoration and typesetter forensics BIBAFull-Text 3-12
  Steven R. Bagley; David F. Brailsford; Brian W. Kernighan
In 1979 the Computing Science Research Center ('Center 127') at Bell Laboratories bought a Linotron 202 typesetter from the Mergenthaler company. This was a 'third generation' digital machine that used a CRT to image characters onto photographic paper. The intent was to use existing Linotype fonts and also to develop new ones to exploit the 202's line-drawing capabilities.
   Use of the 202 was hindered by Mergenthaler's refusal to reveal the inner structure and encoding mechanisms of the font files. The particular 202 was further dogged by extreme hardware and software unreliability.
   A memorandum describing the experience was written in early 1980 but was deemed to be too "sensitive" to release. The original troff input for the memorandum exists and now, more than 30 years later, the memorandum can be released. However, the only available record of its visual appearance was a poor-quality scanned photocopy of the original printed version.
   This paper details our efforts in rebuilding a faithful retypeset replica of the original memorandum, given that the Linotron 202 disappeared long ago, and that this episode at Bell Labs occurred 5 years before the dawn of PostScript (and later PDF) as de facto standards for digital document preservation.
   The paper concludes with some lessons for digital archiving policy drawn from this rebuilding exercise.
Interacting with digital cultural heritage collections via annotations: the CULTURA approach BIBAFull-Text 13-22
  Maristella Agosti; Owen Conlan; Nicola Ferro; Cormac Hampson; Gary Munnelly
This paper introduces the main characteristics of the digital cultural collections that constitute the use cases presently in use in the CULTURA environment. A section on related work follows giving an account on efforts on the management of digital annotations that are pertinent and that have been considered. Afterwards the innovative annotation features of the CULTURA portal for digital humanities are described; those features are aimed at improving the interaction of non-specialist users and general public with digital cultural heritage content. The annotation functions consist of two modules: the FAST annotation service as back-end and the CAT Web front-end integrated in the CULTURA portal. The annotation features have been, and are being, tested with different types of users and useful feedback is being collated, with the overall aim of generalising the approach to diverse document collections and not only the area of cultural heritage.
Early modern OCR project (eMOP) at Texas A&M University: using Aletheia to train Tesseract BIBAFull-Text 23-26
  Katayoun Torabi; Jessica Durgan; Bryan Tarpley
Great effort is being made to collect and preserve historic manuscripts from the early modern and eighteenth-century periods; unfortunately, searching the Early English Books Online (EEBO) and Eighteenth Century Collections Online (ECCO) collections can be extremely difficult for researchers because current Optical Character Recognition (OCR) engines struggle to read and recognize various historic fonts, especially in manuscripts of declining quality. To address this problem, the Early Modern OCR Project (eMOP) at the Initiative for the Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University seeks to train OCR engines to read historic documents more effectively in order to make the entirety of these collections accessible to searching. The first step in this project involves using Aletheia Desktop Tool, developed by PRImA Research Lab at the University of Salford, to use documents from the EEBO and ECCO collections to create training sets to aid OCR engines, such as Google's Tesseract, in recognizing the special characters such as ligatures, italics, and blackletter found within early modern fonts. In the year that the Aletheia team has been working to create these font training libraries, we have overcome several problems, including learning how to select, extract, and deliver the data that best suits Tesseract training requirements. This work with Aletheia is part of a larger scholarly project that endeavors to not only make the EEBO and ECCO collections more accessible for data mining purposes for researchers, but also seeks to make available to the public the methodologies, workflow, and digital tools developed during the eMOP project to aid libraries, museums, and scholars in other fields in their efforts to preserve and study our combined cultural history.

Dealing with multiple versions

Uncertain version control in open collaborative editing of tree-structured documents BIBAFull-Text 27-36
  M. Lamine Ba; Talel Abdessalem; Pierre Senellart
In order to ease content enrichment, exchange, and sharing, web-scale collaborative platforms such as Wikipedia or Google Docs enable unbounded interactions between a large number of contributors, without prior knowledge of their level of expertise and reliability. Version control is then essential for keeping track of the evolution of the shared content and its provenance. In such environments, uncertainty is ubiquitous due to the unreliability of the sources, the incompleteness and imprecision of the contributions, the possibility of malicious editing and vandalism acts, etc. To handle this uncertainty, we use a probabilistic XML model as a basic component of our version control framework. Each version of a shared document is represented by an XML tree and the whole document, together with its different versions, is modeled as a probabilistic XML document. Uncertainty is evaluated using the probabilistic model and the reliability measure associated to each source, each contributor, or each editing event, resulting in an uncertainty measure on each version and each part of the document. We show that standard version control operations can be implemented directly as operations on the probabilistic XML model; efficiency with respect to deterministic version control systems is demonstrated on real-world datasets.
LSEQ: an adaptive structure for sequences in distributed collaborative editing BIBAFull-Text 37-46
  Brice Nédelec; Pascal Molli; Achour Mostefaoui; Emmanuel Desmontils
Distributed collaborative editing systems allow users to work distributed in time, space and across organizations. Trending distributed collaborative editors such as Google Docs, Etherpad or Git have grown in popularity over the years. A new kind of distributed editors based on a family of distributed data structure replicated on several sites called Conflict-free Replicated Data Type (CRDT for short) appeared recently. This paper considers a CRDT that represents a distributed sequence of basic elements that can be lines, words or characters (sequence CRDT). The possible operations on this sequence are the insertion and the deletion of elements. Compared to the state of the art, this approach is more decentralized and better scales in terms of the number of participants. However, its space complexity is linear with respect to the total number of inserts and the insertion points in the document. This makes the overall performance of such editors dependent on the editing behaviour of users. This paper proposes and models LSEQ, an adaptive allocation strategy for a sequence CRDT. LSEQ achieves in the average a sub-linear spatial-complexity whatever is the editing behaviour. A series of experiments validates LSEQ showing that it outperforms existing approaches.
Introduction to the universal delta model BIBAFull-Text 47-56
  Gioele Barabucci
There are currently no shared formalization of the output of diff algorithms, the so called deltas. From a theoretical point of view, without such a formalization it is difficult to compare the output of different algorithms. In more practical terms, the lack of a shared formalization makes it hard to create tools that support more than one diff algorithm.
   This paper introduces the universal delta model: a formal definition of changes (the pieces of information that records that something has changed), operations (the definitions of the kind of change that happened) and deltas (coherent summaries of what has changed between two documents). The fundamental mechanism that makes the changes as defined in the universal delta model a very expressive tool, is the use of encapsulation relations between changes: changes are not only simple records of what has changed, they can also be combined into more complex changes to express the fact that the algorithm has detected more nuanced kinds of changes. The universal delta model has been applied successfully in various projects that served as an evaluation for the model. In addition to the model itself, this paper briefly describes one of these projects: the measurement of objective qualities of deltas as produced by various diff algorithms.
Version aware LibreOffice documents BIBAFull-Text 57-60
  Meenu Pandey; Ethan V. Munson
Version control systems provide a methodology for maintaining changes to a document over its lifetime and provide better management and control of evolving document collections, such as source code for large software systems. However, no version control system supports similar functionalities for office documents.
   Version Aware XML documents integrate full versioning functionality into an XML document type, using XML namespaces to avoid document type errors. Version aware XML documents contain a preamble with versions stored in reverse delta format, plus unique ID attributes attached to the nodes of the documents. They support the full branching and merging functionalities familiar to software engineers, in contrast to the constrained versioning models typical of Office applications.
   LibreOffice is an open source office document suite which is widely used for document creation. Each document is represented in the Open Office Document Format, which is a collection of XML files. The current project is an endeavor to show the practicality of the version aware XML documents approach by modifying the LibreOffice document suite to support version awareness. We are modifying LibreOffice to accept and preserve both the preamble and the IDs of the version aware framework. Initially, other functionality will be provided by wrapper applications and independent tools, but full integration into the LibreOffice user interface is envisioned.

Search & sense making

Interactive text document clustering using feature labeling BIBAFull-Text 61-70
  Seyednaser Nourashrafeddin; Evangelos Milios; Dirk Arnold
We propose an interactive text document method, which is based on term labeling. The algorithm asks the user to cluster the top keyterms associated with document clusters iteratively. The keyterm clusters are used to guide the clustering method. Rather than using standard clustering algorithms, we propose a new text clusterer using term clusters. Terms that exist in a document corpus are clustered. Using a greedy approach, the term clusters are distilled in order to remove non-discriminative general terms. We then present a heuristic approach to extract seed documents associated with each distilled term cluster. These seeds are finally used to cluster all documents. We compared our interactive term labeling to a baseline interactive term selection algorithm on some real standard text datasets. The experiments show that with a comparable amount of user effort, our term labeling is more effective than the baseline term selection method.
A graph-based topic extraction method enabling simple interactive customization BIBAFull-Text 71-80
  Ajitesh Srivastava; Axel J. Soto; Evangelos Milios
It is often desirable to identify the concepts that are present in a corpus. A popular way to deal with this objective is to discover clusters of words or topics, for which many algorithms exist in the literature. Yet most of these methods lack the interpretability that would enable interaction with a user not familiar with their inner workings. The paper proposes a graph-based topic extraction algorithm, which can also be viewed as a soft-clustering of words present in a given corpus. Each topic, in the form of a set of words, represents an underlying concept in the corpus. The method allows easy interpretation of the clustering process, and hence enables the scope of user involvement at various steps. For a quantitative evaluation of the topics extracted, we use them as features to get a compact representation of documents for classification tasks. We compare the classification accuracy achieved by a reduced feature set obtained with our method versus other topic extraction techniques, namely Latent Dirichlet Allocation and Non-negative Matrix Factorization. While the results from all the three algorithms are comparable, the speed and easy interpretability of our algorithm makes it more appropriate to be used interactively by lay users.
Searching online book documents and analyzing book citations BIBAFull-Text 81-90
  Zhaohui Wu; Sujatha Das; Zhenhui Li; Prasenjit Mitra; C. Lee Giles
Academic search engines and digital libraries provide convenient online search and access facilities for scientific publications. However, most existing systems do not include books in their collections although several books are freely available online. Academic books are different from papers in terms of their length, contents and structure. We argue that accounting for academic books is important in understanding and assessing scientific impact. We introduce an open-book search engine that extracts and indexes metadata, contents, and bibliography from online PDF book documents. To the best of our knowledge, no previous work gives a systematical study on building a search engine for books.
   We propose a hybrid approach for extracting title and authors from a book that combines results from CiteSeer, a rule based extractor, and a SVM based extractor, leveraging web knowledge. For "table of contents" recognition, we propose rules based on multiple regularities based on numbering and ordering. In addition, we study bibliography extraction and citation parsing for a large dataset of books. Finally, we use the multiple fields available in books to rank books in response to search queries. Our system can effectively extract metadata and contents from large collections of online books and provides efficient book search and retrieval facilities.
Near duplicate detection in an academic digital library BIBAFull-Text 91-94
  Kyle Williams; C. Lee Giles
The detection and potential removal of duplicates is desirable for a number of reasons, such as to reduce the need for unnecessary storage and computation, and to provide users with uncluttered search results. This paper describes an investigation into the application of scalable simhash and shingle state of the art duplicate detection algorithms for detecting near duplicate documents in the CiteSeerX digital library. We empirically explored the duplicate detection methods and evaluated their performance and application to academic documents and identified good parameters for the algorithms. We also analyzed the types of near duplicates identified by each algorithm. The highest F-scores achieved were 0.91 and 0.99 for the simhash and shingle-based methods respectively. The shingle-based method also identified a larger variety of duplicate types than the simhash-based method.

Architecture & processes

Augmenting digital documents with negotiation capability BIBAFull-Text 95-98
  Jerzy Kaczorek; Bogdan Wiszniewski
Active digital documents are not only capable of performing various operations using their internal functionality and external services, accessible in the environment in which they operate, but can also migrate on their own over a network of mobile devices that provide dynamically changing execution contexts. They may imply conflicts between preferences of the active document and the device the former wishes to execute on. In the paper we propose a solution for solving such conflicts with automatic negotiations, allowing documents and devices to find contracts satisfying both sides. It is based on a simple bargaining model reinforced with machine learning mechanisms to classify string sequences representing negotiation histories.
A framework for usage-based document reengineering BIBAFull-Text 99-102
  Madjid Sadallah; Benoît Encelle; Azze-Eddine Mared; Yannick Prie
This ongoing work investigates usage-based document reengineering as a means to support authors in modifying their documents. Document usages (i.e. usage feedbacks) cover readers' explicit annotations and their reading traces. We first describe a conceptual framework with various levels of assistance for document reengineering: indications on reading, problem detection, reconception suggestions and automatic reconception propositions, taking our example in e-learning document management. We then present a technical framework for usage-based document reengineering and its associated models for documents, annotations and traces representation.

Document recognition & analysis I

Visual saliency and terminology extraction for document annotation BIBAFull-Text 103-106
  Benjamin Duthil; Mickael Coustaty; Vincent Courboulay; Jean-Marc Ogier
The document digitization process becomes a crucial economical issue in our society. Then, it becomes necessary to be able to organize this huge amount of documents. The work proposed in this paper tends to propose a new method to automatically classify document using a saliency-based segmentation process on one hand, and a terminology extraction and annotation on the other hand. The saliency-based segmentation is used to extract salient regions and by the way logo, while the terminology approach is used to annotate them and to automatically classify the document. The approach does not require human expertise, and use Google Images as a knowledge database. The results obtained on a real database of 1766 documents show the relevance of the approach.
An adaptive thresholding algorithm based on edge detection and morphological operations for document images BIBAFull-Text 107-110
  Renata Freire de Paiva Neves; Cleber Zanchettin; Carlos Alexandre Barros Mello
This paper presents a new algorithm to threshold document images. The proposed algorithm deal with complex background images, illumination and aspect variants, back-to-front interference, variation of brightness and different positioned shadows. The algorithm have two phases. The first one uses edge detection and morphological operations to identify the text on the image. The second phase uses the positions of the text to define the threshold value in an adaptive process. Our approach presents promising results in images with complex background released from the Document Image Binarization Contest (DIBCO) when compared with other literature and competition thresholding algorithms.
Information extraction efficiency of business documents captured with smartphones and tablets BIBAFull-Text 111-114
  Daniel Esser; Klemens Muthmann; Daniel Schuster
Businesses and large organizations currently prefer scanners to incorporate paper documents into their electronic document archives. While cameras integrated into mobile devices such as smartphones and tablets are commonly available, it is still unclear how using a mobile device for document capture influences document content recognition. This is especially important for information extraction carried out on documents captured in a mobile scenario. Therefore this paper presents a set of experiments to compare automatic index data extraction from business documents in a static and in a mobile case. The paper shows which decline in extraction one can expect, explains the reasons and gives a short overview over possible solutions.
Dominant color segmentation of administrative document images by hierarchical clustering BIBAFull-Text 115-118
  Elodie Carel; Vincent Courboulay; Jean-Christophe Burie; Jean-Marc Ogier
This paper addresses the problem of color documents images segmentation in an industrial context. Automated Document Recognition (ADR) systems highly reduce time and resource costs of companies by managing their huge amount of administrative documents, and by optimizing their workflow. Most of the time, a binarization is performed due to their historical industrial process. Therefore, colorimetric information can improve the process. In this paper, we propose a hierarchical clustering based approach to extract dominant color masks of documents. Indeed, our dataset comprises different kind of scanned administrative document images such as invoices, forms, letters, and so on. We do not know a priori the number of dominant colors on our documents. These masks will further feed the inputs to an OCR in order to bring extra-information about the colorimetric context. This approach requires neither user interaction nor setting steps. Experiments on several types of documents show the relevance of the proposed approach.
Optical font recognition using conditional random field BIBAFull-Text 119-122
  Aziza Satkhozhina; Ildus Ahmadullin; Jan P. Allebach
Automated publishing systems require large databases containing document page layout templates. Most of these layout templates are created manually. A lower cost alternative is to extract document page layouts from existing documents. In order to extract the layout from a scanned document image, it is necessary to perform Optical Font Recognition (OFR) since the font is an important element in layout design. In this paper, we use the Conditional Random Field (CRF) model to perform OFR. First, we extract typographical features of the text. Then, we train the probabilistic model using a log-linear parameterization of CRF. The advantage of using CRF is that it does not assume that the typographical features are independent of each other. We demonstrate the effectiveness of this approach on a set of 616 fonts.
A shape-based layout descriptor for classifying spatial relationships in handwritten math BIBAFull-Text 123-126
  Francisco Alvaro; Richard Zanibbi
We consider the difficult problem of classifying spatial relationships between symbols and subexpressions in handwritten mathematical expressions. We first improve existing geometric features based on bounding boxes and center points, normalizing them using the distance between the centers of the two symbols or subexpressions in question. We then propose a novel feature set for layout classification, using polar histograms computed over points in handwritten strokes. A series of experiments are presented in which a Support Vector Machine is used with these new features to classify spatial relationships of five types in the MathBrush corpus (horizontal, superscript, subscript, below, and inside (e.g. in a square root)). The normalized geometric features provide an improvement over previously published results, while the shape-based features provide a natural representation with results comparable to those for the geometric features. Combining the features produced a very small improvement in accuracy.
Evaluating glyph binarizations based on their properties BIBAFull-Text 127-130
  Shira Faigenbaum; Arie Shaus; Barak Sober; Eli Turkel; Eli Piasetzky
Document binary images, created by different algorithms, are commonly evaluated based on a pre-existing ground truth. Previous research found several pitfalls in this methodology and suggested various approaches addressing the issue. This article proposes an alternative binarization quality evaluation solution for binarized glyphs, circumventing the ground truth. Our method relies on intrinsic properties of binarized glyphs. The features used for quality assessment are stroke width consistency, presence of small connected components (stains), edge noise, and the average edge curvature. Linear and tree-based combinations of these features are also considered. The new methodology is tested and shown to be nearly as sound as human experts' judgments.

Document layout & presentation generation I

Functional, extensible, svg-based variable documents BIBAFull-Text 131-140
  John W. Lumley
Architectures for documents that vary in response to binding to data, or user interaction, are usually based on limited layout semantics, such as text flows, and simple data variability, such as replacing reserved constructs. By using a generalised XML graphical representation (SVG), decorated with an extensible set of layout intent declarations, and with embedded fragments of XSLT decorated with program retention directives, it is possible to produce self-contained documents that are both highly flexible and extensible and can adapt their presentation to multiple stages of data binding, as well as user interaction. The essentials of the architecture are presented with examples and details of the necessary implementation and support tools, most of which are written in declarative, functional XSLT. Recent developments in XSLT technologies make it possible to consider such documents operating within unmodified browsers -- techniques are discussed.
Hierarchical probabilistic model for news composition BIBAFull-Text 141-150
  Ildus Ahmadullin; Niranjan Damera-Venkata
We present a method for the automated composition of personalized newspapers. Traditional newsprint composition is a laborious and expensive manual process. We develop a two level hierarchical page layout model that models aesthetic design choices using local (within article region) and global (page level) prior probability distributions. Given content to be composed, our model can infer the best way to divide a page into layout regions and simultaneously optimize content fit within these regions. We automate decisions on how to paginate articles, flow article text across pages, crop images, adjust whitespace etc. for the best overall newspaper compositions. We also show how content editing which is a very important task in the traditional news workflow can be incorporated in a semi-automated manner within our framework. Our model is a generalization of our prior work on probabilistic modeling of single-flow layouts to enable multiple article flows on a page, while still allowing one or more articles that may break on a page and continue on subsequent pages.
Balancing font sizes for flexibility in automated document layout BIBAFull-Text 151-160
  Ricardo Piccoli; João Batista Oliveira
This paper presents an improved approach for automatically laying out content onto a document page, where the number and size of the items are unknown in advance. Our solution leverages earlier results from Oliveira (2008) wherein layouts are modeled by a guillotine partitioning of the page. The benefit of such method is its efficiency and ability to place as many items on a page as desired. In our model, items have flexible representations and texts may freely change their font sizes to fit a particular area of the page. As a consequence, the optimization goal is to find a layout that produces the least noticeable difference between font sizes, in order to obtain the most aesthetically pleasing layout. Finding the best areas for text requires knowledge of how typesetting engines actually render text for a particular setting. As such, we also model the behavior of the TeX typesetting engine when computing the height to be occupied by a text block as a function of the font size, text length and line width. An analytical approximation for text placement is then presented, refined by using curve fitting over TeX-generated data. As a practical result, the resulting layouts for a newspaper generation application are also presented. Finally, we discuss these results and directions for further research.

Document recognition & analysis II

Document noise removal using sparse representations over learned dictionary BIBAFull-Text 161-168
  Do Thanh-Ha; Salvatore Tabbone; Oriol Ramos Terrades
In this paper, we propose an algorithm for denoising document images using sparse representations. Following a training set, this algorithm is able to learn the main document characteristics and also, the kind of noise included into the documents. In this perspective, we propose to model the noise energy based on the normalized cross-correlation between pairs of noisy and non-noisy documents. Experimental results on several datasets demonstrate the robustness of our method compared with the state-of-the-art.
Supervised polarity classification of Spanish tweets based on linguistic knowledge BIBAFull-Text 169-172
  David Vilares; Miguel Ángel Alonso; Carlos Gómez-Rodríguez
We describe a system that classifies the polarity of Spanish tweets. We adopt a hybrid approach, which combines machine learning and linguistic knowledge acquired by means of NLP. We use part-of-speech tags, syntactic dependencies and semantic knowledge as features for a supervised classifier. Lexical particularities of the language used in Twitter are taken into account in a pre-processing step. Experimental results improve over those of pure machine learning approaches and confirm the practical utility of the proposal.
Hi-Fi HTML rendering of multi-format documents in DoMinUS BIBAFull-Text 173-176
  Stefano Ferilli; Floriana Esposito; Domenico Redavid
Digital Libraries collect, organize and provide to end users large quantities of selected documents. While these documents come in a variety of formats, it is desirable that they are delivered to final users in a uniform way. Web formats are a suitable choice for this purpose. While Web documents are very flexible as to layout presentation, that is determined at runtime by the interpreter, documents coming from a library should preserve their original layout when displayed to final users. Using raster images would not allow the user to access the actual content of the document's components (text and images). This paper presents a technique to render in an HTML file the original layout of a document, preserving the peculiarity of its components (text, images, formulas, tables, algorithms). It builds on the DoMInUS framework, that can process documents in several source formats.
PDFX: fully-automated PDF-to-XML conversion of scientific literature BIBAFull-Text 177-180
  Alexandru Constantin; Steve Pettifer; Andrei Voronkov
PDFX is a rule-based system designed to reconstruct the logical structure of scholarly articles in PDF form, regardless of their formatting style. The system's output is an XML document that describes the input article's logical structure in terms of title, sections, tables, references, etc. and also links it to geometrical typesetting markers in the original PDF, such as paragraph and column breaks. The key aspect of the presented approach is that the rule set used relies on relative parameters derived from font and layout specifics of each article, rather than on a template-matching paradigm. The system thus obviates the need for domain- or layout-specific tuning or prior training, exploiting only typographical conventions inherent in scientific literature. Evaluated against a significantly varied corpus of articles from nearly 2000 different journals, PDFX gives a 77.45 F1 measure for top-level heading identification and 74.03 for extracting individual bibliographic items. The service is freely available for use at http://pdfx.cs.man.ac.uk/.
Recognising document components in XML-based academic articles BIBAFull-Text 181-184
  Angelo Di Iorio; Silvio Peroni; Francesco Poggi; Fabio Vitali; David Shotton
Recognising textual structures (paragraphs, sections, etc.) provides abstract and more general mechanisms for describing documents independent of the particular semantics of specific markup schemas, tools and presentation stylesheets. In this paper we propose an algorithm that allows us to identify the structural role of each element in a set of homogeneous scientific articles stored as XML files.

Metadata & annotation

Improving term extraction by utilizing user annotations BIBAFull-Text 185-188
  Jozef Harinek; Marián Šimko
Automated acquisition of relevant domain terms from educational documents available in social educational systems can benefit from processing a growing number of user-created annotations assigned to the content. Annotations provide us potentially useful information about documents and can improve the results of base Automatic Term Recognition (ATR) algorithms. We propose a method for relevant domain terms extraction based on user-created annotations processing. We consider three basic annotation types: tags, comments and highlights. The final term weight is computed by combining relevant domain terms weights obtained from the individual annotation types and those obtained from the text. The method was evaluated using data from Principles of Software Engineering course in adaptive educational system ALEF and showed that enhancements based on annotation processing yield significant improvement of results.
Using RDFS/OWL to ease semantic integration of structured documents BIBAFull-Text 189-192
  Jean-Yves Vion-Dury
This paper defines i/ an RDFS/OWL schema to capture the syntactic structure of marked-up documents and ii/ the (reversible) transposition of any XML/SGML/HTML document into a set of conformant RDF triples that convey the relevant tree information, be it meta-information (structure of the tree, attributes, XML comments...) or basic information (textual content).
   The translation we propose reuses predefined semantics of RDFS and OWL W3C standards, thus making tree manipulation and transformations homogeneous to common RDF semantic models; once translated, operations on XML/SGML/HTML trees can be much easily integrated into Semantic Web applications (and this applies particularly well to emerging HTML notational systems such as RDFa or micro-formats).
   Where this makes sense for application areas, specific document schemas can be totally or partially translated into supplemental RDFS/OWL constraints manageable by inference engines complying with the W3C standards.
Reviewing the TEI ODD system BIBAFull-Text 193-196
  Sebastian Rahtz; Lou Burnard
For many years the Text Encoding Initiative (TEI) has maintained a specialised high-level XML vocabulary in the 'literate programming' paradigm to define its influential Guidelines, from which schemas or DTDs in other schema languages are derived. This paper reviews the development of this vocabulary, known as ODD (for 'One Document Does it all'). We discuss some problems with the language, and propose solutions to make it more complete and extensible.
Assisted editing in the biomedical domain: motivation and challenges BIBAFull-Text 197-200
  Fabio Rinaldi
One of the characteristics of biomedical scientific literature is the high ambiguity of the domain-specific terminology which can be used to describe technical concepts and specific objects of the domain. This is partly due to the very broad scope of the domain of interest and partly to inherent properties of the terminology itself. There are simply very large numbers of genes, proteins, organs, cell lines, cellular phenomena, experimental methods, and so on. For example, UniProt, the most authoritative protein database, currently contains more than 33 million entries. Clearly, the names which are typically used to refer to proteins are polysemic and might refer to hundreds of different entries in a reference database.
   Such a large and extensive terminology necessarily makes it difficult to derive from the literature a simplified representation of the entities and relationships described in the articles, despite considerable efforts by the text mining community. In this paper we propose to complement such efforts with editing tools that can assist the authors in efficiently adding to their publications a minimal semantic annotation so that much of the ambiguity is avoided.
Managing content, metadata and user-created annotations in web-based applications BIBAFull-Text 201-204
  Marián Šimko; Martin Franta; Martin Habdák; Petra Vrablecová
We introduce a tool aimed to facilitate the management of content, metadata and social annotations assigned to documents in semantic web-based applications. The COME2T (COllaboration- and MEtadata-oriented COntent Management EnvironmenT) allows easy administration of lightweight semantics for the provided content and user-created annotations, which are often created as a result of implicit collaboration between users of a web-based application. We present the tool's most important features and briefly describe a pilot application of the tool when used to manage content for the adaptive learning portal ALEF.

Multimedia I

Multimedia authoring based on templates and semi-automatic generated wizards BIBAFull-Text 205-214
  Roberto Gerson de Albuquerque Azevedo; Rodrigo Costa Mesquita Santos; Eduardo Cruz Araújo; Luiz Fernando Gomes Soares; Carlos de Salles Soares Neto
Templates have been used to engage non-expert multimedia authors as content producers. In template-based authoring, templates with most of the relevant application logic and application constraints are developed by experts, who must also specify the template semantics, report which are the required gaps to be filled in, and how to do so. Filling template's gaps is the single task left to inexperienced users to produce the final applications. To do that, they usually must understand the padding instructions reported by template authors and learn some specific padding language. An alternative is using specific GUI components created specifically to each new developed template. This paper proposes a semi-automatic generation of GUI Wizards to guide end(-user) authors to create multimedia applications. The wizard can be tuned to improve the communication between the template author and the template end user, and also if the template specification is not complete. Many successful trial cases show that the generated wizards are usually simple enough to be used by non-experts. The contributions coming from this paper is not constrained to any specific template language or final-application format. Nevertheless, aiming at testing the proposal it was instantiated to work with TAL (Template Authoring Language) whose template processors can generate applications in different target languages.
Content-based copy and paste from video documents BIBAFull-Text 215-218
  Laurent Denoue; Scott Carter; Matthew Cooper
Unlike text, copying and pasting parts of video documents is challenging. Yet, the abundance of video documents now available including how-to tutorials requires simpler tools that allow users to easily copy and paste fragments of video materials into new documents. We describe new direct video manipulation techniques enabling users to quickly copy and paste content from video documents into a user's own multimedia document. While the video plays, users interact with the video canvas to select text regions, scrollable regions, slide sequences built up across many frames, or semantically meaningful regions such as dialog boxes. Instead of relying on the timeline to accurately select sub-parts of the video document, users navigate using familiar selection techniques such as mouse-wheel to scroll back and forward over a video shot in which the content scrolls, double-clicks over rectangular regions to select them, or clicks and drags over textual regions of the video canvas to select them. We describe the video processing techniques that run in real-time in modern web browsers using HTML5 and JavaScript; and show how they help users quickly copy and paste video fragments into new documents, allowing them to efficiently reuse video documents for authoring or note-taking.
MoViA: a mobile video annotation tool BIBAFull-Text 219-222
  Bruna C. R. Cunha; Olibário J. Machado Neto; Maria da Graça Pimentel
The user interaction with mobile devices has dramatically improved over the last years. Increasingly we rely on smartphones and tablets for a wider range of tasks. Modern mobile devices enable users to access, manage and transmit multiple types of media in an easy, convenient and portable way. In this context, the playback of videos on mobile devices becomes a usual activity. Many works regarding video annotations have been made, but few are concerned with the mobile scenario. The ability to add annotations and to share them with others is a content enriching process which can improve activities from educational to entertainment purposes. In this paper, we present an intuitive tool that allows users to perform temporal video annotations on mobile devices. Using conventional tablets and smartphones equipped with the Android operating system, text, audio and digital ink annotations can be made on any video. It is possible to share text annotations with other users and play multiple annotations at the same time. The several display sizes and the possibility to switch between portrait and landscape mode have also been considered.

Posters & demonstrations

Enterprise document system cloud deployment BIBAFull-Text 223-224
  Christopher Alan Wells; Joel Jirak; Steve Pruitt; Anthony J. Wiley
The software used by enterprise businesses for creating variable-data customer documents must be highly reliable, and vendors are increasingly distributing such software via the cloud as an online service. This means that vendors now assume responsibility for the IT resources hosting and supporting the software as well as the customer documents and data. Vendors also assume responsibility for pushing updates to all customers simultaneously. To support the test and release of new versions, software vendors must deploy and configure the software at an unprecedented rate.
   To reduce the time spent deploying and configuring software in the cloud, and to minimize the chance for human error, we present StackLauncher. By making it possible to automatically configure and launch software "stacks" with push-button simplicity, StackLauncher is a valuable addition to the software development lifecycle for cloud deployment of enterprise document software.
Bag of subjects: lecture videos multimodal indexing BIBAFull-Text 225-226
  Nhu Van Nguyen; Jean-Marc Ogier; Franck Charneau
In this paper, we address multimodal indexing and retrieval for videos of lectures or seminars. This paper proposes a combination of technologies respectively issuing from image document analysis and text mining. Based on visual information and textual information extracted from slide images, we investigate a Bag of mixed Words (visual words and textual words) model to represent lecture slide's contents. Lecture videos are indexed and retrieved by using extended Bag of Words model. In this model, it is assumed that a video may contain multiple subjects; and this model discovers the visual representation of these subjects automatically and indexes the video accordingly. We discuss the mixed text/image query and proposed indexing approach for retrieval lecture videos and report a quantitative evaluation on lecture videos of our Lab.
tranScriptorium: a European project on handwritten text recognition BIBAFull-Text 227-228
  Joan Andreu Sánchez; Günter Mühlberger; Basilis Gatos; Philip Schofield; Katrien Depuydt; Richard M. Davis; Enrique Vidal; Jesse de Does
The tranScriptorium project aims to develop innovative, efficient and cost-effective solutions for annotating handwritten historical documents using modern, holistic Handwritten Text Recognition (HTR) technology. Three actions are planned in tranScriptorium: i) improve basic image preprocessing and holistic HTR techniques; ii) develop novel indexing and keyword searching approaches; and iii) capitalize on new, user-friendly interactive-predictive HTR approaches for computer-assisted operation.
On the performance of the position() XPath function BIBAFull-Text 229-230
  Luiz Augusto Matos da Silva; Luiz Laerte N. da, Jr. Silva; Marta Mattoso; Vanessa Braganholo
In very large XML documents or collections, the query response times are not always satisfactory. To overcome this limitation, parallel processing can be applied. Data can be replicated in several processors and queries can be partitioned to run over different virtual data partitions on each processor, on an approach called virtual partitioning. PartiX-VP is a simple XML virtual partitioning approach that generates virtual data partitions by dividing the cardinality of the partitioning attribute by the number of allocated processors, resulting in intervals of equal size for each processor. In this approach, the XML query is rewritten and selection predicates are added to define the virtual partitions. These selection predicates use the position() XPath function that addresses a set of elements on a given position in the document. In this paper, we present an experimental evaluation of the position() XPath function in five XML native DBMS. We have identified differences in the processing time of the position() XPath function in large collections of XML documents. This may lead to load unbalancing in simple virtual partitioning approaches, thus this analysis opens space for improvements in virtual partitioning.
Incremental hierarchical text clustering with privileged information BIBAFull-Text 231-232
  Ricardo Marcondes Marcacini; Solange Oliveira Rezende
In many text clustering tasks, there is some valuable knowledge about the problem domain, in addition to the original textual data involved in the clustering process. Traditional text clustering methods are unable to incorporate such additional (privileged) information into data clustering. Recently, a new paradigm called LUPI -- Learning Using Privileged Information -- was proposed by Vapnik to incorporate privileged information in classification tasks. In this paper, we extend the LUPI paradigm to deal with text clustering tasks. In particular, we show that the LUPI paradigm is potentially promising for incremental hierarchical text clustering, being very useful for organizing large textual databases. In our method, the privileged information about the text documents is applied to refine an initial clustering model by means of consensus clustering. The initial model is used for incremental clustering of the remaining text documents. We carried out an experimental evaluation on two benchmark text collections and the results showed that our method significantly improves the clustering accuracy when compared to a traditional hierarchical clustering method.
Beyond term clusters: assigning Wikipedia concepts to scientific documents BIBAFull-Text 233-234
  Ozge Yeloglu; Evangelos Milios; A. Nur Zincir-Heywood
We propose a model for assigning Wikipedia Concepts as scientific category labels to scientific documents where their terms are first grouped together using the well-known topic modelling method, Latent Dirichlet Allocation (LDA) and then assigned to Wikipedia Concepts by wikification. We wikify the terms of the topic model of a document to extract related concepts from Wikipedia. We experiment on two different datasets: the abstracts of the documents from the ACM Digital Library and the full papers of the UvT Collection. The ACM dataset includes Computer Science publications whereas UvT includes scientific publications from a range of topics. Domain specific taxonomies are used for evaluation. Results show that our approach is able to assign Wikipedia Concepts to the scientific publications in an automated manner, removing any need for human supervision.
Cross language indexing and retrieval of the Cypriot digital antiquities repository BIBAFull-Text 235-236
  Dayu Yuan; Prasenjit Mitra
We design and implement a cross-language retrieval system for the Cypriot Digital Antiquities Repository (cyDAR). Users can query either by English and Ancient Greek to search for documents written in Ancient Greek. Because of the lack of dictionary and parallel corpus, we use translation machine to translate the documents. We index both the original Ancient Greek text and translated English text to facilitated multi-language search.

Document layout & presentation generation II

No need to justify your choice: pre-compiling line breaks to improve eBook readability BIBAFull-Text 237-240
  Alexander J. Pinkney; Steven R. Bagley; David F. Brailsford
Implementations of eBooks have existed in one form or another for at least the past 20 years, but it is only in the past 5 years that dedicated eBook hardware has become a mass-market item.
   New screen technologies, such as e-paper, provide a reading experience similar to those of physical books, and even backlit LCD and OLED displays are beginning to have high enough pixel densities to render text crisply at small point sizes. Despite this, the major element of the physical book that has not yet made the transition to the eBook is high-quality typesetting.
   The great advantage of eBooks is that the presentation of the page can adapt, at rendering time, to the physical screen size and to the reading preferences of the user. Until now, simple first-fit line-breaking algorithms have had to be used in order to give acceptable rendering speed whilst conserving battery life.
   This paper describes a system for producing well-typeset, scalable document layouts for eBook readers, without the computational overhead normally associated with better-quality typesetting. We precompute many of the complex parts of the typesetting process, and perform the majority of the 'heavy lifting' at document compile-time, rather than at rendering time. Support is provided for floats (such as figures in an academic paper, or illustrations in a novel), for arbitrary screen sizes, and also for arbitrary point-size changes within the text.
Reflowing and annotating scientific papers on eBook readers BIBAFull-Text 241-244
  Simone Marinai
Working with scientific and technical papers on small screen devices, such as tablets and eBook readers, is difficult since these works are often typeset in multiple columns with a relatively small font size.
   On tablets, pan and zoom operations allow users to visualize the text in the desired size, however, tracing the text in multiple columns can be uneasy and not appropriate for studying and working with the scientific works. Moreover, these operations are slow on most e-ink eBook readers that have limited computation resources. Document reflow is in this case one option, but it is difficult to provide a satisfactory visualization of scientific and technical papers.
   In this paper, we describe one off-line tool for scientific document reflow that adopts document image processing techniques to generate one modified version of the original PDF organized as a single column text that can be easily visualized on eBook readers. Moreover, the tool allows the user to make free-form annotations on the modified paper using the tools of the eBook reader. These annotations are faithfully reproduced in the original two-column document.
Automatic generation of limited-depth hyper-documents from clinical guidelines BIBAFull-Text 245-248
  Mark Truran; Jonathan Siddle; Gersende Georg; Marc Cavazza
Research suggests that browsing clinical guidelines in a linear format is difficult for users. One national producer of clinical guidelines (HAS, the French National Authority for Health) has recently developed a new document format designed to improve accessibility. It is a limited-depth hypertext structurally constrained so that all information lies within two clicks of a central index ('reco2clics'). In the following paper, we introduce an authoring tool which converts full-length clinical guidelines to the 'reco2clics' format. Alongside routine editorial operations, this tool supports dynamic document restructuring, a complex operation using text segmentation algorithms and deontic analysis.
Splitting wide tables optimally BIBAFull-Text 249-252
  Mihai Bilauca; Patrick Healy
In this paper we discuss the problems that occur when splitting wide tables across multiple pages. We focus our attention on finding solutions that minimize the impact on the meaning of data when the objective is to reorder the columns such that the number of pages used is minimal. Reordering of columns in a table raises a number of complex optimization problems that we will study in this paper: minimizing page count and at the same time the number of column positions changes or the number of column groups split across pages. We show that by using integer programming solutions the number of pages used when splitting wide tables can be reduced by up to 25% and it can be achieved in short computational time.

Multimedia II

NCL4WEB: translating NCL applications to HTML5 web pages BIBAFull-Text 253-262
  Esdras Caleb Oliveira Silva; Joel A. F. dos Santos; Débora C. Muchaluat-Saade
Testing Digital TV applications is not a simple task. DTV applications either need to be transmitted by a TV broadcaster or someone with an equipment capable of generating a DTV signal with the application embedded. Alternatively, an interactive TV application developer may use a virtual execution environment, like a virtual set-top box installed in a computer, which implements the digital TV middleware standard. In both cases, the application usually does not reach a large number of final users, and developers may not be motivated to continue working with digital TV interactive content. On the other hand, HTML5 support for multimedia content will certainly attract multimedia authors to web development. Considering this scenario, this work proposes an alternative way of presenting a digital TV application developed in NCL for the Ginga declarative middleware, translating it into HTML5 web pages, so it can be presented using a common web browser. The translation tool is called NCL4WEB. Like HTML, NCL is XML-based, so NCL4WEB is based on XSLT stylesheets. It transforms NCL elements into HTML5 elements and a set of JavaScript functions that implement synchronization relationships among media objects, including user interaction. Using NCL4WEB, NCL developers are able to publish their interactive TV applications on the web. It is transparent for final users to access HTML5 or NCL content using a web browser.
Go beyond boundaries of iTV applications BIBAFull-Text 263-272
  Caio Cesar Viel; Erick Lazaro Melo; Maria da Graça Campos Pimentel; Cesar Augusto Camillo Teixeira
The development of multimedia applications that require the manipulation and the synchronization of multiple media and the handling of different types of user interactions usually requires specialized knowledge in imperative languages. Declarative languages have been proposed in order to make this task easier, especially when applications are restricted to certain classes, as it is the case of Interactive TV applications in which user interactions are restricted to a few simple models. However, those simple models may be too simple when documents are reused in other platforms: for instance, when watching a video most web users expect an interactive timeline to be available -- which is not the case in interactive TV videos. This paper presents a component-based approach to the enrichment of declarative languages for multimedia so that desirable user-media interactions are made possible at the same time that the original ease of authoring is maintained. We detail the components and present a corresponding proof-of-concept prototype. We also discuss design decisions associated with the development of the components, which should be useful in further extensions.
Multimedia document synchronization in a distributed social context BIBAFull-Text 273-276
  Jack Jansen; Pablo Cesar; Dick Bulterman
Watching digital content together and commenting on it is becoming a social habit between friends and family members living apart. It is also becoming an important value-added activity for business video conferencing. In both cases, the video sharing experience can easily be spoiled if synchronization problems arise, since the context of the conversation will not be consistent across locations. In the past, research has treated the distributed synchronization problem as a technical one, mainly focusing on timestamps, frame accuracy, and protocol-dependent control messages. That approach is based on a content agnostic approach which we feel does not adequately address the higher-level constraints of individual conversations.
   In this paper we postulate that the technical issues are just part of the problem, so solutions need to take into account the media being shared and the social setting. Therefore, we propose a framework that allows researchers to experiment with different synchronization policies tailored to specific settings. The framework allows the evaluation of these policies through user testing, for finding the most appropriate policies and strategies.
Interchanging and preserving presentation recordings BIBAFull-Text 277-280
  Kai Michael Höver; Max Mühlhäuser
The importance of presentation recordings is steadily increasing. This trend is indicated for example by the growing MOOCs market. Many systems for the production of such recordings exist. However, produced recordings are not exchangeable between systems due to different representation formats. In this paper, we present an ontology for the conceptual description of presentation recordings and describe the transformation process between different systems. Furthermore, we explain how this ontology can be used to preserve presentation recordings as ebooks.

Workshops

Document changes: modeling; detection; storing and visualization (DChanges) BIBAFull-Text 281-282
  Gioele Barabucci; Uwe M. Borghoff; Angelo Di Iorio; Sonja Maier
Many people have approached the problem of investigating the evolution of documents and data from different perspectives, e.g. by tracking changes, versioning and diffing. The goal of this workshop is to share ideas, common issues and principles, and to foster research collaboration on these topics.
Collaborative annotations in shared environments: metadata, vocabularies and techniques in the digital humanities (DH-CASE 2013) BIBAFull-Text 283-284
  Francesca Tomasi; Fabio Vitali
We present here the workshop DH-CASE 2013, aimed at investigating the state of art in the field of collaboration in text annotation, by exploring methods, tools and techniques used in the domain of the Digital Humanities (DH).
Reimagining digital publishing for technical documents BIBAFull-Text 285-286
  Michael Wybrow
This workshop asks how we might reimagine digital publishing for technical documents and proposes to investigate new adaptive approaches to document reading with flexible navigation and where contextual information -- figures, references, definitions, etc -- might be displayed dynamically at the point they are referred to. The workshop ultimately seeks to answer the question of what needs to happen for reading and annotation of technical documents on digital devices to become more comfortable and productive than on paper?