HCI Bibliography Home | HCI Conferences | DocEng Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
DocEng Tables of Contents: 0102030405060708091011121314

Proceedings of the 2010 ACM Symposium on Document Engineering

Fullname:DocEng'10 Proceeding of the 10th ACM Symposium on Document Engineering
Editors:Apostolos Antonacopoulos; Michael Gormish; Rolf Ingold
Location:Manchester, UK
Dates:2010-Sep-21 to 2010-Sep-24
Standard No:ISBN: 1-4503-0231-9, 978-1-4503-0231-9; ACM DL: Table of Contents hcibib: DocEng10
Links:Conference Home Page
  1. Keynote address
  2. Systems
  3. Authoring
  4. Tools
  5. E-books
  6. Editing
  7. Document systems
  8. Analysis
  9. Creation/printing
  10. Document engineering I: posters
  11. Document engineering II: posters

Keynote address

Exploring the world's knowledge in the digital age BIBAFull-Text 1-2
  Aly Kaloa Conteh
The advent of the digital age has brought about a dramatic change in the techniques for the dissemination of, and access to, historical documents. For today's researcher, the internet is a primary source of the tools and information that supports their research. The British Library (BL) provides world class information services to the academic, business, research and scientific communities and offers unparalleled access to the world's largest and most comprehensive research collection. The British Library's collections include 150 million items from every era of written human history beginning with Chinese oracle bones dating from 300 BC, right up to the latest e-journals. For almost 20 years the British Library has been engaged in transforming physical collection items into digital form enabling access to that content over the World Wide Web.
   In this talk, I describe the depth and range of the collections that are held at the British Library. I will outline the digital conversion process we undertake, including our current solutions for digital formats, metadata standards, providing access to the content and preserving the digital outputs in perpetuity. Finally, as the British Library undertakes projects that will digitise millions of pages of historical text based items, I will look at the challenges we face, such as storage requirements and enhancing resource discovery, and how are we addressing those challenges.


Document engineering for a digital library: PDF recompression using JBIG2 and other optimizations of PDF documents BIBAKFull-Text 3-12
  Petr Sojka; Radim Hatlapatka
This paper describes several innovative document transformations and tools that have been developed in the process of building the Digital Mathematical Library DML-CZ http://dml.cz. The main result presented in this paper is our PDF re-compression tools developed using a jbig2enc library. Together with other programs, especially pdfsizeopt.py by Péter Szabó, we have managed to decrease PDF storage size and transmission needs be 62%: using both programs we reduced the size of the original PDFs to 38%.
   This paper briefly describes other approaches and tools developed while creating the digital library. The batch digital signature stamper, the document similarity metrics which uses four different methods, a [meta]data validation process and some math OCR tools represent some of the main byproducts of this project. These ways of document engineering, together with Google Scholar indexing optimizations have led to the success of serving digitized and born-digital scientific math documents to the public in DML=CZ, and will be employed also in the project of The European Digital Mathematics Library, EuDML.
Keywords: authoring tools and systems, categorization, character recognition, classification, digital mathematical library, digitisation workflow, document presentation (typography, formatting, layout), representations/standards, structure, layout and content analysis
Multilingual composite document management framework for the internet: an FRBR approach BIBAKFull-Text 13-16
  Jean-Marc Lecarpentier; Cyril Bazin; Hervé Le Crosnier
Most Web Content is nowadays published with Content Management Systems (CMS). As outlined in this paper, existing tools lack some functionalities to create and manage multilingual composite documents efficiently. In another domain, the International Federation of Library Associations and Institutions (IFLA) published the Functional Requirements for Bibliographic Records (FRBR) to lay the foundation for cataloguing documents and their various versions, translations and formats, setting the focus on the intellectual work.
   Using the FRBR concepts as guidelines, we introduce a tree-based model to describe relations between a digital document's various versions, translations and formats. Content negotiation and relationships between documents at the highest level of the tree allow composite documents to be rendered according to a user's preferences (e.g. language, user agent...). The proposed model has been implemented and validated within the Sydonie framework, a research and industrial project. Sydonie implements our model in a CMS-like tool to imagine new ways to create, edit and publish multilingual composite documents.
Keywords: composite documents, document management system, multilingual documents


A social approach to authoring media annotations BIBAKFull-Text 17-26
  Roberto, Jr. Fagá; Vivian Genaro Motti; Renan Gonçalves Cattelan; Cesar Augusto Camillo Teixeira; Maria da Graça Campos Pimentel
End-user generated content is responsible for the success of several collaborative applications, as it can be noted in the context of the web. The collaborative use of some of these applications is made possible, in many cases, by the availability of annotation features which allow users to include commentaries on each other's content. In this paper we first discuss the opportunity of defining vocabularies that allow third-party applications to integrate annotations to end-user generated documents, and present a proposal for such a vocabulary. We then illustrate the usefulness of our proposal by detailing a tool which allows users to add multimedia annotations to end-user generated video content.
Keywords: annotation, collaboration, multimodal, open vocabulary, video, watch-and-comment, YouTube
Creating and sharing personalized time-based annotations of videos on the web BIBAKFull-Text 27-36
  Rodrigo Laiola Guimarães; Pablo Cesar; Dick C. A. Bulterman
This paper introduces a multimedia document model that can structure community comments about media. In particular, we describe a set of temporal transformations for multimedia documents that allow end-users to create and share personalized timed-text comments on third party videos. The benefit over current approaches lays in the usage of a rich captioning format that is not embedded into a specific video encoding format. Using as example a Web-based video annotation tool, this paper describes the possibility of merging video clips from different video providers into a logical unit to be captioned, and tailoring the annotations to specific friends or family members. In addition, the described transformations allow for selective viewing and navigation through temporal links, based on end-users' comments. We also report on a predictive timing model for synchronizing unstructured comments with specific events within a video(s). The contributions described in this paper bring significant implications to be considered in the analysis of rich media social networking sites and the design of next generation video annotation tools.
Keywords: document transformations, smiltext, temporal hyperlinks, timed end-user comments, video annotation tools
"This conversation will be recorded": automatically generating interactive documents from captured media BIBAKFull-Text 37-40
  Didier Augusto Vega-Oliveros; Diogo Santana Martins; Maria da Graça Campos Pimentel
Synchronous communication tools allow remote users to collaborate by exchanging text, audio, images or video messages in synchronous sessions. In some scenarios, it is paramount that collaborative synchronous sessions be recorded for later review. In particular in the case of web conferencing tools, the approach usually adopted for recording a meeting is to generate a linear video with the content of the exchanged media. Such approach limits the review of a meeting to users watching a video using traditional timeline-based video controls. In this work we advocate that interactive multimedia documents can be generated automatically as a result of capturing a synchronous session. We outline our approach presenting a case of study involving remote communication, and detail the generation of a multimedia document by means of operators focusing on the interaction among the collaborating users.
Keywords: automatic authoring, interactive video


Document imaging security and forensics ecosystem considerations BIBAKFull-Text 41-50
  Steven J. Simske; Margaret Sturgill; Guy Adams; Paul Everest
Much of the focus in document security tends to be on the deterrent -- the physical (printed, manufactured) item placed on a document, often used for routing in addition to security purposes. Hybrid (multiple) deterrents are not always reliably read by a single imaging device, and so a single device generally cannot simultaneously provide overall document security. We herein show how a relatively simple deterrent can be used in combination with multiple imaging devices to provide document security. In this paper, we show how these devices can be used to classify the printing technology used, a subject of importance for counterfeiter identification as well as printer quality control. Forensic-level imaging is also useful in preventing repudiation and forging, while mobile and/or simple scanning can be used to prevent tampering -- propitiously in addition to providing useful, non-security related, capabilities such as document routing (track and trace) and workflow association.
Keywords: 3D bar codes, color tiles, document fraud, forensics, high-resolution imaging, security
XUIB: XML to user interface binding BIBAKFull-Text 51-60
  Lendle Tseng; Yue-Sun Kuo; Hsiu-Hui Lee; Chuen-Liang Chen
Separated from GUI builders, existing GUI building tools for XML are complex systems that duplicate many functions of GUI builders. They are specialized in building GUIs for XML data only. This puts unnecessary burden on developers since an application usually has to handle both XML and non-XML data. In this paper, we propose a solution that separates the XML-to-GUI bindings from the construction of the GUIs for XML, and concentrates on the XML-to-GUI bindings only. Cooperating with a GUI builder, the proposed system can support the construction of the GUIs for XML as GUI building tools for XML can. Furthermore, the proposed mechanism is neutral to GUI builders and toolkits. As a result, multiple GUI builders and toolkits can be supported by the proposed solution with moderate effort. Our current implementation supports two types of GUI platforms: Java/Swing and Web/Html.
Keywords: GUI, GUI builders, XML, XML authoring, user interface
From templates to schemas: bridging the gap between free editing and safe data processing BIBAKFull-Text 61-64
  Vincent Quint; Cécile Roisin; Stéphane Sire; Christine Vanoirbeek
In this paper we present tools that provide an easy way to edit XML content directly on the web, with the usual benefit of valid XML content. These tools make it possible to create content targeted for lightweight web applications. Our approach uses (1) the XTiger template language, (2) the AXEL Javascript library for authoring structured XML content and (3) XSLT transformations for generating XML schemas against which the XML content can be validated. Template-driven editing allows any web user to easily enter content while schemas make sure applications can safely process this content.
Keywords: XML, document authoring, document language, web editing
Lessons from the dragon: compiling PDF to machine code BIBAKFull-Text 65-68
  Steven R. Bagley
Page Description Languages, such as PDF or PostScript, describe the page as a series of graphical operators, which are then imaged to draw the page content. An interpreter executes these operators one-by-one every time the page is rendered into a viewable form. Typically, this interpreter takes the form of a tokenizer that splits the page description into the separate operators. Various subroutines are then called depending on which tokens are encountered. This process is analogous to instruction execution at the heart of a CPU: the CPU fetches machine code instructions from memory and dispatches them to the various parts of the chip as necessary.
   In this paper, we show that it is possible to compile a page description directly into machine code, bypassing the need to interpret the page description. This can bring a speed increase in PDF rendering -- particularly important on low-power devices -- and could also help increase document accessibility.
Keywords: PDF, compilation, document format, interpretation, page description languages


Transquotation in EBooks BIBAKFull-Text 69-72
  Steven A. Battle; Matthew Bernius
This paper describes the use of transquotation in eBooks to support collaborative publishing. Users are able to prepare content on a wiki and assemble this into a publishable eBook. The eBook content should remain connected to its origin so that as the wiki content changes, the eBook may be revised accordingly. The problems raised in creating a transquoting eBook editing environment include transforming wiki content into a suitable presentation, mapping selections back to plain-text, and mapping existing selections into the presentation space for review.
Keywords: eBooks, transclusion, transquotation
Table of contents recognition for converting PDF documents in e-book formats BIBAKFull-Text 73-76
  Simone Marinai; Emanuele Marino; Giovanni Soda
We describe one tool for Table of Content (ToC) identification and recognition from PDF books. This task is part of ongoing research on the development of tools for the semi-automatic conversion of PDF documents in the Epub format that can be read on several E-book devices. Among various sub-tasks, the ToC extraction and recognition is particularly useful for an easy navigation of book contents.
   The proposed tool first identifies the ToC pages. The bounding boxes of ToC titles in the book body are subsequently found in order to add suitable links in the Epub ToC. The proposed approach is tolerant to discrepancies between the ToC text and the corresponding titles. We evaluated the tool on several open access books edited by University Presses that are partner of the OAPEN EcontentPlus project.
Keywords: PDF, e-book conversion, table of content


Using versioned tree data structure, change detection and node identity for three-way XML merging BIBAKFull-Text 77-86
  Cheng Thao; Ethan V. Munson
XML has become the standard document representation for many popular tools in various domains. When multiple authors collaborate to produce a document, they must be able to work in parallel and periodically merge their efforts into a single work. While there exist a small number of three-way XML merging tools, their performance could be improved in several areas and they lack any form of user interface for resolving conflicts.
   In this paper, we present an implementation of a three-way XML merge algorithm that is faster, uses less memory and is more precise than existing tools. It uses a specialized versioning tree data structure that supports node identity and change detection. The algorithm applies the traditional three-way merge found in GNU diff3 to the children of changed nodes. The editing operations it supports are addition, deletion, update, and move. A graphical interface for visualizing and resolving conflicts is also provided. An evaluation experiment was conducted comparing the proposed algorithm with three other tools on randomly generated XML data.
Keywords: XML, document trees, three-way merge, versioning system
A model for editing operations on active temporal multimedia documents BIBAKFull-Text 87-96
  Jack Jansen; Pablo Cesar; Dick C. A. Bulterman
Inclusion of content with temporal behavior in a structured document leads to such a document gaining temporal semantics. If we then allow changes to the document during its presentation, this brings with it a number of fundamental issues that are related to those temporal semantics. In this paper we study modifications of active multimedia documents and the implications of those modifications for temporal consistency. Such modifications are becoming increasingly important as multimedia documents move from being primarily a standalone presentation format to being a building block in a larger application.
   We present a categorization of modification operations, where each category has distinct consistency and implementation implications for the temporal semantics. We validate the model by applying it to the SMIL language, categorizing all possible editing operations. Finally, we apply the model to the design of a teleconferencing application, where multimedia composition is only a small component of the whole application, and needs to be reactive to the rest of the system.
   The primary contribution of this paper is the development of a temporal editing model and a general analysis which we feel can help application designers to structure their applications such that the temporal impact of document modification can be minimized.
Keywords: application design, declarative languages, dynamic transformations, multimedia
Semantics-based change impact analysis for heterogeneous collections of documents BIBAKFull-Text 97-106
  Serge Autexier; Normen Müller
An overwhelming amount of documents is produced and changed every day in most areas of our everyday life, such as, for instance, business, education, research or administration. The documents are seldom isolated artifacts but are related to other documents. Therefore changing one document possibly requires adaptations to other documents.
   Although dedicated tools may provide some assistance when changing documents, they often ignore other documents or documents of a different type. To resolve that discontinuity, we present a framework that embraces existing document types and supports the declarative specification of semantic annotation and propagation rules inside and across documents of different types, and on which basis we define change impact analysis for heterogeneous collections of documents.
   The framework is implemented in the tool GMoC which can be used to semantically annotate collections of documents and to analyze the impacts of changes made in different documents of a collection.
Keywords: change impact analysis, document collections, document management, graph rewriting, semantics
Linking data and presentations: from mapping to active transformations BIBAKFull-Text 107-110
  Olivier Beaudoux; Arnaud Blouin
Modern GUI toolkits, and especially RIA ones, propose the concept of binding to dynamically link domain data and their presentations. Bindings are very simple to use for predefined graphical components. However, they remain dependent on the GUI platform, are not as expressive as transformation languages, and require specific coding when designing new graphical components. A solution to such issues is to use active transformations: an active transformation is a transformation that dynamically links source data to target data. Active transformations are however complex to write and/or to process. In this paper, we propose the use of the AcT framework that consists of: a platform-independent mapping language that masks the complexity of active transformations; a graphical mapping editor; and an implementation on the .NET platform.
Keywords: active transformation, mapping, model driven engineering
Blocked recursive image composition with exclusion zones BIBAFull-Text 111-114
  Hui Chao; Daniel R. Tretter; Xuemei Zhang; C. Brian Atkins
Photo collages are a popular and powerful storytelling mechanism. They are often enhanced with background artwork that sets the theme for the story. However, layout algorithms for photo collage creation typically do not take this artwork into account, which can result in collages where photos overlay important artwork elements. To address this, we extend our previous Blocked Recursive Image Composition (BRIC) method to allow any number of photos to be automatically arranged around preexisting exclusion zones on a canvas (exBRIC). We first generate candidate binary splitting trees to partition the canvas into regions that accommodate both photos and exclusion zones. We use a Cassowary constraint solver to ensure that the desired exclusion zones are not covered by photos. Finally, photo areas, exclusion zones and layout symmetry are evaluated to select the best candidate. This method provides flexible, dynamic and integrated photo layout with background artwork.

Document systems

Differential access for publicly-posted composite documents with multiple workflow participants BIBAKFull-Text 115-124
  Helen Y. Balinsky; Steven J. Simske
A novel mechanism for providing and enforcing differential access control for publicly-posted composite documents is proposed. The concept of a document is rapidly changing: individual file-based, traditional formats can no longer accommodate the required mixture of differently formatted parts: individual images, video/audio clips, PowerPoint presentations, html-pages, Word documents, Excel spreadsheets, pdf files, etc. Multi-part composite documents are created and managed in complex workflows, with participants including external consultants, partners and customers distributed across the globe, with many no longer contained within one monolithic secure environment. Distributed over non-secure channels, these documents carry different types of sensitive information: examples include (a) an enterprise pricing strategy for new products, (b) employees' personal records, (c) government intelligence, and (d) individual medical records. A central server solution is often hard or impossible to create and maintain for ad-hoc workflows. Thus, the documents are often circulated between workflow participants over traditional, low security e-mails, placed on shared drives, or exchanged using CD/DVD or USB. The situation is more complicated when multiple workflow participants need to contribute to various parts of such a document with different access levels: for example, full editing rights, read-only, reading of some parts only, etc., for different users. We propose a full scale differential access control approach, enabling public posting of composite documents, to address these concerns.
Keywords: access control, composite document, document security, policy
Assessing the readability of clinical documents in a document engineering environment BIBAKFull-Text 125-134
  Mark Truran; Gersende Georg; Marc Cavazza; Dong Zhou
Previous work has established that specific linguistic markers present in specialised medical documents (clinical guidelines) can be used to support their automatic structuring within a document engineering environment. This technique is commonly used by the French Health Authority (la Haute Autorite de Sante) during elaboration of clinical guidelines to improve the quality of the final document. In this paper, we explore the readability of clinical guidelines. We discuss a structural measure of document readability that exploits the ratio between these linguistic markers (deontic structures) and the remainder of the text. We describe an experiment in which a corpus of 10 French clinical guidelines is scored for structural readability. We correlate these scores with measures of textual cohesion (computed using latent semantic analysis) and the results of a readability survey performed by a panel of domain experts. Our results suggest an association between the density of deontic structures in a clinical guideline and its overall readability. This implies that certain generic readability measures can henceforth be utilised in our document engineering environment.
Keywords: LSA, cohesion, latent semantic analysis, medical document processing, readability
Optimized reprocessing of documents using stored processor state BIBAKFull-Text 135-138
  James A. Ollis; David F. Brailsford; Steven R. Bagley
Variable Data Printing (VDP) allows customised versions of material such as advertising flyers to be readily produced. However, VDP is often extremely demanding of computing resources because, even when much of the material stays invariant from one document instance to the next, it is often simpler to re-evaluate the page completely rather than identifying just the portions that vary.
   In this paper we explore, in an XML/XSLT/SVG workflow and in an editing context, the reduction of the processing burden that can be realised by selectively reprocessing only the variant parts of the document. We introduce a method of partial re-evaluation that relies on re-engineering an existing XSLT parser to handle, at each XML tree node, both the storage and restoration of state for the underlying document processing framework. Quantitative results are presented for the magnitude of the speed-ups that can be achieved.
   We also consider how changes made through an appearance-based interactive editing scheme for VDP documents can be automatically reflected in the document view via optimised XSLT re-evaluation of sub-trees that are affected either by the changed script or by altered data.
Keywords: SVG, VDP, XSLT, document authoring, document editing, partial re-evaluation, variable data documents
APEX: automated policy enforcement eXchange BIBAKFull-Text 139-142
  Steven J. Simske; Helen Balinsky
The changing nature of document workflows, document privacy and document security merit a new approach to the enforcement of policy. We propose the use of automated means for enforcing policy, which provides advantages for compliance and auditing, adaptability to changes in policy, and compatibility with a cloud-based exchange. We describe the Automated Policy Enforcement eXchange (APEX) software system, which consists of: (1) a policy editor, (2) a policy server, (3) a local daemon on every PC/laptop to maintain local secure up-to-date storage and policy, and (4) local (policy-enforcing) wrappers to capture document-handling user actions such as document export, e-mail, print, edit and save. During the performance of relevant incremental change, or other user-elicited action, on a composite document, the document and its metadata are scanned for salient policy eliciting terms (PETs). The document is then partitioned based on relevant policies and the security policy for each part is determined. If the document contains no PETs, then the user-initiated actions are allowed; otherwise, alternative actions are suggested, including: (a) encryption, (b) redirecting to a secure printer and requiring authorization (e.g. PIN) for printing, and (c) disallowing printing until specific sensitive data is removed.
Keywords: document system components, document systems, policy, policy editor, policy server, security, text analysis


Unsupervised font reconstruction based on token co-occurrence BIBAKFull-Text 143-150
  Michael Patrick Cutter; Joost van Beusekom; Faisal Shafait; Thomas Michael Breuel
High quality conversions of scanned documents into PDF usually either rely on full OCR or token compression. This paper describes an approach intermediate between those two: it is based on token clustering, but additionally groups tokens into candidate fonts. Our approach has the potential of yielding OCR-like PDFs when the inputs are high quality and degrading to token based compression when the font analysis fails, while preserving full visual fidelity. Our approach is based on an unsupervised algorithm for grouping tokens into candidate fonts. The algorithm constructs a graph based on token proximity and derives token groups by partitioning this graph. In initial experiments on scanned 300 dpi pages containing multiple fonts, this technique reconstructs candidate fonts with 100% accuracy.
Keywords: candidate fonts, font reconstruction, token co-occurrence graph partitioning, token compression
Document structure meets page layout: loopy random fields for web news content extraction BIBAKFull-Text 151-160
  Alex Spengler; Patrick Gallinari
Web content extraction is concerned with the automatic identification of semantically interesting web page regions. To generalize to pages from unknown sites, it is crucial to exploit not only the local characteristics of a particular web page region, but also the rich interdependencies that exist between the regions and their latent semantics. We therefore propose a loopy conditional random field which combines semantic intra-page dependencies derived from both document structure and page layout, uses a realistic set of local and relational features and is efficiently learnt in the tree-based reparameterization framework. The results of our empirical analysis on a corpus of real-world news web pages from 177 distinct sites with multiple annotations on DOM node level demonstrate that our combination of document structure and layout-driven interdependencies leads to a significant error reduction on the semantically interesting regions of a web page.
Keywords: loopy conditional random fields, news600 data set, tree-based reparameterization, web content extraction
Comparison of global and cascading recognition systems applied to multi-font Arabic text BIBAKFull-Text 161-164
  Fouad Slimane; Slim Kanoun; Adel M. Alimi; Jean Hennebert; Rolf Ingold
A known difficulty of Arabic text recognition is in the large variability of printed representation from one font to the other. In this paper, we present a comparative study between two strategies for the recognition of multi-font Arabic text. The first strategy is to use a global recognition system working independently on all the fonts. The second strategy is to use a so-called cascade built from a font identification system followed by font-dependent systems. In order to reach a fair comparison, the feature extraction and the modeling algorithms based on HMMs are kept as similar as possible between both approaches. The evaluation is carried out on the large and publicly available APTI (Arabic Printed Text Image) database with 10 different fonts. The results are showing a clear advantage of performance for the cascading approach. However, the cascading system is more costly in terms of cpu and memory.
Keywords: APTI, GMM, HMM, font recognition, text recognition
Automatic selection of print-worthy content for enhanced web page printing experience BIBAKFull-Text 165-168
  Suk Hwan Lim; Liwei Zheng; Jianming Jin; Huiman Hou; Jian Fan; Jerry Liu
The user experience of printing web pages has not been very good. Web pages typically contain contents that are not print-worthy or informative such as side bars, footers, headers, advertisements, and auxiliary information for further browsing. Since the inclusion of such contents degrades the web printing experience, we have developed a tool that first selects the main part of the web page automatically and then allows users to make adjustments. In this paper, we describe the algorithm for selecting the main content automatically during the first pass. The web page is first segmented into several coherent areas or blocks using our web page segmentation method that clusters content based on the affinity values between basic elements. The relative importance values for the segmented blocks are computed using various features and the main content is extracted based on the constraint of one DOM (Document Object Model) sub-tree and high important scores. We evaluated our algorithm on 65 web pages and computed the accuracy based on area of overlap between the ground truth and the extracted result of the algorithm.
Keywords: block importance, segmentation, web page layout analysis, web page printing


A new model for automated table layout BIBAKFull-Text 169-176
  Mihai Bilauca; Patrick Healy
In this paper we consider the table layout problem. We present a combinatorial optimization modeling method for the table layout optimization problem, the problem of minimizing a table's height subject to it fitting on a given page (width). We present two models of the problem and report on their evaluation.
Keywords: constrained optimization, table layout
PDF profiling for B&W versus color pages cost estimation for efficient on-demand book printing BIBAKFull-Text 177-180
  Fabio Giannetti; Gary Dispoto; Rafael Dueire Lins; Gabriel de França Pereira e Silva; Alexis Cabeda
Today, the way books, magazines and newspapers are published is undergoing a democratic revolution. Digital Presses have enabled the on-demand model, which provides individuals with the opportunity to produce and publish their own books with very low upfront cost. With these new markets, opportunities, and challenges have arisen. In a traditional environment, black-and-white and color pages were printed using different presses. Later on, the book was assembled combining the pages accordingly. In a digital workflow all the pages are printed with the same press, although the page cost varies significantly between color and b/w pages. Having an accurate printing cost profiler for pdf-files is fundamental for the print-on-demand business, as jobs often have a mix of color and b/w pages. To meet the expectations of some of HP customers in the large Print Service Providers (PSPs) business, a profiler was developed which yielded a reasonable cost estimate. The industrial use of such a tool showed some discrepancies between estimated and printer log, however. The new profiler presented herein provides a more accurate account of pdf jobs to be printed. Tested on 79 "real world" pdf jobs, totaling 7,088 pages, the new profiler made only one page misclassification, while the previous one yielded 54 classification errors.
Keywords: PDF profiling, digital presses, printing costs
Next generation typeface representations: revisiting parametric fonts BIBAKFull-Text 181-184
  Tamir Hassan; Changyuan Hu; Roger D. Hersch
Outline font technology has long been established as the standard way to represent typefaces, allowing characters to be represented independently of print size and resolution. Although outline font technologies are mature and produce results of sufficient quality for professional printing applications, they are inherently inflexible, which presents limitations in a number of document engineering applications. In the 1990s, the topic of finding a successor to outline fonts was a hot topic of research. Unfortunately, none of the methods developed at the time were successful in replacing outline font technology and this field of research has since then declined sharply in popularity.
   In this paper, we revisit a parametric font format developed between 1995 and 2001 by Hu and Hersch, where characters are built up from connected shape components. We extend this representation and use it to synthesize several characters from the Frutiger typeface and alter their weights by setting the relevant parameters. These settings are automatically propagated to the other characters of the font family.
   To conclude, we provide a discussion on next-generation font technologies in the light of today's Web-centric technologies and suggest applications that could greatly benefit from the use of flexible, parametric font representations.
Keywords: digital typography, font representation, font synthesis, parameterized fonts, parametric fonts, re-typesetting

Document engineering I: posters

DSMW: a distributed infrastructure for the cooperative edition of semantic wiki documents BIBAKFull-Text 185-186
  Hala Skaf-Molli; Gérôme Canals; Pascal Molli
DSMW is a distributed semantic wiki that offers new collaboration modes to semantic wiki users and supports dataflow-oriented processes.
   DSMW is an extension to Semantic Mediawiki (SMW), it allows to create a network of SMW servers that share common semantic wiki pages. DSMW users can create communication channels between servers and use a publish-subscribe approach to manage the change propagation. DSMW synchronizes concurrent updates of shared semantic pages to ensure their consistency.
Keywords: distribution, replication, semantic wiki
Open world classification of printed invoices BIBAKFull-Text 187-190
  Enrico Sorio; Alberto Bartoli; Giorgio Davanzo; Eric Medvet
A key step in the understanding of printed documents is their classification based on the nature of information they contain and their layout. In this work we consider a dynamic scenario in which document classes are not known a priori and new classes can appear at any time. This open world setting is both realistic and highly challenging. We use an SVM-based classifier based only on image-level features and use a nearest-neighbor approach for detecting new classes. We assess our proposal on a real-world dataset composed of 562 invoices belonging to 68 different classes. These documents were digitalized after being handled by a corporate environment, thus they are quite noisy -- e.g., big stamps and handwritten signatures at unfortunate positions and alike. The experimental results are highly promising.
Keywords: SVM, document image classification, machine learning, nearest-neighbor
Diffing, patching and merging XML documents: toward a generic calculus of editing deltas BIBAKFull-Text 191-194
  Jean-Yves Vion-Dury
This work addresses what we believe to be a central issue in the field of XML diff and merge computation: the mathematical modeling of the so-called "editing deltas" and the study of their formal abstract properties. We expect at least three outputs from this theoretical work: a common basis to compare performances of the various algorithms through a structural normalization of deltas, a universal and flexible patch application model and a clearer separation of patch and merge engine performance from delta generation performance. Moreover, this work could inspire technical approaches to combine heterogeneous engines thank to sound delta transformations. This short paper reports current results, discusses key points and outlines some perspectives.
Keywords: XML, tree edit distance, tree transformation, tree-to-tree correction, version control
Contextual advertising for web article printing BIBAKFull-Text 195-198
  Shengwen Yang; Jianming Jin; Joshi Parag; Sam Liu
Advertisements provide the necessary revenue model supporting the Web ecosystem and its rapid growth. Targeted or contextual ad insertion plays an important role in optimizing the financial return of this model. Nearly all the current ad payment strategies such as "pay-per-impression" and "pay-per-click" on web pages are geared for electronic viewing purposes. Little attention, however, is focused on deriving additional ad revenues when the content is repurposed for alternative mean of presentation, e.g. being printed. Although more and more content is moving to the Web, there are still many occasions where printed output of web content or RSS feeds is desirable, such as maps and articles; thus printed ad insertion can potentially be lucrative.
   In this paper, we describe a cloud-based printing service that enables automatic contextual ad insertion, with respect to the main web page content, when a printout of the page is requested. To encourage service utilization, it would provide higher quality printouts than what is possible from current browser print drivers, which generally produce poor outputs -- ill formatted pages with lots of unwanted information, e.g. navigation icons. At this juncture we will limit the scope to only article-related web pages although the concept can be extended to arbitrary web pages. The key components of this system include (1) automatic extraction of article from web pages, (2) the ad service network for ad matching and delivering, and (3) joint content and ad printout creation.
Keywords: contextual advertisement, web printing
Table layout performance of document authoring tools BIBAKFull-Text 199-202
  Mihai Bilauca; Patrick Healy
In this paper we survey table creation in several popular document authoring programs and identify usability bottlenecks and inconsistencies between several of them. We discuss the user experience when drawing tables and we draw attention to the fact that authoring tables is still difficult and can be a frustrating and error prone exercise and that the development of high-quality table tools should be further pursued.
Keywords: table layout
Document product lines: variability-driven document generation BIBAKFull-Text 203-206
  Ma Carmen Penadés; José H. Canós; Marcos R. S. Borges; Manuel Llavador
In this paper, we propose a process model, which we call Document Product Lines, for the intensive generation of documents with variable content. Unlike current approaches, we identify the variability sources at the requirements level, including an explicit representation and management of these sources. The process model provides a methodological guidance to the (semi)automated generation of customized editors following the principles, techniques, and available technologies of Software Product Line Engineering. We illustrate our proposal with its application to the intensive generation of Emergency Plans.
Keywords: document generation, emergency management, emergency plans, software product lines, variability management
Degraded dot matrix character recognition using CSM-based feature extraction BIBAKFull-Text 207-210
  Abderrahmane Namane; El Houssine Soubari; Patrick Meyrueis
This paper presents an OCR method for degraded character recognition applied to a reference number (RN) of 15 printed characters of an invoice document produced by dot-matrix printer. First, the paper deals with the problem of the reference number localization and extraction, in which the characters tops or bottoms are or not touched with a printed reference line of the electrical bill. In case of touched RN, the extracted characters are severely degraded leading to missing parts in the characters tops or bottoms. Secondly, a combined recognition method based on the complementary similarity measure (CSM) method and MLP-based classifier is used. The CSM is used to accept or reject an incoming character. In case of acceptation, the CSM acts as a feature extractor and produces a feature vector of ten component features. The MLP is then trained using these feature vectors. The use of the CSM as a feature extractor tends to make the MLP very powerful and very well suited for rejection. Experimental results on electrical bills show the ability of the model to yield relevant and robust recognition on severely degraded printed characters.
Keywords: OCR, character recognition, dot matrix, feature extraction, multiple classification
Picture detection in document page images BIBAFull-Text 211-214
  Patrick Chiu; Francine Chen; Laurent Denoue
We present a method for picture detection in document page images, which can come from scanned or camera images, or rendered from electronic file formats. Our method uses OCR to separate out the text and applies the Normalized Cuts algorithm to cluster the non-text pixels into picture regions. A refinement step uses the captions found in the OCR text to deduce how many pictures are in a picture region, thereby correcting for under- and over-segmentation. A performance evaluation scheme is applied which takes into account the detection quality and fragmentation quality. We benchmark our method against the ABBYY application on page images from conference papers.
Down to the bone: simplifying skeletons BIBAKFull-Text 215-218
  Jannis Stoppe; Björn Gottfried
This paper is about off-line handwritten text comparison of historic documents. The long-term motivation is the support of palaeographic research, in particular to back up decisions as to whether two handwritings can be ascribed to the same author. In this paper, a first fundamental step is presented for extracting relevant structures from handwritten texts. Such structures are represented by skeletons, due to their resemblance to original writing movements.
   Core result is an approach to the simplification of skeleton structures. While skeletons represent constitutive structures for a wide variety of subsequent algorithms, simplification algorithms usually focus on pruning branches off the skeleton instead of simplifying the skeleton as a whole. By contrast, our approach reduces the amount of elements in a skeleton based on a global error level, reducing the skeleton's complexity while keeping its structure as close to the original exemplar as possible. The results are much easier to analyse while relevant information is maintained.
Keywords: image processing, shapes, simplification, skeletons, text
Interactive layout analysis and transcription systems for historic handwritten documents BIBAKFull-Text 219-222
  Oriol Ramos-Terrades; Alejandro H. Toselli; Nicolas Serrano; Verónica Romero; Enrique Vidal; Alfons Juan
The amount of digitized legacy documents has been rising dramatically over the last years due mainly to the increasing number of on-line digital libraries publishing this kind of documents, waiting to be classified and finally transcribed into a textual electronic format (such as ASCII or PDF). Nevertheless, most of the available fully-automatic applications addressing this task are far from being perfect and heavy and inefficient human intervention is often required to check and correct the results of such systems. In contrast, multimodal interactive-predictive approaches may allow the users to participate in the process helping the system to improve the overall performance. With this in mind, two sets of recent advances are introduced in this work: a novel interactive method for text block detection and two multimodal interactive handwritten text transcription systems which use active learning and interactive-predictive technologies in the recognition process.
Keywords: handwriting recognition, interactive layout analysis, interactive predictive processing, partial supervision
Document conversion for cultural heritage texts: FrameMaker to HTML revisited BIBAKFull-Text 223-226
  Michael Piotrowski
Many large-scale digitization projects are currently under way that intend to preserve the cultural heritage contained in paper documents (in particular books) and make it available on the Web. Typically OCR is used to produce searchable electronic texts from books. For newer books, approximately from the late 1980s onwards, digital text may already exist in the form of typesetting data. For applications that require a higher level of accuracy than OCR can deliver, the conversion of typesetting data can thus be an alternative to manual keying. In this paper, we describe a tool for converting typesetting data in FrameMaker format to XHTML+CSS developed for a collection of source editions of medieval and early modern documents. Even though the books of the Collection are typeset in good quality and in modern typefaces, OCR is unusable, since the text is in various historical forms of German, French, Italian, Rhaeto-Romanic, and Latin. The conversion of typesetting data produces fully reliable text free from OCR errors and thus also provides a basis for the construction of language resources for the processing of historical texts.
Keywords: CSS, XHTML, cultural heritage data, document format conversion, FrameMaker
Glyph extraction from historic document images BIBAKFull-Text 227-230
  Lothar Meyer-Lerbs; Arne Schuldt; Björn Gottfried
This paper is about the reproduction of ancient texts with vectorised fonts. While for OCR only recognition rates count, a reproduction process does not necessarily require the recognition of characters. Our system aims at extracting all characters from printed historic documents without the employment of knowledge of language, font, or writing system. It searches for the best prototypes and creates a document-specific font from these glyphs. To reach this goal, many common OCR preprocessing steps are no longer adequate. We describe the necessary changes of our system that deals particularly with documents typeset in Fraktur. On the one hand, algorithms are described that extract glyphs accurately for the purpose of precise reproduction. On the other hand, classification results of extracted Fraktur glyphs are presented for different shape descriptors.
Keywords: document-specific font, glyph classification, glyph extraction, glyph shape, image enhancement
Style and branding elements extraction from businessweb sites BIBAKFull-Text 231-234
  Limei Jiao; Suk Hwan Lim; Nina Bhatti; Yuhong Xiong; Jerry Liu
We describe a method to extract style and branding elements from multiple web pages in a given site for content repurposing. Style and branding elements convey the values of the site owners effectively and connect with the target prospects. They are manifested through logos, graphical elements, background color, font styles, font colors and other illustrations. Our method automatically extracts color and image elements appearing frequently and prominently on multiple pages throughout the site. We rely on a DOM tree matching method to obtain the frequency of re-occurring elements and use relative sizes and positions of elements to determine the type of elements. Note that approximate locations of these elements provide an added clue to the content repurposing engine as to where to place the elements in the repurposed document. The obtained results show that the proposed method can efficiently extract style and branding elements with high accuracy.
Keywords: high frequent elements extraction, style and branding extraction, tree matching

Document engineering II: posters

FormCracker: interactive web-based form filling BIBAKFull-Text 235-238
  Laurent Denoue; John Adcock; Scott Carter; Patrick Chui; Francine Chen
Filling out document forms distributed by email or hosted on the Web is still problematic and usually requires a printer and scanner. Users commonly download and print forms, fill them out by hand, scan and email them. Even if the document is form-enabled (PDFs with FDF information), to read the file users still have to launch a separate application which may not be available, especially on mobile devices.
   FormCracker simplifies this process by providing an interactive, fully web-based document viewer that lets users complete forms online. Document pages are rendered as images and presented in a simple HTML-based viewer. When a user clicks in a form-field, FormCracker identifies the correct form-field type using lightweight image processing and heuristics based on nearby text. Users can then seamlessly enter data in form-fields such as text boxes, check boxes, radio buttons, multiple text lines, and multiple single-box characters. FormCracker also provides useful auto-complete features based on the field type, for example a date picker, a drop-down menu for city names, state lists, and an auto-complete text box for first and last names. Finally, FormCracker allows users to save and print the completed document.
   In summary, with FormCracker a user can efficiently complete and reuse any electronic form.
Keywords: document processing, form filling, image processing, interactive
Semantics-enriched document exchange BIBAKFull-Text 239-242
  Jingzhi Guo; Ming Sang Ho
In e-business development, semantics-oriented document exchange is becoming important, because it can support cross-domain user connection, business transaction and collaboration. To provide this support, this paper proposes a DOC Mechanism to exchange semantically interoperable business documents between heterogeneous enterprise information systems. This mechanism is designed on a layered-sign network, which enables any exchanged e-business document to be independently interpretable without losing semantic consistency.
Keywords: XML product map, concept, document engineering, document exchange, electronic business, representation, semantics, sign
Document and item-based modeling: a hybrid method for a socio-semantic web BIBAKFull-Text 243-246
  Jean-Pierre Cahier; Xiaoyue Ma; L'Hédi Zaher
The paper discusses the challenges of categorising documents and "items of the world" to promote knowledge sharing in large communities of interest. We present the DOCMA method (Document and Item-based Model for Action) dedicated to end-users who have minimal or no knowledge of information science. Community members can elicit structure and indexed business items stemming from their query including projects, actors, products, places of interest, and geo-situated objects. This hybrid method was applied in a collaborative Web portal in the field of sustainability for the past two years.
Keywords: document, folksonomy, method, socio-semantic web, web2.0
Structure-aware topic clustering in social media BIBAFull-Text 247-250
  Julien Dubuc; Sabine Bergler
The rapid evolution and growth of social media software has enabled hundreds of millions to interact within on-line communities on a global scale. While they enable communication through a common set of metaphors, such as discussion threads and quoting text in replies, this software uses a variety of diverging ways of representing discussion. Since the meaning of a conversation is defined not only by the content of a piece of text, but also by the relationships between pieces of text, part of the meaning of the discussion is obscured from automated processing.
   Search engines, which act as gateways to outsiders into the social text in a community, are reduced to giving an incomplete picture. This paper proposes a model for representing both the content and the structure of social text in a consistent way, enabling automated processing of the structure of the discussion along with its text content.
   It also describes a method for indexing text that uses this structural information to provide meaningful contexts for paragraphs of interest. It then describes a method for clustering text content into topic groups, using this indexing method, and also using the social structure to make informed decisions about which pieces of text to compare meaningfully.
Pre-evaluation of invariant layout in functional variable-data documents BIBAKFull-Text 251-254
  John Lumley
Layout of content in variable data documents can be computationally expensive. When very large numbers of almost similar copies of a document are required, automated pre-evaluation of invariant sections may increase efficiency of final document generation. If the layout model is functional and combinatorial in nature (such as in the Document Description Framework), there are some generalised conservative techniques to do this that involve very modest changes to implementations, independent of details of the actual layouts. This paper describes these techniques and how they might be used with other similar document layout models.
Keywords: SVG, XSLT, document construction, functional programming
Towards a common evaluation strategy for table structure recognition algorithms BIBAKFull-Text 255-258
  Tamir Hassan
A number of methods for evaluating table structure recognition systems have been proposed in the literature, which have been used successfully for automatic and manual optimization of their respective algorithms. Unfortunately, the lack of standard, ground-truthed datasets coupled with the ambiguous nature of how humans interpret tabular data has made it difficult to compare the obtained results between different systems developed by different research groups.
   With reference to these approaches, we describe our experiences in comparing our algorithm for table detection and structure recognition to another recently published system using a freely available dataset of 75 PDF documents. Based on examples from this dataset, we define several classes of errors and propose how they can be treated consistently to eliminate ambiguities and ensure the repeatability of the results and their comparability between different systems from different research groups.
Keywords: evaluation, ground truth, precision, recall, table detection, table recognition, table structure recognition
Using feature models for creating families of documents BIBAKFull-Text 259-262
  Sven Karol; Martin Heinzerling; Florian Heidenreich; Uwe Aßmann
Variants in a family of office documents are usually created by ad-hoc copy and paste actions from a set of base documents. As a result, the set of variants is decoupled from the original documents and is difficult to manage. In this paper we present a novel approach that uses concepts from Feature Oriented Domain Analysis (FODA) to specify document families to generate variants. As a proof of concept, we implemented the Document Feature Mapper tool, which is based on our previous experience in Software Product Line Engineering (SPLE) with FODA. In our tool, variant spaces are precisely specified using feature models and mappings relating features to slices in the document family. Gives a selection of features satisfying the feature model's constraints a variant can be derived. To show the applicability of our approach and tool, we conducted two case studies with documents in the Open Document Format (ODF).
Keywords: ODF, XML, document families, feature models, variants
Two new aesthetic measures for item alignment BIBAKFull-Text 263-266
  Aline Duarte Riva; Alexandre Kazuo Seki; João Batista Souza de Oliveira; Isabel Harb Mansour; Ricardo Farias Piccoli
This paper introduces two methods for measuring the alignment of items on a page with respect to its left/right margins. The methods are based on the path followed by the eyes as they follow the items from top to bottom of the page.
   Examples are presented and both methods are analyzed with respect to the axioms presented in [2], that describe how good alignment measure is supposed to behave.
Keywords: document aesthetics, page alignment, page layout
Term frequency dynamics in collaborative articles BIBAKFull-Text 267-270
  Sérgio Nunes; Cristina Ribeiro; Gabriel David
Documents on the World Wide Web are dynamic entities. Mainstream information retrieval systems and techniques are primarily focused on the latest version a document, generally ignoring its evolution over time. In this work, we study the term frequency dynamics in web documents over their lifespan. We use the Wikipedia as a document collection because it is a broad and public resource and, more important, because it provides access to the complete revision history of each document. We investigate the progression of similarity values over two projection variables, namely revision order and revision date. Based on this investigation we find that term frequency in encyclopedic documents -- i.e. comprehensive and focused on a single topic -- exhibits a rapid and steady progression towards the document's current version. The content in early versions quickly becomes very similar to the present version of the document.
Keywords: document dynamics, term frequency, wikipedia
A file-type sensitive, auto-versioning file system BIBAKFull-Text 271-274
  Arthur Müller; Sebastian Rönnau; Uwe M. Borghoff
Auto-versioning file systems offer a simple and reliable interface to document change control. The implicit versioning of documents at each write access catches the whole evolution of a document, thus supporting regulatory compliance rules. Most existing file systems work on low abstraction levels and track the document evolution on their binary representation. Higher-level differencing tools allow for a far more meaningful change-tracking, though.
   In this paper, we present an auto-versioning file system that is able to handle files depending on their file type. This way, a suitable differencing tool can be assigned to each file type. Our approach supports regulatory compliant storage as well as the archiving of documents.
Keywords: auto-versioning, document management, file system, regulatory compliance, version control
Medieval manuscript layout model BIBAKFull-Text 275-278
  Micheal Baechler; Rolf Ingold
Medieval manuscript layouts are quite complex. Additionally to their main text flow, which can spread over one or several columns, such manuscripts contain also other textual elements such as insertions, annotations, and corrections. They are often richly decorated with ornaments, illustrations, and drop capitals making their layout even more complex. In this paper we propose a generic layout model to represent their physical structure.
   To achieve this goal we propose to use four layers in order to distinguish between the different graphical elements. In this paper we show how this model is used to represent automatic segmentation results and how it allows a quantitative measure of their accuracy.
Keywords: annotation, layout, layout model, manuscript, medieval, medieval manuscript, segmentation
Using model driven engineering technologies for building authoring applications BIBAKFull-Text 279-282
  Olivier Beaudoux; Arnaud Blouin; Jean-Marc Jézéquel
Building authoring applications is a tedious and complex task that requires a high programming effort. Document technologies, especially XML based ones, can help in reducing such an effort by providing common bases for manipulating documents. Still, the overall task consists mainly of writing the application's source code. Model Driven Engineering (MDE) focuses on generating the source code from an exhaustive model of the application. In this paper, we illustrate that MDE technologies can be used to automate the development of authoring application components, but fail in generating the code of graphical components. We present our framework, called Malai, that aims to solve this issue.
Keywords: MDE, Malai, Malan, authoring applications
On Helmholtz's principle for documents processing BIBAKFull-Text 283-286
  Alexander A. Balinsky; Helen Y. Balinsky; Steven J. Simske
Keyword extraction is a fundamental problem in text data mining and document processing. A large number of document processing applications directly depend on the quality and speed of keyword extraction algorithms. In this article, a novel approach to rapid change detection in data stream.
   and documents is developed. It is based on ideas from image processing and especially on the Helmholtz Principle from the Gestalt Theory of human perception. Applied to the problem of keywords extraction, it delivers fast and effective tools to identify meaningful keywords using parameter-free methods. We also define a level of meaningfulness of the keywords which can be used to modify the set of keywords depending on application needs.
Keywords: gestalt, Helmholtz principle, keyword extraction, meaningful words, rapid change detection