| Exploring the world's knowledge in the digital age | | BIBA | Full-Text | 1-2 | |
| Aly Kaloa Conteh | |||
| The advent of the digital age has brought about a dramatic change in the
techniques for the dissemination of, and access to, historical documents. For
today's researcher, the internet is a primary source of the tools and
information that supports their research. The British Library (BL) provides
world class information services to the academic, business, research and
scientific communities and offers unparalleled access to the world's largest
and most comprehensive research collection. The British Library's collections
include 150 million items from every era of written human history beginning
with Chinese oracle bones dating from 300 BC, right up to the latest
e-journals. For almost 20 years the British Library has been engaged in
transforming physical collection items into digital form enabling access to
that content over the World Wide Web.
In this talk, I describe the depth and range of the collections that are held at the British Library. I will outline the digital conversion process we undertake, including our current solutions for digital formats, metadata standards, providing access to the content and preserving the digital outputs in perpetuity. Finally, as the British Library undertakes projects that will digitise millions of pages of historical text based items, I will look at the challenges we face, such as storage requirements and enhancing resource discovery, and how are we addressing those challenges. | |||
| Document engineering for a digital library: PDF recompression using JBIG2 and other optimizations of PDF documents | | BIBAK | Full-Text | 3-12 | |
| Petr Sojka; Radim Hatlapatka | |||
| This paper describes several innovative document transformations and tools
that have been developed in the process of building the Digital Mathematical
Library DML-CZ http://dml.cz. The main result presented in this paper is our
PDF re-compression tools developed using a jbig2enc library. Together with
other programs, especially pdfsizeopt.py by Péter Szabó, we have
managed to decrease PDF storage size and transmission needs be 62%: using both
programs we reduced the size of the original PDFs to 38%.
This paper briefly describes other approaches and tools developed while creating the digital library. The batch digital signature stamper, the document similarity metrics which uses four different methods, a [meta]data validation process and some math OCR tools represent some of the main byproducts of this project. These ways of document engineering, together with Google Scholar indexing optimizations have led to the success of serving digitized and born-digital scientific math documents to the public in DML=CZ, and will be employed also in the project of The European Digital Mathematics Library, EuDML. Keywords: authoring tools and systems, categorization, character recognition,
classification, digital mathematical library, digitisation workflow, document
presentation (typography, formatting, layout), representations/standards,
structure, layout and content analysis | |||
| Multilingual composite document management framework for the internet: an FRBR approach | | BIBAK | Full-Text | 13-16 | |
| Jean-Marc Lecarpentier; Cyril Bazin; Hervé Le Crosnier | |||
| Most Web Content is nowadays published with Content Management Systems
(CMS). As outlined in this paper, existing tools lack some functionalities to
create and manage multilingual composite documents efficiently. In another
domain, the International Federation of Library Associations and Institutions
(IFLA) published the Functional Requirements for Bibliographic Records (FRBR)
to lay the foundation for cataloguing documents and their various versions,
translations and formats, setting the focus on the intellectual work.
Using the FRBR concepts as guidelines, we introduce a tree-based model to describe relations between a digital document's various versions, translations and formats. Content negotiation and relationships between documents at the highest level of the tree allow composite documents to be rendered according to a user's preferences (e.g. language, user agent...). The proposed model has been implemented and validated within the Sydonie framework, a research and industrial project. Sydonie implements our model in a CMS-like tool to imagine new ways to create, edit and publish multilingual composite documents. Keywords: composite documents, document management system, multilingual documents | |||
| A social approach to authoring media annotations | | BIBAK | Full-Text | 17-26 | |
| Roberto, Jr. Fagá; Vivian Genaro Motti; Renan Gonçalves Cattelan; Cesar Augusto Camillo Teixeira; Maria da Graça Campos Pimentel | |||
| End-user generated content is responsible for the success of several
collaborative applications, as it can be noted in the context of the web. The
collaborative use of some of these applications is made possible, in many
cases, by the availability of annotation features which allow users to include
commentaries on each other's content. In this paper we first discuss the
opportunity of defining vocabularies that allow third-party applications to
integrate annotations to end-user generated documents, and present a proposal
for such a vocabulary. We then illustrate the usefulness of our proposal by
detailing a tool which allows users to add multimedia annotations to end-user
generated video content. Keywords: annotation, collaboration, multimodal, open vocabulary, video,
watch-and-comment, YouTube | |||
| Creating and sharing personalized time-based annotations of videos on the web | | BIBAK | Full-Text | 27-36 | |
| Rodrigo Laiola Guimarães; Pablo Cesar; Dick C. A. Bulterman | |||
| This paper introduces a multimedia document model that can structure
community comments about media. In particular, we describe a set of temporal
transformations for multimedia documents that allow end-users to create and
share personalized timed-text comments on third party videos. The benefit over
current approaches lays in the usage of a rich captioning format that is not
embedded into a specific video encoding format. Using as example a Web-based
video annotation tool, this paper describes the possibility of merging video
clips from different video providers into a logical unit to be captioned, and
tailoring the annotations to specific friends or family members. In addition,
the described transformations allow for selective viewing and navigation
through temporal links, based on end-users' comments. We also report on a
predictive timing model for synchronizing unstructured comments with specific
events within a video(s). The contributions described in this paper bring
significant implications to be considered in the analysis of rich media social
networking sites and the design of next generation video annotation tools. Keywords: document transformations, smiltext, temporal hyperlinks, timed end-user
comments, video annotation tools | |||
| "This conversation will be recorded": automatically generating interactive documents from captured media | | BIBAK | Full-Text | 37-40 | |
| Didier Augusto Vega-Oliveros; Diogo Santana Martins; Maria da Graça Campos Pimentel | |||
| Synchronous communication tools allow remote users to collaborate by
exchanging text, audio, images or video messages in synchronous sessions. In
some scenarios, it is paramount that collaborative synchronous sessions be
recorded for later review. In particular in the case of web conferencing tools,
the approach usually adopted for recording a meeting is to generate a linear
video with the content of the exchanged media. Such approach limits the review
of a meeting to users watching a video using traditional timeline-based video
controls. In this work we advocate that interactive multimedia documents can be
generated automatically as a result of capturing a synchronous session. We
outline our approach presenting a case of study involving remote communication,
and detail the generation of a multimedia document by means of operators
focusing on the interaction among the collaborating users. Keywords: automatic authoring, interactive video | |||
| Document imaging security and forensics ecosystem considerations | | BIBAK | Full-Text | 41-50 | |
| Steven J. Simske; Margaret Sturgill; Guy Adams; Paul Everest | |||
| Much of the focus in document security tends to be on the deterrent -- the
physical (printed, manufactured) item placed on a document, often used for
routing in addition to security purposes. Hybrid (multiple) deterrents are not
always reliably read by a single imaging device, and so a single device
generally cannot simultaneously provide overall document security. We herein
show how a relatively simple deterrent can be used in combination with multiple
imaging devices to provide document security. In this paper, we show how these
devices can be used to classify the printing technology used, a subject of
importance for counterfeiter identification as well as printer quality control.
Forensic-level imaging is also useful in preventing repudiation and forging,
while mobile and/or simple scanning can be used to prevent tampering --
propitiously in addition to providing useful, non-security related,
capabilities such as document routing (track and trace) and workflow
association. Keywords: 3D bar codes, color tiles, document fraud, forensics, high-resolution
imaging, security | |||
| XUIB: XML to user interface binding | | BIBAK | Full-Text | 51-60 | |
| Lendle Tseng; Yue-Sun Kuo; Hsiu-Hui Lee; Chuen-Liang Chen | |||
| Separated from GUI builders, existing GUI building tools for XML are complex
systems that duplicate many functions of GUI builders. They are specialized in
building GUIs for XML data only. This puts unnecessary burden on developers
since an application usually has to handle both XML and non-XML data. In this
paper, we propose a solution that separates the XML-to-GUI bindings from the
construction of the GUIs for XML, and concentrates on the XML-to-GUI bindings
only. Cooperating with a GUI builder, the proposed system can support the
construction of the GUIs for XML as GUI building tools for XML can.
Furthermore, the proposed mechanism is neutral to GUI builders and toolkits. As
a result, multiple GUI builders and toolkits can be supported by the proposed
solution with moderate effort. Our current implementation supports two types of
GUI platforms: Java/Swing and Web/Html. Keywords: GUI, GUI builders, XML, XML authoring, user interface | |||
| From templates to schemas: bridging the gap between free editing and safe data processing | | BIBAK | Full-Text | 61-64 | |
| Vincent Quint; Cécile Roisin; Stéphane Sire; Christine Vanoirbeek | |||
| In this paper we present tools that provide an easy way to edit XML content
directly on the web, with the usual benefit of valid XML content. These tools
make it possible to create content targeted for lightweight web applications.
Our approach uses (1) the XTiger template language, (2) the AXEL Javascript
library for authoring structured XML content and (3) XSLT transformations for
generating XML schemas against which the XML content can be validated.
Template-driven editing allows any web user to easily enter content while
schemas make sure applications can safely process this content. Keywords: XML, document authoring, document language, web editing | |||
| Lessons from the dragon: compiling PDF to machine code | | BIBAK | Full-Text | 65-68 | |
| Steven R. Bagley | |||
| Page Description Languages, such as PDF or PostScript, describe the page as
a series of graphical operators, which are then imaged to draw the page
content. An interpreter executes these operators one-by-one every time the page
is rendered into a viewable form. Typically, this interpreter takes the form of
a tokenizer that splits the page description into the separate operators.
Various subroutines are then called depending on which tokens are encountered.
This process is analogous to instruction execution at the heart of a CPU: the
CPU fetches machine code instructions from memory and dispatches them to the
various parts of the chip as necessary.
In this paper, we show that it is possible to compile a page description directly into machine code, bypassing the need to interpret the page description. This can bring a speed increase in PDF rendering -- particularly important on low-power devices -- and could also help increase document accessibility. Keywords: PDF, compilation, document format, interpretation, page description
languages | |||
| Transquotation in EBooks | | BIBAK | Full-Text | 69-72 | |
| Steven A. Battle; Matthew Bernius | |||
| This paper describes the use of transquotation in eBooks to support
collaborative publishing. Users are able to prepare content on a wiki and
assemble this into a publishable eBook. The eBook content should remain
connected to its origin so that as the wiki content changes, the eBook may be
revised accordingly. The problems raised in creating a transquoting eBook
editing environment include transforming wiki content into a suitable
presentation, mapping selections back to plain-text, and mapping existing
selections into the presentation space for review. Keywords: eBooks, transclusion, transquotation | |||
| Table of contents recognition for converting PDF documents in e-book formats | | BIBAK | Full-Text | 73-76 | |
| Simone Marinai; Emanuele Marino; Giovanni Soda | |||
| We describe one tool for Table of Content (ToC) identification and
recognition from PDF books. This task is part of ongoing research on the
development of tools for the semi-automatic conversion of PDF documents in the
Epub format that can be read on several E-book devices. Among various
sub-tasks, the ToC extraction and recognition is particularly useful for an
easy navigation of book contents.
The proposed tool first identifies the ToC pages. The bounding boxes of ToC titles in the book body are subsequently found in order to add suitable links in the Epub ToC. The proposed approach is tolerant to discrepancies between the ToC text and the corresponding titles. We evaluated the tool on several open access books edited by University Presses that are partner of the OAPEN EcontentPlus project. Keywords: PDF, e-book conversion, table of content | |||
| Using versioned tree data structure, change detection and node identity for three-way XML merging | | BIBAK | Full-Text | 77-86 | |
| Cheng Thao; Ethan V. Munson | |||
| XML has become the standard document representation for many popular tools
in various domains. When multiple authors collaborate to produce a document,
they must be able to work in parallel and periodically merge their efforts into
a single work. While there exist a small number of three-way XML merging tools,
their performance could be improved in several areas and they lack any form of
user interface for resolving conflicts.
In this paper, we present an implementation of a three-way XML merge algorithm that is faster, uses less memory and is more precise than existing tools. It uses a specialized versioning tree data structure that supports node identity and change detection. The algorithm applies the traditional three-way merge found in GNU diff3 to the children of changed nodes. The editing operations it supports are addition, deletion, update, and move. A graphical interface for visualizing and resolving conflicts is also provided. An evaluation experiment was conducted comparing the proposed algorithm with three other tools on randomly generated XML data. Keywords: XML, document trees, three-way merge, versioning system | |||
| A model for editing operations on active temporal multimedia documents | | BIBAK | Full-Text | 87-96 | |
| Jack Jansen; Pablo Cesar; Dick C. A. Bulterman | |||
| Inclusion of content with temporal behavior in a structured document leads
to such a document gaining temporal semantics. If we then allow changes to the
document during its presentation, this brings with it a number of fundamental
issues that are related to those temporal semantics. In this paper we study
modifications of active multimedia documents and the implications of those
modifications for temporal consistency. Such modifications are becoming
increasingly important as multimedia documents move from being primarily a
standalone presentation format to being a building block in a larger
application.
We present a categorization of modification operations, where each category has distinct consistency and implementation implications for the temporal semantics. We validate the model by applying it to the SMIL language, categorizing all possible editing operations. Finally, we apply the model to the design of a teleconferencing application, where multimedia composition is only a small component of the whole application, and needs to be reactive to the rest of the system. The primary contribution of this paper is the development of a temporal editing model and a general analysis which we feel can help application designers to structure their applications such that the temporal impact of document modification can be minimized. Keywords: application design, declarative languages, dynamic transformations,
multimedia | |||
| Semantics-based change impact analysis for heterogeneous collections of documents | | BIBAK | Full-Text | 97-106 | |
| Serge Autexier; Normen Müller | |||
| An overwhelming amount of documents is produced and changed every day in
most areas of our everyday life, such as, for instance, business, education,
research or administration. The documents are seldom isolated artifacts but are
related to other documents. Therefore changing one document possibly requires
adaptations to other documents.
Although dedicated tools may provide some assistance when changing documents, they often ignore other documents or documents of a different type. To resolve that discontinuity, we present a framework that embraces existing document types and supports the declarative specification of semantic annotation and propagation rules inside and across documents of different types, and on which basis we define change impact analysis for heterogeneous collections of documents. The framework is implemented in the tool GMoC which can be used to semantically annotate collections of documents and to analyze the impacts of changes made in different documents of a collection. Keywords: change impact analysis, document collections, document management, graph
rewriting, semantics | |||
| Linking data and presentations: from mapping to active transformations | | BIBAK | Full-Text | 107-110 | |
| Olivier Beaudoux; Arnaud Blouin | |||
| Modern GUI toolkits, and especially RIA ones, propose the concept of binding
to dynamically link domain data and their presentations. Bindings are very
simple to use for predefined graphical components. However, they remain
dependent on the GUI platform, are not as expressive as transformation
languages, and require specific coding when designing new graphical components.
A solution to such issues is to use active transformations: an active
transformation is a transformation that dynamically links source data to target
data. Active transformations are however complex to write and/or to process. In
this paper, we propose the use of the AcT framework that consists of: a
platform-independent mapping language that masks the complexity of active
transformations; a graphical mapping editor; and an implementation on the .NET
platform. Keywords: active transformation, mapping, model driven engineering | |||
| Blocked recursive image composition with exclusion zones | | BIBA | Full-Text | 111-114 | |
| Hui Chao; Daniel R. Tretter; Xuemei Zhang; C. Brian Atkins | |||
| Photo collages are a popular and powerful storytelling mechanism. They are often enhanced with background artwork that sets the theme for the story. However, layout algorithms for photo collage creation typically do not take this artwork into account, which can result in collages where photos overlay important artwork elements. To address this, we extend our previous Blocked Recursive Image Composition (BRIC) method to allow any number of photos to be automatically arranged around preexisting exclusion zones on a canvas (exBRIC). We first generate candidate binary splitting trees to partition the canvas into regions that accommodate both photos and exclusion zones. We use a Cassowary constraint solver to ensure that the desired exclusion zones are not covered by photos. Finally, photo areas, exclusion zones and layout symmetry are evaluated to select the best candidate. This method provides flexible, dynamic and integrated photo layout with background artwork. | |||
| Differential access for publicly-posted composite documents with multiple workflow participants | | BIBAK | Full-Text | 115-124 | |
| Helen Y. Balinsky; Steven J. Simske | |||
| A novel mechanism for providing and enforcing differential access control
for publicly-posted composite documents is proposed. The concept of a document
is rapidly changing: individual file-based, traditional formats can no longer
accommodate the required mixture of differently formatted parts: individual
images, video/audio clips, PowerPoint presentations, html-pages, Word
documents, Excel spreadsheets, pdf files, etc. Multi-part composite documents
are created and managed in complex workflows, with participants including
external consultants, partners and customers distributed across the globe, with
many no longer contained within one monolithic secure environment. Distributed
over non-secure channels, these documents carry different types of sensitive
information: examples include (a) an enterprise pricing strategy for new
products, (b) employees' personal records, (c) government intelligence, and (d)
individual medical records. A central server solution is often hard or
impossible to create and maintain for ad-hoc workflows. Thus, the documents are
often circulated between workflow participants over traditional, low security
e-mails, placed on shared drives, or exchanged using CD/DVD or USB. The
situation is more complicated when multiple workflow participants need to
contribute to various parts of such a document with different access levels:
for example, full editing rights, read-only, reading of some parts only, etc.,
for different users. We propose a full scale differential access control
approach, enabling public posting of composite documents, to address these
concerns. Keywords: access control, composite document, document security, policy | |||
| Assessing the readability of clinical documents in a document engineering environment | | BIBAK | Full-Text | 125-134 | |
| Mark Truran; Gersende Georg; Marc Cavazza; Dong Zhou | |||
| Previous work has established that specific linguistic markers present in
specialised medical documents (clinical guidelines) can be used to support
their automatic structuring within a document engineering environment. This
technique is commonly used by the French Health Authority (la Haute Autorite de
Sante) during elaboration of clinical guidelines to improve the quality of the
final document. In this paper, we explore the readability of clinical
guidelines. We discuss a structural measure of document readability that
exploits the ratio between these linguistic markers (deontic structures) and
the remainder of the text. We describe an experiment in which a corpus of 10
French clinical guidelines is scored for structural readability. We correlate
these scores with measures of textual cohesion (computed using latent semantic
analysis) and the results of a readability survey performed by a panel of
domain experts. Our results suggest an association between the density of
deontic structures in a clinical guideline and its overall readability. This
implies that certain generic readability measures can henceforth be utilised in
our document engineering environment. Keywords: LSA, cohesion, latent semantic analysis, medical document processing,
readability | |||
| Optimized reprocessing of documents using stored processor state | | BIBAK | Full-Text | 135-138 | |
| James A. Ollis; David F. Brailsford; Steven R. Bagley | |||
| Variable Data Printing (VDP) allows customised versions of material such as
advertising flyers to be readily produced. However, VDP is often extremely
demanding of computing resources because, even when much of the material stays
invariant from one document instance to the next, it is often simpler to
re-evaluate the page completely rather than identifying just the portions that
vary.
In this paper we explore, in an XML/XSLT/SVG workflow and in an editing context, the reduction of the processing burden that can be realised by selectively reprocessing only the variant parts of the document. We introduce a method of partial re-evaluation that relies on re-engineering an existing XSLT parser to handle, at each XML tree node, both the storage and restoration of state for the underlying document processing framework. Quantitative results are presented for the magnitude of the speed-ups that can be achieved. We also consider how changes made through an appearance-based interactive editing scheme for VDP documents can be automatically reflected in the document view via optimised XSLT re-evaluation of sub-trees that are affected either by the changed script or by altered data. Keywords: SVG, VDP, XSLT, document authoring, document editing, partial re-evaluation,
variable data documents | |||
| APEX: automated policy enforcement eXchange | | BIBAK | Full-Text | 139-142 | |
| Steven J. Simske; Helen Balinsky | |||
| The changing nature of document workflows, document privacy and document
security merit a new approach to the enforcement of policy. We propose the use
of automated means for enforcing policy, which provides advantages for
compliance and auditing, adaptability to changes in policy, and compatibility
with a cloud-based exchange. We describe the Automated Policy Enforcement
eXchange (APEX) software system, which consists of: (1) a policy editor, (2) a
policy server, (3) a local daemon on every PC/laptop to maintain local secure
up-to-date storage and policy, and (4) local (policy-enforcing) wrappers to
capture document-handling user actions such as document export, e-mail, print,
edit and save. During the performance of relevant incremental change, or other
user-elicited action, on a composite document, the document and its metadata
are scanned for salient policy eliciting terms (PETs). The document is then
partitioned based on relevant policies and the security policy for each part is
determined. If the document contains no PETs, then the user-initiated actions
are allowed; otherwise, alternative actions are suggested, including: (a)
encryption, (b) redirecting to a secure printer and requiring authorization
(e.g. PIN) for printing, and (c) disallowing printing until specific sensitive
data is removed. Keywords: document system components, document systems, policy, policy editor, policy
server, security, text analysis | |||
| Unsupervised font reconstruction based on token co-occurrence | | BIBAK | Full-Text | 143-150 | |
| Michael Patrick Cutter; Joost van Beusekom; Faisal Shafait; Thomas Michael Breuel | |||
| High quality conversions of scanned documents into PDF usually either rely
on full OCR or token compression. This paper describes an approach intermediate
between those two: it is based on token clustering, but additionally groups
tokens into candidate fonts. Our approach has the potential of yielding
OCR-like PDFs when the inputs are high quality and degrading to token based
compression when the font analysis fails, while preserving full visual
fidelity. Our approach is based on an unsupervised algorithm for grouping
tokens into candidate fonts. The algorithm constructs a graph based on token
proximity and derives token groups by partitioning this graph. In initial
experiments on scanned 300 dpi pages containing multiple fonts, this technique
reconstructs candidate fonts with 100% accuracy. Keywords: candidate fonts, font reconstruction, token co-occurrence graph
partitioning, token compression | |||
| Document structure meets page layout: loopy random fields for web news content extraction | | BIBAK | Full-Text | 151-160 | |
| Alex Spengler; Patrick Gallinari | |||
| Web content extraction is concerned with the automatic identification of
semantically interesting web page regions. To generalize to pages from unknown
sites, it is crucial to exploit not only the local characteristics of a
particular web page region, but also the rich interdependencies that exist
between the regions and their latent semantics. We therefore propose a loopy
conditional random field which combines semantic intra-page dependencies
derived from both document structure and page layout, uses a realistic set of
local and relational features and is efficiently learnt in the tree-based
reparameterization framework. The results of our empirical analysis on a corpus
of real-world news web pages from 177 distinct sites with multiple annotations
on DOM node level demonstrate that our combination of document structure and
layout-driven interdependencies leads to a significant error reduction on the
semantically interesting regions of a web page. Keywords: loopy conditional random fields, news600 data set, tree-based
reparameterization, web content extraction | |||
| Comparison of global and cascading recognition systems applied to multi-font Arabic text | | BIBAK | Full-Text | 161-164 | |
| Fouad Slimane; Slim Kanoun; Adel M. Alimi; Jean Hennebert; Rolf Ingold | |||
| A known difficulty of Arabic text recognition is in the large variability of
printed representation from one font to the other. In this paper, we present a
comparative study between two strategies for the recognition of multi-font
Arabic text. The first strategy is to use a global recognition system working
independently on all the fonts. The second strategy is to use a so-called
cascade built from a font identification system followed by font-dependent
systems. In order to reach a fair comparison, the feature extraction and the
modeling algorithms based on HMMs are kept as similar as possible between both
approaches. The evaluation is carried out on the large and publicly available
APTI (Arabic Printed Text Image) database with 10 different fonts. The results
are showing a clear advantage of performance for the cascading approach.
However, the cascading system is more costly in terms of cpu and memory. Keywords: APTI, GMM, HMM, font recognition, text recognition | |||
| Automatic selection of print-worthy content for enhanced web page printing experience | | BIBAK | Full-Text | 165-168 | |
| Suk Hwan Lim; Liwei Zheng; Jianming Jin; Huiman Hou; Jian Fan; Jerry Liu | |||
| The user experience of printing web pages has not been very good. Web pages
typically contain contents that are not print-worthy or informative such as
side bars, footers, headers, advertisements, and auxiliary information for
further browsing. Since the inclusion of such contents degrades the web
printing experience, we have developed a tool that first selects the main part
of the web page automatically and then allows users to make adjustments. In
this paper, we describe the algorithm for selecting the main content
automatically during the first pass. The web page is first segmented into
several coherent areas or blocks using our web page segmentation method that
clusters content based on the affinity values between basic elements. The
relative importance values for the segmented blocks are computed using various
features and the main content is extracted based on the constraint of one DOM
(Document Object Model) sub-tree and high important scores. We evaluated our
algorithm on 65 web pages and computed the accuracy based on area of overlap
between the ground truth and the extracted result of the algorithm. Keywords: block importance, segmentation, web page layout analysis, web page printing | |||
| A new model for automated table layout | | BIBAK | Full-Text | 169-176 | |
| Mihai Bilauca; Patrick Healy | |||
| In this paper we consider the table layout problem. We present a
combinatorial optimization modeling method for the table layout optimization
problem, the problem of minimizing a table's height subject to it fitting on a
given page (width). We present two models of the problem and report on their
evaluation. Keywords: constrained optimization, table layout | |||
| PDF profiling for B&W versus color pages cost estimation for efficient on-demand book printing | | BIBAK | Full-Text | 177-180 | |
| Fabio Giannetti; Gary Dispoto; Rafael Dueire Lins; Gabriel de França Pereira e Silva; Alexis Cabeda | |||
| Today, the way books, magazines and newspapers are published is undergoing a
democratic revolution. Digital Presses have enabled the on-demand model, which
provides individuals with the opportunity to produce and publish their own
books with very low upfront cost. With these new markets, opportunities, and
challenges have arisen. In a traditional environment, black-and-white and color
pages were printed using different presses. Later on, the book was assembled
combining the pages accordingly. In a digital workflow all the pages are
printed with the same press, although the page cost varies significantly
between color and b/w pages. Having an accurate printing cost profiler for
pdf-files is fundamental for the print-on-demand business, as jobs often have a
mix of color and b/w pages. To meet the expectations of some of HP customers in
the large Print Service Providers (PSPs) business, a profiler was developed
which yielded a reasonable cost estimate. The industrial use of such a tool
showed some discrepancies between estimated and printer log, however. The new
profiler presented herein provides a more accurate account of pdf jobs to be
printed. Tested on 79 "real world" pdf jobs, totaling 7,088 pages, the new
profiler made only one page misclassification, while the previous one yielded
54 classification errors. Keywords: PDF profiling, digital presses, printing costs | |||
| Next generation typeface representations: revisiting parametric fonts | | BIBAK | Full-Text | 181-184 | |
| Tamir Hassan; Changyuan Hu; Roger D. Hersch | |||
| Outline font technology has long been established as the standard way to
represent typefaces, allowing characters to be represented independently of
print size and resolution. Although outline font technologies are mature and
produce results of sufficient quality for professional printing applications,
they are inherently inflexible, which presents limitations in a number of
document engineering applications. In the 1990s, the topic of finding a
successor to outline fonts was a hot topic of research. Unfortunately, none of
the methods developed at the time were successful in replacing outline font
technology and this field of research has since then declined sharply in
popularity.
In this paper, we revisit a parametric font format developed between 1995 and 2001 by Hu and Hersch, where characters are built up from connected shape components. We extend this representation and use it to synthesize several characters from the Frutiger typeface and alter their weights by setting the relevant parameters. These settings are automatically propagated to the other characters of the font family. To conclude, we provide a discussion on next-generation font technologies in the light of today's Web-centric technologies and suggest applications that could greatly benefit from the use of flexible, parametric font representations. Keywords: digital typography, font representation, font synthesis, parameterized
fonts, parametric fonts, re-typesetting | |||
| DSMW: a distributed infrastructure for the cooperative edition of semantic wiki documents | | BIBAK | Full-Text | 185-186 | |
| Hala Skaf-Molli; Gérôme Canals; Pascal Molli | |||
| DSMW is a distributed semantic wiki that offers new collaboration modes to
semantic wiki users and supports dataflow-oriented processes.
DSMW is an extension to Semantic Mediawiki (SMW), it allows to create a network of SMW servers that share common semantic wiki pages. DSMW users can create communication channels between servers and use a publish-subscribe approach to manage the change propagation. DSMW synchronizes concurrent updates of shared semantic pages to ensure their consistency. Keywords: distribution, replication, semantic wiki | |||
| Open world classification of printed invoices | | BIBAK | Full-Text | 187-190 | |
| Enrico Sorio; Alberto Bartoli; Giorgio Davanzo; Eric Medvet | |||
| A key step in the understanding of printed documents is their classification
based on the nature of information they contain and their layout. In this work
we consider a dynamic scenario in which document classes are not known a priori
and new classes can appear at any time. This open world setting is both
realistic and highly challenging. We use an SVM-based classifier based only on
image-level features and use a nearest-neighbor approach for detecting new
classes. We assess our proposal on a real-world dataset composed of 562
invoices belonging to 68 different classes. These documents were digitalized
after being handled by a corporate environment, thus they are quite noisy --
e.g., big stamps and handwritten signatures at unfortunate positions and alike.
The experimental results are highly promising. Keywords: SVM, document image classification, machine learning, nearest-neighbor | |||
| Diffing, patching and merging XML documents: toward a generic calculus of editing deltas | | BIBAK | Full-Text | 191-194 | |
| Jean-Yves Vion-Dury | |||
| This work addresses what we believe to be a central issue in the field of
XML diff and merge computation: the mathematical modeling of the so-called
"editing deltas" and the study of their formal abstract properties. We expect
at least three outputs from this theoretical work: a common basis to compare
performances of the various algorithms through a structural normalization of
deltas, a universal and flexible patch application model and a clearer
separation of patch and merge engine performance from delta generation
performance. Moreover, this work could inspire technical approaches to combine
heterogeneous engines thank to sound delta transformations. This short paper
reports current results, discusses key points and outlines some perspectives. Keywords: XML, tree edit distance, tree transformation, tree-to-tree correction,
version control | |||
| Contextual advertising for web article printing | | BIBAK | Full-Text | 195-198 | |
| Shengwen Yang; Jianming Jin; Joshi Parag; Sam Liu | |||
| Advertisements provide the necessary revenue model supporting the Web
ecosystem and its rapid growth. Targeted or contextual ad insertion plays an
important role in optimizing the financial return of this model. Nearly all the
current ad payment strategies such as "pay-per-impression" and "pay-per-click"
on web pages are geared for electronic viewing purposes. Little attention,
however, is focused on deriving additional ad revenues when the content is
repurposed for alternative mean of presentation, e.g. being printed. Although
more and more content is moving to the Web, there are still many occasions
where printed output of web content or RSS feeds is desirable, such as maps and
articles; thus printed ad insertion can potentially be lucrative.
In this paper, we describe a cloud-based printing service that enables automatic contextual ad insertion, with respect to the main web page content, when a printout of the page is requested. To encourage service utilization, it would provide higher quality printouts than what is possible from current browser print drivers, which generally produce poor outputs -- ill formatted pages with lots of unwanted information, e.g. navigation icons. At this juncture we will limit the scope to only article-related web pages although the concept can be extended to arbitrary web pages. The key components of this system include (1) automatic extraction of article from web pages, (2) the ad service network for ad matching and delivering, and (3) joint content and ad printout creation. Keywords: contextual advertisement, web printing | |||
| Table layout performance of document authoring tools | | BIBAK | Full-Text | 199-202 | |
| Mihai Bilauca; Patrick Healy | |||
| In this paper we survey table creation in several popular document authoring
programs and identify usability bottlenecks and inconsistencies between several
of them. We discuss the user experience when drawing tables and we draw
attention to the fact that authoring tables is still difficult and can be a
frustrating and error prone exercise and that the development of high-quality
table tools should be further pursued. Keywords: table layout | |||
| Document product lines: variability-driven document generation | | BIBAK | Full-Text | 203-206 | |
| Ma Carmen Penadés; José H. Canós; Marcos R. S. Borges; Manuel Llavador | |||
| In this paper, we propose a process model, which we call Document Product
Lines, for the intensive generation of documents with variable content. Unlike
current approaches, we identify the variability sources at the requirements
level, including an explicit representation and management of these sources.
The process model provides a methodological guidance to the (semi)automated
generation of customized editors following the principles, techniques, and
available technologies of Software Product Line Engineering. We illustrate our
proposal with its application to the intensive generation of Emergency Plans. Keywords: document generation, emergency management, emergency plans, software product
lines, variability management | |||
| Degraded dot matrix character recognition using CSM-based feature extraction | | BIBAK | Full-Text | 207-210 | |
| Abderrahmane Namane; El Houssine Soubari; Patrick Meyrueis | |||
| This paper presents an OCR method for degraded character recognition applied
to a reference number (RN) of 15 printed characters of an invoice document
produced by dot-matrix printer. First, the paper deals with the problem of the
reference number localization and extraction, in which the characters tops or
bottoms are or not touched with a printed reference line of the electrical
bill. In case of touched RN, the extracted characters are severely degraded
leading to missing parts in the characters tops or bottoms. Secondly, a
combined recognition method based on the complementary similarity measure (CSM)
method and MLP-based classifier is used. The CSM is used to accept or reject an
incoming character. In case of acceptation, the CSM acts as a feature extractor
and produces a feature vector of ten component features. The MLP is then
trained using these feature vectors. The use of the CSM as a feature extractor
tends to make the MLP very powerful and very well suited for rejection.
Experimental results on electrical bills show the ability of the model to yield
relevant and robust recognition on severely degraded printed characters. Keywords: OCR, character recognition, dot matrix, feature extraction, multiple
classification | |||
| Picture detection in document page images | | BIBA | Full-Text | 211-214 | |
| Patrick Chiu; Francine Chen; Laurent Denoue | |||
| We present a method for picture detection in document page images, which can come from scanned or camera images, or rendered from electronic file formats. Our method uses OCR to separate out the text and applies the Normalized Cuts algorithm to cluster the non-text pixels into picture regions. A refinement step uses the captions found in the OCR text to deduce how many pictures are in a picture region, thereby correcting for under- and over-segmentation. A performance evaluation scheme is applied which takes into account the detection quality and fragmentation quality. We benchmark our method against the ABBYY application on page images from conference papers. | |||
| Down to the bone: simplifying skeletons | | BIBAK | Full-Text | 215-218 | |
| Jannis Stoppe; Björn Gottfried | |||
| This paper is about off-line handwritten text comparison of historic
documents. The long-term motivation is the support of palaeographic research,
in particular to back up decisions as to whether two handwritings can be
ascribed to the same author. In this paper, a first fundamental step is
presented for extracting relevant structures from handwritten texts. Such
structures are represented by skeletons, due to their resemblance to original
writing movements.
Core result is an approach to the simplification of skeleton structures. While skeletons represent constitutive structures for a wide variety of subsequent algorithms, simplification algorithms usually focus on pruning branches off the skeleton instead of simplifying the skeleton as a whole. By contrast, our approach reduces the amount of elements in a skeleton based on a global error level, reducing the skeleton's complexity while keeping its structure as close to the original exemplar as possible. The results are much easier to analyse while relevant information is maintained. Keywords: image processing, shapes, simplification, skeletons, text | |||
| Interactive layout analysis and transcription systems for historic handwritten documents | | BIBAK | Full-Text | 219-222 | |
| Oriol Ramos-Terrades; Alejandro H. Toselli; Nicolas Serrano; Verónica Romero; Enrique Vidal; Alfons Juan | |||
| The amount of digitized legacy documents has been rising dramatically over
the last years due mainly to the increasing number of on-line digital libraries
publishing this kind of documents, waiting to be classified and finally
transcribed into a textual electronic format (such as ASCII or PDF).
Nevertheless, most of the available fully-automatic applications addressing
this task are far from being perfect and heavy and inefficient human
intervention is often required to check and correct the results of such
systems. In contrast, multimodal interactive-predictive approaches may allow
the users to participate in the process helping the system to improve the
overall performance. With this in mind, two sets of recent advances are
introduced in this work: a novel interactive method for text block detection
and two multimodal interactive handwritten text transcription systems which use
active learning and interactive-predictive technologies in the recognition
process. Keywords: handwriting recognition, interactive layout analysis, interactive predictive
processing, partial supervision | |||
| Document conversion for cultural heritage texts: FrameMaker to HTML revisited | | BIBAK | Full-Text | 223-226 | |
| Michael Piotrowski | |||
| Many large-scale digitization projects are currently under way that intend
to preserve the cultural heritage contained in paper documents (in particular
books) and make it available on the Web. Typically OCR is used to produce
searchable electronic texts from books. For newer books, approximately from the
late 1980s onwards, digital text may already exist in the form of typesetting
data. For applications that require a higher level of accuracy than OCR can
deliver, the conversion of typesetting data can thus be an alternative to
manual keying. In this paper, we describe a tool for converting typesetting
data in FrameMaker format to XHTML+CSS developed for a collection of source
editions of medieval and early modern documents. Even though the books of the
Collection are typeset in good quality and in modern typefaces, OCR is
unusable, since the text is in various historical forms of German, French,
Italian, Rhaeto-Romanic, and Latin. The conversion of typesetting data produces
fully reliable text free from OCR errors and thus also provides a basis for the
construction of language resources for the processing of historical texts. Keywords: CSS, XHTML, cultural heritage data, document format conversion, FrameMaker | |||
| Glyph extraction from historic document images | | BIBAK | Full-Text | 227-230 | |
| Lothar Meyer-Lerbs; Arne Schuldt; Björn Gottfried | |||
| This paper is about the reproduction of ancient texts with vectorised fonts.
While for OCR only recognition rates count, a reproduction process does not
necessarily require the recognition of characters. Our system aims at
extracting all characters from printed historic documents without the
employment of knowledge of language, font, or writing system. It searches for
the best prototypes and creates a document-specific font from these glyphs. To
reach this goal, many common OCR preprocessing steps are no longer adequate. We
describe the necessary changes of our system that deals particularly with
documents typeset in Fraktur. On the one hand, algorithms are described that
extract glyphs accurately for the purpose of precise reproduction. On the other
hand, classification results of extracted Fraktur glyphs are presented for
different shape descriptors. Keywords: document-specific font, glyph classification, glyph extraction, glyph shape,
image enhancement | |||
| Style and branding elements extraction from businessweb sites | | BIBAK | Full-Text | 231-234 | |
| Limei Jiao; Suk Hwan Lim; Nina Bhatti; Yuhong Xiong; Jerry Liu | |||
| We describe a method to extract style and branding elements from multiple
web pages in a given site for content repurposing. Style and branding elements
convey the values of the site owners effectively and connect with the target
prospects. They are manifested through logos, graphical elements, background
color, font styles, font colors and other illustrations. Our method
automatically extracts color and image elements appearing frequently and
prominently on multiple pages throughout the site. We rely on a DOM tree
matching method to obtain the frequency of re-occurring elements and use
relative sizes and positions of elements to determine the type of elements.
Note that approximate locations of these elements provide an added clue to the
content repurposing engine as to where to place the elements in the repurposed
document. The obtained results show that the proposed method can efficiently
extract style and branding elements with high accuracy. Keywords: high frequent elements extraction, style and branding extraction, tree
matching | |||
| FormCracker: interactive web-based form filling | | BIBAK | Full-Text | 235-238 | |
| Laurent Denoue; John Adcock; Scott Carter; Patrick Chui; Francine Chen | |||
| Filling out document forms distributed by email or hosted on the Web is
still problematic and usually requires a printer and scanner. Users commonly
download and print forms, fill them out by hand, scan and email them. Even if
the document is form-enabled (PDFs with FDF information), to read the file
users still have to launch a separate application which may not be available,
especially on mobile devices.
FormCracker simplifies this process by providing an interactive, fully web-based document viewer that lets users complete forms online. Document pages are rendered as images and presented in a simple HTML-based viewer. When a user clicks in a form-field, FormCracker identifies the correct form-field type using lightweight image processing and heuristics based on nearby text. Users can then seamlessly enter data in form-fields such as text boxes, check boxes, radio buttons, multiple text lines, and multiple single-box characters. FormCracker also provides useful auto-complete features based on the field type, for example a date picker, a drop-down menu for city names, state lists, and an auto-complete text box for first and last names. Finally, FormCracker allows users to save and print the completed document. In summary, with FormCracker a user can efficiently complete and reuse any electronic form. Keywords: document processing, form filling, image processing, interactive | |||
| Semantics-enriched document exchange | | BIBAK | Full-Text | 239-242 | |
| Jingzhi Guo; Ming Sang Ho | |||
| In e-business development, semantics-oriented document exchange is becoming
important, because it can support cross-domain user connection, business
transaction and collaboration. To provide this support, this paper proposes a
DOC Mechanism to exchange semantically interoperable business documents between
heterogeneous enterprise information systems. This mechanism is designed on a
layered-sign network, which enables any exchanged e-business document to be
independently interpretable without losing semantic consistency. Keywords: XML product map, concept, document engineering, document exchange,
electronic business, representation, semantics, sign | |||
| Document and item-based modeling: a hybrid method for a socio-semantic web | | BIBAK | Full-Text | 243-246 | |
| Jean-Pierre Cahier; Xiaoyue Ma; L'Hédi Zaher | |||
| The paper discusses the challenges of categorising documents and "items of
the world" to promote knowledge sharing in large communities of interest. We
present the DOCMA method (Document and Item-based Model for Action) dedicated
to end-users who have minimal or no knowledge of information science. Community
members can elicit structure and indexed business items stemming from their
query including projects, actors, products, places of interest, and
geo-situated objects. This hybrid method was applied in a collaborative Web
portal in the field of sustainability for the past two years. Keywords: document, folksonomy, method, socio-semantic web, web2.0 | |||
| Structure-aware topic clustering in social media | | BIBA | Full-Text | 247-250 | |
| Julien Dubuc; Sabine Bergler | |||
| The rapid evolution and growth of social media software has enabled hundreds
of millions to interact within on-line communities on a global scale. While
they enable communication through a common set of metaphors, such as discussion
threads and quoting text in replies, this software uses a variety of diverging
ways of representing discussion. Since the meaning of a conversation is defined
not only by the content of a piece of text, but also by the relationships
between pieces of text, part of the meaning of the discussion is obscured from
automated processing.
Search engines, which act as gateways to outsiders into the social text in a community, are reduced to giving an incomplete picture. This paper proposes a model for representing both the content and the structure of social text in a consistent way, enabling automated processing of the structure of the discussion along with its text content. It also describes a method for indexing text that uses this structural information to provide meaningful contexts for paragraphs of interest. It then describes a method for clustering text content into topic groups, using this indexing method, and also using the social structure to make informed decisions about which pieces of text to compare meaningfully. | |||
| Pre-evaluation of invariant layout in functional variable-data documents | | BIBAK | Full-Text | 251-254 | |
| John Lumley | |||
| Layout of content in variable data documents can be computationally
expensive. When very large numbers of almost similar copies of a document are
required, automated pre-evaluation of invariant sections may increase
efficiency of final document generation. If the layout model is functional and
combinatorial in nature (such as in the Document Description Framework), there
are some generalised conservative techniques to do this that involve very
modest changes to implementations, independent of details of the actual
layouts. This paper describes these techniques and how they might be used with
other similar document layout models. Keywords: SVG, XSLT, document construction, functional programming | |||
| Towards a common evaluation strategy for table structure recognition algorithms | | BIBAK | Full-Text | 255-258 | |
| Tamir Hassan | |||
| A number of methods for evaluating table structure recognition systems have
been proposed in the literature, which have been used successfully for
automatic and manual optimization of their respective algorithms.
Unfortunately, the lack of standard, ground-truthed datasets coupled with the
ambiguous nature of how humans interpret tabular data has made it difficult to
compare the obtained results between different systems developed by different
research groups.
With reference to these approaches, we describe our experiences in comparing our algorithm for table detection and structure recognition to another recently published system using a freely available dataset of 75 PDF documents. Based on examples from this dataset, we define several classes of errors and propose how they can be treated consistently to eliminate ambiguities and ensure the repeatability of the results and their comparability between different systems from different research groups. Keywords: evaluation, ground truth, precision, recall, table detection, table
recognition, table structure recognition | |||
| Using feature models for creating families of documents | | BIBAK | Full-Text | 259-262 | |
| Sven Karol; Martin Heinzerling; Florian Heidenreich; Uwe Aßmann | |||
| Variants in a family of office documents are usually created by ad-hoc copy
and paste actions from a set of base documents. As a result, the set of
variants is decoupled from the original documents and is difficult to manage.
In this paper we present a novel approach that uses concepts from Feature
Oriented Domain Analysis (FODA) to specify document families to generate
variants. As a proof of concept, we implemented the Document Feature Mapper
tool, which is based on our previous experience in Software Product Line
Engineering (SPLE) with FODA. In our tool, variant spaces are precisely
specified using feature models and mappings relating features to slices in the
document family. Gives a selection of features satisfying the feature model's
constraints a variant can be derived. To show the applicability of our approach
and tool, we conducted two case studies with documents in the Open Document
Format (ODF). Keywords: ODF, XML, document families, feature models, variants | |||
| Two new aesthetic measures for item alignment | | BIBAK | Full-Text | 263-266 | |
| Aline Duarte Riva; Alexandre Kazuo Seki; João Batista Souza de Oliveira; Isabel Harb Mansour; Ricardo Farias Piccoli | |||
| This paper introduces two methods for measuring the alignment of items on a
page with respect to its left/right margins. The methods are based on the path
followed by the eyes as they follow the items from top to bottom of the page.
Examples are presented and both methods are analyzed with respect to the axioms presented in [2], that describe how good alignment measure is supposed to behave. Keywords: document aesthetics, page alignment, page layout | |||
| Term frequency dynamics in collaborative articles | | BIBAK | Full-Text | 267-270 | |
| Sérgio Nunes; Cristina Ribeiro; Gabriel David | |||
| Documents on the World Wide Web are dynamic entities. Mainstream information
retrieval systems and techniques are primarily focused on the latest version a
document, generally ignoring its evolution over time. In this work, we study
the term frequency dynamics in web documents over their lifespan. We use the
Wikipedia as a document collection because it is a broad and public resource
and, more important, because it provides access to the complete revision
history of each document. We investigate the progression of similarity values
over two projection variables, namely revision order and revision date. Based
on this investigation we find that term frequency in encyclopedic documents --
i.e. comprehensive and focused on a single topic -- exhibits a rapid and steady
progression towards the document's current version. The content in early
versions quickly becomes very similar to the present version of the document. Keywords: document dynamics, term frequency, wikipedia | |||
| A file-type sensitive, auto-versioning file system | | BIBAK | Full-Text | 271-274 | |
| Arthur Müller; Sebastian Rönnau; Uwe M. Borghoff | |||
| Auto-versioning file systems offer a simple and reliable interface to
document change control. The implicit versioning of documents at each write
access catches the whole evolution of a document, thus supporting regulatory
compliance rules. Most existing file systems work on low abstraction levels and
track the document evolution on their binary representation. Higher-level
differencing tools allow for a far more meaningful change-tracking, though.
In this paper, we present an auto-versioning file system that is able to handle files depending on their file type. This way, a suitable differencing tool can be assigned to each file type. Our approach supports regulatory compliant storage as well as the archiving of documents. Keywords: auto-versioning, document management, file system, regulatory compliance,
version control | |||
| Medieval manuscript layout model | | BIBAK | Full-Text | 275-278 | |
| Micheal Baechler; Rolf Ingold | |||
| Medieval manuscript layouts are quite complex. Additionally to their main
text flow, which can spread over one or several columns, such manuscripts
contain also other textual elements such as insertions, annotations, and
corrections. They are often richly decorated with ornaments, illustrations, and
drop capitals making their layout even more complex. In this paper we propose a
generic layout model to represent their physical structure.
To achieve this goal we propose to use four layers in order to distinguish between the different graphical elements. In this paper we show how this model is used to represent automatic segmentation results and how it allows a quantitative measure of their accuracy. Keywords: annotation, layout, layout model, manuscript, medieval, medieval manuscript,
segmentation | |||
| Using model driven engineering technologies for building authoring applications | | BIBAK | Full-Text | 279-282 | |
| Olivier Beaudoux; Arnaud Blouin; Jean-Marc Jézéquel | |||
| Building authoring applications is a tedious and complex task that requires
a high programming effort. Document technologies, especially XML based ones,
can help in reducing such an effort by providing common bases for manipulating
documents. Still, the overall task consists mainly of writing the application's
source code. Model Driven Engineering (MDE) focuses on generating the source
code from an exhaustive model of the application. In this paper, we illustrate
that MDE technologies can be used to automate the development of authoring
application components, but fail in generating the code of graphical
components. We present our framework, called Malai, that aims to solve this
issue. Keywords: MDE, Malai, Malan, authoring applications | |||
| On Helmholtz's principle for documents processing | | BIBAK | Full-Text | 283-286 | |
| Alexander A. Balinsky; Helen Y. Balinsky; Steven J. Simske | |||
| Keyword extraction is a fundamental problem in text data mining and document
processing. A large number of document processing applications directly depend
on the quality and speed of keyword extraction algorithms. In this article, a
novel approach to rapid change detection in data stream.
and documents is developed. It is based on ideas from image processing and especially on the Helmholtz Principle from the Gestalt Theory of human perception. Applied to the problem of keywords extraction, it delivers fast and effective tools to identify meaningful keywords using parameter-free methods. We also define a level of meaningfulness of the keywords which can be used to modify the set of keywords depending on application needs. Keywords: gestalt, Helmholtz principle, keyword extraction, meaningful words, rapid
change detection | |||