| The future of documents | | BIBAK | Full-Text | 1 | |
| Tom Malloy | |||
| The quantity of the world's information held in the form of documents is
massive and growing. So, the need for a container of information that can be
transported between service participants (people or computers), and that is
decoupled from the operation of those services, is as important today as ever.
The basic properties of documents have been understood for over 5,000 years. While the first electronic documents translated many of these properties to the new medium, they did this in highly application-specific manners, with document formats specific to a particular content type. They also lost some important paper (eg signing, annotations). PDF gave us the first universal document format. It supports various content types and extended the common properties supported, but is limited to static, page-oriented content. In the years since PDF was first developed there have been many technological and social changes. We now have the opportunity to take the universal document to the next level. We are calling this new item an Intelligent Document. Intelligent Documents provide a rich, interactive experience to their users. While some of their content is static and embedded, increasingly the document is an interface to live information sources and web services providing personalized and contextually-aware content, caching the content as needed to enable on-line and off-line access. The documents can adapt their rendering and interaction to a broad range of devices and capabilities enabling users to truly control how they want to receive information. The applications that create the documents' data or content may come and go, but Intelligent Documents contain substantial amounts of metadata (increasingly semantic in nature) enabling their reuse and repurposing well beyond their initially conceived lifetime. This leads us to a document structure that is modularized, providing componetization of content, presentation, logic, and universal attributes for all document objects regardless of type. Intelligent Documents may more closely resemble applications in their behavior than traditional documents. The blurring distinction between documents and applications is a trend that will only increase over time. Keywords: PDF, document structure | |||
| Structuring documents according to their table of contents | | BIBAK | Full-Text | 2-9 | |
| Hervé Déjean; Jean-Luc Meunier | |||
| In this paper, we present a method for structuring a document according to
the information present in its Table of Contents. The detection of the ToC as
well as the determination of the parts it refers to in the document body rely
on a series of generic properties characterizing any ToC, while its
hierarchization is achieved using clustering techniques. We also report on the
robustness and performance of the method before discussing it, in light of
related work. Keywords: document structuring, table of contents recognition | |||
| Towards XML version control of office documents | | BIBAK | Full-Text | 10-19 | |
| Sebastian Rönnau; Jan Scheffczyk; Uwe M. Borghoff | |||
| Office applications such as OpenOffice and Microsoft Office are widely used
to edit the majority of today's business documents: office documents. Usually,
version control systems consider office documents as binary objects, thus
severely hindering collaborative work. Since XML has become a de-facto standard
for office applications, we focus on versioning office documents by structured
XML version control approaches. This enables state-of-the-art version control
for office documents.
A basic prerequisite to XML version control is a diff algorithm, which detects structural changes between XML documents. In this paper, we evaluate state-of-the-art XML diff algorithms w.r.t. their suitability to OpenOffice XML documents and the future OASIS office document standard. It turns out that, due to the specific XML office format, a careful examination of the diff algorithm characteristics is necessary. Therefore, we identify important features for XML diff approaches to handle office documents. We have implemented a first OpenOffice versioning API that can be used in version control systems as a replacement for line-based or binary diffs, which are currently used. Keywords: XML diffing, office applications, version control | |||
| Influence of fusion strategies on feature-based identification of low-resolution documents | | BIBAK | Full-Text | 20-22 | |
| Ardhendu Behera; Denis Lalanne; Rolf Ingold | |||
| The paper describes a method by which one could use the documents captured
from low-resolution handheld devices to retrieve the originals of those
documents from a document store. The method considers conjunctively two
complementary feature sets. First, the geometrical distribution of the color in
the document's 2D image plane is preferred. Secondly, the shallow layout
features is considered due to the poor resolution of the captured documents. We
propose in this article to fuse those two complementary feature sets in order
to improve document identification performance. Finally, in order to test the
influence of merging strategies on document identification performance, a
synergic method is proposed and evaluated relative to a similar method in which
feature sets are simply considered sequentially. Keywords: document retrieval, document signature, geometrical color distribution,
shallow layout features | |||
| Enabling massive scale document transformation for the semantic web: the universal parsing agent | | BIBAK | Full-Text | 23-25 | |
| Mark A. Whiting; Wendy Cowley; Nick Cramer; Alex Gibson; Ryan Hohimer; Ryan Scott; Stephen Tratz | |||
| The Universal Parsing Agent (UPA) is a document analysis and transformation
program that supports massive scale conversion of information into forms
suitable for the semantic web. UPA provides reusable tools to analyze text
documents; identify and extract important information elements; enhance text
with semantically descriptive tags; and output the information that is needed
in the format and structure that is needed. Keywords: XML, XSLT, document transformation, natural language processing, parsing,
regular expressions | |||
| Exploiting XML technologies for intelligent document routing | | BIBAK | Full-Text | 26-28 | |
| Isaac Cheng; Savitha Srinivasan; Neil Boyette | |||
| Today, XML is increasingly becoming a standard for representation of
semi-structured information such as documents that combines content and
metadata. Typical document management applications include document
representation, authoring, validation, and document routing in support of a
business process. We propose a framework for intelligent document routing that
exploits and extends XML technologies to automate dynamic document routing and
real-time update of business routing logic. The document-routing logic is
stored in a secure repository and executed by a business rules engine. During
rule execution, the input parameters of each business rule are bound with the
data from each inbound XML document. This document routing framework is
validated in a real-world implementation with reduced development cost,
accelerated rule update cycle and simplified administration efforts. Keywords: B2B, DMS, IT, J2EE, MIS, MQ, ROI, SO, SOA, business transformation,
downsize, economy, global, logistics, organizational change, outsource, policy,
training, turnover, worldwide | |||
| A system of collecting domain-specific jargons | | BIBK | Full-Text | 29 | |
| Hiromi Oda | |||
Keywords: domain specific expressions community jargons | |||
| Emphasis for highly customized documents | | BIBK | Full-Text | 30 | |
| Helen Balinsky; Maurizio Pilu | |||
Keywords: aesthetic measure, emphasis, high customization, publishing | |||
| The COG Scrapbook | | BIBAK | Full-Text | 31 | |
| Steven R. Bagley; David F. Brailsford | |||
| Creating truly dynamic documents from disparate components is not an easy
process, especially for untrained end users. Existing packages such as Adobe
InDesign or Quark XPress are designed for graphic-arts professionals and even
these are not optimised for the process of integrating disparate components.
The 'COG Scrapbook' builds on our PDF-based Component Object Graphics (COGs)
technology to create a radically new 'drag and drop' software approach to
creating dynamic documents for use in educational environments.
The initial focus of the COG Scrapbook is to enable groups of people to collaboratively create material ranging from smaller-sized (e.g. A4) personal scrapbooks through to large (e.g. A0) posters. Some examples scenarios where the technology could be utilized are: * university students producing posters for class assignments; * school teachers and students jointly creating classroom wall displays; * creation of a scrapbook of annotated digital photographs resulting from school field trips and museum visits The poster illustrating the present state of this work will itself have been constructed using COG Scrapbook software. Keywords: COGs, FormXObject, PDF, graphic objects | |||
| A framework for structure, layout & function in documents | | BIBAK | Full-Text | 32-41 | |
| John Lumley; Roger Gimson; Owen Rees | |||
| The Document Description Framework (DDF) is a representation for
variable-data documents. It supports very high flexibility in the type and
extent of variation supported, considerably beyond the 'copy-hole' or
flow-based mechanisms of existing formats and tools. DDF is based on holding
application data, logical data structure and presentation as well as
constructional 'programs' together within a single document. DDF documents can
be merged with other documents, bound to variable values incrementally, combine
several types of layout and styling in the same document and support final
delivery to different devices and page-ready formats. The framework uses XML
syntax and fragments of XSLT to describe 'programmatic construction' of a bound
document. DDF is extensible, especially in the ability to add new types of
layout and interoperability between components in different formats. In this
paper we describe the motivation for DDF, the major design choices and how we
evaluate a DDF document with specific data values. We show through implemented
examples how it can be used to construct high-complexity and variability
presentations and how the framework complements and can use many existing
XML-based documents formats, such as SVG and XSL-FO. Keywords: SVG, XML, XSLT, document construction, functional programming | |||
| An environment for maintaining computation dependency in XML documents | | BIBAK | Full-Text | 42-51 | |
| Dongxi Liu; Zhenjiang Hu; Masato Takeichi | |||
| In the domain of XML authoring, there have been many tools to help users to
edit XML documents. These tools make it easier to produce complex documents by
using such technologies as syntax-directed or presentation-oriented editing,
etc. However, when an XML document contains data with some computation
dependency among them, these tools cannot free users from the burden of
maintaining this dependency relationship. By computation dependency, we mean
that some data are gotten by computing from other data in the same document.
In this paper, we present an environment for authoring XML document, in which users can express the data dependency relationship in one document explicitly rather than implicitly in their minds. Under this environment, the dependent parts of the document are represented as expressions, which in turn can be evaluated to generate the dependent data. Therefore, users need not to compute the dependent data first and then input them manually, as required by the current authoring tools. Keywords: XML, computation dependency, functional programming, lazy evaluation,
programmable structured document | |||
| Compiling XPath for streaming access policy | | BIBAK | Full-Text | 52-54 | |
| Pierre Genevès; Kristoffer Rose | |||
| We show how the full XPath language can be compiled into a minimal subset
suited for stream-based evaluation. Specifically, we show how XPath
normalization into a core language as proposed in the current W3C "Last Call"
draft of the XPath/XQuery Formal Semantics can be extended such that both the
context state and reverse axes can be eliminated from the core XPath (and
potentially XQuery) language. This allows execution of (almost) full XPath on
any of the emerging streaming subsets. Keywords: XPath, compilation, static rewriting, streaming | |||
| Static optimization of XSLT stylesheets: template instantiation optimization and lazy XML parsing | | BIBAK | Full-Text | 55-57 | |
| Manaka Kenji; Sato Hiroyuki | |||
| The increasing popularity of XSLT brings up the requirement of more
efficient performance. In this paper, we propose two optimization techniques
based on template caller-callee analysis. One is the template instantiation
optimization which analyzes a stylesheet and identifies the templates to be
instantiated before transformation. The other is the static lazy XML parsing
optimization that constructs a pruned XML tree by statically identifying the
nodes that are actually referred. Furthermore, we have implemented both our
optimizations on Saxon and have evaluated its performance. In these
experiments, we have proved both of them to be practically useful and to
improve XSLT performance. Keywords: XSLT, lazy evaluation, optimization, saxon | |||
| Generating form-based user interfaces for XML vocabularies | | BIBAK | Full-Text | 58-60 | |
| Y. S. Kuo; N. C. Shih; Lendle Tseng; Hsun-Cheng Hu | |||
| So far, many user interfaces for XML data (documents) have been constructed
from scratch for specific XML vocabularies and applications. The tool support
for user interfaces for XML data is inadequate. Forms-XML is an interactive
component invoked by applications for generating user interfaces for prescribed
XML vocabularies automatically. Based on a given XML schema, the component
generates a hierarchy of HTML forms for users to interact with and update XML
data compliant with the given schema. The user interface Forms-XML generates is
very simple with an abundance of guidance and hints to the user, and can be
customized by user interface designers as well as developers. Keywords: XML, XML editing, user interface | |||
| Encapsulating and manipulating component object graphics (COGs) using SVG | | BIBAK | Full-Text | 61-63 | |
| Alexander J. Macdonald; David F. Brailsford; Steven R. Bagley | |||
| Scalable Vector Graphics (SVG) has an imaging model similar to that of
PostScript and PDF but the XML basis of SVG allows it to participate fully, via
namespaces, in generalised XML documents.
There is increasing interest in using SVG as a Page Description Language and we examine ways in which SVG document components can be encapsulated in contexts where SVG will be used as a rendering technology for conventional page printing. Our aim is to encapsulate portions of SVG content (SVG COGs) so that the COGs are mutually independent and can be moved around a page, while maintaining invariant graphic properties and with guaranteed freedom from side effects and mutual interference. Parallels are drawn between COG implementation within SVG's tree-based inheritance mechanisms and an earlier COG implementation using PDF. Keywords: PDF, SVG, XML, component object graphics, parameterization | |||
| Support for arbitrary regions in XSL-FO | | BIBAK | Full-Text | 64-73 | |
| Ana Cristina B. da Silva; Joao B. S. de Oliveira; Fernando T. M. Mano; Thiago B. Silva; Leonardo L. Meirelles; Felipe R. Meneguzzi; Fabio Giannetti | |||
| This paper proposes an extension of the XSL-FO standard which allows the
specification of an unlimited number of arbitrarily shaped page regions. These
extensions are built on top of XSL-FO 1.1 to enable flow content to be laid out
into arbitrary shapes and allowing for page layouts currently available only to
desktop publishing software. Such a proposal is expected to leverage XSL-FO
towards usage as an enabling technology in the generation of content intended
for personalized printing. Keywords: LaTeX, SVG, XML, XSL-FO, arbitrary shapes, digital publishing, typesetting | |||
| Toward tighter tables | | BIBAK | Full-Text | 74-83 | |
| Nathan Hurst; Kim Marriott; Peter Moulder | |||
| Tables are provided in virtually all document formatting systems and are one
of the most powerful and useful design elements in current web document
standards. Unfortunately, optimal layout of tables which contain text is
NP-hard for reasonable layout requirements such as minimizing table height for
a given width [1]. We present two new independently-applicable techniques for
table layout. The first technique is to solve a continuous approximation to the
original layout problem by using a constant-area approximation of the cell
content combined with a minimum width and height for the cell. The second
technique starts by setting each column to its narrowest possible width and
then iteratively reduces the height of the table by judiciously widening its
columns. This second technique uses the actual text and line-break rules rather
than the constant-area approximation used by the first technique. We also
investigate two hybrid approaches both of which use iterative column widening
to improve the quality of an initial solution found using a different
technique. In the first hybrid approach we use the continuous approximation
technique to compute the initial column widths while in the second hybrid
approach a modification of the HTML table layout algorithm is used to compute
the initial widths. We found that all four techniques are reasonably fast and
give significantly more compact layout than that of HTML layout engines. Keywords: conic programming, optimisation techniques, table layout | |||
| Generative semantic clustering in spatial hypertext | | BIBAK | Full-Text | 84-93 | |
| Andruid Kerne; Eunyee Koh; Vikram Sundaram; J. Michael Mistrot | |||
| This paper presents an iterative method for generative semantic clustering
of related information elements in spatial hypertext documents. The goal is to
automatically organize them in ways that are meaningful to the user. We
consider a process in which elements are gradually added to a spatial
hypertext. The method for generating meaningful layout is based on a
quantitative model that measures and represents the mutual relatedness between
each new element and those already in the document. The measurement is based on
attributes such as metadata, term vectors, user interest expressions, and
document locations. We call this model relatedness potential, because it
represents how much the new element is related and thus attracted to existing
elements as a vector field across the space. Using this field as a gradient
potential, the new element will be placed near the most attracted elements,
forming clusters of related elements. The relative magnitude of contribution of
attributes to relatedness potential can be controlled through an interactive
interface.
Unlike prior clustering methods such as k-means and self-organizing-maps, relatedness potential works well in iterative systems, in which the collection of elements is not defined a priori. Further, users can invoke relatedness potential to re-cluster elements, as they engage in on-the-fly provisional acts of direct manipulation reorganization and latching of a few most significant elements. A preliminary study indicates that users find this method generates spatial hypertext documents that are easier to read. Keywords: clustering, collections, document layout, generative hypermedia, information
triage, mixed-initiatives, spatial hypertext | |||
| Engineering information in documents: leaving room for uncertainty | | BIBA | Full-Text | 94 | |
| Dick C. A. Bulterman | |||
| Much of the work in engineering complex documents has been on building a
path from a document model (or document abstraction) to one or more instances
of that document during a publication phase. While many approaches exist to
help create multiple instances from a single document, all of these instances
share the property that the content (once created) is fixed.
In this talk, we consider more flexible bindings of information packaging and publication. The main goal of this work is to allow a large degree of uncertainty to document content -- uncertainty that is motivated by incremental authoring processes (where the contents of a document are enriched over its lifetime), uncertainty that is motivated by continuous adaptation of content based on changes in the nature and architecture of the underlying rendering environment, and uncertainty that is motivated by temporal navigation through various set of interwoven, conditionally-active content layers. Several use cases will be discussed (including entertainment, accessible documents and medical applications) and a flexible, non-monolithic document reader/annotator architecture will be shown based on the Ambulant renderer. | |||
| Constrained XSL formatting objects for adaptive documents | | BIBAK | Full-Text | 95-97 | |
| Gil Loureiro; Francisco Azevedo | |||
| The pagination strategy of XSL Formatting Objects (XSL:FO) is based on a
"break if no fit" approach that often produces one last page with only one
printable object due to a lack of space on the previous page. On a batch, high
volume, personalized document production scenario, this fact can represent a
high cost on extra sheets of paper with a lot of free space and a document with
a poor look. In this paper, we describe a new approach to solve the pagination
problem of XSL:FO documents where space use efficiency and aesthetic aspects
are considered. The approach is based on constraint satisfaction using Mixed
Integer Linear Programming (MILP) models. The starting point was the FO part of
XSL specification, where we added a Constrained XSL:FO extension (referred to
as CXSL:FO) that delivers tags used to declare constraints on size and font
adjustments of target FO objects. This extension is added to our reengineered
FOP formatter that builds and solves an MILP model to find the global optimal
solution corresponding to a document with the minimum number of pages, each one
being maximally filled. We show its effectiveness in the generation of
personalized welcome letters. Keywords: MILP, XSL:FO, adaptive documents, constraints, pagination | |||
| Content interaction and formatting for mobile devices | | BIBAK | Full-Text | 98-100 | |
| Tayeb Lemlouma; Nabil Layaïda | |||
| In this paper we present an experimental content adaptation system for
mobile devices. The system enables the presentation of multimedia content and
considers the problem of small screen display of mobile terminals. The approach
combines structural and media adaptation with the content formatting and
proposes a system that handles the user interaction and the content navigation. Keywords: content adaptation, content formatting, evaluation, mobile-devices, user
interaction | |||
| Schema matching for transforming structured documents | | BIBAK | Full-Text | 101-110 | |
| Aida Boukottaya; Christine Vanoirbeek | |||
| Structured document content reuse is the problem of restructuring and
translating data structured under a source schema into an instance of a target
schema. A notion closely tied with structured document reuse is that of
structure transformations. Schema matching is a critical strep in structured
document transformations. Manual matching is expensive and error-prone. It is
therefore important to develop techniques to automate the matching process and
thus the transformation process. In this paper, we contributed in both
understanding the matching problem in the context of structured document
transformations and developing matching methods those output serves as the
basis for the automatic generation of transformation scripts. Keywords: document structure transformations, schema matching | |||
| Textual indexation of ancient documents | | BIBA | Full-Text | 111-117 | |
| Yann Leydier; Frank LeBourgeois; Hubert Emptoz | |||
| In the past years many levels of indexation have been developed to allow a fast retrieval of digitized documents. Among all the ways of indexing a document, textual indexation allows the finest queries on a the documents' content. Usually, the plain text transcription of a digitized document is obtained by applying an OCR (Optical Character Recognition) software on it. What if the OCR fails? Indeed OCR systems are inefficient on low-quality printed documents, and are unsuited to the processing of ancient fonts. Furthermore, OCR is not applicable to manuscript text recognition. In this paper we introduce two alternative methods of accessing to text trough the image: the Computer Assisted Transcription and the Word Spotting. | |||
| A fast orientation and skew detection algorithm for monochromatic document images | | BIBAK | Full-Text | 118-126 | |
| Bruno Tenório Ávila; Rafael Dueire Lins | |||
| Very often in the digitization process, documents are either not placed with
the correct orientation or are rotated of small angles in relation to the
original image axis. These factors make more difficult the visualization of
images by human users, increase the complexity of any sort of automatic image
recognition, degrade the performance of OCR tools, increase the space needed
for image storage, etc. This paper presents a fast algorithm for orientation
and skew detection for complex monochromatic document images, which is capable
of detecting any document rotation at a high precision. Keywords: monochromatic document image, orientation and skew detection | |||
| A statistical method for binary classification of images | | BIBAK | Full-Text | 127-129 | |
| Steven J. Simske; Dalong Li; Jason S. Aronoff | |||
| The classification of documents with sparse text, and video analysis, relies
on accurate image classification. We herein present a method for binary
classification that accommodates any number of individual classifiers. Each
individual classifier is defined by the critical point between its two means,
and its relative weighting is inversely proportional to its expected error
rate. Using 10 simple image analysis metrics, we distinguish a set of "natural"
and "city" scenes, providing a "semantically meaningful" classification. The
optimal combination of 5 of these 10 classifiers provides 85.8% accuracy on a
small (120 image) feasibility corpus. When this feasibility corpus is then
split into half training and half testing images, the mean accuracy of the
optimum set of classifiers was 81.7%. Accuracy as high as 90% was obtained for
the test set when training percentage was increased. These results demonstrate
that an accurate classifier can be constructed from a large pool of simple
classifiers through the use of the statistical ("Normal") classification method
described herein. Keywords: binary classification, classifier, combined classifiers, image
classification, normal | |||
| A new rotation algorithm for monochromatic images | | BIBAK | Full-Text | 130-132 | |
| Bruno Tenório Ávila; Rafael Dueire Lins; Lamberto Oliveira | |||
| The classical rotation algorithm applied to monochromatic images introduces
white holes in black areas, making edges uneven and disconnecting neighboring
elements. Several algorithms in the literature address only the white hole
problem. This paper proposes a new algorithm that solves those three problems,
producing better quality images. Keywords: monochromatic image rotation, skew correction | |||
| Interaction between paper and electronic documents | | BIBAK | Full-Text | 133 | |
| Michael Gormish | |||
| Documents today almost always exist in two forms: paper and electronic. Many
documents, especially legacy documents start as paper, but are then scanned and
recognized. Other documents are started electronically but then printed for
easy reading, annotation, or distribution. Some documents are scanned, operated
on electronically, then printed. Often machine readable information, e.g.
barcodes or RFID tags are added to paper documents to allow association with
the electronic document, or with "meta-data" in some database. Sometimes the
ability to go back and forth between paper and electronic forms,
round-tripping, is important, other times the two forms are fundamentally
different.
While the end of paper in the offices has been long predicted, actual volume of printed materials continues to rise. Electronic documents have, in fact, greatly increased the use of paper. This panel addresses making paper more useful in an electronic document world, and making electronic databases deal with paper. There are obvious challenges including scanning paper documents and printing electronic ones, but there are additional opportunities including using paper to summarize and access multimedia documents, and using paper to control electronic actions. Keywords: classification, document analysis, document databases, enterprise document,
machine identifiers, model, paper manifestation, printing, scanning,
segmentation | |||
| Injecting information into atomic units of text | | BIBAK | Full-Text | 134-142 | |
| Yannis Haralambous; Gábor Bella | |||
| This paper presents a new approach to text processing, based on textemes.
These are atomic text units generalising the concepts of character and glyph by
merging them in a common data structure, together with an arbitrary number of
user-defined properties. In the first part, we give a survey of the notions of
character and glyph and their relation with Natural Language Processing models,
some visual text representation issues and strategies adopted by file formats
(SVG, PDF, DVI) and software (Uniscribe, Pango). In the second part we show
applications of textemes in various text processing issues: ligatures, variant
glyphs and other OpenType-related properties, hyphenation, color and other
presentation attributes, Arabic form and morphology, CJK spacing, metadata,
etc. Finally we describe how the Omega typesetting system implements texteme
processing as an example of a generalised approach to input character stream
parsing, internal representation of text, and modular typographic
transformations. In the data flow from input to output, whether in memory or
through serializations in auxiliary data files, textemes progressively
accumulate information that is used by Omega's paragraph builder engine and
included in the output DVI file. We show how this additional information
increases efficiency of conversions to other file formats such as PDF or SVG.
We conclude this paper by presenting interesting potential applications of
texteme methods in document engineering. Keywords: OpenType, PDF, SVG, Unicode, character, glyph, multilingual typesetting,
omega, texteme | |||
| Classifying XML tags through "reading contexts" | | BIBA | Full-Text | 143-145 | |
| Xavier Tannier; Jean-Jacques Girardot; Mihaela Mathieu | |||
| Some tags used in XML documents create arbitrary breaks in the natural flow of the text. This may constitute an impediment to the application of some methods of document engineering. This article introduces the concept of "reading contexts", and gives clues to handle it theoretically and in practice. This work should notably allow to recognize emphasis tags in a text, to define a new concept of term proximity in structured documents, to improve indexing techniques, and also to open up the way to advanced linguistic analyses of XML corpora. | |||
| XML active transformation (eXAcT): transforming documents within interactive systems | | BIBAK | Full-Text | 146-148 | |
| Olivier Beaudoux | |||
| Stylesheets and batch transformations are the most widely used techniques to
transform "abstract" documents into target presentation documents. Despite the
recent introduction of incremental transformations, several important features
required by interactive systems are yet to be addressed, such as multiple
sources (e.g. preferences and resources), multiple targets (e.g. multiple
views), source-to-target linking (e.g. interacting with the source via the
tar-get), and bidirectional linking (e.g. interacting directly with the
target). This paper proposes the use of XML Active Transformations (eXAcT) in
order to fulfil these requirements. The eXAcT specification is based on the
definition of two new DOM node types, active fragment and anchor, and on a
transformation process in-spired from XSLT. Our jaXAT implementation toolkit
allows the active transformation of any DOM document into (but not limited to)
SVG presentations. Keywords: GUI, SVG, XML, active transformations, authoring tools | |||
| Prefiltering techniques for efficient XML document processing | | BIBAK | Full-Text | 149-158 | |
| Chia-Hsin Huang; Tyng-Ruey Chuang; Hahn-Ming Lee | |||
| Document Object Model (DOM) and Simple API for XML (SAX) are the two major
programming models for XML document processing. Each, however, has its own
efficiency limitation. DOM assumes an in-core representation of XML documents
which can be problematic for large documents. SAX needs to scan over the
document in a linear manner in order to locate the interesting fragments.
Previously, we have used tree-to-table mapping and indexing techniques to help
answer structural queries to large, or large collections of, XML documents. In
this paper, we generalize the previous techniques into a prefiltering framework
where repeated access to large XML documents can be efficiently carried out
within the existing DOM and SAX models. The prefiltering framework essentially
uses a tiny search engine to locate useful fragments in the target XML
documents by approximately executing the user's queries. Those fragments are
gathered into a candidate-set XML document, and is returned to the user's DOM-
or SAX-based applications for further processing. This results in a practical
and efficient model of XML processing, especially when the XML documents are
large and infrequently updated, but are frequently being queried. Keywords: DOM, SAX, prefiltering, structural query, two-phased XML processing model | |||
| Event points: annotating XML documents for remote sharing | | BIBAK | Full-Text | 159-161 | |
| Olivier Beaudoux | |||
| Collaboration is heavily based on sharing documents. However, most groupware
toolkits do not directly support document sharing, but rather focus on
supporting mechanisms such as remote concurrent access to shared objects. We
propose the notion of event point as a single and unified concept for defining
sharing capabilities of XML documents and introduce four types of event points
for real-time groupware: replication, copy, echo, and synchronization. These
event points support such collaborative features as real-time sharing,
synchronization, telepointing, localization, and echo. The paper presents the
concept of event point, its implementation in the DoPIdom toolkit, and some
sample uses in our Sovigo drawing tool. Keywords: CSCW toolkit, XML documents, real-time groupware | |||
| Managing syntactic variation in text retrieval | | BIBAK | Full-Text | 162-164 | |
| Jesús Vilares; Carlos Gómez-Rodríguez; Miguel A. Alonso | |||
| Information Retrieval systems are limited by the linguistic variation of
language. The use of Natural Language Processing techniques to manage this
problem has been studied for a long time, but mainly focusing on English. In
this paper we deal with European languages, taking Spanish as a case in point.
Two different sources of syntactic information, queries and documents, are
studied in order to increase the performance of Information Retrieval systems. Keywords: information retrieval, natural language processing, shallow parsing | |||
| Integrating translation services within a structured editor | | BIBAK | Full-Text | 165-167 | |
| Ali Choumane; Hervé Blanchon; Cécile Roisin | |||
| Fully automatic machine translation cannot produce high quality translation;
Dialog-Based Machine Translation (DB-MT) is the only way to provide authors
with a means of translating documents in languages they have not mastered, or
do not even know. With such environment, the author must help the system to
"understand" the document by means of an interactive disambiguation step. In
this paper we study the consequences of integrating the DBMT services within a
structured document editor (Amaya). The source document (named edited document)
needs a companion document enriched with different data produced during the
interactive translation process (question trees, answers of the author,
translations). The edited document also needs to be enriched (annotated) in
order to enable access to the question trees. The enriched edited document and
the companion document have to be synchronized in case the edited document is
further updated. Keywords: DBMT, XML document, editing of structured documents, interactive
disambiguation, self-explaining document | |||
| Towards active web clients | | BIBAK | Full-Text | 168-176 | |
| Vincent Quint; Iréne Vatton | |||
| Recent developments of document technologies have strongly impacted the
evolution of Web clients over the last fifteen years, but all Web clients have
not taken the same advantage of this advance. In particular, mainstream tools
have put the emphasis on accessing existing documents to the detriment of a
more cooperative usage of the Web. However, in the early days, Web users were
able to go beyond browsing and to get more actively involved. This paper
presents the main features needed to make Web clients more active and creative
tools, by taking advantage of the latest advances of document technology. These
features are implemented in Amaya, a user agent that supports several languages
from the XML family and integrates seamlessly such complementary
functionalities as browsing, editing, publishing, and annotating. Keywords: XML documents, authoring, compound documents, style languages, web user
agent | |||
| Enhancing composite digital documents using XML-based standoff markup | | BIBAK | Full-Text | 177-186 | |
| Peter L. Thomas; David F. Brailsford | |||
| Document representations can rapidly become unwieldy if they try to
encapsulate all possible document properties, ranging from abstract structure
to detailed rendering and layout.
We present a composite document approach wherein an XML-based document representation is linked via a 'shadow tree' of bi-directional pointers to a PDF representation of the same document. Using a two-window viewer any material selected in the PDF can be related back to the corresponding material in the XML, and vice versa. In this way the treatment of specialist material such as mathematics, music or chemistry (e.g. via 'read aloud' or 'play aloud') can be activated via standard tools working within the XML representation, rather than requiring that application-specific structures be embedded in the PDF itself. The problems of textual recognition and tree pattern matching between the two representations are discussed in detail. Comparisons are drawn between our use of a shadow tree of pointers to map between document representations and the use of a code-replacement shadow tree in technologies such as XBL. Keywords: MathML, MusicXML, PDF, XBL, XML, composite documents, standoff markup | |||
| Content publishing framework for interactive paper documents | | BIBAK | Full-Text | 187-196 | |
| Moira C. Norrie; Alexios Palinginis; Beat Signer | |||
| Paper persists as an important medium for documents and this has motivated
the development of new technologies for interactive paper that enable actions
on paper to be linked to digital actions. A major issue that remains is how to
integrate these technologies into the document life cycle and, specifically,
how to facilitate the authoring of links between printed documents and digital
documents and services. We describe how we have extended a general web
publishing framework to support the production of interactive paper documents,
thereby integrating paper as a new web channel in a platform for multi-channel
access. Keywords: interactive paper, publishing framework | |||
| Document digitization lifecycle for complex magazine collection | | BIBAK | Full-Text | 197-206 | |
| Sherif Yacoub; John Burns; Paolo Faraboschi; Daniel Ortega; Jose Abad Peiro; Vinay Saxena | |||
| The conversion of large collections of documents from paper to digital
formats that are suitable for electronic archival is a complex multi-phase
process. The creation of good quality images from paper documents is just one
phase. To extract relevant information that they contain, with an accuracy that
fits the purpose of target applications, an automated document analysis system
and a manual verification/review process are needed. The automated system needs
to perform a variety of analysis and recognition tasks in order to reach an
accuracy level that minimizes the manual correction effort downstream.
This paper describes the complete process and the associated technologies, tools, and systems needed for the conversion of a large collection of complex documents and deployment for online web access to its information rich content. We used this process to recapture 80 years of Time magazines. The historical collection is scanned, automatically processed by advanced document analysis components to extract articles, manually verified for accuracy, and converted in a form suitable for web access. We discuss the major phases of the conversion lifecycle and the technology developed and tools used for each phase. We also discuss results in terms of recognition accuracy. Keywords: document analysis and understanding, document digitization, document
engineering, preservation of historical content | |||
| A programming environment for demand-driven processing of network XML data and its performance evaluation | | BIBAK | Full-Text | 207-216 | |
| Masakazu Yamanaka; Kenji Niimura; Tomio Kamada | |||
| This paper proposes a programming environment for Java that processes
network XML data in a demand-driven manner to return quick initial responses.
Our system provides a data binding tool and a tree operation package, and the
programmer can easily handle network XML data as tree-based operations using
these facilities. For efficiency, demand-driven data binding allows the
application to start the processing of a network XML document before the
arrival of the whole data, and our tree operators are also designed to start
the calculation using the initially accessible part of the input data. Our
system uses multithread technology for implementation with optimization
techniques to reduce runtime overheads. It can return initial responses
quickly, and often shortens the total execution time due to the effects of
latency hiding and the reduction of memory usage. Compared with an ordinary
tree-based approach, our system shows a highly improved response and a 1-28%
reduction of total execution time on the benchmark programs. It only needs 1-4%
runtime overheads against the event-driven programs. Keywords: XML, data binding, demand-driven, multi-threading | |||
| A case study on alternate representations of data structures in XML | | BIBAK | Full-Text | 217-219 | |
| Daniel Gruhl; Daniel Meredith; Jan Pieper | |||
| XML provides a universal and portable format for document and data exchange.
While the syntax and specification of XML makes documents both human readable
and machine parsable, it is often at the expense of efficiency when
representing simple data structures.
We investigate the "costs" associated with XML serialization from several resource perspectives: storage, transport, processing and human readability. These experiments are done within the context of a large text-centric service oriented architecture -- IBM's WebFountain project. We find that for several applications, human readable formats outperform binary equivalents, especially in the area of data size, and that the costs of processing encoded binary data often exceeds that of processing terse human readable formats. Keywords: WebFountain, XML, compression, data structures, serialization | |||
| Eclipse modeling framework for document management | | BIBAK | Full-Text | 220-222 | |
| Neil Boyette; Vikas Krishna; Savitha Srinivasan | |||
| The lifecycle of document management applications typically comprises a set
of loosely coupled subsystems that provide capture, index, search, workflow,
fulfillment and archival features. However, there exists no standard model for
composing these elements together to instantiate a complete application.
Therefore, every application invariably incorporates custom application code to
provide the linkages between each of these loosely coupled subsystems. This
paper proposes a model-based approach to instantiating document management
applications. An Eclipse Modeling Framework (EMF) based model is used to
formalize the variable elements in the document management applications. The
modeling tool supports the instantiation of an EMF model for every new
application and supports the generation of runtime artifacts -- this includes
code, XML configurations, scripts and business logic. This approach to creating
new instances of document management applications with a formal EMF model has
been validated with a real-world document management application. Keywords: DMS, EMF, IT, ROI, business transformation, documents, eclipse, framework,
logistics, modeling, process | |||
| GroundTruth tools & technology: applications in real world | | BIBAK | Full-Text | 223-224 | |
| Vinay Saxena; Sherif Yacoub | |||
| The process of creating digital archive from paper based document is gaining
popularity. Automated systems/frameworks for document analysis techniques have
been developed, but still lack in achieving the required accuracy goals in
terms of text, article identification etc. Rendering problems, such as missing
graphical components, wrong reading ordering in multi columned
journals/magazine, missing indentation and broken text lines, hyphenation
issues, are basically due to poor layout information extracted from the scanned
document during the OCR process. Also lacking are the tools to take the output
of these processes and be able to create highly accurate content with
associated metadata from the original. The term "Ground Truth" in the current
context is used to refer to the process (automatic and manual collectively) by
which we ensure that the end result of the process are highly accurate and
complete rich text content (articles, papers, etc) generated from the original
scanned version of content.
We present to the audience PerfectDoc -- A suite of tools for manual GroundTruthing. The suite consist of tools to create highly accurate GroundTruth, GT editors and tools to take this data and deliver output suitable for web based viewing. Keywords: document analysis and understanding, document digitization, document
engineering, preservation of historical content | |||
| A web-based document harmonization and annotation chain: from PDF to RDF | | BIBAK | Full-Text | 225-226 | |
| Thierry Jacquin; Olivier Fambon; Boris Chidlovskii | |||
| We propose a demonstration of a Web-based document harmonization and
annotation chain developed within the VIKEF integrated project. The chain
integrates a combination of Web Services in order to access, harmonize and
semantically annotate remote document collections. Annotations are then mapped
onto RDF descriptions that serve as a basis for building semantic-enabled
services to support community processes. Keywords: PDF, RDF, document annotation, web services | |||
| A demonstration of the document description framework | | BIBAK | Full-Text | 227-228 | |
| John Lumley; Roger Gimson; Owen Rees | |||
| The Document Description Framework (DDF) [1] is a representation for
variable-data documents, designed to support very high flexibility in the type
and extent of variation, considerably beyond 'copy-hole' or flow-based
mechanisms of existing formats and tools. This demonstration shows how i) DDF
documents can be evaluated and merged to construct complex multi-stage
documents, ii) the layout capabilities can be extended flexibly and iii) how
they may be created and edited within a GUI-based environment. Keywords: SVG, XML, XSLT, document construction, functional programming | |||
| Bringing the semantic web to the office desktop | | BIBAK | Full-Text | 229-230 | |
| Timothy Miles-Board; Arouna Woukeu; Leslie Carr; Gary Wills; Wendy Hall | |||
| Many Semantic Web applications address the needs of human readers of the Web
(e.g. searching, annotating), but these technologies can also address the needs
of human writers of the Web. The WiCK project has explored the application of
knowledge bases and services to the Office desktop, in order to assist document
production, culminating in the WiCKOffice environment. This aim of this
demonstration is to showcase the most recent offshoot of the WiCKOffice
development, WiCKLite: a lightweight component for connecting knowledge
services to document templates in order to deliver targeted assistance to end
users. Keywords: knowledge writing, semantic web, smart tags | |||
| XIS: an XML document integration system | | BIBAK | Full-Text | 231-232 | |
| Guangming Xing; Chaitanya R. Malla; Andrew Ernest | |||
| We describe XIS, an XML document integration system. The system is based on
an algorithm that computes the top-down edit distance between an XML document
and a schema. The complexity of the algorithm is t x s x log s, where t is the
size of the document and s is the size of the schema.
The system includes a GUI that allows the user to visualize the operations performed on the XML document. Synthesized and real data-sets will be used to show the efficiency and efficacy of the system. Keywords: XML, document integration, tree grammar | |||
| The COG scrapbook | | BIBAK | Full-Text | 233-234 | |
| Steven R. Bagley; David F. Brailsford | |||
| The COG Scrapbook technology is an attempt to by the authors to convert
their COG technology into a usable suite of software for typical end-users,
rather than Document Engineering specialists.
This demonstration illustrates the four major components of this software suite, the COG Manipulator; COG Encapsulator; COG Extractor; and COG Creator. These four components provide the user with the tools required to manipulate COG PDF documents within the Adobe Acrobat environment. Keywords: COGs, FormXObject, PDF, graphic objects | |||
| The MMiSS repository | | BIBAK | Full-Text | 235-236 | |
| Achim Mahnke | |||
| The mmiss repository is a system for storing and versioning structured
documents in a multi-author environment. In contrast to general purpose
versioning systems like CVS, versioning and merging is based on the logical
structure of documents. New functionalities like the support for developing
ontologies along with documents and providing variants as a means for
adaptation are introduced. They aim at the development of higher level
functions on documents, like change management and consistency checking. Keywords: consistency, document management, ontology, repository, variants, version
control | |||
| Document editing and browsing in AKTiveDoc | | BIBAK | Full-Text | 237-238 | |
| Vitaveska Lanfranchi; Fabio Ciravegna; Phil Moore; Daniela Petrelli | |||
| In this demo paper, we present AKTiveDoc, a tool for supporting sharing and
reuse of knowledge in document production (e.g. writing) and use (e.g.
reading). Keywords: free-text annotation, interfaces, knowledge suggestion, ontology-driven
annotation, semantic web, semi-automatic annotation | |||
| BigBatch: a toolbox for monochromatic documents | | BIBAK | Full-Text | 239-240 | |
| Rafael Dueire Lins; Bruno Tenório Ávila | |||
| BigBatch is a tool designed to automatically process thousands of
monochromatic images of documents generated by production line scanners. It
removes noisy borders, checks and corrects orientation, calculates and
compensates the skew angle, crops the image standardizing document sizes, and
finally compresses it according to user defined file format. BigBatch
encompasses the best and recently developed algorithms for such kind of
document images. BigBatch may work either in standalone or operator assisted
modes. Besides that, BigBatch in standalone mode is able to process in clusters
of workstations. Keywords: border removal, document processing, image processing, monochromatic images,
orientation, skew detection | |||