HCI Bibliography Home | HCI Conferences | DocEng Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
DocEng Tables of Contents: 0102030405060708091011121314

Proceedings of the 2005 ACM Symposium on Document Engineering

Fullname:DocEng'05 Proceeding of the 5th ACM Symposium on Document Engineering
Editors:Anthony Wiley; Peter R. King
Location:Bristol, United Kingdom
Dates:2005-Nov-02 to 2005-Nov-04
Standard No:ISBN: 1-59593-240-2; ACM DL: Table of Contents hcibib: DocEng05
  1. Keynote
  2. Document structure and content analysis 1
  3. Posters
  4. Making use of document standards and models
  5. Document presentation
  6. Keynote
  7. Adaptive and variant documents
  8. Document structure and content analysis 2
  9. Panel Session
  10. Document authoring, markup and manipulation 1
  11. Document searching, document annotation, and document metadata
  12. Document authoring, markup and manipulation 2
  13. Techniques for document management and document engineering
  14. Demonstrations


The future of documents BIBAKFull-Text 1
  Tom Malloy
The quantity of the world's information held in the form of documents is massive and growing. So, the need for a container of information that can be transported between service participants (people or computers), and that is decoupled from the operation of those services, is as important today as ever.
   The basic properties of documents have been understood for over 5,000 years. While the first electronic documents translated many of these properties to the new medium, they did this in highly application-specific manners, with document formats specific to a particular content type. They also lost some important paper (eg signing, annotations). PDF gave us the first universal document format. It supports various content types and extended the common properties supported, but is limited to static, page-oriented content.
   In the years since PDF was first developed there have been many technological and social changes. We now have the opportunity to take the universal document to the next level. We are calling this new item an Intelligent Document.
   Intelligent Documents provide a rich, interactive experience to their users. While some of their content is static and embedded, increasingly the document is an interface to live information sources and web services providing personalized and contextually-aware content, caching the content as needed to enable on-line and off-line access. The documents can adapt their rendering and interaction to a broad range of devices and capabilities enabling users to truly control how they want to receive information. The applications that create the documents' data or content may come and go, but Intelligent Documents contain substantial amounts of metadata (increasingly semantic in nature) enabling their reuse and repurposing well beyond their initially conceived lifetime.
   This leads us to a document structure that is modularized, providing componetization of content, presentation, logic, and universal attributes for all document objects regardless of type. Intelligent Documents may more closely resemble applications in their behavior than traditional documents. The blurring distinction between documents and applications is a trend that will only increase over time.
Keywords: PDF, document structure

Document structure and content analysis 1

Structuring documents according to their table of contents BIBAKFull-Text 2-9
  Hervé Déjean; Jean-Luc Meunier
In this paper, we present a method for structuring a document according to the information present in its Table of Contents. The detection of the ToC as well as the determination of the parts it refers to in the document body rely on a series of generic properties characterizing any ToC, while its hierarchization is achieved using clustering techniques. We also report on the robustness and performance of the method before discussing it, in light of related work.
Keywords: document structuring, table of contents recognition
Towards XML version control of office documents BIBAKFull-Text 10-19
  Sebastian Rönnau; Jan Scheffczyk; Uwe M. Borghoff
Office applications such as OpenOffice and Microsoft Office are widely used to edit the majority of today's business documents: office documents. Usually, version control systems consider office documents as binary objects, thus severely hindering collaborative work. Since XML has become a de-facto standard for office applications, we focus on versioning office documents by structured XML version control approaches. This enables state-of-the-art version control for office documents.
   A basic prerequisite to XML version control is a diff algorithm, which detects structural changes between XML documents. In this paper, we evaluate state-of-the-art XML diff algorithms w.r.t. their suitability to OpenOffice XML documents and the future OASIS office document standard. It turns out that, due to the specific XML office format, a careful examination of the diff algorithm characteristics is necessary. Therefore, we identify important features for XML diff approaches to handle office documents. We have implemented a first OpenOffice versioning API that can be used in version control systems as a replacement for line-based or binary diffs, which are currently used.
Keywords: XML diffing, office applications, version control
Influence of fusion strategies on feature-based identification of low-resolution documents BIBAKFull-Text 20-22
  Ardhendu Behera; Denis Lalanne; Rolf Ingold
The paper describes a method by which one could use the documents captured from low-resolution handheld devices to retrieve the originals of those documents from a document store. The method considers conjunctively two complementary feature sets. First, the geometrical distribution of the color in the document's 2D image plane is preferred. Secondly, the shallow layout features is considered due to the poor resolution of the captured documents. We propose in this article to fuse those two complementary feature sets in order to improve document identification performance. Finally, in order to test the influence of merging strategies on document identification performance, a synergic method is proposed and evaluated relative to a similar method in which feature sets are simply considered sequentially.
Keywords: document retrieval, document signature, geometrical color distribution, shallow layout features
Enabling massive scale document transformation for the semantic web: the universal parsing agent BIBAKFull-Text 23-25
  Mark A. Whiting; Wendy Cowley; Nick Cramer; Alex Gibson; Ryan Hohimer; Ryan Scott; Stephen Tratz
The Universal Parsing Agent (UPA) is a document analysis and transformation program that supports massive scale conversion of information into forms suitable for the semantic web. UPA provides reusable tools to analyze text documents; identify and extract important information elements; enhance text with semantically descriptive tags; and output the information that is needed in the format and structure that is needed.
Keywords: XML, XSLT, document transformation, natural language processing, parsing, regular expressions
Exploiting XML technologies for intelligent document routing BIBAKFull-Text 26-28
  Isaac Cheng; Savitha Srinivasan; Neil Boyette
Today, XML is increasingly becoming a standard for representation of semi-structured information such as documents that combines content and metadata. Typical document management applications include document representation, authoring, validation, and document routing in support of a business process. We propose a framework for intelligent document routing that exploits and extends XML technologies to automate dynamic document routing and real-time update of business routing logic. The document-routing logic is stored in a secure repository and executed by a business rules engine. During rule execution, the input parameters of each business rule are bound with the data from each inbound XML document. This document routing framework is validated in a real-world implementation with reduced development cost, accelerated rule update cycle and simplified administration efforts.
Keywords: B2B, DMS, IT, J2EE, MIS, MQ, ROI, SO, SOA, business transformation, downsize, economy, global, logistics, organizational change, outsource, policy, training, turnover, worldwide


A system of collecting domain-specific jargons BIBKFull-Text 29
  Hiromi Oda
Keywords: domain specific expressions community jargons
Emphasis for highly customized documents BIBKFull-Text 30
  Helen Balinsky; Maurizio Pilu
Keywords: aesthetic measure, emphasis, high customization, publishing
The COG Scrapbook BIBAKFull-Text 31
  Steven R. Bagley; David F. Brailsford
Creating truly dynamic documents from disparate components is not an easy process, especially for untrained end users. Existing packages such as Adobe InDesign or Quark XPress are designed for graphic-arts professionals and even these are not optimised for the process of integrating disparate components. The 'COG Scrapbook' builds on our PDF-based Component Object Graphics (COGs) technology to create a radically new 'drag and drop' software approach to creating dynamic documents for use in educational environments.
   The initial focus of the COG Scrapbook is to enable groups of people to collaboratively create material ranging from smaller-sized (e.g. A4) personal scrapbooks through to large (e.g. A0) posters. Some examples scenarios where the technology could be utilized are:
  • university students producing posters for class assignments;
  • school teachers and students jointly creating classroom wall displays;
  • creation of a scrapbook of annotated digital photographs resulting from
       school field trips and museum visits The poster illustrating the present state of this work will itself have been constructed using COG Scrapbook software.
    Keywords: COGs, FormXObject, PDF, graphic objects
  • Making use of document standards and models

    A framework for structure, layout & function in documents BIBAKFull-Text 32-41
      John Lumley; Roger Gimson; Owen Rees
    The Document Description Framework (DDF) is a representation for variable-data documents. It supports very high flexibility in the type and extent of variation supported, considerably beyond the 'copy-hole' or flow-based mechanisms of existing formats and tools. DDF is based on holding application data, logical data structure and presentation as well as constructional 'programs' together within a single document. DDF documents can be merged with other documents, bound to variable values incrementally, combine several types of layout and styling in the same document and support final delivery to different devices and page-ready formats. The framework uses XML syntax and fragments of XSLT to describe 'programmatic construction' of a bound document. DDF is extensible, especially in the ability to add new types of layout and interoperability between components in different formats. In this paper we describe the motivation for DDF, the major design choices and how we evaluate a DDF document with specific data values. We show through implemented examples how it can be used to construct high-complexity and variability presentations and how the framework complements and can use many existing XML-based documents formats, such as SVG and XSL-FO.
    Keywords: SVG, XML, XSLT, document construction, functional programming
    An environment for maintaining computation dependency in XML documents BIBAKFull-Text 42-51
      Dongxi Liu; Zhenjiang Hu; Masato Takeichi
    In the domain of XML authoring, there have been many tools to help users to edit XML documents. These tools make it easier to produce complex documents by using such technologies as syntax-directed or presentation-oriented editing, etc. However, when an XML document contains data with some computation dependency among them, these tools cannot free users from the burden of maintaining this dependency relationship. By computation dependency, we mean that some data are gotten by computing from other data in the same document.
       In this paper, we present an environment for authoring XML document, in which users can express the data dependency relationship in one document explicitly rather than implicitly in their minds. Under this environment, the dependent parts of the document are represented as expressions, which in turn can be evaluated to generate the dependent data. Therefore, users need not to compute the dependent data first and then input them manually, as required by the current authoring tools.
    Keywords: XML, computation dependency, functional programming, lazy evaluation, programmable structured document
    Compiling XPath for streaming access policy BIBAKFull-Text 52-54
      Pierre Genevès; Kristoffer Rose
    We show how the full XPath language can be compiled into a minimal subset suited for stream-based evaluation. Specifically, we show how XPath normalization into a core language as proposed in the current W3C "Last Call" draft of the XPath/XQuery Formal Semantics can be extended such that both the context state and reverse axes can be eliminated from the core XPath (and potentially XQuery) language. This allows execution of (almost) full XPath on any of the emerging streaming subsets.
    Keywords: XPath, compilation, static rewriting, streaming
    Static optimization of XSLT stylesheets: template instantiation optimization and lazy XML parsing BIBAKFull-Text 55-57
      Manaka Kenji; Sato Hiroyuki
    The increasing popularity of XSLT brings up the requirement of more efficient performance. In this paper, we propose two optimization techniques based on template caller-callee analysis. One is the template instantiation optimization which analyzes a stylesheet and identifies the templates to be instantiated before transformation. The other is the static lazy XML parsing optimization that constructs a pruned XML tree by statically identifying the nodes that are actually referred. Furthermore, we have implemented both our optimizations on Saxon and have evaluated its performance. In these experiments, we have proved both of them to be practically useful and to improve XSLT performance.
    Keywords: XSLT, lazy evaluation, optimization, saxon
    Generating form-based user interfaces for XML vocabularies BIBAKFull-Text 58-60
      Y. S. Kuo; N. C. Shih; Lendle Tseng; Hsun-Cheng Hu
    So far, many user interfaces for XML data (documents) have been constructed from scratch for specific XML vocabularies and applications. The tool support for user interfaces for XML data is inadequate. Forms-XML is an interactive component invoked by applications for generating user interfaces for prescribed XML vocabularies automatically. Based on a given XML schema, the component generates a hierarchy of HTML forms for users to interact with and update XML data compliant with the given schema. The user interface Forms-XML generates is very simple with an abundance of guidance and hints to the user, and can be customized by user interface designers as well as developers.
    Keywords: XML, XML editing, user interface
    Encapsulating and manipulating component object graphics (COGs) using SVG BIBAKFull-Text 61-63
      Alexander J. Macdonald; David F. Brailsford; Steven R. Bagley
    Scalable Vector Graphics (SVG) has an imaging model similar to that of PostScript and PDF but the XML basis of SVG allows it to participate fully, via namespaces, in generalised XML documents.
       There is increasing interest in using SVG as a Page Description Language and we examine ways in which SVG document components can be encapsulated in contexts where SVG will be used as a rendering technology for conventional page printing.
       Our aim is to encapsulate portions of SVG content (SVG COGs) so that the COGs are mutually independent and can be moved around a page, while maintaining invariant graphic properties and with guaranteed freedom from side effects and mutual interference. Parallels are drawn between COG implementation within SVG's tree-based inheritance mechanisms and an earlier COG implementation using PDF.
    Keywords: PDF, SVG, XML, component object graphics, parameterization

    Document presentation

    Support for arbitrary regions in XSL-FO BIBAKFull-Text 64-73
      Ana Cristina B. da Silva; Joao B. S. de Oliveira; Fernando T. M. Mano; Thiago B. Silva; Leonardo L. Meirelles; Felipe R. Meneguzzi; Fabio Giannetti
    This paper proposes an extension of the XSL-FO standard which allows the specification of an unlimited number of arbitrarily shaped page regions. These extensions are built on top of XSL-FO 1.1 to enable flow content to be laid out into arbitrary shapes and allowing for page layouts currently available only to desktop publishing software. Such a proposal is expected to leverage XSL-FO towards usage as an enabling technology in the generation of content intended for personalized printing.
    Keywords: LaTeX, SVG, XML, XSL-FO, arbitrary shapes, digital publishing, typesetting
    Toward tighter tables BIBAKFull-Text 74-83
      Nathan Hurst; Kim Marriott; Peter Moulder
    Tables are provided in virtually all document formatting systems and are one of the most powerful and useful design elements in current web document standards. Unfortunately, optimal layout of tables which contain text is NP-hard for reasonable layout requirements such as minimizing table height for a given width [1]. We present two new independently-applicable techniques for table layout. The first technique is to solve a continuous approximation to the original layout problem by using a constant-area approximation of the cell content combined with a minimum width and height for the cell. The second technique starts by setting each column to its narrowest possible width and then iteratively reduces the height of the table by judiciously widening its columns. This second technique uses the actual text and line-break rules rather than the constant-area approximation used by the first technique. We also investigate two hybrid approaches both of which use iterative column widening to improve the quality of an initial solution found using a different technique. In the first hybrid approach we use the continuous approximation technique to compute the initial column widths while in the second hybrid approach a modification of the HTML table layout algorithm is used to compute the initial widths. We found that all four techniques are reasonably fast and give significantly more compact layout than that of HTML layout engines.
    Keywords: conic programming, optimisation techniques, table layout
    Generative semantic clustering in spatial hypertext BIBAKFull-Text 84-93
      Andruid Kerne; Eunyee Koh; Vikram Sundaram; J. Michael Mistrot
    This paper presents an iterative method for generative semantic clustering of related information elements in spatial hypertext documents. The goal is to automatically organize them in ways that are meaningful to the user. We consider a process in which elements are gradually added to a spatial hypertext. The method for generating meaningful layout is based on a quantitative model that measures and represents the mutual relatedness between each new element and those already in the document. The measurement is based on attributes such as metadata, term vectors, user interest expressions, and document locations. We call this model relatedness potential, because it represents how much the new element is related and thus attracted to existing elements as a vector field across the space. Using this field as a gradient potential, the new element will be placed near the most attracted elements, forming clusters of related elements. The relative magnitude of contribution of attributes to relatedness potential can be controlled through an interactive interface.
       Unlike prior clustering methods such as k-means and self-organizing-maps, relatedness potential works well in iterative systems, in which the collection of elements is not defined a priori. Further, users can invoke relatedness potential to re-cluster elements, as they engage in on-the-fly provisional acts of direct manipulation reorganization and latching of a few most significant elements. A preliminary study indicates that users find this method generates spatial hypertext documents that are easier to read.
    Keywords: clustering, collections, document layout, generative hypermedia, information triage, mixed-initiatives, spatial hypertext


    Engineering information in documents: leaving room for uncertainty BIBAFull-Text 94
      Dick C. A. Bulterman
    Much of the work in engineering complex documents has been on building a path from a document model (or document abstraction) to one or more instances of that document during a publication phase. While many approaches exist to help create multiple instances from a single document, all of these instances share the property that the content (once created) is fixed.
       In this talk, we consider more flexible bindings of information packaging and publication. The main goal of this work is to allow a large degree of uncertainty to document content -- uncertainty that is motivated by incremental authoring processes (where the contents of a document are enriched over its lifetime), uncertainty that is motivated by continuous adaptation of content based on changes in the nature and architecture of the underlying rendering environment, and uncertainty that is motivated by temporal navigation through various set of interwoven, conditionally-active content layers.
       Several use cases will be discussed (including entertainment, accessible documents and medical applications) and a flexible, non-monolithic document reader/annotator architecture will be shown based on the Ambulant renderer.

    Adaptive and variant documents

    Constrained XSL formatting objects for adaptive documents BIBAKFull-Text 95-97
      Gil Loureiro; Francisco Azevedo
    The pagination strategy of XSL Formatting Objects (XSL:FO) is based on a "break if no fit" approach that often produces one last page with only one printable object due to a lack of space on the previous page. On a batch, high volume, personalized document production scenario, this fact can represent a high cost on extra sheets of paper with a lot of free space and a document with a poor look. In this paper, we describe a new approach to solve the pagination problem of XSL:FO documents where space use efficiency and aesthetic aspects are considered. The approach is based on constraint satisfaction using Mixed Integer Linear Programming (MILP) models. The starting point was the FO part of XSL specification, where we added a Constrained XSL:FO extension (referred to as CXSL:FO) that delivers tags used to declare constraints on size and font adjustments of target FO objects. This extension is added to our reengineered FOP formatter that builds and solves an MILP model to find the global optimal solution corresponding to a document with the minimum number of pages, each one being maximally filled. We show its effectiveness in the generation of personalized welcome letters.
    Keywords: MILP, XSL:FO, adaptive documents, constraints, pagination
    Content interaction and formatting for mobile devices BIBAKFull-Text 98-100
      Tayeb Lemlouma; Nabil Layaïda
    In this paper we present an experimental content adaptation system for mobile devices. The system enables the presentation of multimedia content and considers the problem of small screen display of mobile terminals. The approach combines structural and media adaptation with the content formatting and proposes a system that handles the user interaction and the content navigation.
    Keywords: content adaptation, content formatting, evaluation, mobile-devices, user interaction

    Document structure and content analysis 2

    Schema matching for transforming structured documents BIBAKFull-Text 101-110
      Aida Boukottaya; Christine Vanoirbeek
    Structured document content reuse is the problem of restructuring and translating data structured under a source schema into an instance of a target schema. A notion closely tied with structured document reuse is that of structure transformations. Schema matching is a critical strep in structured document transformations. Manual matching is expensive and error-prone. It is therefore important to develop techniques to automate the matching process and thus the transformation process. In this paper, we contributed in both understanding the matching problem in the context of structured document transformations and developing matching methods those output serves as the basis for the automatic generation of transformation scripts.
    Keywords: document structure transformations, schema matching
    Textual indexation of ancient documents BIBAFull-Text 111-117
      Yann Leydier; Frank LeBourgeois; Hubert Emptoz
    In the past years many levels of indexation have been developed to allow a fast retrieval of digitized documents. Among all the ways of indexing a document, textual indexation allows the finest queries on a the documents' content. Usually, the plain text transcription of a digitized document is obtained by applying an OCR (Optical Character Recognition) software on it. What if the OCR fails? Indeed OCR systems are inefficient on low-quality printed documents, and are unsuited to the processing of ancient fonts. Furthermore, OCR is not applicable to manuscript text recognition. In this paper we introduce two alternative methods of accessing to text trough the image: the Computer Assisted Transcription and the Word Spotting.
    A fast orientation and skew detection algorithm for monochromatic document images BIBAKFull-Text 118-126
      Bruno Tenório Ávila; Rafael Dueire Lins
    Very often in the digitization process, documents are either not placed with the correct orientation or are rotated of small angles in relation to the original image axis. These factors make more difficult the visualization of images by human users, increase the complexity of any sort of automatic image recognition, degrade the performance of OCR tools, increase the space needed for image storage, etc. This paper presents a fast algorithm for orientation and skew detection for complex monochromatic document images, which is capable of detecting any document rotation at a high precision.
    Keywords: monochromatic document image, orientation and skew detection
    A statistical method for binary classification of images BIBAKFull-Text 127-129
      Steven J. Simske; Dalong Li; Jason S. Aronoff
    The classification of documents with sparse text, and video analysis, relies on accurate image classification. We herein present a method for binary classification that accommodates any number of individual classifiers. Each individual classifier is defined by the critical point between its two means, and its relative weighting is inversely proportional to its expected error rate. Using 10 simple image analysis metrics, we distinguish a set of "natural" and "city" scenes, providing a "semantically meaningful" classification. The optimal combination of 5 of these 10 classifiers provides 85.8% accuracy on a small (120 image) feasibility corpus. When this feasibility corpus is then split into half training and half testing images, the mean accuracy of the optimum set of classifiers was 81.7%. Accuracy as high as 90% was obtained for the test set when training percentage was increased. These results demonstrate that an accurate classifier can be constructed from a large pool of simple classifiers through the use of the statistical ("Normal") classification method described herein.
    Keywords: binary classification, classifier, combined classifiers, image classification, normal
    A new rotation algorithm for monochromatic images BIBAKFull-Text 130-132
      Bruno Tenório Ávila; Rafael Dueire Lins; Lamberto Oliveira
    The classical rotation algorithm applied to monochromatic images introduces white holes in black areas, making edges uneven and disconnecting neighboring elements. Several algorithms in the literature address only the white hole problem. This paper proposes a new algorithm that solves those three problems, producing better quality images.
    Keywords: monochromatic image rotation, skew correction

    Panel Session

    Interaction between paper and electronic documents BIBAKFull-Text 133
      Michael Gormish
    Documents today almost always exist in two forms: paper and electronic. Many documents, especially legacy documents start as paper, but are then scanned and recognized. Other documents are started electronically but then printed for easy reading, annotation, or distribution. Some documents are scanned, operated on electronically, then printed. Often machine readable information, e.g. barcodes or RFID tags are added to paper documents to allow association with the electronic document, or with "meta-data" in some database. Sometimes the ability to go back and forth between paper and electronic forms, round-tripping, is important, other times the two forms are fundamentally different.
       While the end of paper in the offices has been long predicted, actual volume of printed materials continues to rise. Electronic documents have, in fact, greatly increased the use of paper. This panel addresses making paper more useful in an electronic document world, and making electronic databases deal with paper. There are obvious challenges including scanning paper documents and printing electronic ones, but there are additional opportunities including using paper to summarize and access multimedia documents, and using paper to control electronic actions.
    Keywords: classification, document analysis, document databases, enterprise document, machine identifiers, model, paper manifestation, printing, scanning, segmentation

    Document authoring, markup and manipulation 1

    Injecting information into atomic units of text BIBAKFull-Text 134-142
      Yannis Haralambous; Gábor Bella
    This paper presents a new approach to text processing, based on textemes. These are atomic text units generalising the concepts of character and glyph by merging them in a common data structure, together with an arbitrary number of user-defined properties. In the first part, we give a survey of the notions of character and glyph and their relation with Natural Language Processing models, some visual text representation issues and strategies adopted by file formats (SVG, PDF, DVI) and software (Uniscribe, Pango). In the second part we show applications of textemes in various text processing issues: ligatures, variant glyphs and other OpenType-related properties, hyphenation, color and other presentation attributes, Arabic form and morphology, CJK spacing, metadata, etc. Finally we describe how the Omega typesetting system implements texteme processing as an example of a generalised approach to input character stream parsing, internal representation of text, and modular typographic transformations. In the data flow from input to output, whether in memory or through serializations in auxiliary data files, textemes progressively accumulate information that is used by Omega's paragraph builder engine and included in the output DVI file. We show how this additional information increases efficiency of conversions to other file formats such as PDF or SVG. We conclude this paper by presenting interesting potential applications of texteme methods in document engineering.
    Keywords: OpenType, PDF, SVG, Unicode, character, glyph, multilingual typesetting, omega, texteme
    Classifying XML tags through "reading contexts" BIBAFull-Text 143-145
      Xavier Tannier; Jean-Jacques Girardot; Mihaela Mathieu
    Some tags used in XML documents create arbitrary breaks in the natural flow of the text. This may constitute an impediment to the application of some methods of document engineering. This article introduces the concept of "reading contexts", and gives clues to handle it theoretically and in practice. This work should notably allow to recognize emphasis tags in a text, to define a new concept of term proximity in structured documents, to improve indexing techniques, and also to open up the way to advanced linguistic analyses of XML corpora.

    Document searching, document annotation, and document metadata

    XML active transformation (eXAcT): transforming documents within interactive systems BIBAKFull-Text 146-148
      Olivier Beaudoux
    Stylesheets and batch transformations are the most widely used techniques to transform "abstract" documents into target presentation documents. Despite the recent introduction of incremental transformations, several important features required by interactive systems are yet to be addressed, such as multiple sources (e.g. preferences and resources), multiple targets (e.g. multiple views), source-to-target linking (e.g. interacting with the source via the tar-get), and bidirectional linking (e.g. interacting directly with the target). This paper proposes the use of XML Active Transformations (eXAcT) in order to fulfil these requirements. The eXAcT specification is based on the definition of two new DOM node types, active fragment and anchor, and on a transformation process in-spired from XSLT. Our jaXAT implementation toolkit allows the active transformation of any DOM document into (but not limited to) SVG presentations.
    Keywords: GUI, SVG, XML, active transformations, authoring tools
    Prefiltering techniques for efficient XML document processing BIBAKFull-Text 149-158
      Chia-Hsin Huang; Tyng-Ruey Chuang; Hahn-Ming Lee
    Document Object Model (DOM) and Simple API for XML (SAX) are the two major programming models for XML document processing. Each, however, has its own efficiency limitation. DOM assumes an in-core representation of XML documents which can be problematic for large documents. SAX needs to scan over the document in a linear manner in order to locate the interesting fragments. Previously, we have used tree-to-table mapping and indexing techniques to help answer structural queries to large, or large collections of, XML documents. In this paper, we generalize the previous techniques into a prefiltering framework where repeated access to large XML documents can be efficiently carried out within the existing DOM and SAX models. The prefiltering framework essentially uses a tiny search engine to locate useful fragments in the target XML documents by approximately executing the user's queries. Those fragments are gathered into a candidate-set XML document, and is returned to the user's DOM- or SAX-based applications for further processing. This results in a practical and efficient model of XML processing, especially when the XML documents are large and infrequently updated, but are frequently being queried.
    Keywords: DOM, SAX, prefiltering, structural query, two-phased XML processing model
    Event points: annotating XML documents for remote sharing BIBAKFull-Text 159-161
      Olivier Beaudoux
    Collaboration is heavily based on sharing documents. However, most groupware toolkits do not directly support document sharing, but rather focus on supporting mechanisms such as remote concurrent access to shared objects. We propose the notion of event point as a single and unified concept for defining sharing capabilities of XML documents and introduce four types of event points for real-time groupware: replication, copy, echo, and synchronization. These event points support such collaborative features as real-time sharing, synchronization, telepointing, localization, and echo. The paper presents the concept of event point, its implementation in the DoPIdom toolkit, and some sample uses in our Sovigo drawing tool.
    Keywords: CSCW toolkit, XML documents, real-time groupware
    Managing syntactic variation in text retrieval BIBAKFull-Text 162-164
      Jesús Vilares; Carlos Gómez-Rodríguez; Miguel A. Alonso
    Information Retrieval systems are limited by the linguistic variation of language. The use of Natural Language Processing techniques to manage this problem has been studied for a long time, but mainly focusing on English. In this paper we deal with European languages, taking Spanish as a case in point. Two different sources of syntactic information, queries and documents, are studied in order to increase the performance of Information Retrieval systems.
    Keywords: information retrieval, natural language processing, shallow parsing
    Integrating translation services within a structured editor BIBAKFull-Text 165-167
      Ali Choumane; Hervé Blanchon; Cécile Roisin
    Fully automatic machine translation cannot produce high quality translation; Dialog-Based Machine Translation (DB-MT) is the only way to provide authors with a means of translating documents in languages they have not mastered, or do not even know. With such environment, the author must help the system to "understand" the document by means of an interactive disambiguation step. In this paper we study the consequences of integrating the DBMT services within a structured document editor (Amaya). The source document (named edited document) needs a companion document enriched with different data produced during the interactive translation process (question trees, answers of the author, translations). The edited document also needs to be enriched (annotated) in order to enable access to the question trees. The enriched edited document and the companion document have to be synchronized in case the edited document is further updated.
    Keywords: DBMT, XML document, editing of structured documents, interactive disambiguation, self-explaining document

    Document authoring, markup and manipulation 2

    Towards active web clients BIBAKFull-Text 168-176
      Vincent Quint; Iréne Vatton
    Recent developments of document technologies have strongly impacted the evolution of Web clients over the last fifteen years, but all Web clients have not taken the same advantage of this advance. In particular, mainstream tools have put the emphasis on accessing existing documents to the detriment of a more cooperative usage of the Web. However, in the early days, Web users were able to go beyond browsing and to get more actively involved. This paper presents the main features needed to make Web clients more active and creative tools, by taking advantage of the latest advances of document technology. These features are implemented in Amaya, a user agent that supports several languages from the XML family and integrates seamlessly such complementary functionalities as browsing, editing, publishing, and annotating.
    Keywords: XML documents, authoring, compound documents, style languages, web user agent
    Enhancing composite digital documents using XML-based standoff markup BIBAKFull-Text 177-186
      Peter L. Thomas; David F. Brailsford
    Document representations can rapidly become unwieldy if they try to encapsulate all possible document properties, ranging from abstract structure to detailed rendering and layout.
       We present a composite document approach wherein an XML-based document representation is linked via a 'shadow tree' of bi-directional pointers to a PDF representation of the same document. Using a two-window viewer any material selected in the PDF can be related back to the corresponding material in the XML, and vice versa. In this way the treatment of specialist material such as mathematics, music or chemistry (e.g. via 'read aloud' or 'play aloud') can be activated via standard tools working within the XML representation, rather than requiring that application-specific structures be embedded in the PDF itself.
       The problems of textual recognition and tree pattern matching between the two representations are discussed in detail.
       Comparisons are drawn between our use of a shadow tree of pointers to map between document representations and the use of a code-replacement shadow tree in technologies such as XBL.
    Keywords: MathML, MusicXML, PDF, XBL, XML, composite documents, standoff markup
    Content publishing framework for interactive paper documents BIBAKFull-Text 187-196
      Moira C. Norrie; Alexios Palinginis; Beat Signer
    Paper persists as an important medium for documents and this has motivated the development of new technologies for interactive paper that enable actions on paper to be linked to digital actions. A major issue that remains is how to integrate these technologies into the document life cycle and, specifically, how to facilitate the authoring of links between printed documents and digital documents and services. We describe how we have extended a general web publishing framework to support the production of interactive paper documents, thereby integrating paper as a new web channel in a platform for multi-channel access.
    Keywords: interactive paper, publishing framework

    Techniques for document management and document engineering

    Document digitization lifecycle for complex magazine collection BIBAKFull-Text 197-206
      Sherif Yacoub; John Burns; Paolo Faraboschi; Daniel Ortega; Jose Abad Peiro; Vinay Saxena
    The conversion of large collections of documents from paper to digital formats that are suitable for electronic archival is a complex multi-phase process. The creation of good quality images from paper documents is just one phase. To extract relevant information that they contain, with an accuracy that fits the purpose of target applications, an automated document analysis system and a manual verification/review process are needed. The automated system needs to perform a variety of analysis and recognition tasks in order to reach an accuracy level that minimizes the manual correction effort downstream.
       This paper describes the complete process and the associated technologies, tools, and systems needed for the conversion of a large collection of complex documents and deployment for online web access to its information rich content. We used this process to recapture 80 years of Time magazines. The historical collection is scanned, automatically processed by advanced document analysis components to extract articles, manually verified for accuracy, and converted in a form suitable for web access. We discuss the major phases of the conversion lifecycle and the technology developed and tools used for each phase. We also discuss results in terms of recognition accuracy.
    Keywords: document analysis and understanding, document digitization, document engineering, preservation of historical content
    A programming environment for demand-driven processing of network XML data and its performance evaluation BIBAKFull-Text 207-216
      Masakazu Yamanaka; Kenji Niimura; Tomio Kamada
    This paper proposes a programming environment for Java that processes network XML data in a demand-driven manner to return quick initial responses. Our system provides a data binding tool and a tree operation package, and the programmer can easily handle network XML data as tree-based operations using these facilities. For efficiency, demand-driven data binding allows the application to start the processing of a network XML document before the arrival of the whole data, and our tree operators are also designed to start the calculation using the initially accessible part of the input data. Our system uses multithread technology for implementation with optimization techniques to reduce runtime overheads. It can return initial responses quickly, and often shortens the total execution time due to the effects of latency hiding and the reduction of memory usage. Compared with an ordinary tree-based approach, our system shows a highly improved response and a 1-28% reduction of total execution time on the benchmark programs. It only needs 1-4% runtime overheads against the event-driven programs.
    Keywords: XML, data binding, demand-driven, multi-threading
    A case study on alternate representations of data structures in XML BIBAKFull-Text 217-219
      Daniel Gruhl; Daniel Meredith; Jan Pieper
    XML provides a universal and portable format for document and data exchange. While the syntax and specification of XML makes documents both human readable and machine parsable, it is often at the expense of efficiency when representing simple data structures.
       We investigate the "costs" associated with XML serialization from several resource perspectives: storage, transport, processing and human readability. These experiments are done within the context of a large text-centric service oriented architecture -- IBM's WebFountain project.
       We find that for several applications, human readable formats outperform binary equivalents, especially in the area of data size, and that the costs of processing encoded binary data often exceeds that of processing terse human readable formats.
    Keywords: WebFountain, XML, compression, data structures, serialization
    Eclipse modeling framework for document management BIBAKFull-Text 220-222
      Neil Boyette; Vikas Krishna; Savitha Srinivasan
    The lifecycle of document management applications typically comprises a set of loosely coupled subsystems that provide capture, index, search, workflow, fulfillment and archival features. However, there exists no standard model for composing these elements together to instantiate a complete application. Therefore, every application invariably incorporates custom application code to provide the linkages between each of these loosely coupled subsystems. This paper proposes a model-based approach to instantiating document management applications. An Eclipse Modeling Framework (EMF) based model is used to formalize the variable elements in the document management applications. The modeling tool supports the instantiation of an EMF model for every new application and supports the generation of runtime artifacts -- this includes code, XML configurations, scripts and business logic. This approach to creating new instances of document management applications with a formal EMF model has been validated with a real-world document management application.
    Keywords: DMS, EMF, IT, ROI, business transformation, documents, eclipse, framework, logistics, modeling, process


    GroundTruth tools & technology: applications in real world BIBAKFull-Text 223-224
      Vinay Saxena; Sherif Yacoub
    The process of creating digital archive from paper based document is gaining popularity. Automated systems/frameworks for document analysis techniques have been developed, but still lack in achieving the required accuracy goals in terms of text, article identification etc. Rendering problems, such as missing graphical components, wrong reading ordering in multi columned journals/magazine, missing indentation and broken text lines, hyphenation issues, are basically due to poor layout information extracted from the scanned document during the OCR process. Also lacking are the tools to take the output of these processes and be able to create highly accurate content with associated metadata from the original. The term "Ground Truth" in the current context is used to refer to the process (automatic and manual collectively) by which we ensure that the end result of the process are highly accurate and complete rich text content (articles, papers, etc) generated from the original scanned version of content.
       We present to the audience PerfectDoc -- A suite of tools for manual GroundTruthing. The suite consist of tools to create highly accurate GroundTruth, GT editors and tools to take this data and deliver output suitable for web based viewing.
    Keywords: document analysis and understanding, document digitization, document engineering, preservation of historical content
    A web-based document harmonization and annotation chain: from PDF to RDF BIBAKFull-Text 225-226
      Thierry Jacquin; Olivier Fambon; Boris Chidlovskii
    We propose a demonstration of a Web-based document harmonization and annotation chain developed within the VIKEF integrated project. The chain integrates a combination of Web Services in order to access, harmonize and semantically annotate remote document collections. Annotations are then mapped onto RDF descriptions that serve as a basis for building semantic-enabled services to support community processes.
    Keywords: PDF, RDF, document annotation, web services
    A demonstration of the document description framework BIBAKFull-Text 227-228
      John Lumley; Roger Gimson; Owen Rees
    The Document Description Framework (DDF) [1] is a representation for variable-data documents, designed to support very high flexibility in the type and extent of variation, considerably beyond 'copy-hole' or flow-based mechanisms of existing formats and tools. This demonstration shows how i) DDF documents can be evaluated and merged to construct complex multi-stage documents, ii) the layout capabilities can be extended flexibly and iii) how they may be created and edited within a GUI-based environment.
    Keywords: SVG, XML, XSLT, document construction, functional programming
    Bringing the semantic web to the office desktop BIBAKFull-Text 229-230
      Timothy Miles-Board; Arouna Woukeu; Leslie Carr; Gary Wills; Wendy Hall
    Many Semantic Web applications address the needs of human readers of the Web (e.g. searching, annotating), but these technologies can also address the needs of human writers of the Web. The WiCK project has explored the application of knowledge bases and services to the Office desktop, in order to assist document production, culminating in the WiCKOffice environment. This aim of this demonstration is to showcase the most recent offshoot of the WiCKOffice development, WiCKLite: a lightweight component for connecting knowledge services to document templates in order to deliver targeted assistance to end users.
    Keywords: knowledge writing, semantic web, smart tags
    XIS: an XML document integration system BIBAKFull-Text 231-232
      Guangming Xing; Chaitanya R. Malla; Andrew Ernest
    We describe XIS, an XML document integration system. The system is based on an algorithm that computes the top-down edit distance between an XML document and a schema. The complexity of the algorithm is t x s x log s, where t is the size of the document and s is the size of the schema.
       The system includes a GUI that allows the user to visualize the operations performed on the XML document. Synthesized and real data-sets will be used to show the efficiency and efficacy of the system.
    Keywords: XML, document integration, tree grammar
    The COG scrapbook BIBAKFull-Text 233-234
      Steven R. Bagley; David F. Brailsford
    The COG Scrapbook technology is an attempt to by the authors to convert their COG technology into a usable suite of software for typical end-users, rather than Document Engineering specialists.
       This demonstration illustrates the four major components of this software suite, the COG Manipulator; COG Encapsulator; COG Extractor; and COG Creator. These four components provide the user with the tools required to manipulate COG PDF documents within the Adobe Acrobat environment.
    Keywords: COGs, FormXObject, PDF, graphic objects
    The MMiSS repository BIBAKFull-Text 235-236
      Achim Mahnke
    The mmiss repository is a system for storing and versioning structured documents in a multi-author environment. In contrast to general purpose versioning systems like CVS, versioning and merging is based on the logical structure of documents. New functionalities like the support for developing ontologies along with documents and providing variants as a means for adaptation are introduced. They aim at the development of higher level functions on documents, like change management and consistency checking.
    Keywords: consistency, document management, ontology, repository, variants, version control
    Document editing and browsing in AKTiveDoc BIBAKFull-Text 237-238
      Vitaveska Lanfranchi; Fabio Ciravegna; Phil Moore; Daniela Petrelli
    In this demo paper, we present AKTiveDoc, a tool for supporting sharing and reuse of knowledge in document production (e.g. writing) and use (e.g. reading).
    Keywords: free-text annotation, interfaces, knowledge suggestion, ontology-driven annotation, semantic web, semi-automatic annotation
    BigBatch: a toolbox for monochromatic documents BIBAKFull-Text 239-240
      Rafael Dueire Lins; Bruno Tenório Ávila
    BigBatch is a tool designed to automatically process thousands of monochromatic images of documents generated by production line scanners. It removes noisy borders, checks and corrects orientation, calculates and compensates the skew angle, crops the image standardizing document sizes, and finally compresses it according to user defined file format. BigBatch encompasses the best and recently developed algorithms for such kind of document images. BigBatch may work either in standalone or operator assisted modes. Besides that, BigBatch in standalone mode is able to process in clusters of workstations.
    Keywords: border removal, document processing, image processing, monochromatic images, orientation, skew detection