HCI Bibliography Home | HCI Conferences | DocEng Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
DocEng Tables of Contents: 0102030405060708091011121314

Proceedings of the 2002 ACM Symposium on Document Engineering

Fullname:DocEng'02 Proceeding of the 2nd ACM Symposium on Document Engineering
Editors:Ethan Munson; Richard Furuta; Jonathan I. Maletic
Location:McLean, Virginia, USA
Dates:2002-Nov-08 to 2002-Nov-00
Standard No:ISBN: 1-58113-594-7; ACM DL: Table of Contents hcibib: DocEng02
  1. Keynote
  2. Managing multimedia in documents
  3. Software and document engineering
  4. Linking documents
  5. XML manipulations
  6. Structure and transformation of documents
  7. Document reuse and semantics
  8. Document analysis and reconstruction


Engineering broad-spectrum document software: lessons from Ghostscript BIBAFull-Text 1
  L. Peter Deutsch
Almost 14 years after its first public release, Ghostscript is still a commercially thriving, actively evolving, nearly-Open Source PDL interpreter. It has successfully evolved from its original function as a PostScript Level 1 previewer running on MS-DOS on IBM-compatible PCs to cover new input PDLs (PDF, PCL5, and PCL XL), output formats (bitmaps, many printers, and low- and high-level PDLs, including PostScript-to-PDF "distilling"), development platforms (Unix, Linux, MS Windows, Macintosh, VMS, ...), deployment platforms (desktop, server, and embedded), and graphics capabilities (including CIE-based and CMYK color, RasterOp, anti-aliasing, and partial transparency), while maintaining or improving performance and output quality. This talk will examine, within the available time, the design choices (some more successful than others) relating specifically to this broad-spectrum coverage.
   Ghostscript repeatedly uses several design patterns to achieve functional and performance coverage. One example is "least common denominator plus specialized but functionally equivalent modules selected at build time". This handles differences in build environments, C language dialects, and operating system interfaces, as well as some user-selectable options. A second example is virtual functions, which handle output formats, color representations, and many other points of variability within Ghostscript. A third is "pipelineable interface", used in the memory manager and also for back ends. A fourth is "mixed-level interface", which places both high-level and low-level operations in a single interface with default implementation of the former in terms of the latter. The talk will give some specific examples of each of these, with an assessment of how well they have worked.
   Until mid-2000, nearly all the development of Ghostscript was done by a single developer with little input. This produced code with extremely high internal consistency, but spotty documentation and some idiosyncratic coding and design choices (such as extremely heavy use of C preprocessor macros). With the transition of development and maintenance to a mixed team of both commercial and public developers working in a much more open process, many processes formerly happening within a single person's head must be documented (e.g., necessary coding rules), automated (e.g., checking for conformity with those rules), modified (e.g., accepting a lower level of quality in intermediate releases), and/or augmented (e.g., public code review, allowing the existence of multiple code branches). The talk will assess some of these changes and their likely impact on future evolution of the code.

Managing multimedia in documents

A presentation language for controlling the formatting process in multimedia presentations BIBAKFull-Text 2-9
  Frédéric Bes; Cécile Roisin
Multimedia information encapsulated inside documents is more and more specific because its content is specified using domain vocabularies. Their integration in space and time to form a document implies transformation steps to produce "presentation structures".
   In this context presentation languages and formatters must be enhanced to cover new needs of rendering such as: multiple output of the same information or dynamic changing of the reader context. These new document models and processing architectures induce new editing and formatting services to be proposed to the author.
   This paper describes new presentation properties that can be added to existing presentation languages and that allow the author to express: priorities, more abstract properties and fall-back positions. These properties are used by our formatter in order to provide more adaptive renderings. The architecture of this formatting service is open in order to be used for different presentation systems with different presentation languages. In this paper, we describe our experiment using priorities and optimization requests for temporal formatting.
Keywords: constraints, formatting control, multimedia presentation
Applying caT's programmable browsing semantics to specify world-wide web documents that reflect place, time, reader, and community BIBAKFull-Text 10-17
  Richard Furuta; Jin-Cheon Na
In this paper we discuss application of caT, which extends the Trellis Petri-net-based model of document/hypertext, towards specification of Web-browsable documents that respond to their reader's characteristics, browsing activities, use environment, and interactions with other readers. The Petri net basis provides both a graphical representation of the nodes and links in the hypertext and also an automaton-based specification of the browsing behaviors encountered by readers examining the hypertext. Providing Web-browsable responsive hypertexts in the caT context requires consideration of the structures that might be designed in support of the application and also of the mechanism for translating from caT's custom interfaces' multi-window presentation to a composite that can be viewed using a standard Web browser.
Keywords: Petri-net-based hypertext, Trellis, caT, context-aware hypertext
Multimedia document engineering in MCF BIBAKFull-Text 18-25
  Peter King; Jocelyne Nanard; Marc Nanard
This article demonstrates how several of the general-purpose principles which have proved successful in the area of large-scale software development and maintenance are relevant to multimedia design. We present the Media Construction Formalism, MCF, whose high level concepts encompass the principles of abstraction, modularity, encapsulation, and reuse, which facilitate formal specification during the initial steps of multimedia document engineering. MCF has been implemented as a user-friendly design environment, which includes a special-purpose structured editor providing a visual representation of partial designs. The MCF system promotes the capture of constructs which may emerge during a design, and which can be manipulated by the multimedia designer and executed by the machine at different levels of granularity and detail. MCF uses the metaphor of roles, players and actors to provide generic design descriptions of multimedia scenarios, and encompasses a powerful temporal and reactive model.
Keywords: abstraction, design, engineering, modularity encapsulation, multimedia, reuse, role, scenario

Software and document engineering

The relevance of software documentation, tools and technologies: a survey BIBAKFull-Text 26-33
  Andrew Forward; Timothy C. Lethbridge
This paper highlights the results of a survey of software professionals. One of the goals of this survey was to uncover the perceived relevance (or lack thereof) of software documentation, and the tools and technologies used to maintain, verify and validate such documents. The survey results highlight the preferences for and aversions against software documentation tools. Participants agree that documentation tools should seek to better extract knowledge from core resources. These resources include the system's source code, test code and changes to both. Resulting technologies could then help reduce the effort required for documentation maintenance, something that is shown to rarely occur. Our data reports compelling evidence that software professionals value technologies that improve automation of the documentation process, as well as facilitating its maintenance.
Keywords: documentation relevance, documentation survey, documentation technologies, program comprehension, software documentation, software engineering, software maintenance
Supporting document and data views of source code BIBAKFull-Text 34-41
  Michael L. Collard; Jonathan I. Maletic; Andrian Marcus
The paper describes the use of an XML format to store and represent program source code. A new XML application, srcML (SouRCe Markup Language), is presented. srcML presumes a document view of source code where information about the syntactic structure is layered over the original source code document. The resultant multi-layered document has a base layer of all the original text (and formatting). The second layer is the syntactic information, derived from the grammar of the programming language, and is encoded in XML. This multi-layered view supports both the creation and viewing of the source code in its original form and the use of XML technologies (for tasks such as analysis and transformation of the source). Although directed at source code documents, (particularly C++) srcML is also applicable to other programming languages and to languages with a strict syntax. srcML represents a departure from the compiler centric manner in which source code is commonly stored, instead a document point of view is taken thus better supporting the manipulation and management of the large numbers of source documents typical in modern software systems.
Keywords: XML, abstract syntax tree, markup language, program analysis, source code
Document engineering for e-business BIBAKFull-Text 42-48
  Robert J. Glushko; Tim McGrath
It can be said that "document exchange" is the "mother of all patterns" for business (and for e-business). Yet, by itself this view isn't sufficiently prescriptive. In this paper, we present additional perspectives or frameworks that make this abstraction more rigorous and useful. We describe an approach to artifact-driven analysis, model refinement, and implementation for document-intensive systems that unifies the "document analysis" approach from publishing and the "data analysis" approach from information systems. These traditionally contrasting approaches to understanding documents are unified in an "Analysis Spectrum" in which presentational, structural, and content components assume different weights or status. Our methodology emphasizes reuse with a "Reuse Matrix," in which both business process (or document exchange) patterns and document schema patterns are organized by different levels of abstraction and scope. Enterprise-level patterns like "supply chain" and "marketplace" can fit into this matrix along with process patterns like "RosettaNet PIP" and document patterns like the "XML Common Business Library." Taken together, these concepts form the foundation of a new discipline: "Document Engineering for e-Business.
Keywords: XML, business process modeling, document analysis, document engineering, e-Business, patterns, reuse

Linking documents

XConnector: extending XLink to provide multimedia synchronization BIBAKFull-Text 49-56
  Débora C. Muchaluat-Saade; Rogério F. Rodrigues; Luiz Fernando G. Soares
This paper proposes XConnector, a language for the creation of complex hypermedia relations with causal or constraint semantics. XConnector allows the definition of relations independently of which resources are related. Another feature is the specification of relation libraries, providing reuse in relationship definition. The main goal is to improve linking languages or the linking modules of hypermedia authoring languages in order to provide multimedia synchronization capabilities using links. Following this direction, an extension to W3C XLink is proposed, incorporating XConnector facilities.
Keywords: XConnector, XLink, hypermedia connector, links, multimedia synchronization
XLinkProxy: external linkbases with XLink BIBAKFull-Text 57-65
  Paolo Ciancarini; Federico Folli; Davide Rossi; Fabio Vitali
In the linking model of the World Wide Web each link is stored in the referring document within an attribute of the A tag. All the hyperlink defined this way can reference a single resource or a single fragment. With the evolution of Web technologies more powerful linking languages (XLink and XPointer) have been proposed.
   Here we introduce XLinkProxy, a Web application that allows sophisticated hyperlink (defined using XLink and XPointer) to be defined outside referring documents, giving users the chance to build dynamic multidestination, multidirectional links databases.
Keywords: XLink, XPointer, external linkbases
An open linking service supporting the authoring of web documents BIBAKFull-Text 66-73
  Renato Bulcao Neto; Claudia Akemi Izeki; Maria da Graça Pimentel; Renata Pontin Fortes; Khai Nhut Truong
Both content driven web authors and application designers may have their attention deviated from their main task when they have to be concerned with the generation of elaborated linking structures. This work aims to demonstrate how a metadata-enhanced web-based open linking service can be exploited towards supporting content driven authors in their tasks. The following results are presented in this paper: (a) the Web Linking Service (WLS), a novel open hypermedia system that stores and exchanges metadata in RDF standard syntax for hypertext structures across the wire and (b) two case studies in which applications offer to their users the ability to create linking structures upon existing contents by making use the WLS service.
Keywords: RDF-based metadata, content driven authoring, document engineering, open linking service, web engineering

XML manipulations

Managing and querying multi-version XML data with update logging BIBAKFull-Text 74-81
  Raymond K. Wong; Nicole Lam
With the increasing popularity of storing content on the WWW and intranet in XML form, there arises the need for the control and management of this data. As this data is constantly evolving, users want to be able to query previous versions, query changes in documents, as well as to retrieve a particular document version efficiently. This paper proposes a version management system for XML data that can manage and query changes in an effective and meaningful manner.
Keywords: XML, path expression, versioning
Experimenting with the circus language for XML modeling and transformation BIBAKFull-Text 82-87
  Jean-Yves Vion-Dury; Veronika Lux; Emmanuel Pietriga
After a brief introduction to the Circus programming language, we present a simple type set to model XML structures. We then describe a transformation that takes a mail as input and produces a reply, showing how subtyping is used in order to refine the type control and specialize the transformation. Conclusions are drawn both on our (easy to use but clearly limited) XML data model and on Circus itself; expected qualities of the language are verified; the need for some new features is expressed. Finally, we sketch some language extensions, a richer model for XML structures, and explain our choices and expectations.
Keywords: XML, XSLT, circus, document model, programming language, typed document transformation
Lazy XML processing BIBAKFull-Text 88-94
  Markus L. Noga; Steffen Schott; Welf Löwe
This paper formalizes the domain of tree-based XML processing and classifies several implementation approaches. The lazy approach, an original contribution, is presented in depth. Proceeding from experimental measurements, we derive a selection strategy for implementation approaches to maximize performance.
Keywords: XML, document object model, lazy evaluation, parsing

Structure and transformation of documents

Mapping and displaying structural transformations between XML and PDF BIBAKFull-Text 95-102
  Matthew R. B. Hardy; David F. Brailsford
Documents are often marked up in XML-based tagsets to delineate major structural components such as headings, paragraphs, figure captions and so on, without much regard to their eventual displayed appearance. And yet these same abstract documents, after many transformations and 'typesetting' processes, often emerge in the popular format of Adobe PDF, either for dissemination or archiving.
   Until recently PDF has been a totally display-based document representation, relying on the underlying PostScript semantics of PDF. Early versions of PDF had no mechanism for retaining any form of abstract document structure but recent releases have now introduced an internal structure tree to create the so called 'Tagged PDF'.
   This paper describes the development of a plugin for Adobe Acrobat which creates a two-window display. In one window is shown an XML document original and in the other its Tagged PDF counterpart is seen, with an internal structure tree that, in some sense, matches the one seen in XML. If a component is highlighted in either window then the corresponding structured item, with any attendant text, is also highlighted in the other window.
   Important applications of correctly Tagged PDF include making PDF documents reflow intelligently on small screen devices and enabling them to be read out in correct reading order, via speech synthesiser software, for the visually impaired. By tracing structure transformation from source document to destination one can implement the repair of damaged PDF structure or the adaptation of an existing structure tree to an incrementally updated document.
Keywords: PDF, XML, document structure transformation
Towards automating of document structure transformations BIBAKFull-Text 103-110
  Eila Kuikka; Paula Leinonen; Martti Penttonen
In this paper we develop a syntax-directed approach to transformation of documents from one structure to another. The aim is to automate a transformation between two grammars that have common parts, although the grammars and names of elements may differ. In an important case, called local transformations, the transformation can be performed by a finite state tree transducer. We propose a system that can generate a transformation semi-automatically if the user defines a matching between the elements containing the text of the document. Multiple possible translations from the target grammar can be restricted using a suitable heuristic function so that the transformation can be completed in a reasonable time period. From the generated transformation specification, it is possible to construct rules for a tree transducer or XSLT script automatically.
Keywords: document structure transformation
Simple and accurate feature selection for hierarchical categorisation BIBAKFull-Text 111-118
  Wahyu Wibowo; Hugh E. Williams
Categorisation of digital documents is useful for organisation and retrieval. While document categories can be a set of unstructured category labels, some document categories are hierarchically structured. This paper investigates automatic hierarchical categorisation and, specifically, the role of features in the development of more effective categorisers. We show that a good hierarchical machine learning-based categoriser can be developed using small numbers of features from pre-categorised training documents. Overall, we show that by using a few terms, categorisation accuracy can be improved substantially: unstructured leaf level categorisation can be improved by up to 8.6%, while top-down hierarchical categorisation accuracy can be improved by up to 12%. In addition, unlike other feature selection models -- which typically require different feature selection parameters for categories at different hierarchical levels -- our technique works equally well for all categories in a hierarchical structure. We conclude that, in general, more accurate hierarchical categorisation is possible by using our simple feature selection technique.
Keywords: categorisation, error reduction, hierarchical categorisation, web hierarchies

Document reuse and semantics

Towards a semantics for XML markup BIBAKFull-Text 119-126
  Allen Renear; David Dubin; C. M. Sperberg-McQueen
Although XML Document Type Definitions provide a mechanism for specifying, in machine-readable form, the syntax of an XML markup language, there is no comparable mechanism for specifying the semantics of an XML vocabulary. That is, there is no way to characterize the meaning of XML markup so that the facts and relationships represented by the occurrence of XML constructs can be explicitly, comprehensively, and mechanically identified. This has serious practical and theoretical consequences. On the positive side, XML constructs can be assigned arbitrary semantics and used in application areas not foreseen by the original designers. On the less positive side, both content developers and application engineers must rely upon prose documentation, or, worse, conjectures about the intention of the markup language designer -- a process that is time-consuming, error-prone, incomplete, and unverifiable, even when the language designer properly documents the language. In addition, the lack of a substantial body of research in markup semantics means that digital document processing is undertheorized as an engineering application area. Although there are some related projects underway (XML Schema, RDF, the Semantic Web) which provide relevant results, none of these projects directly and comprehensively address the core problems of XML markup semantics. This paper (i) summarizes the history of the concept of markup meaning, (ii) characterizes the specific problems that motivate the need for a formal semantics for XML and (iii) describes an ongoing research project -- the BECHAMEL Markup Semantics Project -- that is attempting to develop such a semantics.
Keywords: SGML, XML, knowledge representation, markup, semantics
Generation of images of historical documents by composition BIBAKFull-Text 127-133
  Carlos A. B. Mello; Rafael D. Lins
This paper describes a system for efficient storage, indexing and network transmission of images of historical documents. The documents are first decomposed into their features such as paper texture, colours, typewritten parts, pictures, etc. Document retrieval forces the re-assembling of the document, synthethising an image visually close to the original document. The information needed to build the final image occupies, in average, 2 Kbytes performing a very efficient compression scheme.
Keywords: historical documents, segmentation, synthesis, texture
A dynamic user interface for document assembly BIBAKFull-Text 134-141
  Miro Lehtonen; Renaud Petit; Oskari Heinonen; Greger Lindén
Document assembly has turned out to be a convenient approach to corporate publishing and reuse of large collections of documents. Automated assembly of a document reduces the amount of human effort when creating customized documents consisting of document fragments from a collection.
   However, most methods used require a number of parameters to be defined prior to the assembly process, and providing these parameters in the correct format is seen to be too demanding for an average user. We have designed and implemented a graphical user interface that provides the user with a simple way to specify the parameters of the assembly process. The interface, which is dynamically generated based on a given document configuration, lets the user create and customize documents such as technical manuals.
   In our example assembly case, the user can select the product, the manual type, the language of the manual as well as the optional components to be included in the manual.
Keywords: BML, XML, XSLT, document assembly, dynamic user interfaces, structured documents

Document analysis and reconstruction

Degraded character image restoration using active contours: a first approach BIBAKFull-Text 142-148
  Bénédicte Allier; Hubert Emptoz
In this paper, we describe an example of the use of active contours for the reconstruction of degraded character images. Active contours were introduced fifteen years ago by Kass and al. [8], and have been widely used since then for segmentation purpose or for objects detection in any kind of natural images, but they have never been used on document images. The aim of this paper is to study the possibility for active contours to enable the recovering of characters width on degraded character images suffering from discontinuities. This is done using two original kinds of energies, the first one based on the use of punctual attraction forces plot on the degraded areas, and the second one based on the use of an external "ideal" image.
Keywords: active contours, character segmentation, reconstruction of degraded character images
Recognizing records from the extracted cells of microfilm tables BIBAKFull-Text 149-156
  Kenneth M. Tubbs; David W. Embley
Microfilm documents contain a wealth of information, but extracting and organizing this information by hand is slow, error-prone, and tedious. As an initial step toward automating access to this information, we describe in this paper an algorithmic process to automatically identify record patterns found in microfilm tables for pre-specified application domains. Our table-processing algorithm accepts an XML input file describing the individual cells of a table taken from a microfilm document, and finds for each record in the document the cells that together comprise the record. Two key features drive the algorithm: (1) geometric layout and (2) label matching with respect to a given domain-specific application ontology. The algorithm achieved an accuracy of 92% on our test corpus of genealogical microfilm tables.
Keywords: automated recognition of record patterns, geometric layout, microfilm tables, ontology matching