| Engineering broad-spectrum document software: lessons from Ghostscript | | BIBA | Full-Text | 1 | |
| L. Peter Deutsch | |||
| Almost 14 years after its first public release, Ghostscript is still a
commercially thriving, actively evolving, nearly-Open Source PDL interpreter.
It has successfully evolved from its original function as a PostScript Level 1
previewer running on MS-DOS on IBM-compatible PCs to cover new input PDLs (PDF,
PCL5, and PCL XL), output formats (bitmaps, many printers, and low- and
high-level PDLs, including PostScript-to-PDF "distilling"), development
platforms (Unix, Linux, MS Windows, Macintosh, VMS, ...), deployment platforms
(desktop, server, and embedded), and graphics capabilities (including CIE-based
and CMYK color, RasterOp, anti-aliasing, and partial transparency), while
maintaining or improving performance and output quality. This talk will
examine, within the available time, the design choices (some more successful
than others) relating specifically to this broad-spectrum coverage.
Ghostscript repeatedly uses several design patterns to achieve functional and performance coverage. One example is "least common denominator plus specialized but functionally equivalent modules selected at build time". This handles differences in build environments, C language dialects, and operating system interfaces, as well as some user-selectable options. A second example is virtual functions, which handle output formats, color representations, and many other points of variability within Ghostscript. A third is "pipelineable interface", used in the memory manager and also for back ends. A fourth is "mixed-level interface", which places both high-level and low-level operations in a single interface with default implementation of the former in terms of the latter. The talk will give some specific examples of each of these, with an assessment of how well they have worked. Until mid-2000, nearly all the development of Ghostscript was done by a single developer with little input. This produced code with extremely high internal consistency, but spotty documentation and some idiosyncratic coding and design choices (such as extremely heavy use of C preprocessor macros). With the transition of development and maintenance to a mixed team of both commercial and public developers working in a much more open process, many processes formerly happening within a single person's head must be documented (e.g., necessary coding rules), automated (e.g., checking for conformity with those rules), modified (e.g., accepting a lower level of quality in intermediate releases), and/or augmented (e.g., public code review, allowing the existence of multiple code branches). The talk will assess some of these changes and their likely impact on future evolution of the code. | |||
| A presentation language for controlling the formatting process in multimedia presentations | | BIBAK | Full-Text | 2-9 | |
| Frédéric Bes; Cécile Roisin | |||
| Multimedia information encapsulated inside documents is more and more
specific because its content is specified using domain vocabularies. Their
integration in space and time to form a document implies transformation steps
to produce "presentation structures".
In this context presentation languages and formatters must be enhanced to cover new needs of rendering such as: multiple output of the same information or dynamic changing of the reader context. These new document models and processing architectures induce new editing and formatting services to be proposed to the author. This paper describes new presentation properties that can be added to existing presentation languages and that allow the author to express: priorities, more abstract properties and fall-back positions. These properties are used by our formatter in order to provide more adaptive renderings. The architecture of this formatting service is open in order to be used for different presentation systems with different presentation languages. In this paper, we describe our experiment using priorities and optimization requests for temporal formatting. Keywords: constraints, formatting control, multimedia presentation | |||
| Applying caT's programmable browsing semantics to specify world-wide web documents that reflect place, time, reader, and community | | BIBAK | Full-Text | 10-17 | |
| Richard Furuta; Jin-Cheon Na | |||
| In this paper we discuss application of caT, which extends the Trellis
Petri-net-based model of document/hypertext, towards specification of
Web-browsable documents that respond to their reader's characteristics,
browsing activities, use environment, and interactions with other readers. The
Petri net basis provides both a graphical representation of the nodes and links
in the hypertext and also an automaton-based specification of the browsing
behaviors encountered by readers examining the hypertext. Providing
Web-browsable responsive hypertexts in the caT context requires consideration
of the structures that might be designed in support of the application and also
of the mechanism for translating from caT's custom interfaces' multi-window
presentation to a composite that can be viewed using a standard Web browser. Keywords: Petri-net-based hypertext, Trellis, caT, context-aware hypertext | |||
| Multimedia document engineering in MCF | | BIBAK | Full-Text | 18-25 | |
| Peter King; Jocelyne Nanard; Marc Nanard | |||
| This article demonstrates how several of the general-purpose principles
which have proved successful in the area of large-scale software development
and maintenance are relevant to multimedia design. We present the Media
Construction Formalism, MCF, whose high level concepts encompass the principles
of abstraction, modularity, encapsulation, and reuse, which facilitate formal
specification during the initial steps of multimedia document engineering. MCF
has been implemented as a user-friendly design environment, which includes a
special-purpose structured editor providing a visual representation of partial
designs. The MCF system promotes the capture of constructs which may emerge
during a design, and which can be manipulated by the multimedia designer and
executed by the machine at different levels of granularity and detail. MCF uses
the metaphor of roles, players and actors to provide generic design
descriptions of multimedia scenarios, and encompasses a powerful temporal and
reactive model. Keywords: abstraction, design, engineering, modularity encapsulation, multimedia,
reuse, role, scenario | |||
| The relevance of software documentation, tools and technologies: a survey | | BIBAK | Full-Text | 26-33 | |
| Andrew Forward; Timothy C. Lethbridge | |||
| This paper highlights the results of a survey of software professionals. One
of the goals of this survey was to uncover the perceived relevance (or lack
thereof) of software documentation, and the tools and technologies used to
maintain, verify and validate such documents. The survey results highlight the
preferences for and aversions against software documentation tools.
Participants agree that documentation tools should seek to better extract
knowledge from core resources. These resources include the system's source
code, test code and changes to both. Resulting technologies could then help
reduce the effort required for documentation maintenance, something that is
shown to rarely occur. Our data reports compelling evidence that software
professionals value technologies that improve automation of the documentation
process, as well as facilitating its maintenance. Keywords: documentation relevance, documentation survey, documentation technologies,
program comprehension, software documentation, software engineering, software
maintenance | |||
| Supporting document and data views of source code | | BIBAK | Full-Text | 34-41 | |
| Michael L. Collard; Jonathan I. Maletic; Andrian Marcus | |||
| The paper describes the use of an XML format to store and represent program
source code. A new XML application, srcML (SouRCe Markup Language), is
presented. srcML presumes a document view of source code where information
about the syntactic structure is layered over the original source code
document. The resultant multi-layered document has a base layer of all the
original text (and formatting). The second layer is the syntactic information,
derived from the grammar of the programming language, and is encoded in XML.
This multi-layered view supports both the creation and viewing of the source
code in its original form and the use of XML technologies (for tasks such as
analysis and transformation of the source). Although directed at source code
documents, (particularly C++) srcML is also applicable to other programming
languages and to languages with a strict syntax. srcML represents a departure
from the compiler centric manner in which source code is commonly stored,
instead a document point of view is taken thus better supporting the
manipulation and management of the large numbers of source documents typical in
modern software systems. Keywords: XML, abstract syntax tree, markup language, program analysis, source code | |||
| Document engineering for e-business | | BIBAK | Full-Text | 42-48 | |
| Robert J. Glushko; Tim McGrath | |||
| It can be said that "document exchange" is the "mother of all patterns" for
business (and for e-business). Yet, by itself this view isn't sufficiently
prescriptive. In this paper, we present additional perspectives or frameworks
that make this abstraction more rigorous and useful. We describe an approach to
artifact-driven analysis, model refinement, and implementation for
document-intensive systems that unifies the "document analysis" approach from
publishing and the "data analysis" approach from information systems. These
traditionally contrasting approaches to understanding documents are unified in
an "Analysis Spectrum" in which presentational, structural, and content
components assume different weights or status. Our methodology emphasizes reuse
with a "Reuse Matrix," in which both business process (or document exchange)
patterns and document schema patterns are organized by different levels of
abstraction and scope. Enterprise-level patterns like "supply chain" and
"marketplace" can fit into this matrix along with process patterns like
"RosettaNet PIP" and document patterns like the "XML Common Business Library."
Taken together, these concepts form the foundation of a new discipline:
"Document Engineering for e-Business. Keywords: XML, business process modeling, document analysis, document engineering,
e-Business, patterns, reuse | |||
| XConnector: extending XLink to provide multimedia synchronization | | BIBAK | Full-Text | 49-56 | |
| Débora C. Muchaluat-Saade; Rogério F. Rodrigues; Luiz Fernando G. Soares | |||
| This paper proposes XConnector, a language for the creation of complex
hypermedia relations with causal or constraint semantics. XConnector allows the
definition of relations independently of which resources are related. Another
feature is the specification of relation libraries, providing reuse in
relationship definition. The main goal is to improve linking languages or the
linking modules of hypermedia authoring languages in order to provide
multimedia synchronization capabilities using links. Following this direction,
an extension to W3C XLink is proposed, incorporating XConnector facilities. Keywords: XConnector, XLink, hypermedia connector, links, multimedia synchronization | |||
| XLinkProxy: external linkbases with XLink | | BIBAK | Full-Text | 57-65 | |
| Paolo Ciancarini; Federico Folli; Davide Rossi; Fabio Vitali | |||
| In the linking model of the World Wide Web each link is stored in the
referring document within an attribute of the A tag. All the hyperlink defined
this way can reference a single resource or a single fragment. With the
evolution of Web technologies more powerful linking languages (XLink and
XPointer) have been proposed.
Here we introduce XLinkProxy, a Web application that allows sophisticated hyperlink (defined using XLink and XPointer) to be defined outside referring documents, giving users the chance to build dynamic multidestination, multidirectional links databases. Keywords: XLink, XPointer, external linkbases | |||
| An open linking service supporting the authoring of web documents | | BIBAK | Full-Text | 66-73 | |
| Renato Bulcao Neto; Claudia Akemi Izeki; Maria da Graça Pimentel; Renata Pontin Fortes; Khai Nhut Truong | |||
| Both content driven web authors and application designers may have their
attention deviated from their main task when they have to be concerned with the
generation of elaborated linking structures. This work aims to demonstrate how
a metadata-enhanced web-based open linking service can be exploited towards
supporting content driven authors in their tasks. The following results are
presented in this paper: (a) the Web Linking Service (WLS), a novel open
hypermedia system that stores and exchanges metadata in RDF standard syntax for
hypertext structures across the wire and (b) two case studies in which
applications offer to their users the ability to create linking structures upon
existing contents by making use the WLS service. Keywords: RDF-based metadata, content driven authoring, document engineering, open
linking service, web engineering | |||
| Managing and querying multi-version XML data with update logging | | BIBAK | Full-Text | 74-81 | |
| Raymond K. Wong; Nicole Lam | |||
| With the increasing popularity of storing content on the WWW and intranet in
XML form, there arises the need for the control and management of this data. As
this data is constantly evolving, users want to be able to query previous
versions, query changes in documents, as well as to retrieve a particular
document version efficiently. This paper proposes a version management system
for XML data that can manage and query changes in an effective and meaningful
manner. Keywords: XML, path expression, versioning | |||
| Experimenting with the circus language for XML modeling and transformation | | BIBAK | Full-Text | 82-87 | |
| Jean-Yves Vion-Dury; Veronika Lux; Emmanuel Pietriga | |||
| After a brief introduction to the Circus programming language, we present a
simple type set to model XML structures. We then describe a transformation that
takes a mail as input and produces a reply, showing how subtyping is used in
order to refine the type control and specialize the transformation. Conclusions
are drawn both on our (easy to use but clearly limited) XML data model and on
Circus itself; expected qualities of the language are verified; the need for
some new features is expressed. Finally, we sketch some language extensions, a
richer model for XML structures, and explain our choices and expectations. Keywords: XML, XSLT, circus, document model, programming language, typed document
transformation | |||
| Lazy XML processing | | BIBAK | Full-Text | 88-94 | |
| Markus L. Noga; Steffen Schott; Welf Löwe | |||
| This paper formalizes the domain of tree-based XML processing and classifies
several implementation approaches. The lazy approach, an original contribution,
is presented in depth. Proceeding from experimental measurements, we derive a
selection strategy for implementation approaches to maximize performance. Keywords: XML, document object model, lazy evaluation, parsing | |||
| Mapping and displaying structural transformations between XML and PDF | | BIBAK | Full-Text | 95-102 | |
| Matthew R. B. Hardy; David F. Brailsford | |||
| Documents are often marked up in XML-based tagsets to delineate major
structural components such as headings, paragraphs, figure captions and so on,
without much regard to their eventual displayed appearance. And yet these same
abstract documents, after many transformations and 'typesetting' processes,
often emerge in the popular format of Adobe PDF, either for dissemination or
archiving.
Until recently PDF has been a totally display-based document representation, relying on the underlying PostScript semantics of PDF. Early versions of PDF had no mechanism for retaining any form of abstract document structure but recent releases have now introduced an internal structure tree to create the so called 'Tagged PDF'. This paper describes the development of a plugin for Adobe Acrobat which creates a two-window display. In one window is shown an XML document original and in the other its Tagged PDF counterpart is seen, with an internal structure tree that, in some sense, matches the one seen in XML. If a component is highlighted in either window then the corresponding structured item, with any attendant text, is also highlighted in the other window. Important applications of correctly Tagged PDF include making PDF documents reflow intelligently on small screen devices and enabling them to be read out in correct reading order, via speech synthesiser software, for the visually impaired. By tracing structure transformation from source document to destination one can implement the repair of damaged PDF structure or the adaptation of an existing structure tree to an incrementally updated document. Keywords: PDF, XML, document structure transformation | |||
| Towards automating of document structure transformations | | BIBAK | Full-Text | 103-110 | |
| Eila Kuikka; Paula Leinonen; Martti Penttonen | |||
| In this paper we develop a syntax-directed approach to transformation of
documents from one structure to another. The aim is to automate a
transformation between two grammars that have common parts, although the
grammars and names of elements may differ. In an important case, called local
transformations, the transformation can be performed by a finite state tree
transducer. We propose a system that can generate a transformation
semi-automatically if the user defines a matching between the elements
containing the text of the document. Multiple possible translations from the
target grammar can be restricted using a suitable heuristic function so that
the transformation can be completed in a reasonable time period. From the
generated transformation specification, it is possible to construct rules for a
tree transducer or XSLT script automatically. Keywords: document structure transformation | |||
| Simple and accurate feature selection for hierarchical categorisation | | BIBAK | Full-Text | 111-118 | |
| Wahyu Wibowo; Hugh E. Williams | |||
| Categorisation of digital documents is useful for organisation and
retrieval. While document categories can be a set of unstructured category
labels, some document categories are hierarchically structured. This paper
investigates automatic hierarchical categorisation and, specifically, the role
of features in the development of more effective categorisers. We show that a
good hierarchical machine learning-based categoriser can be developed using
small numbers of features from pre-categorised training documents. Overall, we
show that by using a few terms, categorisation accuracy can be improved
substantially: unstructured leaf level categorisation can be improved by up to
8.6%, while top-down hierarchical categorisation accuracy can be improved by up
to 12%. In addition, unlike other feature selection models -- which typically
require different feature selection parameters for categories at different
hierarchical levels -- our technique works equally well for all categories in a
hierarchical structure. We conclude that, in general, more accurate
hierarchical categorisation is possible by using our simple feature selection
technique. Keywords: categorisation, error reduction, hierarchical categorisation, web
hierarchies | |||
| Towards a semantics for XML markup | | BIBAK | Full-Text | 119-126 | |
| Allen Renear; David Dubin; C. M. Sperberg-McQueen | |||
| Although XML Document Type Definitions provide a mechanism for specifying,
in machine-readable form, the syntax of an XML markup language, there is no
comparable mechanism for specifying the semantics of an XML vocabulary. That
is, there is no way to characterize the meaning of XML markup so that the facts
and relationships represented by the occurrence of XML constructs can be
explicitly, comprehensively, and mechanically identified. This has serious
practical and theoretical consequences. On the positive side, XML constructs
can be assigned arbitrary semantics and used in application areas not foreseen
by the original designers. On the less positive side, both content developers
and application engineers must rely upon prose documentation, or, worse,
conjectures about the intention of the markup language designer -- a process
that is time-consuming, error-prone, incomplete, and unverifiable, even when
the language designer properly documents the language. In addition, the lack of
a substantial body of research in markup semantics means that digital document
processing is undertheorized as an engineering application area. Although there
are some related projects underway (XML Schema, RDF, the Semantic Web) which
provide relevant results, none of these projects directly and comprehensively
address the core problems of XML markup semantics. This paper (i) summarizes
the history of the concept of markup meaning, (ii) characterizes the specific
problems that motivate the need for a formal semantics for XML and (iii)
describes an ongoing research project -- the BECHAMEL Markup Semantics Project
-- that is attempting to develop such a semantics. Keywords: SGML, XML, knowledge representation, markup, semantics | |||
| Generation of images of historical documents by composition | | BIBAK | Full-Text | 127-133 | |
| Carlos A. B. Mello; Rafael D. Lins | |||
| This paper describes a system for efficient storage, indexing and network
transmission of images of historical documents. The documents are first
decomposed into their features such as paper texture, colours, typewritten
parts, pictures, etc. Document retrieval forces the re-assembling of the
document, synthethising an image visually close to the original document. The
information needed to build the final image occupies, in average, 2 Kbytes
performing a very efficient compression scheme. Keywords: historical documents, segmentation, synthesis, texture | |||
| A dynamic user interface for document assembly | | BIBAK | Full-Text | 134-141 | |
| Miro Lehtonen; Renaud Petit; Oskari Heinonen; Greger Lindén | |||
| Document assembly has turned out to be a convenient approach to corporate
publishing and reuse of large collections of documents. Automated assembly of a
document reduces the amount of human effort when creating customized documents
consisting of document fragments from a collection.
However, most methods used require a number of parameters to be defined prior to the assembly process, and providing these parameters in the correct format is seen to be too demanding for an average user. We have designed and implemented a graphical user interface that provides the user with a simple way to specify the parameters of the assembly process. The interface, which is dynamically generated based on a given document configuration, lets the user create and customize documents such as technical manuals. In our example assembly case, the user can select the product, the manual type, the language of the manual as well as the optional components to be included in the manual. Keywords: BML, XML, XSLT, document assembly, dynamic user interfaces, structured
documents | |||
| Degraded character image restoration using active contours: a first approach | | BIBAK | Full-Text | 142-148 | |
| Bénédicte Allier; Hubert Emptoz | |||
| In this paper, we describe an example of the use of active contours for the
reconstruction of degraded character images. Active contours were introduced
fifteen years ago by Kass and al. [8], and have been widely used since then for
segmentation purpose or for objects detection in any kind of natural images,
but they have never been used on document images. The aim of this paper is to
study the possibility for active contours to enable the recovering of
characters width on degraded character images suffering from discontinuities.
This is done using two original kinds of energies, the first one based on the
use of punctual attraction forces plot on the degraded areas, and the second
one based on the use of an external "ideal" image. Keywords: active contours, character segmentation, reconstruction of degraded
character images | |||
| Recognizing records from the extracted cells of microfilm tables | | BIBAK | Full-Text | 149-156 | |
| Kenneth M. Tubbs; David W. Embley | |||
| Microfilm documents contain a wealth of information, but extracting and
organizing this information by hand is slow, error-prone, and tedious. As an
initial step toward automating access to this information, we describe in this
paper an algorithmic process to automatically identify record patterns found in
microfilm tables for pre-specified application domains. Our table-processing
algorithm accepts an XML input file describing the individual cells of a table
taken from a microfilm document, and finds for each record in the document the
cells that together comprise the record. Two key features drive the algorithm:
(1) geometric layout and (2) label matching with respect to a given
domain-specific application ontology. The algorithm achieved an accuracy of 92%
on our test corpus of genealogical microfilm tables. Keywords: automated recognition of record patterns, geometric layout, microfilm
tables, ontology matching | |||