HCI Bibliography Home | HCI Conferences | DocEng Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
DocEng Tables of Contents: 0102030405060708091011121314

Proceedings of the 2009 ACM Symposium on Document Engineering

Fullname:DocEng'09 Proceeding of the 9th ACM Symposium on Document Engineering
Editors:Uwe M. Borghoff; Boris Chidlovskii
Location:Munich, Germany
Dates:2009-Sep-16 to 2009-Sep-18
Publisher:ACM
Standard No:ISBN: 978-1-60558-575-8; ACM DL: Table of Contents hcibib: DocEng09
Papers:43
Pages:254
  1. Keynote
  2. Algorithms and theory
  3. Experiments and methodology
  4. Document analysis (I)
  5. Document analysis (II)
  6. Keynote
  7. Document presentation (I) -- formatting, printing and layout
  8. Document presentation (II) -- imaging
  9. Interacting with documents
  10. Modeling documents
  11. Document and linguistics (I)
  12. Document and linguistics (II)
  13. Document and programming
  14. Demo's and posters

Keynote

Document engineering: a preferred partner discipline in knowledge management BIBAKFull-Text 1-2
  Josef Hofer-Alfeis
After 20 years of investigation and application of Knowledge Management (KM) there are still various views and expectations on it resulting from its trans-disciplinary character. It is a kind of meta-discipline with a lot of partner disciplines, e.g. personnel development, organization, process and quality management, information management, document engineering and communication. The reason is the complex character of knowledge itself, which is defined in KM as the capability for effective action. A major dimension of this capability is naturally the content dimension, i.e. which knowledge area or object-activity-domain is it about, e.g. "document engineering". In any knowledge area the knowledge has three types of carriers: individuals with their experiences, education and inherent capabilities, groups like teams and communities with their compound capabilities based on joint understanding and networked complementary capabilities and finally information, carrying more or less codified and documented knowledge. Across all three knowledge carriers three questions or dimensions of knowledge quality are interesting in any knowledge area, which is important, e.g. for a business: "How deep or profound is it, e.g. the level of expertise of a subject matter expert or a best practice description? "How much is it distributed and inter-connected, e.g. which experts, groups and documents are involved and how?" How is it codified and documented, e.g. the quality of defining, structuring and documenting the content?
   This is the starting point for KM: it provides adequate processes or instruments to improve or adjust the knowledge quality to the needs, e.g. of a business. But there are already the various partner disciplines of KM active to support, e.g. learning and training, inter-connection by collaboration, information formalizing and distribution -- why do we still need KM? The partner disciplines may have profound capabilities in their fields, but they are driving a kind of one-dimensional KM. The full power of KM is to combine their solutions to more powerful multi-dimensional approaches.
Keywords: document engineering, documented knowledge, knowledge, knowledge codification, knowledge management, knowledge management process, knowledge networking, meta-discipline, partner disciplines

Algorithms and theory

Efficient change control of XML documents BIBAKFull-Text 3-12
  Sebastian Rönnau; Geraint Philipp; Uwe M. Borghoff
XML-based documents play a major role in modern information architectures and their corresponding workflows. In this context, the ability to identify and represent differences between two versions of a document is essential. Several approaches to finding the differences between XML documents have already been proposed. Typically, they are based on tree-to-tree correction, or sequence alignment. Most of these algorithms, however, are too slow and do not support the subsequent merging of changes. In this paper, we present a differencing algorithm tailored to ordered XML documents, called DocTreeDiff. It relies on our context-oriented XML versioning model which allows for document merging, presented in earlier work. An empiric evaluation demonstrates the efficiency of our approach as well as the high quality of the generated deltas.
Keywords: XML diff, XML merge, office documents, tree-to-tree correction, version control
Differential synchronization BIBAKFull-Text 13-20
  Neil Fraser
This paper describes the Differential Synchronization (DS) method for keeping documents synchronized. The key feature of DS is that it is simple and well suited for use in both novel and existing state-based applications without requiring application redesign. DS uses deltas to make efficient use of bandwidth, and is fault-tolerant, allowing copies to converge in spite of occasional errors. We consider practical implementation of DS and describe some techniques to improve its performance in a browser environment.
Keywords: collaboration, synchronization
On the analysis of queries with counting constraints BIBAKFull-Text 21-24
  Everardo Bárcenas; Pierre Genevès; Nabil Layaïda
We study the analysis problem of XPath expressions with counting constraints. Such expressions are commonly used in document transformations or programs in which they select portions of documents subject to transformations. We explore how recent results on the static analysis of navigational aspects of XPath can be extended to counting constraints. The static analysis of this combined XPath fragment allows to detect bugs in transformations and to perform many kinds of optimizations of document transformations. More precisely, we study how a logic for finite trees capable of expressing upward and downward recursive navigation, can be equipped with a counting operator along regular path expressions.
Keywords: counting constraints, modal logics, type checking, xml, xpath
Modelling composite document behaviour with concurrent hierarchical state machines BIBAKFull-Text 25-28
  Steve Battle; Helen Balinsky
This paper addresses the need for a modular approach to document composition and life-cycle, enabling mixed content to be used and re-used within documents. Each content item may bring with it its own workflow. Documents are conventionally considered to be the passive subjects of workflow, but when a document presents a complex mix of components it becomes harder for a centralized workflow to cater for this variety of needs. Our solution is to apply object-oriented concepts to documents, expressing process definitions alongside the content they apply to. We are interested in describing document life-cycles, and use Finite State Machines to describe the way that the individual components of a document change over time. A framework for composing these functional document components must first consider their hierarchical nesting for which we use Hierarchical State Machines. Furthermore, to accommodate the composition of independent sibling components under a common parent we use Concurrent Hierarchical State Machines. This theoretical framework provides practical guidelines for modelling composite document behaviour.
Keywords: composite documents, document-centric process, finite state machines

Experiments and methodology

Automated re-typesetting, indexing and content enhancement for scanned marriage registers BIBAKFull-Text 29-38
  David F. Brailsford
For much of England and Wales marriage registers began to be kept in 1537. The marriage details were recorded locally, and in longhand, until 1st July 1837, when central records began. All registers were kept in the local parish church. In the period from 1896 to 1922 an attempt was made, by the Phillimore company of London, using volunteer help, to transcribe marriage registers for as many English parishes as possible and to have them printed. This paper describes an experiment in the automated re-typesetting of Volume 2 of the 15-volume Phillimore series relating to the county of Derbyshire. The source material was plain text derived from running Optical Character Recognition (OCR) on a set of page scans taken from the original printed volume. The aim of the experiment was to avoid any idea of labour-intensive page-by-page rebuilding with tools such as Acrobat Capture. Instead, it proved possible to capitalise on the regular, tabular, structure of the Register pages as a means of automating the re-typesetting process, using UNIX troff software and its tbl preprocessor. A series of simple software tools helped to bring about the OCR-to-troff transformation. However, the re-typesetting of the text was not just an end in itself but, additionally, a step on the way to content enhancement and content repurposing. This included the indexing of the marriage entries and their potential transformation into XML and GEDCOM notations. The experiment has shown, for highly regular material, that the efforts of one programmer, with suitable low-level tools, can be far more effective than attempting to recreate the printed material using WYSIWYG software.
Keywords: GEDCOM, OCR, genealogy, hyper-linking, indexing, re-typesetting, troff
Test collection management and labeling system BIBAKFull-Text 39-42
  Eunyee Koh; Andruid Kerne; Sarah Berry
In order to evaluate the performance of information retrieval and extraction algorithms, we need test collections. A test collection consists of a set of documents, a clearly formed problem that an algorithm is supposed to provide solutions to, and the answers that the algorithm should produce when executed on the documents. Defining the association between elements in the test collection and answers is known as labeling. For mainstream information retrieval problems, there are publicly available test collections which have been maintained for years. However, the scope of these problems, and thus the associated test collections, is limited. In other cases, researchers need to build, label, and manage their own test collections, which can be a tedious and error-prone task. We were building test collections of HTML documents, for problems in which the answers that the algorithm supplies is a sub-tree of the DOM (Document Object Model). To lighten the burden of this task, we developed a test collection management and labeling system (TCMLS), to facilitate usability in the process of building test collections, applying them to validate algorithms, and potentially sharing them across the research community.
Keywords: document object model, test collection, xml schema
A platform to automatically generate and incorporate documents into an ontology-based content repository BIBAKFull-Text 43-46
  Matthias Heinrich; Antje Boehm-Peters; Martin Knechtel
In order to access large information pools efficiently data has to be structured and categorized. Recently, applying ontologies to formalize information has become an established approach. In particular, ontology-based search and navigation are promising solutions which are capable to significantly improve state of the art systems (e.g. full-text search engines). However, the ontology roll-out and maintenance are costly tasks. Therefore, we propose a documentation generation platform that automatically derives content and incorporates generated content into an existing ontology. The demanding task of classifying content as concept instances, setting data type and object properties is accomplished by the documentation generation platform. Eventually, our approach results in a semantically enriched content base. Note that no manual effort is required to establish links between content objects and the ontology.
Keywords: ontology completion, semantic annotation, software documentation, text generation

Document analysis (I)

Object-level document analysis of PDF files BIBAKFull-Text 47-55
  Tamir Hassan
The PDF format is commonly used for the exchange of documents on the Web and there is a growing need to understand and extract or repurpose data held in PDF documents. Many systems for processing PDF files use algorithms designed for scanned documents, which analyse a page based on its bitmap representation. We believe this approach to be inefficient. Not only does the rasterization step cost processing time, but information is also lost and errors can be introduced.
   Inspired primarily by the need to facilitate machine extraction of data from PDF documents, we have developed methods to extract textual and graphic content directly from the PDF content stream and represent it as a list of "objects" at a level of granularity suitable for structural understanding of the document. These objects are then grouped into lines, paragraphs and higher-level logical structures using a novel bottom-up segmentation algorithm based on visual perception principles. Experimental results demonstrate the viability of our approach, which is currently used as a basis for HTML conversion and data extraction methods.
Keywords: document analysis, pdf
Aesthetic measure of alignment and regularity BIBAKFull-Text 56-65
  Helen Y. Balinsky; Anthony J. Wiley; Matthew C. Roberts
To be effective as communications or sales tools, documents that are personalized and customized for each customer must be visually appealing and aesthetically pleasing. Producing perhaps millions of unique versions of essentially the same document not only presents challenges to the printing process but also disrupts the standard quality control procedures. The quality of the alignment in each document can easily distinguish professionally looking documents from amateur designs and some computer generated layouts. A multicomponent measure of document alignment and regularity, derived directly from designer knowledge, is developed and presented in computable form. The measure includes: edge quality, page connectivity, grid regularity and alignment statistics. It is clear that these components may have different levels of importance, relevance and acceptability for various document types and classes, thus the proposed measure should always be evaluated against the requirements of the desired class of documents.
Keywords: aesthetic rules, alignment, automatic layout evaluation, designer grid, regularity, the hough transform
Web article extraction for web printing: a DOM+visual based approach BIBAKFull-Text 66-69
  Ping Luo; Jian Fan; Sam Liu; Fen Lin; Yuhong Xiong; Jerry Liu
This work studies the problem of extracting articles from Web pages for better printing. Different from existing approaches of article extraction, Web printing poses several unique requirements: 1) Identifying just the boundary surrounding the text-body is not the ideal solution for article extraction. It is highly desirable to filter out some uninformative links and advertisements within this boundary. 2) It is necessary to identify paragraphs, which may not be readily separated as DOM nodes, for the purpose of better layout of the article. 3) Its performance should be independent of content domains, written languages, and Web page templates. Toward these goals we propose a novel method of article extraction using both DOM (Document Object Model) and visual features. The main components of our method include: 1) a text segment/paragraph identification algorithm based on line-breaking features, 2) a global optimization method, Maximum Scoring Subsequence, based on text segments for identifying the boundary of the article body, 3) an outlier elimination step based on left or right alignment of text segments with the article body. Our experiments showed the proposed method is effective in terms of precision and recall at the level of text segments.
Keywords: article extraction, maximal scoring subsequence
Indexing by permeability in block structured web pages BIBAKFull-Text 70-73
  Emmanuel Bruno; Nicolas Faessel; Hervé Glotin; Jacques Le Maitre; Michel Scholl
We present in this paper a model that we have developed for indexing and querying web pages based on their visual rendering. In this model pages are split up into a set of visual blocks. The indexing of a block takes into account its content, its visual importance and, by permeability, the indexing of neighbors blocks. A page is modeled as a directed acyclic graph. Each node is associated with a block and labeled by the coefficient of importance of this block. Each edge is labeled by the coefficient of permeability of the target node content to the source node content. Importance and permeability coefficients cannot be manually quantified. the second part of this paper, we present an experiment consisting in learning optimal permeability coefficients by gradient descent for indexing images of a web page from the text blocks of this page. The dataset is drawn from real web pages of the train and test set of the ImagEval task2 corpus. Results demonstrate an improvement of the indexing using non uniform block permeabilities.
Keywords: block importance, block permeability, content based image retrieval, document indexing, document retrieval

Document analysis (II)

Getting the most out of social annotations for web page classification BIBAKFull-Text 74-83
  Arkaitz Zubiaga; Raquel Martínez; Víctor Fresno
User-generated annotations on social bookmarking sites can provide interesting and promising metadata for web document management tasks like web page classification. These user-generated annotations include diverse types of information, such as tags and comments. Nonetheless, each kind of annotation has a different nature and popularity level. In this work, we analyze and evaluate the usefulness of each of these social annotations to classify web pages over a taxonomy like that proposed by the Open Directory Project. We compare them separately to the content-based classification, and also combine the different types of data to augment performance. Our experiments show encouraging results with the use of social annotations for this purpose, and we found that combining these metadata with web page content improves even more the classifier's performance.
Keywords: social annotations, social bookmarking, web page classification
Deriving image-text document surrogates to optimize cognition BIBAKFull-Text 84-93
  Eunyee Koh; Andruid Kerne
The representation of information collections needs to be optimized for human cognition. While documents often include rich visual components, collections, including personal collections and those generated by search engines, are typically represented by lists of text-only surrogates. By concurrently invoking complementary components of human cognition, combined image-text surrogates will help people to more effectively see, understand, think about, and remember an information collection. This research develops algorithmic methods that use the structural context of images in HTML documents to associate meaningful text and thus derive combined image-text surrogates. Our algorithm first recognizes which documents consist essentially of informative and multimedia content. Then, the algorithm recognizes the informative sub-trees within each such document, discards advertisements and navigation, and extracts images with contextual descriptions. Experimental results demonstrate the algorithm's efficacy. An implementation of the algorithm is provided in combinFormation, a creativity support tool for collection authoring. The enhanced image-text surrogates enhance the experiences of users finding and collecting information as part of developing new ideas.
Keywords: information extraction, search representation, surrogates
HCX: an efficient hybrid clustering approach for XML documents BIBAKFull-Text 94-97
  Sangeetha Kutty; Richi Nayak; Yuefeng Li
This paper proposes a novel Hybrid Clustering approach for XML documents (HCX) that first determines the structural similarity in the form of frequent subtrees and then uses these frequent subtrees to represent the constrained content of the XML documents in order to determine the content similarity. The empirical analysis reveals that the proposed method is scalable and accurate.
Keywords: clustering, frequent mining, structure and content, subtree mining, xml documents

Keynote

From system requirements documents to integrated system modeling artifacts BIBAKFull-Text 98
  Manfred H. B. Broy
In the development of embedded systems starting from high-level requirements and going over to system specification and further to architecture various aspects and issues have to be elicited, collected, analyzed and documented. These start from early phase contents like goals and high level requirements and go on to more concrete requirements and finally to system specifications, architecture design documents on which the final implementation of the system is based.
   Traditionally these contents have to be captured in documents such as product specification documents (in German: Lastenheft) and system specification documents (in German: Pflichtenheft). Typically in the early phases of system development a high number of different documents are produced that all talk about different issues and aspects of the system and also the development. Unavoidably a lot of these documents carry similar information and sometimes contain the same information in many different copies. Typically these documents are under a continuous change due to new insights and changing constraints. As a result configuration and version management and, in particular, change management of these documents becomes a nightmare. Every time an individual requirement, a goal or an aspect is modified, this modification has to be carried out consistently in all the documents. The changes produce new versions of the documents. A configuration management of such documents is nearly impossible. As a result information sometimes contained in more than 20 documents tends to become inconsistent. After a while there is a tendency not to update existing documents anymore and just to accept that at the end of the project a lot of the documents are no longer up-to-date and no more consistent. In the best case finally at the end of the project an updated documentation is produced in a step of reverse engineering. In the worst case a final consistent actual documentation is not produced at all such that the documentation of the system is completely lost and later a complicated and time consuming reconstruction of the documentation in a step of reengineering has to be carried out by the team that has to maintain the system -- by engineers who are often not involved in the development and therefore not familiar with the contents of the projects.
   A different approach aims at the use of content models, called artifact models, where the information about systems are captured in a structured way using modeling techniques such that all this information is structured in terms of comprehensive product models, sometimes called artifact models (or also meta-models) that describe all the relevant contents of a system in a structured way and trace the relationship between these contents in a way that there is no redundancy in the model but just relationships between the different parts of the models. To do that a model-based development technique is most appropriate where substantial parts of the content is not captured by text and natural language, but by specific modeling concepts. In the end such an approach results in a life-cycle product-modeling management system that supports all the phases of system development and contains all relevant information about a product and its development such that any kind of documentation about the system can be generated from the artifact model.
   In order to turn this vision of structured product models with high automation into reality, we need an integrated engineering environment that offers support for creating and managing models within well-defined process steps. The integrated development environment should comprise the following four blocks: 1) a model repository that maintains the different artifacts including their dependencies, 2) advanced tools for editing models that directly support their users to build-up models, 3) tools for analyzing the product model and synthesizing new artifacts out of the product model, and 4) a workflow engine to guide the engineers through the steps defined by the development process.
Keywords: integrated artifact models, tool support

Document presentation (I) -- formatting, printing and layout

Review of automatic document formatting BIBAKFull-Text 99-108
  Nathan Hurst; Wilmot Li; Kim Marriott
We review the literature on automatic document formatting with an emphasis on recent work in the field. One common way to frame document formatting is as a constrained optimization problem where decision variables encode element placement, constraints enforce required geometric relationships, and the objective function measures layout quality. We present existing research using this framework, describing the kind of optimization problem being solved and the basic optimization techniques used to solve it. Our review focuses on the formatting of primarily textual documents, including both micro- and macro-typographic concerns. We also cover techniques for automatic table layout. Related problems such as widget and diagram layout, as well as temporal layout issues that arise in multimedia documents are outside the scope of this review.
Keywords: adaptive layout, optimization techniques, typography
Job profiling in high performance printing BIBAKFull-Text 109-118
  Thiago Nunes; Fabio Giannetti; Mariana Kolberg; Rafael Nemetz; Alexis Cabeda; Luiz Gustavo Fernandes
Digital presses have consistently improved their speed in the past ten years. Meanwhile, the need for document personalization and customization has increased. As a consequence of these two facts, the traditional RIP (Raster Image Processing) process has became a highly demanding computational step in the print workflow. Print Service Providers (PSP) are now using multiple RIP engines and parallelization strategies to speed up the whole ripping process which is currently based on a per-page base. Nevertheless, these strategies are not optimized in terms of assuring the best Return On Investment (ROI) for the RIP engines. Depending on the input document jobs characteristics, the ripping step may not achieve the print-engine speed creating a unwanted bottleneck. The aim of this paper is to present a way to improve the ROI of PSPs proposing a profiling strategy which enables the optimal usage of RIPs for specific jobs features ensuring that jobs are always consumed at least at engine speed. The profiling strategy is based on a per-page analysis of input PDF jobs identifying their key components. This work introduces a profiler tool to extract information from jobs and some metrics to predict a job ripping cost based on its profile. This information is extremely useful during the job splitting step, since jobs can be split in a clever way. This improves the load balance of the allocated RIPs engines and makes the overall process faster. Finally, experimental results are presented in order to evaluate both, the profiler and the proposed metrics.
Keywords: digital printing, job profiling, parallel processing, pdf, performance evaluation, print, print queue, raster image processing
Aesthetically-driven layout engine BIBAKFull-Text 119-122
  Helen Y. Balinsky; Jonathan R. Howes; Anthony J. Wiley
A novel Aesthetically-Driven Layout (ADL) engine for automatic production of highly customized, non-flow documents is proposed. In a non-flow document, where each page is composed of separable images and text blocks, aesthetic considerations may take precedence over the sequencing of the content. Such layout methods are most suitable for the construction of personalized catalogues, advertising flyers and sales and marketing material, all of which rely heavily on their aesthetics in order to successfully reach their intended audience. The non-flow algorithm described here permits the dynamic creation of page layouts around pre-existing static page content. Pages pre-populated with static content may include reserved areas which are filled at run-time. The remainder of a page, which is neither convex, nor simply-connected, is automatically filled with customer-relevant content by following the professional manual design strategy of multiple levels of layout resolution. The page designers preference, style and aesthetic rules are taken into account at every stage with the highest scoring layout being selected.
Keywords: alignment, fixed content, high customization and personalization, non-flow documents, regularity
Automated extensible XML tree diagrams BIBAKFull-Text 123-126
  John Lumley
XML is a tree-oriented meta-language and understanding XML structures can often involve the construction of visual trees. These trees may use a variety of graphics for chosen elements and often condense or elide sections of the tree to aid focus, as well as adding extra explanatory graphical material such as callouts and cross-tree links. We outline an automated approach for building such trees with great flexibility, based on the use of XSLT, SVG and a functional layout package. This paper concentrates on techniques to declare and implement such flexible decoration, rather than the layout of the tree itself.
Keywords: functional programming, svg, xml trees, xslt

Document presentation (II) -- imaging

Effect of copying and restoration on color barcode payload density BIBAKFull-Text 127-130
  Steven J. Simske; Margaret Sturgill; Jason S. Aronoff
2D barcodes are taking on increasing significance as the ubiquity of high-resolution cameras, combined with the availability of variable data printing, drives increasing amounts of "click and connect" applications. Barcodes therefore serve as an increasingly significant connection between physical and electronic portions, or versions, of documents. The use of color provides many additional advantages, including increased payload density and security. In this paper, we consider four factors affecting the readable payload in a color barcode: (1) number of print-scan (PS), or copy, cycles, (2) image restoration to offset PS-induced degradation, (3) the authentication algorithm used, and (4) the use of spectral pre-compensation (SPC) to optimize the color settings for the color barcodes. The PS cycle was shown to consistently reduce payload density by approximately 55% under all tested conditions. SPC nearly doubled the payload density, and selecting the better authentication algorithm increased payload density by roughly 50% in the mean. Restoration, however, was found to increase payload density less substantially (~30%), and only when combined with the optimized settings for SPC. These results are also discussed in light of optimizing payload density for the generation of document security deterrents.
Keywords: 3d bar codes, color compensation, color tiles, image restoration, payload density, security printing
Layout-aware limiarization for readability enhancement of degraded historical documents BIBAKFull-Text 131-134
  Flávio Bertholdo; Eduardo Valle; Arnaldo de A. Araújo
In this paper we propose a technique of limiarization (also known as thresholding or binarization) tailored to improve the readability of degraded historical documents. Limiarization is a simple image processing technique, which is employed in many complex tasks like image compression, object segmentation and character recognition. The technique also finds applications on itself: since it results in a high-contrast image, in which the foreground is clearly separated from the background, it can greatly improve the readability of a document, provided that other attributes (like character shape) do not suffer. Our technique exploits statistical characteristics of textual documents and applies both global and local thresholding. Under visual inspection on experiments made in a collection of severely degraded historical documents, it compares favorably with the state of the art.
Keywords: binarization, historical documents, image enhancement, limiarization, readability improvement
Geometric consistency checking for local-descriptor based document retrieval BIBAKFull-Text 135-138
  Eduardo Valle; David Picard; Matthieu Cord
In this paper, we evaluate different geometric consistency schemes, which can be used in tandem with an efficient architecture, based on voting and local descriptors, to retrieve multimedia documents. In many contexts the geometric consistency enforcement is essential to boost the retrieval performance. Our empirical results show however, that geometric consistency alone is unable to guarantee high-quality results in databases that contain too many non-discriminating descriptors.
Keywords: cbir, geometric consistency, image retrieval, local descriptors, retrieval by voting

Interacting with documents

A REST protocol and composite format for interactive web documents BIBAKFull-Text 139-148
  John M. Boyer; Charles F. Wiecha; Rahul P. Akolkar
Documents allow end-users to encapsulate information related to a collaborative business process into a package that can be saved, emailed, digitally signed, and used as the basis of interaction in an activity or an ad hoc workflow. While documents are used incidentally today in web applications, for example in HTML presentations of content stored otherwise in back-end systems, they are not yet the central artifact for developers of dynamic, data intensive web applications. This paper unifies the storage and management of the various artifacts of web applications into an Interactive Web Document (IWD). Data content, presentation, behavior, attachments, and digital signatures collected throughout the business process are unified into a single composite web resource. We describe a REST-based protocol for interacting with IWDs and a standards-based approach to packaging their multiple constituent artifacts into IWD archives based on the Open Document Format standard.
Keywords: collaboration, document-centric, html, odf, rich internet application, scxml, web application, workflow, xforms
Adding dynamic visual manipulations to declarative multimedia documents BIBAKFull-Text 149-152
  Fons Kuijk; Rodrigo Laiola Guimarães; Pablo Cesar; Dick C. A. Bulterman
The objective of this work is to define a document model extension that enables complex spatial and temporal interactions within multimedia documents. As an example we describe an authoring interface of a photo sharing system that can be used to capture stories in an open, declarative format. The document model extension defines visual transformations for synchronized navigation driven by dynamic associated content. Due to the open declarative format, the presentation content can be targeted to individuals, while maintaining the underlying data model. The impact of this work is reflected in its recent standardization in the W3C SMIL language. Multimedia players, as Ambulant and the RealPlayer, support the extension described in this paper.
Keywords: animation, content enrichment, declarative language, media annotation, pan and zoom, photo sharing, smil
Enriching the interactive user experience of open document format BIBAKFull-Text 153-156
  John M. Boyer; Charles F. Wiecha
The typical user experience of office documents is geared to the passive recording of user content creation. In this paper, we describe how to provide more active content within such documents based on elaborating the integration between Open Document Format (ODF) and the W3C standard for rich interactivity and data management in web pages (XForms). This includes assignment of more comprehensive behaviors to single form controls, better control over collections of form controls, readonly and conditional sections of mixed content and form controls, and dynamically repeated sections automatically responsive to data changes, including data obtained from web services invoked during user interaction with the document.
Keywords: accessibility, interactive documents, odf, xforms
An e-writer for documents plus strokes BIBAKFull-Text 157-160
  Michael J. Gormish; Kurt Piersol; Ken Gudan; John Barrus
This paper describes the hardware, software, and a document model for a prototype E-Writer. Paper like displays have proved useful in E-Readers like the Kindle in part because of low power usage and the ability to read indoors and out. We focus on emulating other properties of paper in the E-Writer: everyone knows how to use it, and users can write anywhere on the page. By focusing on a simple document model consisting primarily of images and strokes we enabled rapid application development that integrates easily with current paper-based document workflows. This paper includes preliminary reports on usage of the E-Writer and its software by a small test group.
Keywords: electronic paper, paper-like, paperless, pen strokes, workflow

Modeling documents

Movie script markup language BIBAKFull-Text 161-170
  Dieter Van Rijsselbergen; Barbara Van De Keer; Maarten Verwaest; Erik Mannens; Rik Van de Walle
This paper introduces the Movie Script Markup Language (MSML), a document specification for the structural representation of screenplay narratives for television and feature film drama production. Its definition was motivated by a lack of available structured and open formats that describe dramatic narrative but also support IT-based production methods of audiovisual drama. The MSML specification fully supports contemporary screenplay templates in a structured fashion, and adds provisions for drama manufacturing methods that allow drama crew to define how narrative can be translated to audiovisual material. A timing model based on timed petri nets is included to enable fine-grained event synchronization. Finally, MSML comprises an animation module through which narrative events can drive production elements like 3-D previsualization, content repurposing or studio automation. MSML is currently serialized into XML documents and is formally described by a complement of an XML Schema and ISO Schematron schema. The specification has been developed in close collaboration with actual drama production crew and has been implemented in a number of proof-of-concept demonstrators.
Keywords: drama production, narratives, screenplay, xml
Annotations with EARMARK for arbitrary, overlapping and out-of order markup BIBAKFull-Text 171-180
  Silvio Peroni; Fabio Vitali
In this paper we propose a novel approach to markup, called Extreme Annotational RDF Markup (EARMARK), using RDF and OWL to annotate features in text content that cannot be mapped with usual markup languages. EARMARK provides a unifying framework to handle tree-based XML features as well as more complex markup for non-XML scenarios such as overlapping elements, repeated and non-contiguous ranges and structured attributes. EARMARK includes and expands the principles of XML markup, RDFa inline annotations and existing approaches to overlapping markup such as LMNL and TexMecs. EARMARK documents can also be linearized into plain XML by choosing any of a number of strategies to express a tree-based subset of the annotations as an XML structure and fitting in the remaining annotations through a number of "tricks", markup expedients for hierarchical linearization of non-hierarchical features. EARMARK provides a solid platform for providing vocabulary-independent declarative support to advanced document features such as transclusion, overlapping and out-of-order annotations within a conceptually insensitive environment such as XML, and does so by exploiting recent semantic web concepts and languages.
Keywords: earmark, markup, overlapping markup, owl, xpointer
Creation and maintenance of multi-structured documents BIBAKFull-Text 181-184
  Pierre-Édouard Portier; Sylvie Calabretto
In this article, we introduce a new problem: the construction of multi-structured documents. We first offer an overview of existing solutions to the representation of such documents. We then notice that none of them consider the problem of their construction. In this context, we use our experience with philosophers who are building a digital edition of the work of Jean-Toussaint Desanti, in order to present a methodology for the construction of multi-structured documents. This methodology is based on the MSDM model in order to represent such documents. Moreover each step of the methodology has been implemented in the Haskell functional programming language.
Keywords: digital libraries, haskell, overlapping hierarchies, xml

Document and linguistics (I)

From rhetorical structures to document structure: shallow pragmatic analysis for document engineering BIBAKFull-Text 185-192
  Gersende Georg; Hugo Hernault; Marc Cavazza; Helmut Prendinger; Mitsuru Ishizuka
In this paper, we extend previous work on the automatic structuring of medical documents using content analysis. Our long-term objective is to take advantage of specific rhetoric markers encountered in specialized medical documents (clinical guidelines) to automatically structure free text according to its role in the document. This should enable to generate multiple views of the same document depending on the target audience, generate document summaries, as well as facilitating knowledge extraction from text. We have established in previous work that the structure of clinical guidelines could be refined through the identification of a limited set of deontic operators. We now propose to extend this approach by analyzing the text delimited by these operators using Rhetorical Structure Theory. The emphasis on causality and time in RST proves a powerful complement to the recognition of deontic structures while retaining the same philosophy of high-level recognition of sentence structure, which can be converted into application-specific mark-ups. Throughout the paper, we illustrate our findings through results produced by the automatic processing of English guidelines for the management of hypertension and Alzheimer disease.
Keywords: medical document processing, natural language processing
On lexical resources for digitization of historical documents BIBAKFull-Text 193-200
  Annette Gotscharek; Ulrich Reffle; Christoph Ringlstetter; Klaus U. Schulz
Many European libraries are currently engaged in mass digitization projects that aim to make historical documents and corpora online available in the Internet. In this context, appropriate lexical resources play a double role. They are needed to improve OCR recognition of historical documents, which currently does not lead to satisfactory results. Second, even assuming a perfect OCR recognition, since historical language differs considerably from modern language, the matching process between queries submitted to search engines and variants of the search terms found in historical documents needs special support. While the usefulness of special dictionaries for both problems seems undisputed, concrete knowledge and experience are still missing. There are no hints about what optimal lexical resources for historical documents should look like. The real benefit reached by optimized lexical resources is unclear. Both questions are rather complex since answers depend on the point in history when documents were born. We present a series of experiments which illuminate these points. For our evaluations we collected a large corpus covering German historical documents from before 1500 to 1950 and constructed various types of dictionaries. We present the coverage reached with each dictionary for ten subperiods of time. Additional experiments illuminate the improvements for OCR accuracy and Information Retrieval that can be reached, again looking at distinct dictionaries and periods of time. For both OCR and IR, our lexical resources lead to substantial improvements.
Keywords: electronic lexica, historical spelling variants, information retrieval
A panlingual anomalous text detector BIBAKFull-Text 201-204
  Ashok C. Popat
In a large-scale book scanning operation, material can vary widely in language, script, genre, domain, print quality, and other factors, giving rise to a corresponding variability in the OCRed text. It is often desirable to automatically detect errorful and otherwise anomalous text segments, so that they can be filtered out or appropriately flagged, for such applications as indexing, mining, analyzing, displaying, and selectively re-processing such data. Moreover, it is advantageous to require that the automated detector be independent of the underlying OCR engine (or engines), that it work over a broad range of languages, that it seamlessly handle mixed-language material, and that it accommodate documents that contain domain-specific and otherwise rare terminology. A technique is presented that satisfies these requirements, using an adaptive mixture of character-level N-gram language models. Its design, training, implementation, and evaluation are described within the context of high-volume book scanning.
Keywords: garbage strings, language identification, mixture models, ppm, text quality, witten-bell

Document and linguistics (II)

Update summarization based on novel topic distribution BIBAKFull-Text 205-213
  Josef Steinberger; Karel Je ek
This paper deals with our recent research in text summarization. The field has moved from multi-document summarization to update summarization. When producing an update summary of a set of topic-related documents the summarizer assumes prior knowledge of the reader determined by a set of older documents of the same topic. The update summarizer thus must solve a novelty vs. redundancy problem. We describe the development of our summarizer which is based on Iterative Residual Rescaling (IRR) that creates the latent semantic space of a set of documents under consideration. IRR generalizes Singular Value Decomposition (SVD) and enables to control the influence of major and minor topics in the latent space. Our sentence-extractive summarization method computes the redundancy, novelty and significance of each topic. These values are finally used in the sentence selection process. The sentence selection component prevents inner summary redundancy. The results of our participation in TAC evaluation seem to be promising.
Keywords: iterative residual rescaling, latent semantic analysis, summary evaluation, text summarization
Linguistic editing support BIBAKFull-Text 214-217
  Michael Piotrowski; Cerstin Mahlow
Unlike programmers, authors only get very little support from their writing tools, i.e., their word processors and editors. Current editors are unaware of the objects and structures of natural languages and only offer character-based operations for manipulating text. Writers thus have to execute complex sequences of low-level functions to achieve their rhetoric or stylistic goals while composing. Software requiring long and complex sequences of operations causes users to make slips. In the case of editing and revising, these slips result in typical revision errors, such as sentences without a verb, agreement errors, or incorrect word order. In the LingURed project, we are developing language-aware editing functions to prevent errors. These functions operate on linguistic elements, not characters, thus shortening the command sequences writers have to execute. This paper describes the motivation and background of the LingURed project and shows some prototypical language-aware functions.
Keywords: action slips, authoring, cognitive load, computational linguistics, language-aware editing, revising
Web document text and images extraction using DOM analysis and natural language processing BIBAKFull-Text 218-221
  Parag Mulendra Joshi; Sam Liu
Web has emerged as the most important source of information in the world. This has resulted in need for automated software components to analyze web pages and harvest useful information from them. However, in typical web pages the informative content is surrounded by a very high degree of noise in the form of advertisements, navigation bars, links to other content, etc. Often the noisy content is interspersed with the main content leaving no clean boundaries between them. This noisy content makes the problem of information harvesting from web pages much harder. Therefore, it is essential to be able to identify main content of a web page and automatically isolate it from noisy content for any further analysis. Most existing approaches rely on prior knowledge of website specific templates and hand-crafted rules specific to websites for extraction of relevant content. We propose a generic approach that does not require prior knowledge of website templates. While HTML DOM analysis and visual layout analysis approaches have sometimes been used, we believe that for higher accuracy in content extraction, the analyzing software needs to mimic a human user and understand content in natural language similar to the way humans intuitively do in order to eliminate noisy content.
   In this paper, we describe a combination of HTML DOM analysis and Natural Language Processing (NLP) techniques for automated extractions of main article with associated images from web pages.
Keywords: dom trees, html documents, image extraction, natural language processing, web page text extraction

Document and programming

Relating declarative hypermedia objects and imperative objects through the NCL glue language BIBAKFull-Text 222-230
  Luiz Fernando Gomes Soares; Marcelo Ferreira Moreno; Francisco Sant'Anna
This paper focuses on the support provided by NCL (Nested Context Language) to relate objects with imperative code content and declarative hypermedia-objects (objects with declarative code content specifying hypermedia documents). NCL is the declarative language of the Brazilian Terrestrial Digital TV System (SBTVD) supported by its middleware called Ginga. NCL and Ginga are part of ISDB standards and also of ITU-T Recommendations for IPTV services.
   The main contribution of this paper is the seamless way NCL integrates imperative and declarative language paradigms with no intrusion, maintaining a clear limit between embedded objects, independent of their coding content, and defining a behavior model that avoids side effects from one paradigm use to another.
Keywords: declarative and imperative code content, digital tv, glue language, intermedia synchronization, middleware, ncl
Using DITA for documenting software product lines BIBAKFull-Text 231-240
  Oscar Díaz; Felipe I. Anfurrutia; Jon Kortabitarte
Aligning the software process and the documentation process is a recipe for having both software and documentation in synchrony where changes in software seamlessly ripple along its documentation counterpart. This paper focuses on documentation for Software Product Lines (SPLs). A SPL is not intended to build one application, but a number of them: a product family. In contrast to single-software product development, SPL development is based on the idea that the distinct products of the family share a significant amount of assets. This forces a change in the software process. Likewise, software documentation development should now mimic their code counterpart: product documentation should also be produced out of a common set of assets. Specifically, the paper shows how DITA process and documents are recasted using a feature-oriented approach, a realization mechanism for SPLs. In so doing, documentation artifacts are produced at the same pace and using similar variability mechanisms that those used for code artifacts. This accounts for three main advantages: uniformity, separation of concerns, and timely and accurate delivery of the documentation.
Keywords: dita, documentation, feature oriented programming, software product lines
Declarative interfaces for dynamic widgets communications BIBAKFull-Text 241-244
  Cyril Concolato; Jean Le Feuvre; Jean-Claude Dufourd
Widgets are small and focused multimedia applications that can be found on desktop computers, mobile devices or even TV sets. Widgets rely on structured documents to describe their spatial, temporal and interactive behavior but also to communicate with remote data sources. However, these sources have to be known at authoring time and the communication process relies heavily on scripting. In this paper, we describe a mechanism enabling the communication between widgets and their dynamic environment (other widgets, remote data sources). The proposed declarative mechanism is compatible with existing Widgets technologies, usable with script-based widgets as well as with fully declarative widgets. A description of an implementation is also provided.
Keywords: communication interface, declarative languages, rich media, scripting interface, widget

Demo's and posters

XSL-FO 2.0: automated publishing for graphic documents BIBAKFull-Text 245-246
  Fabio Giannetti
The W3C (World Wide Web Consortium) is in the process of developing the second major version of XSL-FO (eXtensible Stylesheet Language -- Formatting Objects) [1], the formatting specification component of XSL. XSL-FO is widely deployed in industry and academia where multiple output forms (typically print and online) are needed from single source XML. It is used in many diverse applications and countries on a large number of implementations to create technical documentation, reports and contracts, terms and conditions, invoices and other forms processing, such as driver's licenses, postal forms, etc. XSL-FO is also widely used for heavy multilingual work because of the internationalization aspects provided in 1.0 to accommodate multiple and mixed writing modes (writing directions such as left-to-right, top-to-bottom, right-to-left, etc.) of the world's languages. The primary goals of the W3C XSL Working Group in developing XSL 2.0 are to provide more sophisticated formatting and layout, enhanced internationalization to provide special formatting objects for Japanese and other Asian and non-Western languages and scripts and to improve integration with other technologies such as SVG (Scalable Vector Graphics) [2] and MathML (Mathematical Markup Language) [3]. A number of XSL 1.0 implementations already support dynamic inclusion of vector graphics using W3C SVG. The XSL and SVG WGs want to define a tighter interface between XSL-FO and SVG to provide enhanced functionality. Experiments [4] with the use of SVG paths to create non-rectangular text regions, or "run-arounds", have helped to motivate further work on deeper integration of SVG graphics inside XSL-FO documents, and to work with the SVG WG on specifying the meaning of XSL-FO markup inside SVG graphics. A similar level of integration with MathML is contemplated.
Keywords: content driven pagination, graphic design, layout, math ml, print, svg, template, transactional printing, variable data print, xml, xsl-fo
GraphWrap: a system for interactive wrapping of pdf documents using graph matching techniques BIBAKFull-Text 247-248
  Tamir Hassan
We present GraphWrap, a novel and innovative approach to wrapping PDF documents. The PDF format is often used to publish large amounts of structured data, such as product specifications, measurements, prices or contact information. As the PDF format is unstructured, it is very difficult to use this data in machine processing applications. Wrapping is the process of navigating the data source, semi-automatically extracting the data and transforming it into a structured form.
   GraphWrap enables a non-expert user to create such data extraction programs for almost any PDF file in an intuitive and interactive manner. We show how a wrapper can be created by selecting an example instance and interacting with the graph representation to set conditions and choose which data items to extract. In the background, the corresponding instances are found using an algorithm based on subgraph isomorphism. The resulting wrapper can then be run on other pages and documents which exhibit a similar visual structure.
Keywords: pdf documents, wrapping
A web-based version editor for XML documents BIBAKFull-Text 249-250
  Luis Arévalo Rosado; Antonio Polo Márquez; Miryam Salas Sánchez
The goal of this demonstration is to show a web-based editor for versioned XML documents. The user interface is an Ajax-based application characterized by its friendliness, its simplicity and intuitive edition of XML documents as well as their versions, thereby avoiding users the complexity of a versioning system. In order to store the XML documents an XML native database, which has been extended to support versioning features as it is shown in [3], is used.
Keywords: ajax editor, branch versioning, historical xml information, xml native databases, xml versions
Logic-based verification of technical documentation BIBAKFull-Text 251-252
  Christian Schönberg; Franz Weitl; Mirjana Jaksic; Burkhard Freitag
Checking the content coherence of digital documents is the purpose of the Verdikt system which can be applied to different domains and document types including technical documentation, e-learning documents, and web pages. An expressive temporal description logic allows for the specification of content consistency criteria along document paths. Whether the document conforms to the specification can then be verified by applying a model checker. In case of specification violations, the model checker provides counterexamples, locating errors in the document precisely. Based on a sample technical documentation in the form of a web document, the general verification process and its effectiveness, efficiency, and usability are demonstrated.
Keywords: document verification, model checking