HCI Bibliography Home | HCI Conferences | DocEng Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
DocEng Tables of Contents: 0102030405060708091011121314

Proceedings of the 2003 ACM Symposium on Document Engineering

Fullname:DocEng'03 Proceeding of the 3rd ACM Symposium on Document Engineering
Editors:Cécile Roisin; Ethan V. Munson; Christine Vanoirbeek
Location:Grenoble, France
Dates:2003-Nov-20 to 2003-Nov-22
Standard No:ISBN: 1-58113-724-9; ACM DL: Table of Contents hcibib: DocEng03
  1. Document querying and transformation
  2. Keynote
  3. Multimedia and hypermedia
  4. Document formatting
  5. Document and images analysis
  6. Document management
  7. Document access and understanding
  8. Optimizing document format
  9. Editing and authoring
  10. Document based architecture & applications

Document querying and transformation

Extending xQuery with transformation operators BIBAKFull-Text 1-8
  Emmanuel Bruno; Jacques Le Maitre; Elisabeth Murisasco
In this paper, we propose to extend XQuery -- the XML query language -- with a set of transformation operators which will produce a copy of an XML tree in which some subtrees will be inserted, replaced or deleted. These operators -- very similar to the ones proposed for updating an XML document -- greatly simplify the expression of some queries in making it possible to express only the modified part of a tree instead of its whole reconstruction. We compare the expressivity of XQuery extended in this way with XSLT.
Keywords: XML, transformations, xQuery
Lazy XSL transformations BIBAFull-Text 9-18
  Steffen Schott; Markus L. Noga
We introduce a lazy XSLT interpreter that provides random access to the transformation result. This allows efficient pipelining of transformation sequences. Nodes of the result tree are computed only upon initial access. As these computations have limited fan-in, sparse output coverage propagates backwards through the pipeline.
   In comparative measurements with traditional eager implementations, our approach is on par for complete coverage and excels as coverage becomes sparser. In contrast to eager evaluation, lazy evaluation also admits infinite intermediate results, thus extending the design space for transformation sequences.
   To demonstrate that lazy evaluation preserves the semantics of XSLT, we reduce XSLT to the lambda calculus via a functional language. While this is possible for all languages, most imperative languages cannot profit from the confluence of lambda as only one reduction applies at a time.
XPath on left and right sides of rules: toward compact XML tree rewriting through node patterns BIBAFull-Text 19-25
  Jean-Yves Vion-Dury
XPath [3, 5] is a powerful and quite successful language able to perform complex node selection in trees through compact specifications. As such, it plays a growing role in many areas ranging from schema specifications, designation and transformation languages to XML query languages. Moreover, researchers have proposed elegant and tractable formal semantics [8, 9, 10, 14], fostering various works on mathematical properties and theoretical tools [10, 13, 12, 14].
   We propose here a novel way to consider XPath, not only for selecting nodes, but also for tree rewriting using rules. In the rule semantics we explore, XPath expressions (noted p,p') are used both on the left and on the right side (i.e. rules have the form p ⇒ p'). We believe that this proposal opens new perspectives toward building highly concise XML transformation languages on widely accepted basis.
Automating XML document structure transformations BIBAKFull-Text 26-28
  Paula Leinonen
This paper describes an implementation for syntax-directed transformation of XML documents from one structure to another. The system is based on the method which we have introduced in our earlier work. That work characterized certain general conditions under which a semi-automatic transformation is possible.
   The system generates semi-automatically a transformation between two structures of the same document class. The system gets source and target DTDs as an input. There is a tool for a user to define a label association between the elements of the DTDs. From the two DTDs and from the label association, the system generates the transformation specification semi-automatically. The system has a tool to help the user to select a correct translation if the target DTD produces several possible structures.
   Implementation of the transformation is based on the top-down tree transducer. From the transformation specification the system produces an XSLT script automatically.
Keywords: XML, XSLT, document structure transformation


XML five years on: a review of the achievements so far and the challenges ahead BIBAKFull-Text 29-31
  Michael H. Kay
This is an extended abstract of the talk given by Michael Kay in the keynote address of the DocEng2003 symposium.
Keywords: XML, XQuery, XSLT

Multimedia and hypermedia

Using SMIL to encode interactive, peer-level multimedia annotations BIBAKFull-Text 32-41
  Dick C. A. Bulterman
This paper discusses applying facilities in SMIL 2.0 to the problem of annotating multimedia presentations. Rather than viewing annotations as collections of (abstract) meta-information for use in indexing, retrieval or semantic processing, we view annotations as a set of peer-level content with temporal and spatial relationships that are important in presenting a coherent story to a user. The composite nature of the collection of media is essential to the nature of peer-level annotations: you would typically annotate a single media item much differently than that same media item in the context of a total presentation.
   This paper focuses on the document engineering aspects of the annotation system. We do not consider any particular user interface for creating the annotations or any back-end storage architecture to save/search the annotations. Instead, we focus on how annotations can be represented within a common document architecture and we consider means of providing document facilities that meet the requirements of our user model. We present our work in the context of a medical patient dossier example.
Keywords: SMIL, annotation, horses, medical systems
Structuring interactive TV documents BIBAKFull-Text 42-51
  Rudinei Goularte; Edson dos Santos Moreira; Maria da Graça C. Pimentel
Interactive video technology is meant to support user-interaction with video in scene objects associated with navigation in video segments and access to text-based metadata. Interactive TV is one of the most important applications of this area, which has required the development of standards, techniques and tools, such as MPEG-4 and MPEG-7, to create, to describe, to deliver and to present interactive content.
   In this scenario, the structure and organization of documents containing multimedia metadata play an important role. However, the Interactive TV documents structuring and organization has not been properly explored during the development of advanced Interactive TV services.
   This work presents a model to structure and to organize documents describing Interactive TV programs and its related media objects, as well as the links between them. This model gives support to represent contextual information, and makes possible to use relevant metadata information in order to implement advanced services like object-based searches, in- movie (scenes, frames, in-frame regions) navigation, and personalization. To demonstrate the functionalities of our model, we have developed an application which uses an Interactive TV program's documents descriptions to present information about in-frame video objects.
Keywords: MPEG-7, XLink, interactive TV, media descriptions, metadata
Thematic alignment of recorded speech with documents BIBAKFull-Text 52-54
  Dalila Mekhaldi; Denis Lalanne; Rolf Ingold
We present in this article a method for detecting similarity links between documents' content and speech recordings' content. This process, further called thematic alignment, is a novel research area that combines both document and speech analysis. This alignment will a) provide temporal indexes to documents, which are non-temporal data, and b) help discovering hidden thematic structures. This article first introduces a multi-layered document structure and quickly introduces the traditional speech structure. Further, it presents a simple similarity measure and various multi-level simple alignments between those two structures. Later, the meeting corpus is presented, as well as an evaluation of the implemented alignments. Finally, we present our future works on multi-alignments and thematic structure discovery.
Keywords: document indexing and retrieval, meeting recordings, multi-layered structure, multimodal analysis, thematic alignment
Digitizing cultural heritage manuscripts: the Bovary project BIBAKFull-Text 55-57
  Stéphane Nicolas; Thierry Paquet; Laurent Heutte
In this paper we describe the Bovary Project, a manuscripts digitization project of the famous French writer Gustave FLAUBERT first great work. This project has just begun at the end of 2002 and should end in 2006 by providing an online access to an hypertextual edition of "Madame Bovary" drafts set. We develop the global context of this project, the main objectives, the first studies and the considered outlooks for the project's carried out.
Keywords: digital libraries, document image analysis, genetic edition, hypermedia, indexation

Document formatting

Creating reusable well-structured PDF as a sequence of component object graphic (COG) elements BIBAKFull-Text 58-67
  Steven R. Bagley; David F. Brailsford; Matthew R. B. Hardy
Portable Document Format (PDF) is a page-oriented, graphically rich format based on PostScript semantics and it is also the format interpreted by the Adobe Acrobat viewers. Although each of the pages in a PDF document is an independent graphic object this property does not necessarily extend to the components (headings, diagrams, paragraphs etc.) within a page. This, in turn, makes the manipulation and extraction of graphic objects on a PDF page into a very difficult and uncertain process.
   The work described here investigates the advantages of a model wherein PDF pages are created from assemblies of COGs (Component Object Graphics) each with a clearly defined graphic state. The relative positioning of COGs on a PDF page is determined by appropriate 'spacer' objects and a traversal of the tree of COGs and spacers determines the rendering order. The enhanced revisability of PDF documents within the COG model is discussed, together with the application of the model in those contexts which require easy revisability coupled with the ability to maintain and amend PDF document structure.
Keywords: PDF, form Xobjects, graphic objects, tagged PDF
Creating personalized documents: an optimization approach BIBAKFull-Text 68-77
  Lisa Purvis; Steven Harrington; Barry O'Sullivan; Eugene C. Freuder
The digital networked world is enabling and requiring a new emphasis on personalized document creation. The new, more dynamic digital environment demands tools that can reproduce both the contents and the layout automatically, tailored to personal needs and transformed for the presentation device, and can enable novices to easily create such documents. In order to achieve such automated document assembly and transformation, we have formalized custom document creation as a multiobjective optimization problem, and use a genetic algorithm to assemble and transform compound personalized documents. While we have found that such an automated process for document creation opens new possibilities and new workflows, we have also found several areas where further research would enable the approach to be more broadly and practically applied. This paper reviews the current system and outlines several areas where future research will broaden its current capabilities.
Keywords: automated layout, constrained optimization, constraint-based reasoning, document design, genetic algorithm, multiobjective optimization
Inter and intra media-object QoS provisioning in adaptive formatters BIBAKFull-Text 78-87
  Rogério Ferreira Rodrigues; Luiz Fernando Gomes Soares
The development of hypermedia/multimedia systems requires the implementation of an element, usually known as formatter, which is in charge of receiving the specification of a document (structure, media-object relationships and presentation descriptions) and controlling its presentation. The process of controlling and maintaining the presentation of a hyperdocument with an output of acceptable quality is a QoS orchestration problem, which needs to be treated by formatters in two related levels: the inter media-object and the intra media-object orchestration. This paper aims at discussing the issues associated to QoS provisioning in hypermedia systems, focusing on the design and implementation of formatters. We propose a QoS framework for hypermedia formatters based on a generic quality of service model for communication environments. The paper also comments the experience obtained in the framework instantiation for the HyperProp system formatter.
Keywords: hyperProp system, hypermedia formatter, media synchronization, quality of service
Using SVG as the rendering model for structured and graphically complex web material BIBAKFull-Text 88-91
  Julius C. Mong; David F. Brailsford
This paper reports some experiments in using SVG (Scalable Vector Graphics), rather than the browser default of (X)HTML/CSS, as a potential Web-based rendering technology, in an attempt to create an approach that integrates the structural and display aspects of a Web document in a single XML-compliant envelope.
   Although the syntax of SVG is XML based, the semantics of the primitive graphic operations more closely resemble those of page description languages such as PostScript or PDF. The principal usage of SVG, so far, is for inserting complex graphic material into Web pages that are predominantly controlled via (X)HTML and CSS.
   The conversion of structured and unstructured PDF into SVG is discussed. It is found that unstructured PDF converts into pages of SVG with few problems, but difficulties arise when one attempts to map the structural components of a Tagged PDF into an XML skeleton underlying the corresponding SVG. These difficulties are not fundamentally syntactic; they arise largely because browsers are innately bound to (X)HTML/CSS as their default rendering model. Some suggestions are made for ways in which SVG could be more totally integrated into browser functionality, with the possibility that future browsers might be able to use SVG as their default rendering paradigm.
Keywords: PDF, SVG, XML, vector graphics
Improving formatting documents by coupling formatting systems BIBAKFull-Text 92-94
  Fateh Boulmaiz; Cécile Roisin; Frédéric Bes
In this paper, we present a framework for coupling an existing formatting system such as SMIL[7] and Madeus[13] with a formatting control system XEF[10]. This framework allows the coupling process to be performed at two levels: 1) the language level, which is concerned with how to link the control features of XEF and the elements of an existing formatting system, and 2) the formatter level, which deals with the creation of a new formatter by formatter composition.
   The overall objective is to provide more powerful and flexible formatting services to cover new needs such adaptive and/or generated presentations.
Keywords: language coupling, presentation language, software coupling

Document and images analysis

INFTY: an integrated OCR system for mathematical documents BIBAKFull-Text 95-104
  Masakazu Suzuki; Fumikazu Tamari; Ryoji Fukuda; Seiichi Uchida; Toshihiro Kanahori
An integrated OCR system for mathematical documents, called INFTY, is presented. INFTY consists of four procedures, i.e., layout analysis, character recognition, structure analysis of mathematical expressions, and manual error correction. In those procedures, several novel techniques are utilized for better recognition performance. Experimental results on about 500 pages of mathematical documents showed high character recognition rates on both mathematical expressions and ordinary texts, and sufficient performance on the structure analysis of the mathematical expressions.
Keywords: character and symbol recognition, mathematical OCR, structure analysis of mathematical expressions
Information encoding into and decoding from dot texture for active forms BIBAKFull-Text 105-114
  Bilan Zhu; Masaki Nakagawa
We describe here information encoding and decoding methods applied to dot texture for active forms. We employ dot texture made of tiny dots and looking like gray color to print various forms. This facilitates the separation of handwriting from its input frame even under monochrome printing/reading environments. It also makes the forms determine how to process filled-in handwriting according to the information embedded in the dot texture. The embedded information results in an improved recognition rate of handwriting, and allows the form processing to be directed by the form itself rather than by the form reading machine. Thus, the form-reading machine becomes a general-purpose machine allowing different forms inputted into it to be processed differently as specified by each form. We compare various dot shapes and information encoding/decoding methods for those shapes. Then, we present how to locate input frames, separate handwriting from input frames and segment handwriting into characters. We also present preliminary evaluation of the described methods.
Keywords: form processing, form recognition, labeling, morphology, paper-based UI
Effective text extraction and recognition for WWW images BIBAKFull-Text 115-117
  Jun Sun; Zhulong Wang; Hao Yu; Fumihito Nishino; Yukata Katsuyama; Satoshi Naoi
Images play a very important role in web content delivery. Many WWW images contain text information that can be used for web indexing and searching. A new text extraction and recognition algorithm is proposed in this paper. The character strokes in the image are first extracted by color clustering and connected component analysis. A novel stroke verification algorithm is used to effectively remove non-character strokes. The verified strokes are then used to build the binary text line image, which is segmented and recognized by dynamic programming. Since text in WWW image usually has close relationship with webpage content, approximate string matching is used to revise the recognition result by matching the content in the webpage with the content in the image. This effective post-processing not only improves the recognition performance, but also can be used in other applications such like image - webpage paragraph corresponding.
Keywords: approximate matching, text extraction, text recognition
Accuracy improvement of automatic text classification based on feature transformation BIBAKFull-Text 118-120
  Guowei Zu; Wataru Ohyama; Tetsushi Wakabayashi; Fumitaka Kimura
In this paper, we describe a comparative study on techniques of feature transformation and classification to improve the accuracy of automatic text classification. The normalization to the relative word frequency, the principal component analysis (K-L transformation) and the power transformation were applied to the feature vectors, which were classified by the Euclidean distance, the linear discriminant function, the projection distance, the modified projection distance and the SVM.
Keywords: automatic text classification, principal component analysis, variable transformation

Document management

Context representation, transformation and comparison for ad hoc product data exchange BIBAKFull-Text 121-130
  Jingzhi Guo; Chengzheng Sun
Product data exchange is the precondition of business interoperation between Web-based firms. However, millions of small and medium sized enterprises (SMEs) encode their Web product data in ad hoc formats for electronic product catalogues. This prevents product data exchange between business partners for business interoperation. To solve this problem, this paper has proposed a novel concept-centric catalogue engineering approach for representing, transforming and comparing semantic contexts in ad hoc product data exchange. In this approach, concepts and contexts of product data are specified along data exchange chain and are mapped onto several novel XML product map (XPM) documents by utilizing XML hierarchical structure and its syntax. The designed XPM has overcome the semantic limitations of XML markup and has achieved the semantic interoperation for ad hoc product data exchange.
Keywords: XML product map, XPM, ad hoc product data exchange, concept, context comparison, context representation, context transformation, electronic commerce, electronic product catalogue, product data integration, semantics
Preservation of digital publications: an OAIS extension and implementation BIBAKFull-Text 131-139
  Peter Rödig; Uwe M. Borghoff; Jan Scheffczyk; Lothar Schmitz
Over the last decades, the amount of digital documents has increased exponentially. Nevertheless, traditional document engineering methods are applied. Even worse, the long-term preservation issues have been neglected in standard document life cycle implementations.
   Our digital (cultural) heritage is, therefore, highly endangered by the silent obsolescence of data formats, software and hardware. Severe losses of information already happened. It is high time to implement concrete solutions.
   Fortunately numerous institutions already target these issues. Moreover, with the OAIS reference model1 a rich standardized conceptual framework is available, which already serves as implementation basis.2This paper discusses an extension to the OAIS reference model and illustrates a prototype implementation of a document life cycle that is enriched by functions for long-term preservation.
   More precisely, this paper aims to provide first solutions to the following three problem areas:
  • 1. Detachment: OAIS defines no functions for the process of detaching digital
        documents prior to the ingest function. This detachment function is modeled
        in great detail and implemented for the provision of the so-called OAIS's
        submission information packages (SIP).
  • 2. DBMS: OAIS defines a very complex functionality. We show how a standard
        database management system (DBMS) can support a wide variety of required
        functionalities in an integrated and homogenous way. Among others OAIS's
        data management, archival storage, and access are supported.
  • 3. Metadata: So far, OAIS does not cover any aspects of the metadata
        generation. Here, we briefly discuss the (semi-)automatic generation of a
        metadata set. In order to evaluate the feasibility of our approach, we built a first prototype. We carried out our experiments in close cooperation with the Bavarian State Library, Munich, which is engaged in numerous international initiatives dealing with the problem of long-term preservation. Our University Library also supported us by delivering a representative test set of digital publications.
       We conclude our paper by presenting some lessons learned from our conceptual work and from our real world experiments.
    Keywords: OAIS, archival systems, database management, detachment of digital publications, digital libraries, document management, long-term preservation, metadata
  • Consistent document engineering: formalizing type-safe consistency rules for heterogeneous repositories BIBAKFull-Text 140-149
      Jan Scheffczyk; Uwe M. Borghoff; Peter Rödig; Lothar Schmitz
    When a group of authors collaboratively edits interrelated documents, consistency problems occur almost immediately. Current document management systems (DMS) provide useful mechanisms such as document locking and version control, but often lack consistency management facilities.
       If at all, consistency is "defined" via informal guidelines, which do not support automatic consistency checks.
       In this paper, we propose to use explicit formal consistency rules for heterogeneous repositories that are managed by traditional DMS. Rules are formalized in a variant of first-order temporal logic. Functions and predicates, implemented in a full programming language, provide complex (even higher-order) functionality. A static type system supports rule formalization, where types also define (formal) document models. In the presence of types, the challenge is to smoothly combine a first-order logic with a useful type system including subtyping. In implementing a tolerant view of consistency, we do not expect that repositories satisfy consistency rules. Instead, a novel semantics precisely pinpoints inconsistent document parts and indicates when, where, and why a repository is inconsistent.
       Our major contributions are (1) the use of explicit formal rules giving a precise (and still comprehensible) notion of consistency, (2) a static type system securing the formalization process, (3) a novel semantics pinpointing inconsistent document (parts) precisely, and (4) a design of how to automatically check consistency for document engineering projects that use existing DMS. We have implemented a prototype of a consistency checker. Applied to real world content, it shows that our contributions can significantly improve consistency in document engineering processes.
    Keywords: consistency in document engineering, document management, temporal logic
    A ground-truthing engine for proofsetting, publishing, re-purposing and quality assurance BIBAKFull-Text 150-152
      Steven J. Simske; Margaret Sturgill
    We present design strategies, implementation preferences and throughput results obtained in deploying a UI-based ground truthing engine as the last step in the quality assurance (QA) for the conversion of a large out-of-print book collection into digital form. A series of automated QA steps were first performed on the document. Five distinct zoning analysis options were deployed and the PDF output thence generated was used to regenerate TIFF files for comparison to the originals. Regenerated TIFFs failing automated QA or a separate visual QA were tagged for ground truthing. Less than 3% of the pages in a 1.2x106-page corpus required ground truthing, resulting in a throughput rate of "fully-proofed" pages of 2x105 pages/man-week. Among the design advantages crucial for this throughput rate was the use of the identical zoning engine for the original production workflow and for the ground truthing engine.
    Keywords: layout, print-on-demand, region management, templates

    Document access and understanding

    Structured multimedia document classification BIBAKFull-Text 153-160
      Ludovic Denoyer; Jean-Noël Vittaut; Patrick Gallinari; Sylvie Brunessaux; Stephan Brunessaux
    We propose a new statistical model for the classification of structured documents and consider its use for multimedia document classification. Its main originality is its ability to simultaneously take into account the structural and the content information present in a structured document, and also to cope with different types of content (text, image, etc). We present experiments on the classification of multilingual pornographic HTML pages using text and image data. The system accurately classifies porn sites from 8 European languages. This corpus has been developed by EADS company in the context of a large Web site filtering application.
    Keywords: Bayesian networks, categorization, generative model, multimedia document, statistical machine, structured document, web page filtering
    Methods for the semantic analysis of document markup BIBAKFull-Text 161-170
      Petra Saskia Bayerl; Harald Lüngen; Daniela Goecke; Andreas Witt; Daniel Naber
    We present an approach on how to investigate what kind of semantic information is regularly associated with the structural markup of scientific articles. This approach addresses the need for an explicit formal description of the semantics of text-oriented XML-documents. The domain of our investigation is a corpus of scientific articles from psychology and linguistics from both English and German online available journals.
       For our analyses, we provide XML-markup representing two kinds of semantic levels: the thematic level (i.e., topics in the text world that the article is about) and the functional or rhetorical level. Our hypothesis is that these semantic levels correlate with the articles' document structure also represented in XML. Articles have been annotated with the appropriate information. Each of the three informational levels is modelled in a separate XML document, since in our domain, the different description levels might conflict so that it is impossible to model them within a single XML document.
       For comparing and mining the resulting multi-layered XML annotations of one article, a Prolog-based approach is used. It focusses on the comparison of XML markup that is distributed among different documents. Prolog predicates have been defined for inferring relations between levels of information that are modelled in separate XML documents. We demonstrate how the Prolog tool is applied in our corpus analyses.
    Keywords: XML, information extraction, prolog, semantic analysis
    Interactive information retrieval from XML documents represented by attribute grammars BIBAKFull-Text 171-174
      Alda Lopes Gançarski; Pedro Rangel Henriques
    In this paper, we describe a system to interactively accede to XML documents represented by attribute grammars. The system has two main components: (1) the query editor/processor, where the user interactively specifies his needs; (2) the document analyzer, which performs operations for query evaluation that accede directly to the documents. The interactive construction of queries is based on the manipulation of intermediate results during query construction and evaluation. We believe this helps the user to achieve the desired result.
    Keywords: XML representation, interactive retrieval

    Optimizing document format

    Two diet plans for fat PDF BIBAKFull-Text 175-184
      Thomas A. Phelps; Robert Wilensky
    As Adobe's Portable Document Format has exploded in popularity so too has the number PDF generators, and predictably the quality of generated PDF varies considerably. This paper surveys a range of PDF optimizations for space, and reports the results of a tool that can postprocess existing PDFs to reduce file sizes by 20 to 70% for large classes of PDFs. (Further reduction can often be obtained by recoding images to lower resolutions or with newer compression methods such as JBIG2 or JPEG2000, but those operations are independent of PDF per se and not a component of the results reported here.) A new PDF storage format called "Compact PDF" is introduced that achieves for many classes of PDF an additional reduction of 30 to 60% beyond what is possible in the latest PDF specification (version 1.5, corresponding to Acrobat 6); for example, the PDF 1.5 Reference manual shrinks from 12.2MB down to 4.2MB. The changes required by Compact PDF to the PDF specification and to PDF readers are easily understood and straightforward to implement.
    Keywords: PDF, compact PDF, compression, multivalent
    Compression of scan-digitized Indian language printed text: a soft pattern matching technique BIBAKFull-Text 185-192
      U. Garain; S. Debnath; A. Mandal; B. B. Chaudhuri
    In this paper, a new compression scheme is presented for Indian Language (IL) textual document images. Since OCR technology for IL scripts is not matured enough, transcription of these documents into digital domain needs new techniques that achieve high degree of compression as well as suitable methods to perform various operations like document indexing, retrieval, etc. The proposed method is essentially based on symbolic compression technique, which has been realized with an efficient segmentation-based clustering approach. A soft pattern-matching technique has been implemented using two different feature sets that co-operate each other to build an efficient prototype library. Experiments have been done for documents printed in Devnagari (Hindi) and Bangla scripts, two mostly used script in Indian sub-continent. Test results show that the proposed technique outperforms several standard methods like CCITT Group-4, JBIG, etc. which are frequently used for compression of document images.
    Keywords: data compression, Indian language, pattern matching, textual image

    Editing and authoring

    Semantically-based text authoring and the concurrent documentation of experimental protocols BIBAKFull-Text 193-202
      Caroline Brun; Marc Dymetman; Eric Fanchon; Stanislas Lhomme; Sylvain Pogodalla
    We describe an application of controlled text authoring to biological experiment reports. This work is the result of a collaboration between a computational linguistics team and biologists specializing in protein production studies. We start by presenting our semantically-controlled authoring system, MDA (Multilingual Document Authoring), an expressive model for specifying well-formedness conditions both at the level of the document content and at the level of its textual realization. We then discuss the practical needs of experiment documentation in bioengineering. We go on to describe the prototype we have developed for this application domain, along with a preliminary evaluation. Finally we discuss a promising new idea emerging from the experimentation but which seems of wider applicability: how the authoring system represents a step towards integrating the formalization of an experimental protocol with its associated textual documentation.
    Keywords: XML, XML-schemas, concurrent documentation, constrained document specification, document authoring, experimental protocols, logic programming, natural language generation
    A structural adviser for the XML document authoring BIBAKFull-Text 203-211
      Boris Chidlovskii
    Since the XML format became a de facto standard for structured documents, the IT research and industry have developed a number of XML editors to help users produce structured documents in XML format. However, the manual generation of structured documents in XML format remains a tedious and time-consuming process because of the excessive verbosity and length of XML code. In this paper, we design a structural adviser for the XML document authoring. The adviser intervenes at any step of the authoring process to suggest one tag or entire tree-like pattern the user is most likely to use next. Adviser suggestions are based on finding analogies between the currently edited fragment and sample data being either previously generated documents in the collection or the history of the current document authoring. The adviser is beneficial in cases when no schema is provided for XML documents, or schema associated with the document is too general and sample data contain specific patterns not captured in the schema. We design the adviser architecture and develop a method for efficient indexing and retrieval of optimal suggestions at any step of the document authoring.
    Keywords: XML markup, data mining, structural pattern, suggestion
    User-directed analysis of scanned images BIBAKFull-Text 212-221
      Steven J. Simske; Jordi Arnabat
    Digital capture (scanning in all its forms, and digital photography/video recording), in providing virtually free temporary memory of captured information, allows users to "over-gather" information during capture, and then to discard unwanted material later. For cameras and video recorders, such editing largely consists of discarding images or frames in their entirety. For scanners (and high-resolution camera/video), such editing benefits from a preview capability that provides quick and reliable user-interface tools for selecting, filtering and saving specific portions of the input. Appropriate preview user interface (UI) tools ease the accessing, editing and dispatch to desired destination (archive, application, webpage, etc.) of captured information (text, tables, drawings, photos, etc.). In this paper, we present several different means for the user-directed "rapid capture" of portions of a scanned image. Specifically, we review past, present and future preview-based UI tools that allow efficient and accurate means of capture to the user. The bases of these tools, as described herein, are user-directed zoning analysis, known as "click and select", which incorporates a bottom-up zoning analysis engine; and statistics-based region classification, which allows rapid reconfiguration of region identification and clustering. We conclude with our view of the future of UI-directed capture.
    Keywords: bottom-up analysis, classification, click and select, preview display, scanning, segmentation, user interface, zoning
    Handling syntactic constraints in a DTD-compliant XML editor BIBAKFull-Text 222-224
      Y. S. Kuo; Jaspher Wang; N. C. Shih
    By exploiting the theories of automata and graphs, we propose algorithms and a process for editing valid XML documents [4][5]. The editing process avoids syntactic violations altogether, thus freeing the user from any syntactic concerns. Based on the proposed algorithms and process, we build an XML editor with forms as its user interface.
    Keywords: XML editor, automata theory, regular expression

    Document based architecture & applications

    Set-at-a-time access to XML through DOM BIBAKFull-Text 225-233
      Hai Chen; Frank Wm. Tompa
    To support the rapid growth of the web and e-commerce, W3C developed DOM as an application programming interface that provides the abstract, logical tree structure of an XML document. In this paper, we propose ordered-set-at-a-time extensions for DOM while maintaining its tightly managed navigational nature. In particular, we define the NodeSequence interface with functions that filter, navigate, and transform sequences of nodes simultaneously. The extended DOM greatly simplifies writing some application code, and it can reduce the communications overhead and response time between a client application and the DOM server to provide applications with more efficient processing. As validation of our proposals, we present application examples that compare the convenience and efficiency of DOM with and without extensions.
    Keywords: DOM, XML, application program interface, navigation, set-at-a-time
    UpLib: a universal personal digital library system BIBAKFull-Text 234-242
      William C. Janssen; Kris Popat
    We describe the design and use of a personal digital library system, UpLib. The system consists of a full-text indexed repository accessed through an active agent via a Web interface. It is suitable for personal collections comprising tens of thousands of documents (including papers, books, photos, receipts, email, etc.), and provides for ease of document entry and access as well as high levels of security and privacy. Unlike many other systems of the sort, user access to the document collection is assured even if the UpLib system is unavailable. It is "universal" in the sense that documents are canonically represented as projections into the text and image domains, and uses a predominantly visual user interface based on page images. UpLib can thus handle any document format which can be rendered as pages. Provision is made for alternative representations existing alongside the text-domain and image-domain representation, either stored or generated on demand. The system is highly extensible through user scripting, and is intended to be used as a platform for further work in document engineering. UpLib is assembled largely from open-source components (the current exception being the OCR engine, which is proprietary).
    Keywords: document management, document repository, page image, personal digital library, thumbnail interfaces, web interfaces
    Management of trusted citations BIBAKFull-Text 243-245
      Christer Fernstrom
    We discuss how references and citations within a document to particular sources can be verified and guaranteed. When a document refers through a quotation to another document, the reader should be able to verify that the reference is correct and that any quotation correctly represents the original text. The mechanism we describe enables the authentication of such quotations. It consists of: A notation to be used when expressing quotations. This notation allows a controlled degree of freedom to make alterations from the original text.
       Different means to check the correctness of such quotations with respect to the cited documents and to quotation rules.
    Keywords: citation, information trust
    Model driven architecture based XML processing BIBAKFull-Text 246-248
      Ivan Kurtev; Klaas van den Berg
    A number of applications that process XML documents interpret them as objects of application specific classes in a given domain. Generic interfaces such as SAX and DOM leave this interpretation completely to the application. Data binding provides some automation but it is not powerful enough to express complex relations between the application model and the document syntax. Since document schemas play the role of models of documents we can define document processing as model-to-model transformation in the context of Model Driven Architecture (MDA). We define a transformation language for specifying transformations from XML schemas to application models. Transformation execution is an interpretation of a document that results in a set of application objects.
    Keywords: MDA, XML processing, transformations