| Extending xQuery with transformation operators | | BIBAK | Full-Text | 1-8 | |
| Emmanuel Bruno; Jacques Le Maitre; Elisabeth Murisasco | |||
| In this paper, we propose to extend XQuery -- the XML query language -- with
a set of transformation operators which will produce a copy of an XML tree in
which some subtrees will be inserted, replaced or deleted. These operators --
very similar to the ones proposed for updating an XML document -- greatly
simplify the expression of some queries in making it possible to express only
the modified part of a tree instead of its whole reconstruction. We compare the
expressivity of XQuery extended in this way with XSLT. Keywords: XML, transformations, xQuery | |||
| Lazy XSL transformations | | BIBA | Full-Text | 9-18 | |
| Steffen Schott; Markus L. Noga | |||
| We introduce a lazy XSLT interpreter that provides random access to the
transformation result. This allows efficient pipelining of transformation
sequences. Nodes of the result tree are computed only upon initial access. As
these computations have limited fan-in, sparse output coverage propagates
backwards through the pipeline.
In comparative measurements with traditional eager implementations, our approach is on par for complete coverage and excels as coverage becomes sparser. In contrast to eager evaluation, lazy evaluation also admits infinite intermediate results, thus extending the design space for transformation sequences. To demonstrate that lazy evaluation preserves the semantics of XSLT, we reduce XSLT to the lambda calculus via a functional language. While this is possible for all languages, most imperative languages cannot profit from the confluence of lambda as only one reduction applies at a time. | |||
| XPath on left and right sides of rules: toward compact XML tree rewriting through node patterns | | BIBA | Full-Text | 19-25 | |
| Jean-Yves Vion-Dury | |||
| XPath [3, 5] is a powerful and quite successful language able to perform
complex node selection in trees through compact specifications. As such, it
plays a growing role in many areas ranging from schema specifications,
designation and transformation languages to XML query languages. Moreover,
researchers have proposed elegant and tractable formal semantics [8, 9, 10,
14], fostering various works on mathematical properties and theoretical tools
[10, 13, 12, 14].
We propose here a novel way to consider XPath, not only for selecting nodes, but also for tree rewriting using rules. In the rule semantics we explore, XPath expressions (noted p,p') are used both on the left and on the right side (i.e. rules have the form p ⇒ p'). We believe that this proposal opens new perspectives toward building highly concise XML transformation languages on widely accepted basis. | |||
| Automating XML document structure transformations | | BIBAK | Full-Text | 26-28 | |
| Paula Leinonen | |||
| This paper describes an implementation for syntax-directed transformation of
XML documents from one structure to another. The system is based on the method
which we have introduced in our earlier work. That work characterized certain
general conditions under which a semi-automatic transformation is possible.
The system generates semi-automatically a transformation between two structures of the same document class. The system gets source and target DTDs as an input. There is a tool for a user to define a label association between the elements of the DTDs. From the two DTDs and from the label association, the system generates the transformation specification semi-automatically. The system has a tool to help the user to select a correct translation if the target DTD produces several possible structures. Implementation of the transformation is based on the top-down tree transducer. From the transformation specification the system produces an XSLT script automatically. Keywords: XML, XSLT, document structure transformation | |||
| XML five years on: a review of the achievements so far and the challenges ahead | | BIBAK | Full-Text | 29-31 | |
| Michael H. Kay | |||
| This is an extended abstract of the talk given by Michael Kay in the keynote
address of the DocEng2003 symposium. Keywords: XML, XQuery, XSLT | |||
| Using SMIL to encode interactive, peer-level multimedia annotations | | BIBAK | Full-Text | 32-41 | |
| Dick C. A. Bulterman | |||
| This paper discusses applying facilities in SMIL 2.0 to the problem of
annotating multimedia presentations. Rather than viewing annotations as
collections of (abstract) meta-information for use in indexing, retrieval or
semantic processing, we view annotations as a set of peer-level content with
temporal and spatial relationships that are important in presenting a coherent
story to a user. The composite nature of the collection of media is essential
to the nature of peer-level annotations: you would typically annotate a single
media item much differently than that same media item in the context of a total
presentation.
This paper focuses on the document engineering aspects of the annotation system. We do not consider any particular user interface for creating the annotations or any back-end storage architecture to save/search the annotations. Instead, we focus on how annotations can be represented within a common document architecture and we consider means of providing document facilities that meet the requirements of our user model. We present our work in the context of a medical patient dossier example. Keywords: SMIL, annotation, horses, medical systems | |||
| Structuring interactive TV documents | | BIBAK | Full-Text | 42-51 | |
| Rudinei Goularte; Edson dos Santos Moreira; Maria da Graça C. Pimentel | |||
| Interactive video technology is meant to support user-interaction with video
in scene objects associated with navigation in video segments and access to
text-based metadata. Interactive TV is one of the most important applications
of this area, which has required the development of standards, techniques and
tools, such as MPEG-4 and MPEG-7, to create, to describe, to deliver and to
present interactive content.
In this scenario, the structure and organization of documents containing multimedia metadata play an important role. However, the Interactive TV documents structuring and organization has not been properly explored during the development of advanced Interactive TV services. This work presents a model to structure and to organize documents describing Interactive TV programs and its related media objects, as well as the links between them. This model gives support to represent contextual information, and makes possible to use relevant metadata information in order to implement advanced services like object-based searches, in- movie (scenes, frames, in-frame regions) navigation, and personalization. To demonstrate the functionalities of our model, we have developed an application which uses an Interactive TV program's documents descriptions to present information about in-frame video objects. Keywords: MPEG-7, XLink, interactive TV, media descriptions, metadata | |||
| Thematic alignment of recorded speech with documents | | BIBAK | Full-Text | 52-54 | |
| Dalila Mekhaldi; Denis Lalanne; Rolf Ingold | |||
| We present in this article a method for detecting similarity links between
documents' content and speech recordings' content. This process, further called
thematic alignment, is a novel research area that combines both document and
speech analysis. This alignment will a) provide temporal indexes to documents,
which are non-temporal data, and b) help discovering hidden thematic
structures. This article first introduces a multi-layered document structure
and quickly introduces the traditional speech structure. Further, it presents a
simple similarity measure and various multi-level simple alignments between
those two structures. Later, the meeting corpus is presented, as well as an
evaluation of the implemented alignments. Finally, we present our future works
on multi-alignments and thematic structure discovery. Keywords: document indexing and retrieval, meeting recordings, multi-layered
structure, multimodal analysis, thematic alignment | |||
| Digitizing cultural heritage manuscripts: the Bovary project | | BIBAK | Full-Text | 55-57 | |
| Stéphane Nicolas; Thierry Paquet; Laurent Heutte | |||
| In this paper we describe the Bovary Project, a manuscripts digitization
project of the famous French writer Gustave FLAUBERT first great work. This
project has just begun at the end of 2002 and should end in 2006 by providing
an online access to an hypertextual edition of "Madame Bovary" drafts set. We
develop the global context of this project, the main objectives, the first
studies and the considered outlooks for the project's carried out. Keywords: digital libraries, document image analysis, genetic edition, hypermedia,
indexation | |||
| Creating reusable well-structured PDF as a sequence of component object graphic (COG) elements | | BIBAK | Full-Text | 58-67 | |
| Steven R. Bagley; David F. Brailsford; Matthew R. B. Hardy | |||
| Portable Document Format (PDF) is a page-oriented, graphically rich format
based on PostScript semantics and it is also the format interpreted by the
Adobe Acrobat viewers. Although each of the pages in a PDF document is an
independent graphic object this property does not necessarily extend to the
components (headings, diagrams, paragraphs etc.) within a page. This, in turn,
makes the manipulation and extraction of graphic objects on a PDF page into a
very difficult and uncertain process.
The work described here investigates the advantages of a model wherein PDF pages are created from assemblies of COGs (Component Object Graphics) each with a clearly defined graphic state. The relative positioning of COGs on a PDF page is determined by appropriate 'spacer' objects and a traversal of the tree of COGs and spacers determines the rendering order. The enhanced revisability of PDF documents within the COG model is discussed, together with the application of the model in those contexts which require easy revisability coupled with the ability to maintain and amend PDF document structure. Keywords: PDF, form Xobjects, graphic objects, tagged PDF | |||
| Creating personalized documents: an optimization approach | | BIBAK | Full-Text | 68-77 | |
| Lisa Purvis; Steven Harrington; Barry O'Sullivan; Eugene C. Freuder | |||
| The digital networked world is enabling and requiring a new emphasis on
personalized document creation. The new, more dynamic digital environment
demands tools that can reproduce both the contents and the layout
automatically, tailored to personal needs and transformed for the presentation
device, and can enable novices to easily create such documents. In order to
achieve such automated document assembly and transformation, we have formalized
custom document creation as a multiobjective optimization problem, and use a
genetic algorithm to assemble and transform compound personalized documents.
While we have found that such an automated process for document creation opens
new possibilities and new workflows, we have also found several areas where
further research would enable the approach to be more broadly and practically
applied. This paper reviews the current system and outlines several areas where
future research will broaden its current capabilities. Keywords: automated layout, constrained optimization, constraint-based reasoning,
document design, genetic algorithm, multiobjective optimization | |||
| Inter and intra media-object QoS provisioning in adaptive formatters | | BIBAK | Full-Text | 78-87 | |
| Rogério Ferreira Rodrigues; Luiz Fernando Gomes Soares | |||
| The development of hypermedia/multimedia systems requires the implementation
of an element, usually known as formatter, which is in charge of receiving the
specification of a document (structure, media-object relationships and
presentation descriptions) and controlling its presentation. The process of
controlling and maintaining the presentation of a hyperdocument with an output
of acceptable quality is a QoS orchestration problem, which needs to be treated
by formatters in two related levels: the inter media-object and the intra
media-object orchestration. This paper aims at discussing the issues associated
to QoS provisioning in hypermedia systems, focusing on the design and
implementation of formatters. We propose a QoS framework for hypermedia
formatters based on a generic quality of service model for communication
environments. The paper also comments the experience obtained in the framework
instantiation for the HyperProp system formatter. Keywords: hyperProp system, hypermedia formatter, media synchronization, quality of
service | |||
| Using SVG as the rendering model for structured and graphically complex web material | | BIBAK | Full-Text | 88-91 | |
| Julius C. Mong; David F. Brailsford | |||
| This paper reports some experiments in using SVG (Scalable Vector Graphics),
rather than the browser default of (X)HTML/CSS, as a potential Web-based
rendering technology, in an attempt to create an approach that integrates the
structural and display aspects of a Web document in a single XML-compliant
envelope.
Although the syntax of SVG is XML based, the semantics of the primitive graphic operations more closely resemble those of page description languages such as PostScript or PDF. The principal usage of SVG, so far, is for inserting complex graphic material into Web pages that are predominantly controlled via (X)HTML and CSS. The conversion of structured and unstructured PDF into SVG is discussed. It is found that unstructured PDF converts into pages of SVG with few problems, but difficulties arise when one attempts to map the structural components of a Tagged PDF into an XML skeleton underlying the corresponding SVG. These difficulties are not fundamentally syntactic; they arise largely because browsers are innately bound to (X)HTML/CSS as their default rendering model. Some suggestions are made for ways in which SVG could be more totally integrated into browser functionality, with the possibility that future browsers might be able to use SVG as their default rendering paradigm. Keywords: PDF, SVG, XML, vector graphics | |||
| Improving formatting documents by coupling formatting systems | | BIBAK | Full-Text | 92-94 | |
| Fateh Boulmaiz; Cécile Roisin; Frédéric Bes | |||
| In this paper, we present a framework for coupling an existing formatting
system such as SMIL[7] and Madeus[13] with a formatting control system XEF[10].
This framework allows the coupling process to be performed at two levels: 1)
the language level, which is concerned with how to link the control features of
XEF and the elements of an existing formatting system, and 2) the formatter
level, which deals with the creation of a new formatter by formatter
composition.
The overall objective is to provide more powerful and flexible formatting services to cover new needs such adaptive and/or generated presentations. Keywords: language coupling, presentation language, software coupling | |||
| INFTY: an integrated OCR system for mathematical documents | | BIBAK | Full-Text | 95-104 | |
| Masakazu Suzuki; Fumikazu Tamari; Ryoji Fukuda; Seiichi Uchida; Toshihiro Kanahori | |||
| An integrated OCR system for mathematical documents, called INFTY, is
presented. INFTY consists of four procedures, i.e., layout analysis, character
recognition, structure analysis of mathematical expressions, and manual error
correction. In those procedures, several novel techniques are utilized for
better recognition performance. Experimental results on about 500 pages of
mathematical documents showed high character recognition rates on both
mathematical expressions and ordinary texts, and sufficient performance on the
structure analysis of the mathematical expressions. Keywords: character and symbol recognition, mathematical OCR, structure analysis of
mathematical expressions | |||
| Information encoding into and decoding from dot texture for active forms | | BIBAK | Full-Text | 105-114 | |
| Bilan Zhu; Masaki Nakagawa | |||
| We describe here information encoding and decoding methods applied to dot
texture for active forms. We employ dot texture made of tiny dots and looking
like gray color to print various forms. This facilitates the separation of
handwriting from its input frame even under monochrome printing/reading
environments. It also makes the forms determine how to process filled-in
handwriting according to the information embedded in the dot texture. The
embedded information results in an improved recognition rate of handwriting,
and allows the form processing to be directed by the form itself rather than by
the form reading machine. Thus, the form-reading machine becomes a
general-purpose machine allowing different forms inputted into it to be
processed differently as specified by each form. We compare various dot shapes
and information encoding/decoding methods for those shapes. Then, we present
how to locate input frames, separate handwriting from input frames and segment
handwriting into characters. We also present preliminary evaluation of the
described methods. Keywords: form processing, form recognition, labeling, morphology, paper-based UI | |||
| Effective text extraction and recognition for WWW images | | BIBAK | Full-Text | 115-117 | |
| Jun Sun; Zhulong Wang; Hao Yu; Fumihito Nishino; Yukata Katsuyama; Satoshi Naoi | |||
| Images play a very important role in web content delivery. Many WWW images
contain text information that can be used for web indexing and searching. A new
text extraction and recognition algorithm is proposed in this paper. The
character strokes in the image are first extracted by color clustering and
connected component analysis. A novel stroke verification algorithm is used to
effectively remove non-character strokes. The verified strokes are then used to
build the binary text line image, which is segmented and recognized by dynamic
programming. Since text in WWW image usually has close relationship with
webpage content, approximate string matching is used to revise the recognition
result by matching the content in the webpage with the content in the image.
This effective post-processing not only improves the recognition performance,
but also can be used in other applications such like image - webpage paragraph
corresponding. Keywords: approximate matching, text extraction, text recognition | |||
| Accuracy improvement of automatic text classification based on feature transformation | | BIBAK | Full-Text | 118-120 | |
| Guowei Zu; Wataru Ohyama; Tetsushi Wakabayashi; Fumitaka Kimura | |||
| In this paper, we describe a comparative study on techniques of feature
transformation and classification to improve the accuracy of automatic text
classification. The normalization to the relative word frequency, the principal
component analysis (K-L transformation) and the power transformation were
applied to the feature vectors, which were classified by the Euclidean
distance, the linear discriminant function, the projection distance, the
modified projection distance and the SVM. Keywords: automatic text classification, principal component analysis, variable
transformation | |||
| Context representation, transformation and comparison for ad hoc product data exchange | | BIBAK | Full-Text | 121-130 | |
| Jingzhi Guo; Chengzheng Sun | |||
| Product data exchange is the precondition of business interoperation between
Web-based firms. However, millions of small and medium sized enterprises (SMEs)
encode their Web product data in ad hoc formats for electronic product
catalogues. This prevents product data exchange between business partners for
business interoperation. To solve this problem, this paper has proposed a novel
concept-centric catalogue engineering approach for representing, transforming
and comparing semantic contexts in ad hoc product data exchange. In this
approach, concepts and contexts of product data are specified along data
exchange chain and are mapped onto several novel XML product map (XPM)
documents by utilizing XML hierarchical structure and its syntax. The designed
XPM has overcome the semantic limitations of XML markup and has achieved the
semantic interoperation for ad hoc product data exchange. Keywords: XML product map, XPM, ad hoc product data exchange, concept, context
comparison, context representation, context transformation, electronic
commerce, electronic product catalogue, product data integration, semantics | |||
| Preservation of digital publications: an OAIS extension and implementation | | BIBAK | Full-Text | 131-139 | |
| Peter Rödig; Uwe M. Borghoff; Jan Scheffczyk; Lothar Schmitz | |||
| Over the last decades, the amount of digital documents has increased
exponentially. Nevertheless, traditional document engineering methods are
applied. Even worse, the long-term preservation issues have been neglected in
standard document life cycle implementations.
Our digital (cultural) heritage is, therefore, highly endangered by the silent obsolescence of data formats, software and hardware. Severe losses of information already happened. It is high time to implement concrete solutions. Fortunately numerous institutions already target these issues. Moreover, with the OAIS reference model1 a rich standardized conceptual framework is available, which already serves as implementation basis.2This paper discusses an extension to the OAIS reference model and illustrates a prototype implementation of a document life cycle that is enriched by functions for long-term preservation. More precisely, this paper aims to provide first solutions to the following three problem areas: 1. Detachment: OAIS defines no functions for the process of detaching digital documents prior to the ingest function. This detachment function is modeled in great detail and implemented for the provision of the so-called OAIS's submission information packages (SIP). 2. DBMS: OAIS defines a very complex functionality. We show how a standard database management system (DBMS) can support a wide variety of required functionalities in an integrated and homogenous way. Among others OAIS's data management, archival storage, and access are supported. 3. Metadata: So far, OAIS does not cover any aspects of the metadata generation. Here, we briefly discuss the (semi-)automatic generation of a metadata set. In order to evaluate the feasibility of our approach, we built a first prototype. We carried out our experiments in close cooperation with the Bavarian State Library, Munich, which is engaged in numerous international initiatives dealing with the problem of long-term preservation. Our University Library also supported us by delivering a representative test set of digital publications. We conclude our paper by presenting some lessons learned from our conceptual work and from our real world experiments. Keywords: OAIS, archival systems, database management, detachment of digital
publications, digital libraries, document management, long-term preservation,
metadata | |||
| Consistent document engineering: formalizing type-safe consistency rules for heterogeneous repositories | | BIBAK | Full-Text | 140-149 | |
| Jan Scheffczyk; Uwe M. Borghoff; Peter Rödig; Lothar Schmitz | |||
| When a group of authors collaboratively edits interrelated documents,
consistency problems occur almost immediately. Current document management
systems (DMS) provide useful mechanisms such as document locking and version
control, but often lack consistency management facilities.
If at all, consistency is "defined" via informal guidelines, which do not support automatic consistency checks. In this paper, we propose to use explicit formal consistency rules for heterogeneous repositories that are managed by traditional DMS. Rules are formalized in a variant of first-order temporal logic. Functions and predicates, implemented in a full programming language, provide complex (even higher-order) functionality. A static type system supports rule formalization, where types also define (formal) document models. In the presence of types, the challenge is to smoothly combine a first-order logic with a useful type system including subtyping. In implementing a tolerant view of consistency, we do not expect that repositories satisfy consistency rules. Instead, a novel semantics precisely pinpoints inconsistent document parts and indicates when, where, and why a repository is inconsistent. Our major contributions are (1) the use of explicit formal rules giving a precise (and still comprehensible) notion of consistency, (2) a static type system securing the formalization process, (3) a novel semantics pinpointing inconsistent document (parts) precisely, and (4) a design of how to automatically check consistency for document engineering projects that use existing DMS. We have implemented a prototype of a consistency checker. Applied to real world content, it shows that our contributions can significantly improve consistency in document engineering processes. Keywords: consistency in document engineering, document management, temporal logic | |||
| A ground-truthing engine for proofsetting, publishing, re-purposing and quality assurance | | BIBAK | Full-Text | 150-152 | |
| Steven J. Simske; Margaret Sturgill | |||
| We present design strategies, implementation preferences and throughput
results obtained in deploying a UI-based ground truthing engine as the last
step in the quality assurance (QA) for the conversion of a large out-of-print
book collection into digital form. A series of automated QA steps were first
performed on the document. Five distinct zoning analysis options were deployed
and the PDF output thence generated was used to regenerate TIFF files for
comparison to the originals. Regenerated TIFFs failing automated QA or a
separate visual QA were tagged for ground truthing. Less than 3% of the pages
in a 1.2x106-page corpus required ground truthing, resulting in a
throughput rate of "fully-proofed" pages of 2x105 pages/man-week. Among
the design advantages crucial for this throughput rate was the use of the
identical zoning engine for the original production workflow and for the ground
truthing engine. Keywords: layout, print-on-demand, region management, templates | |||
| Structured multimedia document classification | | BIBAK | Full-Text | 153-160 | |
| Ludovic Denoyer; Jean-Noël Vittaut; Patrick Gallinari; Sylvie Brunessaux; Stephan Brunessaux | |||
| We propose a new statistical model for the classification of structured
documents and consider its use for multimedia document classification. Its main
originality is its ability to simultaneously take into account the structural
and the content information present in a structured document, and also to cope
with different types of content (text, image, etc). We present experiments on
the classification of multilingual pornographic HTML pages using text and image
data. The system accurately classifies porn sites from 8 European languages.
This corpus has been developed by EADS company in the context of a large Web
site filtering application. Keywords: Bayesian networks, categorization, generative model, multimedia document,
statistical machine, structured document, web page filtering | |||
| Methods for the semantic analysis of document markup | | BIBAK | Full-Text | 161-170 | |
| Petra Saskia Bayerl; Harald Lüngen; Daniela Goecke; Andreas Witt; Daniel Naber | |||
| We present an approach on how to investigate what kind of semantic
information is regularly associated with the structural markup of scientific
articles. This approach addresses the need for an explicit formal description
of the semantics of text-oriented XML-documents. The domain of our
investigation is a corpus of scientific articles from psychology and
linguistics from both English and German online available journals.
For our analyses, we provide XML-markup representing two kinds of semantic levels: the thematic level (i.e., topics in the text world that the article is about) and the functional or rhetorical level. Our hypothesis is that these semantic levels correlate with the articles' document structure also represented in XML. Articles have been annotated with the appropriate information. Each of the three informational levels is modelled in a separate XML document, since in our domain, the different description levels might conflict so that it is impossible to model them within a single XML document. For comparing and mining the resulting multi-layered XML annotations of one article, a Prolog-based approach is used. It focusses on the comparison of XML markup that is distributed among different documents. Prolog predicates have been defined for inferring relations between levels of information that are modelled in separate XML documents. We demonstrate how the Prolog tool is applied in our corpus analyses. Keywords: XML, information extraction, prolog, semantic analysis | |||
| Interactive information retrieval from XML documents represented by attribute grammars | | BIBAK | Full-Text | 171-174 | |
| Alda Lopes Gançarski; Pedro Rangel Henriques | |||
| In this paper, we describe a system to interactively accede to XML documents
represented by attribute grammars. The system has two main components: (1) the
query editor/processor, where the user interactively specifies his needs; (2)
the document analyzer, which performs operations for query evaluation that
accede directly to the documents. The interactive construction of queries is
based on the manipulation of intermediate results during query construction and
evaluation. We believe this helps the user to achieve the desired result. Keywords: XML representation, interactive retrieval | |||
| Two diet plans for fat PDF | | BIBAK | Full-Text | 175-184 | |
| Thomas A. Phelps; Robert Wilensky | |||
| As Adobe's Portable Document Format has exploded in popularity so too has
the number PDF generators, and predictably the quality of generated PDF varies
considerably. This paper surveys a range of PDF optimizations for space, and
reports the results of a tool that can postprocess existing PDFs to reduce file
sizes by 20 to 70% for large classes of PDFs. (Further reduction can often be
obtained by recoding images to lower resolutions or with newer compression
methods such as JBIG2 or JPEG2000, but those operations are independent of PDF
per se and not a component of the results reported here.) A new PDF storage
format called "Compact PDF" is introduced that achieves for many classes of PDF
an additional reduction of 30 to 60% beyond what is possible in the latest PDF
specification (version 1.5, corresponding to Acrobat 6); for example, the PDF
1.5 Reference manual shrinks from 12.2MB down to 4.2MB. The changes required by
Compact PDF to the PDF specification and to PDF readers are easily understood
and straightforward to implement. Keywords: PDF, compact PDF, compression, multivalent | |||
| Compression of scan-digitized Indian language printed text: a soft pattern matching technique | | BIBAK | Full-Text | 185-192 | |
| U. Garain; S. Debnath; A. Mandal; B. B. Chaudhuri | |||
| In this paper, a new compression scheme is presented for Indian Language
(IL) textual document images. Since OCR technology for IL scripts is not
matured enough, transcription of these documents into digital domain needs new
techniques that achieve high degree of compression as well as suitable methods
to perform various operations like document indexing, retrieval, etc. The
proposed method is essentially based on symbolic compression technique, which
has been realized with an efficient segmentation-based clustering approach. A
soft pattern-matching technique has been implemented using two different
feature sets that co-operate each other to build an efficient prototype
library. Experiments have been done for documents printed in Devnagari (Hindi)
and Bangla scripts, two mostly used script in Indian sub-continent. Test
results show that the proposed technique outperforms several standard methods
like CCITT Group-4, JBIG, etc. which are frequently used for compression of
document images. Keywords: data compression, Indian language, pattern matching, textual image | |||
| Semantically-based text authoring and the concurrent documentation of experimental protocols | | BIBAK | Full-Text | 193-202 | |
| Caroline Brun; Marc Dymetman; Eric Fanchon; Stanislas Lhomme; Sylvain Pogodalla | |||
| We describe an application of controlled text authoring to biological
experiment reports. This work is the result of a collaboration between a
computational linguistics team and biologists specializing in protein
production studies. We start by presenting our semantically-controlled
authoring system, MDA (Multilingual Document Authoring), an expressive model
for specifying well-formedness conditions both at the level of the document
content and at the level of its textual realization. We then discuss the
practical needs of experiment documentation in bioengineering. We go on to
describe the prototype we have developed for this application domain, along
with a preliminary evaluation. Finally we discuss a promising new idea emerging
from the experimentation but which seems of wider applicability: how the
authoring system represents a step towards integrating the formalization of an
experimental protocol with its associated textual documentation. Keywords: XML, XML-schemas, concurrent documentation, constrained document
specification, document authoring, experimental protocols, logic programming,
natural language generation | |||
| A structural adviser for the XML document authoring | | BIBAK | Full-Text | 203-211 | |
| Boris Chidlovskii | |||
| Since the XML format became a de facto standard for structured documents,
the IT research and industry have developed a number of XML editors to help
users produce structured documents in XML format. However, the manual
generation of structured documents in XML format remains a tedious and
time-consuming process because of the excessive verbosity and length of XML
code. In this paper, we design a structural adviser for the XML document
authoring. The adviser intervenes at any step of the authoring process to
suggest one tag or entire tree-like pattern the user is most likely to use
next. Adviser suggestions are based on finding analogies between the currently
edited fragment and sample data being either previously generated documents in
the collection or the history of the current document authoring. The adviser is
beneficial in cases when no schema is provided for XML documents, or schema
associated with the document is too general and sample data contain specific
patterns not captured in the schema. We design the adviser architecture and
develop a method for efficient indexing and retrieval of optimal suggestions at
any step of the document authoring. Keywords: XML markup, data mining, structural pattern, suggestion | |||
| User-directed analysis of scanned images | | BIBAK | Full-Text | 212-221 | |
| Steven J. Simske; Jordi Arnabat | |||
| Digital capture (scanning in all its forms, and digital photography/video
recording), in providing virtually free temporary memory of captured
information, allows users to "over-gather" information during capture, and then
to discard unwanted material later. For cameras and video recorders, such
editing largely consists of discarding images or frames in their entirety. For
scanners (and high-resolution camera/video), such editing benefits from a
preview capability that provides quick and reliable user-interface tools for
selecting, filtering and saving specific portions of the input. Appropriate
preview user interface (UI) tools ease the accessing, editing and dispatch to
desired destination (archive, application, webpage, etc.) of captured
information (text, tables, drawings, photos, etc.). In this paper, we present
several different means for the user-directed "rapid capture" of portions of a
scanned image. Specifically, we review past, present and future preview-based
UI tools that allow efficient and accurate means of capture to the user. The
bases of these tools, as described herein, are user-directed zoning analysis,
known as "click and select", which incorporates a bottom-up zoning analysis
engine; and statistics-based region classification, which allows rapid
reconfiguration of region identification and clustering. We conclude with our
view of the future of UI-directed capture. Keywords: bottom-up analysis, classification, click and select, preview display,
scanning, segmentation, user interface, zoning | |||
| Handling syntactic constraints in a DTD-compliant XML editor | | BIBAK | Full-Text | 222-224 | |
| Y. S. Kuo; Jaspher Wang; N. C. Shih | |||
| By exploiting the theories of automata and graphs, we propose algorithms and
a process for editing valid XML documents [4][5]. The editing process avoids
syntactic violations altogether, thus freeing the user from any syntactic
concerns. Based on the proposed algorithms and process, we build an XML editor
with forms as its user interface. Keywords: XML editor, automata theory, regular expression | |||
| Set-at-a-time access to XML through DOM | | BIBAK | Full-Text | 225-233 | |
| Hai Chen; Frank Wm. Tompa | |||
| To support the rapid growth of the web and e-commerce, W3C developed DOM as
an application programming interface that provides the abstract, logical tree
structure of an XML document. In this paper, we propose ordered-set-at-a-time
extensions for DOM while maintaining its tightly managed navigational nature.
In particular, we define the NodeSequence interface with functions that filter,
navigate, and transform sequences of nodes simultaneously. The extended DOM
greatly simplifies writing some application code, and it can reduce the
communications overhead and response time between a client application and the
DOM server to provide applications with more efficient processing. As
validation of our proposals, we present application examples that compare the
convenience and efficiency of DOM with and without extensions. Keywords: DOM, XML, application program interface, navigation, set-at-a-time | |||
| UpLib: a universal personal digital library system | | BIBAK | Full-Text | 234-242 | |
| William C. Janssen; Kris Popat | |||
| We describe the design and use of a personal digital library system, UpLib.
The system consists of a full-text indexed repository accessed through an
active agent via a Web interface. It is suitable for personal collections
comprising tens of thousands of documents (including papers, books, photos,
receipts, email, etc.), and provides for ease of document entry and access as
well as high levels of security and privacy. Unlike many other systems of the
sort, user access to the document collection is assured even if the UpLib
system is unavailable. It is "universal" in the sense that documents are
canonically represented as projections into the text and image domains, and
uses a predominantly visual user interface based on page images. UpLib can thus
handle any document format which can be rendered as pages. Provision is made
for alternative representations existing alongside the text-domain and
image-domain representation, either stored or generated on demand. The system
is highly extensible through user scripting, and is intended to be used as a
platform for further work in document engineering. UpLib is assembled largely
from open-source components (the current exception being the OCR engine, which
is proprietary). Keywords: document management, document repository, page image, personal digital
library, thumbnail interfaces, web interfaces | |||
| Management of trusted citations | | BIBAK | Full-Text | 243-245 | |
| Christer Fernstrom | |||
| We discuss how references and citations within a document to particular
sources can be verified and guaranteed. When a document refers through a
quotation to another document, the reader should be able to verify that the
reference is correct and that any quotation correctly represents the original
text. The mechanism we describe enables the authentication of such quotations.
It consists of: A notation to be used when expressing quotations. This notation
allows a controlled degree of freedom to make alterations from the original
text.
Different means to check the correctness of such quotations with respect to the cited documents and to quotation rules. Keywords: citation, information trust | |||
| Model driven architecture based XML processing | | BIBAK | Full-Text | 246-248 | |
| Ivan Kurtev; Klaas van den Berg | |||
| A number of applications that process XML documents interpret them as
objects of application specific classes in a given domain. Generic interfaces
such as SAX and DOM leave this interpretation completely to the application.
Data binding provides some automation but it is not powerful enough to express
complex relations between the application model and the document syntax. Since
document schemas play the role of models of documents we can define document
processing as model-to-model transformation in the context of Model Driven
Architecture (MDA). We define a transformation language for specifying
transformations from XML schemas to application models. Transformation
execution is an interpretation of a document that results in a set of
application objects. Keywords: MDA, XML processing, transformations | |||