| A three-way merge for XML documents | | BIBAK | Full-Text | 1-10 | |
| Tancred Lindholm | |||
| Three-way merging is a technique that may be employed for reintegrating
changes to a document in cases where multiple independently modified copies
have been made. While tools for three-way merge of ASCII text files exist in
the form of the ubiquitous diff and patch tools these are of limited
applicability to XML documents.
We present a method for three-way merging of XML which is targeted at merging XML formats that model human-authored documents as ordered trees (e.g. rich text formats structured text drawings etc.). To this end we investigate a number of use cases on XML merging (collaborative editing propagating changes across document variants) from which we derive a set of high-level merge rules. Our merge is based on these rules. We propose that our merge is easy to both understand and implement yet sufficiently expressive to handle several important cases of merging on document structure that are beyond the capabilities of traditional text-based tools. In order to justify these claims we applied our merging method to the merging tasks contained in the use cases. The overall performance of the merge was found to be satisfactory. The key contributions of this work are: a set of merge rules derived from use cases on XML merging a compact and versatile XML merge in accordance with these rules and a classification of conflicts in the context of that merge. Keywords: XML, collaborative editing, conflict, structured text, three-way merge | |||
| Fast structural query with application to Chinese treebank sentence retrieval | | BIBAK | Full-Text | 11-20 | |
| Chia-Hsin Huang; Tyng-Ruey Chuang; Hahn-Ming Lee | |||
| In natural language processing a huge amount of structured data is
constantly used for the extraction and presentation of grammatical structures
in sentences. For example the Chinese Treebank corpus developed at the
Institute of Information Science Academia Sinica Taiwan is a semantically
annotated corpus that has been used to help parse and study Chinese sentences.
In this setting users usually use structured tree patterns instead of keywords
to query the corpus.
In this paper we present an online prototype system that provides exploratory search ability. The system implements two flexible and efficient structural query methods and employs a user-friendly web-based interface. Although the system adopts the XML format to present the corpora and search results it does not use conventional XML query languages. As searching the Chinese Treebank corpora is structural in nature and often deals with structural similarities conventional XML query languages such as XPath and XQuery are inflexible and inefficient. We propose and implement a query algorithm called Parent-Child Relationship Filter (PCRF) which provides flexible and efficient structural search. PCRF is sufficiently flexible to provide several similarity-matching options such as wildcard unordered sibling sub-trees ancestor-descendant matching and their combinations. In addition PCRF supports stream-based matching to help users query their XML documents online. We also present three accelerating rules that achieve a 1.5- to 8-fold performance improvement in query time. Our experiment results show that our method archive a 10- to 1000-fold performance improvement compared to the usual text-based XPath query method. Keywords: XML, structural query, treebank | |||
| Querying XML documents by dynamic shredding | | BIBAK | Full-Text | 21-30 | |
| Hui Zhang; Frank Wm. Tompa | |||
| With the wide adoption of XML as a standard data representation and exchange
format querying XML documents becomes increasingly important. However
relational database systems constitute a much more mature technology than what
is available for native storage of XML. To bridge the gap one way to manage XML
data is to use a commercial relational database system. In this approach users
typically first "shred" their documents by isolating what they predict to be
meaningful fragments then store the individual fragments according to some
relational schema and later translate each XML query (e.g. expressed in W3C's
XQuery) to SQL queries expressed against the shredded documents.
In this paper we propose an alternative approach that builds on relational database technology but shreds XML documents dynamically. This avoids many of the problems in maintaining document order and reassembling compound data from its fragments. We then present an algorithm to translate a significant subset of XQuery into an extended relational algebra that includes operators defined for the structured text datatype. This algorithm can be used as the basis of a sound translation from XQuery to SQL and the starting point for query optimization which is required for XML to be supported by relational database technology. Keywords: XML, XQuery, dynamic shredding, relational algebra, text ADT | |||
| Presenting the results of relevance-oriented search over XML documents | | BIBAK | Full-Text | 31-33 | |
| Alda Lopes Gançarski; Pedro Rangel Henriques | |||
| In this paper we discuss how to present the result of searching elements of
any type from XML documents relevant to some information need
(relevance-oriented search). As the resulting elements can contain each other
we show an intuitive way of organizing the resulting list of elements in
several ranked lists at different levels such that each element is presented
only one time. Depending on the size of such ranked lists its presentation is
given by a structure tree for small lists or by a sequence of pointers for
large lists. In both cases the textual content of the implied elements is
given. We also analyse the size of ranked lists in a real collection of XML
documents. Keywords: user interface for XML retrieval | |||
| The XML world view | | BIBA | Full-Text | 34 | |
| Kristoffer H. Rose | |||
| XML is unique in its very broad acceptance throughout both the document engineering and data processing community. This creates a unique opportunity for unifying the traditionally separate worlds and ask questions such as "What are the data relations in my document?" and "How can I read a textual version of my data?" all within the single framework provided by XML. In this talk I'll speculate on how one could view the whole world as a single XML document from which both relational "table" and textual "report" queries are possible. | |||
| Supporting virtual documents in just-in-time hypermedia systems | | BIBAK | Full-Text | 35-44 | |
| Li Zhang; Michael Bieber; David Millard; Vincent Oria | |||
| Many analytical or computational applications especially legacy systems
create documents and display screens in response to user queries "dynamically"
or in "real time". These "virtual documents" do not exist in advance and thus
hypermedia features must be generated "just in time" - automatically and
dynamically. Additionally the hypermedia features may have to cause target
documents to be generated or re-generated. This paper focuses on the specific
challenges faced in hypermedia support for virtual documents of dynamic
hypermedia functionality dynamic regeneration and dynamic anchor
re-identification and re-location. It presents a prototype called JHE
(Just-in-time Hypermedia Engine) to support just-in-time hypermedia across
third party applications with dynamic content and discusses issues prompted by
this research. Keywords: dynamic hypermedia functionality, dynamic regeneration, integration
architecture, just-in-time hypermedia, re-identification, re-location, virtual
documents | |||
| A document-based approach to the generation of web applications | | BIBAK | Full-Text | 45-47 | |
| Andrea R. de Andrade; Ethan V. Munson; Mariada G. Pimentel | |||
| wVIEW is an automated system for generating Web applications that relies
extensively on document representations and transformations. wVIEW adopts the
widely accepted hypermedia design principle that content navigation and
presentation are separate concerns. Each of these aspects of the design process
is controlled by separate declarative specifications. Only the first
specification the content structure specification which is described using UML
must be provided. However the wVIEW user is free to add extensions and
customizations to both the data and navigation models in order to make the
final application suit specific needs. This paper describes the wVIEW approach
and the current prototype which focuses on the data and navigation modelling
aspects. The paper discusses experiences in using XSLT as the primary
development tool and shows examples how the enhancements planned to XSLT
address some limitations of the application generation process. Keywords: XML, XSLT, cocoon, design, web applications | |||
| Assisting artifact retrieval in software engineering projects | | BIBAK | Full-Text | 48-50 | |
| Mirjana Andric; Wendy Hall; Leslie Carr | |||
| The research presented in this paper focuses on the issue of how a
recommender system can support the task of searching documents and artifacts
constructed in a software development project. The "A LA" (Associative Linking
of Attributes) system represents a recommender facility built on top of a
document management system. The facility provides assistance to finding items
by utilising hypertextually connected metadata. In order to determine metadata
relationships "A LA" employs techniques of content analysis together with
exploiting user-generated metadata and usage logs. An evaluation study that
compares querying using a full text search approach with the "A LA" method for
finding relevant documents was conducted. Keywords: links, metadata, recommender systems, zigzag | |||
| Lightweight integration of documents and services | | BIBAK | Full-Text | 51-53 | |
| Nkechi Nnadi; Michael Bieber | |||
| This research's primary contribution is providing a relatively
straightforward sustainable infrastructure for integrating documents and
services. Users see a totally integrated environment. The integration
infrastructure generates supplemental link anchors. Selecting one generates a
list of relevant links automatically through the use of relationship rules. Keywords: automatic link generation, metainformation, relationship rules, service
integration | |||
| Personal glossaries on the WWW: an exploratory study | | BIBAK | Full-Text | 54-56 | |
| James Blustein; Mona Noor | |||
| We examine basic issues of glossary tools as part of a suite of annotational
tools to help users make meaning from documents from unfamiliar realms of
discourse. We specifically evaluated the performance of glossary tools for
reading medical information about common diseases by users with no formal
medical education.
We developed both automatic and an editable glossary tools. Both of them extracted definitions from the text of articles. Only the editable glossary tool allowed users to add delete and change entries. Both tools were evaluated to find out how useful they were to users reading technical articles online. The analytical results showed that user performance improved without increasing total reading time. The glossary tools were effective and pleasing to users at no decrease in efficiency. This experiment points the way for longer-term studies with adaptable tools particularly to help users unfamiliar with technical documents. We also discuss the rôle of glossaries as part of a suite of annotational tools to help users make personal (and therefore meaningful) hypertextual document collections. Keywords: annotation support, evaluation experiment, hyperlinked glossaries, user
interfaces | |||
| Behavioral reactivity and real time programming in XML: functional programming meets SMIL animation | | BIBAK | Full-Text | 57-66 | |
| Peter King; Patrick Schmitz; Simon Thompson | |||
| XML and its associated languages are emerging as powerful authoring tools
for multimedia and hypermedia web content. Furthermore intelligent presentation
generation engines have begun to appear as have models and platforms for
adaptive presentations. However XML-based models are limited by their lack of
expressiveness in presentation and animation. As a result authors of dynamic
adaptive web content must often use considerable amounts of script or code. The
use of such script or code has two serious drawbacks. First such code
undermines the declarative description possible in the original presentation
language and second the scripting/coding approach does not readily lend itself
to authoring by non programmers. In this paper we describe a set of XML
language extensions inspired by features from the functional programming world
which are designed to widen the class of reactive systems which could be
described in languages such as SMIL. The described features extend the power of
declarative modeling for the web by allowing the introduction of web media
items which may dynamically react to continuously varying inputs both in a
continuous way and by triggering discrete user-defined events. The two
extensions described herein are discussed in the context of SMIL Animation and
SVG but could be applied to many XML-based languages. Keywords: DOM, SMIL, SVG, XML, animation, behaviors, continuous, declarative, events,
expressions, functional programming, modeling, time | |||
| A question answer system using mails posted to a mailing list | | BIBAK | Full-Text | 67-73 | |
| Yasuhiko Watanabe; Kazuya Sono; Kazuya Yokomizo; Yoshihiro Okada | |||
| The most serious difficulty in developing a QA system is knowledge. In this
paper we first discuss three problems of developing a knowledge base by which a
QA system answers how type questions. Then we propose a method of developing a
knowledge base by using mails posted to a mailing list. Next we describe a QA
system which can answer how type questions based on the knowledge base. Our
system finds question mails which are similar to user's question and shows the
answers to the user. The similarity between user's question and a question mail
is calculated by matching of user's question and a significant sentence in the
question mail. Finally we show that mails posted to a mailing list can be used
as a knowledge base by which a QA system answers how type questions. Keywords: mailing list, question answer system, sentence extraction | |||
| A look at some issues during textual linking of homogeneous web repositories | | BIBAK | Full-Text | 74-83 | |
| José Antonio Camacho-Guerrero; Alessandra Alaniz Macedo; Maria da Graça Campos Pimentel | |||
| Interacting with services that create links automatically via Web users are
able to identify relationships among documents stored in different
repositories. The fact that automatic linking services do not use queries
performed by a human user has impact in the use of information retrieval
techniques for the identification of relationships. Information retrieval
techniques can lead to the identification of relationships that should not have
been generated (generating non-relevant links) at the same time that fail to
identify all relevant relationships (poor recall). Towards improving the
quality of the relationships identified we have investigated some design issues
considered during the automatic linking of textual repositories. The
investigations have used a collection of documents from online Brazilian
Newspapers and the Cystic Fibrosis Collection. The results of the
investigations have defined procedures infrastructures and consequently the
requirements for a configurable linking service made also available as a
contribution of this work. Keywords: homogeneous repositories, information retrieval, linking, semantic
structures, web | |||
| Interactive multimedia annotations: enriching and extending content | | BIBAK | Full-Text | 84-86 | |
| Rudinei Goularte; Renan G. Cattelan; José A. Camacho-Guerrero; Valter R., Jr. Inácio; Maria da Graça C. Pimentel | |||
| This paper discusses an approach to the problem of annotating multimedia
content. Our approach provides annotation as metadata for indexing retrieval
and semantic processing as well as content enrichment. We use an underlying
model for structured multimedia descriptions and annotations allowing the
establishment of spatial temporal and linking relationships. We discuss aspects
related with documents and annotations used to guide the design of an
application that allows annotations to be made with pen-based interaction with
Tablet PCs. As a result a video stream can be annotated during the capture. The
annotation can be further edited extended or played back synchronously. Keywords: MPEG-7, annotation, multimodal interfaces | |||
| A reduced yet extensible audio-visual description language | | BIBAK | Full-Text | 87-89 | |
| Raphaël Troncy; Jean Carrive | |||
| Enabling an intelligent access to multimedia data requires a powerful
description language. In this paper we demonstrate why the MPEG-7 standard
fails to fulfill this task. We introduce then our proposition: an audio-visual
specific description language modular reduced but designed to be extensible.
This language is centered on the notions of descriptor and structure with a
well-defined semantics. A descriptor can be a low-level feature automatically
extracted from the signal or a higher semantic concept that will be used to
annotate the video documents. The descriptors can be combined into structures
according to defined models that provide description patterns. Keywords: MPEG-7, audio-visual description language, descriptor, knowledge
representation, semantic web, semantics, structure | |||
| The case for explicit knowledge in documents | | BIBAK | Full-Text | 90-98 | |
| Leslie Carr; Timothy Miles-Board; Arouna Woukeu; Gary Wills; Wendy Hall | |||
| The Web is full of documents which must be interpreted by human readers and
by software agents (search engines recommender systems clustering processes
etc.). Although Web standards have addressed format obfuscation by using XML
schemas and stylesheets to specify unambiguous structure and presentation
semantics interpretation is still hampered by the fundamental ambiguity of
information in PCDATA text. Even the most easily distinguishable kinds of
knowledge such as article citations and proper nouns (referring to people
organisations projects products technical concepts) have to be identified by
fallible post-hoc extraction processes. The WiCK project has investigated the
writing process in a Semantic Web environment where knowledge services exist
and actively assist the author. In this paper we discuss the need to make
knowledge an explicit part of the document representation and the advantages
and disadvantages of this step. Keywords: document structure, knowledge writing, semantic web | |||
| Creating structured PDF files using XML templates | | BIBAK | Full-Text | 99-108 | |
| Matthew R. B. Hardy; David F. Brailsford; Peter L. Thomas | |||
| This paper describes a tool for recombining the logical structure from an
XML document with the typeset appearance of the corresponding PDF document. The
tool uses the XML representation as a template for the insertion of the logical
structure into the existing PDF document thereby creating a Structured/Tagged
PDF. The addition of logical structure adds value to the PDF in three ways: the
accessibility is improved (PDF screen readers for visually impaired users
perform better) media options are enhanced (the ability to reflow PDF documents
using structure as a guide makes PDF viable for use on hand-held devices) and
the re-usability of the PDF documents benefits greatly from the presence of an
XML-like structure tree to guide the process of text retrieval in reading order
(e.g. when interfacing to XML applications and databases). Keywords: PDF, XML, logical structure insertion | |||
| Aesthetic measures for automated document layout | | BIBAK | Full-Text | 109-111 | |
| Steven J. Harrington; J. Fernando Naveda; Rhys Price Jones; Paul Roetling; Nishant Thakkar | |||
| A measure of aesthetics that has been used in automated layout is described.
The approach combines heuristic measures of attributes that degrade the
aesthetic quality. The combination is nonlinear so that one bad aesthetic
feature can harm the overall score. Example heuristic measures are described
for the features of alignment regularity separation balance white-space
fraction white-space free flow proportion uniformity and page security. Keywords: aesthetics, document, layout | |||
| Creation of topic map by identifying topic chain in Chinese | | BIBAK | Full-Text | 112-114 | |
| Ching-Long Yeh; Yi-Chun Chen | |||
| XML Topic maps enable multiple concurrent views of sets of information
objects and can be used to different applications. For example thesaurus-like
interfaces to corpora navigational tools for cross-references or citation
systems information filtering or delivering depending on user profiles etc.
However to enrich the information of a topic map or to connect with some
document's URI is very labor-intensive and time-consuming. To solve this
problem we propose an approach based on natural language processing techniques
to identify and extract useful information in raw Chinese text. Unlike most
traditional approaches to parsing sentences based on the integration of complex
linguistic information and domain knowledge we work on the output of a
part-of-speech tagger and use shallow parsing instead of complex parsing to
identify the topics of sentences. The key elements of the centering model of
local discourse coherence are employed to extract structures of discourse
segments. We use the local discourse structure to solve the problem of zero
anaphora in Chinese and then identify the topic which is the most salient
element in a sentence. After we obtain all the topics of a document we may
assign this document into a topic node of the topic map and add the information
of the document into the topic element simultaneously. Keywords: centering model, shallow parsing, topic identification, topic maps, zero
anaphora resolution | |||
| Techniques for authoring complex XML documents | | BIBAK | Full-Text | 115-123 | |
| Vincent Quint; Irône Vatton | |||
| This paper reviews the main innovations of XML and considers their impact on
the editing techniques for structured documents. Namespaces open the way to
compound documents; well-formedness brings more freedom in the editing task;
CSS allows style to be associated easily with structured documents. In addition
to these innovative features the wide deployment of XML introduces structured
documents in many new applications including applications where text is not the
dominant content type. In languages such as SVG or SMIL for instance XML is
used to represent vector graphics or multimedia presentations.
This is a challenging situation for authoring tools. Traditional methods for editing structured documents are not sufficient to address the new requirements. New techniques must be developed or adapted to allow more users to efficiently create advanced XML documents. These techniques include multiple views semantic-driven editing direct manipulation concurrent manipulation of style and structure and integrated multi-language editing. They have been implemented and experimented in the Amaya editor and in some other tools. Keywords: CSS, XML, authoring tools, compound documents, direct manipulation,
structured editing, style languages | |||
| Instructional information in adaptive spatial hypertext | | BIBAK | Full-Text | 124-133 | |
| Luis Francisco-Revilla; Frank Shipman | |||
| Spatial hypertext is an effective medium for the delivery of help and
instructional information on the Web. Spatial hypertext's intrinsic features
allow documents to visually reflect the inherent structure of the information
space and represent implicit relationships between information objects. This
work presents a study of the effectiveness of spatial hypertext as medium for
delivery of instructional information. Results were gathered based on direct
observation of the people reading a spatial hypertext document which was used
as informational support for a complex task. Two versions of the spatial
hypertext document were used: a non-adaptive and an adaptive. The document was
adapted based upon the inferred relevance of information to the user's
knowledge and task requirements. The study produced insights on emergent
reading strategies such as informed link traversals and the use of collections
as bookmarks. Observations and evaluation of how people interacted with both
document versions showed that the spatial layout and the use of collections as
a way to encapsulate information allowed people to read browse and navigate
very large information spaces while maintaining a clear understanding the
structure of the information. Finally several differences between the adaptive
and non-adaptive versions were identified showing that adaptation alters not
only the display of information but the way that people read spatial hypertext
document. Keywords: adaptation, information delivery, spatial hypertext | |||
| Page composition using PPML as a link-editing script | | BIBAK | Full-Text | 134-136 | |
| Steven R. Bagley; David F. Brailsford | |||
| The advantages of a COG (Component Object Graphic) approach to the
composition of PDF pages have been set out in a previous paper [1]. However if
pages are to be composed in this way then the individual graphic objects must
have known bounding boxes and must be correctly placed on the page in a process
that resembles the link editing of a multi-module computer program. Ideally the
linker should be able to utilize all declared resource information attached to
each COG.
We have investigated the use of an XML application called Personalized Print Markup Language (PPML) to control the link editing process for PDF COGs. Our experiments though successful have shown up the shortcomings of PPML's resource handling capabilities which are currently active at the document and page levels but which cannot be elegantly applied to individual graphic objects at a sub-page level. Proposals are put forward for modifications to PPML that would make easier any COG-based approach to page composition. Keywords: PDF, PPML, form Xobjects, graphic objects, link editing | |||
| Managing inconsistent repositories via prioritized repairs | | BIBAK | Full-Text | 137-146 | |
| Jan Scheffczyk; Peter Rödig; Uwe M. Borghoff; Lothar Schmitz | |||
| Whenever a group of authors collaboratively edits interrelated documents
semantic consistency is a major goal. Current document management systems (DMS)
lack adequate consistency management facilities. We propose liberal use of
formal consistency rules which permits inconsistencies. In this paper we focus
on deriving repairs for inconsistencies. Our major contributions are: (1)
deriving (common) repairs for multiple rules (2) resolving conflicts between
repairs (3)prioritizing repairs and (4) support for partial inconsistency
resolution which resolves the most troubling inconsistencies and leaves less
important inconsistencies for a later handling. The novel aspect of our
approach is that we derive repairs from DAGs (directed acyclic graphs) and not
from documents directly. That way the repository is locked during DAG
generation only which is performed incrementally. Keywords: consistency maintenance, document management, repair | |||
| The lifecycle of a digital historical document: structure and content | | BIBAK | Full-Text | 147-154 | |
| A. Antonacopoulos; D. Karatzas; H. Krawczyk; B. Wiszniewski | |||
| This paper describes the lifecycle of a digital historical document, from
template-based structure definition through to content extraction from the
scanned pages and its final reconstitution as an electronic document (combining
content and semantic information) along with the tools that have been created
to realise each stage in the lifecycle. The whole approach is described in the
context of different types of typewritten documents relating to prisoners in
World-War II concentration camps and is the result of a multinational
collaboration under the MEMORIAL project funded (€1.5M) by the European
Union (www.memorial-project.info). Extensive tests with historians/archivists
and evaluation of the content extraction results indicate the superior
performance of the whole semantics-driven approach both over manual
transcription and over the semi-automated application of off-the-shelf OCR and
the use of a conventional (text and layout) document format. Keywords: digital libraries, document analysis, document architecture, document
engineering, historical documents, text enhancement | |||
| Accommodating paper in document databases | | BIBAK | Full-Text | 155-162 | |
| Majed AbuSafiya; Subhasish Mazumdar | |||
| Although the paperless office has been imminent for decades, documents in
paper form continue to be used extensively in almost all organizations.
Present-day information systems are designed on the premise that any paper
document in use will be either converted into electronic form or merely printed
from electronic file(s) accessible to the system. Yet, paper is the medium of
choice in many situations, mainly owing to its portability and usability, and
the medium of necessity in others, especially where external communication or
the traditional notion of authenticity are involved. Humans who find unique
attractive features in both paper and electronic forms of documents, must
survive this tension between the de-jure banishment of paper and its de-facto
prevalence. In this paper, we propose to make paper documents first-class
citizens by including them in the model underlying the information system.
Specifically, we extend the schema of a document database with the notion of
paper documents, physical locations, and the organizational hierarchy. This
leads to an overall enhancement of document integrity and the ability to answer
queries such as "where are the customer complaint letters we have received
today?" and "which documents are in this filing cabinet?". Recent technological
advances such as sensors have made the implementation of such a model very
realistic. Keywords: RFID, document databases, document management, enterprise document model,
paper documents, paper manifestation | |||
| Strategies for document optimization in digital publishing | | BIBAK | Full-Text | 163-170 | |
| Felipe Rech Meneguzzi; Leonardo Luceiro Meirelles; Fernando Tarlá Martins Mano; Joao Batista de Souza Oliveira; Ana Cristina Benso da Silva | |||
| Recent advances in digital press technology have enabled the creation of
high-quality personalized documents, with the potential of generating an entire
batch of one-of-a-kind documents. Even though digital presses are capable of
printing such document sets as fast as they would print regular press jobs,
raster image processing might possibly be performed for every different page in
the job. Such process demands a large computational effort and it is therefore
interesting to gather repeated images that are used throughout all documents
and rasterize them as few times as possible. Moreover, performing such process
separately from document production in the publishing workflow allows
optimization to be performed prior to final printing, thus allowing it to take
press hardware specifics into account, and reducing the time taken for it to
produce the final output. This paper describes techniques to perform this task
using PPML as the document description language, as well as the main issues
concerning this kind of document optimization. Several gathering policies are
described along with explanatory examples. We also provide and discuss
experimental data supporting the use of such strategy. Keywords: PPML, digital press, personalized printing, variable data printing, variable
information documents | |||
| Digital capture for automated scanner workflows | | BIBAK | Full-Text | 171-177 | |
| Steven J. Simske; Scott C. Baggs | |||
| The use of scanners and other capture devices to incorporate film- and
paper-based materials into digital workflows is an important part of "digital
convergence", or the bringing of paper-based and electronic documents together
into the same electronic workflows. The diversity of captured information-from
text and mixed-type documents to photos, negatives, slides and
transparencies-requires a combination of document analysis techniques to
perform, automatically, the segmentation, classification and workflow
assignment of the scanned images. We herein present technologies that provide
fast (< 1.0 sec) and reliable (> 95% job accuracy) capture solutions for
all of these input content types. These solutions offer near real-time capture
that provides automated workflow capabilities to a repertoire of scanning
hardware: scanners, all-in-one devices, copiers and multifunctional printers.
The techniques used to categorize the documents, perform zoning analysis on the
documents, and then perform closed loop quality assurance on the documents are
presented. Keywords: classification, negatives, photos, scanning, segmentation, slides, user
interface, zoning | |||
| Visual signature based identification of Low-resolution document images | | BIBAK | Full-Text | 178-187 | |
| Ardhendu Behera; Denis Lalanne; Rolf Ingold | |||
| In this paper, we present (a) a method for identifying documents captured
from low-resolution devices such as web-cams, digital cameras or mobile phones
and (b) a technique for extracting their textual content without performing
OCR. The first method associates a hierarchically structured visual signature
to the low-resolution document image and further matches it with the visual
signatures of the original high-resolution document images, stored in PDF form
in a repository. The matching algorithm follows the signature hierarchy, which
speeds-up the search by guiding it towards fruitful solution spaces. In a
second step, the content of the original PDF document is extracted, structured,
and matched with its corresponding high-resolution visual signature. Finally,
the matched content is attached to the low-resolution document image's visual
signature, which greatly enriches the document's content and indexing. We
present in this article both these identification and extraction methods and
evaluate them on various documents, resolutions and lighting conditions, using
different capture devices. Keywords: document visual signature, document-based meeting retrieval, documents'
content extraction, low-resolution document image identification | |||
| NCL 2.0: integrating new concepts to XML modular languages | | BIBAK | Full-Text | 188-197 | |
| Heron V. O. Silva; Rogério F. Rodrigues; Luiz Fernando G. Soares; Débora C. Muchaluat Saade | |||
| This paper presents the main new features of Nested Context Language (NCL)
version 2.0. NCL 2.0 is a modular and declarative hypermedia language, whose
modules can be combined to other languages, such as SMIL, to provide new
facilities. Among the NCL 2.0 new features, we can highlight the support for
handling hypermedia relations as first-class entities, through the definition
of hypermedia connectors, and the possibility of specifying any semantics for a
hypermedia composition, using the concept of composition templates. Another
important goal of this paper is to describe a framework to facilitate the
development of NCL parsing and processing tools. Based on this framework, the
paper comments several implemented compilers, which allow, for instance, the
conversion of NCL documents into SMIL specifications. Keywords: NCL, SMIL, XConnector, XTemplate, composition template, framework for
parsing and processing XML, hypermedia connector | |||
| Document capture using stereo vision | | BIBAK | Full-Text | 198-200 | |
| Adrian Ulges; Christoph H. Lampert; Thomas Breuel | |||
| Capturing images of documents using handheld digital cameras has a variety
of applications in academia, research, knowledge management, retail, and office
settings. The ultimate goal of such systems is to achieve image quality
comparable to that currently achieved with flatbed scanners even for curved,
warped, or curled pages. This can be achieved by high-accuracy 3D modeling of
the page surface, followed by a "flattening" of the surface. A number of
previous systems have either assumed only perspective distortions, or used
techniques like structured lighting, shading, or side-imaging for obtaining 3D
shape. This paper describes a system for handheld camera-based document capture
using general purpose stereo vision methods followed by a new document
dewarping technique. Examples of shape modeling and dewarping of book images is
shown. Keywords: camera based document capture, dewarping, stereo vision | |||
| On modular transformation of structural content | | BIBAK | Full-Text | 201-210 | |
| Tyng-Ruey Chuang; Jan-Li Lin | |||
| We show that an XML DTD (Document Type Definition) can be viewed as the
fixed point of a parametric content model. Based on the parametric content
model, we develop a model of modular transformation of XML documents. A fold
operator is used to capture a class of functions that consume valid XML
document trees in a bottom-up matter. Similarly, an unfold operator is used to
generate valid XML document trees in a top-down fashion. We then show that
DTD-aware XML document transformation, which consumes a document of one DTD and
generates a document of another DTD, can be thought as both a fold operation
and an unfold operation.
This leads us to model certain DTD-aware document transformations by mappings from the source content models to the target content models. From these mappings, we derive DTD-aware XML document transformational programs. Benefits of such derived programs include automatic validation of the target documents (no invalid document will be generated) and modular property in the composition of these programs (intermediate results from successive transformations can be eliminated). Keywords: ML, XML, bird-meertens formalism, document transformation and validation,
functional programming, modules | |||
| Logic-based XPath optimization | | BIBAK | Full-Text | 211-219 | |
| Pierre Genevès; Jean-Yves Vion-Dury | |||
| XPath [5] was introduced by the W3C as a standard language for specifying
node selection, matching conditions, and for computing values from an XML
document. XPath is now used in many XML standards such as XSLT [4] and the
forthcoming XQuery [10] database access language. Since efficient XML content
querying is crucial for the performance of almost all XML processing
architectures, a growing need for studying high performance XPath-based
querying has emerged. Our approach aims at optimizing XPath performance through
static analysis and syntactic transformation of XPath expressions. Keywords: XML, XPath, axiomatization, containment, efficiency, optimization, query | |||
| Supervised learning for the legacy document conversion | | BIBAK | Full-Text | 220-228 | |
| Boris Chidlovskii; Jérôme Fuselier | |||
| We consider the problem of document conversion from the rendering-oriented
HTML markup into a semantic-oriented XML annotation defined by user-specific
DTDs or XML Schema descriptions. We represent both source and target documents
as rooted ordered trees so the conversion can be achieved by applying a set of
tree transformations. We apply the supervised learning framework to the
conversion task according to which the tree transformations are learned from a
set of training examples. Because of the complexity of tree-to-tree
transformations, We develop a two-step approach to the conversion problem, that
first labels leaves in the source trees and then recomposes target trees from
the leaf labels. We present two solutions based of the leaf classification with
the target terminals and paths. Moreover, we develop three methods for the leaf
classification. All methods and solutions have been tested on two real
collections. Keywords: XML markup, legacy document conversion, machine learning | |||
| Chart-parsing techniques and the prediction of valid editing moves in structured document authoring | | BIBAK | Full-Text | 229-238 | |
| Marc Dymetman | |||
| We present an approach to controlled document authoring that significantly
extends the functionality of existing methods by allowing bottom-up and
top-down specifications to be freely mixed. A finite-state automaton is used to
represent the partial, evolving, description of the document during authoring.
Using a generalization of chart-parsing techniques to FSAs rather than fixed
input strings, we show how the authoring system is able to automatically detect
the consequences of the choices already made by the author so as to only
propose for the next authoring steps choices which may provably lead to a
globally valid document.
We start by considering the case of authoring purely textual documents controlled by a context-free grammar, then show a generalization of this approach to structured documents controlled by a specification whose formal expressive power is at least that of Regular Hedge Grammars (closely related to RELAX NG Schemas) and therefore greater than that of DTDs. Keywords: XML, computational linguistics, document authoring tools and systems,
parsing | |||
| Towards efficient implementation of XML schema content models | | BIBAK | Full-Text | 239-241 | |
| Pekka Kilpeläinen; Rauno Tuhkanen | |||
| XML Schema uses an extension of traditional regular expressions for
describing allowed contents of document elements. Iteration is described
through numeric attributes minOccurs and maxOccurs attached to
content-describing elements such as sequence, choice, and element. These
numeric occurrence indicators are a challenge to standard automata-based
solutions. Straightforward solutions require space that is exponential with
respect to the length of the expressions.
We describe a strategy to implement unambiguous content model expressions as counter automata, which are of linear size only. Keywords: XML schema, automaton, regular expression | |||