| Document engineering: a preferred partner discipline in knowledge management | | BIBAK | Full-Text | 1-2 | |
| Josef Hofer-Alfeis | |||
| After 20 years of investigation and application of Knowledge Management (KM)
there are still various views and expectations on it resulting from its
trans-disciplinary character. It is a kind of meta-discipline with a lot of
partner disciplines, e.g. personnel development, organization, process and
quality management, information management, document engineering and
communication. The reason is the complex character of knowledge itself, which
is defined in KM as the capability for effective action. A major dimension of
this capability is naturally the content dimension, i.e. which knowledge area
or object-activity-domain is it about, e.g. "document engineering". In any
knowledge area the knowledge has three types of carriers: individuals with
their experiences, education and inherent capabilities, groups like teams and
communities with their compound capabilities based on joint understanding and
networked complementary capabilities and finally information, carrying more or
less codified and documented knowledge. Across all three knowledge carriers
three questions or dimensions of knowledge quality are interesting in any
knowledge area, which is important, e.g. for a business: "How deep or profound
is it, e.g. the level of expertise of a subject matter expert or a best
practice description? "How much is it distributed and inter-connected, e.g.
which experts, groups and documents are involved and how?" How is it codified
and documented, e.g. the quality of defining, structuring and documenting the
content?
This is the starting point for KM: it provides adequate processes or instruments to improve or adjust the knowledge quality to the needs, e.g. of a business. But there are already the various partner disciplines of KM active to support, e.g. learning and training, inter-connection by collaboration, information formalizing and distribution -- why do we still need KM? The partner disciplines may have profound capabilities in their fields, but they are driving a kind of one-dimensional KM. The full power of KM is to combine their solutions to more powerful multi-dimensional approaches. Keywords: document engineering, documented knowledge, knowledge, knowledge
codification, knowledge management, knowledge management process, knowledge
networking, meta-discipline, partner disciplines | |||
| Efficient change control of XML documents | | BIBAK | Full-Text | 3-12 | |
| Sebastian Rönnau; Geraint Philipp; Uwe M. Borghoff | |||
| XML-based documents play a major role in modern information architectures
and their corresponding workflows. In this context, the ability to identify and
represent differences between two versions of a document is essential. Several
approaches to finding the differences between XML documents have already been
proposed. Typically, they are based on tree-to-tree correction, or sequence
alignment. Most of these algorithms, however, are too slow and do not support
the subsequent merging of changes. In this paper, we present a differencing
algorithm tailored to ordered XML documents, called DocTreeDiff. It relies on
our context-oriented XML versioning model which allows for document merging,
presented in earlier work. An empiric evaluation demonstrates the efficiency of
our approach as well as the high quality of the generated deltas. Keywords: XML diff, XML merge, office documents, tree-to-tree correction, version
control | |||
| Differential synchronization | | BIBAK | Full-Text | 13-20 | |
| Neil Fraser | |||
| This paper describes the Differential Synchronization (DS) method for
keeping documents synchronized. The key feature of DS is that it is simple and
well suited for use in both novel and existing state-based applications without
requiring application redesign. DS uses deltas to make efficient use of
bandwidth, and is fault-tolerant, allowing copies to converge in spite of
occasional errors. We consider practical implementation of DS and describe some
techniques to improve its performance in a browser environment. Keywords: collaboration, synchronization | |||
| On the analysis of queries with counting constraints | | BIBAK | Full-Text | 21-24 | |
| Everardo Bárcenas; Pierre Genevès; Nabil Layaïda | |||
| We study the analysis problem of XPath expressions with counting
constraints. Such expressions are commonly used in document transformations or
programs in which they select portions of documents subject to transformations.
We explore how recent results on the static analysis of navigational aspects of
XPath can be extended to counting constraints. The static analysis of this
combined XPath fragment allows to detect bugs in transformations and to perform
many kinds of optimizations of document transformations. More precisely, we
study how a logic for finite trees capable of expressing upward and downward
recursive navigation, can be equipped with a counting operator along regular
path expressions. Keywords: counting constraints, modal logics, type checking, xml, xpath | |||
| Modelling composite document behaviour with concurrent hierarchical state machines | | BIBAK | Full-Text | 25-28 | |
| Steve Battle; Helen Balinsky | |||
| This paper addresses the need for a modular approach to document composition
and life-cycle, enabling mixed content to be used and re-used within documents.
Each content item may bring with it its own workflow. Documents are
conventionally considered to be the passive subjects of workflow, but when a
document presents a complex mix of components it becomes harder for a
centralized workflow to cater for this variety of needs. Our solution is to
apply object-oriented concepts to documents, expressing process definitions
alongside the content they apply to. We are interested in describing document
life-cycles, and use Finite State Machines to describe the way that the
individual components of a document change over time. A framework for composing
these functional document components must first consider their hierarchical
nesting for which we use Hierarchical State Machines. Furthermore, to
accommodate the composition of independent sibling components under a common
parent we use Concurrent Hierarchical State Machines. This theoretical
framework provides practical guidelines for modelling composite document
behaviour. Keywords: composite documents, document-centric process, finite state machines | |||
| Automated re-typesetting, indexing and content enhancement for scanned marriage registers | | BIBAK | Full-Text | 29-38 | |
| David F. Brailsford | |||
| For much of England and Wales marriage registers began to be kept in 1537.
The marriage details were recorded locally, and in longhand, until 1st July
1837, when central records began. All registers were kept in the local parish
church. In the period from 1896 to 1922 an attempt was made, by the Phillimore
company of London, using volunteer help, to transcribe marriage registers for
as many English parishes as possible and to have them printed. This paper
describes an experiment in the automated re-typesetting of Volume 2 of the
15-volume Phillimore series relating to the county of Derbyshire. The source
material was plain text derived from running Optical Character Recognition
(OCR) on a set of page scans taken from the original printed volume. The aim of
the experiment was to avoid any idea of labour-intensive page-by-page
rebuilding with tools such as Acrobat Capture. Instead, it proved possible to
capitalise on the regular, tabular, structure of the Register pages as a means
of automating the re-typesetting process, using UNIX troff software and its tbl
preprocessor. A series of simple software tools helped to bring about the
OCR-to-troff transformation. However, the re-typesetting of the text was not
just an end in itself but, additionally, a step on the way to content
enhancement and content repurposing. This included the indexing of the marriage
entries and their potential transformation into XML and GEDCOM notations. The
experiment has shown, for highly regular material, that the efforts of one
programmer, with suitable low-level tools, can be far more effective than
attempting to recreate the printed material using WYSIWYG software. Keywords: GEDCOM, OCR, genealogy, hyper-linking, indexing, re-typesetting, troff | |||
| Test collection management and labeling system | | BIBAK | Full-Text | 39-42 | |
| Eunyee Koh; Andruid Kerne; Sarah Berry | |||
| In order to evaluate the performance of information retrieval and extraction
algorithms, we need test collections. A test collection consists of a set of
documents, a clearly formed problem that an algorithm is supposed to provide
solutions to, and the answers that the algorithm should produce when executed
on the documents. Defining the association between elements in the test
collection and answers is known as labeling. For mainstream information
retrieval problems, there are publicly available test collections which have
been maintained for years. However, the scope of these problems, and thus the
associated test collections, is limited. In other cases, researchers need to
build, label, and manage their own test collections, which can be a tedious and
error-prone task. We were building test collections of HTML documents, for
problems in which the answers that the algorithm supplies is a sub-tree of the
DOM (Document Object Model). To lighten the burden of this task, we developed a
test collection management and labeling system (TCMLS), to facilitate usability
in the process of building test collections, applying them to validate
algorithms, and potentially sharing them across the research community. Keywords: document object model, test collection, xml schema | |||
| A platform to automatically generate and incorporate documents into an ontology-based content repository | | BIBAK | Full-Text | 43-46 | |
| Matthias Heinrich; Antje Boehm-Peters; Martin Knechtel | |||
| In order to access large information pools efficiently data has to be
structured and categorized. Recently, applying ontologies to formalize
information has become an established approach. In particular, ontology-based
search and navigation are promising solutions which are capable to
significantly improve state of the art systems (e.g. full-text search engines).
However, the ontology roll-out and maintenance are costly tasks. Therefore, we
propose a documentation generation platform that automatically derives content
and incorporates generated content into an existing ontology. The demanding
task of classifying content as concept instances, setting data type and object
properties is accomplished by the documentation generation platform.
Eventually, our approach results in a semantically enriched content base. Note
that no manual effort is required to establish links between content objects
and the ontology. Keywords: ontology completion, semantic annotation, software documentation, text
generation | |||
| Object-level document analysis of PDF files | | BIBAK | Full-Text | 47-55 | |
| Tamir Hassan | |||
| The PDF format is commonly used for the exchange of documents on the Web and
there is a growing need to understand and extract or repurpose data held in PDF
documents. Many systems for processing PDF files use algorithms designed for
scanned documents, which analyse a page based on its bitmap representation. We
believe this approach to be inefficient. Not only does the rasterization step
cost processing time, but information is also lost and errors can be
introduced.
Inspired primarily by the need to facilitate machine extraction of data from PDF documents, we have developed methods to extract textual and graphic content directly from the PDF content stream and represent it as a list of "objects" at a level of granularity suitable for structural understanding of the document. These objects are then grouped into lines, paragraphs and higher-level logical structures using a novel bottom-up segmentation algorithm based on visual perception principles. Experimental results demonstrate the viability of our approach, which is currently used as a basis for HTML conversion and data extraction methods. Keywords: document analysis, pdf | |||
| Aesthetic measure of alignment and regularity | | BIBAK | Full-Text | 56-65 | |
| Helen Y. Balinsky; Anthony J. Wiley; Matthew C. Roberts | |||
| To be effective as communications or sales tools, documents that are
personalized and customized for each customer must be visually appealing and
aesthetically pleasing. Producing perhaps millions of unique versions of
essentially the same document not only presents challenges to the printing
process but also disrupts the standard quality control procedures. The quality
of the alignment in each document can easily distinguish professionally looking
documents from amateur designs and some computer generated layouts. A
multicomponent measure of document alignment and regularity, derived directly
from designer knowledge, is developed and presented in computable form. The
measure includes: edge quality, page connectivity, grid regularity and
alignment statistics. It is clear that these components may have different
levels of importance, relevance and acceptability for various document types
and classes, thus the proposed measure should always be evaluated against the
requirements of the desired class of documents. Keywords: aesthetic rules, alignment, automatic layout evaluation, designer grid,
regularity, the hough transform | |||
| Web article extraction for web printing: a DOM+visual based approach | | BIBAK | Full-Text | 66-69 | |
| Ping Luo; Jian Fan; Sam Liu; Fen Lin; Yuhong Xiong; Jerry Liu | |||
| This work studies the problem of extracting articles from Web pages for
better printing. Different from existing approaches of article extraction, Web
printing poses several unique requirements: 1) Identifying just the boundary
surrounding the text-body is not the ideal solution for article extraction. It
is highly desirable to filter out some uninformative links and advertisements
within this boundary. 2) It is necessary to identify paragraphs, which may not
be readily separated as DOM nodes, for the purpose of better layout of the
article. 3) Its performance should be independent of content domains, written
languages, and Web page templates. Toward these goals we propose a novel method
of article extraction using both DOM (Document Object Model) and visual
features. The main components of our method include: 1) a text
segment/paragraph identification algorithm based on line-breaking features, 2)
a global optimization method, Maximum Scoring Subsequence, based on text
segments for identifying the boundary of the article body, 3) an outlier
elimination step based on left or right alignment of text segments with the
article body. Our experiments showed the proposed method is effective in terms
of precision and recall at the level of text segments. Keywords: article extraction, maximal scoring subsequence | |||
| Indexing by permeability in block structured web pages | | BIBAK | Full-Text | 70-73 | |
| Emmanuel Bruno; Nicolas Faessel; Hervé Glotin; Jacques Le Maitre; Michel Scholl | |||
| We present in this paper a model that we have developed for indexing and
querying web pages based on their visual rendering. In this model pages are
split up into a set of visual blocks. The indexing of a block takes into
account its content, its visual importance and, by permeability, the indexing
of neighbors blocks. A page is modeled as a directed acyclic graph. Each node
is associated with a block and labeled by the coefficient of importance of this
block. Each edge is labeled by the coefficient of permeability of the target
node content to the source node content. Importance and permeability
coefficients cannot be manually quantified. the second part of this paper, we
present an experiment consisting in learning optimal permeability coefficients
by gradient descent for indexing images of a web page from the text blocks of
this page. The dataset is drawn from real web pages of the train and test set
of the ImagEval task2 corpus. Results demonstrate an improvement of the
indexing using non uniform block permeabilities. Keywords: block importance, block permeability, content based image retrieval,
document indexing, document retrieval | |||
| Getting the most out of social annotations for web page classification | | BIBAK | Full-Text | 74-83 | |
| Arkaitz Zubiaga; Raquel Martínez; Víctor Fresno | |||
| User-generated annotations on social bookmarking sites can provide
interesting and promising metadata for web document management tasks like web
page classification. These user-generated annotations include diverse types of
information, such as tags and comments. Nonetheless, each kind of annotation
has a different nature and popularity level. In this work, we analyze and
evaluate the usefulness of each of these social annotations to classify web
pages over a taxonomy like that proposed by the Open Directory Project. We
compare them separately to the content-based classification, and also combine
the different types of data to augment performance. Our experiments show
encouraging results with the use of social annotations for this purpose, and we
found that combining these metadata with web page content improves even more
the classifier's performance. Keywords: social annotations, social bookmarking, web page classification | |||
| Deriving image-text document surrogates to optimize cognition | | BIBAK | Full-Text | 84-93 | |
| Eunyee Koh; Andruid Kerne | |||
| The representation of information collections needs to be optimized for
human cognition. While documents often include rich visual components,
collections, including personal collections and those generated by search
engines, are typically represented by lists of text-only surrogates. By
concurrently invoking complementary components of human cognition, combined
image-text surrogates will help people to more effectively see, understand,
think about, and remember an information collection. This research develops
algorithmic methods that use the structural context of images in HTML documents
to associate meaningful text and thus derive combined image-text surrogates.
Our algorithm first recognizes which documents consist essentially of
informative and multimedia content. Then, the algorithm recognizes the
informative sub-trees within each such document, discards advertisements and
navigation, and extracts images with contextual descriptions. Experimental
results demonstrate the algorithm's efficacy. An implementation of the
algorithm is provided in combinFormation, a creativity support tool for
collection authoring. The enhanced image-text surrogates enhance the
experiences of users finding and collecting information as part of developing
new ideas. Keywords: information extraction, search representation, surrogates | |||
| HCX: an efficient hybrid clustering approach for XML documents | | BIBAK | Full-Text | 94-97 | |
| Sangeetha Kutty; Richi Nayak; Yuefeng Li | |||
| This paper proposes a novel Hybrid Clustering approach for XML documents
(HCX) that first determines the structural similarity in the form of frequent
subtrees and then uses these frequent subtrees to represent the constrained
content of the XML documents in order to determine the content similarity. The
empirical analysis reveals that the proposed method is scalable and accurate. Keywords: clustering, frequent mining, structure and content, subtree mining, xml
documents | |||
| From system requirements documents to integrated system modeling artifacts | | BIBAK | Full-Text | 98 | |
| Manfred H. B. Broy | |||
| In the development of embedded systems starting from high-level requirements
and going over to system specification and further to architecture various
aspects and issues have to be elicited, collected, analyzed and documented.
These start from early phase contents like goals and high level requirements
and go on to more concrete requirements and finally to system specifications,
architecture design documents on which the final implementation of the system
is based.
Traditionally these contents have to be captured in documents such as product specification documents (in German: Lastenheft) and system specification documents (in German: Pflichtenheft). Typically in the early phases of system development a high number of different documents are produced that all talk about different issues and aspects of the system and also the development. Unavoidably a lot of these documents carry similar information and sometimes contain the same information in many different copies. Typically these documents are under a continuous change due to new insights and changing constraints. As a result configuration and version management and, in particular, change management of these documents becomes a nightmare. Every time an individual requirement, a goal or an aspect is modified, this modification has to be carried out consistently in all the documents. The changes produce new versions of the documents. A configuration management of such documents is nearly impossible. As a result information sometimes contained in more than 20 documents tends to become inconsistent. After a while there is a tendency not to update existing documents anymore and just to accept that at the end of the project a lot of the documents are no longer up-to-date and no more consistent. In the best case finally at the end of the project an updated documentation is produced in a step of reverse engineering. In the worst case a final consistent actual documentation is not produced at all such that the documentation of the system is completely lost and later a complicated and time consuming reconstruction of the documentation in a step of reengineering has to be carried out by the team that has to maintain the system -- by engineers who are often not involved in the development and therefore not familiar with the contents of the projects. A different approach aims at the use of content models, called artifact models, where the information about systems are captured in a structured way using modeling techniques such that all this information is structured in terms of comprehensive product models, sometimes called artifact models (or also meta-models) that describe all the relevant contents of a system in a structured way and trace the relationship between these contents in a way that there is no redundancy in the model but just relationships between the different parts of the models. To do that a model-based development technique is most appropriate where substantial parts of the content is not captured by text and natural language, but by specific modeling concepts. In the end such an approach results in a life-cycle product-modeling management system that supports all the phases of system development and contains all relevant information about a product and its development such that any kind of documentation about the system can be generated from the artifact model. In order to turn this vision of structured product models with high automation into reality, we need an integrated engineering environment that offers support for creating and managing models within well-defined process steps. The integrated development environment should comprise the following four blocks: 1) a model repository that maintains the different artifacts including their dependencies, 2) advanced tools for editing models that directly support their users to build-up models, 3) tools for analyzing the product model and synthesizing new artifacts out of the product model, and 4) a workflow engine to guide the engineers through the steps defined by the development process. Keywords: integrated artifact models, tool support | |||
| Review of automatic document formatting | | BIBAK | Full-Text | 99-108 | |
| Nathan Hurst; Wilmot Li; Kim Marriott | |||
| We review the literature on automatic document formatting with an emphasis
on recent work in the field. One common way to frame document formatting is as
a constrained optimization problem where decision variables encode element
placement, constraints enforce required geometric relationships, and the
objective function measures layout quality. We present existing research using
this framework, describing the kind of optimization problem being solved and
the basic optimization techniques used to solve it. Our review focuses on the
formatting of primarily textual documents, including both micro- and
macro-typographic concerns. We also cover techniques for automatic table
layout. Related problems such as widget and diagram layout, as well as temporal
layout issues that arise in multimedia documents are outside the scope of this
review. Keywords: adaptive layout, optimization techniques, typography | |||
| Job profiling in high performance printing | | BIBAK | Full-Text | 109-118 | |
| Thiago Nunes; Fabio Giannetti; Mariana Kolberg; Rafael Nemetz; Alexis Cabeda; Luiz Gustavo Fernandes | |||
| Digital presses have consistently improved their speed in the past ten
years. Meanwhile, the need for document personalization and customization has
increased. As a consequence of these two facts, the traditional RIP (Raster
Image Processing) process has became a highly demanding computational step in
the print workflow. Print Service Providers (PSP) are now using multiple RIP
engines and parallelization strategies to speed up the whole ripping process
which is currently based on a per-page base. Nevertheless, these strategies are
not optimized in terms of assuring the best Return On Investment (ROI) for the
RIP engines. Depending on the input document jobs characteristics, the ripping
step may not achieve the print-engine speed creating a unwanted bottleneck. The
aim of this paper is to present a way to improve the ROI of PSPs proposing a
profiling strategy which enables the optimal usage of RIPs for specific jobs
features ensuring that jobs are always consumed at least at engine speed. The
profiling strategy is based on a per-page analysis of input PDF jobs
identifying their key components. This work introduces a profiler tool to
extract information from jobs and some metrics to predict a job ripping cost
based on its profile. This information is extremely useful during the job
splitting step, since jobs can be split in a clever way. This improves the load
balance of the allocated RIPs engines and makes the overall process faster.
Finally, experimental results are presented in order to evaluate both, the
profiler and the proposed metrics. Keywords: digital printing, job profiling, parallel processing, pdf, performance
evaluation, print, print queue, raster image processing | |||
| Aesthetically-driven layout engine | | BIBAK | Full-Text | 119-122 | |
| Helen Y. Balinsky; Jonathan R. Howes; Anthony J. Wiley | |||
| A novel Aesthetically-Driven Layout (ADL) engine for automatic production of
highly customized, non-flow documents is proposed. In a non-flow document,
where each page is composed of separable images and text blocks, aesthetic
considerations may take precedence over the sequencing of the content. Such
layout methods are most suitable for the construction of personalized
catalogues, advertising flyers and sales and marketing material, all of which
rely heavily on their aesthetics in order to successfully reach their intended
audience. The non-flow algorithm described here permits the dynamic creation of
page layouts around pre-existing static page content. Pages pre-populated with
static content may include reserved areas which are filled at run-time. The
remainder of a page, which is neither convex, nor simply-connected, is
automatically filled with customer-relevant content by following the
professional manual design strategy of multiple levels of layout resolution.
The page designers preference, style and aesthetic rules are taken into account
at every stage with the highest scoring layout being selected. Keywords: alignment, fixed content, high customization and personalization, non-flow
documents, regularity | |||
| Automated extensible XML tree diagrams | | BIBAK | Full-Text | 123-126 | |
| John Lumley | |||
| XML is a tree-oriented meta-language and understanding XML structures can
often involve the construction of visual trees. These trees may use a variety
of graphics for chosen elements and often condense or elide sections of the
tree to aid focus, as well as adding extra explanatory graphical material such
as callouts and cross-tree links. We outline an automated approach for building
such trees with great flexibility, based on the use of XSLT, SVG and a
functional layout package. This paper concentrates on techniques to declare and
implement such flexible decoration, rather than the layout of the tree itself. Keywords: functional programming, svg, xml trees, xslt | |||
| Effect of copying and restoration on color barcode payload density | | BIBAK | Full-Text | 127-130 | |
| Steven J. Simske; Margaret Sturgill; Jason S. Aronoff | |||
| 2D barcodes are taking on increasing significance as the ubiquity of
high-resolution cameras, combined with the availability of variable data
printing, drives increasing amounts of "click and connect" applications.
Barcodes therefore serve as an increasingly significant connection between
physical and electronic portions, or versions, of documents. The use of color
provides many additional advantages, including increased payload density and
security. In this paper, we consider four factors affecting the readable
payload in a color barcode: (1) number of print-scan (PS), or copy, cycles, (2)
image restoration to offset PS-induced degradation, (3) the authentication
algorithm used, and (4) the use of spectral pre-compensation (SPC) to optimize
the color settings for the color barcodes. The PS cycle was shown to
consistently reduce payload density by approximately 55% under all tested
conditions. SPC nearly doubled the payload density, and selecting the better
authentication algorithm increased payload density by roughly 50% in the mean.
Restoration, however, was found to increase payload density less substantially
(~30%), and only when combined with the optimized settings for SPC. These
results are also discussed in light of optimizing payload density for the
generation of document security deterrents. Keywords: 3d bar codes, color compensation, color tiles, image restoration, payload
density, security printing | |||
| Layout-aware limiarization for readability enhancement of degraded historical documents | | BIBAK | Full-Text | 131-134 | |
| Flávio Bertholdo; Eduardo Valle; Arnaldo de A. Araújo | |||
| In this paper we propose a technique of limiarization (also known as
thresholding or binarization) tailored to improve the readability of degraded
historical documents. Limiarization is a simple image processing technique,
which is employed in many complex tasks like image compression, object
segmentation and character recognition. The technique also finds applications
on itself: since it results in a high-contrast image, in which the foreground
is clearly separated from the background, it can greatly improve the
readability of a document, provided that other attributes (like character
shape) do not suffer. Our technique exploits statistical characteristics of
textual documents and applies both global and local thresholding. Under visual
inspection on experiments made in a collection of severely degraded historical
documents, it compares favorably with the state of the art. Keywords: binarization, historical documents, image enhancement, limiarization,
readability improvement | |||
| Geometric consistency checking for local-descriptor based document retrieval | | BIBAK | Full-Text | 135-138 | |
| Eduardo Valle; David Picard; Matthieu Cord | |||
| In this paper, we evaluate different geometric consistency schemes, which
can be used in tandem with an efficient architecture, based on voting and local
descriptors, to retrieve multimedia documents. In many contexts the geometric
consistency enforcement is essential to boost the retrieval performance. Our
empirical results show however, that geometric consistency alone is unable to
guarantee high-quality results in databases that contain too many
non-discriminating descriptors. Keywords: cbir, geometric consistency, image retrieval, local descriptors, retrieval
by voting | |||
| A REST protocol and composite format for interactive web documents | | BIBAK | Full-Text | 139-148 | |
| John M. Boyer; Charles F. Wiecha; Rahul P. Akolkar | |||
| Documents allow end-users to encapsulate information related to a
collaborative business process into a package that can be saved, emailed,
digitally signed, and used as the basis of interaction in an activity or an ad
hoc workflow. While documents are used incidentally today in web applications,
for example in HTML presentations of content stored otherwise in back-end
systems, they are not yet the central artifact for developers of dynamic, data
intensive web applications. This paper unifies the storage and management of
the various artifacts of web applications into an Interactive Web Document
(IWD). Data content, presentation, behavior, attachments, and digital
signatures collected throughout the business process are unified into a single
composite web resource. We describe a REST-based protocol for interacting with
IWDs and a standards-based approach to packaging their multiple constituent
artifacts into IWD archives based on the Open Document Format standard. Keywords: collaboration, document-centric, html, odf, rich internet application,
scxml, web application, workflow, xforms | |||
| Adding dynamic visual manipulations to declarative multimedia documents | | BIBAK | Full-Text | 149-152 | |
| Fons Kuijk; Rodrigo Laiola Guimarães; Pablo Cesar; Dick C. A. Bulterman | |||
| The objective of this work is to define a document model extension that
enables complex spatial and temporal interactions within multimedia documents.
As an example we describe an authoring interface of a photo sharing system that
can be used to capture stories in an open, declarative format. The document
model extension defines visual transformations for synchronized navigation
driven by dynamic associated content. Due to the open declarative format, the
presentation content can be targeted to individuals, while maintaining the
underlying data model. The impact of this work is reflected in its recent
standardization in the W3C SMIL language. Multimedia players, as Ambulant and
the RealPlayer, support the extension described in this paper. Keywords: animation, content enrichment, declarative language, media annotation, pan
and zoom, photo sharing, smil | |||
| Enriching the interactive user experience of open document format | | BIBAK | Full-Text | 153-156 | |
| John M. Boyer; Charles F. Wiecha | |||
| The typical user experience of office documents is geared to the passive
recording of user content creation. In this paper, we describe how to provide
more active content within such documents based on elaborating the integration
between Open Document Format (ODF) and the W3C standard for rich interactivity
and data management in web pages (XForms). This includes assignment of more
comprehensive behaviors to single form controls, better control over
collections of form controls, readonly and conditional sections of mixed
content and form controls, and dynamically repeated sections automatically
responsive to data changes, including data obtained from web services invoked
during user interaction with the document. Keywords: accessibility, interactive documents, odf, xforms | |||
| An e-writer for documents plus strokes | | BIBAK | Full-Text | 157-160 | |
| Michael J. Gormish; Kurt Piersol; Ken Gudan; John Barrus | |||
| This paper describes the hardware, software, and a document model for a
prototype E-Writer. Paper like displays have proved useful in E-Readers like
the Kindle in part because of low power usage and the ability to read indoors
and out. We focus on emulating other properties of paper in the E-Writer:
everyone knows how to use it, and users can write anywhere on the page. By
focusing on a simple document model consisting primarily of images and strokes
we enabled rapid application development that integrates easily with current
paper-based document workflows. This paper includes preliminary reports on
usage of the E-Writer and its software by a small test group. Keywords: electronic paper, paper-like, paperless, pen strokes, workflow | |||
| Movie script markup language | | BIBAK | Full-Text | 161-170 | |
| Dieter Van Rijsselbergen; Barbara Van De Keer; Maarten Verwaest; Erik Mannens; Rik Van de Walle | |||
| This paper introduces the Movie Script Markup Language (MSML), a document
specification for the structural representation of screenplay narratives for
television and feature film drama production. Its definition was motivated by a
lack of available structured and open formats that describe dramatic narrative
but also support IT-based production methods of audiovisual drama. The MSML
specification fully supports contemporary screenplay templates in a structured
fashion, and adds provisions for drama manufacturing methods that allow drama
crew to define how narrative can be translated to audiovisual material. A
timing model based on timed petri nets is included to enable fine-grained event
synchronization. Finally, MSML comprises an animation module through which
narrative events can drive production elements like 3-D previsualization,
content repurposing or studio automation. MSML is currently serialized into XML
documents and is formally described by a complement of an XML Schema and ISO
Schematron schema. The specification has been developed in close collaboration
with actual drama production crew and has been implemented in a number of
proof-of-concept demonstrators. Keywords: drama production, narratives, screenplay, xml | |||
| Annotations with EARMARK for arbitrary, overlapping and out-of order markup | | BIBAK | Full-Text | 171-180 | |
| Silvio Peroni; Fabio Vitali | |||
| In this paper we propose a novel approach to markup, called Extreme
Annotational RDF Markup (EARMARK), using RDF and OWL to annotate features in
text content that cannot be mapped with usual markup languages. EARMARK
provides a unifying framework to handle tree-based XML features as well as more
complex markup for non-XML scenarios such as overlapping elements, repeated and
non-contiguous ranges and structured attributes. EARMARK includes and expands
the principles of XML markup, RDFa inline annotations and existing approaches
to overlapping markup such as LMNL and TexMecs. EARMARK documents can also be
linearized into plain XML by choosing any of a number of strategies to express
a tree-based subset of the annotations as an XML structure and fitting in the
remaining annotations through a number of "tricks", markup expedients for
hierarchical linearization of non-hierarchical features. EARMARK provides a
solid platform for providing vocabulary-independent declarative support to
advanced document features such as transclusion, overlapping and out-of-order
annotations within a conceptually insensitive environment such as XML, and does
so by exploiting recent semantic web concepts and languages. Keywords: earmark, markup, overlapping markup, owl, xpointer | |||
| Creation and maintenance of multi-structured documents | | BIBAK | Full-Text | 181-184 | |
| Pierre-Édouard Portier; Sylvie Calabretto | |||
| In this article, we introduce a new problem: the construction of
multi-structured documents. We first offer an overview of existing solutions to
the representation of such documents. We then notice that none of them consider
the problem of their construction. In this context, we use our experience with
philosophers who are building a digital edition of the work of Jean-Toussaint
Desanti, in order to present a methodology for the construction of
multi-structured documents. This methodology is based on the MSDM model in
order to represent such documents. Moreover each step of the methodology has
been implemented in the Haskell functional programming language. Keywords: digital libraries, haskell, overlapping hierarchies, xml | |||
| From rhetorical structures to document structure: shallow pragmatic analysis for document engineering | | BIBAK | Full-Text | 185-192 | |
| Gersende Georg; Hugo Hernault; Marc Cavazza; Helmut Prendinger; Mitsuru Ishizuka | |||
| In this paper, we extend previous work on the automatic structuring of
medical documents using content analysis. Our long-term objective is to take
advantage of specific rhetoric markers encountered in specialized medical
documents (clinical guidelines) to automatically structure free text according
to its role in the document. This should enable to generate multiple views of
the same document depending on the target audience, generate document
summaries, as well as facilitating knowledge extraction from text. We have
established in previous work that the structure of clinical guidelines could be
refined through the identification of a limited set of deontic operators. We
now propose to extend this approach by analyzing the text delimited by these
operators using Rhetorical Structure Theory. The emphasis on causality and time
in RST proves a powerful complement to the recognition of deontic structures
while retaining the same philosophy of high-level recognition of sentence
structure, which can be converted into application-specific mark-ups.
Throughout the paper, we illustrate our findings through results produced by
the automatic processing of English guidelines for the management of
hypertension and Alzheimer disease. Keywords: medical document processing, natural language processing | |||
| On lexical resources for digitization of historical documents | | BIBAK | Full-Text | 193-200 | |
| Annette Gotscharek; Ulrich Reffle; Christoph Ringlstetter; Klaus U. Schulz | |||
| Many European libraries are currently engaged in mass digitization projects
that aim to make historical documents and corpora online available in the
Internet. In this context, appropriate lexical resources play a double role.
They are needed to improve OCR recognition of historical documents, which
currently does not lead to satisfactory results. Second, even assuming a
perfect OCR recognition, since historical language differs considerably from
modern language, the matching process between queries submitted to search
engines and variants of the search terms found in historical documents needs
special support. While the usefulness of special dictionaries for both problems
seems undisputed, concrete knowledge and experience are still missing. There
are no hints about what optimal lexical resources for historical documents
should look like. The real benefit reached by optimized lexical resources is
unclear. Both questions are rather complex since answers depend on the point in
history when documents were born. We present a series of experiments which
illuminate these points. For our evaluations we collected a large corpus
covering German historical documents from before 1500 to 1950 and constructed
various types of dictionaries. We present the coverage reached with each
dictionary for ten subperiods of time. Additional experiments illuminate the
improvements for OCR accuracy and Information Retrieval that can be reached,
again looking at distinct dictionaries and periods of time. For both OCR and
IR, our lexical resources lead to substantial improvements. Keywords: electronic lexica, historical spelling variants, information retrieval | |||
| A panlingual anomalous text detector | | BIBAK | Full-Text | 201-204 | |
| Ashok C. Popat | |||
| In a large-scale book scanning operation, material can vary widely in
language, script, genre, domain, print quality, and other factors, giving rise
to a corresponding variability in the OCRed text. It is often desirable to
automatically detect errorful and otherwise anomalous text segments, so that
they can be filtered out or appropriately flagged, for such applications as
indexing, mining, analyzing, displaying, and selectively re-processing such
data. Moreover, it is advantageous to require that the automated detector be
independent of the underlying OCR engine (or engines), that it work over a
broad range of languages, that it seamlessly handle mixed-language material,
and that it accommodate documents that contain domain-specific and otherwise
rare terminology. A technique is presented that satisfies these requirements,
using an adaptive mixture of character-level N-gram language models. Its
design, training, implementation, and evaluation are described within the
context of high-volume book scanning. Keywords: garbage strings, language identification, mixture models, ppm, text quality,
witten-bell | |||
| Update summarization based on novel topic distribution | | BIBAK | Full-Text | 205-213 | |
| Josef Steinberger; Karel Je ek | |||
| This paper deals with our recent research in text summarization. The field
has moved from multi-document summarization to update summarization. When
producing an update summary of a set of topic-related documents the summarizer
assumes prior knowledge of the reader determined by a set of older documents of
the same topic. The update summarizer thus must solve a novelty vs. redundancy
problem. We describe the development of our summarizer which is based on
Iterative Residual Rescaling (IRR) that creates the latent semantic space of a
set of documents under consideration. IRR generalizes Singular Value
Decomposition (SVD) and enables to control the influence of major and minor
topics in the latent space. Our sentence-extractive summarization method
computes the redundancy, novelty and significance of each topic. These values
are finally used in the sentence selection process. The sentence selection
component prevents inner summary redundancy. The results of our participation
in TAC evaluation seem to be promising. Keywords: iterative residual rescaling, latent semantic analysis, summary evaluation,
text summarization | |||
| Linguistic editing support | | BIBAK | Full-Text | 214-217 | |
| Michael Piotrowski; Cerstin Mahlow | |||
| Unlike programmers, authors only get very little support from their writing
tools, i.e., their word processors and editors. Current editors are unaware of
the objects and structures of natural languages and only offer character-based
operations for manipulating text. Writers thus have to execute complex
sequences of low-level functions to achieve their rhetoric or stylistic goals
while composing. Software requiring long and complex sequences of operations
causes users to make slips. In the case of editing and revising, these slips
result in typical revision errors, such as sentences without a verb, agreement
errors, or incorrect word order. In the LingURed project, we are developing
language-aware editing functions to prevent errors. These functions operate on
linguistic elements, not characters, thus shortening the command sequences
writers have to execute. This paper describes the motivation and background of
the LingURed project and shows some prototypical language-aware functions. Keywords: action slips, authoring, cognitive load, computational linguistics,
language-aware editing, revising | |||
| Web document text and images extraction using DOM analysis and natural language processing | | BIBAK | Full-Text | 218-221 | |
| Parag Mulendra Joshi; Sam Liu | |||
| Web has emerged as the most important source of information in the world.
This has resulted in need for automated software components to analyze web
pages and harvest useful information from them. However, in typical web pages
the informative content is surrounded by a very high degree of noise in the
form of advertisements, navigation bars, links to other content, etc. Often the
noisy content is interspersed with the main content leaving no clean boundaries
between them. This noisy content makes the problem of information harvesting
from web pages much harder. Therefore, it is essential to be able to identify
main content of a web page and automatically isolate it from noisy content for
any further analysis. Most existing approaches rely on prior knowledge of
website specific templates and hand-crafted rules specific to websites for
extraction of relevant content. We propose a generic approach that does not
require prior knowledge of website templates. While HTML DOM analysis and
visual layout analysis approaches have sometimes been used, we believe that for
higher accuracy in content extraction, the analyzing software needs to mimic a
human user and understand content in natural language similar to the way humans
intuitively do in order to eliminate noisy content.
In this paper, we describe a combination of HTML DOM analysis and Natural Language Processing (NLP) techniques for automated extractions of main article with associated images from web pages. Keywords: dom trees, html documents, image extraction, natural language processing,
web page text extraction | |||
| Relating declarative hypermedia objects and imperative objects through the NCL glue language | | BIBAK | Full-Text | 222-230 | |
| Luiz Fernando Gomes Soares; Marcelo Ferreira Moreno; Francisco Sant'Anna | |||
| This paper focuses on the support provided by NCL (Nested Context Language)
to relate objects with imperative code content and declarative
hypermedia-objects (objects with declarative code content specifying hypermedia
documents). NCL is the declarative language of the Brazilian Terrestrial
Digital TV System (SBTVD) supported by its middleware called Ginga. NCL and
Ginga are part of ISDB standards and also of ITU-T Recommendations for IPTV
services.
The main contribution of this paper is the seamless way NCL integrates imperative and declarative language paradigms with no intrusion, maintaining a clear limit between embedded objects, independent of their coding content, and defining a behavior model that avoids side effects from one paradigm use to another. Keywords: declarative and imperative code content, digital tv, glue language,
intermedia synchronization, middleware, ncl | |||
| Using DITA for documenting software product lines | | BIBAK | Full-Text | 231-240 | |
| Oscar Díaz; Felipe I. Anfurrutia; Jon Kortabitarte | |||
| Aligning the software process and the documentation process is a recipe for
having both software and documentation in synchrony where changes in software
seamlessly ripple along its documentation counterpart. This paper focuses on
documentation for Software Product Lines (SPLs). A SPL is not intended to build
one application, but a number of them: a product family. In contrast to
single-software product development, SPL development is based on the idea that
the distinct products of the family share a significant amount of assets. This
forces a change in the software process. Likewise, software documentation
development should now mimic their code counterpart: product documentation
should also be produced out of a common set of assets. Specifically, the paper
shows how DITA process and documents are recasted using a feature-oriented
approach, a realization mechanism for SPLs. In so doing, documentation
artifacts are produced at the same pace and using similar variability
mechanisms that those used for code artifacts. This accounts for three main
advantages: uniformity, separation of concerns, and timely and accurate
delivery of the documentation. Keywords: dita, documentation, feature oriented programming, software product lines | |||
| Declarative interfaces for dynamic widgets communications | | BIBAK | Full-Text | 241-244 | |
| Cyril Concolato; Jean Le Feuvre; Jean-Claude Dufourd | |||
| Widgets are small and focused multimedia applications that can be found on
desktop computers, mobile devices or even TV sets. Widgets rely on structured
documents to describe their spatial, temporal and interactive behavior but also
to communicate with remote data sources. However, these sources have to be
known at authoring time and the communication process relies heavily on
scripting. In this paper, we describe a mechanism enabling the communication
between widgets and their dynamic environment (other widgets, remote data
sources). The proposed declarative mechanism is compatible with existing
Widgets technologies, usable with script-based widgets as well as with fully
declarative widgets. A description of an implementation is also provided. Keywords: communication interface, declarative languages, rich media, scripting
interface, widget | |||
| XSL-FO 2.0: automated publishing for graphic documents | | BIBAK | Full-Text | 245-246 | |
| Fabio Giannetti | |||
| The W3C (World Wide Web Consortium) is in the process of developing the
second major version of XSL-FO (eXtensible Stylesheet Language -- Formatting
Objects) [1], the formatting specification component of XSL. XSL-FO is widely
deployed in industry and academia where multiple output forms (typically print
and online) are needed from single source XML. It is used in many diverse
applications and countries on a large number of implementations to create
technical documentation, reports and contracts, terms and conditions, invoices
and other forms processing, such as driver's licenses, postal forms, etc.
XSL-FO is also widely used for heavy multilingual work because of the
internationalization aspects provided in 1.0 to accommodate multiple and mixed
writing modes (writing directions such as left-to-right, top-to-bottom,
right-to-left, etc.) of the world's languages. The primary goals of the W3C XSL
Working Group in developing XSL 2.0 are to provide more sophisticated
formatting and layout, enhanced internationalization to provide special
formatting objects for Japanese and other Asian and non-Western languages and
scripts and to improve integration with other technologies such as SVG
(Scalable Vector Graphics) [2] and MathML (Mathematical Markup Language) [3]. A
number of XSL 1.0 implementations already support dynamic inclusion of vector
graphics using W3C SVG. The XSL and SVG WGs want to define a tighter interface
between XSL-FO and SVG to provide enhanced functionality. Experiments [4] with
the use of SVG paths to create non-rectangular text regions, or "run-arounds",
have helped to motivate further work on deeper integration of SVG graphics
inside XSL-FO documents, and to work with the SVG WG on specifying the meaning
of XSL-FO markup inside SVG graphics. A similar level of integration with
MathML is contemplated. Keywords: content driven pagination, graphic design, layout, math ml, print, svg,
template, transactional printing, variable data print, xml, xsl-fo | |||
| GraphWrap: a system for interactive wrapping of pdf documents using graph matching techniques | | BIBAK | Full-Text | 247-248 | |
| Tamir Hassan | |||
| We present GraphWrap, a novel and innovative approach to wrapping PDF
documents. The PDF format is often used to publish large amounts of structured
data, such as product specifications, measurements, prices or contact
information. As the PDF format is unstructured, it is very difficult to use
this data in machine processing applications. Wrapping is the process of
navigating the data source, semi-automatically extracting the data and
transforming it into a structured form.
GraphWrap enables a non-expert user to create such data extraction programs for almost any PDF file in an intuitive and interactive manner. We show how a wrapper can be created by selecting an example instance and interacting with the graph representation to set conditions and choose which data items to extract. In the background, the corresponding instances are found using an algorithm based on subgraph isomorphism. The resulting wrapper can then be run on other pages and documents which exhibit a similar visual structure. Keywords: pdf documents, wrapping | |||
| A web-based version editor for XML documents | | BIBAK | Full-Text | 249-250 | |
| Luis Arévalo Rosado; Antonio Polo Márquez; Miryam Salas Sánchez | |||
| The goal of this demonstration is to show a web-based editor for versioned
XML documents. The user interface is an Ajax-based application characterized by
its friendliness, its simplicity and intuitive edition of XML documents as well
as their versions, thereby avoiding users the complexity of a versioning
system. In order to store the XML documents an XML native database, which has
been extended to support versioning features as it is shown in [3], is used. Keywords: ajax editor, branch versioning, historical xml information, xml native
databases, xml versions | |||
| Logic-based verification of technical documentation | | BIBAK | Full-Text | 251-252 | |
| Christian Schönberg; Franz Weitl; Mirjana Jaksic; Burkhard Freitag | |||
| Checking the content coherence of digital documents is the purpose of the
Verdikt system which can be applied to different domains and document types
including technical documentation, e-learning documents, and web pages. An
expressive temporal description logic allows for the specification of content
consistency criteria along document paths. Whether the document conforms to the
specification can then be verified by applying a model checker. In case of
specification violations, the model checker provides counterexamples, locating
errors in the document precisely. Based on a sample technical documentation in
the form of a web document, the general verification process and its
effectiveness, efficiency, and usability are demonstrated. Keywords: document verification, model checking | |||