| Document engineering education | | BIBAK | Full-Text | 1 | |
| Ethan V. Munson | |||
| This working session will be a roundtable discussion of document engineering
education. The working session's goal is to allow educators and researchers to
share their experiences in teaching topics related to document engineering. The
hope is that these discussions will stimulate the development of common
resources, including syllabi, reading lists, and exercises in order to
facilitate the spread of document engineering as a viable topic for study. Keywords: curriculum, document engineering, education | |||
| Navigating documents using ontologies, taxonomies and folksonomies | | BIBAK | Full-Text | 2 | |
| Margaret-Anne D. Storey | |||
| Navigating computer-based information landscapes can be a challenging task
for humans in almost any knowledge domain. Most documentation spaces are large,
complex and ever-changing, which creates a significant cognitive burden on the
end-user. Effective tool support can help orient the user and guide them to an
appropriate place in the information space. In our research, we have been
investigating how visualization tools can support navigation by leveraging the
standard and folk classification systems that are embedded in information
spaces. We have focused on two specific domains where navigating information
can pose challenges: medical informatics and software engineering.
Within the domain of medical informatics, we have designed a visualization tool that supports the exploration and comparison of a set of clinical trials. The navigational support offered to the user is customized according to an ontology that describes the trial designs. For software engineers, we have developed a tool that generates "navigational waypoints" from informal tagging in software documents. These waypoints provide a way for the software engineer to create "tours" through the space of software documents. In our current work, we are now exploring how adaptive visualization tools may leverage both structured and unstructured information in providing navigational support. We believe that both kinds of information when presented in a coherent visual manner will lead to more effective cognitive support for users as they browse, query and search integrated knowledge spaces. Keywords: document navigation, folksonomies, ontologies, taxonomies, visualization,
waypoints | |||
| Thresholding of badly illuminated document images through photometric correction | | BIBAK | Full-Text | 3-8 | |
| Shijian Lu; Chew Lim Tan | |||
| This paper presents a document image thresholding technique that binarizes
badly illuminated document images by the photometric correction. Based on the
observation that illumination normally varies smoothly and document images
often contain a uniformly colored background, the global shading variation is
estimated by using a two-dimensional Savitzky-Golay filter that fits a least
square polynomial surface to the luminance of a badly illuminated document
image. With the knowledge of the global shading variation, shading degradation
is then corrected through a compensation process that produces an image with
roughly uniform illumination. Badly illuminated document images are accordingly
binarized through the global thresholding of the compensated ones. Experiments
show that the proposed thresholding technique is fast, robust, and efficient
for the binarization of badly illuminated document images. Keywords: badly-illuminated document images, document image analysis, document image
thresholding | |||
| A system for understanding imaged infographics and its applications | | BIBAK | Full-Text | 9-18 | |
| Weihua Huang; Chew Lim Tan | |||
| Information graphics, or infographics, are visual representations of
information, data or knowledge. Understanding of infographics in documents is a
relatively new research problem, which becomes more challenging when
infographics appear as raster images. This paper describes technical details
and practical applications of the system we built for recognizing and
understanding imaged infographics located in document pages. To recognize
infographics in raster form, both graphical symbol extraction and text
recognition need to be performed. The two kinds of information are then
auto-associated to capture and store the semantic information carried by the
infographics. Two practical applications of the system are introduced in this
paper, including supplement to traditional optical character recognition (OCR)
system and providing enriched information for question answering (QA). To test
the performance of our system, we conducted experiments using a collection of
downloaded and scanned infographic images. Another set of scanned document
pages from the University of Washington document image database were used to
demonstrate how the system output can be used by other applications. The
results obtained confirm the practical value of the system. Keywords: applications, association of text and graphics, document image
understanding, infographics | |||
| A model for mapping between printed and digital document instances | | BIBAK | Full-Text | 19-28 | |
| Nadir Weibel; Moira C. Norrie; Beat Signer | |||
| The first steps towards bridging the paper-digital divide have been achieved
with the development of a range of technologies that allow printed documents to
be linked to digital content and services. However, the static nature of paper
and limited structural information encoded in classical paginated formats make
it difficult to map between parts of a printed instance of a document and
logical elements of a digital instance of the same document, especially taking
document revisions into account. We present a solution to this problem based on
a model that combines metadata of the digital and printed instances to enable a
seamless mapping between digital documents and their physical counterparts on
paper. We also describe how the model was used to develop iDoc, a framework
that supports the authoring and publishing of interactive paper documents. Keywords: document integration, document model, interactive paper, page description
languages, structured documents | |||
| Data model and architecture of a paper-digital document management system | | BIBAK | Full-Text | 29-31 | |
| Kosuke Konishi; Naohiro FurukawaHisashi Ikeda | |||
| We propose a document management system called "iJITinOffice," which manages
paper documents, including those with handwriting, and integrates them with
electronic documents. By digitizing and managing handwriting on paper, we
provide document management and retrieval capabilities that utilize the
thinking process and memory that occurs with handwriting. The system was
implemented using Anoto digital pen technology. Previous papers [2, 3, 4]
introduced the concept and a summary of our system. In this paper we describe
the design of a data model and the architecture of the system. The data model
links information from paper, handwriting, and electronic documents together.
It makes it possible to interweave searches for electronic documents and
handwriting on paper documents. Keywords: digital pen, handwritten annotation, paper document management | |||
| A new Tsallis entropy-based thresholding algorithm for images of historical documents | | BIBAK | Full-Text | 32-34 | |
| Carlos A. B. Mello | |||
| It is presented in this paper an algorithm for thresholding images of
historical documents. The main objective is to generate high quality
monochromatic images in order to make them easily accessible thru Internet and
achieve high recognition rates by Optical Character Recognition algorithms. Our
new algorithm is based on the classical entropy concept and a variation defined
by the Tsallis Entropy and it proved to be more efficient than classical
thresholding algorithms. The images generated are analyzed using precision,
recall, accuracy and specificity. Keywords: document processing, entropy, historical documents, image segmentation,
thresholding | |||
| Presenting in html | | BIBAK | Full-Text | 35-36 | |
| Erik Wilde; Philippe Cattin | |||
| The management and publishing of complex presentations is poorly supported
by available presentation software. This makes it hard to publish usable and
accessible presentation material, and to reuse that material for continuously
evolving events. XSLidy provides a XSLT-based approach to generate
presentations out of a mix of HTML and structural elements. Using XSLidy, the
management and reuse of complex presentations becomes easier, and the results
are more user-friendly in terms of usability and accessibility. Keywords: XSLidy, presentation | |||
| A multi-format variable data template wrapper extending podis PPML-T standard | | BIBAK | Full-Text | 37-43 | |
| Fabio Giannetti | |||
| Variable Data Print (VDP) has fueled the need for increasingly sophisticated
tools and capabilities with every solution vendor providing different
approaches and techniques.
Nevertheless, it is possible to provide a unified wrapper around these different XML formats that will facilitate the exchange of templates and/or import/export into other formats. The proposed solution compares favourably with Simple Object Access Protocol (SOAP) provision for WebServices. The SOAP wrapper provides the protocol which allows the contained message to be expressed in different formats (described using XML Schemas). This enables the interoperability between services encapsulating their specific implementations and only exposes the methods and their parameters. This proposal builds on similar concepts separating a template into three parts: the template, the binding and the data. Each part has its own format and "embedded" semantics. Keywords: PPML, PPMLT, SOAP, SVG, XML, XSL-FO, XSLT, document exchange, template,
variable data print | |||
| Extracting reusable document components for variable data printing | | BIBAK | Full-Text | 44-52 | |
| Steven R. Bagley; David F. Brailsford; James A. Ollis | |||
| Variable Data Printing (VDP) has brought new flexibility and dynamism to the
printed page. Every printed instance of a specific class of document can now
have different degrees of customized content within the document template.
This flexibility comes at a cost. If every printed page is potentially different from all others it must be rasterized separately, which is a time-consuming process. Technologies such as PPML (Personalized Print Markup Language) attempt to address this problem by dividing the bitmapped page into components that can be cached at the raster level, thereby speeding up the generation of page instances. A large number of documents are stored in Page Description Languages at a higher level of abstraction than the bitmapped page. Much of this content could be reused within a VDP environment provided that separable document components can be identified and extracted. These components then need to be individually rasterisable so that each high-level component can be related to its low-level (bitmap) equivalent. Unfortunately, the unstructured nature of most Page Description Languages makes it difficult to extract content easily. This paper outlines the problems encountered in extracting component-based content from existing page description formats, such as PostScript, PDF and SVG, and how the differences between the formats affects the ease with which content can be extracted. The techniques are illustrated with reference to a tool called COG Extractor, which extracts content from PDF and SVG and prepares it for reuse. Keywords: PDF, SVG, content extraction, graphic objects, PostScript, variable data
printing | |||
| VDP templates with theme-driven layer variants | | BIBAK | Full-Text | 53-55 | |
| Royston Sellman | |||
| Many graphic artists and designers have adapted their skills to the use of
tools that extend static layout applications, allowing the creation of Variable
Data Print template documents. These connect text and image placeholders to
database fields, allowing creation of a set of instances at job time. Much of
the VDP work flowing through digital presses originates this way. However, in
the field we have observed limitations to simple approaches which make it hard
to create templates that do much more than can be achieved with mail merge and
variable backgrounds. In this paper we describe two examples which illustrate
the problems. Solutions have been developed but a frequent drawback is that
they move the graphic artist out of the loop, either because they do not
support the fine layout control creative professionals expect to use, or
because they are aimed at programmers and database professionals. Agencies and
PSPs however, are so keen to keep creative professionals in the loop that we
have seen ingenious but fragile and inefficient in-house solutions which
support complex VDP outputs while still keeping the designer in the team. We
have developed tools for creative professionals that extend standard layout
applications and allow designers to go a step beyond simple VDP. This paper
describes the application of these tools to real use cases. We show that the
tools can replace custom solutions giving improvements in VDP job creation,
database simplification and resilience to changing requirements. Keywords: PPML-T, VDP | |||
| Speculative document evaluation | | BIBAK | Full-Text | 56-58 | |
| Alexander Macdonald; David Brailsford; Steven Bagley; John Lumley | |||
| Optimisation of real world Variable Data printing (VDP) documents is a
difficult problem because the interdependencies between layout functions may
drastically reduce the number of invariant blocks that can be factored out for
pre-rasterisation.
This paper examines how speculative evaluation at an early stage in a document-preparation pipeline, provides a generic and effective method of optimising VDP documents that contain such interdependencies. Speculative evaluation will be at its most effective in speeding up print runs if sets of layout invariances can either be discovered automatically, or designed into the document at an early stage. In either case the expertise of the layout designer needs to be supplemented by expertise in exploiting potential invariances and also in predicting the effects of speculative evaluation on the caches used at various stages in the print production pipeline. Keywords: PPML, SVG, VDP, document layout, optimisation, speculative evaluation | |||
| A document object modeling method to retrieve data from a very large XML document | | BIBAK | Full-Text | 59-68 | |
| Seung Min Kim; Suk I. Yoo; Eunji Hong; Tae Gwon Kim; Il Kon Kim | |||
| Document Object Modeling (DOM) is widely used approach for retrieving data
from an XML document. If the size of the XML document is very large, however,
using the DOM approach for retrieving data from the XML document may suffer
from a lack of memory space for building the associated XML tree in the main
memory. To alleviate this problem, we propose a method that allows the very
large XML document to be split into small XML documents, retrieves data from
the XML tree built from each of these small XML documents, and combines the
results from all of the n XML trees to generate the final result. With this
proposed approach, the memory space and processing time required to retrieve
data from the very large XML document using DOM are reduced so that they can be
managed by one single general-purpose personal computer. Keywords: DOM, DOM API, XML, very large XML documents | |||
| A document engineering environment for clinical guidelines | | BIBAK | Full-Text | 69-78 | |
| Gersende Georg; Marie-Christine Jaulent | |||
| In this paper, we present a document engineering environment for Clinical
Guidelines (G-DEE), which are standardized medical documents developed to
improve the quality of medical care. The computerization of Clinical Guidelines
has attracted much interest in recent years, as it could support the
knowledge-based process through which they are produced. Early work on
guideline computerization has been based on document engineering techniques
using mark-up languages to produce structured documents. We propose to extend
the document-based approach by introducing some degree of automatic content
processing, dedicated to the recognition of linguistic markers, signaling
recommendations through the use of "deontic operators". Such operators are
identified by shallow parsing using Finite-State Transition Networks, and are
further used to automatically generate mark-up structuring the documents. We
also show that several guidelines manipulation tasks can be formalized as
XSL-based transformations of the original marked-up document. The automatic
processing component, which underlies the marking-up process, has been
evaluated using two complete clinical guidelines (corresponding to over 300
recommendations). As a result, precision of marker identification varied
between 88 and 98% and recall between 81 and 99%. Keywords: GEM, XML, clinical guidelines, deontic operators | |||
| XML version detection | | BIBAK | Full-Text | 79-88 | |
| Deise de Brum Saccol; Nina Edelweiss; Renata de Matos Galante; Carlo Zaniolo | |||
| The problem of version detection is critical in many important application
scenarios, including software clone identification, Web page ranking,
plagiarism detection, and peer-to-peer searching. A natural and commonly used
approach to version detection relies on analyzing the similarity between files.
Most of the techniques proposed so far rely on the use of hard thresholds for
similarity measures. However, defining a threshold value is problematic for
several reasons: in particular (i) the threshold value is not the same when
considering different similarity functions, and (ii) it is not semantically
meaningful for the user. To overcome this problem, our work proposes a version
detection mechanism for XML documents based on Naïve Bayesian classifiers.
Thus, our approach turns the detection problem into a classification problem.
In this paper, we present the results of various experiments on synthetic data
that show that our approach produces very good results, both in terms of recall
and precision measures. Keywords: XML, classification, similarity functions, versioning | |||
| Declarative extensions of XML languages | | BIBAK | Full-Text | 89-91 | |
| Simon Thompson; Peter R. King; Patrick Schmitz | |||
| We present a set of XML language extensions that bring notions from
functional programming to web authors, extending the power of declarative
modelling for the web. Our previous work discussed expressions and user-defined
events. In this paper, we discuss how one may extend XML by adding definitions
and parameterization; complex data and data types; and reactivity, events and
continuous "behaviours". We consider these extensions in the light of World
Wide Web Consortium standards, and illustrate their utility by a variety of use
cases. Keywords: XML, behaviour, data type, declarative, event, functional, type | |||
| Bank notes: extreme DocEng | | BIBAK | Full-Text | 92 | |
| Sara Church | |||
| Most people handle bank notes every day without giving them a thought, let
alone pondering their complexity. Yet every aspect of a bank note is highly
engineered to serve its purpose. Every facet of the bank note's existence, from
the materials that comprise them to the equipment that produces them, from the
machines that handle them to the shredders that destroy them, is carefully
considered and designed. Layered on these functional requirements are human
factors and the need to verify their authenticity, to be able to distinguish
them from any other printed documents that clever would-be, ill-intentioned
imitators might produce.
In the context of today's print-on-demand environment and the glitter-and-glow appeal of craft and display products to all segments of society, the requirements for achieving this differentiation from the counterfeiters' best products are increasingly challenging. This presentation addresses how real bank notes are made, the practical factors that drive their function and form requirements and the interplay of these factors with their security requirements, to inhibit the manufacture of counterfeit bank notes. Keywords: bank notes, counterfeiting, document engineering "in the large", document
security, security documents, security printing, security substrates, variable
data printing | |||
| Anvil next generation: a multi-format variable data print template based on PPML-T | | BIBAK | Full-Text | 93-94 | |
| Fabio Giannetti | |||
| Anvil Next Generation is a toolset enabling the usage of multiple formats,
as templates. It is mainly based on the Personalized Print Markup Language
Template (PPML-T) workflow. The possibility of supporting several template
formats within the same workflow enables more flexibility, whilst maintaining
the data merge and binding operations unchanged. Keywords: PPML, PPMLT, XSL-FO, XSLT, template, variable data print | |||
| Intention driven multimedia document production | | BIBAK | Full-Text | 95-96 | |
| Ludovic Gaillard; Marc Nanard; Peter R. King; Jocelyne Nanard | |||
| We demonstrate a system supporting intention-driven multimedia document
series production. We present mechanisms which build specifications of
genre-compliant document series and which produce documents conforming to those
specifications from existing finely indexed multimedia data sources. Keywords: genre, meta-structure, multimedia, series, transformation | |||
| Touch scan-n-search: a touchscreen interface to retrieve online versions of scanned documents | | BIBAK | Full-Text | 97-98 | |
| Fabrice Matulic | |||
| The system described in this paper attempts to tackle the problem of finding
online content based on paper documents through an intuitive touchscreen
interface designed for modern scanners and multifunction printers. Touch
Scan-n-Search allows the user to select elements of a scanned document (e.g. a
newspaper article) and to seamlessly connect to common web search services in
order to retrieve the online version of the document along with related
content. This is achieved by automatically extracting keyphrases from text
elements in the document (obtained by OCR) and creating "tappable" GUI widgets
to allow the user to control and fine-tune the search requests. The retrieved
content can then be printed, sent, or used to compose new documents. Keywords: GUI, keyword extraction, online news retrieval, scanned document | |||
| The salt triple: framework editor publisher | | BIBAK | Full-Text | 99-100 | |
| Tudor Groza; Alexander Schutz; Siegfried Handschuh | |||
| In this paper we present the SALT (Semantically Annotated LATEX) Triple, a
set of tools built to demonstrate a complete annotation workflow from creation
to usage. The Triple set contains the authoring and annotation framework, an
editor and a web publisher which helps the generation or uses the generated
metadata for a specific purpose. The demos show three phases part of the
workflow: (i) authoring -- first we introduce the way in which concurrent
annotations can be created during the authoring process by using the iSALT
editor as a front-end for the SALT framework; (ii) generation -- then we show
how the metadata is generated and embedded into the final result of the
authoring and annotation process, i.e. a semantically enriched PDF document;
(iii) usage -- and finally we demonstrate a way how the metadata can be used
for generating a set of rich online workshop proceedings. Keywords: LATEX, semantic authoring, semantic document | |||
| An efficient, streamable text format for multimedia captions and subtitles | | BIBAK | Full-Text | 101-110 | |
| Dick C. A. Bulterman; A. J. Jansen; Pablo Cesar; Samuel Cruz-Lara | |||
| In spite of the high profile of media types such as video, audio and images,
many multimedia presentations rely extensively on text content. Text can be
used for incidental labels, or as subtitles or captions that accompany other
media objects. In a multimedia document, text content is not only constrained
by the need to support presentation styles and layout, it is also constrained
by the temporal context of the presentation. This involves intra-text and extra
text timing synchronization with other media objects. This paper describes a
new timed-text representation language that is intended to be embedded in a
non-text host language. Our format, which we call aText (for the Ambulant Text
Format), balances the need for text styling with the requirement for an
efficient representation that can be easily parsed and scheduled at runtime.
aText, which can also be streamed, is defined as an embeddable text format for
use within declarative XML languages. The paper presents a discussion of the
requirements for the format, a description of the format and a comparison with
other existing and emerging text formats. We also provide examples for aText
when embedded within the SMIL and MLIF languages and discuss our implementation
experiences of aText with the Ambulant Player. Keywords: DFXP, SMIL, ambulant, realtext, streaming text, timed text | |||
| Genre driven multimedia document production by means of incremental transformation | | BIBAK | Full-Text | 111-120 | |
| Marc Nanard; Jocelyne Nanard; Peter R. King; Ludovic Gaillard | |||
| Genre, like layout, is an important factor in effective communication, and
automated tools which assist in genre compliance are thus of considerable
value. Genres are reusable meta-structures, which exist independently of
specific documents. This paper focuses on that part of the document production
process which involves genre, and discusses a specific example in order to
present the design rationale of mechanisms which assist in producing documents
compliant with specific genre rules.
The mechanisms we have developed are based on automated incremental, iterative transformations, which convert a draft document elaborated by the author into a genre compliant final document. The approach mimics the manner in which a human expert would transform the document. Transformation rules constitute a reusable and constructive expression of certain aspects of genre. The rules identify situations which appear inappropriate for the genre in question, and propose corrective action, so that the document becomes increasingly more compliant with the genre in question. This process of genre conformance iterates until no further corrective action is possible. This mechanism has been fully implemented. The implementation comprises both a work environment and a rule based language. The implementation relies internally on a general purpose tree transformation engine designed originally for use in natural language processing applications, which we have adapted to handle XML documents. Keywords: genre, meta-structure, multimedia, series, transformation | |||
| Timed-fragmentation of SVG documents to control the playback memory usage | | BIBAK | Full-Text | 121-124 | |
| Cyril Concolato; Jean Le Feuvre; Jean-Claude Moissinac | |||
| The Scalable Vector Graphics (SVG) language allows in its version 1.2 the
description of multimedia scenes including audio, video, vector graphics,
interactivity and animations. This standard has been selected by the mobile
industry as the format for vector graphics and rich media content. For this
purpose, additional tools were introduced in the language to solve the problem
of the playback of long-running SVG sequences on memory-constrained devices
like mobile phones. However, the proposed tools are not entirely sufficient and
solutions outside the scope of SVG are needed.
This paper proposes a method, complementary to the SVG tools, to control the memory consumption while playing back long running SVG sequences. This method relies on the use of an auxiliary XML document to describe the timed-fragmentation of the SVG document and the storage and streaming properties of each SVG fragment. Using this method, this paper shows that some SVG documents can be stored, delivered and played as streams, and that their playback as streams brings an important memory consumption reduction while using a standard SVG 1.2 Tiny player. Keywords: fragmentation, memory usage, scalable vector graphics, streaming, timing | |||
| Automatic float placement in multi-column documents | | BIBAK | Full-Text | 125-134 | |
| Kim Marriott; Peter Moulder; Nathan Hurst | |||
| Multi-column layout with horizontal scrolling has a number of advantages
over the standard model (single column with vertical scrolling) for on-line
document layout. However, one difficulty with the multi-column model is the
need for good automatic placement of floating figures. We identify reasonable
aesthetic criteria for their placement, and then give a
dynamic-programming-like algorithm for finding an optimal layout with respect
to these criteria. We also investigate an A* based approach and give two
variants differing in the choice of heuristic. We find that one of the A* based
approaches is faster than the dynamic programming approach and, if a "window"
of optimization is used, fast enough for moderately sized documents. Keywords: floating figure, multi-column layout, optimization techniques | |||
| Logical document conversion: combining functional and formal knowledge | | BIBAK | Full-Text | 135-143 | |
| Hervé Déjean; Jean-Luc Meunier | |||
| We present in this paper a method for document layout analysis based on
identifying the function of document elements (what they do). This approach is
orthogonal and complementary to the traditional view based on the form of
document elements (how they are constructed). One key advantage of such
functional knowledge is that the functions of some document elements are very
stable from document to document and over time. Relying on the stability of
such functions, the method is not impacted by layout variability, a key issue
in logical document analysis and is thus very robust and versatile. The method
starts the recognition process by using functional knowledge and uses in a
second step formal knowledge as a source of feedback in order to correct some
errors. This allows the method to adapt to specific documents by using formal
specificities. Keywords: combination of knowledge, feedback, functional analysis, logical document
analysis, methodology | |||
| Preserving the aesthetics during non-fixed aspect ratio scaling of the digital border | | BIBAK | Full-Text | 144-146 | |
| Hui Chao; Prasad Gabbur; Anthony Wiley | |||
| To enhance the visual effect of a photo, various digital borders or frames
are provided for photo decoration at photo sharing websites. Even though
multiple versions of the same border design may be prepared manually for
several "standard" page or photo sizes, difficulty arises when the user's page
or photo sizes are not one of the standards. Forcing a photo into the unfitted
border will result in a cropped photo. This limits the use of digital borders
and therefore the art designs. In this paper, we propose a method that
automatically resizes the digital border for different paper sizes while
preserving the look and feel of the original design. It analyzes the geometric
layout and semantic structure of the digital border, and then based on the
nature of the structures; it scales and moves them to the right place to
reconstruct the digital border to the new page size. Keywords: document layout, document scaling, image segmentation and reconstruction | |||
| Approximating text by its area | | BIBAK | Full-Text | 147-150 | |
| Nathan Hurst; Kim Marriott | |||
| Given possibly non-rectangular shapes, S1, ..., Sn, and some English text,
T, we give methods based on approximating T by its area that determine for each
Si whether T definitely fits in Si, definitely does not fit in Si, or probably
fits in Si. These methods have complexity linear in the size of Si, assuming it
is represented as a trapezoid list, but do not depend on the size of T. They
require a linear time shape independent pre-processing of the text. Keywords: continuous approximation | |||
| Editing with style | | BIBAK | Full-Text | 151-160 | |
| Vincent Quint; Irne Vatton | |||
| HTML has popularized the use of style sheets, and the advent of XML has
stressed the importance of style as a key area complementing document structure
and content. A number of tools are now available for producing HTML and XML
documents, but very few are addressing style issues. In this paper we analyze
the requirements for style manipulation tools, based on the main features of
the CSS language. We discuss methods and techniques that meet these
requirements and that can be used to efficiently support web authors in style
sheet manipulation. The discussion is illustrated by the recent developments
made in the Amaya web authoring environment. Keywords: CSS, document authoring, style languages, web editing | |||
| The Mars project: PDF in XML | | BIBAK | Full-Text | 161-170 | |
| Matthew R. B. Hardy | |||
| The Portable Document Format (PDF) is a page-oriented, graphically rich
document format based on PostScript semantics. It is the file format underlying
the Adobeî Acrobatî viewers and is used throughout the publishing
industry for final form documents and document interchange. Beyond document
layout, PDF provides enhanced capabilities, which include logical structure,
forms, 3D, movies and a number of other rich features.
Developers and system integrators face challenges manipulating PDF and its data. They are looking for solutions that allow them to more easily create and operate on documents, as well as to integrate with modern XML-based document processing workflows. The Mars document format is based on the fundamental structures of PDF, but uses an XML syntax to represent the document. Mars uses XML to represent the underlying data structures of PDF, as well as incorporating additional industry standards such as SVG, PNG, JPG, JPG2000 and OpenType. Mars combines all of these components into a ZIP-based document container. The use of open standards in Mars means that Mars documents can be used with a large range of off-the-shelf tools and that a larger population of developers will be very familiar with its underlying technology. Using these standards, publishers gain access to all of the richness of PDF, but can now tightly integrate Mars into their document workflows. Keywords: Mars, PDF, SVG, XML, package, zip | |||
| SALT: a semantic approach for generating document representations | | BIBAK | Full-Text | 171-173 | |
| Tudor Groza; Alexander Schutz; Siegfried Handschuh | |||
| The structure of a document has an important influence on the perception of
its content. Considering scientific publications, we can affirm that by making
use of the ordinary linear layout, a well organized publication, following a
"red wire", will always be better understood and analyzed than one having a
poor or chaotic structure, but not necessarily poor content. Reading a
publication in a linear way, from the first page to the last page means a lot
of unnecessary information processing to the reader. Looking at a publication
from another perspective by accessing the key-points or argumentative structure
directly can give better insights into the author's thoughts, and for certain
tasks (i.e. getting a first impression of an article) a representation of the
document reduced to its core could be more important than its linear structure.
In this paper, we will show how one can build different representations of the
same document, by exploiting the semantics captured in the text. The focus will
be on scientific publications and as building foundation we use the SALT
(Semantically Annotated LATEX) annotation framework for creating Semantic PDF
Documents. Keywords: LATEX, PDF, semantic annotation, semantic document | |||
| Endless documents: a publication as a continual function | | BIBAK | Full-Text | 174-176 | |
| John Lumley; Roger Gimson; Owen Rees | |||
| Variable data can be considered as functions of their bindings to values.
The Document Description Framework (DDF) treats documents in this manner, using
XSLT semantics to describe document functionality and a variety of related
mechanisms to support layout, reference and so forth. But the result of
evaluation of a function could itself be a function: can variable data
documents behave likewise? We show that documents can be treated as simple
continuations within that framework with minor modifications. We demonstrate
this on a perpetual diary. Keywords: SVG, XSLT, document construction, functional programming | |||
| Authors vs. readers: a comparative study of document metadata and content in the www | | BIBAK | Full-Text | 177-186 | |
| Michael G. Noll; Christoph Meinel | |||
| Collaborative tagging describes the process by which many users add metadata
in the form of unstructured keywords to shared content. The recent practical
success of web services with such a tagging component like Flickr or
del.icio.us has provided a plethora of user-supplied metadata about web content
for everyone to leverage.
In this paper, we conduct a quantitative and qualitative analysis of metadata and information provided by the authors and publishers of web documents compared with metadata supplied by end users for the same content. Our study is based on a random sample of 100,000 web documents from the Open Directory, for which we examined the original documents from the World Wide Web in addition to data retrieved from the social bookmarking service del.icio.us, the content rating system ICRA, and the search engine Google. To the best of our knowledge, this is the first study to compare user tags with the metadata and actual content of documents in the WWW on a larger scale and to integrate document popularity information in the observations. The data set of our experiments is freely available for research. Keywords: authoring, del.icio.us, dmoz, dmoz100k06, document engineering, Google,
ICRA, metadata, PageRank, social bookmarking, tagging, www | |||
| Elimination of junk document surrogate candidates through pattern recognition | | BIBAK | Full-Text | 187-195 | |
| Eunyee Koh; Daniel Caruso; Andruid Kerne; Ricardo Gutierrez-Osuna | |||
| A surrogate is an object that stands for a document and enables navigation
to that document. Hypermedia is often represented with textual surrogates, even
though studies have shown that image and text surrogates facilitate the
formation of mental models and overall understanding. Surrogates may be formed
by breaking a document down into a set of smaller elements, each of which is a
surrogate candidate. While processing these surrogate candidates from an HTML
document, relevant information may appear together with less useful junk
material, such as navigation bars and advertisements.
This paper develops a pattern recognition based approach for eliminating junk while building the set of surrogate candidates. The approach defines features on candidate elements, and uses classification algorithms to make selection decisions based on these features. For the purpose of defining features in surrogate candidates, we introduce the Document Surrogate Model (DSM), a streamlined Document Object Model (DOM)-like representation of semantic structure. Using a quadratic classifier, we were able to eliminate junk surrogate candidates with an average classification rate of 80%. By using this technique, semi-autonomous agents can be developed to more effectively generate surrogate collections for users. We end by describing a new approach for hypermedia and the semantic web, which uses the DSM to define value-added surrogates for a document. Keywords: document surrogate model, mixed-initiatives, navigation, pattern
recognition, principal components analysis, quadratic classifier,
semi-autonomous agents, surrogate | |||
| Filtering product reviews from web search results | | BIBAK | Full-Text | 196-198 | |
| Tun Thura Thet; Jin-Cheon Na; Christopher S. G. Khoo | |||
| This study seeks to develop an automatic method to identify product reviews
on the Web using the snippets (summary information) returned by search engines.
Determining whether a snippet is a review or non-review is a challenging task,
since the snippet usually does not contain many useful features for identifying
review documents. Firstly we applied a common machine learning technique, SVM
(Support Vector Machine), to investigate which features of snippets are useful
for the classification. Then we employed a heuristic approach utilizing domain
knowledge and found that the heuristic approach performs equally well as the
machine learning approach. A hybrid approach which combines the machine
learning technique and domain knowledge performs slightly better than the
machine learning approach alone. Keywords: genre classification, product review documents, snippets | |||
| Structure and content analysis for html medical articles: a hidden Markov model approach | | BIBAK | Full-Text | 199-201 | |
| Jie Zou; Daniel Le; George R. Thoma | |||
| We describe ongoing research on segmenting and labeling HTML medical journal
articles. In contrast to existing approaches in which HTML tags usually serve
as strong indicators, we seek to minimize dependence on HTML tags. Designing
logical component models for general Web pages is a challenging task. However,
in the narrow domain of online journal articles, we show that the HTML
document, modeled with a Hidden Markov Model, can be accurately segmented into
logical zones. Keywords: HTML document labeling, HTML document segmentation, document layout
analysis, document object model (DOM), text mining, web information retrieval | |||
| Exclusion-inclusion based text categorization of biomedical articles | | BIBAK | Full-Text | 202-204 | |
| Nadia Zerida; Nadine Lucas; Bruno Crémilleux | |||
| In this paper, we propose a new approach based on two original principles to
categorize biomedical articles. On the one hand, we combine linguistic,
structural and metric descriptors to build patterns stemming from data mining
techniques. On the other hand, we take into account the importance of the
absence of patterns to the categorization task by using an exclusion-inclusion
method. To avoid a crisp effect between the absence and the presence of a
pattern, the exclusion-inclusion method uses two regret measures to quantify
the interest of a weak pattern according to the other classes and among
patterns from a same class. The global decision is based on the generalization
of the local patterns, firstly by using patterns excluding classes, then
according to the regret ratios. Experiments show the effectiveness of the
approach. Keywords: categorization, characterisation, text mining | |||
| Adapting associative classification to text categorization | | BIBAK | Full-Text | 205-208 | |
| Baoli Li; Neha Sugandh; Ernest V. Garcia; Ashwin Ram | |||
| Associative classification, which originates from numerical data mining, has
been applied to deal with text data recently. Text data is firstly digitalized
to database of transactions, and then training and prediction is actually
conducted on the derived numerical dataset. This intuitive strategy has
demonstrated quite good performance. However, it doesn't take into
consideration the inherent characteristics of text data as much as possible,
although it has to deal with some specific problems of text data such as
lemmatizing and stemming during digitalization. In this paper, we propose a
bottom-up strategy to adapt associative classification to text categorization,
in which we take into account structure information of text. Experiments on
Reuters-21578 dataset show that the proposed strategy can make use of text
structure information and achieve better performance. Keywords: associative classification, text categorization | |||
| Towards automatic document migration: semantic preservation of embedded queries | | BIBAK | Full-Text | 209-218 | |
| Thomas Triebsees; Uwe M. Borghoff | |||
| Archivists and librarians face an ever increasing amount of digital
material. Their task is to preserve its authentic content. In the long run,
this requires periodic migrations (from one format to another or from one
hardware/software platform to another). Document migrations are challenging
tasks where tool-support and a high degree of automation are important. A
central aspect is that documents are often mutually related and, hence, a
document's semantics has to be considered in its whole context. References
between documents are usually formulated in graph- or tree-based query
languages like URL or XPath. A typical scenario is web-archiving where websites
are stored inside a server infrastructure that can be queried from HTML-files
using URLs. Migrating websites will often require link adaptation in order to
preserve link consistency. Although automated and "trustworthy" preservation of
link consistency is easy to postulate, it is hard to carry out, in particular,
if "trustworthy" means "provably working correct". In this paper, we propose a
general approach to semantically evaluating and constructing graph queries,
which at the same time conform to a regular grammar, appear as part of a
document's content, and access a graph structure that is specified using
First-Order Predicate Logic (FOPL). In order to do so, we adapt model checking
techniques by constructing suitable query automata. We integrate these
techniques into our preservation framework [12] and show the feasibility of
this approach using an example. We migrate a website to a specific archiving
format and demonstrate the automated preservation of link-consistency. The
approach shown in this paper mainly contributes to a higher degree of
automation in document migration while still maintaining a high degree of
"trustworthiness", namely "provable correctness". Keywords: automated document migration, digital preservation, link consistency, query
processing | |||
| Mapping paradigm for document transformation | | BIBAK | Full-Text | 219-221 | |
| Arnaud Blouin; Olivier Beaudoux | |||
| Since the advent of XML, the ability to transform documents using
transformation languages such as XSLT has become an important challenge.
However, writing a transformation script (e.g. an XSLT stylesheet) is still an
expert task. This paper proposes a simpler way to transform documents by
defining a relation between two schemas expressed through our mapping language.
And then by using a transformation process that applies the mapping instances
of the schemas. Thus, a user only needs to focus on the mapping without having
any knowledge about how a transformation language and its processor work. This
paper outlines our mapping approach and language, and illustrates them with an
example. Keywords: XML, XSLT, document transformation, mapping | |||
| Combination of transformation and schema languages described by a complete formal semantics | | BIBAK | Full-Text | 222-224 | |
| Catherine Pugin; Rolf Ingold | |||
| XML and its associated languages, namely DTD, XML Schema and XSLT, have
tremendous importance for lots of applications even if their semantics is often
hard to understand and incomplete. In this paper, we concentrate on
transformation languages and propose a new one in XML syntax and focusing on
strong specifications. Since our language is completely defined by formal
semantics, conceptual drawbacks have been avoided and complexity has been
reduced. Thus, static type checking could easily be provided. Finally, we
combine our transformation language with our own schema language in order to
perform static typing. Keywords: XML, integration, schema, static type checking, transformation | |||