| XSLT working session | | BIBAK | Full-Text | 1 | |
| John Lumley; Jeni Tennison | |||
| Document engineering has been transformed by the development and adoption of
XML as a common form (a 'meta-language') for defining document representations
of many types. Recently XSLT has matured significantly as a transformational
technology for XML-based documents that could have similar effect in the design
of document processing tools. XSLT brings technologies from functional
programming to bear on producing robust programs firmly embedded in the context
of XML. We have new ideas and emphases on solution design, from very localised
techniques, tricks and algorithms right up to large-scale program suites.
XSLT's functional programming basis, its ability to mix push- and pull-driven
effects, lack of reassignable state and definition within XML, all present
challenges and opportunities to those both learning and exploiting its powers.
Many of the attendees at DocEng'06 will have some experience of using XSLT in their research and practice. In this working session we aim to share interesting and useful insights and exploitations of XSLT, through an extended panel session and 'audience participation'. The session leaders (John Lumley, Jeni Tennison) have extensive experience of using XSLT within document engineering. The session starts with subjects suggested by the presenters and continues with a variety of topics raised by attendees. Keywords: XSLT, functional programming | |||
| Every page is different: a new document type for commercial printing | | BIBAK | Full-Text | 2 | |
| Keith Moore | |||
| The Web has certainly demonstrated the power of personalization with offers
based on interests and prior purchasing behavior. Most studies indicate a 10 to
1 improvement in sale conversion using some degree of personalization in
printed marketing collateral. Until recently, the cost of providing
personalization in high-quality printed collateral has been prohibitive.
Delivering a combined campaign of web and print is a logistics nightmare
requiring parallel workflows, parallel content management, and custom
synchronization mechanisms.
While digital offset presses (such as the HP Indigo) can address fulfillment of high-quality personalized pages, the upstream work processes including layout, typography, content merge, proofing and color are struggling to capture and represent these new types of jobs. In this talk, I'll describe the capabilities of the new digital offset presses, some of the market forces that are driving the new types of pages, and highlight the challenges being faced in creating these new workflows. Keywords: augmented paper, campaign management, digital commercial printing,
personalization, variable data printing | |||
| Minimum sized text containment shapes | | BIBAK | Full-Text | 3-12 | |
| Nathan Hurst; Kim Marriott; Peter Moulder | |||
| In many text-processing applications, we would like shapes that expand (or
shrink) in size to fit their textual content. We address how to efficiently
compute the minimum size for such text shapes. A variant of this problem is to
take a fixed shape and determine the maximal size font that will still allow
the content to fit into it. Our approach is to model the problem as a
constrained optimisation problem with a single variable that controls the
geometry of the text shape. We use a variant of secant search to determine the
minimum area for the shape, guided by the area of the text. We represent the
shape by regions that are composed of trapezoids whose coordinates are a linear
function of the unknown variable. This allows us to use a novel linear time
algorithm (based on computing Minkowski difference) that takes a trapezoid list
and text height and determines the region in which a line of text of that
height and some minimum width can start and still remain inside the shape. Keywords: adaptive layout, constrained optimization, textbox | |||
| Measuring aesthetic distance between document templates and instances | | BIBAK | Full-Text | 13-21 | |
| Alexis Cabeda Faria; Joao B. S. de Oliveira | |||
| Adaptive documents undergo many transformations during their generation,
including insertion and deletion of content. One major problem in this scenario
is the preservation of the aesthetic qualities of the document during those
transformations.
As adaptive documents are instances of a template, the aesthetic quality of an instance with respect to the template could be evaluated by aesthetic measures providing scores to any desired quality parameters. These parameters measure the deviation of the instance from the desired template. This evaluation could assure the quality of instances during their generation and final output. This paper introduces the use of document templates to support aesthetic measures of document instances. A score is assigned to a document instance according to the differences detected from the original template. Considering the original template as an ideal result, the quality of a document instance will decrease according to the number and severity of the changes applied to produce it. So, documents that are below a given threshold can be sent for further (possibly human) review, and any others are accepted. The amount of change with respect to the template will reflect the document quality, and in such a model the quality of instances can be considered as a distance from that original. Keywords: aesthetics, document, layout, measures | |||
| Text block geometric shape analysis | | BIBAK | Full-Text | 22-24 | |
| Hui Chao | |||
| When graphic artist designs a page, they envision a set of text blocks of
arbitrary shapes constrained by page size, image blocks and graphics blocks
with wrap around properties. We call this the intended shape. What is seen on
an actual page depends on the particular text content and typographical
constrains such as natural text line breaking and justification. We call this
the apparent shape. Our goal is to create document templates by extracting the
text blocks' intended shapes from the apparent shapes. The main difficulty is
when the line justification is jagged the intended block shape is obfuscated.
We solve this problem by analyzing the layout relation of all blocks on a page
and applying an iterative process to find the maximum likelihood of the
intended shapes. Keywords: document geometric layout analysis, page segmentation, template creation | |||
| Evaluating invariances in document layout functions | | BIBAK | Full-Text | 25-27 | |
| Alexander J. Macdonald; David F. Brailsford; John Lumley | |||
| With the development of variable-data-driven digital presses where each
document printed is potentially unique there is a need for pre-press
optimization to identify material that is invariant from document to document.
In this way rasterisation can be confined solely to those areas which change
between successive documents thereby alleviating a potential performance
bottleneck.
Given a template document specified in terms of layout functions, where actual data is bound at the last possible moment before printing, we look at deriving and exploiting the invariant properties of layout functions from their formal specifications. We propose future work on generic extraction of invariance from such properties for certain classes of layout functions. Keywords: SVG, XML, XSLT, document layout, optimisation | |||
| Solving the simple continuous table layout problem | | BIBAK | Full-Text | 28-30 | |
| Nathan Hurst; Kim Marriott; David Albrecht | |||
| Automatic table layout is required in web applications. Unfortunately, this
is NP-hard for reasonable layout requirements such as minimizing table height
for a given width. One approach is to solve a continuous relaxation of the
layout problem in which each cell must be large enough to contain the area of
its content. The solution to this relaxed problem can then guide the solution
to the original problem. We give a simple and efficient algorithm for solving
this continuous relaxation for the case that cells do not span multiple columns
or rows. The algorithm is not only interesting in its own right but also
because it provides insight into the geometry of table layout. Keywords: automatic table layout, constrained optimization | |||
| COG Extractor | | BIBAK | Full-Text | 31 | |
| Steven R. Bagley | |||
| The Component Object Graphic (COG) model describes documents as a series of
distinct, encapsulated graphical blocks (termed COGs) that are positioned on
the page rather than the traditional approach (taken by formats such as
PostScript, PDF and SVG) of describing each page as one monolithic block of
drawing operators that create marks on the page to form the content. Previous
work [1, 2] has demonstrated how this paradigm can be implemented on top of
both PDF and SVG.
Tools have previously been created which allow the creation of COG documents [1, 3] and manipulation of existing COGs to form new documents [1, 4]. Missing from the COG toolkit has been a method of producing COGs from already existing content. The proposed solution is a new tool entitled 'COG Extractor' that allows the user to select an area of an existing document to be extracted and converted to be a COG in PDF or SVG form. The first process is to get the documents into a format that can easily be understood. The most sensible choice is PDF since there are several tools that can convert other formats into PDF. It is then necessary to parse the PDF content stream for the page to be extracted and to derive the meaning of the operators in the page's content stream. PDF's operators can be divided into two classes: those that image content, and those that define the state (such as fill colour, line width) that content is imaged with. This causes problems since each imaging operators depends on the cumulative effect of all previous state operators. It is therefore necessary to work out exactly which operators are responsible for drawing the content. This is achieved by combining consecutive state operators together to form a state-change object, which encapsulates all changes in state at that point. Everything following the state-change object in the PDF content stream then become children of that state-change object. This produces a tree representation of the document where the drawing operators form the leaf nodes. The state of any drawing operator can then be calculated by walking back along its parents in the tree. From this tree it is then possible to calculate the bounding box of each drawing operator and see if it intersects with the area to be extracted. If they do not intersect, then that drawing operator can be pruned from the tree. An optimization phase can remove any state-change nodes that are no longer required (e.g nodes for which all children have been removed). This tree can then be exported as a COG in PDF or SVG format for future use. Keywords: COGs, FormXObject, PDF, SVG, graphic objects | |||
| Carrot and stick: combining information retrieval models | | BIBK | Full-Text | 32 | |
| Hans Friedrich Witschel | |||
Keywords: language models, weighting schemes | |||
| Standards based high resolution Document editing using low resolution proxy images | | BIBAK | Full-Text | 33 | |
| Michael Gormish; Edward L. Schwartz | |||
| This poster presents an implementation of the JPIP-JPM standard which allows
low bandwidth access to remote high resolution documents. The implementation
goes beyond the standard by allowing simple editing operations at the client
without obtaining all of the compressed data. These edits form a new version of
the document that can be easily uploaded over low bandwidth connections. Keywords: JPEG2000, JPIP, JPM, partial image transmission | |||
| Print-n-link: weaving the paper web | | BIBAK | Full-Text | 34-43 | |
| Moira C. Norrie; Beat Signer; Nadir Weibel | |||
| Citations form the basis for a web of scientific publications. Search
engines, embedded hyperlinks and digital libraries all simplify the task of
finding publications of interest on the web and navigating to cited
publications or web sites. However the actual reading of publications often
takes place on paper and frequently on the move. We present a system
Print-n-Link that uses technologies for interactive paper to enhance the
reading process by enabling users to access digital information and/or searches
for cited documents from a printed version of a publication using a digital pen
for interaction. A special virtual printer driver automatically generates links
from paper to digital services during the printing process based on an analysis
of PDF documents. Depending on the user setting and interaction gesture, the
system may retrieve metadata about the citation and inform the user through an
audio channel or directly display the cited document on the user's screen. Keywords: citation management, digital library, document integration, interactive
paper | |||
| Knowledge engineering from frontline support to preliminary design | | BIBAK | Full-Text | 44-52 | |
| Sylvia C. Wong; Richard M. Crowder; Gary B. Wills; Nigel R. Shadbolt | |||
| The design and maintenance of complex engineering systems such as a jet
engine generates a significant amount of documentation. Increasingly, aerospace
manufacturers are shifting their focus from selling products to providing
services. As a result, when designing new engines, engineers must increasingly
consider the life-cycle requirements in addition to design parameters. To
identify possible areas of concern, engineers must obtain knowledge gained from
the entire life of an engine. However, because of the size and distributed
nature of a company's operation, engineers often do not have access to
front-line maintenance data. In addition, the large number of documents accrued
makes it impossible to examine thoroughly. This paper presents a prototype
knowledge-based document repository for such an application. It searches and
analyzes distributed document resources, and provides engineers with a summary
view of the underlying knowledge. The aim is to aid engineers in creating
design requirement documents that incorporate aftermarket issues. Unlike
existing document repositories and digital libraries, our approach is
knowledge-based, where users browse summary reports instead of following
suggested links. To test the validity of our proposed architecture, we have
developed and deployed a working prototype. The prototype has been demonstrated
to engineers and received positive reviews. Keywords: intelligent documents, semantic web, service-oriented architecture | |||
| An XML interaction service for workflow applications | | BIBAK | Full-Text | 53-55 | |
| Y. S. Kuo; Lendle Tseng; Hsun-Cheng Hu; N. C. Shih | |||
| Interactions with human users are a crucial part of many workflow
applications. In workflow or business process management specifications, such
as WSBPEL, all data are represented as XML. As a consequence, many human tasks
simply support users to create or update XML data compliant with a schema. We
propose the XML interaction service as a generic Web service for use by human
tasks that provides HTML form-based Web interfaces for users to interact with
and update schema-compliant XML data. In addition, a visual interface design
tool is provided for interface designers to create and customize the Web
interfaces an XML interaction service supports. Keywords: XML, user interface, workflow management | |||
| Engineering better voting systems | | BIBAK | Full-Text | 56-58 | |
| Bertrand Haas | |||
| We consider here an election ballot as a document, a document that works as
the carrier of a voter's choice in an election's accounting system for
determining a winning candidate. And we consider a voting system as a way to
manage both the document and its flow in compliance with the requirements of
the election. Trustworthy elections are the core of a democratic spirit and
engineering a voting system with requirements of convenience, privacy,
integrity and reliability lies at the core of trustworthy elections. We define
here some clear requirements for a trustworthy voting system and analyze the
most popular classes of voting systems according to some of these requirements.
We draw conclusions on the engineering of better voting systems and show two
efforts in this direction. Keywords: ballots, electronic, mail, security, voting | |||
| Preservation-centric and constraint-based migration of digital documents | | BIBAK | Full-Text | 59-61 | |
| Thomas Triebsees; Uwe M. Borghoff | |||
| We introduce a framework that supports archivists in planning and running
migrations. The central idea is that -- once relevant information pieces of
digital documents are modeled -- desired migration results can be specified by
means of preservation constraints. From these constraint specifications we are
able to derive migration algorithms that provably respect a set of document
properties before (pre-conditions) and after migration (post-conditions).
Underlying is the concept of Abstract State Machines (ASM) modeling archival
states. Migrations are modeled as sequences of basic operations that change the
archive's state while respecting userdefined constraints. Among others, our
target scenarios comprise legal and medical documents where considerable
property changes cannot be tolerated and where constraint preservation must
hold over a long period of time. Keywords: digital archive, migration, model, preservation | |||
| Combining linguistic and structural descriptors for mining biomedical literature | | BIBAK | Full-Text | 62-64 | |
| Nadia Zerida; Nadine Lucas; Bruno Crémilleux | |||
| This work proposes an original combination of linguistic and structural
descriptors to represent the content of biomedical papers. The objective is to
show the effectiveness of descriptors taking into account the structure of
documents to characterise three kinds of biomedical texts (reviews, research
and clinical papers). The description of text is made at various levels, from
the global level to the local one. The contexts makes it possible to
characterise the three classes. The characterisation of the textual resources
is carried out quantitatively by using the discriminating capacity of
techniques of data mining based on emerging patterns. Keywords: categorisation, characterisation, text mining preprocessing | |||
| Comparing XML path expressions | | BIBAK | Full-Text | PDF | 65-74 | |
| Pierre Genevès; Nabil Layaïda | |||
| XPath is the standard declarative language for navigating XML data and
returning a set of matching nodes. In the context of XSLT/XQuery analysis,
query optimization, and XML type checking, XPath decision problems arise
naturally. They notably include XPath comparisons such as equivalence (whether
two queries always return the same result), and containment (whether for any
tree the result of a particular query is included in the result of a second
one).
XPath decision problems have attracted a lot of research attention, especially for studying the computational complexity of various XPath fragments. However, what is missing at present is the constructive use of an expressive logic which would allow capturing these decision problems, while providing practically effective decision procedures. In this paper, we propose a logic-based framework for the static analysis of XPath. Specifically, we propose the alternation free modal μ-calculus with converse as the appropriate logic for effectively solving XPath decision problems. We present a translation of a large XPath fragment into μ-calculus, together with practical experiments on the containment using a state-of-the-art EXPTIME decision procedure for μ-calculus satisfiability. These preliminary experiments shed light, for the first time, on the cost of checking the containment in practice. We believe they reveal encouraging results for further static analysis of XML transformations. Keywords: XPath, analysis, experimentation | |||
| Fast and simple XML tree differencing by sequence alignment | | BIBAK | Full-Text | 75-84 | |
| Tancred Lindholm; Jaakko Kangasharju; Sasu Tarkoma | |||
| With the advent of XML we have seen a renewed interest in methods for
computing the difference between trees. Methods that include heuristic elements
play an important role in practical applications due to the inherent complexity
of the problem. We present a method for differencing XML as ordered trees based
on mapping the problem to the domain of sequence alignment, applying simple and
efficient heuristics in this domain, and transforming back to the tree domain.
Our approach provides a method to quickly compute changes that are meaningful
transformations on the XML tree level, and includes subtree move as a primitive
operation. We evaluate the feasibility of our approach and benchmark it against
a selection of existing differencing tools. The results show our approach to be
feasible and to have the potential to perform on par with tools of a more
complex design in terms of both output size and execution time. Keywords: XML, differencing, move, ordered tree, sequence alignment | |||
| Filtering XML documents using XPath expressions and aspect-oriented programming | | BIBAK | Full-Text | 85-87 | |
| Ermir Qeli; Bernd Freisleben | |||
| In this paper, we present the design and implementation of a filtering
approach for XML documents which is based on XPath expressions and
Aspect-Oriented Programming (AOP). The class of XPath expressions used allows
for branching, wildcards and descendant relationships between nodes. For the
embedding of simple paths into XPath expressions, a dynamic programming
approach is proposed. The AOP paradigm, which provides a means for
encapsulating crosscutting concerns in software, is introduced to integrate the
filtering approach in the broader context of event-based parsing of XML
documents using SAX. Keywords: SAX, XML, XPath, aspect-oriented programming | |||
| Customizable detection of changes for XML documents using XPath expressions | | BIBAK | Full-Text | 88-90 | |
| Ermir Qeli; Julinda Gllavata; Bernd Freisleben | |||
| Change detection in XML documents is an important task in the context of
query systems. In this paper, we present CustX- Diff, a customizable change
detection approach for XML documents based on X-Diff [6]. CustX-Diff performs
the change detection operation simultaneosly with the XPath based filtering of
XML document parts. The class of XPath expressions used is the tree patterns
subset of XPath. For the embedding of simple paths into XPath expressions
during the difference operation, a dynamic programming approach is proposed.
Comparative performance results with respect to the original X-Diff [6]
approach demonstrate the efficiency of the proposed method. Keywords: XML, XPath, change detection | |||
| Processing XML documents with pipelines | | BIBAK | Full-Text | 91 | |
| Jeni Tennison | |||
| We live in a world where documents do not simply exist in the form we need
them. Getting from stored form to final presentation is often a complex
process. Documents have to be generated, validated and transformed, queried,
split, merged, filtered, annotated, restructured, translated and rendered. We
need it to be easy to define the set of processes that the document has to go
through. And we need it to be fast, at run time, to do this processing.
The XML Processing Model Working Group is developing a markup language for defining pipelines of processes that XML documents may go through. In this talk I shall describe how pipelines can be put together using the XProc language and why even simple processing can benefit from being organised in that way. Keywords: XML, pipeline, processing, transformation | |||
| Modeling context Information for capture and access applications | | BIBAK | Full-Text | 92-94 | |
| Maria G. Pimentel; Laercio Baldochi; Ethan V. Munson | |||
| The Contextractor is an XSLT-based transformation system that gathers
information from extended UML models to produce XML Schemas that model the
information captured by an application and define a query language that allows
the submission of queries over the captured content. Keywords: UML models, XMI, XML schema, XSLT, capture and access applications,
computing, ubiquitous | |||
| Resolving layout interdependency with presentational variables | | BIBAK | Full-Text | 95-97 | |
| John Lumley; Roger Gimson; Owen Rees | |||
| In the construction of variable data documents, the layout of component
parts to build a composite section with heterogeneous layout functions can be
implemented by a tree-evaluating layout processor. This handles many cases with
well-scoped structure very smoothly but becomes complex when layout
relationships between components cut across a strict tree. We present an
approach for XML-described layouts based on a post-rendering set of
single-assignment variables, analogous to XSLT, that can make this much easier,
does not compromise layout extensibility and can be a target for automated
interdependency analysis and generation. This is the approach used in the
layout processor associated with the Document Description Framework (DDF). Keywords: SVG, XSLT, document construction, functional programming | |||
| Meta-algorithmic systems for document classification | | BIBAK | Full-Text | 98-106 | |
| Steven J. Simske; David W. Wright; Margaret Sturgill | |||
| To address cost and regulatory concerns, many businesses are converting
paper-based elements of their workflows into fully electronic flows that use
the content of the documents. Scanning the document contents into workflows,
however, is a manual, error-prone, and costly process especially when the data
extraction process requires high accuracy. These manual costs are a primary
barrier to widespread adoption of distributed capture solutions for business
critical workflows such as insurance claims, medical records, or loan
applications. Software solutions using artificial intelligence and natural
language processing techniques are emerging to address these needs, but each
have their individual strengths and weaknesses, and none have demonstrated a
high level of accuracy across the many unstructured document types included in
these business critical workflows. This paper describes how to overcome many of
these limitations by intelligently combining multiple approaches for document
classification using meta-algorithmic design patterns. These patterns explore
the error space in multiple engines, and provide improved and "emergent"
results in comparison to voting schemes and to the output of any of the
individual engines. This paper considers the results of the individual engines
along with traditional combinatorial techniques such as voting, before
describing prototype results for a variety of novel metaalgorithmic patterns
that reduce individual document error rates by up to 13% and reduce system
error rates by up to 38%. Keywords: confusion matrix, document classification, document indexing, engine
combination, meta-algorithmics | |||
| Content based SMS spam filtering | | BIBAK | Full-Text | 107-114 | |
| José María Gómez Hidalgo; Guillermo Cajigas Bringas; Enrique Puertas Sánz; Francisco Carrero García | |||
| In the recent years, we have witnessed a dramatic increment in the volume of
spam email. Other related forms of spam are increasingly revealing as a problem
of importance, specially the spam on Instant Messaging services (the so called
SPIM), and Short Message Service (SMS) or mobile spam.
Like email spam, the SMS spam problem can be approached with legal, economic or technical measures. Among the wide range of technical measures, Bayesian filters are playing a key role in stopping email spam. In this paper, we analyze to what extent Bayesian filtering techniques used to block email spam, can be applied to the problem of detecting and stopping mobile spam. In particular, we have built two SMS spam test collections of significant size, in English and Spanish. We have tested on them a number of messages representation techniques and Machine Learning algorithms, in terms of effectiveness. Our results demonstrate that Bayesian filtering techniques can be effectively transferred from email to SMS spam. Keywords: Bayesian filter, junk, receiver operating characteristic, spam | |||
| Application of syntactic properties to three-level recognition of polish hand-written medical texts | | BIBAK | Full-Text | 115-121 | |
| Grzegorz Godlewski; Maciej Piasecki; Jerzy Sas | |||
| In the paper, three-level hand-writing recognition using language syntactic
properties on the upper level is presented. Isolated characters are recognized
on the lowest level. The character classification from the lowest level is used
in words recognition. Words are recognized using a combined classifier based on
possibly incomplete unigram lexicon. Word classifier builds a rank of the most
likely words. Ranks created for subsequent words are input to the syntactic
classifier, which recognizes the whole sentences. Here the local syntactic
constraints are used to build a syntactically consistent sentence. The method
has been applied to recognition of hand-written medical texts describing fixed
aspects of patient treatment. Due to narrow area of topics explained in the
texts and peculiarity of style characteristic for physicians writing texts, the
syntax of expected sentences is relatively simple, what makes the problem of
checking the syntactic consistency simpler. Keywords: HMM, OCR, PoS tagging, hand-written text, polish | |||
| Quality enhancement in information extraction from scanned documents | | BIBAK | Full-Text | 122-124 | |
| Atsuhiro Takasu; Kenro Aihara | |||
| When constructing a large document archive, an important element is the
digitizing of printed documents. Although various techniques for document image
analysis such as Optical Character Recognition (OCR) have been developed, error
handling is required in constructing real document archive systems. This paper
discusses the problem from the quality enhancement perspective and proposes a
robust reference extraction method for academic articles scanned with OCR
mark-up. We applied the proposed method to articles appearing in various
journals, and these experiments showed that the proposed method achieved a
recognition accuracy of more than 94%. This paper also discusses manual
correction and investigates experimentally the relationship between extraction
accuracy and cost reduction. Keywords: digital libraries, document capturing, statistical model | |||
| Document annotation by active learning techniques | | BIBAK | Full-Text | 125-127 | |
| Loïc Lecerf; Boris Chidlovskii | |||
| We present a system for the semantic annotation of layout-oriented
documents, with an integrated learning component. We introduce probabilistic
learning methods on tree-like documents and we present different active
learning techniques for training document annotation models. We report some
preliminary results of deploying such active learning techniques on an
important case of document collection annotation. Keywords: active learning, maximum entropy, semantic annotation | |||
| NEWPAR: an automatic feature selection and weighting schema for category ranking | | BIBAK | Full-Text | 128-137 | |
| Fernando Ruiz-Rico; Jose Luis Vicedo; María-Consuelo Rubio-Sánchez | |||
| Category ranking provides a way to classify plain text documents into a
pre-determined set of categories. This work proposes to have a look at typical
document collections and analyze which measures and peculiarities can help us
to represent documents so that the resulting features are as much
discriminative and representative as possible. Considerations such as selecting
only nouns and adjectives, taking expressions rather than words, and using
measures like term length, are combined into a simple feature selection and
weighting method to extract, select and weight especial n-grams. Several
experiments are performed to prove the usefulness of the new schema with
different data sets (Reuters and OHSUMED) and two different algorithms (SVM and
a simple sum of weights). After evaluation, the new approach outperforms some
of the best known and most widely used categorization methods. Keywords: SVM, category ranking, machine learning, text categorization, text
classification | |||
| Extending the single words-based document model: a comparison of bigrams and 2-itemsets | | BIBAK | Full-Text | 138-146 | |
| Roman Tesar; Vaclav Strnad; Karel Jezek; Massimo Poesio | |||
| The basic approach in text categorization is to represent documents by
single words. However, often other features are utilized to achieve better
classification results. In this paper, our attention is focused on bigrams and
2-itemsets. We compare the performance improvement in terms of classification
accuracy when these features are used to extend the single words-based document
representation on two standard text corpora: Reuters-21578 and 20 Newsgroups.
For this comparison we use the multinomial Naive Bayes classifier and five
different feature selection approaches. Algorithms for bigrams and 2-itemsets
discovery are presented as well. Our results show a statistically significant
improvement when bigrams and also 2-itemsets are incorporated. However, in the
case of 2-itemsets it is important to use an appropriate feature selection
method. On the other hand, even when a simple feature selection approach is
applied to discover bigrams the classification accuracy improves. The
conclusion is that, in our case, it is not very effective to extend document
representation with 2-itemsets because bigrams achieve better results and
discovering them is less resource-consuming. Keywords: bigrams, comparison, document model, feature selection, itemsets, machine
learning, n-grams, text categorization | |||
| Describing and querying hierarchical XML structures defined over the same textual data | | BIBAK | Full-Text | 147-154 | |
| Emmanuel Bruno; Elisabeth Murisasco | |||
| Our work aims at representing and querying hierarchical XML structures
defined over the same textual data. We call such data "multistructured textual
documents".
Our objectives are twofold. First, we shall define a suitable -- XML compatible -- data model enabling (1) to describe several independent hierarchical structures over the same textual data (represented by several XML structured documents) (2) to consider user annotations added in each structured document. Our proposal is based on the use of hedges (the foundation of the grammar language RelaxNG). Secondly, we shall propose an extension of XQuery in order to query structures and content in a concurrent way. We shall apply our proposals using a literary text written in old French. Keywords: XML, XQuery, multistructure, textual documents, tree-like structure | |||
| Describing multistructured XML documents by means of delay nodes | | BIBAK | Full-Text | 155-164 | |
| Jacques Le Maitre | |||
| Multistructured documents are documents whose structure is composed of a set
of concurrent hierarchical structures. In this paper, we propose a new model of
multistructured documents and we show how to translate its instances into XML
using a new kind of nodes: delay nodes, which we propose to add to the XDM
model on which XPath and XQuery are based. A delay node is the virtual
representation, by an XQuery query, of some of the children of its parent.
Interest of delay nodes to manage multistructured documents is that they allow
several nodes to virtually share their children nodes. In this way, it is
possible to query, with XPath or XQuery, multistructured documents described in
XML as if their different structures were really concurrent. Finally, we
compare our model with the GODDAGbased model and the multicolored trees (MCT)
model. Keywords: XML, XQuery, lazy evaluation, multistructured documents | |||
| Live editing of hypermedia documents | | BIBAK | Full-Text | 165-172 | |
| Romualdo Monteiro de Resende Costa; Márcio Ferreira Moreno; Rogério Ferreira Rodrigues; Luiz Fernando Gomes Soares | |||
| In some hypermedia system applications, like interactive digital TV
applications, authoring and presentation of documents may have to be done
concomitantly. This is the case of live programs, where not only some contents
are not known a priori, but also some temporal and spatial relationships, among
program media objects, may have to be established after the unknown content
definition. This paper proposes a method for hypermedia document live editing,
preserving not only the presentation semantics but also the logical structure
semantics defined by an author. To validate this proposal, an implementation
has been done for the Brazilian Digital TV System, which is also presented. Keywords: NCL, SBTVD, declarative middleware, ginga, interactive digital TV | |||
| The limsee3 multimedia authoring model | | BIBAK | Full-Text | 173-175 | |
| Romain Deltour; Cécile Roisin | |||
| For most users, authoring multimedia documents remains a complex task. One
solution to deal with this problem is to provide template-based authoring tools
but with the drawback of limited functionality. In this paper we propose a
document model dedicated to the creation of authoring tools using templates
while keeping rich composition capabilities. It is based on a component
oriented approach integrating homogeneously logical, time and spatial
structures. Templates are defined as constraints on these structures. Keywords: document authoring, document models, multimedia documents, template-based
editing | |||
| Benefits of structured multimedia documents in IDTV: the end-user enrichment system | | BIBAK | Full-Text | 176-178 | |
| Pablo Cesar; D. C. A. Bulterman; A. J. Jansen | |||
| This paper presents a system that exploits the benefits of modelling
multimedia presentations as structured documents within the context of
interactive digital television systems. Our work permits end-users to easily
enrich multimedia content at viewing time (e.g., add images and delete scenes).
Because the document is structured, the system can expose to the user the
possible enrichment alternatives depending on the current state of the
presentation (e.g., current story). Moreover, because the base content is
wrapped as a structured document, the enrichments can be modelled as overlying
layers that do not alter the original content. Finally, the user can share the
enriched content (or parts of it) to specific peers within a P2P network. Keywords: SMIL, content enrichment, structured multimedia documents | |||
| From video to photo albums: digital publishing workflow for automatic album creation | | BIBAK | Full-Text | 179-181 | |
| Parag Mulendra Joshi; C. Brian Atkins; Tong Zhang | |||
| The revolution in consumer electronics for capturing video has been followed
by an explosion of video content. However, meaningful consumption models of
such rich media for nonprofessional users are still emerging. In contrast to
those of video cameras, the consumption models for output of still cameras have
been long established and are considerably simpler. The output of a still
camera is an image of sufficiently high quality and high resolution for a good
quality production on paper. Due to ease of use, mobility, high quality and
simplicity paper photographs are still incomparable in terms of overall human
experience. On the other hand, video content by itself is not as easy to use.
Consumption of video content requires computers and/or video display devices
and so cannot be instantaneously displayed or shared. Rendition on paper is
much more complex for video content compared to still camera images. In
contrast with the simplicity of usage of still cameras, video camera output has
to be edited on computer, key frames with good visual quality have to be
manually extracted, digitally edited and prepared for printing before getting
usable good quality photographs. Due to complexity of the video content, users
often prefer to take still pictures instead of recording video clips. In this
paper we describe an approach to construct an end-to-end digital publishing
workflow system that automatically composes visually appealing photo albums
with high quality photographic images from video content input. Keywords: WSBPEL, digital publishing, multimedia, photo album, video, web services,
web-to-print, workflow | |||
| SmartPublisher: document creation on pen-based systems via document element reuse | | BIBAK | Full-Text | 182-183 | |
| Fabrice Matulic | |||
| SmartPublisher is a powerful, all-in-one application for pen-based devices
with which users can quickly and intuitively create new documents by reusing
individual image and text elements acquired from analogue and/or digital
documents. The application is especially targeted at scanning devices with
touch screen operating panels or tablet PCs connected to them (e.g. modern
multifunction printers with large touch screen displays), as one of its main
purposes is reuse of material obtained from scanned paper documents. Keywords: GUI, document creation and editing, reuse, scanned document | |||
| ALDAI: active learning documents annotation interface | | BIBA | Full-Text | 184-185 | |
| Boris Chidlovskii; Jérôme Fuselier; Loïc Lecerf | |||
| In the framework of the LegDoC project at XRCE, we present a document annotation interface with an integrated active learning component. The interface is designed for automating the annotation of layout-oriented documents with semantic labels. We describe core functionalities of the interface; we pay a particular attention to the learning component, including the feature management and different strategies of deploying the active learner in the document annotation. | |||
| The ambulant annotator: empowering viewer-side enrichment of multimedia content | | BIBAK | Full-Text | 186-187 | |
| Pablo Cesar; Dick C. A. Bulterman; A. J. Jansen | |||
| This paper presents a set of demos that allow viewer-side enrichment of
multimedia content in a home setting. The most relevant features of our system
are the following: passive authoring of content in contraposition to the
traditional active PC authoring, preservation of the base content, and
collaborative authoring (e.g., to share the enriched material with a peer
group). These requirements are met by modelling television content as
structured multimedia documents using SMIL 2.1. Keywords: SMIL, content enrichment, multimedia documents, structured | |||
| Templates, microformats and structured editing | | BIBAK | Full-Text | 188-197 | |
| Francesc Campoy Flores; Vincent Quint; Irène Vatton | |||
| Microformats and semantic XHTML add semantics to web pages while taking
advantage of the existing (X)HTML infrastructure. This approach enables new
applications that can be deployed smoothly on the web. But there is currently
no way to describe rigorously this type of markup and authors of web pages have
very little help for creating and encoding semantic markup. A language that
addresses these issues is presented in this paper. Its role is to specify
semantically rich XML languages in terms of other XML languages, such as XHTML.
The language is versatile enough to represent templates that can capture the
overall structure of large documents as well as the fine details of a
microformat. It is supported by an editing tool for producing documents encoded
in a semantically rich markup language, still fully compatible with XHTML. Keywords: document authoring, document models, document templates, microformats,
semantic XHTML, structure editing, world wide web | |||
| The portrait of a common HTML web page | | BIBAK | Full-Text | 198-204 | |
| Ryan Levering; Michal Cutler | |||
| Web pages are not purely text, nor are they solely HTML. This paper surveys
HTML web pages; not only on textual content, but with an emphasis on higher
order visual features and supplementary technology. Using a crawler with an
in-house developed rendering engine, data on a pseudo-random sample of web
pages is collected. First, several basic attributes are collected to verify the
collection process and confirm certain assumptions on web page text. Next, we
take a look at the distribution of different types of page content (text,
images, plug-in objects, and forms) in terms of rendered visual area. Those
different types of content are broken down into a detailed view of the ways in
which the content is used. This includes a look at the prevalence and usage of
scripts and styles. We conclude that more complex page elements play a
significant and underestimated role in the visually attractive, media rich, and
highly interactive web pages that are currently being added to the World Wide
Web. Keywords: CSS, HTML, feature, JavaScript, script, style, survey, visual, world wide
web | |||
| Mash-o-matic | | BIBAK | Full-Text | 205-214 | |
| Sudarshan Murthy; David Maier; Lois Delcambre | |||
| Web applications called mash-ups combine information of varying granularity
from different, possibly disparate, sources. We describe Mash-o-matic, a
utility that can extract, clean, and combine disparate information fragments,
and automatically generate data for mash-ups and the mash-ups themselves. As an
illustration, we generate a mash-up that displays a map of a university campus,
and outline the potential benefits of using Mash-o-matic. Mash-o-matic exploits
superimposed information (SI), which is new information and structure created
in reference to fragments of existing information. Mashomatic is implemented
using middleware called the Superimposed Pluggable Architecture for Contexts
and Excerpts (SPARCE), and a query processor for SI and referenced information,
both parts of our infrastructure to support SI management. We present a
high-level description of the mash-up production process and discuss in detail
how Mash-o-matic accelerates that process. Keywords: SPARCE, bi-level information, document transformation, mash-up, sidepad,
superimposed information | |||
| Profile-based web document delivery | | BIBAK | Full-Text | 215-217 | |
| Roger Stone; Jatinder Dhiensa; Colin Machin | |||
| This work originated by considering the needs of visually impaired users but
may have wider application. A profile captures some key descriptors or
preferences of a user and their browsing device. Individual users may maintain
any number of profiles which they can edit for use in different situations, for
different tasks or with different devices. A profile is described in terms of
essentiality and proficiency. Essentiality is used to control the quantity of
information that is transmitted and proficiency is used to control the format.
Various levels of essentiality are introduced into a document by the technique
known as microformatting. Proficiency (for the visually impaired) includes a
description of minimum acceptable font size, preferred font face and preferred
text and background colours. A key feature of the proficiency profile is the
accessibility component which captures the user's tolerance of accessibility
issues in a document, for example the presence of images or the markup of
tables. The document delivery tool works as a kind of filter to reduce the
content to the level of essentiality requested, to make the various
presentation changes and to warn of accessibility issues as specified in the
user's profile. Encouraging preliminary results have been obtained from testing
the prototype with subjects from the local RNIB college. Keywords: usability, visual impairment, web accessibility, web-documents | |||
| An OAI data provider for JEMS | | BIBAK | Full-Text | 218-220 | |
| Diego Fraga Contessa; José Palazzo Moreira de Oliveira | |||
| Effective scientific activity requires the open publication and diffusion of
scientific works and easy access to the research results. Digital libraries aim
to simplify the publication process by using the Open Archives Initiative
associated with the Dublin Core metadata system. The Brazilian Computer Society
-- SBC -- supports the JEMS system for managing the submission and evaluation
of conference and journal articles. JEMS collects considerable data about
papers, authors and conferences and SBC is seeking ways to provide wider access
to this information. This article describes an OAI-compatible data provider
that publishes metadata in both the Simple and Qualified Dublin Core formats. Keywords: JEMS, OAI, XML, data provider, digital libraries, metadata generation | |||