HCI Bibliography Home | HCI Conferences | DocEng Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
DocEng Tables of Contents: 0102030405060708091011121314

Proceedings of the 2006 ACM Symposium on Document Engineering

Fullname:DocEng'06 Proceeding of the 6th ACM Symposium on Document Engineering
Editors:David F. Brailsford
Location:Amsterdam, The Netherlands
Dates:2006-Oct-10 to 2006-Oct-13
Standard No:ISBN: 1-59593-515-0; ACM DL: Table of Contents hcibib: DocEng06
  1. Working session
  2. Keynote
  3. Document layout
  4. Poster presentations
  5. Document management
  6. XML-based document structure and analysis
  7. Keynote
  8. XSLT and beyond
  9. Document recognition and classification
  10. Text-based document models
  11. Documents with multiple markup
  12. Multimedia and hypermedia authoring
  13. Demonstrations
  14. Document editing for the web
  15. Novel web applications

Working session

XSLT working session BIBAKFull-Text 1
  John Lumley; Jeni Tennison
Document engineering has been transformed by the development and adoption of XML as a common form (a 'meta-language') for defining document representations of many types. Recently XSLT has matured significantly as a transformational technology for XML-based documents that could have similar effect in the design of document processing tools. XSLT brings technologies from functional programming to bear on producing robust programs firmly embedded in the context of XML. We have new ideas and emphases on solution design, from very localised techniques, tricks and algorithms right up to large-scale program suites. XSLT's functional programming basis, its ability to mix push- and pull-driven effects, lack of reassignable state and definition within XML, all present challenges and opportunities to those both learning and exploiting its powers.
   Many of the attendees at DocEng'06 will have some experience of using XSLT in their research and practice. In this working session we aim to share interesting and useful insights and exploitations of XSLT, through an extended panel session and 'audience participation'. The session leaders (John Lumley, Jeni Tennison) have extensive experience of using XSLT within document engineering. The session starts with subjects suggested by the presenters and continues with a variety of topics raised by attendees.
Keywords: XSLT, functional programming


Every page is different: a new document type for commercial printing BIBAKFull-Text 2
  Keith Moore
The Web has certainly demonstrated the power of personalization with offers based on interests and prior purchasing behavior. Most studies indicate a 10 to 1 improvement in sale conversion using some degree of personalization in printed marketing collateral. Until recently, the cost of providing personalization in high-quality printed collateral has been prohibitive. Delivering a combined campaign of web and print is a logistics nightmare requiring parallel workflows, parallel content management, and custom synchronization mechanisms.
   While digital offset presses (such as the HP Indigo) can address fulfillment of high-quality personalized pages, the upstream work processes including layout, typography, content merge, proofing and color are struggling to capture and represent these new types of jobs. In this talk, I'll describe the capabilities of the new digital offset presses, some of the market forces that are driving the new types of pages, and highlight the challenges being faced in creating these new workflows.
Keywords: augmented paper, campaign management, digital commercial printing, personalization, variable data printing

Document layout

Minimum sized text containment shapes BIBAKFull-Text 3-12
  Nathan Hurst; Kim Marriott; Peter Moulder
In many text-processing applications, we would like shapes that expand (or shrink) in size to fit their textual content. We address how to efficiently compute the minimum size for such text shapes. A variant of this problem is to take a fixed shape and determine the maximal size font that will still allow the content to fit into it. Our approach is to model the problem as a constrained optimisation problem with a single variable that controls the geometry of the text shape. We use a variant of secant search to determine the minimum area for the shape, guided by the area of the text. We represent the shape by regions that are composed of trapezoids whose coordinates are a linear function of the unknown variable. This allows us to use a novel linear time algorithm (based on computing Minkowski difference) that takes a trapezoid list and text height and determines the region in which a line of text of that height and some minimum width can start and still remain inside the shape.
Keywords: adaptive layout, constrained optimization, textbox
Measuring aesthetic distance between document templates and instances BIBAKFull-Text 13-21
  Alexis Cabeda Faria; Joao B. S. de Oliveira
Adaptive documents undergo many transformations during their generation, including insertion and deletion of content. One major problem in this scenario is the preservation of the aesthetic qualities of the document during those transformations.
   As adaptive documents are instances of a template, the aesthetic quality of an instance with respect to the template could be evaluated by aesthetic measures providing scores to any desired quality parameters. These parameters measure the deviation of the instance from the desired template. This evaluation could assure the quality of instances during their generation and final output.
   This paper introduces the use of document templates to support aesthetic measures of document instances. A score is assigned to a document instance according to the differences detected from the original template. Considering the original template as an ideal result, the quality of a document instance will decrease according to the number and severity of the changes applied to produce it. So, documents that are below a given threshold can be sent for further (possibly human) review, and any others are accepted.
   The amount of change with respect to the template will reflect the document quality, and in such a model the quality of instances can be considered as a distance from that original.
Keywords: aesthetics, document, layout, measures
Text block geometric shape analysis BIBAKFull-Text 22-24
  Hui Chao
When graphic artist designs a page, they envision a set of text blocks of arbitrary shapes constrained by page size, image blocks and graphics blocks with wrap around properties. We call this the intended shape. What is seen on an actual page depends on the particular text content and typographical constrains such as natural text line breaking and justification. We call this the apparent shape. Our goal is to create document templates by extracting the text blocks' intended shapes from the apparent shapes. The main difficulty is when the line justification is jagged the intended block shape is obfuscated. We solve this problem by analyzing the layout relation of all blocks on a page and applying an iterative process to find the maximum likelihood of the intended shapes.
Keywords: document geometric layout analysis, page segmentation, template creation
Evaluating invariances in document layout functions BIBAKFull-Text 25-27
  Alexander J. Macdonald; David F. Brailsford; John Lumley
With the development of variable-data-driven digital presses where each document printed is potentially unique there is a need for pre-press optimization to identify material that is invariant from document to document. In this way rasterisation can be confined solely to those areas which change between successive documents thereby alleviating a potential performance bottleneck.
   Given a template document specified in terms of layout functions, where actual data is bound at the last possible moment before printing, we look at deriving and exploiting the invariant properties of layout functions from their formal specifications. We propose future work on generic extraction of invariance from such properties for certain classes of layout functions.
Keywords: SVG, XML, XSLT, document layout, optimisation
Solving the simple continuous table layout problem BIBAKFull-Text 28-30
  Nathan Hurst; Kim Marriott; David Albrecht
Automatic table layout is required in web applications. Unfortunately, this is NP-hard for reasonable layout requirements such as minimizing table height for a given width. One approach is to solve a continuous relaxation of the layout problem in which each cell must be large enough to contain the area of its content. The solution to this relaxed problem can then guide the solution to the original problem. We give a simple and efficient algorithm for solving this continuous relaxation for the case that cells do not span multiple columns or rows. The algorithm is not only interesting in its own right but also because it provides insight into the geometry of table layout.
Keywords: automatic table layout, constrained optimization

Poster presentations

COG Extractor BIBAKFull-Text 31
  Steven R. Bagley
The Component Object Graphic (COG) model describes documents as a series of distinct, encapsulated graphical blocks (termed COGs) that are positioned on the page rather than the traditional approach (taken by formats such as PostScript, PDF and SVG) of describing each page as one monolithic block of drawing operators that create marks on the page to form the content. Previous work [1, 2] has demonstrated how this paradigm can be implemented on top of both PDF and SVG.
   Tools have previously been created which allow the creation of COG documents [1, 3] and manipulation of existing COGs to form new documents [1, 4]. Missing from the COG toolkit has been a method of producing COGs from already existing content. The proposed solution is a new tool entitled 'COG Extractor' that allows the user to select an area of an existing document to be extracted and converted to be a COG in PDF or SVG form.
   The first process is to get the documents into a format that can easily be understood. The most sensible choice is PDF since there are several tools that can convert other formats into PDF.
   It is then necessary to parse the PDF content stream for the page to be extracted and to derive the meaning of the operators in the page's content stream. PDF's operators can be divided into two classes: those that image content, and those that define the state (such as fill colour, line width) that content is imaged with. This causes problems since each imaging operators depends on the cumulative effect of all previous state operators. It is therefore necessary to work out exactly which operators are responsible for drawing the content.
   This is achieved by combining consecutive state operators together to form a state-change object, which encapsulates all changes in state at that point. Everything following the state-change object in the PDF content stream then become children of that state-change object. This produces a tree representation of the document where the drawing operators form the leaf nodes. The state of any drawing operator can then be calculated by walking back along its parents in the tree.
   From this tree it is then possible to calculate the bounding box of each drawing operator and see if it intersects with the area to be extracted. If they do not intersect, then that drawing operator can be pruned from the tree.
   An optimization phase can remove any state-change nodes that are no longer required (e.g nodes for which all children have been removed). This tree can then be exported as a COG in PDF or SVG format for future use.
Keywords: COGs, FormXObject, PDF, SVG, graphic objects
Carrot and stick: combining information retrieval models BIBKFull-Text 32
  Hans Friedrich Witschel
Keywords: language models, weighting schemes
Standards based high resolution Document editing using low resolution proxy images BIBAKFull-Text 33
  Michael Gormish; Edward L. Schwartz
This poster presents an implementation of the JPIP-JPM standard which allows low bandwidth access to remote high resolution documents. The implementation goes beyond the standard by allowing simple editing operations at the client without obtaining all of the compressed data. These edits form a new version of the document that can be easily uploaded over low bandwidth connections.
Keywords: JPEG2000, JPIP, JPM, partial image transmission

Document management

Print-n-link: weaving the paper web BIBAKFull-Text 34-43
  Moira C. Norrie; Beat Signer; Nadir Weibel
Citations form the basis for a web of scientific publications. Search engines, embedded hyperlinks and digital libraries all simplify the task of finding publications of interest on the web and navigating to cited publications or web sites. However the actual reading of publications often takes place on paper and frequently on the move. We present a system Print-n-Link that uses technologies for interactive paper to enhance the reading process by enabling users to access digital information and/or searches for cited documents from a printed version of a publication using a digital pen for interaction. A special virtual printer driver automatically generates links from paper to digital services during the printing process based on an analysis of PDF documents. Depending on the user setting and interaction gesture, the system may retrieve metadata about the citation and inform the user through an audio channel or directly display the cited document on the user's screen.
Keywords: citation management, digital library, document integration, interactive paper
Knowledge engineering from frontline support to preliminary design BIBAKFull-Text 44-52
  Sylvia C. Wong; Richard M. Crowder; Gary B. Wills; Nigel R. Shadbolt
The design and maintenance of complex engineering systems such as a jet engine generates a significant amount of documentation. Increasingly, aerospace manufacturers are shifting their focus from selling products to providing services. As a result, when designing new engines, engineers must increasingly consider the life-cycle requirements in addition to design parameters. To identify possible areas of concern, engineers must obtain knowledge gained from the entire life of an engine. However, because of the size and distributed nature of a company's operation, engineers often do not have access to front-line maintenance data. In addition, the large number of documents accrued makes it impossible to examine thoroughly. This paper presents a prototype knowledge-based document repository for such an application. It searches and analyzes distributed document resources, and provides engineers with a summary view of the underlying knowledge. The aim is to aid engineers in creating design requirement documents that incorporate aftermarket issues. Unlike existing document repositories and digital libraries, our approach is knowledge-based, where users browse summary reports instead of following suggested links. To test the validity of our proposed architecture, we have developed and deployed a working prototype. The prototype has been demonstrated to engineers and received positive reviews.
Keywords: intelligent documents, semantic web, service-oriented architecture
An XML interaction service for workflow applications BIBAKFull-Text 53-55
  Y. S. Kuo; Lendle Tseng; Hsun-Cheng Hu; N. C. Shih
Interactions with human users are a crucial part of many workflow applications. In workflow or business process management specifications, such as WSBPEL, all data are represented as XML. As a consequence, many human tasks simply support users to create or update XML data compliant with a schema. We propose the XML interaction service as a generic Web service for use by human tasks that provides HTML form-based Web interfaces for users to interact with and update schema-compliant XML data. In addition, a visual interface design tool is provided for interface designers to create and customize the Web interfaces an XML interaction service supports.
Keywords: XML, user interface, workflow management
Engineering better voting systems BIBAKFull-Text 56-58
  Bertrand Haas
We consider here an election ballot as a document, a document that works as the carrier of a voter's choice in an election's accounting system for determining a winning candidate. And we consider a voting system as a way to manage both the document and its flow in compliance with the requirements of the election. Trustworthy elections are the core of a democratic spirit and engineering a voting system with requirements of convenience, privacy, integrity and reliability lies at the core of trustworthy elections. We define here some clear requirements for a trustworthy voting system and analyze the most popular classes of voting systems according to some of these requirements. We draw conclusions on the engineering of better voting systems and show two efforts in this direction.
Keywords: ballots, electronic, mail, security, voting
Preservation-centric and constraint-based migration of digital documents BIBAKFull-Text 59-61
  Thomas Triebsees; Uwe M. Borghoff
We introduce a framework that supports archivists in planning and running migrations. The central idea is that -- once relevant information pieces of digital documents are modeled -- desired migration results can be specified by means of preservation constraints. From these constraint specifications we are able to derive migration algorithms that provably respect a set of document properties before (pre-conditions) and after migration (post-conditions). Underlying is the concept of Abstract State Machines (ASM) modeling archival states. Migrations are modeled as sequences of basic operations that change the archive's state while respecting userdefined constraints. Among others, our target scenarios comprise legal and medical documents where considerable property changes cannot be tolerated and where constraint preservation must hold over a long period of time.
Keywords: digital archive, migration, model, preservation
Combining linguistic and structural descriptors for mining biomedical literature BIBAKFull-Text 62-64
  Nadia Zerida; Nadine Lucas; Bruno Crémilleux
This work proposes an original combination of linguistic and structural descriptors to represent the content of biomedical papers. The objective is to show the effectiveness of descriptors taking into account the structure of documents to characterise three kinds of biomedical texts (reviews, research and clinical papers). The description of text is made at various levels, from the global level to the local one. The contexts makes it possible to characterise the three classes. The characterisation of the textual resources is carried out quantitatively by using the discriminating capacity of techniques of data mining based on emerging patterns.
Keywords: categorisation, characterisation, text mining preprocessing

XML-based document structure and analysis

Comparing XML path expressions BIBAKFull-TextPDF 65-74
  Pierre Genevès; Nabil Layaïda
XPath is the standard declarative language for navigating XML data and returning a set of matching nodes. In the context of XSLT/XQuery analysis, query optimization, and XML type checking, XPath decision problems arise naturally. They notably include XPath comparisons such as equivalence (whether two queries always return the same result), and containment (whether for any tree the result of a particular query is included in the result of a second one).
   XPath decision problems have attracted a lot of research attention, especially for studying the computational complexity of various XPath fragments. However, what is missing at present is the constructive use of an expressive logic which would allow capturing these decision problems, while providing practically effective decision procedures.
   In this paper, we propose a logic-based framework for the static analysis of XPath. Specifically, we propose the alternation free modal μ-calculus with converse as the appropriate logic for effectively solving XPath decision problems. We present a translation of a large XPath fragment into μ-calculus, together with practical experiments on the containment using a state-of-the-art EXPTIME decision procedure for μ-calculus satisfiability. These preliminary experiments shed light, for the first time, on the cost of checking the containment in practice. We believe they reveal encouraging results for further static analysis of XML transformations.
Keywords: XPath, analysis, experimentation
Fast and simple XML tree differencing by sequence alignment BIBAKFull-Text 75-84
  Tancred Lindholm; Jaakko Kangasharju; Sasu Tarkoma
With the advent of XML we have seen a renewed interest in methods for computing the difference between trees. Methods that include heuristic elements play an important role in practical applications due to the inherent complexity of the problem. We present a method for differencing XML as ordered trees based on mapping the problem to the domain of sequence alignment, applying simple and efficient heuristics in this domain, and transforming back to the tree domain. Our approach provides a method to quickly compute changes that are meaningful transformations on the XML tree level, and includes subtree move as a primitive operation. We evaluate the feasibility of our approach and benchmark it against a selection of existing differencing tools. The results show our approach to be feasible and to have the potential to perform on par with tools of a more complex design in terms of both output size and execution time.
Keywords: XML, differencing, move, ordered tree, sequence alignment
Filtering XML documents using XPath expressions and aspect-oriented programming BIBAKFull-Text 85-87
  Ermir Qeli; Bernd Freisleben
In this paper, we present the design and implementation of a filtering approach for XML documents which is based on XPath expressions and Aspect-Oriented Programming (AOP). The class of XPath expressions used allows for branching, wildcards and descendant relationships between nodes. For the embedding of simple paths into XPath expressions, a dynamic programming approach is proposed. The AOP paradigm, which provides a means for encapsulating crosscutting concerns in software, is introduced to integrate the filtering approach in the broader context of event-based parsing of XML documents using SAX.
Keywords: SAX, XML, XPath, aspect-oriented programming
Customizable detection of changes for XML documents using XPath expressions BIBAKFull-Text 88-90
  Ermir Qeli; Julinda Gllavata; Bernd Freisleben
Change detection in XML documents is an important task in the context of query systems. In this paper, we present CustX- Diff, a customizable change detection approach for XML documents based on X-Diff [6]. CustX-Diff performs the change detection operation simultaneosly with the XPath based filtering of XML document parts. The class of XPath expressions used is the tree patterns subset of XPath. For the embedding of simple paths into XPath expressions during the difference operation, a dynamic programming approach is proposed. Comparative performance results with respect to the original X-Diff [6] approach demonstrate the efficiency of the proposed method.
Keywords: XML, XPath, change detection


Processing XML documents with pipelines BIBAKFull-Text 91
  Jeni Tennison
We live in a world where documents do not simply exist in the form we need them. Getting from stored form to final presentation is often a complex process. Documents have to be generated, validated and transformed, queried, split, merged, filtered, annotated, restructured, translated and rendered. We need it to be easy to define the set of processes that the document has to go through. And we need it to be fast, at run time, to do this processing.
   The XML Processing Model Working Group is developing a markup language for defining pipelines of processes that XML documents may go through. In this talk I shall describe how pipelines can be put together using the XProc language and why even simple processing can benefit from being organised in that way.
Keywords: XML, pipeline, processing, transformation

XSLT and beyond

Modeling context Information for capture and access applications BIBAKFull-Text 92-94
  Maria G. Pimentel; Laercio Baldochi; Ethan V. Munson
The Contextractor is an XSLT-based transformation system that gathers information from extended UML models to produce XML Schemas that model the information captured by an application and define a query language that allows the submission of queries over the captured content.
Keywords: UML models, XMI, XML schema, XSLT, capture and access applications, computing, ubiquitous
Resolving layout interdependency with presentational variables BIBAKFull-Text 95-97
  John Lumley; Roger Gimson; Owen Rees
In the construction of variable data documents, the layout of component parts to build a composite section with heterogeneous layout functions can be implemented by a tree-evaluating layout processor. This handles many cases with well-scoped structure very smoothly but becomes complex when layout relationships between components cut across a strict tree. We present an approach for XML-described layouts based on a post-rendering set of single-assignment variables, analogous to XSLT, that can make this much easier, does not compromise layout extensibility and can be a target for automated interdependency analysis and generation. This is the approach used in the layout processor associated with the Document Description Framework (DDF).
Keywords: SVG, XSLT, document construction, functional programming

Document recognition and classification

Meta-algorithmic systems for document classification BIBAKFull-Text 98-106
  Steven J. Simske; David W. Wright; Margaret Sturgill
To address cost and regulatory concerns, many businesses are converting paper-based elements of their workflows into fully electronic flows that use the content of the documents. Scanning the document contents into workflows, however, is a manual, error-prone, and costly process especially when the data extraction process requires high accuracy. These manual costs are a primary barrier to widespread adoption of distributed capture solutions for business critical workflows such as insurance claims, medical records, or loan applications. Software solutions using artificial intelligence and natural language processing techniques are emerging to address these needs, but each have their individual strengths and weaknesses, and none have demonstrated a high level of accuracy across the many unstructured document types included in these business critical workflows. This paper describes how to overcome many of these limitations by intelligently combining multiple approaches for document classification using meta-algorithmic design patterns. These patterns explore the error space in multiple engines, and provide improved and "emergent" results in comparison to voting schemes and to the output of any of the individual engines. This paper considers the results of the individual engines along with traditional combinatorial techniques such as voting, before describing prototype results for a variety of novel metaalgorithmic patterns that reduce individual document error rates by up to 13% and reduce system error rates by up to 38%.
Keywords: confusion matrix, document classification, document indexing, engine combination, meta-algorithmics
Content based SMS spam filtering BIBAKFull-Text 107-114
  José María Gómez Hidalgo; Guillermo Cajigas Bringas; Enrique Puertas Sánz; Francisco Carrero García
In the recent years, we have witnessed a dramatic increment in the volume of spam email. Other related forms of spam are increasingly revealing as a problem of importance, specially the spam on Instant Messaging services (the so called SPIM), and Short Message Service (SMS) or mobile spam.
   Like email spam, the SMS spam problem can be approached with legal, economic or technical measures. Among the wide range of technical measures, Bayesian filters are playing a key role in stopping email spam. In this paper, we analyze to what extent Bayesian filtering techniques used to block email spam, can be applied to the problem of detecting and stopping mobile spam. In particular, we have built two SMS spam test collections of significant size, in English and Spanish. We have tested on them a number of messages representation techniques and Machine Learning algorithms, in terms of effectiveness. Our results demonstrate that Bayesian filtering techniques can be effectively transferred from email to SMS spam.
Keywords: Bayesian filter, junk, receiver operating characteristic, spam
Application of syntactic properties to three-level recognition of polish hand-written medical texts BIBAKFull-Text 115-121
  Grzegorz Godlewski; Maciej Piasecki; Jerzy Sas
In the paper, three-level hand-writing recognition using language syntactic properties on the upper level is presented. Isolated characters are recognized on the lowest level. The character classification from the lowest level is used in words recognition. Words are recognized using a combined classifier based on possibly incomplete unigram lexicon. Word classifier builds a rank of the most likely words. Ranks created for subsequent words are input to the syntactic classifier, which recognizes the whole sentences. Here the local syntactic constraints are used to build a syntactically consistent sentence. The method has been applied to recognition of hand-written medical texts describing fixed aspects of patient treatment. Due to narrow area of topics explained in the texts and peculiarity of style characteristic for physicians writing texts, the syntax of expected sentences is relatively simple, what makes the problem of checking the syntactic consistency simpler.
Keywords: HMM, OCR, PoS tagging, hand-written text, polish
Quality enhancement in information extraction from scanned documents BIBAKFull-Text 122-124
  Atsuhiro Takasu; Kenro Aihara
When constructing a large document archive, an important element is the digitizing of printed documents. Although various techniques for document image analysis such as Optical Character Recognition (OCR) have been developed, error handling is required in constructing real document archive systems. This paper discusses the problem from the quality enhancement perspective and proposes a robust reference extraction method for academic articles scanned with OCR mark-up. We applied the proposed method to articles appearing in various journals, and these experiments showed that the proposed method achieved a recognition accuracy of more than 94%. This paper also discusses manual correction and investigates experimentally the relationship between extraction accuracy and cost reduction.
Keywords: digital libraries, document capturing, statistical model
Document annotation by active learning techniques BIBAKFull-Text 125-127
  Loïc Lecerf; Boris Chidlovskii
We present a system for the semantic annotation of layout-oriented documents, with an integrated learning component. We introduce probabilistic learning methods on tree-like documents and we present different active learning techniques for training document annotation models. We report some preliminary results of deploying such active learning techniques on an important case of document collection annotation.
Keywords: active learning, maximum entropy, semantic annotation

Text-based document models

NEWPAR: an automatic feature selection and weighting schema for category ranking BIBAKFull-Text 128-137
  Fernando Ruiz-Rico; Jose Luis Vicedo; María-Consuelo Rubio-Sánchez
Category ranking provides a way to classify plain text documents into a pre-determined set of categories. This work proposes to have a look at typical document collections and analyze which measures and peculiarities can help us to represent documents so that the resulting features are as much discriminative and representative as possible. Considerations such as selecting only nouns and adjectives, taking expressions rather than words, and using measures like term length, are combined into a simple feature selection and weighting method to extract, select and weight especial n-grams. Several experiments are performed to prove the usefulness of the new schema with different data sets (Reuters and OHSUMED) and two different algorithms (SVM and a simple sum of weights). After evaluation, the new approach outperforms some of the best known and most widely used categorization methods.
Keywords: SVM, category ranking, machine learning, text categorization, text classification
Extending the single words-based document model: a comparison of bigrams and 2-itemsets BIBAKFull-Text 138-146
  Roman Tesar; Vaclav Strnad; Karel Jezek; Massimo Poesio
The basic approach in text categorization is to represent documents by single words. However, often other features are utilized to achieve better classification results. In this paper, our attention is focused on bigrams and 2-itemsets. We compare the performance improvement in terms of classification accuracy when these features are used to extend the single words-based document representation on two standard text corpora: Reuters-21578 and 20 Newsgroups. For this comparison we use the multinomial Naive Bayes classifier and five different feature selection approaches. Algorithms for bigrams and 2-itemsets discovery are presented as well. Our results show a statistically significant improvement when bigrams and also 2-itemsets are incorporated. However, in the case of 2-itemsets it is important to use an appropriate feature selection method. On the other hand, even when a simple feature selection approach is applied to discover bigrams the classification accuracy improves. The conclusion is that, in our case, it is not very effective to extend document representation with 2-itemsets because bigrams achieve better results and discovering them is less resource-consuming.
Keywords: bigrams, comparison, document model, feature selection, itemsets, machine learning, n-grams, text categorization

Documents with multiple markup

Describing and querying hierarchical XML structures defined over the same textual data BIBAKFull-Text 147-154
  Emmanuel Bruno; Elisabeth Murisasco
Our work aims at representing and querying hierarchical XML structures defined over the same textual data. We call such data "multistructured textual documents".
   Our objectives are twofold. First, we shall define a suitable -- XML compatible -- data model enabling (1) to describe several independent hierarchical structures over the same textual data (represented by several XML structured documents) (2) to consider user annotations added in each structured document. Our proposal is based on the use of hedges (the foundation of the grammar language RelaxNG). Secondly, we shall propose an extension of XQuery in order to query structures and content in a concurrent way. We shall apply our proposals using a literary text written in old French.
Keywords: XML, XQuery, multistructure, textual documents, tree-like structure
Describing multistructured XML documents by means of delay nodes BIBAKFull-Text 155-164
  Jacques Le Maitre
Multistructured documents are documents whose structure is composed of a set of concurrent hierarchical structures. In this paper, we propose a new model of multistructured documents and we show how to translate its instances into XML using a new kind of nodes: delay nodes, which we propose to add to the XDM model on which XPath and XQuery are based. A delay node is the virtual representation, by an XQuery query, of some of the children of its parent. Interest of delay nodes to manage multistructured documents is that they allow several nodes to virtually share their children nodes. In this way, it is possible to query, with XPath or XQuery, multistructured documents described in XML as if their different structures were really concurrent. Finally, we compare our model with the GODDAGbased model and the multicolored trees (MCT) model.
Keywords: XML, XQuery, lazy evaluation, multistructured documents

Multimedia and hypermedia authoring

Live editing of hypermedia documents BIBAKFull-Text 165-172
  Romualdo Monteiro de Resende Costa; Márcio Ferreira Moreno; Rogério Ferreira Rodrigues; Luiz Fernando Gomes Soares
In some hypermedia system applications, like interactive digital TV applications, authoring and presentation of documents may have to be done concomitantly. This is the case of live programs, where not only some contents are not known a priori, but also some temporal and spatial relationships, among program media objects, may have to be established after the unknown content definition. This paper proposes a method for hypermedia document live editing, preserving not only the presentation semantics but also the logical structure semantics defined by an author. To validate this proposal, an implementation has been done for the Brazilian Digital TV System, which is also presented.
Keywords: NCL, SBTVD, declarative middleware, ginga, interactive digital TV
The limsee3 multimedia authoring model BIBAKFull-Text 173-175
  Romain Deltour; Cécile Roisin
For most users, authoring multimedia documents remains a complex task. One solution to deal with this problem is to provide template-based authoring tools but with the drawback of limited functionality. In this paper we propose a document model dedicated to the creation of authoring tools using templates while keeping rich composition capabilities. It is based on a component oriented approach integrating homogeneously logical, time and spatial structures. Templates are defined as constraints on these structures.
Keywords: document authoring, document models, multimedia documents, template-based editing
Benefits of structured multimedia documents in IDTV: the end-user enrichment system BIBAKFull-Text 176-178
  Pablo Cesar; D. C. A. Bulterman; A. J. Jansen
This paper presents a system that exploits the benefits of modelling multimedia presentations as structured documents within the context of interactive digital television systems. Our work permits end-users to easily enrich multimedia content at viewing time (e.g., add images and delete scenes). Because the document is structured, the system can expose to the user the possible enrichment alternatives depending on the current state of the presentation (e.g., current story). Moreover, because the base content is wrapped as a structured document, the enrichments can be modelled as overlying layers that do not alter the original content. Finally, the user can share the enriched content (or parts of it) to specific peers within a P2P network.
Keywords: SMIL, content enrichment, structured multimedia documents
From video to photo albums: digital publishing workflow for automatic album creation BIBAKFull-Text 179-181
  Parag Mulendra Joshi; C. Brian Atkins; Tong Zhang
The revolution in consumer electronics for capturing video has been followed by an explosion of video content. However, meaningful consumption models of such rich media for nonprofessional users are still emerging. In contrast to those of video cameras, the consumption models for output of still cameras have been long established and are considerably simpler. The output of a still camera is an image of sufficiently high quality and high resolution for a good quality production on paper. Due to ease of use, mobility, high quality and simplicity paper photographs are still incomparable in terms of overall human experience. On the other hand, video content by itself is not as easy to use. Consumption of video content requires computers and/or video display devices and so cannot be instantaneously displayed or shared. Rendition on paper is much more complex for video content compared to still camera images. In contrast with the simplicity of usage of still cameras, video camera output has to be edited on computer, key frames with good visual quality have to be manually extracted, digitally edited and prepared for printing before getting usable good quality photographs. Due to complexity of the video content, users often prefer to take still pictures instead of recording video clips. In this paper we describe an approach to construct an end-to-end digital publishing workflow system that automatically composes visually appealing photo albums with high quality photographic images from video content input.
Keywords: WSBPEL, digital publishing, multimedia, photo album, video, web services, web-to-print, workflow


SmartPublisher: document creation on pen-based systems via document element reuse BIBAKFull-Text 182-183
  Fabrice Matulic
SmartPublisher is a powerful, all-in-one application for pen-based devices with which users can quickly and intuitively create new documents by reusing individual image and text elements acquired from analogue and/or digital documents. The application is especially targeted at scanning devices with touch screen operating panels or tablet PCs connected to them (e.g. modern multifunction printers with large touch screen displays), as one of its main purposes is reuse of material obtained from scanned paper documents.
Keywords: GUI, document creation and editing, reuse, scanned document
ALDAI: active learning documents annotation interface BIBAFull-Text 184-185
  Boris Chidlovskii; Jérôme Fuselier; Loïc Lecerf
In the framework of the LegDoC project at XRCE, we present a document annotation interface with an integrated active learning component. The interface is designed for automating the annotation of layout-oriented documents with semantic labels. We describe core functionalities of the interface; we pay a particular attention to the learning component, including the feature management and different strategies of deploying the active learner in the document annotation.
The ambulant annotator: empowering viewer-side enrichment of multimedia content BIBAKFull-Text 186-187
  Pablo Cesar; Dick C. A. Bulterman; A. J. Jansen
This paper presents a set of demos that allow viewer-side enrichment of multimedia content in a home setting. The most relevant features of our system are the following: passive authoring of content in contraposition to the traditional active PC authoring, preservation of the base content, and collaborative authoring (e.g., to share the enriched material with a peer group). These requirements are met by modelling television content as structured multimedia documents using SMIL 2.1.
Keywords: SMIL, content enrichment, multimedia documents, structured

Document editing for the web

Templates, microformats and structured editing BIBAKFull-Text 188-197
  Francesc Campoy Flores; Vincent Quint; Irène Vatton
Microformats and semantic XHTML add semantics to web pages while taking advantage of the existing (X)HTML infrastructure. This approach enables new applications that can be deployed smoothly on the web. But there is currently no way to describe rigorously this type of markup and authors of web pages have very little help for creating and encoding semantic markup. A language that addresses these issues is presented in this paper. Its role is to specify semantically rich XML languages in terms of other XML languages, such as XHTML. The language is versatile enough to represent templates that can capture the overall structure of large documents as well as the fine details of a microformat. It is supported by an editing tool for producing documents encoded in a semantically rich markup language, still fully compatible with XHTML.
Keywords: document authoring, document models, document templates, microformats, semantic XHTML, structure editing, world wide web

Novel web applications

The portrait of a common HTML web page BIBAKFull-Text 198-204
  Ryan Levering; Michal Cutler
Web pages are not purely text, nor are they solely HTML. This paper surveys HTML web pages; not only on textual content, but with an emphasis on higher order visual features and supplementary technology. Using a crawler with an in-house developed rendering engine, data on a pseudo-random sample of web pages is collected. First, several basic attributes are collected to verify the collection process and confirm certain assumptions on web page text. Next, we take a look at the distribution of different types of page content (text, images, plug-in objects, and forms) in terms of rendered visual area. Those different types of content are broken down into a detailed view of the ways in which the content is used. This includes a look at the prevalence and usage of scripts and styles. We conclude that more complex page elements play a significant and underestimated role in the visually attractive, media rich, and highly interactive web pages that are currently being added to the World Wide Web.
Keywords: CSS, HTML, feature, JavaScript, script, style, survey, visual, world wide web
Mash-o-matic BIBAKFull-Text 205-214
  Sudarshan Murthy; David Maier; Lois Delcambre
Web applications called mash-ups combine information of varying granularity from different, possibly disparate, sources. We describe Mash-o-matic, a utility that can extract, clean, and combine disparate information fragments, and automatically generate data for mash-ups and the mash-ups themselves. As an illustration, we generate a mash-up that displays a map of a university campus, and outline the potential benefits of using Mash-o-matic. Mash-o-matic exploits superimposed information (SI), which is new information and structure created in reference to fragments of existing information. Mashomatic is implemented using middleware called the Superimposed Pluggable Architecture for Contexts and Excerpts (SPARCE), and a query processor for SI and referenced information, both parts of our infrastructure to support SI management. We present a high-level description of the mash-up production process and discuss in detail how Mash-o-matic accelerates that process.
Keywords: SPARCE, bi-level information, document transformation, mash-up, sidepad, superimposed information
Profile-based web document delivery BIBAKFull-Text 215-217
  Roger Stone; Jatinder Dhiensa; Colin Machin
This work originated by considering the needs of visually impaired users but may have wider application. A profile captures some key descriptors or preferences of a user and their browsing device. Individual users may maintain any number of profiles which they can edit for use in different situations, for different tasks or with different devices. A profile is described in terms of essentiality and proficiency. Essentiality is used to control the quantity of information that is transmitted and proficiency is used to control the format. Various levels of essentiality are introduced into a document by the technique known as microformatting. Proficiency (for the visually impaired) includes a description of minimum acceptable font size, preferred font face and preferred text and background colours. A key feature of the proficiency profile is the accessibility component which captures the user's tolerance of accessibility issues in a document, for example the presence of images or the markup of tables. The document delivery tool works as a kind of filter to reduce the content to the level of essentiality requested, to make the various presentation changes and to warn of accessibility issues as specified in the user's profile. Encouraging preliminary results have been obtained from testing the prototype with subjects from the local RNIB college.
Keywords: usability, visual impairment, web accessibility, web-documents
An OAI data provider for JEMS BIBAKFull-Text 218-220
  Diego Fraga Contessa; José Palazzo Moreira de Oliveira
Effective scientific activity requires the open publication and diffusion of scientific works and easy access to the research results. Digital libraries aim to simplify the publication process by using the Open Archives Initiative associated with the Dublin Core metadata system. The Brazilian Computer Society -- SBC -- supports the JEMS system for managing the submission and evaluation of conference and journal articles. JEMS collects considerable data about papers, authors and conferences and SBC is seeking ways to provide wider access to this information. This article describes an OAI-compatible data provider that publishes metadata in both the Simple and Qualified Dublin Core formats.
Keywords: JEMS, OAI, XML, data provider, digital libraries, metadata generation