Proceedings of the 2008 ACM Symposium on Document Engineering

Fullname:DocEng'08 Proceeding of the 8th ACM Symposium on Document Engineering
Editors:Dick C. A. Bulterman; Luiz Fernando G. Soares; Maria da Graça C. Pimentel
Location:São Paulo, Brazil
Dates:2008-Sep-16 to 2008-Sep-19
Standard No:ISBN: 1-60558-081-3, 978-1-60558-081-4; ACM DL: Table of Contents hcibib: DocEng08
  1. Keynote
  2. Scalable documents
  3. Structured documents
  4. Variable documents
  5. Demo session A
  6. Finding, mashing and mixing
  7. Document/image layout
  8. Modelling documents
  9. Information extraction in documents
  10. Demo session B
  11. Generation and printing
  12. Content processing
  13. Closing keynote
  14. Recognizing characters
  15. Modeling, editing, adaptation


Aggregate documents: making sense of a patchwork of topical documents BIBAKFull-Text 3-7
  Michael Shilman
This working session will be an interactive discussion about multimedia content transformation. The basic assumption is that content transformation activities should be provided as non-destructive operations. The final goal of the panel is to gather researchers within the community interested in manipulating multimedia content for providing rich user experiences. The organizers of the panel will moderate and shape the discussion; nevertheless, position papers from the participants are expected.
Keywords: content adaptation, content transformation, multimedia content, structured multimedia

Scalable documents

Interactive office documents: a new face for web 2.0 applications BIBAKFull-Text 8-17
  John M. Boyer
As the world wide web transforms from a vehicle of information dissemination and e-commerce transactions into a writable nexus of human collaboration, the Web 2.0 technologies at the forefront of the tranformation may be seen as special cases of a more general shift in the conceptual application model of the web. This paper recognizes the conceptual transition and explores the connections to a new class of interactive office documents that become possible by tighter integration of the Open Document Format with the W3C's next generation web forms technology (XForms). The connections transcend simple provisioning of office document editing and persistence capabilities on the web. Rather, the advantages of office documents as self-contained entities that flow through a collaborative network or business process are combined with web application qualities such as intelligent behavioral interaction, in-process web service access, and control of server submission content. An office document mashup called 'Dual Forms' is presented to demonstrate the feasibility of office document centric web applications.
Keywords: ODF, SOA, XForms, XML signature, business process, office document, user interaction, web service
Enabling adaptive time-based web applications with SMIL state BIBAKFull-Text 18-27
  Jack Jansen; Dick C. A. Bulterman
In this paper we examine adaptive time-based web applications (or presentations). These are interactive presentations where time dictates the major structure, and that require interactivity and other dynamic adaptation. We investigate the current technologies available to create such presentations and their shortcomings, and suggest a mechanism for addressing these shortcomings. This mechanism, SMIL State, can be used to add user-defined state to declarative time-based languages such as SMIL or SVG animation, thereby enabling the author to create control flows that are difficult to realize within the temporal containment model of the host languages. In addition, SMIL State can be used as a bridging mechanism between languages, enabling easy integration of external components into the web application.
Keywords: SMIL, declarative languages, delayed ad viewing, multimedia web applications
An export architecture for a multimedia authoring environment BIBAKFull-Text 28-31
  Jan Mikác; Cécile Roisin; Bao Le Duc
In this paper, we propose an export architecture that provides a clear separation of multimedia authoring services from publication services. We illustrate this architecture with the LimSee3 authoring tool and several standard publication formats: Timesheets, SMIL, and XHTML.
Keywords: SMIL, export, multimedia document, publishing format, timesheets
Adaptation of scalable multimedia documents BIBAKFull-Text 32-41
  Benoît Pellan; Cyril Concolato
Several scalable media codecs have been standardized in recent years to cope with heterogeneous usage conditions and to aim at always providing audio, video and image content in the best possible quality. Today, interactive multimedia presentations are becoming accessible on handheld terminals and face the same adaptation challenges as the media elements they present: quite diversified screen, memory and processing power capabilities. In this paper, we address the adaptation of multimedia documents by applying the concept of scalability to their presentation.
   The Scalable MSTI document model introduced in this paper has been designed with two main requirements in mind. First, the adaptation process must be simple to execute because it may be performed on limited terminals in broadcast scenarios. Second, the adaptation process must be simple to describe so that authored adaptation directives can be transported along with the document with a limited bandwidth overhead. The Scalable MSTI model achieves both objectives by specifying Spatial, Temporal and Interactive scalability axes on which incremental authoring can be performed to create progressive presentation layers.
   Our experiments are conducted on scalable multimedia documents designed for Digital Radio services on DMB channels using MPEG-4 BIFS and also for web services using XHTML, SVG, SMIL and Flash. A scalable image gallery is described throughout this article and illustrates the features offered by our document model in a rich multimedia example.
Keywords: document adaptation, document model, multimedia scalability

Structured documents

Automated repurposing of implicitly structured documents BIBAKFull-Text 42-51
  Helen Balinsky; Anthony Wiley; Michael Rhodes; Alfie Abdul-Rahman
The different visual cues present in a document -- such as spatial intervals and positions, contrast in font families, sizes and weights -- combine to form the document's visual hierarchy. This hierarchy is essential to the reader, allowing scanning and comprehension; in contrast, this information is often ignored by machine processing. At the same time, the document structure is often not available in a machine readable form due to the ways documents were originally created or later transformed. This paper addresses the challenge of automatic document repurposing -- applying styling and formatting from one 'implicitly' structured document to another, whilst preserving the underlying visual hierarchy. Using visual perception analysis, the proportionality mapping is established, according to which the original document content is transformed into the new style without breaking the original hierarchical structure. Spatial relationships, location and frequency analysis are then used to fine-tune the transformation.
Keywords: cap-height, document repurposing, hierarchical metrics and structure, injective mapping, x-height
Merging changes in XML documents using reliable context fingerprints BIBAKFull-Text 52-61
  Sebastian Rönnau; Christian Pauli; Uwe M. Borghoff
Different dialects of XML have emerged as ubiquitous document exchange formats. For effective collaboration based on such documents, the capability to propagate edit operations performed on a document is indispensable. In order to avoid the transmission of whole documents, deltas are used to describe these edit operations, allowing the construction of a new version of a document. However, patching a document with a delta it was not generated for is error-prone, and any insert or delete operations performed on the document are likely to affect all subsequent paths within that document.
   In this paper, we present a delta format for XML documents that uses context-aware fingerprints to identify edit operations. This allows our XML patch procedure to find the correct position of an edit operation, even if the document was updated in the meantime. Possible conflicts are detected. Experimental results show the reliability of the presented fingerprinting technique and prove the high quality of the resulting patched documents.
Keywords: CSCW, XML diff, XML patch, fingerprint, office applications, version control
A concise XML binding framework facilitates practical object-oriented document engineering BIBAKFull-Text 62-65
  Andruid Kerne; Zachary O. Toups; Blake Dworaczyk; Madhur Khandelwal
Semantic web researchers tend to assume that XML Schema and OWL-S are the correct means for representing the types, structure, and semantics of XML data used for documents and interchange between programs and services. These technologies separate information representation from implementation. The separation may seem like a benefit, because it is platform-agnostic. The problem is that the separation interferes with writing correct programs for practical document engineering, because it violates a primary principle of object-oriented programming: integration of data structures and algorithms. We develop an XML binding framework that connects Java object declarations with serialized XML representation. A basis of the framework is a metalanguage, embedded in Java object and field declarations, designed to be particularly concise, to facilitate the authoring and maintenance of programs that generate and manipulate XML documents. The framework serves as the foundation for a layered software architecture that includes meta-metadata descriptions for multimedia information extraction, modeling, and visualization; Lightweight Semantic Distributed Computing Services; interaction logging services; and a user studies framework.
Keywords: Java, XML, binding framework, metalanguage, object-oriented programming, translation
Malan: a mapping language for the data manipulation BIBAKFull-Text 66-75
  Arnaud Blouin; Olivier Beaudoux; Stéphane Loiseau
Malan is a MApping LANguage that allows the generation of transformation programs by specifying a schema mapping between a source and target data schema. By working at the schema level, Malan remains independent of any transformation process; it also naturally guarantees the correctness of the transformation target relative to its schema. Moreover, by expressing schemas as UML class diagrams, Malan schema mappings can be written on top of UML modellers. This paper describes the overall approach by focusing on the Malan language itself, and its use within a transformation process.
Keywords: UML, data manipulation, malan, mapping, schema transformation, schema translation

Variable documents

Configurable editing of XML-based variable-data documents BIBAKFull-Text 76-85
  John Lumley; Roger Gimson; Owen Rees
Variable data documents can be considered as functions of their bindings to values, and this function could be arbitrarily complex to build strongly-customised but high-value documents. We outline an approach for editing such documents from example instances, which is highly configurable in terms of controlling exactly what is editable and how, capable of being used with a wide variety of XML-based document formats and processing pipelines, if certain reasonable properties are supported and can generate appropriate editors automatically, including web-service deployment.
Keywords: SVG, XSLT, document construction, document editing, functional programming
Tracking sub-page components in document workflows BIBAKFull-Text 86-89
  James A. Ollis; Steven R. Bagley; David F. Brailsford
Documents go through numerous transformations and intermediate formats as they are processed, in a workflow, from abstract markup into final printable form. Unfortunately, it is common to find that ideas about document components, which might exist in the source code for the document, become completely lost within an amorphous, unstructured, page of PDF prior to being rendered. Given the importance of a component-based approach in Variable Data Printing (VDP) we have developed a collection of tools that allow information about the various transformations to be embedded at each stage in the workflow, together with a visualization tool that uses this embedded information to display the relationships between the various intermediate documents.
   We demonstrate these tools in the context of an example workflow using DocBook markup but the techniques described are widely applicable and would be easily adaptable to other workflows and for use in teaching tools to illustrate document component and VDP concepts.
Keywords: COGs, DocBook, PDF, VDP, XSL-FO, XSLT, document components, document workflows, education
Higher-level layout through topological abstraction BIBAKFull-Text 90-99
  Angelo Di Iorio; Luca Furini; Fabio Vitali; John Lumley; Tony Wiley
Existing layout languages provide support for geometric properties allowing -- and in a sense forcing -- users to give a complete geometric description of the desired output: if the characteristics of the output medium change, the layout of the whole document has to be reworked completely, as the properties set by the user are no longer appropriate for the modified context.
   In this paper we propose a different paradigm which allows users to produce layouts by describing their topological and abstract properties, rather than geometric ones. We first define and detail topological properties as abstract relationships between the document components, independent from the output characteristics, and then describe an XML-based layout language based on these concepts, called TALL.
   A running engine able to transform topological layouts into actual PDF files, based on XSLT and the DDF framework, is presented as well.
Keywords: DDF, TALL, XSLT, automatic layouts, topological layouts

Demo session A

An office document mashup for document-centric business processes BIBAKFull-Text 100-101
  John M. Boyer; Eric Dunn; Maureen Kraft; Jun S. H. Liu; Mihir R. Shah; He Feng Su; Saurabh Tiwari
An office document mashup called 'Dual Forms' is presented to demonstrate the feasibility and advantages of imbuing an office document with intelligent interaction capabilities, access to web services of a service-oriented architecture (SOA), digital signatures for legally binding contractual agreements, and a self-submission capability that allows the document to flow through a collaborative network or business process.
Keywords: ODF, SOA, XForms, XML signature, business process, office document, user interaction, web service
Image collection taxonomies for photo-book auto-population with intuitive interaction BIBAKFull-Text 102-103
  Pere Obrador; Nathan Moroney; Ian MacDowell; Eamonn O'Brien-Strain
We demonstrate a system for automatic image selection for photobook creation, along with an intuitive user interface for fine tuning of the selection results. A versatile image collection representation is introduced, which allows for automatic scalable selection in order to target a specific image count for a predetermined size photobook. The images are selected based on their relevance, while preserving a good coverage of the event (time plus people) in order to maintain the storytelling potential of the selection. The selected images are laid out and presented to the user through an Adobe Flex user interface, which allows them to select images and swap them by semantically related ones, in an intuitive manner. The final result is output to a PDF file.
Keywords: automatic photo selection, hierarchy, image appeal, image collection, near-duplicate detection, scalability, time clustering
A prototype documenter system for medical grand rounds BIBAKFull-Text 104-105
  Renato de Freitas Bulcão-Neto; José Antonio Camacho-Guerrero; Alessandra Alaniz Macedo
This paper demonstrates our ongoing experience on a documenter system for medical grand rounds. The system captures and synchronizes the set of material presented and corresponding physicians' interactions, automatically relates clinical cases of patients, and then generates web-accessible documents with all information captured. The resulting documentation can be used for several purposes such as teaching, research and presurgical decision taking.
Keywords: documentation, extension, pervasive healthcare

Finding, mashing and mixing

A content-based approach for document representation and retrieval BIBAKFull-Text 106-109
  Antonio M. Rinaldi
In the last few years, the problem of defining efficient techniques for knowledge representation is becoming a challenging topic in both academic and industrial community. The large amount of available data creates several problems in terms of information overload. In this framework, we assume that new approaches for knowledge definition and representation may be useful, in particular the ones based on the concept of ontology. In this paper we propose a suitable model for knowledge representation purposes using linguistic concepts and properties. We implement our model in a system which, using novel techniques and metrics, analyzes documents from a semantic point of view using as context of interest the Web. Experiments are performed on a test set built using a directory service to have information about analyzed documents. The obtained results compared with other similar systems show an effective improvement.
Keywords: WordNet, ontologies, semantic relatedness metrics
No mining, no meaning: relating documents across repositories with ontology-driven information extraction BIBAKFull-Text 110-118
  Víctor Codocedo; Hernán Astudillo
Far from eliminating documents as some expected, the Internet has lead to a proliferation of digital documents, without a centralized control or indexing. Thus, identifying relevant documents becomes simultaneously more important and much harder, since what users require may be dispersed across many documents and many repositories. This paper describes Ontologic Anchoring, a technique to relate documents in domain ontologies, using named entity recognition (a natural-language processing approach) and semantic annotation to relate individual documents to elements in ontologies. This approach allows document retrieval using domain-level inferences, and integration of repositories with heterogeneous media, languages and structure. Ontological anchoring is a two-way street: ontologies allow semantic indexing of documents, and simultaneously new documents enrich ontologies. The approach is illustrated with an initial deployment for heritage documents in Spanish.
Keywords: NLP, human-in-the-loop, information extraction, metadata creation, ontological anchoring, ontology
Document logs: a distributed approach to metadata for better security and flexibility BIBAKFull-Text 119-122
  Michael Gormish; Greg Wolff; Kurt Piersol; Peter Hart
A document log is an ordered list of entries providing a history for any sort of media or file, just as a logfile provides a history of a computer program and a logbook provides a history of a journey. The history of a document may consist of copyright information, approvals, annotations, or any sort of metadata. This paper describes a metadata architecture using Content Based Identifiers and Document Logs that facilitates location of metadata from distributed sources, caching, ordering of log entries, and detection of changes in metadata or documents. The techniques used complement existing metadata format standards and are contrasted with storage of metadata in a file or document management system.
Keywords: hash chain, time-stamp, uuid
The CONCUR framework for community maintenance of curated resources BIBAKFull-Text 123-126
  Patrick Schmitz
The increasing use of computational linguistics for semantic search and discovery tools requires much work on development and maintenance of associated ontologies. Related applications depend upon curated resources like dictionaries, gazetteers, etc. In order to scale these application models and leverage the respective communities of interest, a new set of tools is needed that facilitate community development and extension of these resources while retaining the curatorial model to ensure a reliable, high quality resource. We describe the requirements and principles for such a system, and present the CONCUR framework that addresses these needs. CONCUR defines a reputation model and a set of reusable infrastructure services to maintain the resource. The reputation model combines correctness as well as utility of participants' contributions, tracked over time and by sub-domain within the resource. We describe the architectural issues of the model, potential applications, and continuing research on the model.
Keywords: SOA, community, curation, ontology, structured information
Online ancient documents: Armarius BIBAKFull-Text 127-130
  Reim Doumat; Elöd Egyed-Zsigmond; Jean-Marie Pinon; Emese Csiszar
Many museums and libraries digitize their collections of historical manuscripts to preserve the historic documents and to facilitate their browsing. The collections are available as digital images and they need annotation to be accessible and exploitable. The annotations can be created manually, automatically or semi-automatically. Manual annotation is expensive and tedious; hence the reuse of users' experiences, by tracing their actions during the annotation process, helps other users to accomplish repetitive tasks in a semi-automatic manner. In this article we present a digital archive model and prototype of a collaborative system for the management of online ancient manuscript. The application offers an online annotation service, an assistant for semi-automatic annotation, and a tracing system that saves traces of important actions in order to reuse them in a recommender system afterward.
Keywords: document categorization and classification, integrating documents with other digital artifacts, system

Document/image layout

Satisficing scrolls: a shortcut to satisfactory layout BIBAKFull-Text 131-140
  Nathan Hurst; Kim Marriott
We present at a new approach to finding aesthetically pleasing page layouts. We do not aim to find an optimal layout, rather the aim is to find a layout which is not obviously wrong. We consider vertical scroll-like layout with floating figures referenced within the text where floats can have alternate sizes, may be optional, move from one side to the other and change their order. We also allow pagination. Our approach is to use a randomised local search algorithm to explore different configurations of floats, i.e. choice of floats and relative ordering. For a particular float configuration we use an efficient gradient projection-like continuous optimization algorithm. The resulting system is fast and provides an efficient warm start option to improve interactive support.
Keywords: floating figure, multi-column layout, optimisation techniques
Two algorithms for automatic document page layout BIBAKFull-Text 141-149
  João Batista S. de Oliveira
This paper describes two approaches to the problem of automatically placing document items on pages of some output device. Both solutions partition the page into regions where each item is to be placed, but work on different input data according to the application: One approach assumes that previously defined rectangular items are to be placed freely on the page (as in a sales brochure), whereas the second approach places free-form items on pages divided into columns (as in a newspaper). Moreover, both approaches try to preserve the reading order provided by the input and use all available area on the page. The algorithms implementing those approaches and based on recursive page division are presented, as well as test results, possible changes and research directions.
Keywords: automatic page layout, packing, placement algorithms
PDF document restoration and optimization during image enhancement BIBAKFull-Text 150-153
  Hui Chao; Carl Staelin; Sagi Schein; Marie Vans; John Lumley
We present a document processing method that addresses some of the practical challenges in image enhancement for digital photo album in PDF documents. With the advent of digital offset presses, consumer photo books are becoming increasingly popular, and most such workflows convert the consumer's photos and layout into PDF documents. In order to produce appealing photo albums from consumer photographs, some form of automatic enhancement is usually required, and this enhancement is often done late in the workflow just before printing, and therefore it is done on the PDF file. If each and every PDF generation tool simply inserted a single complete image each time an image appeared in the document, then the process of opening a PDF document, iterating through the document, extracting, enhancing, and replacing images, and then saving the enhanced document would be relatively easy. Unfortunately, PDF generation tools often violate that assumption in two ways. Firstly, large images are often written as a set of small images in strips or tiles, which visually appear to be a single image. Secondly, an image in a PDF document may be reused in the document on different position and pages; directly enhancing images without the consideration of the reuse model could result in great increase in the document size and poor system performance. Therefore, image reconstruction and document optimization were performed in our PDF photo album enhancement solution.
Keywords: PDF optimization, document enhancement, image stitching
Authoring adaptive diagrams BIBAKFull-Text 154-163
  Cameron McCormack; Kim Marriott; Bernd Meyer
The web and digital media requires intelligent, adaptive documents whose appearance and content adapts to the viewing context and which support user interaction. While previous research has focussed on textual and multimedia content, this is also true for diagrammatic content. We have designed and implemented an authoring tool which supports the construction of adaptive diagrams. Adaptive layout behaviour is specified by using constraint-based placement tools as well as by allowing the author to specify more radical layout changes using alternate layout configurations. As well as specifying alternate layouts, the author can specify alternate representations for an object, alternate styles and alternate textual content. The resulting space of different versions of the diagram is the cross product of these different alternatives. At display time the version is constructed dynamically, taking into account the author specified preference order on the alternatives, current viewing environment, and user interaction.
Keywords: adaptive layout, authoring, diagrams

Modelling documents

Towards extending and using SPARQL for modular document generation BIBAKFull-Text 164-172
  Faisal Alkhateeb; Sébastien Laborie
RDF is one of the most used languages for resource description and SPARQL has become its standard query language. Nonetheless, SPARQL remains limited to generate automatically documents from RDF repositories, as it can be used to construct only RDF documents. We propose in this paper an extension to SPARQL that allows to generate any kind of XML documents from multiple RDF data and a given XML template. Thanks to this extension, an XML template can itself contain SPARQL queries that can import template instances. Such an approach allows to reuse templates, divide related information into various templates and avoid templates containing mixed languages. Moreover, reasoning capabilities can be exploited using RDF Schema or simply RDFS.
Keywords: RDF, SPARQL, XML document generation, semantic web, template
Fast identification of visual documents using local descriptors BIBAKFull-Text 173-176
  Eduardo Valle; Matthieu Cord; Sylvie Philipp-Foliguet
In this paper we introduce a system for the identification of visual documents. Since it stems from content-based document indexing and retrieval, our system does not need to rely on textual annotations, watermarks or other metadata, which can be missing or incorrect. Our retrieval system is based on local descriptors, which have been shown to provide accurate and robust description. Because of the high computational costs associated to the matching of local descriptors, we propose Projection KD-Forest: an indexing technique which allows efficient approximate k nearest neighbors search. Experiments demonstrate that the Projection KD-Forest allows the system to provide prompt results with negligible loss on accuracy. The Projection KD-Forest also compares well when contrasted to other strategies of k nearest neighbors search.
Keywords: copy detection, document identification, image retrieval, k nearest neighbors search, local descriptors, multidimensional indexing
Improving query performance on XML documents: a workload-driven design approach BIBAKFull-Text 177-186
  Rebeca Schroeder; Ronaldo dos Santos Mello
As XML has emerged as a data representation format and as great quantities of data have been stored in the XML format, XML document design has become an important and evident issue in several application contexts. Methodologies based on conceptual modeling are being tightly applied for designing XML documents. However, the conversion of a conceptual schema to an XML schema is a complex process. In many cases, conceptual relationships cannot be represented in a hierarchy so that they have to be represented by reference relationships in the XML schema. The problem is that reference relationships generate a disconnected XML structure and, consequently, produce an overhead cost for query processing on XML documents.
   This paper presents a design approach for generating XML schemas from conceptual schemas considering the expected workload of the XML applications. Query workload is used to produce XML schemas which minimize the impact of the reference relationships on query performance. We evaluate our approach through a case study where a set of XML documents are redesigned by our methodology. The results demonstrate that query performance is improved in terms of the number of accesses generated by the queries on the XML documents designed by our approach.
Keywords: XML schemas, conceptual schemas, query performance
Similarity of XML schema definitions BIBAKFull-Text 187-190
  Irena Mlýnková
In this paper we propose a technique for evaluating similarity of XML Schema fragments. Firstly, we define classes of structurally and semantically equivalent XSD constructs. Then we propose a similarity measure that is based on the idea of edit distance utilized to XSD constructs and enables one to involve various additional similarity aspects. In particular, we exploit the equivalence classes and semantic similarity of element/attribute names. Using experiments we show the behavior and advantages of the proposal.
Keywords: XML schema, equivalence of XSD constructs, similarity
Matching XML documents in highly dynamic applications BIBAKFull-Text 191-198
  Adrovane M. Kade; Carlos A. Heuser
Highly dynamic applications like the Web and peer-to-peer systems require a great deal of effort in document management. Documents from different sources may contain parts that, although having different structure or different contents, may be considered as representing the same conceptual information. One essential task in this scenario is the identification of complementary or overlapping documents that need to be integrated. In this paper, we deal specifically with documents represented in the XML format. XML document integration is an important process in highly dynamic applications, for the volume of data available in this format is constantly growing. XML integration is also a challenging task, due to the flexible nature of XML, which may lead to structure divergences and content conflicts between the documents. In this work, we present a novel approach to the matching problem, i.e., the problem of defining which parts of two documents contain the same information. Matching is usually the first step of an integration process. Our approach is novel in the sense it combines similarity information from the content of the elements with information from the structure of the documents. This feature, as our experiments confirm, makes our approach capable of dealing with content as well as structural divergences.
Keywords: XML, document management, matching, similarity measure

Information extraction in documents

Automatic keyphrase extraction from scientific documents using N-gram filtration technique BIBAKFull-Text 199-208
  Niraj Kumar; Kannan Srinathan
In this paper we present an automatic Keyphrase extraction technique for English documents of scientific domain. The devised algorithm uses n-gram filtration technique, which filters sophisticated n-grams {1dnd4} along with their weight from the words of input document. To develop n-gram filtration technique, we have used (1) LZ78 data compression based technique, (2) a simple refinement step, (3) A simple Pattern Filtration algorithm and, (4) a term weighting scheme. In term weighting scheme, we have introduced the importance of position of sentence (where given phrase occurs first) in document and position of phrase in sentence for documents of scientific domain (which is literally more organized than other domains). The entire system is based upon statistical observations, simple grammatical facts, heuristics, and lexical information of English language. We remark that the devised system does not require a learning phase. Our experimental results with publically available text dataset, shows that the devised system is comparable with other known algorithms.
Keywords: information extraction, information retrieval, keyphrase extraction, scientific domain
Semantic impact graphs for information valuation BIBAKFull-Text 209-212
  Sinan al-Saffar; Gregory L. Heileman
Information valuation has typically been carried out implicitly in question-answering and document retrieval systems. We argue that explicit information valuation is needed to move away from the system and process-centric nature of implicit valuation which has also hindered the theoretical study of information value under a unified and explicit framework. In this paper we present a graphical-based model for explicit information valuation. Our model caters to the subjective nature of information quality by measuring the impact a candidate piece of information may have on a knowledge base representing the recipient's world view. Our model is capable of evaluating information semantically at the statement level and is in effect basing information-valuation on information-understanding. However, information value can be computed and predicted using our causal graph model without requiring full logical inference typically needed for information-understanding.
Keywords: document ranking, information retrieval, information valuation, semantic web search
Identifying and expanding titles in web texts BIBAKFull-Text 213-216
  Clémentine Adam; Estelle Delpech; Patrick Saint-Dizier
In this paper, we present an analysis based on linguistic and typographic features that allows for the identification of titles in web documents. We focus in particular on procedural texts. Identifying titles is a difficult task because ways of encoding them are very diverse. A number of titles are also incomplete because of context, we propose therefore a way to retrieve the missing elements, in particular predicates, so that titles are fully intelligible.
Keywords: structure analysis, text semantics, text titles

Demo session B

A demonstration of a configurable editing framework BIBAKFull-Text 217-218
  John Lumley; Roger Gimson; Owen Rees
XML-based variable data documents are special cases of XML documents subjected to processing before final visualisation. We demonstrate how such 'templates' can be edited from specific instances in a generalised manner and that this can be supported by a highly extensible and configurable editing framework. The demonstration covers simple authoring actions, higher-level authoring control (altering the editability within a document), reconfiguring the overall editor capability, using alternative 'views' of documents and exploiting the framework to modify generalised XML 'files', including some of those that define the editor itself.
Keywords: SVG, XSLT, document construction, document editing, functional programming
Playback of mixed multimedia document BIBAKFull-Text 219-220
  Cyril Concolato; Jean Le Feuvre
Many multimedia languages exist today to describe animated, interactive, 2D or 3D graphics and media elements, and each language has its merits. We studied the problems underlying the integration of all these languages in a single player. We present here the result of this work, and in particular, we demonstrate the mixed playback of SVG, BIFS, LASeR, Flash or VRML/X3D content.
Keywords: BIFS, SVG, VRML, mixed documents, multimedia player
Scalable multimedia documents for digital radio BIBAKFull-Text 221-222
  Benoit Pellan; Cyril Concolato
In this paper, we demonstrate the adaptation of multimedia digital radio services in broadcast environments based on scalable multimedia documents. The authoring of our multimedia services relies on the Scalable MSTI model that decomposes multimedia documents into three ordered dimensions: Spatial, Temporal and Interactive descriptions. Our demonstration shows Scalable MSTI multimedia documents that can be adapted to typical T-DMB digital radio usage scenarios.
Keywords: DMB digital radio, digital radio, document adaptation, multimedia radio services, multimedia scalability

Generation and printing

An exploratory mapping strategy for web-driven magazines BIBAKFull-Text 223-229
  Fabio Giannetti
"There will always (I hope) be print books, but just as the advent of photography changed the role of painting or film changed the role of theater in our culture, electronic publishing is changing the world of print media. To look for a one-to-one transposition to the new medium is to miss the future until it has passed you by." -- Tim O'Reilly [1].
   It is not hard to envisage that publishers will leverage subscribers' information, interest groups' shared knowledge and others sources to enhance their publications. While this enhances the value of the publication through more accurate and personalized content, it also brings a new set of challenges to the publisher. Content is now driven by web and in a truly automated system no designer "re-touch" intervention can be envisaged. The paper introduces an exploratory mapping strategy to allocate web driven content in a highly graphical publication like a traditional magazine. Two major aspects of the mapping are covered, which enables different level of flexibility and addresses different content flowing strategies. The last contribution is an evaluation of existing standards, which potentially can leverage this work to incorporate more flexible mapping, and subsequently, composition capabilities.
Keywords: SVG, XML, XPS, XSL-FO, content driven pagination, layout, print, template, transactional printing, variable data print
PrintMonkey: giving users a grip on printing the web BIBAKFull-Text 230-239
  Jennifer Baldwin; James A. Rowson; Yvonne Coady
Web content is notoriously difficult to capture on a printed page due to inconsistent and undesired results. Items that users may not want to print, such as media, navigation menus and more show up on their page. Other items that they may care about are truncated or spread across several pages. Some tools exist to help users with what is printed, but they often are cumbersome to use or are costly for a company to maintain. Therefore, we introduce PrintMonkey, which allows users to write their own printing templates and share them with others on the web. No modifications to the original webpages are required and users with less development experience can use and develop templates. A comparison with four alternative solutions reveals the concrete ways in which PrintMonkey improves upon existing approaches in terms of functionality, customizability and scalability.
Keywords: JavaScript, customized browsing, print templates, printing the web, screen scraping

Content processing

Towards Brazilian Portuguese automatic text simplification systems BIBAKFull-Text 240-248
  Sandra M. Aluísio; Lucia Specia; Thiago A. S. Pardo; Erick G. Maziero; Renata P. M. Fortes
In this paper we investigate the main linguistic phenomena that can make texts complex and how they could be simplified. We focus on a corpus analysis of simple account texts available on the web for Brazilian Portuguese and propose simplification strategies for this language. This study illustrates the need for text simplification to facilitate accessibility to information by poor literacy readers and potentially by people with other cognitive disabilities. It also highlights characteristics of simplification for Portuguese, which may differ from other languages. Such study consists of the first step towards building Brazilian Portuguese text simplification systems. One of the scenarios in which these systems could be used is that of reading electronic texts produced, e.g., by the Brazilian government or by relevant news agencies.
Keywords: Brazilian Portuguese, corpus analysis, natural language processing, poor literacy readers, text simplification
Constructing a know-how repository of advices and warnings from procedural texts BIBAKFull-Text 249-252
  Lionel Fontan; Patrick Saint-Dizier
In this paper, we show how a domain dependent know-how textual database of advices and warnings can be constructed from procedural texts. We show how arguments of type warnings and advices can be annotated and extracted from procedural texts, and propose a format and a strategy to automatically generate a know-how textual database.
Keywords: automatically generated document, structure and content analysis, text semantics
Summarizing and referring: towards cohesive extracts BIBAKFull-Text 253-256
  Patricia Nunes Gonçalves; Lucia Rino; Renata Vieira
In this paper we propose and evaluate a system for summary post-edition, which aims at replacing referential expressions, trying to avoid referencial cohesion problems. To propose expressions that best represent the evoked entity, the system uses knowledge about coreference chains. We evaluate the system both with knowledge provided by manual and automatic annotation of coreference chains.
Keywords: automatic summarization, coreference chains, referencial cohesion

Closing keynote

Keeping a digital library clean: new solutions to old problems BIBAKFull-Text 257-262
  Alberto H. F. Laender; Marcos André Gonçalves; Ricardo G. Cota; Anderson A. Ferreira; Rodrygo L. T. Santos; Allan J. C. Silva
Digital Libraries are complex information systems that involve rich sets of digital objects and their respective metadata, along with multiple organizational structures and services (e.g., searching, browsing, and personalization), and are normally built having a target community of users with specific interests. Central to the success of this type of system is the quality of their services and content. In the context of DLs of scientific literature, among the many problems faced to sustain their information quality, two specific ones, related to information consistency, have taken a lot of attention from the research community: name disambiguation and lack of information to access the full-text of cataloged documents. In this paper, we examine these two problems and describe the solutions we have proposed to solve them.
Keywords: citation management, digital libraries, full-text management, information quality, name disambiguation

Recognizing characters

An optical character recognition approach to qualifying thresholding algorithms BIBAKFull-Text 263-266
  Margaret Sturgill; Steven J. Simske
Pre-processing for raster image based document segmentation begins with image thresholding, which is a binarization process separating foreground from background. In this paper, we compare an existing (Otsu), modified existing (Kittler-Illingworth) and simple peak-based thresholding approach on a set of 982 documents for which existing ground truth (full text) is available. We use the output of an open source OCR engine which incorporates an adaptive/dynamic thresholder that can be bypassed by one of the three global thresholds we tested. This allowed comparison of these three approaches in the aggregate. We then used an independently-generated dictionary as a means of characterizing thresholder efficacy. Such an approach, if successful, will provide the means for selecting an optimal thresholder in the absence of a large set of ground truthed documents. Our preliminary findings here indicate that this approach may provide a reliable means for thresholder comparison and eventually preclude the need for time-intensive human ground truthing.
Keywords: Kittler-Illingworth, OCR, accuracy, meta-algorithms, Otsu, testing, threshold
A rotation method for binary document images using DDA algorithm BIBAKFull-Text 267-270
  Duc Thanh Nguyen
DDA (Digital Differential Analyzer) is a famous algorithm used commonly in computer graphics to interpolate integer coordinate pixels of a straight line. In this paper, we introduce a method of image rotation for binary document images using DDA algorithm with assumption that the true skew angles of the documents have already been computed. The proposed method applies the main idea of DDA algorithm with some modifications for the skew scanning lines along to the inverse direction of the skew angle. In this method the ratios between the length of black runs and the whole scan line are guaranteed. Thus the algorithm can overcome disadvantages of mathematical rotation such as white holes and over segmentation. Moreover, using DDA algorithm to approximate integer points helps this method reduce the number of rotation operations.
Keywords: DDA, rotation algorithm, skew correction
Segmentation of overlapping cursive handwritten digits BIBAKFull-Text 271-274
  Carlos A. B. Mello; Edward Roe; Everton B. Lacerda
In this paper, we describe an approach for the problem of segmenting overlapping characters. We are working with digit segmentation for bank check processing. Our method is based on the idea of a hypothetical ball traversing the number. The inertia of the movement segments the overlapping digits. Rules are defined for this movement. Our initial proposal achieved very good results with O(n2) complexity.
Keywords: document processing, overlapping digits, segmentation

Modeling, editing, adaptation

Multimedia adaptation in ubiquitous environments: benefits of structured multimedia documents BIBAKFull-Text 275-284
  Pablo Cesar; Ishan Vaishnavi; Ralf Kernchen; Stefan Meissner; Cristian Hesselman; Matthieu Boussard; Antonietta Spedalieri; Dick C. A. Bulterman; Bo Gao
This paper demonstrates the advantages of using structured multimedia documents for session management and media distribution in ubiquitous environments. We show how document manipulations can be used to perform powerful operations such as content to context adaptation and presentation continuity. When consuming media in ubiquitous environments, where the set of devices surrounding a user may change, dynamic media adaptation and session transfer become primary requirements. This paper presents a working system, based on a representative scenario, in which multimedia content is distributed and adapted to a movable user to best suit his/her contextual situation. The implemented scenario includes the following scenes: content selection using a personal mobile phone, content distribution to the most suitable device according to the user's context, and presentation continuity when the user moves to another location. This paper introduces the underlying document manipulations that turn the scenario into a working system.
Keywords: SMIL, multimedia adaptation, session continuity, structured multimedia documents
A visual approach for modeling spatiotemporal relations BIBAKFull-Text 285-288
  Rodrigo Laiola Guimarães; Carlos de Salles Soares Neto; Luiz Fernando Gomes Soares
Textual programming languages have proven to be difficult to learn and to use effectively for many people. For this sake, visual tools can be useful to abstract the complexity of such textual languages, minimizing the specification efforts. In this paper we present a visual approach for high level specification of spatiotemporal relations. In order to accomplish this task, our visual representation provides an intuitive way to specify complex synchronization events amongst media. Finally, to validate our work, the visual specification is mapped to NCL (Nested Context Language), the standard declarative language of the Brazilian Terrestrial Digital TV System.
Keywords: NCL, SBTVD, connector, spatiotemporal relations, synchronization, visual representation, visual specification
Intermedia synchronization management in DTV systems BIBAKFull-Text 289-297
  Romualdo Monteiro de Resende Costa; Marcelo Ferreira Moreno; Luiz Fernando Gomes Soares
Intermedia synchronization is related with spatial and temporal relationships among media objects that compound a DTV application. From the server side (usually a broadcaster's server or a Web Server) to receivers, end-to-end intermedia synchronization support must be provided. Based on application specifications, several abstract data structures should be created to guide all synchronization control processes. A special data structure, a labeled digraph called HTG (Hypermedia Temporal Graph) is proposed in this paper as the basis of all other data structures. From HTG, receivers derive a presentation plan to orchestrate media content presentations that make up a DTV application. From this plan other data structures are derived to estimate when media players should be instantiated and when data contents should be retrieved from a DSM-CC carousel or from a return channel. If the return channel provides QoS support, another data structure is derived from the presentation plan, in order to determine when resource reservation should take place. For content pushed by broadcasters, HTG is used in the server side as the basis for building the carousel plan, a data structure that guides the order and frequency that media objects should be broadcasted.
   The paper's proposals were partially put into practice in the current open source reference implementation of the standard middleware of the Brazilian Terrestrial Digital TV System. However, this reference implementation is used just as a proof of concept. The ideas presented can be extended to any multimedia document presentation player (user agent) and content distribution server.
Keywords: NCL, digital TV, intermedia synchronization, middleware, temporal graph
End-user editing of interactive multimedia documents BIBAKFull-Text 298-301
  Maria da Graça C. Pimentel; Renan G. Cattelan; Erick L. Melo; Cesar A. C. Teixeira
The problem of allowing user-centric control within multimedia presentations is important to document engineering when the presentations are specified as structured multimedia documents. In this paper we investigate the problem in the context of end-user "real-time" editing of interactive video programs.
Keywords: interactive multimedia, interactive video