HCI Bibliography Home | HCI Conferences | DocEng Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
DocEng Tables of Contents: 0102030405060708091011121314

Proceedings of the 2007 ACM Symposium on Document Engineering

Fullname:DocEng'07 Proceeding of the 7th ACM Symposium on Document Engineering
Editors:Peter King; Steven Simske
Location:Winnipeg, Manitoba, Canada
Dates:2007-Aug-28 to 2007-Aug-31
Standard No:ISBN: 1-59593-776-5, 978-1-59593-776-6; ACM DL: Table of Contents hcibib: DocEng07
  1. Working session
  2. Keynote address
  3. Paper documents: capture and physical-digital-coexistence
  4. Poster session
  5. Variable data printing
  6. XML documents
  7. Keynote address
  8. Demonstrations
  9. Multimedia
  10. Layout and aesthetics
  11. Extending document engineering formats
  12. Classification and machine learning
  13. Document transformation

Working session

Document engineering education BIBAKFull-Text 1
  Ethan V. Munson
This working session will be a roundtable discussion of document engineering education. The working session's goal is to allow educators and researchers to share their experiences in teaching topics related to document engineering. The hope is that these discussions will stimulate the development of common resources, including syllabi, reading lists, and exercises in order to facilitate the spread of document engineering as a viable topic for study.
Keywords: curriculum, document engineering, education

Keynote address

Navigating documents using ontologies, taxonomies and folksonomies BIBAKFull-Text 2
  Margaret-Anne D. Storey
Navigating computer-based information landscapes can be a challenging task for humans in almost any knowledge domain. Most documentation spaces are large, complex and ever-changing, which creates a significant cognitive burden on the end-user. Effective tool support can help orient the user and guide them to an appropriate place in the information space. In our research, we have been investigating how visualization tools can support navigation by leveraging the standard and folk classification systems that are embedded in information spaces. We have focused on two specific domains where navigating information can pose challenges: medical informatics and software engineering.
   Within the domain of medical informatics, we have designed a visualization tool that supports the exploration and comparison of a set of clinical trials. The navigational support offered to the user is customized according to an ontology that describes the trial designs. For software engineers, we have developed a tool that generates "navigational waypoints" from informal tagging in software documents. These waypoints provide a way for the software engineer to create "tours" through the space of software documents. In our current work, we are now exploring how adaptive visualization tools may leverage both structured and unstructured information in providing navigational support. We believe that both kinds of information when presented in a coherent visual manner will lead to more effective cognitive support for users as they browse, query and search integrated knowledge spaces.
Keywords: document navigation, folksonomies, ontologies, taxonomies, visualization, waypoints

Paper documents: capture and physical-digital-coexistence

Thresholding of badly illuminated document images through photometric correction BIBAKFull-Text 3-8
  Shijian Lu; Chew Lim Tan
This paper presents a document image thresholding technique that binarizes badly illuminated document images by the photometric correction. Based on the observation that illumination normally varies smoothly and document images often contain a uniformly colored background, the global shading variation is estimated by using a two-dimensional Savitzky-Golay filter that fits a least square polynomial surface to the luminance of a badly illuminated document image. With the knowledge of the global shading variation, shading degradation is then corrected through a compensation process that produces an image with roughly uniform illumination. Badly illuminated document images are accordingly binarized through the global thresholding of the compensated ones. Experiments show that the proposed thresholding technique is fast, robust, and efficient for the binarization of badly illuminated document images.
Keywords: badly-illuminated document images, document image analysis, document image thresholding
A system for understanding imaged infographics and its applications BIBAKFull-Text 9-18
  Weihua Huang; Chew Lim Tan
Information graphics, or infographics, are visual representations of information, data or knowledge. Understanding of infographics in documents is a relatively new research problem, which becomes more challenging when infographics appear as raster images. This paper describes technical details and practical applications of the system we built for recognizing and understanding imaged infographics located in document pages. To recognize infographics in raster form, both graphical symbol extraction and text recognition need to be performed. The two kinds of information are then auto-associated to capture and store the semantic information carried by the infographics. Two practical applications of the system are introduced in this paper, including supplement to traditional optical character recognition (OCR) system and providing enriched information for question answering (QA). To test the performance of our system, we conducted experiments using a collection of downloaded and scanned infographic images. Another set of scanned document pages from the University of Washington document image database were used to demonstrate how the system output can be used by other applications. The results obtained confirm the practical value of the system.
Keywords: applications, association of text and graphics, document image understanding, infographics
A model for mapping between printed and digital document instances BIBAKFull-Text 19-28
  Nadir Weibel; Moira C. Norrie; Beat Signer
The first steps towards bridging the paper-digital divide have been achieved with the development of a range of technologies that allow printed documents to be linked to digital content and services. However, the static nature of paper and limited structural information encoded in classical paginated formats make it difficult to map between parts of a printed instance of a document and logical elements of a digital instance of the same document, especially taking document revisions into account. We present a solution to this problem based on a model that combines metadata of the digital and printed instances to enable a seamless mapping between digital documents and their physical counterparts on paper. We also describe how the model was used to develop iDoc, a framework that supports the authoring and publishing of interactive paper documents.
Keywords: document integration, document model, interactive paper, page description languages, structured documents
Data model and architecture of a paper-digital document management system BIBAKFull-Text 29-31
  Kosuke Konishi; Naohiro FurukawaHisashi Ikeda
We propose a document management system called "iJITinOffice," which manages paper documents, including those with handwriting, and integrates them with electronic documents. By digitizing and managing handwriting on paper, we provide document management and retrieval capabilities that utilize the thinking process and memory that occurs with handwriting. The system was implemented using Anoto digital pen technology. Previous papers [2, 3, 4] introduced the concept and a summary of our system. In this paper we describe the design of a data model and the architecture of the system. The data model links information from paper, handwriting, and electronic documents together. It makes it possible to interweave searches for electronic documents and handwriting on paper documents.
Keywords: digital pen, handwritten annotation, paper document management
A new Tsallis entropy-based thresholding algorithm for images of historical documents BIBAKFull-Text 32-34
  Carlos A. B. Mello
It is presented in this paper an algorithm for thresholding images of historical documents. The main objective is to generate high quality monochromatic images in order to make them easily accessible thru Internet and achieve high recognition rates by Optical Character Recognition algorithms. Our new algorithm is based on the classical entropy concept and a variation defined by the Tsallis Entropy and it proved to be more efficient than classical thresholding algorithms. The images generated are analyzed using precision, recall, accuracy and specificity.
Keywords: document processing, entropy, historical documents, image segmentation, thresholding

Poster session

Presenting in html BIBAKFull-Text 35-36
  Erik Wilde; Philippe Cattin
The management and publishing of complex presentations is poorly supported by available presentation software. This makes it hard to publish usable and accessible presentation material, and to reuse that material for continuously evolving events. XSLidy provides a XSLT-based approach to generate presentations out of a mix of HTML and structural elements. Using XSLidy, the management and reuse of complex presentations becomes easier, and the results are more user-friendly in terms of usability and accessibility.
Keywords: XSLidy, presentation

Variable data printing

A multi-format variable data template wrapper extending podis PPML-T standard BIBAKFull-Text 37-43
  Fabio Giannetti
Variable Data Print (VDP) has fueled the need for increasingly sophisticated tools and capabilities with every solution vendor providing different approaches and techniques.
   Nevertheless, it is possible to provide a unified wrapper around these different XML formats that will facilitate the exchange of templates and/or import/export into other formats.
   The proposed solution compares favourably with Simple Object Access Protocol (SOAP) provision for WebServices. The SOAP wrapper provides the protocol which allows the contained message to be expressed in different formats (described using XML Schemas). This enables the interoperability between services encapsulating their specific implementations and only exposes the methods and their parameters. This proposal builds on similar concepts separating a template into three parts: the template, the binding and the data. Each part has its own format and "embedded" semantics.
Keywords: PPML, PPMLT, SOAP, SVG, XML, XSL-FO, XSLT, document exchange, template, variable data print
Extracting reusable document components for variable data printing BIBAKFull-Text 44-52
  Steven R. Bagley; David F. Brailsford; James A. Ollis
Variable Data Printing (VDP) has brought new flexibility and dynamism to the printed page. Every printed instance of a specific class of document can now have different degrees of customized content within the document template.
   This flexibility comes at a cost. If every printed page is potentially different from all others it must be rasterized separately, which is a time-consuming process. Technologies such as PPML (Personalized Print Markup Language) attempt to address this problem by dividing the bitmapped page into components that can be cached at the raster level, thereby speeding up the generation of page instances.
   A large number of documents are stored in Page Description Languages at a higher level of abstraction than the bitmapped page. Much of this content could be reused within a VDP environment provided that separable document components can be identified and extracted. These components then need to be individually rasterisable so that each high-level component can be related to its low-level (bitmap) equivalent. Unfortunately, the unstructured nature of most Page Description Languages makes it difficult to extract content easily.
   This paper outlines the problems encountered in extracting component-based content from existing page description formats, such as PostScript, PDF and SVG, and how the differences between the formats affects the ease with which content can be extracted. The techniques are illustrated with reference to a tool called COG Extractor, which extracts content from PDF and SVG and prepares it for reuse.
Keywords: PDF, SVG, content extraction, graphic objects, PostScript, variable data printing
VDP templates with theme-driven layer variants BIBAKFull-Text 53-55
  Royston Sellman
Many graphic artists and designers have adapted their skills to the use of tools that extend static layout applications, allowing the creation of Variable Data Print template documents. These connect text and image placeholders to database fields, allowing creation of a set of instances at job time. Much of the VDP work flowing through digital presses originates this way. However, in the field we have observed limitations to simple approaches which make it hard to create templates that do much more than can be achieved with mail merge and variable backgrounds. In this paper we describe two examples which illustrate the problems. Solutions have been developed but a frequent drawback is that they move the graphic artist out of the loop, either because they do not support the fine layout control creative professionals expect to use, or because they are aimed at programmers and database professionals. Agencies and PSPs however, are so keen to keep creative professionals in the loop that we have seen ingenious but fragile and inefficient in-house solutions which support complex VDP outputs while still keeping the designer in the team. We have developed tools for creative professionals that extend standard layout applications and allow designers to go a step beyond simple VDP. This paper describes the application of these tools to real use cases. We show that the tools can replace custom solutions giving improvements in VDP job creation, database simplification and resilience to changing requirements.
Keywords: PPML-T, VDP
Speculative document evaluation BIBAKFull-Text 56-58
  Alexander Macdonald; David Brailsford; Steven Bagley; John Lumley
Optimisation of real world Variable Data printing (VDP) documents is a difficult problem because the interdependencies between layout functions may drastically reduce the number of invariant blocks that can be factored out for pre-rasterisation.
   This paper examines how speculative evaluation at an early stage in a document-preparation pipeline, provides a generic and effective method of optimising VDP documents that contain such interdependencies.
   Speculative evaluation will be at its most effective in speeding up print runs if sets of layout invariances can either be discovered automatically, or designed into the document at an early stage. In either case the expertise of the layout designer needs to be supplemented by expertise in exploiting potential invariances and also in predicting the effects of speculative evaluation on the caches used at various stages in the print production pipeline.
Keywords: PPML, SVG, VDP, document layout, optimisation, speculative evaluation

XML documents

A document object modeling method to retrieve data from a very large XML document BIBAKFull-Text 59-68
  Seung Min Kim; Suk I. Yoo; Eunji Hong; Tae Gwon Kim; Il Kon Kim
Document Object Modeling (DOM) is widely used approach for retrieving data from an XML document. If the size of the XML document is very large, however, using the DOM approach for retrieving data from the XML document may suffer from a lack of memory space for building the associated XML tree in the main memory. To alleviate this problem, we propose a method that allows the very large XML document to be split into small XML documents, retrieves data from the XML tree built from each of these small XML documents, and combines the results from all of the n XML trees to generate the final result. With this proposed approach, the memory space and processing time required to retrieve data from the very large XML document using DOM are reduced so that they can be managed by one single general-purpose personal computer.
Keywords: DOM, DOM API, XML, very large XML documents
A document engineering environment for clinical guidelines BIBAKFull-Text 69-78
  Gersende Georg; Marie-Christine Jaulent
In this paper, we present a document engineering environment for Clinical Guidelines (G-DEE), which are standardized medical documents developed to improve the quality of medical care. The computerization of Clinical Guidelines has attracted much interest in recent years, as it could support the knowledge-based process through which they are produced. Early work on guideline computerization has been based on document engineering techniques using mark-up languages to produce structured documents. We propose to extend the document-based approach by introducing some degree of automatic content processing, dedicated to the recognition of linguistic markers, signaling recommendations through the use of "deontic operators". Such operators are identified by shallow parsing using Finite-State Transition Networks, and are further used to automatically generate mark-up structuring the documents. We also show that several guidelines manipulation tasks can be formalized as XSL-based transformations of the original marked-up document. The automatic processing component, which underlies the marking-up process, has been evaluated using two complete clinical guidelines (corresponding to over 300 recommendations). As a result, precision of marker identification varied between 88 and 98% and recall between 81 and 99%.
Keywords: GEM, XML, clinical guidelines, deontic operators
XML version detection BIBAKFull-Text 79-88
  Deise de Brum Saccol; Nina Edelweiss; Renata de Matos Galante; Carlo Zaniolo
The problem of version detection is critical in many important application scenarios, including software clone identification, Web page ranking, plagiarism detection, and peer-to-peer searching. A natural and commonly used approach to version detection relies on analyzing the similarity between files. Most of the techniques proposed so far rely on the use of hard thresholds for similarity measures. However, defining a threshold value is problematic for several reasons: in particular (i) the threshold value is not the same when considering different similarity functions, and (ii) it is not semantically meaningful for the user. To overcome this problem, our work proposes a version detection mechanism for XML documents based on Naïve Bayesian classifiers. Thus, our approach turns the detection problem into a classification problem. In this paper, we present the results of various experiments on synthetic data that show that our approach produces very good results, both in terms of recall and precision measures.
Keywords: XML, classification, similarity functions, versioning
Declarative extensions of XML languages BIBAKFull-Text 89-91
  Simon Thompson; Peter R. King; Patrick Schmitz
We present a set of XML language extensions that bring notions from functional programming to web authors, extending the power of declarative modelling for the web. Our previous work discussed expressions and user-defined events. In this paper, we discuss how one may extend XML by adding definitions and parameterization; complex data and data types; and reactivity, events and continuous "behaviours". We consider these extensions in the light of World Wide Web Consortium standards, and illustrate their utility by a variety of use cases.
Keywords: XML, behaviour, data type, declarative, event, functional, type

Keynote address

Bank notes: extreme DocEng BIBAKFull-Text 92
  Sara Church
Most people handle bank notes every day without giving them a thought, let alone pondering their complexity. Yet every aspect of a bank note is highly engineered to serve its purpose. Every facet of the bank note's existence, from the materials that comprise them to the equipment that produces them, from the machines that handle them to the shredders that destroy them, is carefully considered and designed. Layered on these functional requirements are human factors and the need to verify their authenticity, to be able to distinguish them from any other printed documents that clever would-be, ill-intentioned imitators might produce.
   In the context of today's print-on-demand environment and the glitter-and-glow appeal of craft and display products to all segments of society, the requirements for achieving this differentiation from the counterfeiters' best products are increasingly challenging.
   This presentation addresses how real bank notes are made, the practical factors that drive their function and form requirements and the interplay of these factors with their security requirements, to inhibit the manufacture of counterfeit bank notes.
Keywords: bank notes, counterfeiting, document engineering "in the large", document security, security documents, security printing, security substrates, variable data printing


Anvil next generation: a multi-format variable data print template based on PPML-T BIBAKFull-Text 93-94
  Fabio Giannetti
Anvil Next Generation is a toolset enabling the usage of multiple formats, as templates. It is mainly based on the Personalized Print Markup Language Template (PPML-T) workflow. The possibility of supporting several template formats within the same workflow enables more flexibility, whilst maintaining the data merge and binding operations unchanged.
Keywords: PPML, PPMLT, XSL-FO, XSLT, template, variable data print
Intention driven multimedia document production BIBAKFull-Text 95-96
  Ludovic Gaillard; Marc Nanard; Peter R. King; Jocelyne Nanard
We demonstrate a system supporting intention-driven multimedia document series production. We present mechanisms which build specifications of genre-compliant document series and which produce documents conforming to those specifications from existing finely indexed multimedia data sources.
Keywords: genre, meta-structure, multimedia, series, transformation
Touch scan-n-search: a touchscreen interface to retrieve online versions of scanned documents BIBAKFull-Text 97-98
  Fabrice Matulic
The system described in this paper attempts to tackle the problem of finding online content based on paper documents through an intuitive touchscreen interface designed for modern scanners and multifunction printers. Touch Scan-n-Search allows the user to select elements of a scanned document (e.g. a newspaper article) and to seamlessly connect to common web search services in order to retrieve the online version of the document along with related content. This is achieved by automatically extracting keyphrases from text elements in the document (obtained by OCR) and creating "tappable" GUI widgets to allow the user to control and fine-tune the search requests. The retrieved content can then be printed, sent, or used to compose new documents.
Keywords: GUI, keyword extraction, online news retrieval, scanned document
The salt triple: framework editor publisher BIBAKFull-Text 99-100
  Tudor Groza; Alexander Schutz; Siegfried Handschuh
In this paper we present the SALT (Semantically Annotated LATEX) Triple, a set of tools built to demonstrate a complete annotation workflow from creation to usage. The Triple set contains the authoring and annotation framework, an editor and a web publisher which helps the generation or uses the generated metadata for a specific purpose. The demos show three phases part of the workflow: (i) authoring -- first we introduce the way in which concurrent annotations can be created during the authoring process by using the iSALT editor as a front-end for the SALT framework; (ii) generation -- then we show how the metadata is generated and embedded into the final result of the authoring and annotation process, i.e. a semantically enriched PDF document; (iii) usage -- and finally we demonstrate a way how the metadata can be used for generating a set of rich online workshop proceedings.
Keywords: LATEX, semantic authoring, semantic document


An efficient, streamable text format for multimedia captions and subtitles BIBAKFull-Text 101-110
  Dick C. A. Bulterman; A. J. Jansen; Pablo Cesar; Samuel Cruz-Lara
In spite of the high profile of media types such as video, audio and images, many multimedia presentations rely extensively on text content. Text can be used for incidental labels, or as subtitles or captions that accompany other media objects. In a multimedia document, text content is not only constrained by the need to support presentation styles and layout, it is also constrained by the temporal context of the presentation. This involves intra-text and extra text timing synchronization with other media objects. This paper describes a new timed-text representation language that is intended to be embedded in a non-text host language. Our format, which we call aText (for the Ambulant Text Format), balances the need for text styling with the requirement for an efficient representation that can be easily parsed and scheduled at runtime. aText, which can also be streamed, is defined as an embeddable text format for use within declarative XML languages. The paper presents a discussion of the requirements for the format, a description of the format and a comparison with other existing and emerging text formats. We also provide examples for aText when embedded within the SMIL and MLIF languages and discuss our implementation experiences of aText with the Ambulant Player.
Keywords: DFXP, SMIL, ambulant, realtext, streaming text, timed text
Genre driven multimedia document production by means of incremental transformation BIBAKFull-Text 111-120
  Marc Nanard; Jocelyne Nanard; Peter R. King; Ludovic Gaillard
Genre, like layout, is an important factor in effective communication, and automated tools which assist in genre compliance are thus of considerable value. Genres are reusable meta-structures, which exist independently of specific documents. This paper focuses on that part of the document production process which involves genre, and discusses a specific example in order to present the design rationale of mechanisms which assist in producing documents compliant with specific genre rules.
   The mechanisms we have developed are based on automated incremental, iterative transformations, which convert a draft document elaborated by the author into a genre compliant final document. The approach mimics the manner in which a human expert would transform the document. Transformation rules constitute a reusable and constructive expression of certain aspects of genre. The rules identify situations which appear inappropriate for the genre in question, and propose corrective action, so that the document becomes increasingly more compliant with the genre in question. This process of genre conformance iterates until no further corrective action is possible.
   This mechanism has been fully implemented. The implementation comprises both a work environment and a rule based language. The implementation relies internally on a general purpose tree transformation engine designed originally for use in natural language processing applications, which we have adapted to handle XML documents.
Keywords: genre, meta-structure, multimedia, series, transformation
Timed-fragmentation of SVG documents to control the playback memory usage BIBAKFull-Text 121-124
  Cyril Concolato; Jean Le Feuvre; Jean-Claude Moissinac
The Scalable Vector Graphics (SVG) language allows in its version 1.2 the description of multimedia scenes including audio, video, vector graphics, interactivity and animations. This standard has been selected by the mobile industry as the format for vector graphics and rich media content. For this purpose, additional tools were introduced in the language to solve the problem of the playback of long-running SVG sequences on memory-constrained devices like mobile phones. However, the proposed tools are not entirely sufficient and solutions outside the scope of SVG are needed.
   This paper proposes a method, complementary to the SVG tools, to control the memory consumption while playing back long running SVG sequences. This method relies on the use of an auxiliary XML document to describe the timed-fragmentation of the SVG document and the storage and streaming properties of each SVG fragment. Using this method, this paper shows that some SVG documents can be stored, delivered and played as streams, and that their playback as streams brings an important memory consumption reduction while using a standard SVG 1.2 Tiny player.
Keywords: fragmentation, memory usage, scalable vector graphics, streaming, timing

Layout and aesthetics

Automatic float placement in multi-column documents BIBAKFull-Text 125-134
  Kim Marriott; Peter Moulder; Nathan Hurst
Multi-column layout with horizontal scrolling has a number of advantages over the standard model (single column with vertical scrolling) for on-line document layout. However, one difficulty with the multi-column model is the need for good automatic placement of floating figures. We identify reasonable aesthetic criteria for their placement, and then give a dynamic-programming-like algorithm for finding an optimal layout with respect to these criteria. We also investigate an A* based approach and give two variants differing in the choice of heuristic. We find that one of the A* based approaches is faster than the dynamic programming approach and, if a "window" of optimization is used, fast enough for moderately sized documents.
Keywords: floating figure, multi-column layout, optimization techniques
Logical document conversion: combining functional and formal knowledge BIBAKFull-Text 135-143
  Hervé Déjean; Jean-Luc Meunier
We present in this paper a method for document layout analysis based on identifying the function of document elements (what they do). This approach is orthogonal and complementary to the traditional view based on the form of document elements (how they are constructed). One key advantage of such functional knowledge is that the functions of some document elements are very stable from document to document and over time. Relying on the stability of such functions, the method is not impacted by layout variability, a key issue in logical document analysis and is thus very robust and versatile. The method starts the recognition process by using functional knowledge and uses in a second step formal knowledge as a source of feedback in order to correct some errors. This allows the method to adapt to specific documents by using formal specificities.
Keywords: combination of knowledge, feedback, functional analysis, logical document analysis, methodology
Preserving the aesthetics during non-fixed aspect ratio scaling of the digital border BIBAKFull-Text 144-146
  Hui Chao; Prasad Gabbur; Anthony Wiley
To enhance the visual effect of a photo, various digital borders or frames are provided for photo decoration at photo sharing websites. Even though multiple versions of the same border design may be prepared manually for several "standard" page or photo sizes, difficulty arises when the user's page or photo sizes are not one of the standards. Forcing a photo into the unfitted border will result in a cropped photo. This limits the use of digital borders and therefore the art designs. In this paper, we propose a method that automatically resizes the digital border for different paper sizes while preserving the look and feel of the original design. It analyzes the geometric layout and semantic structure of the digital border, and then based on the nature of the structures; it scales and moves them to the right place to reconstruct the digital border to the new page size.
Keywords: document layout, document scaling, image segmentation and reconstruction
Approximating text by its area BIBAKFull-Text 147-150
  Nathan Hurst; Kim Marriott
Given possibly non-rectangular shapes, S1, ..., Sn, and some English text, T, we give methods based on approximating T by its area that determine for each Si whether T definitely fits in Si, definitely does not fit in Si, or probably fits in Si. These methods have complexity linear in the size of Si, assuming it is represented as a trapezoid list, but do not depend on the size of T. They require a linear time shape independent pre-processing of the text.
Keywords: continuous approximation

Extending document engineering formats

Editing with style BIBAKFull-Text 151-160
  Vincent Quint; Irne Vatton
HTML has popularized the use of style sheets, and the advent of XML has stressed the importance of style as a key area complementing document structure and content. A number of tools are now available for producing HTML and XML documents, but very few are addressing style issues. In this paper we analyze the requirements for style manipulation tools, based on the main features of the CSS language. We discuss methods and techniques that meet these requirements and that can be used to efficiently support web authors in style sheet manipulation. The discussion is illustrated by the recent developments made in the Amaya web authoring environment.
Keywords: CSS, document authoring, style languages, web editing
The Mars project: PDF in XML BIBAKFull-Text 161-170
  Matthew R. B. Hardy
The Portable Document Format (PDF) is a page-oriented, graphically rich document format based on PostScript semantics. It is the file format underlying the Adobeî Acrobatî viewers and is used throughout the publishing industry for final form documents and document interchange. Beyond document layout, PDF provides enhanced capabilities, which include logical structure, forms, 3D, movies and a number of other rich features.
   Developers and system integrators face challenges manipulating PDF and its data. They are looking for solutions that allow them to more easily create and operate on documents, as well as to integrate with modern XML-based document processing workflows.
   The Mars document format is based on the fundamental structures of PDF, but uses an XML syntax to represent the document. Mars uses XML to represent the underlying data structures of PDF, as well as incorporating additional industry standards such as SVG, PNG, JPG, JPG2000 and OpenType. Mars combines all of these components into a ZIP-based document container.
   The use of open standards in Mars means that Mars documents can be used with a large range of off-the-shelf tools and that a larger population of developers will be very familiar with its underlying technology. Using these standards, publishers gain access to all of the richness of PDF, but can now tightly integrate Mars into their document workflows.
Keywords: Mars, PDF, SVG, XML, package, zip
SALT: a semantic approach for generating document representations BIBAKFull-Text 171-173
  Tudor Groza; Alexander Schutz; Siegfried Handschuh
The structure of a document has an important influence on the perception of its content. Considering scientific publications, we can affirm that by making use of the ordinary linear layout, a well organized publication, following a "red wire", will always be better understood and analyzed than one having a poor or chaotic structure, but not necessarily poor content. Reading a publication in a linear way, from the first page to the last page means a lot of unnecessary information processing to the reader. Looking at a publication from another perspective by accessing the key-points or argumentative structure directly can give better insights into the author's thoughts, and for certain tasks (i.e. getting a first impression of an article) a representation of the document reduced to its core could be more important than its linear structure. In this paper, we will show how one can build different representations of the same document, by exploiting the semantics captured in the text. The focus will be on scientific publications and as building foundation we use the SALT (Semantically Annotated LATEX) annotation framework for creating Semantic PDF Documents.
Keywords: LATEX, PDF, semantic annotation, semantic document
Endless documents: a publication as a continual function BIBAKFull-Text 174-176
  John Lumley; Roger Gimson; Owen Rees
Variable data can be considered as functions of their bindings to values. The Document Description Framework (DDF) treats documents in this manner, using XSLT semantics to describe document functionality and a variety of related mechanisms to support layout, reference and so forth. But the result of evaluation of a function could itself be a function: can variable data documents behave likewise? We show that documents can be treated as simple continuations within that framework with minor modifications. We demonstrate this on a perpetual diary.
Keywords: SVG, XSLT, document construction, functional programming

Classification and machine learning

Authors vs. readers: a comparative study of document metadata and content in the www BIBAKFull-Text 177-186
  Michael G. Noll; Christoph Meinel
Collaborative tagging describes the process by which many users add metadata in the form of unstructured keywords to shared content. The recent practical success of web services with such a tagging component like Flickr or del.icio.us has provided a plethora of user-supplied metadata about web content for everyone to leverage.
   In this paper, we conduct a quantitative and qualitative analysis of metadata and information provided by the authors and publishers of web documents compared with metadata supplied by end users for the same content. Our study is based on a random sample of 100,000 web documents from the Open Directory, for which we examined the original documents from the World Wide Web in addition to data retrieved from the social bookmarking service del.icio.us, the content rating system ICRA, and the search engine Google. To the best of our knowledge, this is the first study to compare user tags with the metadata and actual content of documents in the WWW on a larger scale and to integrate document popularity information in the observations. The data set of our experiments is freely available for research.
Keywords: authoring, del.icio.us, dmoz, dmoz100k06, document engineering, Google, ICRA, metadata, PageRank, social bookmarking, tagging, www
Elimination of junk document surrogate candidates through pattern recognition BIBAKFull-Text 187-195
  Eunyee Koh; Daniel Caruso; Andruid Kerne; Ricardo Gutierrez-Osuna
A surrogate is an object that stands for a document and enables navigation to that document. Hypermedia is often represented with textual surrogates, even though studies have shown that image and text surrogates facilitate the formation of mental models and overall understanding. Surrogates may be formed by breaking a document down into a set of smaller elements, each of which is a surrogate candidate. While processing these surrogate candidates from an HTML document, relevant information may appear together with less useful junk material, such as navigation bars and advertisements.
   This paper develops a pattern recognition based approach for eliminating junk while building the set of surrogate candidates. The approach defines features on candidate elements, and uses classification algorithms to make selection decisions based on these features. For the purpose of defining features in surrogate candidates, we introduce the Document Surrogate Model (DSM), a streamlined Document Object Model (DOM)-like representation of semantic structure. Using a quadratic classifier, we were able to eliminate junk surrogate candidates with an average classification rate of 80%. By using this technique, semi-autonomous agents can be developed to more effectively generate surrogate collections for users. We end by describing a new approach for hypermedia and the semantic web, which uses the DSM to define value-added surrogates for a document.
Keywords: document surrogate model, mixed-initiatives, navigation, pattern recognition, principal components analysis, quadratic classifier, semi-autonomous agents, surrogate
Filtering product reviews from web search results BIBAKFull-Text 196-198
  Tun Thura Thet; Jin-Cheon Na; Christopher S. G. Khoo
This study seeks to develop an automatic method to identify product reviews on the Web using the snippets (summary information) returned by search engines. Determining whether a snippet is a review or non-review is a challenging task, since the snippet usually does not contain many useful features for identifying review documents. Firstly we applied a common machine learning technique, SVM (Support Vector Machine), to investigate which features of snippets are useful for the classification. Then we employed a heuristic approach utilizing domain knowledge and found that the heuristic approach performs equally well as the machine learning approach. A hybrid approach which combines the machine learning technique and domain knowledge performs slightly better than the machine learning approach alone.
Keywords: genre classification, product review documents, snippets
Structure and content analysis for html medical articles: a hidden Markov model approach BIBAKFull-Text 199-201
  Jie Zou; Daniel Le; George R. Thoma
We describe ongoing research on segmenting and labeling HTML medical journal articles. In contrast to existing approaches in which HTML tags usually serve as strong indicators, we seek to minimize dependence on HTML tags. Designing logical component models for general Web pages is a challenging task. However, in the narrow domain of online journal articles, we show that the HTML document, modeled with a Hidden Markov Model, can be accurately segmented into logical zones.
Keywords: HTML document labeling, HTML document segmentation, document layout analysis, document object model (DOM), text mining, web information retrieval
Exclusion-inclusion based text categorization of biomedical articles BIBAKFull-Text 202-204
  Nadia Zerida; Nadine Lucas; Bruno Crémilleux
In this paper, we propose a new approach based on two original principles to categorize biomedical articles. On the one hand, we combine linguistic, structural and metric descriptors to build patterns stemming from data mining techniques. On the other hand, we take into account the importance of the absence of patterns to the categorization task by using an exclusion-inclusion method. To avoid a crisp effect between the absence and the presence of a pattern, the exclusion-inclusion method uses two regret measures to quantify the interest of a weak pattern according to the other classes and among patterns from a same class. The global decision is based on the generalization of the local patterns, firstly by using patterns excluding classes, then according to the regret ratios. Experiments show the effectiveness of the approach.
Keywords: categorization, characterisation, text mining
Adapting associative classification to text categorization BIBAKFull-Text 205-208
  Baoli Li; Neha Sugandh; Ernest V. Garcia; Ashwin Ram
Associative classification, which originates from numerical data mining, has been applied to deal with text data recently. Text data is firstly digitalized to database of transactions, and then training and prediction is actually conducted on the derived numerical dataset. This intuitive strategy has demonstrated quite good performance. However, it doesn't take into consideration the inherent characteristics of text data as much as possible, although it has to deal with some specific problems of text data such as lemmatizing and stemming during digitalization. In this paper, we propose a bottom-up strategy to adapt associative classification to text categorization, in which we take into account structure information of text. Experiments on Reuters-21578 dataset show that the proposed strategy can make use of text structure information and achieve better performance.
Keywords: associative classification, text categorization

Document transformation

Towards automatic document migration: semantic preservation of embedded queries BIBAKFull-Text 209-218
  Thomas Triebsees; Uwe M. Borghoff
Archivists and librarians face an ever increasing amount of digital material. Their task is to preserve its authentic content. In the long run, this requires periodic migrations (from one format to another or from one hardware/software platform to another). Document migrations are challenging tasks where tool-support and a high degree of automation are important. A central aspect is that documents are often mutually related and, hence, a document's semantics has to be considered in its whole context. References between documents are usually formulated in graph- or tree-based query languages like URL or XPath. A typical scenario is web-archiving where websites are stored inside a server infrastructure that can be queried from HTML-files using URLs. Migrating websites will often require link adaptation in order to preserve link consistency. Although automated and "trustworthy" preservation of link consistency is easy to postulate, it is hard to carry out, in particular, if "trustworthy" means "provably working correct". In this paper, we propose a general approach to semantically evaluating and constructing graph queries, which at the same time conform to a regular grammar, appear as part of a document's content, and access a graph structure that is specified using First-Order Predicate Logic (FOPL). In order to do so, we adapt model checking techniques by constructing suitable query automata. We integrate these techniques into our preservation framework [12] and show the feasibility of this approach using an example. We migrate a website to a specific archiving format and demonstrate the automated preservation of link-consistency. The approach shown in this paper mainly contributes to a higher degree of automation in document migration while still maintaining a high degree of "trustworthiness", namely "provable correctness".
Keywords: automated document migration, digital preservation, link consistency, query processing
Mapping paradigm for document transformation BIBAKFull-Text 219-221
  Arnaud Blouin; Olivier Beaudoux
Since the advent of XML, the ability to transform documents using transformation languages such as XSLT has become an important challenge. However, writing a transformation script (e.g. an XSLT stylesheet) is still an expert task. This paper proposes a simpler way to transform documents by defining a relation between two schemas expressed through our mapping language. And then by using a transformation process that applies the mapping instances of the schemas. Thus, a user only needs to focus on the mapping without having any knowledge about how a transformation language and its processor work. This paper outlines our mapping approach and language, and illustrates them with an example.
Keywords: XML, XSLT, document transformation, mapping
Combination of transformation and schema languages described by a complete formal semantics BIBAKFull-Text 222-224
  Catherine Pugin; Rolf Ingold
XML and its associated languages, namely DTD, XML Schema and XSLT, have tremendous importance for lots of applications even if their semantics is often hard to understand and incomplete. In this paper, we concentrate on transformation languages and propose a new one in XML syntax and focusing on strong specifications. Since our language is completely defined by formal semantics, conceptual drawbacks have been avoided and complexity has been reduced. Thus, static type checking could easily be provided. Finally, we combine our transformation language with our own schema language in order to perform static typing.
Keywords: XML, integration, schema, static type checking, transformation