HCI Bibliography Home | HCI Conferences | DocEng Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
DocEng Tables of Contents: 0102030405060708091011121314

Proceedings of the 2012 ACM Symposium on Document Engineering

Fullname:Proceedings of the 2012 ACM Symposium on Document Engineering
Editors:Cyril Concolato; Patrick Schmitz
Location:Paris, France
Dates:2012-Sep-04 to 2012-Sep-07
Standard No:ISBN: 978-1-4503-1116-8; ACM DL: Table of Contents; hcibib: DocEng12
Links:Conference Website
Summary:It is our great pleasure to welcome you to the 2012 ACM Symposium on Document Engineering -- DocEng 2012, which is being held September 4-7, 2012, in Paris, France. This year's symposium continues its tradition of being the premier forum for presentation of research results and experience reports on leading edge issues of document presentation and adaptation, analysis, modeling, transformation, systems, theory, and applications. The mission of the symposium is to share significant results, to evaluate novel approaches and models, and to identify promising directions for future research and development. DocEng gives researchers and practitioners a unique opportunity to share their perspectives with others interested in the various aspects of document engineering.
    The call for papers attracted 89 submissions from Asia, Australia, Canada, Europe, the Russian Federation, and the United States. The program committee accepted 14 of 42 full paper submissions (33%), plus another 20 short papers, and 5 demos and posters, for a combined acceptance rate of 44%. The papers cover a variety of topics, including Layout and Presentation Control, Document Analysis, OCR and Visual Analysis, Multimedia and Hypermedia, XML and Related Tools, Architecture and Document Management, Search and Sense-making, and Digital Humanities. In addition, the program includes workshops on authoring issues, and on education models and curricula for Document Engineering. DocEng 2012 features keynote speeches by Bruno Bachimont of the Institut National de l'Audiovisuel, and Université de Technologie de Compiàgne, and by Thierry Delprat of Nuxeo. We hope that these proceedings will serve as a valuable reference for document engineering researchers and developers.
  1. Keynote address
  2. Layout and presentation generation
  3. Document analysis
  4. Multimedia and hypermedia
  5. Keynote address
  6. XML and related tools
  7. OCR and visual analysis
  8. Demonstrations and posters
  9. Search and sensemaking
  10. Digital humanities
  11. Architecture and document management

Keynote address

Document and archive: editing the past BIBAFull-Text 1-2
  Bruno Bachimont
Document engineering has a difficult task: to propose tools and methods to manipulate contents and make sense of them. This task is still harder when dealing with archive, insofar as document engineering has not only to provide tools for expressing sense but above all tools and methods to keep contents accessible in their integrity and intelligible according to their meaning. However, these objectives may be contradictory: access implies to transform contents to make them accessible through networks, tools and devices. Intelligibility may imply to adapt contents to the current state of knowledge and capacity of understanding. But, by doing that, can we still speak of authenticity, integrity, or even the identity of documents? Document engineering has provided powerful means to express meaning and to turn an intention into a semiotic expression. Document repurposing has become a usual way for exploiting libraries, archives, etc. By enabling to reuse a specific part of a given content, repurposing techniques allow to entirely renegotiate the meaning of this part by changing its context, its interactivity, in short the way people can consider this piece of content and interpret it. Put in this way, there could be an antinomy between archiving and document engineering. However, transforming document, editing content is an efficient way to keep them alive and compelling for people. Preserving contents does not consist in simply storing them but in actively transforming them to adapt them technically and keep them intelligible. Editing the past is then a new challenge, merging a content deontology with a document technology. This challenge implies to redefine some classical notions as authenticity and highlight the needs for new concepts and methods. Especially in a digital world, documents are permanently reconfigured by technical tools that produce variants, similar contents calling into question the usual definition the identity of documents. Editing the past calls for a new critics of variants.

Layout and presentation generation

Ad insertion in automatically composed documents BIBAFull-Text 3-12
  Niranjan Damera-Venkata; José Bento
We consider the problem of automatically inserting advertisements (ads) into machine composed documents. We explicitly analyze the fundamental tradeoff between expected revenue due to ad insertion and the quality of the corresponding composed documents. We show that the optimal tradeoff a publisher can expect may be expressed as an efficient-frontier in the revenue-quality space. We develop algorithms to compose documents that lie on this optimal tradeoff frontier. These algorithms can automatically choose distributions of ad sizes and ad placement locations to optimize revenue for a given quality or optimize quality for given revenue. Such automation allows a market maker to accept highly personalized content from publishers who have no design or ad inventory management capability and distribute formatted documents to end users with aesthetic ad placement. The ad density/coverage may be controlled by the publisher or the end user on a per document basis by simply sliding along the tradeoff frontier. Business models where ad sales precede (ad-pull) or follow (ad-push) document composition are analyzed from a document engineering perspective.
Optimal guillotine layout BIBAFull-Text 13-22
  Graeme Gange; Kim Marriott; Peter Stuckey
Guillotine-based page layout is a method for document layout commonly used by newspapers and magazines, where each region of the page either contains a single article, or is recursively split either vertically or horizontally. Surprisingly there appears to be little research into algorithms for automatic guillotine-based document layout. In this paper we give efficient algorithms to find optimal solutions to guillotine layout problems of two forms. Fixed-cut layout is where the structure of the guillotining is given and we only have to determine the best configuration for each individual article to give the optimal total configuration. Free layout is where we also have to search for the optimal structure. We give bottom-up and top-down dynamic programming algorithms to solve these problems, and propose a novel interaction model for documents on electronic media. Experiments show that our algorithms are effective for realistic layout problems.
ALMcss: a javascript implementation of the CSS template layout module BIBAFull-Text 23-32
  César Acebal; Bert Bos; María Rodríguez; Juan Manuel Cueva
Traditionally, web standards in general and Cascading Style Sheets (CSS) in particular take a long time from when they are defined by the W3C until they are implemented by browser vendors. This has been a limitation not only for authors, who had to wait even years before they were able to use certain CSS properties in their web pages, but also for the creators of the specification itself, who were not able to test their proposals in practice.
   In this paper we present ALMcss, a JavaScript prototype that implements the CSS Template Layout Module, a proposal for an addition to CSS to make it a more capable layout language. It has been developed inside the W3C CSS Working Group by two of the authors of this paper. We present the rationale of the module and an introduction to its syntax, before discussing the design of our prototype.
   ALMcss has served us as a proof of concept that the Template Layout Module is not only feasible, but it can be in fact implemented in current web browsers using just JavaScript and the Document Object Model (DOM). In addition, ALMcss allows web designers to start to use today the new layout capabilities of CSS that the module provides, even before it becomes an official W3C specification.
Learning how to trade off aesthetic criteria in layout BIBAFull-Text 33-36
  Peter Moulder; Kim Marriott
Typesetting software is often faced with conflicting aesthetic goals. For example, choosing where to break lines in text might involve aiming to minimize hyphenation, variation in word spacing, and consecutive lines starting with the same word. Typically, automatic layout is modelled as an optimization problem in which the goal is to minimize a complex objective function that combines various penalty functions each of which corresponds to a particular bad feature. Determining how to combine these penalty functions is difficult and very time consuming, becoming harder each time we add another penalty. Here we present a machine-learning approach to do this, and test it in the context of line-breaking. Our approach repeatedly queries the expert typographer as to which one of a pair of layouts is better, and accordingly refines the estimate of how best to weight the penalties in a linear combination. It chooses layout pair queries by a heuristic to maximize the amount that can be learnt from them so as to reduce the number of combinations that must be considered by the typographer.

Document analysis

Challenges in generating bookmarks from TOC entries in e-books BIBAFull-Text 37-40
  Yogalakshmi Jayabal; Chandrashekar Ramanathan; Mehul Jayprakash Sheth
ABSTRACT The task of extracting document structures from a digital e-book is difficult and is an active area of research. On the other hand, many e-books already have a table of contents (TOC) at the beginning of the document. This may lead us to believe that adding bookmarks into digital document (e-book) based on the existing TOC would be trivial. In this paper, we highlight the challenges involved in this task of automatically adding bookmarks to an existing e-book based on the TOC that exists within the document. If we are able to reliably identify the specific locations of each TOC entry within the document, the algorithms can be easily extended to identify document structures within e-books that have TOC. We describe a tool we have built called Booky that tries to add automatic PDF bookmarks to existing PDF based e-books as they have TOC as part of the document content. The tool addresses most of the challenges that have been identified while still leaving a few tricky scenarios still open.
A section title authoring tool for clinical guidelines BIBAFull-Text 41-44
  Mark Truran; Gersende Georg; Marc Cavazza; Dong Zhou
Professional users of medical information often report difficulties when attempting to locate specific information in lengthy documents. Sometimes these difficulties can be attributed to poorly specified section titles which fail to advertise relevant content. In this paper we describe preliminary work on a software plug-in for a document engineering environment that will assist authors when they formulate section-level headings. We describe two different algorithms which can be used to generate section titles. We compare the performance of these algorithms and correlate our experimental results with an evaluation of title quality performed by domain experts.
A methodology for evaluating algorithms for table understanding in PDF documents BIBAFull-Text 45-48
  Max Göbel; Tamir Hassan; Ermelinda Oro; Giorgio Orsi
This paper presents a methodology for the evaluation of table understanding algorithms for PDF documents. The evaluation takes into account three major tasks: table detection, table structure recognition and functional analysis. We provide a general and flexible output model for each task along with corresponding evaluation metrics and methods. We also present a methodology for collecting and ground-truthing PDF documents based on consensus-reaching principles and provide a publicly available ground-truthed dataset.

Multimedia and hypermedia

Interactive non-linear video: definition and XML structure BIBAFull-Text 49-58
  Britta Meixner; Harald Kosch
A literature review on the term "interactive video" and "interactive non-linear video" revealed different levels of interaction in varying definitions. We give a formal definition of the term "interactive non-linear video" to clarify the elements and possible relations between elements contained in such videos. Furthermore, we introduce a new event-based XML format consisting of four required and two optional elements to describe this form of video. A scene graph consisting of scenes with triggers for annotations builds the core of the format. Formal definition and XML format are both illustrated by a real world example.
Just-in-time personalized video presentations BIBAFull-Text 59-68
  Jack Jansen; Pablo Cesar; Rodrigo Laiola Guimaraes; Dick C. A. Bulterman
Using high-quality video cameras on mobile devices, it is relatively easy to capture a significant volume of video content for community events such as local concerts or sporting events. A more difficult problem is selecting and sequencing individual media fragments that meet the personal interests of a viewer of such content. In this paper, we consider an infrastructure that supports the just-in-time delivery of personalized content. Based on user profiles and interests, tailored video mash-ups can be created at view-time and then further tailored to user interests via simple end-user interaction. Unlike other mash-up research, our system focuses on client-side compilation based on personal (rather than aggregate) interests. This paper concentrates on a discussion of language and infrastructure issues required to support just-in-time video composition and delivery. Using a high school concert as an example, we provide a set of requirements for dynamic content delivery. We then provide an architecture and infrastructure that meets these requirements. We conclude with a technical and user analysis of the just-in-time personalized video approach.
TAL processor for hypermedia applications BIBAFull-Text 69-78
  Carlos S. Soares Neto; Hedvan F. Pinto; Luiz Fernando G. Soares
TAL (Template Authoring Language) is a specification language for hypermedia document templates. Templates describe application families with structural and semantic similarities. In TAL, templates not only define design patterns that applications must follow, but also constraints on the use of these patterns. A template must be processed together with a padding document giving rise to a new document in some specification language, called target language. TAL supports the description of templates independently of the languages used to specify target and padding documents. Usually a specific processor is required for each target language and for each padding document used. This paper concerns TAL processors. However, we should note that the proposal can be easily extended to any other solution used to define templates. Any pattern language and any language used to define constraints could be used instead of TAL. The TAL processor architecture is general and it is discussed when presenting the processor framework. As an instantiation example, an implementation of a TAL Processor targeting NCL (the declarative language of Ginga DTV middleware) is examined, and also another one targeting HTML-based middleware. The use of wizards for defining padding documents is also discussed in the examples of the proposed architecture instantiation.
Advene as a tailorable hypervideo authoring tool: a case study BIBAFull-Text 79-82
  Olivier Aubert; Yannick Prié; Daniel Schmitt
Audiovisual documents provide a great primary material for analysis in multiple domains, such as sociology or interaction studies. Video annotation tools offer new ways of analysing these documents, beyond the conventional transcription. However, these tools are often dedicated to specific domains, putting constraints on the data model or interfaces that may not be convenient for alternative uses. Moreover, most tools serve as exploratory and analysis instruments only, not proposing export formats suitable for publication. We describe in this paper a usage of the Advene software, a versatile video annotation tool that can be tailored for various kinds of analyses: users can define their own analysis structure and visualizations, and share their analyses either as structured annotations with visualization templates, or published on the Web as hypervideo documents. We explain how users can customize the software through the definition of their own data structures and visualizations. We illustrate this adaptability through an actual usage for interview analysis.

Keynote address

Content and document based approach for digital productivity applications BIBAFull-Text 83-84
  Thierry Delprat
In today's world most of the data produced and consumed by employees is content. In this talk we will present our approach to create and deploy content and document based applications to improve business processes and user experience.

XML and related tools

A first approach to the automatic recognition of structural patterns in XML documents BIBAFull-Text 85-94
  Angelo Di Iorio; Silvio Peroni; Francesco Poggi; Fabio Vitali
XML is among the preferred formats for storing the structure of documents such as scientific articles, manuals, documentation, literary works, etc. Sometimes publishers adopt established and well-known vocabularies such as DocBook and TEI, other times they create partially or entirely new ones that better deal with the particular requirements of their documents. The (explicit and implicit) requirements of use in these vocabularies often follow well-established patterns, creating meta-structures (the block, the container, the inline element, etc.) that persist across vocabularies and authors and that describe a truer and more general conceptualization of the documents' building blocks. Addressing such meta-structures not only gives a better insight of what documents really are composed of, but provides abstract and more general mechanisms to work on documents regardless of the availability of specific schemas, tools and presentation stylesheets. In this paper we introduce a schema-independent theory based on eleven structural patterns. We provide a definition of such patterns and how they synthesize characteristics emerging from real markup documents. Additionally, we propose an algorithm that allows us to identify the pattern of each element in a set of homogeneous markup documents.
XML query-update independence analysis revisited BIBAFull-Text 95-98
  Muhammad Junedi; Pierre Genevàs; Nabil Layaïda
XML transformations can be resource-costly in particular when applied to very large XML documents and document sets. Those transformations usually involve lots of XPath queries and may not need to be entirely re-executed following an update of the input document. In this context, a given query is said to be independent of a given update if, for any XML document, the results of the query are not affected by the update. We revisit Benedikt and Cheney's framework for query-update independence analysis and show that performance can be drastically enhanced, contradicting their initial claims. The essence of our approach and results resides in the use of an appropriate logic, to which queries and updates are both succinctly translated. Compared to previous approaches, ours is more expressive from a theoretical point of view, equally accurate, and more efficient in practice. We illustrate this through practical experiments and comparative figures.
Structure-conforming XML document transformation based on graph homomorphism BIBAFull-Text 99-102
  Tyng-Ruey Chuang; Hui-Yin Wu
We propose a principled method to specify XML document transformation so that the outcome of a transformation can be ensured to conform to certain structural constraints as required by the target XML document type. We view XML document types as graphs, and model transformations as relations between the two graphs. Starting from this abstraction, we use and extend graph homomorphism as a formalism for the specifications of transformations between XML document types. A specification can then be checked to ensure whether results from the transformation will always be structure-conforming.
Toward automated schema-directed code revision BIBAFull-Text 103-106
  Raquel Oliveira; Pierre Genevàs; Nabil Layaïda
Updating XQuery programs in accordance with a change of the input XML schema is known to be a time-consuming and error-prone task. We propose an automatic method aimed at helping developers realign the XQuery program with the new schema. First, we introduce a taxonomy of possible problems induced by a schema change. This allows to differentiate problems according to their severity levels, e.g. errors that require code revision, and semantic changes that should be brought to the developer's attention. Second, we provide the necessary algorithms to detect such problems using a solver that checks satisfiability of XPath expressions.

OCR and visual analysis

Effective radical segmentation of offline handwritten Chinese characters towards constructing personal handwritten fonts BIBAFull-Text 107-116
  Zhanghui Chen; Baoyao Zhou
Effective radical segmentation of handwritten Chinese characters can greatly facilitate the subsequent character processing tasks, such as Chinese handwriting recognition/identification and the generation of Chinese handwritten fonts. In this paper, a popular snake model is enhanced by considering the guided image force and optimized by Genetic Algorithm, such that it achieves a significant improvement in terms of both accuracy and efficiency when applied to segment the radicals in handwritten Chinese characters. The proposed radical segmentation approach consists of three stages: constructing guide information, Genetic Algorithm optimization and post-embellishment. Testing results show that the proposed approach can effectively decompose radicals with overlaps and connections from handwritten Chinese characters with various layout structures. The segmentation accuracy reaches 94.91% for complicated samples with overlapped and connected radicals and the segmentation speed is 0.05 second per character. For demonstrating the advantages of the approach, radicals extracted from the user input samples are reused to construct personal Chinese handwritten font library. Experiments show that the constructed characters well maintain the handwriting style of the user and have good enough performance. In this way, the user only needs to write a small number of samples for obtaining his/her own handwritten font library. This method greatly reduces the cost of existing solutions and makes it much easier for people to use computers to write letters/e-mails, diaries/blogs, even magazines/books in their own handwriting.
Structural and visual comparisons for web page archiving BIBAFull-Text 117-120
  Marc Teva Law; Nicolas Thome; Stéphane Gançarski; Matthieu Cord
In this paper, we propose a Web page archiving system that combines state-of-the-art comparison methods based on the source codes of Web pages, with computer vision techniques. To detect whether successive versions of a Web page are similar or not, our system is based on: (1) a combination of structural and visual comparison methods embedded in a statistical discriminative model, (2) a visual similarity measure designed for Web pages that improves change detection, (3) a supervised feature selection method adapted to Web archiving. We train a Support Vector Machine model with vectors of similarity scores between successive versions of pages. The trained model then determines whether two versions, defined by their vector of similarity scores, are similar or not. Experiments on real archives validate our approach.
Receipts2Go: the big world of small documents BIBAFull-Text 121-124
  Bill Janssen; Eric Saund; Eric Bier; Patricia Wall; Mary Ann Sprague
The Receipts2Go system is about the world of one-page documents: cash register receipts, book covers, cereal boxes, price tags, train tickets, fire extinguisher tags. In that world, we're exploring techniques for extracting accurate information from documents for which we have no layout descriptions -- indeed no initial idea of what the document's genre is -- using photos taken with cell phone cameras by users who aren't skilled document capture technicians. This paper outlines the system and reports on some initial results, including the algorithms we've found useful for cleaning up those document images, and the techniques used to extract and organize relevant information from thousands of similar-but-different page layouts.
Displaying chemical structural formulae in ePub format BIBAFull-Text 125-128
  Simone Marinai; Stefano Quiriconi
We describe one tool designed to enhance the visualization of chemical structural formulae in E-book readers. When dealing with small formulae, to avoid the pixelation effect with zoomed images, the formula is converted to a vectoral representation and then enlarged. On the opposite, large formulae are split in sub-images by cutting the image in suitable locations attempting to reduce the parts of the formula that are broken. In both cases the formulae are embedded in one ePub document that allows users to browse the chemical structure on most reading devices.
Logical segmentation for article extraction in digitized old newspapers BIBAFull-Text 129-132
  Thomas Palfray; David Hebert; Stéphane Nicolas; Pierrick Tranouez; Thierry Paquet
Newspapers are documents made of news item and informative articles. They are not meant to be read iteratively: the reader can pick his items in any order he fancies. Ignoring this structural property, most digitized newspaper archives only offer access by issue or at best by page to their content. We have built a digitization workflow that automatically extracts newspaper articles from images, which allows indexing and retrieval of information at the article level. Our back-end system extracts the logical structure of the page to produce the informative units: the articles. Each image is labelled at the pixel level, through a machine learning based method, then the page logical structure is constructed up from there by the detection of structuring entities such as horizontal and vertical separators, titles and text lines. This logical structure is stored in a METS wrapper associated to the ALTO file produced by the system including the OCRed text. Our front-end system provides a web high definition visualisation of images, textual indexing and retrieval facilities, searching and reading at the article level. Articles transcriptions can be collaboratively corrected, which as a consequence allows for better indexing. We are currently testing our system on the archives of the Journal de Rouen, one of France eldest local newspaper. These 250 years of publication amount to 300,000 pages of very variable image quality and layout complexity. Test year 1808 can be consulted at plair.univ-rouen.fr.
Scientific table type classification in digital library BIBAFull-Text 133-136
  Seongchan Kim; Keejun Han; Soon Young Kim; Ying Liu
Tables are ubiquitous in digital libraries and on the Web, utilized to satisfy various types of data delivery and document formatting goals. For example, tables are widely used to present experimental results or statistical data in a condensed fashion in scientific documents. Identifying and organizing tables of different types is an absolutely necessary task for better table understanding, and data sharing and reusing. This paper has a three-fold contribution: 1) We propose Introduction, Methods, Results, and Discussion (IMRAD)-based table functional classification for scientific documents; 2) A fine-grained table taxonomy is introduced based on an extensive observation and investigation of tables in digital libraries; and 3) We investigate table characteristics and classify tables automatically based on the defined taxonomy. The preliminary experimental results show that our table taxonomy with salient features can significantly improve scientific table classification performance.
Document understanding of graphical content in natively digital PDF documents BIBAFull-Text 137-140
  Aysylu Gabdulkhakova; Tamir Hassan
This paper presents an object-based method for analysing the content drawn by graphical operators in natively digital PDF documents. We propose that graphical content in a document can be classified either as structural or non-structural and present an output model for our analysis result. Heuristic techniques are used to group the instructions into regions and determine their logical role in the document's structure. Experimental results demonstrate the effectiveness of the algorithm.

Demonstrations and posters

HP relate: a customer communication system for the SMB market BIBAFull-Text 141-144
  Steve Pruitt; Anthony Wiley
Enterprise businesses rely on variable data publishing solutions to produce customer communications, such as letters, statements, and financial reports, which are tailored to individual recipients. Until now, however, such customer communications systems were out of the reach of the small and medium business (SMB) market for several reasons. In order to produce enterprise-quality documents, businesses needed employees with advanced skills in document design and automated document composition. In addition, customized documents typically require scripted business logic and complicated data integration. To achieve this level of document composition and delivery would require the SMB user to have access to IT systems and staffing that would be prohibitively expensive. HP Relate is an innovative document design system that delivers enterprise-quality documents for a next-generation customer communication system for the SMB market. HP Relate features easy-to-use document design tools that require no more than self-assisted training. Document business logic and data integration is accessible to SMB users through common office tools, such as dragging and dropping and spreadsheets. Instead of requiring software installed on the user's system, HP Relate is provisioned on a cloud-based platform using a software as a service (SaaS) subscription-based model. In addition, the HP Relate platform enables SMBs to deliver documents in the format of a customer's choosing, including traditional print forms, web-based deployment, and mobile devices.
Structured and fragmented content in collaborative XML publishing chains BIBAFull-Text 145-148
  Stéphane Crozat
In this paper, we present the main results of the C2M project through one of its operational deliverable: the Scenari4 collaborative editing and publishing system for XML content. The purpose of the C2M project was to design a system able to manage structured and fragmented contents -- as XML publishing chains do -- while providing collaborative possibilities -- as Enterprise Content Management systems (ECM) do. The main issue is related to transclusion relationships which are massively used in XML publishing chains, in order to support repurposing without copying. This approach is not compatible with the classical way ECMs manage content, especially in terms of propagation of modifications, rights or transactions management. We propose two complementary solutions to manage two different levels of collaboration. The workspace is designed as a highly dynamic place able to deal with live fragments, linked together in a network, that can be easily updated at any time by any user. The library is a more static and more classical way to manage content, dedicated to folder-documents, which are XML frozen versions of sub-networks extracted from workspaces. While workspaces are dedicated to content elaboration and maintenance, libraries are places to store, to read, or to exchange stable documents. Scenari4 is released under FLOSS license and has been being used in several experimental and commercial contexts since the beginning of 2012.
Typesetting multiple interacting streams BIBAFull-Text 149-152
  Blanca Mancilla; Jarryd P. Beck; John Plaice
We present a new means for specifying multiple interacting streams, as is needed for documents with multiple systems of notes, side-by-side translations, and critical editions. Each stream is treated as a sequence of components, and anchors are used in the concrete syntax to define reference points used by other streams. When these streams are loaded into memory, the anchors simply become iterators in a container. We present a set of algorithms for the typesetting of multiple streams of text, each with multiple streams of floats and footnotes.
An inheritance model for documents in web applications with sydonie BIBAFull-Text 153-156
  Jean-Marc Lecarpentier; Pierre-Yves Buard; Hervé Le Crosnier; Romain Brixtel
Each web site has to manage documents tailored for its specific needs. When building applications with a specific document model, web developers must make a choice: build from scratch or use existing tools with the need to accommodate the model. We propose an inheritance model for documents, implemented in the Sydonie open source web development framework. It offers a flexible environment to create classes of documents. Sydonie's document model uses entity nodes inspired by the Functional Requirements for Bibliographics Records (FRBR). Document content and metadata are modeled using a set of relations between entity nodes and attribute objects. Classes of documents or attribute types can be defined through a declarative XML file. Our inheritance model provides the possibility to define them at the framework level, application profile level or application level. This demonstration explains the document definition process and inheritance model implemented in the framework and gives several examples of its advantages.
500 year documentation BIBAFull-Text 157-160
  Francis T. Marchese; Maninder Pal Kaur Shergill
Museum visitors today can regularly view 500 year old art by Renaissance masters. Will visitors to museums 500 years in the future be able to see the work of digital artists from the early 21st century? This paper considers the real problem of conserving interactive digital artwork for museum installation in the far distant future by exploring the requirements for creating documentation that will support an artwork's adaptation to future technology. In effect, this documentation must survive as long as the artwork itself -- effectively, in perpetuity. A proposal is made for the use of software engineering methodologies as solutions for designing this documentation.

Search and sensemaking

Personalized document clustering with dual supervision BIBAFull-Text 161-170
  Yeming Hu; Evangelos E. Milios; James Blustein; Shali Liu
The potential for semi-supervised techniques to produce personalized clusters has not been explored. This is due to the fact that semi-supervised clustering algorithms used to be evaluated using oracles based on underlying class labels. Although using oracles allows clustering algorithms to be evaluated quickly and without labor intensive labeling, it has the key disadvantage that oracles always give the same answer for an assignment of a document or a feature. However, different human users might give different assignments of the same document and/or feature because of different but equally valid points of view. In this paper, we conduct a user study in which we ask participants (users) to group the same document collection into clusters according to their own understanding, which are then used to evaluate semi-supervised clustering algorithms for user personalization. Through our user study, we observe that different users have their own personalized organizations of the same collection and a user's organization changes over time. Therefore, we propose that document clustering algorithms should be able to incorporate user input and produce personalized clusters based on the user input. We also confirm that semi-supervised algorithms with noisy user input can still produce better organizations matching user's expectation (personalization) than traditional unsupervised ones. Finally, we demonstrate that labeling keywords for clusters at the same time as labeling documents can improve clustering performance further compared to labeling only documents with respect to user personalization.
The Glozz platform: a corpus annotation and mining tool BIBAFull-Text 171-180
  Antoine Widlöcher; Yann Mathet
Corpus linguistics and Natural Language Processing make it necessary to produce and share reference annotations to which linguistic and computational models can be compared. Creating such resources requires a formal framework supporting description of heterogeneous linguistic objects and structures, appropriate representation formats, and adequate manual annotation tools, making it possible to locate, identify and describe linguistic phenomena in textual documents. The Glozz platform addresses all these needs, and provides a highly versatile corpus annotation tool with advanced visualization, querying and evaluation possibilities.
Sift: an end-user tool for gathering web content on the go BIBAFull-Text 181-190
  Matthias Geel; Timothy Church; Moira C. Norrie
Although web sites have started to embed semantic metadata within their documents, it remains a challenge for non-technical end-users to exploit that markup to extract and store information of interest. To address this challenge, we show how tools can be developed that allow users to identify extractable information while browsing and then control how that information should be extracted and stored in a personal library. The proposed approach is based on an extensible framework capable of using different kinds of markup to aid the extraction process and a unique fusion of several well-established techniques from areas such as the semantic web, data warehousing, web scraping and web feeds. We present the Sift tool which is a proof-of-concept implementation of the approach.
Faceted documents: describing document characteristics using semantic lenses BIBAFull-Text 191-194
  Silvio Peroni; David Shotton; Fabio Vitali
The semantic enhancement of a traditional scientific paper is not a straightforward operation, since it involves many different aspects or facets. In this paper we propose eight different semantic lenses through which these facets may be viewed, and describe and exemplify the ontologies by which these lenses may be implemented.

Digital humanities

A framework for retrieval and annotation in digital humanities using XQuery full text and update in BaseX BIBAFull-Text 195-204
  Cerstin Mahlow; Christian Grün; Alexander Holupirek; Marc H. Scholl
A key difference between traditional humanities research and the emerging field of digital humanities is that the latter aims to complement qualitative methods with quantitative data. In linguistics, this means the use of large corpora of text, which are usually annotated automatically using natural language processing tools. However, these tools do not exist for historical texts, so scholars have to work with unannotated data. We have developed a system for systematic, iterative exploration and annotation of historical text corpora, which relies on an XML database (BaseX) and in particular on the Full Text and Update facilities of XQuery.
DocExplore: overcoming cultural and physical barriers to access ancient documents BIBAFull-Text 205-208
  Pierrick Tranouez; Stéphane Nicolas; Vladislavs Dovgalecs; Alexandre Burnett; Laurent Heutte; Yiqing Liang; Richard Guest; Michael Fairhurst
In this paper, we describe DocExplore, an integrated software suite centered on the handling of digitized documents with an emphasis on ancient manuscripts. This software suite allows the augmentation and exploration of ancient documents of cultural interest. Specialists can add textual and multimedia data and metadata to digitized documents through a graphical interface that does not require technical knowledge. They are helped in this endeavor by sophisticated document analysis tools that allows for instance to spot words or patterns in images of documents. The suite is intended to ease considerably the process of bringing locked away historical materials to the attention of the general public by covering all the steps from managing a digital collection to creating interactive presentations suited for cultural exhibitions. Its genesis and sustained development reside in a collaboration of archivists, historians and computer scientists, the latter being not only in charge of the development of the software, but also of creating and incorporating novel pattern recognition for document analysis techniques.
Evaluation of BILBO reference parsing in digital humanities via a comparison of different tools BIBAFull-Text 209-212
  Young-Min Kim; Patrice Bellot; Jade Tavernier; Elodie Faath; Marin Dacos
Automatic bibliographic reference annotation involves the tokenization and identification of reference fields. Recent methods use machine learning techniques such as Conditional Random Fields to tackle this problem. On the other hand, the state of the art methods always learn and evaluate their systems with a well structured data having simple format such as bibliography at the end of scientific articles. And that is a reason why the parsing of new reference different from a regular format does not work well. In our previous work, we have established a standard for the tokenization and feature selection with a less formulaic data such as notes. In this paper, we evaluate our system BILBO with other popular online reference parsing tools on a new data from totally different source. BILBO is constructed with our own corpora extracted and annotated from real world data, digital humanities articles of Revues.org site (90% in French) of OpenEdition. The robustness of BILBO system allows a language independent tagging result. We expect that this first attempt of evaluation will motivate the development of other efficient techniques for the scattered and less formulaic bibliographic references.
Glyph spotting for mediaeval handwritings by template matching BIBAFull-Text 213-216
  Jan-Hendrik Worch; Mathias Lawo; Björn Gottfried
This paper reports on the analysis of different approaches in order to search for glyphs within handwritten mediaeval documents. As layout analysis methods are difficult to apply to the documents at hand, template matching methods are employed. A number of different shape descriptions are used to filter out false positives, since the application of correlation coefficients alone results in too many matches. The overall goal consists in the interactive support of an editor who is transcribing a given handwriting. For this purpose, the automatic spotting of glyphs enables the editor to compare glyphs within different contexts.

Architecture and document management

Architecture for hypermedia dynamic applications with content and behavior constraints BIBAFull-Text 217-226
  Luiz Fernando G. Soares; Carlos S. Soares Neto; José Geraldo Sousa
This paper deals with the generation of dynamic hypermedia applications whose content and behavior their authors may not be able to predict a priori, but which must conform to a strict set of explicitly defined constraints. In the paper, we show that it is possible to establish an architecture configuration to be followed by this special kind of dynamic applications. In the proposed architecture, templates are responsible for specifying the design patterns and the constraints to be followed. Some alternatives for distributing (from the client side to the server side) the components that comprise the architecture are discussed, and one of them is used to exemplify an instantiation of the architecture. In the instantiation, TAL (Template Authoring Language) is used to define templates. In TAL, templates are open-compositions, that is, especial set of patterns for compositions, whose content must obey some explicitly defined constraints. The paper also shows how the architecture instantiation could be used to build dynamic digital TV applications.
Full-text search on multi-byte encoded documents BIBAFull-Text 227-236
  Raymond K. Wong; Fengming Shi; Nicole Lam
The Burrows Wheeler transform (BWT) has become popular in text compression, full-text search, XML representation, and DNA sequence matching. It is very efficient to perform a full-text search on BWT encoded text using backward search. This paper aims to study different approaches for applying BWT on multi-byte encoded (e.g. UTF-16) text documents. While previous work has studied BWT on word-based models, and BWT can be applied directly on multi-byte encodings (by treating the document as single-byte coded), there has been no extensive study on how to utilize BWT on multi-byte encoded documents for efficient full-text search. Therefore, in this paper, we propose several ways to efficiently backward search multi-byte text documents. We demonstrate our findings using Chinese text documents. Our experiment results show that our extensions to the standard BWT method offer faster search performance and use less runtime memory.
Deriving document workflows from feature models BIBAFull-Text 237-240
  Mª Carmen Penadés; Abel Gómez; José H. Canós
Despite the increasing interest in the Document Engineering community, a formal definition of document workflow is still to come. Often, the term refers to an abstract process consisting in a set of tasks to contribute to some document contents, and some techniques are being developed to support parts of these tasks rather than how to generate the process itself. In most proposals, these tasks are implicit in the business processes running in an organization, lacking an explicit document workflow model that could be analysed and enacted as a coherent unit. In this paper, we propose a document-centric approach to document workflow generation. We have extended the feature-based document meta-model of the Document Product Lines approach with an organizational metamodel. For a given configuration of the feature model, we assign tasks to different members of the organization to contribute to the document contents. Moreover, the relationships between features define an ordering of the tasks, which may be refined to produce a specification of the document workflow model automatically. The generation of customized software manuals is used to illustrate the proposal.
Charactles: more than characters BIBAFull-Text 241-244
  Blanca Mancilla; John Plaice
In this paper, we propose a general notion of character which encompasses two concepts: points within a character set, such as Unicode, as well as arbitrary tuples defining structured objects. We call these general characters "charactles". Using this model, text can be defined to be a linear sequence of charactles, not requiring the use of hierarchical structures to encode the text. As a result, all sorts of processing, such as searching and typesetting, are potentially simplified.