HCI Bibliography Home | HCI Conferences | DocEng Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
DocEng Tables of Contents: 0102030405060708091011121314

Proceedings of the 2011 ACM Symposium on Document Engineering

Fullname:DocEng'11 Proceeding of the 11th ACM Symposium on Document Engineering
Editors:Matthew Hardy; Frank Wm. Tompa
Location:Mountain View, California
Dates:2011-Sep-19 to 2011-Sep-22
Standard No:ISBN: 1-4503-0863-5, 978-1-4503-0863-2; ACM DL: Table of Contents hcibib: DocEng11
Links:Conference Home Page
  1. Keynote address
  2. Optimizing layouts
  3. Multimedia presentations
  4. Demos and posters
  5. Keynote address 2
  6. Editing
  7. Visual analysis
  8. Flowing content into layout
  9. Metadata
  10. Summarization
  11. Tailored and adaptive layout
  12. Deviance control
  13. Workshops

Keynote address

The evolving form of documents BIBAFull-Text 1-2
  John E. Warnock
The sophistication and visual richness of printed documents has made great strides over the last 25 years. The computerization of entry, layout and production has gone from a mostly manual process to a totally computerized process.
   As documents start to contain multimedia components (sound, animation, video) and become totally "electronic", and as devices to view these new documents become more varied in size shape and capabilities, the authoring of these documents has become a difficult challenge.
   This talk will address those challenges, and discuss whether the evolving web tools are moving in the right direction.

Optimizing layouts

Probabilistic document model for automated document composition BIBAFull-Text 3-12
  Niranjan Damera-Venkata; José Bento; Eamonn O'Brien-Strain
We present a new paradigm for automated document composition based on a generative, unified probabilistic document model (PDM) that models document composition. The model formally incorporates key design variables such as content pagination, relative arrangement possibilities for page elements and possible page edits. These design choices are modeled jointly as coupled random variables (a Bayesian Network) with uncertainty modeled by their probability distributions. The overall joint probability distribution for the network assigns higher probability to good design choices. Given this model, we show that the general document layout problem can be reduced to probabilistic inference over the Bayesian network. We show that the inference task may be accomplished efficiently, scaling linearly with the content in the best case. We provide a useful specialization of the general model and use it to illustrate the advantages of soft probabilistic encodings over hard one-way constraints in specifying design aesthetics.
Building table formatting tools BIBAFull-Text 13-22
  Mihai Bilauca; Patrick Healy
In this paper we present an overview of the challenges to overcome when developing table authoring tools, including a review of logical table models, typographical issues and automated table layout optimization. We present a Table Drawing Tool prototype which implements an automated solution for the table layout optimization problem for tables with spanning cells using a mathematical modelling method. We report on the performance improvements of this new optimization method compared to previous solutions.
Optimal automatic table layout BIBAFull-Text 23-32
  Graeme Gange; Kim Marriott; Peter Moulder; Peter Stuckey
Automatic layout of tables is useful in word processing applications and is required in on-line applications because of the need to tailor the layout to the viewport width, choice of font and dynamic content. However, if the table contains text, minimizing the height of the table for a fixed maximum width is a difficult combinatorial optimization problem. We present three different approaches to finding the minimum height layout based on standard approaches for combinatorial optimization. All are guaranteed to find the optimal solution. The first is an A*-based approach that uses an admissible heuristic based on the area of the cell content. The second and third are constraint programming (CP) approaches using the same CP model. The second approach uses traditional CP search, while the third approach uses a hybrid CP/SAT approach, lazy clause generation, that uses learning to reduce the search required. We provide a detailed empirical evaluation of the three approaches and also compare them with two mixed integer programming (MIP) encodings due to Bilauca and Healy.

Multimedia presentations

A framework with tools for designing web-based geographic applications BIBAFull-Text 33-42
  The Nhan Luong; Sébastien Laborie; Thierry Nodenot
Many Web-based geographic applications have been developed in various domains, such as tourism, education, surveillance and military. However, developing such applications is a cumbersome task because it requires several types of components (e.g., maps, contents, indexing services, databases) that have to be assembled together. Hence, developers have to deal with different technologies and application behavior models. In order to create Web-based geographic applications and overcome these design problems, we propose a framework composed of three complementary tasks: identifying some desired data, building the graphical layout organization and defining potential user interactions. According to this framework, we have specified a unified model and we have encoded it using Semantic Web technologies, such as RDF. Through a prototype named WINDMash, we have implemented some tools that instantiate our model and automatically generate concrete Internet geographic applications that can be executed on Web browsers.
Timesheets.js: when SMIL meets HTML5 and CSS3 BIBAFull-Text 43-52
  Fabien Cazenave; Vincent Quint; Cécile Roisin
In this paper, we explore different ways to publish multimedia documents on the web. We propose a solution that takes advantage of the new multimedia features of web standards, namely HTML5 and CSS3. While JavaScript is fine for handling timing, synchronization and user interaction in specific multimedia pages, we advocate a more generic, document-oriented alternative relying primarily on declarative standards: HTML5 and CSS3 complemented by SMIL Timesheets. This approach is made possible by a Timesheets scheduler that runs in the browser. Various applications based on this solution illustrate the paper, ranging from media annotations to web documentaries.
Component-based hypervideo model: high-level operational specification of hypervideos BIBAFull-Text 53-56
  Madjid Sadallah; Olivier Aubert; Yannick Prié
Hypervideo offers enhanced video-centric experiences. Usually defined from a hypermedia perspective, the lack of a dedicated specification hampers hypervideo domain and concepts from being broadly investigated. This article proposes a specialized hypervideo model that addresses hypervideo specificities.
   Following the principles of component-based modeling and annotation-driven content abstracting, the Component-based Hypervideo Model (CHM) that we propose is a high level representation of hypervideos that intends to provide a general and dedicated hypervideo data model.
   Considered as a video-centric interactive document, the CHM hypervideo presentation and interaction features are expressed through a high level operational specification. Our annotation-driven approach promotes a clear separation of data from video content and document visualizations. The model serves as a basis for a Web-oriented implementation that provides a declarative syntax and accompanying tools for hypervideo document design in a Web standards-compliant manner.

Demos and posters

The art of mathematics retrieval BIBAFull-Text 57-60
  Petr Sojka; Martin Líaka
The design and architecture of MIaS (Math Indexer and Searcher), a system for mathematics retrieval is presented, and design decisions are discussed. We argue for an approach based on Presentation MathML using a similarity of math subformulae. The system was implemented as a math-aware search engine based on the state-of-the-art system Apache Lucene.
   Scalability issues were checked against more than 400,000 arXiv documents with 158 million mathematical formulae. Almost three billion MathML subformulae were indexed using a Solr-compatible Lucene.
Automated conversion of web-based marriage register data into a printed format with predefined layout BIBAFull-Text 61-64
  David F. Brailsford
The Phillimore Marriage Registers for England were published in the period 1896 to 1922 and have defined a standard layout format for the typesetting of marriage data. However, not all English parish churches had their marriage registers analysed and printed by the Phillimore organisation within this time period.
   This paper tells the story of Wirksworth, a town in Derbyshire with a large church, licensed for marriages, yet whose marriage data was not released to the Phillimore organisation. Hence there is no printed Phillimore Marriages volume for Wirksworth. However, in recent years, a Wirksworth web site, created by John Palmer, has become famous as being probably the most comprehensive record of a parish's activities anywhere on the Web.
   Within a total of 120 MB of data on the web site, covering events in Wirksworth from medieval times to the present, is a set of data recording births, marriages and deaths transcribed from the original hand-written church register volumes.
   The work described here covers the software tools and techniques that were used in creating a set of awk scripts to extract all the marriage records from the Wirksworth web site data. The extracted material was then automatically re-processed, typeset and indexed to form an entirely new Phillimore-style volume for Wirksworth marriages.
A cloud-based and social authoring tool for video BIBAFull-Text 65-68
  Naimdjon Takhirov; Fabien Duchateau
In this paper, we present a cloud-based collaborative authoring tool called Creaza VideoCloud. This authoring tool offers an extensive set of features for document-based video authoring in the cloud.
   In this paper, we present a cloud-based collaborative authoring tool called Creaza VideoCloud. This authoring tool offers an extensive set of features for document-based video authoring in the cloud.
Collaborative editing of multimodal annotation data BIBAFull-Text 69-72
  Stephan Wieschebrink
The annotation of multimodal speech corpora is a particularly tedious task, since annotatable events can be composed of smaller events that span across several modalities (e.g. speech and gesture), which imposes the need to operate on the same data, using a wide range of different tools in order to cover all the different modalities and layers of abstraction within multimodal data. MonadicDom4J has been developed as a highly generic general purpose java-based Rich Client framework that opens the possibility to simultaneously operate on any kind of XML data through several different views and from several remote locations. It allows for the dynamic allocation of plugins, needed to render a given type of XML markup, and takes care of the concurrency between different sites viewing the same data, by means of differential synchronization. The demonstration will involve several different applications ranging from general textual hyperdocument editing to multimodal annotation tools, whose contents can be freely intermixed, interlinked and transcluded into different contexts, using drag and drop interaction. The audience will have the opportunity to try collaborative editing on the presented examples from their own devices.
Developer-friendly annotation-based HTML-to-XML transformation technology BIBAFull-Text 73-76
  Lendle Chun-Hsiung Tseng
Nowadays, the amount of information accessible on the web is huge. Although web users today expect a more integrated way to access information on the web, it is still rather difficult to "integrate" information from different web sites since most web pages are authored in HTML format, which is actually a presentation-oriented language and is usually considered unstructured. Today, there are many research works aiming at extracting information from web pages. Existing works typically transform the extracting results into structured or semi-structured data formats, thus other applications can further process the results to discover more useful information. Nevertheless, the unstructured nature of HTML makes the transformation process complex and can hardly be widely adopted. In this paper, an annotation-based HTML-to-XML transformation technology is proposed. The mechanism is developed with both usability and simplicity in mind. With the proposed mechanism, ordinary web site developers simply add annotations to their web pages. Annotated web pages can then be processed by our software libraries and transformed into XML documents, which are machine-understandable. Software agents thus can be developed based on our technology.
EDITEC: hypermedia composite template graphical editor for interactive tv authoring BIBAFull-Text 77-80
  Jean Damasceno; Joel dos Santos; Débora Muchaluat-Saade
This paper presents EDITEC, a graphical editor for hypermedia composite templates that can be used for authoring interactive TV programs. EDITEC templates are based on the XTemplate 3.0 language. EDITEC was designed for offering a user-friendly visual graphical approach. It provides several options for representing iteration structures and a graphical interface for creating basic XPath expressions. The editor provides a multiple-view environment, giving the user a complete control of the composite template during the authoring process. Composite templates can be used in NCL programs for embedding spatio-temporal semantics into NCL contexts. NCL is the standard declarative language used for the production of interactive applications in the Brazilian digital TV system and ITU H.761 IPTV services.
An exploratory analysis of mind maps BIBAFull-Text 81-84
  Joeran Beel; Stefan Langer
The results presented in this paper come from an exploratory study of 19,379 mind maps created by 11,179 users from the mind mapping applications 'Docear' and 'MindMeister'. The objective was to find out how mind maps are structured and which information they contain. The results include: A typical mind map is rather small, with 31 nodes on average (median), whereas each node usually contains between one to three words. In 66.12% of cases there are few notes, if any, and the number of hyperlinks tends to be rather low, too, but depends upon the mind mapping application. Most mind maps are edited only on one (60.76%) or two days (18.41%). It is to expect that a typical user creates around 2.7 mind maps (mean) a year. However, there are exceptions which create a long tail. One user created 243 mind maps, the largest mind map contained 52,182 nodes, one node contained 7,497 words and one mind map was edited on 142 days.
Models for video enrichment BIBAFull-Text 85-88
  Benoît Encelle; Pierre-Antoine Champin; Yannick Prié; Olivier Aubert
Videos are commonly being augmented with additional content such as captions, images, audio, hyperlinks, etc., which are rendered while the video is being played. We call the result of this rendering "enriched videos". This article details an annotation-based approach for producing enriched videos: enrichment is mainly composed of textual annotations associated to temporal parts of the video that are rendered while playing it. The key notion of enriched video and associated concepts is first introduced and we second expose the models we have developed for annotating videos and for presenting annotations during the playing of the videos. Finally, an overview of a general workflow for producing/viewing enriched videos is presented. This workflow particularly illustrates the usage of the proposed models in order to improve the accessibility of videos for sensory disabled people.
Print-friendly page extraction for web printing service BIBAFull-Text 89-92
  Sam Liu; Cong-Lei Yao
Printing Web pages from browsers usually results in unsatisfactory printouts because the pages are typically ill formatted and contain non-informative content such as navigation menu and ads. Thus, print-worthy Web pages such as articles generally contain hyperlinks (or links) that lead to print-friendly pages containing the salient content. For a more desirable Web printing experience, the main Web content should be extracted to produce well formatted pages. This paper describes a cloud service based on automatic content extraction and repurposing from print-friendly pages for Web printing. Content extraction from print-friendly pages is simpler and more reliable than from the original pages, but there are many variations of the print-link representations in HTML that make robust print-link detection more difficult than it first appears. First, the link can be text-based, image-based, or both. For example, there is a lexicon of phrases used to indicate print-friendly pages, such as "print", "print article", "print-friendly version", etc. In addition, some links use printer-resembling image icons with or without a print phrase present. To complicate matter further, not all of the links contain a valid URL, but instead the pages are dynamically generated either by the client Javascript or by the server, so that no URL is present. Experimental results suggest that our solution is capable of achieving over 99% precision and 97% recall performance measures for print-friendly link extraction.
Skeleton comparisons: the junction neighbourhood histogram BIBAFull-Text 93-96
  Jannis Stoppe; Björn Gottfried
For analysing and comparing characters, using skeletons is a promising approach due to their topology-preserving nature and the resemblance of the skeleton to the original writing movement. We suggest a novel qualitative approach to skeleton comparison that is based on the adjacency of junctions and end points and the steps of a preceding skeleton simplification. By using a multi-dimensional histogram that contains information about the adjacency and the degree of joints, we gain high comparison speeds which, when combined with the multi-step approach, can be used for a generic topology distance metric.
Version-aware XML documents BIBAFull-Text 97-100
  Cheng Thao; Ethan V. Munson
A document often goes through many revisions before it is finalized. In the normal document creation process, newer revisions overwrite older ones and only the final revision is kept. At any stage of document creation, it might be desirable to see how the document came to its current form or to revert back to a previous revision. Conventional version control tools such as CVS could help authors do exactly this. However, these tools are unlikely to be adopted by non-technical document authors due to the overhead of managing a repository and the tools' learning curves.
   This paper presents an approach called version-aware documents that embeds versioning data within the document thus making version control for single documents a seamless part of the authoring process.

Keynote address 2

Google's international bloopers... and how we fixed one BIBAPresentation on YouTube 101-102
  Luke Swartz; Mark Davis
Google has millions of users around the world...but occasionally we mess up. We'll explore some of Google's international "bloopers" and show how to avoid similar mistakes in other applications.
   We'll also highlight how we solved a persistent blooper, namely having UI strings like "Alice has added 3 contact to his address book." Our Plural/Gender API allows complicated UI strings to change appropriately, based on numbers and personal gender. We'll look at the document formats used to express plural and gendered messages, and explore the impact on the translation process.
Note: Presentation titled: Google+ Internationalization


Evaluating CRDTs for real-time document editing BIBAFull-Text 103-112
  Mehdi Ahmed-Nacer; Claudia-Lavinia Ignat; Gérald Oster; Hyun-Gul Roh; Pascal Urso
Nowadays, real-time editing systems are catching on. Tools such as Etherpad or Google Docs enable multiple authors at dispersed locations to collaboratively write shared documents. In such systems, a replication mechanism is required to ensure consistency when merging concurrent changes performed on the same document. Current editing systems make use of operational transformation (OT), a traditional replication mechanism for concurrent document editing.
   Recently, Commutative Replicated Data Types (CRDTs) were introduced as a new class of replication mechanisms whose concurrent operations are designed to be natively commutative. CRDTs, such as WOOT, Logoot, Treedoc, and RGAs, are expected to be substitutes of replication mechanisms in collaborative editing systems.
   This paper demonstrates the suitability of CRDTs for real-time collaborative editing. To reflect the tendency of decentralised collaboration, which can resist censorship, tolerate failures, and let users have control over documents, we collected editing logs from real-time peer-to-peer collaborations. We present our experiment results obtained by replaying those editing logs on various CRDTs and an OT algorithm implemented in the same environment.
A generic calculus of XML editing deltas BIBAFull-Text 113-120
  Jean-Yves Vion-Dury
In previous work we outlined a mathematical model of the so-called XML editing deltas and proposed a first study of their formal properties. We expected at least three outputs from this theoretical work: a common basis to compare performances of the various algorithms through a structural normalization of deltas, a universal and flexible patch application model and a clearer separation of patch and merge engine performance from delta generation performance. This paper presents the full calculus and reports significant progresses with respect to formalizing a normalization procedure. Such method is key to defining an equivalence relation between editing scripts and eventually designing optimizers compiler back-ends, new patch specification languages and execution models.

Visual analysis

An efficient language-independent method to extract content from news webpages BIBAFull-Text 121-128
  Eduardo Cardoso; Iam Jabour; Eduardo Laber; Rogério Rodrigues; Pedro Cardoso
We tackle the task of news webpage segmentation, specifically identifying the news title, publication date and story body. While there are very good results in the literature, most of them rely on webpage rendering, which is a very time-consuming step. We focus on scenarios with a high volume of documents, where performance is a must. The chosen approach extends our previous work in the area, combining structural properties with hints of visual presentation styles, computed with a quicker method than regular rendering, and machine learning algorithms. In our experiments, we took special attention to some aspects that are often overlooked in the literature, such as processing time and the generalization of the extraction results for unseen domains. Our approach has shown to be about an order of magnitude faster than an equivalent full rendering alternative while retaining a good quality of extraction.
A versatile model for web page representation, information extraction and content re-packaging BIBAFull-Text 129-138
  Bernhard Krüpl-Sypien; Ruslan R. Fayzrakhmanov; Wolfgang Holzinger; Mathias Panzenböck; Robert Baumgartner
On today's Web, designers take huge efforts to create visually rich websites that boast a magnitude of interactive elements. Contrarily, most web information extraction (WIE) algorithms are still based on attributed tree methods which struggle to deal with this complexity. In this paper, we introduce a versatile model to represent web documents. The model is based on gestalt theory principles -- trying to capture the most important aspects in a formally exact way. It (i) represents and unifies access to visual layout, content and functional aspects; (ii) is implemented with semantic web techniques that can be leveraged for i.e. automatic reasoning. Considering the visual appearance of a web page, we view it as a collection of gestalt figures -- based on gestalt primitives -- each representing a specific design pattern, be it navigation menus or news articles. Based on this model, we introduce our WIE methodology, a re-engineering process involving design patterns, statistical distributions and text content properties. The complete framework consists of the UOM model, which formalizes the mentioned components, and the MANM layer that hints on structure and serialization, providing document re-packaging foundations. Finally, we discuss how we have applied and evaluated our model in the area of web accessibility.
Document visual similarity measure for document search BIBAFull-Text 139-142
  Ildus Ahmadullin; Jan Allebach; Niranjan Damera-Venkata; Jian Fan; Seungyon Lee; Qian Lin; Jerry Liu; Eamonn O'Brien-Strain
Managing large document databases has become an important task. Being able to automatically compare document layouts and classify and search documents with respect to their visual appearance proves to be desirable in many applications. We propose a new algorithm that approximates a metric function between documents based on their visual similarity. The comparison is based only on the visual appearance of the document without taking into consideration its text content. We measure the similarity of single page documents with respect to distance functions between three document components: background, text, and saliency. Each document component is represented as a Gaussian mixture distribution; and distances between the components of different documents are calculated as an approximation of the Hellinger distance between corresponding distributions. Since the Hellinger distance obeys the triangle inequality, it proves to be favorable in the task of nearest neighbor search in a document database. Thus, the computation required to find similar documents in a document database can be significantly reduced.

Flowing content into layout

Paginate dynamic and web content BIBAFull-Text 143-152
  Fabio Giannetti
Highly customized and content driven documents present substantial challenges in producing sophisticated layout. In fact, these are apps that usually look like well-designed documents. A concrete example is e-books. E-books have re-flowing requirements to allow the user to read them on a plethora of devices as wells as change the font size and font style. Meanwhile this increases the flexibility of the medium, it loses common features found in books like footnotes, marginalia (a.k.a. side notes), pull-quotes and, floats. This paper introduces an approach on extending the concept of galley to a generalized document design instrument. The proposed solution has the aim of providing an easy and flexible, yet powerful, way to express complex layout for highly dynamic and re-flowing content. To achieve this goal, not only it is important to express all the areas available within the page or page region, but also identify a mean to efficiently map content to them. To serve this purpose, a role based mapper has been introduced linking both flow and out-of-flow content.
A novel physics-based interaction model for free document layout BIBAFull-Text 153-162
  Ricardo Farias Bidart Piccoli; Rodrigo Chamun; Nicole Carrion Cogo; João Batista Souza de Oliveira; Isabel Harb Manssour
Marketing flyers, greeting cards, brochures and similar materials are expensive to produce, since these documents need to be personalized and typically require a graphic design professional to create. Either authoring tools are too complex to use or a predefined set of fixed templates is available, which can be restrictive and difficult to produce the desired results. Thus, simpler design tools are a compelling need for small businesses and consumers. This paper describes an interactive authoring method for creating free-form documents based on a force-directed approach, traditionally applied for graph layout problems. This is used for automatically distributing and manipulating images, text and decorative elements on a page, according to forces modeled after physical laws. Such approach can be used for enabling easy authoring of personalized brochures, photo albums, calendars, greeting cards and other free-form documents. A prototype has been developed for evaluation purposes, and is briefly described in this paper. Evaluation results are presented as well, showing that users enjoy the experience of designing a page by interacting with it, and that end results can be satisfactory.
Reflowable documents composed from pre-rendered atomic components BIBAFull-Text 163-166
  Alexander J. Pinkney; Steven R. Bagley; David F. Brailsford
Mobile eBook readers are now commonplace in today's society, but their document layout algorithms remain basic, largely due to constraints imposed by short battery life. At present, with any eBook file format not based on PDF, the layout of the document, as it appears to the end user, is at the mercy of hidden reformatting and reflow algorithms interacting with the screen parameters of the device on which the document is rendered. Very little control is provided to the publisher or author, beyond some basic formatting options.
   This paper describes a method of producing well-typeset, scalable, document layouts by embedding several pre-rendered versions of a document within one file, thus enabling many computationally expensive steps (e.g. hyphenation and line-breaking) to be carried out at document compilation time, rather than at 'view time'. This system has the advantage that end users are not constrained to a single, arbitrarily chosen view of the document, nor are they subjected to reading a poorly typeset version rendered on the fly. Instead, the device can choose a layout appropriate to its screen size and the end user's choice of zoom level, and the author and publisher can have fine-grained control over all layouts.


Introduction of a dynamic assistance to the creative process of adding dimensions to multistructured documents BIBAFull-Text 167-170
  Pierre-Edouard Portier; Sylvie Calabretto
We consider documents as the results of dynamic processes of documentary fragments' associations. We have experienced that once a substantial number of associations exist, users need some synoptic views. One possible way of providing such views relies in the organization of associations into relevant subsets that we call "dimensions". Thus, dimensions offer orders along which a documentary archive can be traversed. Many works have proposed efficient ways of presenting combinations of dimensions through graphical user interfaces. Moreover, there are studies on the structural properties of dimensional hypertexts. However, the problem of the origins and evolution of dimensions has not yet received a similar attention. Thus, we propose a mechanism based on a simple structural constraint for helping users in the construction of dimensions: if a cycle appears within a dimension while a user is creating a new dimension by the aggregation of existing ones, he will be encouraged (and assisted in his task) to restructure the dimensions in order to cut the cycle. This is a first step towards a rational control of the emergence and evolution of dimensions.
Interoperable metadata semantics with meta-metadata: a use case integrating search engines BIBAFull-Text 171-174
  Yin Qu; Andruid Kerne; Andrew M. Webb; Aaron Herstein
A use case involving integrating results from search engines illustrates how the meta-metadata language facilitates interoperable metadata semantics. Formal semantics can be hard to obtain directly. For example, search engines may only present results through web pages; even if they do provide web services, they don't provide them according to a mutually interoperable standard.
   We show how to use the open source meta-metadata language to define a common base class for search results, and how to extend the base class to create polymorphic variants that include engine-specific fields. We develop wrappers to extract data from HTML search results from engines including Google, Bing, Delicious, and Slashdot. We write a short meta-search program for integrating the search results, reranking them, and providing formatted HTML output. This provides an extensible formal and functional semantics for search. Meta-metadata also directly enables representing the same integrated search results as XML or JSON. This research can profoundly transform the derivation and representation of interoperable metadata semantics from a multitude of heterogeneous wild web sources.


Automatic text summarization and small-world networks BIBAFull-Text 175-184
  Helen Balinsky; Alexander Balinsky; Steven J. Simske
Automatic text summarization is an important and challenging problem. Over the years, the amount of text available electronically has grown exponentially. This growth has created a huge demand for automatic methods and tools for text summarization. We can think of automatic summarization as a type of information compression. To achieve such compression, better modelling and understanding of document structures and internal relations is required. In this article, we develop a novel approach to extractive text summarization by modelling texts and documents as small-world networks. Based on our recent work on the detection of unusual behavior in text, we model a document as a one-parameter family of graphs with its sentences or paragraphs defining the vertex set and with edges defined by Helmholtz's principle. We demonstrate that for some range of the parameters, the resulting graph becomes a small-world network. Such a remarkable structure opens the possibility of applying many measures and tools from social network theory to the problem of extracting the most important sentences and structures from text documents. We hope that documents will be also a new and rich source of examples of complex networks.
Efficient keyword extraction for meaningful document perception BIBAFull-Text 185-194
  Thomas Bohne; Sebastian Rönnau; Uwe M. Borghoff
Keyword extraction is a common technique in the domain of information retrieval. Keywords serve as a minimalistic summary for single documents or document collections, enabling the reader to quickly perceive the main contents of a text. However, they are often not readily available for the documents of interest.
   Common keyword extraction techniques demand either a large data collection, a learning process, or access to extensive amounts of reference data. By relying on additional linguistic features (e.g. stop word removal), most approaches are language-restricted. Moreover, the extracted keywords usually pertain to the entire document, rather than only to the portion that is of interest to the reader.
   In this paper, we present an efficient and flexible approach to summarize selections of text within a document. Our solution is based on a keyword extraction algorithm that is applicable to a variety of documents, regardless of language or context. This algorithm relies on the Helmholtz principle and extends a recently presented approach. Our extension covers the features of a weighting algorithm while providing a self-regulation capability to allow for more meaningful results. Furthermore, our approach takes into account the document structure in order to enhance pure statistic summarizations. We evaluate the efficiency of our approach and present results with meaningful examples. In addition, we outline further applications of our approach that allow for enhanced document perception as well as for meaningful document indexing and retrieval.
Building a topic hierarchy using the bag-of-related-words representation BIBAFull-Text 195-204
  Rafael Geraldeli Rossi; Solange Oliveira Rezende
A simple and intuitive way to organize a huge document collection is by a topic hierarchy. Generally two steps are carried out to build a topic hierarchy automatically: 1) hierarchical document clustering and 2) cluster labeling. For both steps, a good textual document representation is essential. The bag-of-words is the common way to represent text collections. In this representation, each document is represented by a vector where each word in the document collection represents a dimension (feature). This approach has well known problems as the high dimensionality and sparsity of data. Besides, most of the concepts are composed by more than one word, as "document engineering" or "text mining". In this paper an approach called bag-of-related-words is proposed to generate features compounded by a set of related words with a dimensionality smaller than the bag-of-words. The features are extracted from each textual document of a collection using association rules. Different ways to map the document into transactions in order to allow the extraction of association rules and interest measures to prune the number of features are analyzed. To evaluate how much the proposed approach can aid the topic hierarchy building, we carried out an objective evaluation for the clustering structure, and a subjective evaluation for topic hierarchies. All the results were compared with the bag-of-words. The obtained results demonstrated that the proposed representation is better than the bag-of-words for the topic hierarchy building.
Local metric learning for tag recommendation in social networks BIBAFull-Text 205-208
  Boris Chidlovskii; Aymen Benzarti
We address the problem of tag recommendation for media objects, like images, videos, etc in social media sharing systems. We propose a framework that 1) extracts both object features and the social context and 2) uses them to learn recommendation rules. The social context is described by different types of information, such as a user's personal objects, the objects of a user's social contacts, the importance of the user in the social network, etc. Both object features and the social context are first used to guide the k-nearest neighbour method for the tag recommendation. We then enhance the method by the local topology adjustment on how the nearest neighbours are selected. We learn a local transformation of the feature space surrounding a given object which pushes together objects with the same tags and puts apart objects with different tags. We show how to learn the Mahalanobis distance metric on multi-tag objects and adopt it to the tag recommendation problem.

Tailored and adaptive layout

Expressing conditions in tailored brochures for public administration BIBAFull-Text 209-218
  Nathalie Colineau; Cécile Paris; Keith Vander Linden
Citizen-focused documents in Public Administration devote considerable effort to the expression of conditions. These conditions are commonly expressed as statements of eligibility requirements for the programs being described, but they manifest themselves in other places as well, such as in feedback to readers in tailored informational brochures and as input fields on program application forms. This paper discusses how administrative conditions can be represented in a manner that supports both the eligibility reasoning required for the generation of citizen-tailored documents and also the automated generation of condition expressions in a variety of forms. The paper pays particular attention to the question of how a generation mechanism can allow authors to override the default forms of automated expression when necessary. The discussion is based on a prototype tailored delivery application whose knowledge base is implemented in OWL DL and whose output is constructed using Myriad, a platform for tailored document planning and formatting.
Adaptive layout template for effective web content presentation in large-screen contexts BIBAFull-Text 219-228
  Michael Nebeling; Fabrice Matulic; Lucas Streit; Moira C. Norrie
Despite the fact that average screen size and resolution have dramatically increased, many of today's web sites still do not scale well in larger viewing contexts. The upcoming HTML5 and CSS3 standards propose features that can be used to build more flexible web page layouts, but their potential to accommodate a wider range of display environments is currently relatively unexplored. We examine the proposed standards to identify the most promising features and report on experiments with a number of adaptive layout mechanisms that support the required forms of adaptation to take advantage of greater screen real estates, such as automated scaling of text and media. Special attention is given to the effective use of multi-column layout, a brand new feature for web design that contributes to optimising the space occupied by text, but at the same time still poses problems in predominantly continuous vertical-scrolling browsing behaviours. The proposed solutions were integrated in a flexible layout template that was then applied to an existing news web site and tested on users to identify the adaptive features that best support reading comfort and efficiency.
Detecting and resolving conflicts between adaptation aspects in multi-staged XML transformations BIBAFull-Text 229-238
  Sven Karol; Matthias Niederhausen; Daniel Kadner; Uwe Aßmann; Klaus Meißner
Separation of Concerns (SoC) is a common principle to reduce the complexity of large software and hypermedia systems. Amongst a variety of approaches, adaptation aspects are a well-known solution to significantly improve SoC in adaptive hypermedia applications. To model adaptation aspects in XML-based hypermedia applications, we developed PX-Weave, a tool which allows to specify and weave such aspects in multi-staged XML transformation environments. However, while aspects increase modularity and thus decrease complexity of software, they do also introduce some complex problems. The most prominent one, aspect interaction, has received a lot of attention from researchers during the last decade. In this paper we investigate the problem of aspect interaction for adaptation aspects. We present a combined approach for static and dynamic detection of aspect interactions in multi-staged XML-based hypermedia applications, which we implemented as an add-on to PX-Weave.

Deviance control

Publicly posted composite documents with identity based encryption BIBAFull-Text 239-248
  Helen Balinsky; Liqun Chen; Steven J. Simske
Recently-introduced Publicly Posted Composite Documents (PPCDs) enable composite documents with different formats and differential access control to participate in cross-organizational workflows distributed over potentially non-secure channels. The original PPCD design was based on a Public Key Infrastructure, requiring each workflow participant to own a pair of public and private keys. This solution also required the document master to know the corresponding valid public keys (certificates) of all participants prior to commencement of the workflow. Using Identity Based Encryption (IBE), a recently described cryptographic technique, we eliminate the requirement for the prior knowledge and distribution of the workflow participants' keys. The required public keys for each workflow participant are calculated based on user identities and other relevant factors at workflow onset. The generation of corresponding private keys can be delayed up until the workflow step, when the corresponding workflow participants require access to the document. The solution presented provides automatic workflow order enforcement and the ability to impose multiple document release dates real-time.
Citation pattern matching algorithms for citation-based plagiarism detection: greedy citation tiling, citation chunking and longest common citation sequence BIBAFull-Text 249-258
  Bela Gipp; Norman Meuschke
Plagiarism Detection Systems have been developed to locate instances of plagiarism e.g. within scientific papers. Studies have shown that the existing approaches deliver reasonable results in identifying copy&paste plagiarism, but fail to detect more sophisticated forms such as paraphrased plagiarism, translation plagiarism or idea plagiarism. The authors of this paper demonstrated in recent studies that the detection rate can be significantly improved by not only relying on text analysis, but by additionally analyzing the citations of a document. Citations are valuable language independent markers that are similar to a fingerprint. In fact, our examinations of real world cases have shown that the order of citations in a document often remains similar even if the text has been strongly paraphrased or translated in order to disguise plagiarism.
   This paper introduces three algorithms and discusses their suitability for the purpose of citation-based plagiarism detection. Due to the numerous ways in which plagiarism can occur, these algorithms need to be versatile. They must be capable of detecting transpositions, scaling and combinations in a local and global form. The algorithms are coined Greedy Citation Tiling, Citation Chunking and Longest Common Citation Sequence. The evaluation showed that if these algorithms are combined, common forms of plagiarism can be detected reliably.
Contributions to the study of SMS spam filtering: new collection and results BIBAFull-Text 259-262
  Tiago A. Almeida; José María G. Hidalgo; Akebo Yamakami
The growth of mobile phone users has lead to a dramatic increasing of SMS spam messages. In practice, fighting mobile phone spam is difficult by several factors, including the lower rate of SMS that has allowed many users and service providers to ignore the issue, and the limited availability of mobile phone spam-filtering software. On the other hand, in academic settings, a major handicap is the scarcity of public SMS spam datasets, that are sorely needed for validation and comparison of different classifiers. Moreover, as SMS messages are fairly short, content-based spam filters may have their performance degraded. In this paper, we offer a new real, public and non-encoded SMS spam collection that is the largest one as far as we know. Moreover, we compare the performance achieved by several established machine learning methods. The results indicate that Support Vector Machine outperforms other evaluated classifiers and, hence, it can be used as a good baseline for further comparison.
A study of the interaction of paper substrates on printed forensic imaging BIBAFull-Text 263-266
  Guy Adams; Stephen Pollard; Steven Simske
At the microscopic level, printing on a substrate exhibits imperfections that can be used as a unique identifier for labels, documents and other printed items. In previous work, we have demonstrated using these minute imperfections around a simple forensic mark such as a single printed character for robust authentication of the character with a low cost (and mobile) system. This approach allows for product authentication even when there is only minimal printing (e.g. on a small label or medallion), supporting a variety of secure document workflows. In this paper, we present an investigation on the influence that the substrate type has on the imperfections of the printing process that are used to derive the character 'signature'. We also make a comparison between two printing processes, dry electro photographic process (laser) and (thermal) inkjet. Understanding the sensitivity of our methods to these factors is important so that we know the limitations of the approach for document forensics.


Version control workshop BIBAFull-Text 267-268
  Neil Fraser
This three hour workshop takes participants on a tour of popular Version Control systems, particularly Subversion and Git. By the end of the workshop each participant will be proficient in using both of these systems. The focus is on solving real-world problems, such as resolving conflicting changes or rolling back a change. This workshop is not about the theory or academic underpinnings of such systems. Participants are required to bring a Macintosh, Linux or Windows laptop.
Secure document engineering BIBAFull-Text 269-272
  Helen Balinsky; Steven J. Simske
With the boom in interactive and composite documents and the increased coupling between the on-line and physical worlds, the need for secure document engineering is greater than ever. Four important factors contribute to the need to re-engineer document lifecycles and the associated workflows. The first is the rapid increase in mobile access to documents. The second is the movement of documents from private directories, shared directories and intranets to the cloud. The third is the increased generation of -- and expectation for -- document content, context and use analytics. Finally, the proliferation of social website applications and services over the past half-decade have created for many a constant state of log-in. Each of these trends creates a significantly increased "attack surface" for individuals, organizations and governments interested in breaching the privacy and security of web users. Combined, these transformations create as big a change to content security as that of the browser in the 1990s. In this workshop, we consider the impact of these ongoing transformations in document creation and interaction, and consider what the best approaches will be to provide privacy and security in light of these transformations.
Multimedia document processing in an HTML5 world BIBAFull-Text 273-274
  Dick C. A. Bulterman; Rodrigo Laiola Guimarães; Pablo Cesar; Ethan Munson; Maria da Graça Campos Pimentel
The evolution in media support within W3C standards has led to the development of HMTL5. HTML5 provides extensive support for audio/video/timed-text within an interoperable browser context.
   This workshop examines the impact of HTML5 on research and systems support for multimedia documents. We will consider issues such as extensibility, adaptivity and maintenance, and will discuss the future needs for Multimedia in a Web context.
Making accessible PDF documents BIBAFull-Text 275-276
  Heather Devine; Andres Gonzalez; Matthew Hardy
Accessibility features in the Adobe Portable Document Format (PDF) help facilitate access to electronic information for people with disabilities. This workshop explores how to create accessible PDF documents, from within Adobe Acrobat and other applications; how to use the Adobe Acrobat PDF accessibility checker and repair workflow; best practices for accessibility; and how accessibility has been built into forthcoming ISO standards (PDF/UA, PDF 32000-2).
Documenting social networks BIBAFull-Text 277-280
  Maria da Graça Campos Pimentel
The many social networks available on the Web offer users several facilities involving the sharing of media as a way of allow communication. This workshop will discuss the role of document engineering in social networks, targeting at issues such as: what documents can we create by analyzing the available information; what documents users manipulate/add/refer to in social networks; what are the roles of authors and readers.
Google mystery workshop BIBAFull-Text 281-282
  John Day-Richter
This workshop will explore a new Google technology involving documents and programming (not yet unannounced by the paper submission deadline).