HCI Bibliography Home | HCI Conferences | ICMI Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
ICMI Tables of Contents: 0203040506070809101112131415

Proceedings of the 2011 International Conference on Multimodal Interfaces

Fullname:Proceedings of the 13th International Conference on Multimodal Interfaces
Editors:Hervé Bourlard; Thomas S. Huang; Enrique Vidal; Daniel Gatica-Perez; Louis-Philippe Morency; Nicu Sebe
Location:Alicante, Spain
Dates:2011-Nov-14 to 2011-Nov-18
Standard No:ISBN: 1-4503-0641-1, 978-1-4503-0641-6; ACM DL: Table of Contents hcibib: ICMI11
Links:Conference Home Page
Summary:Welcome to Alicante and to the International Conference on Multimodal Interaction, ICMI 2011. ICMI is the premier international forum for multidisciplinary research on multimodal human-human and human-computer interaction, interfaces, and system development. It is the fusion of the International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction which, for the last two years, held a combined event under the name ICMIMLMI. Starting in this thirteenth edition the combined conference uses the new, shorter name.
    This year we had the largest number of submissions ever achieved in ICMI/MLMI: 127 papers, 4 Special Session proposals, 10 Demonstration papers and 6 Workshop proposals. From the 4 Special Session proposals 2 were selected, including 7 papers. Out of the 120 regular papers submitted, 47 were accepted for oral or poster presentation, bringing the conference acceptance rate to 39%. This rate was higher for the Demonstration papers, from which 7 were accepted. In addition, the program includes three invited Keynote talks. Finally, from the 6 post-conference workshop proposals, 4 were selected, centered on hot specific topics of multi-modal interaction.
    The review process was organized using the PCS submission and review system, which ICMI has used in the past. Aiming at improving the quality of the finally accepted papers, for the first time, this year the review process included a rebuttal step. The process was assisted by 15 Area Chairs (ACs) who helped the Program Chairs in defining the Program Committee. The papers were allocated to ACs in areas of their expertise according to the indications of the submitters, and then checked for conflicts. The Program Chairs distributed the papers to members of program committee and volunteer reviewers for comments. Once reviews were submitted, the ACs provided meta-reviews for all papers which were sent to the authors for rebuttal consideration. After hearing the authors' arguments, the scores of the papers were then collected and tabulated. All reviews and papers were then again checked by the Program Chairs, and papers with highly varying scores received an additional round of reviews. All papers and their reviews were finally discussed by the Program Chairs on a two-day remote meeting in order to decide on the list of accepted submissions.
    The program was formed by grouping papers into main topics of interest for this year's conference. Following the trend in previous ICMI-MLMI events and many other academic meetings, to minimize paper consumption we decided to distribute the conference proceedings on USB Flash Drives. This year we have selected 5 top scoring papers as candidates for two awards: Outstanding Student Paper, and Outstanding Paper. An anonymous committee has been appointed by Program Chairs to select the two awarded papers. You will find the nominated papers in the conference program marked with special symbol. The final award decisions will be announced at the conference banquet.
    As in previous events, ICMI-2011 has been organized with the support of ACM and SIGCHI. In addition, despite the financial crisis, many sponsors have given support to the event. A significant amount of funds has been provided by the Spanish "Ministerio de Ciencia e Innovación" (MICINN) and by several academic organizations of the Valencia Community: "Universitat Politécnica de Valéncia" (UPV), "Universidad de Alicante" (UA), the "Departamento de Sistemas Informáticos y Computación" (DSIC-UPV), the "Escola Técnica Superior d'Enginyeria Informática" (ETSINFUPV), the "Departamento de Lenguajes y Sistemas Informáticos" (LSI-UA) and the "Institut Universitari de Investigació Informatica" (IUII-UA). On the other hand, the US National Science Foundation (NSF) has generously provided us with travel and housing support for several students to help offset pressure on academic travel budgets. Two academic projects have also contributed to the conference organization: The Spanish "Multimodal Interaction in Pattern Recognition and Computer Vision" (MIPRCV) and the European "Social games for conflIct REsolution based on natural iNteraction" (SIREN). In addition, we thank the European network of excellence on "Pattern Analysis, Statistical Modeling, and Computational Learning" (PASCAL 2) for partially supporting travel expenses of keynote speakers and students and the "Asociación Española de Reconocimiento de Formas y Análisis de Imágenes" (AERFAI) for supporting ICMI-2011 registration expenses for its members. Even in these difficult times, important companies affirmed their support to the multimodal interaction and interface research community by providing ICMI with a reasonable level of financial support. These organizations deserve our warmest gratitude: Telefonica I+D, Microsoft Research and AT&T. Without the generous support of all these sponsors, this meeting would not have been possible.
  1. Keynote address 1
  2. Oral session 1: affect
  3. Special session 1: multimodal interaction: brain-computer interfacing
  4. Poster session
  5. Keynote address 2
  6. Oral session 2: social interaction
  7. Oral session 3: gesture and touch
  8. Demo session and DSS poster session
  9. Special session 2: long-term socially perceptive and interactive robot companions: challenges and future perspectives
  10. Keynote address 3
  11. Oral session 4: ubiquitous interaction
  12. Oral session 5: virtual and real worlds

Keynote address 1

Still looking at people BIBAFull-Text 1-2
  David A. Forsyth
There is a great need for programs that can describe what people are doing from video. Among other applications, such programs could be used to search for scenes in consumer video; in surveillance applications; to support the design of buildings and of public places; to screen humans for diseases; and to build enhanced human computer interfaces.
   Building such programs is difficult, because it is hard to identify and track people in video sequences, because we have no canonical vocabulary for describing what people are doing, and because phenomena such as aspect and individual variation greatly affect the appearance of what people are doing. Recent work in kinematic tracking has produced methods that can report the kinematic configuration of the body automatically, and with moderate accuracy. While it is possible to build methods that use kinematic tracks to reason about the 3D configuration of the body, and from this the activities, such methods remain relatively inaccurate. However, they have the attraction that one can build models that are generative, and that allow activities to be assembled from a set of distinct spatial and temporal components. The models themselves are learned from labelled motion capture data and are assembled in a way that makes it possible to learn very complex finite automata without estimating large numbers of parameters. The advantage of such a model is that one can search videos for examples of activities specified with a simple query language, without possessing any example of the activity sought. In this case, aspect is dealt with by explicit 3D reasoning.
   An alternative approach is to model the whole problem as k-way classification into a set of known classes. This approach is much more accurate at present, but has the difficulty that we don't really know what the classes should be in general. This is because we do not know how to describe activities. Recent work in object recognition on describing unfamiliar objects suggests that activities might be described in terms of attributes -- properties that many activities share, that are easy to spot, and that are individually somewhat discriminative. Such a description would allow a useful response to an unfamiliar activity. I will sketch current progress on this agenda.

Oral session 1: affect

Mining multimodal sequential patterns: a case study on affect detection BIBAFull-Text 3-10
  Héctor P. Martínez; Georgios N. Yannakakis
Temporal data from multimodal interaction such as speech and bio-signals cannot be easily analysed without a preprocessing phase through which some key characteristics of the signals are extracted. Typically, standard statistical signal features such as average values are calculated prior to the analysis and, subsequently, are presented either to a multimodal fusion mechanism or a computational model of the interaction. This paper proposes a feature extraction methodology which is based on frequent sequence mining within and across multiple modalities of user input. The proposed method is applied for the fusion of physiological signals and gameplay information in a game survey dataset. The obtained sequences are analysed and used as predictors of user affect resulting in computational models of equal or higher accuracy compared to the models built on standard statistical features.
Crowdsourced data collection of facial responses BIBAFull-Text 11-18
  Daniel McDuff; Rana el Kaliouby; Rosalind Picard
In the past, collecting data to train facial expression and affect recognition systems has been time consuming and often led to data that do not include spontaneous expressions. We present the first crowdsourced data collection of dynamic, natural and spontaneous facial responses as viewers watch media online. This system allowed a corpus of 3,268 videos to be collected in under two months.
   We characterize the data in terms of viewer demographics, position, scale, pose and movement of the viewer within the frame, and illumination of the facial region. We compare statistics from this corpus to those from the CK+ and MMI databases and show that distributions of position, scale, pose, movement and luminance of the facial region are significantly different from those represented in these datasets.
   We demonstrate that it is possible to efficiently collect massive amounts of ecologically valid responses, to known stimuli, from a diverse population using such a system. In addition facial feature points within the videos can be tracked for over 90% of the frames. These responses were collected without need for scheduling, payment or recruitment. Finally, we describe a subset of data (over 290 videos) that will be available for the research community.
A systematic discussion of fusion techniques for multi-modal affect recognition tasks BIBAFull-Text 19-26
  Florian Lingenfelser; Johannes Wagner; Elisabeth André
Recently, automatic emotion recognition has been established as a major research topic in the area of human computer interaction (HCI). Since humans express emotions through various channels, a user's emotional state can naturally be perceived by combining emotional cues derived from all available modalities. Yet most effort has been put into single-channel emotion recognition, while only a few studies with focus on the fusion of multiple channels have been published. Even though most of these studies apply rather simple fusion strategies -- such as the sum or product rule -- some of the reported results show promising improvements compared to the single channels. Such results encourage investigations if there is further potential for enhancement if more sophisticated methods are incorporated. Therefore we apply a wide variety of possible fusion techniques such as feature fusion, decision level combination rules, meta-classification or hybrid-fusion. We carry out a systematic comparison of a total of 16 fusion methods on different corpora and compare results using a novel visualization technique. We find that multi-modal fusion is in almost any case at least on par with single channel classification, though homogeneous results within corpora point to interchangeability between concrete fusion schemes.
Adaptive facial expression recognition using inter-modal top-down context BIBAFull-Text 27-34
  Ravi Kiran Sarvadevabhatla; Mitchel Benovoy; Sam Musallam; Victor Ng-Thow-Hing
The role of context in recognizing a person's affect is being increasingly studied. In particular, context arising from the presence of multi-modal information such as faces, speech and head pose has been used in recent studies to recognize facial expressions. In most approaches, the modalities are independently considered and the effect of one modality on the other, which we call inter-modal influence (e.g. speech or head pose modifying the facial appearance) is not modeled. In this paper, we describe a system that utilizes context from the presence of such inter-modal influences to recognize facial expressions. To do so, we use 2-D contextual masks which are activated within the facial expression recognition pipeline depending on the prevailing context. We also describe a framework called the Context Engine. The Context Engine offers a scalable mechanism for extending the current system to address additional modes of context that may arise during human-machine interactions. Results on standard data sets demonstrate the utility of modeling inter-modal contextual effects in recognizing facial expressions.

Special session 1: multimodal interaction: brain-computer interfacing

Brain-computer interaction: can multimodality help? BIBAFull-Text 35-40
  Anton Nijholt; Brendan Z. Allison; Rob J. K. Jacob
This paper is a short introduction to a special ICMI session on brain-computer interaction. During this paper, we first discuss problems, solutions, and a five-year view for brain-computer interaction. We then talk further about unique issues with multimodal and hybrid brain-computer interfaces, which could help address many current challenges. This paper presents some potentially controversial views, which will hopefully inspire discussion about the different views on brain-computer interfacing, how to embed brain-computer interfacing in a multimodal and multi-party context, and, more generally, how to look at brain-computer interfacing from an ambient intelligence point of view.
Modality switching and performance in a thought and speech controlled computer game BIBAFull-Text 41-48
  Hayrettin Gürkök; Gido Hakvoort; Mannes Poel
Providing multiple modalities to users is known to improve the overall performance of an interface. Weakness of one modality can be overcome by the strength of another one. Moreover, with respect to their abilities, users can choose between the modalities to use the one that is the best for them. In this paper we explored whether this holds for direct control of a computer game which can be played using a brain-computer interface (BCI) and an automatic speech recogniser (ASR). Participants played the games in unimodal mode (i.e. ASR-only and BCI-only) and multimodal mode where they could switch between the two modalities. The majority of the participants switched modality during the multimodal game but for the most of the time they stayed in ASR control. Therefore multimodality did not provide a significant performance improvement over unimodal control in our particular setup. We also investigated the factors which influence modality switching. We found that performance and performance-related factors were prominently effective in modality switching.
An approach towards human-robot-human interaction using a hybrid brain-computer interface BIBAFull-Text 49-52
  Nils Hachmeister; Hannes Riechmann; Helge Ritter; Andrea Finke
We propose the concept of a brain-computer interface interaction system that allows patients to virtually use non-verbal interaction affordances, in particular gestures and facial expressions, by means of a humanoid robot. Here, we present a pilot study on controlling such a robot via a hybrid BCI. The results indicate that users can intuitively address interaction partners by looking in their direction and employ gestures and facial expressions in every-day interaction situations.
Towards multimodal error responses: a passive BCI for the detection of auditory errors BIBAFull-Text 53-56
  Thorsten O. Zander; Marius David Klippel; Reinhold Scherer
The study presented here introduces a Passive BCI detecting responses of the subjects brain on the perception of correct and erroneous auditory signals. 10 experts in music theory who actively play an instrument listened to cadences, sequences of chords, that could have an unexpected, erroneous ending. In consistence with previous studies from the neurosciences we evoked an event-related potential, mainly consisting of an early right anterior negativity reflecting syntactic error processing followed by a stronger negativity in erroneous trials at 500 ms, induced by semantic processing. We could identify single trials of these processes with a standardized, crossvalidated offline classification scheme, resulting in an accuracy of 75.7%. The here presented system is a further step towards a multimodal, BCI-based Human-Computer Interaction, also including auditory feedback channels.
Pseudo-haptics: from the theoretical foundations to practical system design guidelines BIBAFull-Text 57-64
  Andreas Pusch; Anatole Lécuyer
Pseudo-haptics, a form of haptic illusion exploiting the brain's capabilities and limitations, has been studied for about a decade. Various interaction techniques making use of it emerged in different fields. However, important questions remain unanswered concerning the nature and the fundamentals of pseudo-haptics, the problems frequently encountered, and sophisticated means supporting the development of new systems and applications. We provide the theoretical background needed to understand the key mechanisms involved in the perception of / interaction with pseudo-haptic phenomena. We synthesise a framework resting on two theories of human perception, cognition and action: The Interacting Cognitive Subsystems model by Barnard et al. and the Bayesian multimodal cue integration framework by Ernst et al. Based on this synthesis and in order to test its utility, we discuss a recent pseudo-haptics example. Finally, we derive system design recommendations meant to facilitate the advancement in the field of pseudo-haptics for user interface researchers and practitioners.

Poster session

6th senses for everyone!: the value of multimodal feedback in handheld navigation aids BIBAFull-Text 65-72
  Martin Pielot; Benjamin Poppinga; Wilko Heuten; Susanne Boll
One of the bottlenecks in today's pedestrian navigation system is to communicate the navigation instructions in an efficient but non-distracting way. Previous work has suggested tactile feedback as solution, but it is not yet clear how it should be integrated into handheld navigation systems to improve efficiency and reduce distraction. In this paper we investigate augmenting and replacing a state of the art pedestrian navigation system with tactile navigation instructions. In a field study in a lively city centre 21 participants had to reach given destinations by the means of tactile, visual or multimodal navigation instructions. In the tactile and multimodal conditions, the handheld device created vibration patterns indicating the direction of the next waypoint. Like a sixth sense it constantly gave the user an idea of how the route continues. The results provide evidence that combining both modalities leads to more efficient navigation performance while using tactile feedback only reduces the user's distraction.
Adding haptic feedback to touch screens at the right time BIBAFull-Text 73-80
  Yi Yang; Yuru Zhang; Zhu Hou; Betty Lemaire-Semail
The lack of haptics on touch screens often causes errors and user frustration. However, adding haptic feedback to touch screens in order to address this problem needs to be effected at an appropriate stage. In this paper we present two experiments to explore when best to add haptic feedback during the user's interaction. We separate the interaction process into three stages: Locating, Navigation and Interaction. We compare two points in the Navigation stage in order to establish the optimal time for adding haptic feedback at that stage. We also compare applying haptic feedback at the Navigation stage and the Interaction stage to establish the latest point at which haptic feedback can be added. Combining previous research with our own, we find that the optimal time for applying haptic feedback to the target GUI at the Navigation stage is when the user reaches his destination and that haptic feedback improves user's performance only if it is added before the Interaction stage. These results should alert designers to the need to take into consideration timing when adding haptic feedback to touch screens.
Robust user context analysis for multimodal interfaces BIBAFull-Text 81-88
  Prasenjit Dey; Muthuselvam Selvaraj; Bowon Lee
Multimodal Interfaces that enable natural means of interaction using multiple modalities such as touch, hand gestures, speech, and facial expressions represent a paradigm shift in human-computer interfaces. Their aim is to allow rich and intuitive multimodal interaction similar to human-to-human communication and interaction. From the multimodal system's perspective, apart from the various input modalities themselves, user context information such as states of attention and activity, and identities of interacting users can help greatly in improving the interaction experience. For example, when sensors such as cameras (webcams, depth sensors etc.) and microphones are always on and continuously capturing signals in their environment, user context information is very useful to distinguish genuine system-directed activity from ambient speech and gesture activity in the surroundings, and distinguish the "active user" from among a set of users. Information about user identity may be used to personalize the system's interface and behavior -- e.g. the look of the GUI, modality recognition profiles, and information layout -- to suit the specific user. In this paper, we present a set of algorithms and an architecture that performs audiovisual analysis of user context using sensors such as cameras and microphone arrays, and integrates components for lip activity and audio direction detection (speech activity), face detection and tracking (attention), and face recognition (identity). The proposed architecture allows the component data flows to be managed and fused with low latency, low memory footprint, and low CPU load, since such a system is typically required to run continuously in the background and report events of attention, activity, and identity, in real-time, to consuming applications.
The picture says it all!: multimodal interactions and interaction metadata BIBAFull-Text 89-96
  Ramadevi Vennelakanti; Prasenjit Dey; Ankit Shekhawat; Phanindra Pisupati
People share photographs with family and friends! This inclination to share photographs lends itself to many occasions of co-present sharing resulting in interesting interactions, discussions, and experiences among those present. These interactions, are rich in information about the context and the content of the photograph and if extracted can be used to associate metadata with the photograph. However these are rarely captured and so, are lost at the end of the co-present photo sharing session.
   Most current work on extracting implicit metadata focuses on Content metadata -- analyzing the content in a photograph and Object metadata that is automatically generated and consists of data like GPS location, date and time etc. We address the capture of another interesting type of implicit metadata, called the "Interaction metadata", from the user's multimodal interactions with the media (here photographs) during co-present sharing.
   These interactions in the context of photographs contain rich information: who saw it, who said what, what was pointed at when they said it, who did they see it with for how long, how many times and so on; which if captured and analyzed can create interesting memories about the photograph. These will over time, help build stories around photographs, aid storytelling, serendipitous discovery and efficient retrieval among other experiences. Interaction metadata can also help organize photographs better by providing mechanisms for filtering based on, who viewed, most viewed, etc. Interaction metadata provides a hereto under explored implicit metadata type created from interactions with media.
   We designed and built a system prototype to capture and create interaction metadata. In this paper we describe the prototype and present the findings of a study we carried out to evaluate this prototype. The contribution of our work to the domain of multimodal interactions are: a method of identifying relevant speech portions in a free flowing conversation and the use of natural human interactions in the context of media to create Interaction Metadata, a novel type of implicit metadata.
Mudra: a unified multimodal interaction framework BIBAFull-Text 97-104
  Lode Hoste; Bruno Dumas; Beat Signer
In recent years, multimodal interfaces have gained momentum as an alternative to traditional WIMP interaction styles. Existing multimodal fusion engines and frameworks range from low-level data stream-oriented approaches to high-level semantic inference-based solutions. However, there is a lack of multimodal interaction engines offering native fusion support across different levels of abstractions to fully exploit the power of multimodal interactions. We present Mudra, a unified multimodal interaction framework supporting the integrated processing of low-level data streams as well as high-level semantic inferences. Our solution is based on a central fact base in combination with a declarative rule-based language to derive new facts at different abstraction levels. Our innovative architecture for multimodal interaction encourages the use of software engineering principles such as modularisation and composition to support a growing set of input modalities as well as to enable the integration of existing or novel multimodal fusion engines.
Humans and smart environments: a novel multimodal interaction approach BIBAFull-Text 105-112
  Stefano Carrino; Alexandre Péclat; Elena Mugellini; Omar Abou Khaled; Rolf Ingold
In this paper, we describe a multimodal approach for human-smart environment interaction. The input interaction is based on three modalities: deictic gestures, symbolic gestures and isolated-words. The deictic gesture is interpreted using the PTAMM (Parallel Tracking and Multiple Mapping) method exploiting a camera handheld or worn on the user arm. The PTAMM algorithm tracks in real-time the position and orientation of the hand in the environment. This information is used to point real or virtual objects, previously added to the environment, using the optical camera axis. Symbolic hand-gestures and isolated voice commands are recognized and used to interact with the pointed target. Haptic and acoustic feedbacks are provided to the user in order to improve the quality of the interaction. A complete prototype has been realized and a first usability evaluation, assessed with the help of 10 users has shown positive results.
Exploiting petri-net structure for activity classification and user instruction within an industrial setting BIBAFull-Text 113-120
  Simon F. Worgan; Ardhendu Behera; Anthony G. Cohn; David C. Hogg
Live workflow monitoring and the resulting user interaction in industrial settings faces a number of challenges. A formal workflow may be unknown or implicit, data may be sparse and certain isolated actions may be undetectable given current visual feature extraction technology. This paper attempts to address these problems by inducing a structural workflow model from multiple expert demonstrations. When interacting with a naive user, this workflow is combined with spatial and temporal information, under a Bayesian framework, to give appropriate feedback and instruction. Structural information is captured by translating a Markov chain of actions into a simple place/transition petri-net. This novel petri-net structure maintains a continuous record of the current workbench configuration and allows multiple sub-sequences to be monitored without resorting to second order processes. This allows the user to switch between multiple sub-tasks, while still receiving informative feedback from the system. As this model captures the complete workflow, human inspection of safety critical processes and expert annotation of user instructions can be made. Activity classification and user instruction results show a significant on-line performance improvement when compared to the existing Hidden Markov Model or pLSA based state of the art. Further analysis reveals that the majority of our model's classification errors are caused by small de-synchronisation events rather than significant workflow deviations. We conclude with a discussion of the generalisability of the induced place/transition petri-net to other activity recognition tasks and summarise the developments of this model.
JerkTilts: using accelerometers for eight-choice selection on mobile devices BIBAFull-Text 121-128
  Mathias Baglioni; Eric Lecolinet; Yves Guiard
This paper introduces JerkTilts, quick back-and-forth gestures that combine device pitch and roll. JerkTilts may serve as gestural self-delimited shortcuts for activating commands. Because they only depend on device acceleration and rely on a parallel and independent input channel, these gestures do not interfere with finger activity on the touch screen. Our experimental data suggest that recognition rates in an eight-choice selection task are as high with JerkTilts as with thumb slides on the touch screen. We also report data confirming that JerkTilts can be combined successfully with simple touch-screen operation. Data from a field study suggest that inadvertent JerkTilts are unlikely to occur in real-life contexts. We describe three illustrative implementations of JerkTilts, which show how the technique helps to simplify and shorten the sequence of actions to reach frequently used commands.
On multimodal interactive machine translation using speech recognition BIBAFull-Text 129-136
  Vicent Alabau; Luis Rodríguez-Ruiz; Alberto Sanchis; Pascual Martínez-Gómez; Francisco Casacuberta
Interactive machine translation (IMT) is an increasingly popular paradigm for semi-automated machine translation, where a human expert is integrated into the core of an automatic machine translation system. The human expert interacts with the IMT system by partially correcting the errors of the system's output. Then, the system proposes a new solution. This process is repeated until the output meets the desired quality. In this scenario, the interaction is typically performed using the keyboard and the mouse. However, speech is also a very interesting input modality since the user does not need to abandon the keyboard to interact with it.
   In this work, we present a new approach to perform speech interaction in a way that translation and speech inputs are tightly fused. This integration is performed early in the speech recognition step. Thus, the information from the translation models allows the speech recognition system to recover from errors that otherwise would be impossible to amend. In addition, this technique allows to use currently available speech recognition technology. The proposed system achieves an important boost in performance with respect to previous approaches.
Multimodal segmentation of object manipulation sequences with product models BIBAFull-Text 137-144
  Alexandra Barchunova; Robert Haschke; Mathias Franzius; Helge Ritter
In this paper we propose an approach for unsupervised segmentation of continuous object manipulation sequences into semantically differing subsequences. The proposed method estimates segment borders based on an integrated consideration of three modalities (tactile feedback, hand posture, audio) yielding robust and accurate results in a single pass. To this end, a Bayesian approach originally applied by Fearnhead to segment one-dimensional time series data -- is extended to allow an integrated segmentation of multi-modal sequences. We propose a joint product model which combines modality-specific likelihoods to model segments. Weight parameters control the influence of each modality within the joint model. We discuss the relevance of all modalities based on an evaluation of the temporal and structural correctness of segmentation results obtained from various weight combinations.
Could a dialog save your life?: analyzing the effects of speech interaction strategies while driving BIBAFull-Text 145-152
  Akos Vetek; Saija Lemmelä
We describe a controlled Wizard-of-Oz study using a medium-fidelity driving simulator investigating how a guided dialog strategy performs when compared to open dialog while driving, with respect to the cognitive loading these strategies impose on the driver. Through our analysis of driving performance logs, speech data, NASA-TLX questionnaires, and bio-signals (heart rate and EEG) we found the secondary speech task to have a measurable adverse effect on driving performance, and that guided dialog is less cognitively demanding in dual-task (driving plus speech interaction) conditions. The driving performance logs and heart rate variability information proved useful for identifying cognitively challenging situations while driving. These could provide important information to an in-car dialog management system that could take into account the driver's cognitive resources to provide safer speech-based interaction by adapting the dialog.
Decisions about turns in multiparty conversation: from perception to action BIBAFull-Text 153-160
  Dan Bohus; Eric Horvitz
We present a decision-theoretic approach for guiding turn taking in a spoken dialog system operating in multiparty settings. The proposed methodology couples inferences about multiparty conversational dynamics with assessed costs of different outcomes, to guide turn-taking decisions. Beyond considering uncertainties about outcomes arising from evidential reasoning about the state of a conversation, we endow the system with awareness and methods for handling uncertainties stemming from computational delays in its own perception and production. We illustrate via sample cases how the proposed approach makes decisions, and we investigate the behaviors of the proposed methods via a retrospective analysis on logs collected in a multiparty interaction study.
Evaluation of user gestures in multi-touch interaction: a case study in pair-programming BIBAFull-Text 161-168
  Alessandro Soro; Samuel Aldo Iacolina; Riccardo Scateni; Selene Uras
Natural User Interfaces are often described as familiar, evocative and intuitive, predictable, based on common skills. Though unquestionable in principle, such definitions don't provide the designer with effective means to design a natural interface or evaluate a design choice vs another. Two main issues in particular are open: (i) how do we evaluate a natural interface, is there a way to measure 'naturalness'; (ii) do natural user interfaces provide a concrete advantage in terms of efficiency, with respect to more traditional interface paradigms? In this paper we discuss and compare observations of user behavior in the task of pair programming, performed at a traditional desktop versus a multi-touch table. We show how the adoption of a multi-touch user interface fosters a significant, observable and measurable, increase of nonverbal communication in general and of gestures in particular, that in turn appears related to the overall performance of the users in the task of algorithm understanding and debugging.
Towards multimodal sentiment analysis: harvesting opinions from the web BIBAFull-Text 169-176
  Louis-Philippe Morency; Rada Mihalcea; Payal Doshi
With more than 10,000 new videos posted online every day on social websites such as YouTube and Facebook, the internet is becoming an almost infinite source of information. One crucial challenge for the coming decade is to be able to harvest relevant information from this constant flow of multimodal data. This paper addresses the task of multimodal sentiment analysis, and conducts proof-of-concept experiments that demonstrate that a joint model that integrates visual, audio, and textual features can be effectively used to identify sentiment in Web videos. This paper makes three important contributions. First, it addresses for the first time the task of tri-modal sentiment analysis, and shows that it is a feasible task that can benefit from the joint exploitation of visual, audio and textual modalities. Second, it identifies a subset of audio-visual features relevant to sentiment analysis and present guidelines on how to integrate these features. Finally, it introduces a new dataset consisting of real online data, which will be useful for future research in this area.
The impact of unwanted multimodal notifications BIBAFull-Text 177-184
  David Warnock; Marilyn R. McGee-Lennon; Stephen Brewster
Multimodal interaction can be used to make home care technology more effective and appropriate, particularly for people with sensory impairments. Previous work has revealed how disruptive notifications in different modalities are to a home-based task, but has not investigated how disruptive unwanted notifications might be. An experiment was conducted which evaluated the disruptive effects of unwanted notifications when delivered in textual, pictographic, abstract visual, speech, earcon, auditory icon, tactile and olfactory modalities. It was found that for all the modalities tested, both wanted and unwanted notifications produced similar reductions in error rate and task success, independent of modality. The results demonstrate the need to control and limit the number of unwanted notifications delivered in the home and contribute to a large body of work advocating the inclusion of multiple interaction modalities.
Freeform pen-input as evidence of cognitive load and expertise BIBAFull-Text 185-188
  Natalie Ruiz; Ronnie Taib; Fang Chen
This paper presents a longitudinal study that explores the combined effect of cognitive load and expertise on the use of a scratchpad. Our results confirm that such cognitive support benefits users under high cognitive load through visual aid, perceptual motor use and helps to improve meaningful learning and successful problem solving. Indeed, we found significant changes in stroke frequency affected by cognitive load, which we believe are caused by the scratchpad essentially augmenting or extending working memory capacity. However, the discrepancy between stroke frequencies under low and high load is reduced with expertise. These results indicate that pen stroke frequency, which can be automated with electronic devices, could be used as an indicator of cognitive load, or conversely, of expertise level.
Acquisition of dynamically revealed multimodal targets BIBAFull-Text 189-192
  Teemu Tuomas Ahmaniemi
This study investigates movement time needed for exploring and selecting a target that is not seen in advance. An experiment where targets were presented with haptic, audio or visual feedback was conducted. The task of the participant was to search the targets with a hand held sensor-actuator device by horizontal scanning movements. The feedback appeared when the pointing was within the target boundaries. Range of distances to the target was varied between the experiment blocks. The results show that the modality did not have a significant effect on the total movement time but visual feedback yielded the shortest and haptic feedback the longest dwell time on target area. This was most probably caused by a visual priming effect and the slow haptic actuator rise time. The wider range of distances yielded longer movement times and within the widest range of distances the closest targets were explored longer than the targets in the middle. This was shown to be caused by the increased number of secondary submovements. The finding suggests that an alternative model to Fitts' law or linear prediction of target acquisition time should be developed taking into account the user's prior knowledge about target location.
Emotional responses to thermal stimuli BIBAFull-Text 193-196
  Katri Salminen; Veikko Surakka; Jukka Raisamo; Jani Lylykangas; Johannes Pystynen; Roope Raisamo; Kalle Mäkelä; Teemu Ahmaniemi
The present aim was to study if thermal stimuli presented to the palm can affect emotional responses when measured with emotion related subjective rating scales and changes in skin conductance response (SCR). Two target temperatures, cold and warm, were created by either decreasing or increasing the temperature of the stimulus 4 °C in respect to the participants current hand temperature. Both cold and warm stimuli were presented by using two presentation methods, i.e., dynamic and pre-adjusted. The results showed that both the dynamic and pre-adjusted warm stimuli elevated the ratings of arousal and dominance. In addition, the pre-adjusted warm and cold stimuli elevated the SCR. The results suggest that especially pre-adjusted warm stimuli can be seen as effective in activating the autonomic nervous system and arousal and dominance dimensions of the affective rating space.
An active learning scenario for interactive machine translation BIBAFull-Text 197-200
  Jesús González-Rubio; Daniel Ortiz-Martínez; Francisco Casacuberta
This paper provides the first experimental study of an active learning (AL) scenario for interactive machine translation (IMT). Unlike other IMT implementations where user feedback is used only to improve the predictions of the system, our IMT implementation takes advantage of user feedback to update the statistical models involved in the translation process. We introduce a sentence sampling strategy to select the sentences that are worth to be interactively translated, and a retraining method to update the statistical models with the user-validated translations. Both, the sampling strategy and the retraining process are designed to work in real-time to meet the severe time constraints inherent to the IMT framework. Experiments in a simulated setting showed that the use of AL dramatically reduces user effort required to obtain translations of a given quality.
Move, and i will tell you who you are: detecting deceptive roles in low-quality data BIBAFull-Text 201-204
  Nimrod Raiman; Hayley Hung; Gwenn Englebienne
Motion, like speech, provides information about one's emotional state. This work introduces an automated non-verbal audio-visual approach for detecting deceptive roles in multi-party conversations using low resolution video. We show how using simple features extracted from motion and speech improves over speech-only for the detection of deceptive roles. Our results show that deceptive players were recognised with significantly higher precision when video features were used. We improve the classification performance with 22.6% compared to our baseline.
Multimodal person independent recognition of workload related biosignal patterns BIBAFull-Text 205-208
  Jan Jarvis; Felix Putze; Dominic Heger; Tanja Schultz
This paper presents an online multimodal person independent workload classification system using blood volume pressure, respiration measures, electrodermal activity and electroencephalography. For each modality a classifier based on linear discriminant analysis is trained. The classification results obtained on short data frames are fused using weighted majority voting. The system was trained and evaluated on a large training corpus of 152 participants, exposed to controlled and uncontrolled scenarios for inducing workload, including a driving task conducted in a realistic driving simulator. Using person dependent feature space normalization, we achieve a classification accuracy of up to 94% for discrimination of relaxed state vs. high workload.
Study of different interactive editing operations in an assisted transcription system BIBAFull-Text 209-212
  Verónica Romero; Alejandro Hector Toselli; Enrique Vidal
To date, automatic handwriting recognition systems are far from being perfect. Therefore, once the full recognition process of a handwritten text image has finished, heavy human intervention is required in order to correct the results of such systems. As an alternative, an interactive system has been proposed in previous works. This alternative follows an Interactive Predictive paradigm and the results show that significant amounts of human effort can be saved. So far only word substitutions and pointer actions have been considered in this interactive system. In this work, we study different interactive editing operations that can allow for more effective, ergonomic and friendly interfaces.
Dynamic perception-production oscillation model in human-machine communication BIBAFull-Text 213-216
  Igor Jauk; Ipke Wachsmuth; Petra Wagner
The goal of the present article is to introduce a new concept of a perception-production timing model in human-machine communication. The model implements a low-level cognitive timing and coordination mechanism. The basic element of the model is a dynamic oscillator capable of tracking reoccurring events in time. The organization of the oscillators in a network is being referred to as the Dynamic Perception-Production Oscillation Model (DPPOM). The DPPOM is largely based on findings in psychological and phonetic experiments on timing in speech perception and production. It consists of two sub-systems, a perception sub-system and a production sub-system. The perception sub-system accounts for information clustering in an input sequence of events. The production sub-system accounts for speech production rhythmically entrained to the input sequence. We propose a system architecture integrating both sub-systems, providing a flexible mechanism for perception-production timing in dialogues. The model's functionality was evaluated in two experiments.
The effect of clothing on thermal feedback perception BIBAFull-Text 217-220
  Martin Halvey; Graham Wilson; Yolanda Vazquez-Alvarez; Stephen A. Brewster; Stephen A. Hughes
Thermal feedback is a new area of research in HCI. To date, studies investigating thermal feedback for interaction have focused on virtual reality, abstract uses of thermal output or on use in highly controlled lab settings. This paper is one of the first to look at how environmental factors, in our case clothing, might affect user perception of thermal feedback and therefore usability of thermal feedback. We present a study into how well users perceive hot and cold stimuli on the hand, thigh and waist. Evaluations were carried out with cotton and nylon between the thermal stimulators and the skin. Results showed that the presence of clothing requires higher intensity thermal changes for detection but that these changes are more comfortable than direct stimulation on skin.
Comparing multi-touch interaction techniques for manipulation of an abstract parameter space BIBAFull-Text 221-224
  Sashikanth Damaraju; Andruid Kerne
The adjustment of multidimensional abstract parameter spaces, used in human-in-the-loop systems such as simulations and visualizations, plays an important role for multi-touch interaction. We investigate new natural forms of interaction to manipulate such parameter spaces. We develop separable multi-touch interaction techniques for abstract parameter space manipulation. We investigate using the index and thumb to perform the often-repeated sub-task of switching between parameters. A user study compares these multi-touch techniques with mouse-based interaction, for the task of color selection, measuring performance and efficiency. Our findings indicate that multi-touch interaction techniques are faster than mouse based interaction.
A general framework for incremental processing of multimodal inputs BIBAFull-Text 225-228
  Afshin Ameri Ekhtiarabadi; Batu Akan; Baran Çürüklu; Lars Asplund
Humans employ different information channels (modalities) such as speech, pictures and gestures in their communication. It is believed that some of these modalities are more error-prone to some specific type of data and therefore multimodality can help to reduce ambiguities in the interaction. There have been numerous efforts in implementing multimodal interfaces for computers and robots. Yet, there is no general standard framework for developing them. In this paper we propose a general framework for implementing multimodal interfaces. It is designed to perform natural language understanding, multimodal integration and semantic analysis with an incremental pipeline and includes a multimodal grammar language, which is used for multimodal presentation and semantic meaning generation.

Keynote address 2

Learning in and from humans: recalibration makes (the) perfect sense BIBAFull-Text 229-230
  Marc O. Ernst
The brain receives information about the environment from all the sensory modalities, including vision, touch and audition. To efficiently interact with the environment, this information must eventually converge in the brain in order to form a reliable and accurate multimodal percept. This process is often complicated by the existence of noise at every level of signal processing, which makes the sensory information derived from the world imprecise and potentially inaccurate. There are several ways in which the nervous system may minimize the negative consequences of noise in terms of precision and accuracy. Two key strategies are to combine redundant sensory estimates and to utilize acquired knowledge about the statistical regularities of different sensory signals. In this talk, I elaborate on how these strategies may be used by the nervous system in order to obtain the best possible estimates from noisy sensory signals, such that we are able of efficiently interact with the environment. Particularly, I will focus on the learning aspects and how our perceptions are tuned to the statistical regularities of an ever-changing environment.

Oral session 2: social interaction

Detecting F-formations as dominant sets BIBAFull-Text 231-238
  Hayley Hung; Ben Kröse
The first step towards analysing social interactive behaviour in crowded environments is to identify who is interacting with whom. This paper presents a new method for detecting focused encounters or F-formations in a crowded, real-life social environment. An F-formation is a specific instance of a group of people who are congregated together with the intent of conversing and exchanging information with each other. We propose a new method of estimating F-formations using a graph clustering algorithm by formulating the problem in terms of identifying dominant sets. A dominant set is a form of maximal clique which occurs in edge weighted graphs. As well as using the proximity between people, body orientation information is used; we propose a socially motivated estimate of focus orientation (SMEFO), which is calculated with location information only. Our experiments show significant improvements in performance over the existing modularity cut algorithm and indicates the effectiveness of using a local social context for detecting F-formations.
Toward multimodal situated analysis BIBAFull-Text 239-246
  Chreston Miller; Francis Quek
Multimodal analysis of human behavior is ultimately situated. The situated context of an instance of a behavior phenomenon informs its analysis. Starting with some initial (user-supplied) descriptive model of a phenomenon, accessing and studying instances in the data that are matches or near matches to the model is essential to refine the model to account for variations in the phenomenon. This inquiry requires viewing the instances within-context to judge their relevance. In this paper, we propose an automatic processing approach that supports this need for situated analysis in multimodal data. We process events on a semi-interval level to provide detailed temporal ordering of events with respect to instances of a phenomenon. We demonstrate the results of our approach and how it facilitates and allows for situated multimodal analysis.
Finding audio-visual events in informal social gatherings BIBAFull-Text 247-254
  Xavier Alameda-Pineda; Vasil Khalidov; Radu Horaud; Florence Forbes
In this paper we address the problem of detecting and localizing objects that can be both seen and heard, e.g., people. This may be solved within the framework of data clustering. We propose a new multimodal clustering algorithm based on a Gaussian mixture model, where one of the modalities (visual data) is used to supervise the clustering process. This is made possible by mapping both modalities into the same metric space. To this end, we fully exploit the geometric and physical properties of an audio-visual sensor based on binocular vision and binaural hearing. We propose an EM algorithm that is theoretically well justified, intuitive, and extremely efficient from a computational point of view. This efficiency makes the method implementable on advanced platforms such as humanoid robots. We describe in detail tests and experiments performed with publicly available data sets that yield very interesting results.
Please, tell me about yourself: automatic personality assessment using short self-presentations BIBAFull-Text 255-262
  Ligia Maria Batrinca; Nadia Mana; Bruno Lepri; Fabio Pianesi; Nicu Sebe
Personality plays an important role in the way people manage the images they convey in self-presentations and employment interviews, trying to affect the other's first impressions and increase effectiveness. This paper addresses the automatically detection of the Big Five personality traits from short (30-120 seconds) self-presentations, by investigating the effectiveness of 29 simple acoustic and visual non-verbal features. Our results show that Conscientiousness and Emotional Stability/Neuroticism are the best recognizable traits. The lower accuracy levels for Extraversion and Agreeableness are explained through the interaction between situational characteristics and the differential activation of the behavioral dispositions underlying those traits.

Oral session 3: gesture and touch

Gesture-aware remote controls: guidelines and interaction technique BIBAFull-Text 263-270
  Gilles Bailly; Dong-Bach Vo; Eric Lecolinet; Yves Guiard
Interaction with TV sets, set-top boxes or media centers strongly differs from interaction with personal computers: not only does a typical remote control suffer strong form factor limitations but the user may well be slouching in a sofa. In the face of more and more data, features, and services made available on interactive televisions, we propose to exploit the new capabilities provided by gesture-aware remote controls. We report the data of three user studies that suggest some guidelines for the design of a gestural vocabulary and we propose five novel interaction techniques. Study 1 reports that users spontaneously perform pitch and yaw gestures as the first modality when interacting with a remote control. Study 2 indicates that users can accurately select up to 5 items with eyes-free roll gestures. Capitalizing on our findings, we designed five interaction techniques that use either device motion, or button-based interaction, or both. They all favor the transition from novice to expert usage for selecting favorites. Study 3 experimentally compares these techniques. It reveals that motion of the device in 3D space, associated with finger presses at the surface of the device, is achievable, fast and accurate. Finally, we discuss the integration of these techniques into a coherent multimedia menu system.
The effect of sampling rate on the performance of template-based gesture recognizers BIBAFull-Text 271-278
  Radu-Daniel Vatavu
We investigate in this work the effect of motion sampling rate over recognition accuracy and execution time for current template-based gesture recognizers in order to provide performance guidelines to practitioners and designers of gesture-based interfaces. We show that as few as 6 sampling points are sufficient for Euclidean and angular recognizers to attain high recognition rates and that a linear relationship exists between sampling rate and number of gestures for the dynamic time warping technique. We report execution times obtained with our controlled downsampling which are 10-20 times faster than shown by existing work at the same high recognition rates. The results of this work will benefit practitioners by providing important performance aspects to consider when using template-based gesture recognizers.
American sign language recognition with the Kinect BIBAFull-Text 279-286
  Zahoor Zafrulla; Helene Brashear; Thad Starner; Harley Hamilton; Peter Presti
We investigate the potential of the Kinect depth-mapping camera for sign language recognition and verification for educational games for deaf children. We compare a prototype Kinect-based system to our current CopyCat system which uses colored gloves and embedded accelerometers to track children's hand movements. If successful, a Kinect-based approach could improve interactivity, user comfort, system robustness, system sustainability, cost, and ease of deployment. We collected a total of 1000 American Sign Language (ASL) phrases across both systems. On adult data, the Kinect system resulted in 51.5% and 76.12% sentence verification rates when the users were seated and standing respectively. These rates are comparable to the 74.82% verification rate when using the current (seated) CopyCat system. While the Kinect computer vision system requires more tuning for seated use, the results suggest that the Kinect may be a viable option for sign verification.
Perceived physicality in audio-enhanced force input BIBAFull-Text 287-294
  Chi-Hsia Lai; Matti Niinimäki; Koray Tahiroglu; Johan Kildal; Teemu Ahmaniemi
This paper investigates how the perceived physicality of the action of applying force with a finger on a rigid surface (such as on a force-sensing touch screen) can be enhanced using real-time synthesized audio feedback. A selection of rich and evocative audio designs was used. Additionally, audio-tactile cross-modal integration was encouraged, by observing that the main rules of multisensory integration were supported. The study conducted showed that richness of perceived physicality increased considerably, mostly in its auditory expression (what pressing sounded like). In addition, in many instances it was observed that the haptic expression of physicality also increased (what pressing felt like), including some perception of compliance. This last result was particularly interesting as it showed that audio-tactile cross-modal integration might be present.

Demo session and DSS poster session

BeeParking: an ambient display to induce cooperative parking behavior BIBAFull-Text 295-298
  Silvia Gabrielli; Rosa Maimone; Michele Marchesoni; Jesús Muñoz
Interactive ambient systems offer a great potential for attracting user attention, raising awareness and supporting the acquisition of more desirable behaviors in the shared use of limited resources, like physical or digital spaces, energy, water and so on. In this paper we describe the iterative design of BeeParking, an ambient display and automatic notification system aimed to induce more cooperative use of a parking facility within a work environment. We also report main findings from a longitudinal in-situ evaluation showing how the system was adopted and how it affected users' parking behavior over time.
Speech interaction in a multimodal tool for handwritten text transcription BIBAFull-Text 299-302
  Maria José Castro-Bleda; Salvador España-Boquera; David Llorens; Andrés Marzal; Federico Prat; Juan Miguel Vilar; Francisco Zamora-Martinez
STATE is a multimodal tool for document processing and text transcription. Its graphical front-end can be easily connected to different text recognition back-ends. New features and improvements are presented in this work: the interactive correction of one word in the transcribed line has been improved to reestimate the entire transcription line using the user feedback and speech input has been integrated in the multimodal interface enabling the user to also utter the word to be corrected, giving the user the possibility to use the interface according to her preferences or the task at hand. Thus, at the current version of STATE, the user can type, write on the screen with a stylus, or utter the incorrectly recognized word, and then, the system uses the user feedback in any of the proposed modalities to reestimate the transcribed line so as to hopefully correct other errors which could be caused by the mistaken word the user has corrected.
Digital pen in mammography patient forms BIBAFull-Text 303-306
  Daniel Sonntag; Marcus Liwicki; Markus Weber
We present a digital pen based interface for clinical radiology reports in the field of mammography. It is of utmost importance in future radiology practices that the radiology reports be uniform, comprehensive, and easily managed. This means that reports must be "readable" to humans and machines alike. In order to improve reporting practices in mammography, we allow the radiologist to write structured reports with a special pen on paper with an invisible dot pattern. A handwriting software takes care of the interpretation of the written report which is transferred into an ontological representation. In addition, a gesture recogniser allows radiologists to encircle predefined annotation suggestions which turns out to be the most beneficial feature. The radiologist can (1) provide the image and image region annotations mapped to a FMA, RadLex, or ICD10 code, (2) provide free text entries, and (3) correct/select annotations while using multiple gestures on the forms and sketch regions. The resulting, automatically generated PDF report is then stored in a semantic backend system for further use and contains all transcribed annotations as well as all free form sketches.
MozArt: a multimodal interface for conceptual 3D modeling BIBAFull-Text 307-310
  Anirudh Sharma; Sriganesh Madhvanath; Ankit Shekhawat; Mark Billinghurst
There is a need for computer aided design tools that support rapid conceptual level design. In this paper we explore and evaluate how intuitive speech and multitouch input can be combined in a multimodal interface for conceptual 3D modeling. Our system, MozArt, is based on a user's innate abilities -- speaking and touching, and has a toolbar/button-less interface for creating and interacting with computer graphics models. We briefly cover the hardware and software technology behind MozArt, and present a pilot study comparing our multimodal system with a conventional multitouch modeling interface with first time CAD users. While a larger study is required to obtain statistically significant comparison regarding efficiency and accuracy of the two interfaces, a majority of the participants preferred the multimodal interface over the multitouch. We summarize lessons learned and discuss directions for future research.
Query refinement suggestion in multimodal image retrieval with relevance feedback BIBAFull-Text 311-314
  Luis A. Leiva; Mauricio Villegas; Roberto Paredes
In the literature, it has been shown that relevance feedback is a good strategy for the system to interact with the user and provide better results in a content-based image retrieval (CBIR) system. On the other hand, there are many retrieval systems which suggest a refinement of the query as the user types, which effectively helps the user to obtain better results with less effort. Based on these observations, in this work we propose to add a suggested query refinement as a complement in an image retrieval system with relevance feedback. Taking advantage of the nature of the relevance feedback, in which the user selects relevant images, the query suggestions are derived using this relevance information. From the results of an evaluation performed, it can be said that this type of query suggestion is a very good enhancement to the relevance feedback scheme, and can potentially lead to better retrieval performance and less effort from the user.
A multimodal music transcription prototype: first steps in an interactive prototype development BIBAFull-Text 315-318
  Tomás Pérez-García; José M. Iñesta; Pedro J. Ponce de León; Antonio Pertusa
Music transcription consists of transforming an audio signal encoding a music performance in a symbolic representation such as a music score. In this paper, a multimodal and interactive prototype to perform music transcription is presented. The system is oriented to monotimbral transcription, its working domain is music played by a single instrument. This prototype uses three different sources of information to detect notes in a musical audio excerpt. It has been developed to allow a human expert to interact with the system to improve its results. In its current implementation, it offers a limited range of interaction and multimodality. Further development aimed at full interactivity and multimodal interactions is discussed.
Socially assisted multi-view video viewer BIBAFull-Text 319-322
  Kenji Mase; Kosuke Niwa; Takafumi Marutani
We have developed a novel viewer for multi-point video with a socially accumulated viewing log for viewing assistance. The viewer uses annotations of objects in the scene and stabilizes the viewpoint to the user-selected object(s) along with the viewing-point selections.
   Starting from discussion on two viewing interfaces, i.e. camera-centered and target-centered, we propose a novel socially assisted viewing interface as a director-agent assisted target-centered system. A histogram of the viewing log in terms of time, camera and target of many people's viewing experiences, which we call a viewgram, is used as the source of the director agent, which exploits the viewgram as the visualized popular viewing behavior for particular content. The system can compose the most preferred viewing sequence by referring to the viewgram, for example. The viewgram can also be used as a map of preferences useful in choosing the viewing point.

Special session 2: long-term socially perceptive and interactive robot companions: challenges and future perspectives

Long-term socially perceptive and interactive robot companions: challenges and future perspectives BIBAFull-Text 323-326
  Ruth S. Aylett; Ginevra Castellano; Bogdan Raducanu; Ana Paiva; Mark Hanheide
This paper gives a brief overview of the challenges for multi-model perception and generation applied to robot companions located in human social environments. It reviews the current position in both perception and generation and the immediate technical challenges and goes on to consider the extra issues raised by embodiment and social context. Finally, it briefly discusses the impact of systems that must function continually over months rather than just for a few hours.
Living with a robot companion: empirical study on the interaction with an artificial health advisor BIBAFull-Text 327-334
  Astrid Marieke von der Pütten; Nicole C. Krämer; Sabrina C. Eimler
The EU project SERA (Social Engagement with Robots and Agents) provided the unique opportunity to collect real field data of people interacting with a robot companion in their homes. In the course of three iterations, altogether six elderly participants took part. Following a multi-methodological approach, the continuous quantitative and qualitative description of user behavior on a very fine-grained level gave us insights into when and how people interacted with the robot companion. Post-trial semi-structured interviews explored how the users perceived the companion and revealed their attitudes. Based on this large data set, conclusions can be drawn on whether people show signs of bonding and how their relation to the robot develops over time. Results indicate large inter-individual differences with regard to interaction behavior and attitudes. Implications for research on companions are discussed.
Child-robot interaction in the wild: advice to the aspiring experimenter BIBAFull-Text 335-342
  Raquel Ros; Marco Nalin; Rachel Wood; Paul Baxter; Rosemarijn Looije; Yannis Demiris; Tony Belpaeme; Alessio Giusti; Clara Pozzi
We present insights gleaned from a series of child-robot interaction experiments carried out in a hospital paediatric department. Our aim here is to share good practice in experimental design and lessons learned about the implementation of systems for social HRI with child users towards application in "the wild", rather than in tightly controlled and constrained laboratory environments: a trade-off between the structures imposed by experimental design and the desire for removal of such constraints that inhibit interaction depth, and hence engagement, requires a careful balance.
Characterization of coordination in an imitation task: human evaluation and automatically computable cues BIBAFull-Text 343-350
  Emilie Delaherche; Mohamed Chetouani
Understanding the ability to coordinate with a partner constitutes a great challenge in social signal processing and social robotics. In this paper, we designed a child-adult imitation task to investigate how automatically computable cues on turn-taking and movements can give insight into high-level perception of coordination. First we collected a human questionnaire to evaluate the perceived coordination of the dyads. Then, we extracted automatically computable cues and information on dialog acts from the video clips. The automatic cues characterized speech and gestural turn-takings and coordinated movements of the dyad. We finally confronted human scores with automatic cues to search which cues could be informative on the perception of coordination during the task. We found that the adult adjusted his behavior according to the child need and that a disruption of the gestural turn-taking rhythm was badly perceived by the judges. We also found, that judges rated negatively the dyads that talked more as speech intervenes when the child had difficulties to imitate. Finally, coherence measures between the partners' movement features seemed more adequate than correlation to characterize their coordination.

Keynote address 3

The sounds of social life: observing humans in their natural habitat BIBAFull-Text 351-352
  Matthias R. Mehl
This talk presents a novel methodology called the Electronically Activated Recorder or EAR. The EAR is a portable audio recorder that periodically records snippets of ambient sounds from participants' momentary environments. In tracking moment-to-moment ambient sounds, it yields acoustic logs of people's days as they naturally unfold. In sampling only a fraction of the time, it protects participants' privacy. As a naturalistic observation method, it provides an observer's account of daily life and is optimized for the assessment of audible aspects of social environments, behaviors, and interactions. The talk discusses the EAR method conceptually and methodologically and identifies three ways in which it can enrich research in the social and behavioral sciences. Specifically, it can (1) provide ecological, behavioral criteria that are independent of self-report, (2) calibrate psychological effects against frequencies of real-world behavior, and (3) help with the assessment of subtle and habitual behaviors that evade self-report.

Oral session 4: ubiquitous interaction

Smartphone usage in the wild: a large-scale analysis of applications and context BIBAFull-Text 353-360
  Trinh Minh Tri Do; Jan Blom; Daniel Gatica-Perez
This paper presents a large-scale analysis of contextualized smartphone usage in real life. We introduce two contextual variables that condition the use of smartphone applications, namely places and social context. Our study shows strong dependencies between phone usage and the two contextual cues, which are automatically extracted based on multiple built-in sensors available on the phone. By analyzing continuous data collected on a set of 77 participants from a European country over 9 months of actual usage, our framework automatically reveals key patterns of phone application usage that would traditionally be obtained through manual logging or questionnaire. Our findings contribute to the large-scale understanding of applications and context, bringing out design implications for interfaces on smartphones.
Multimodal mobile interactions: usability studies in real world settings BIBAFull-Text 361-368
  Julie R. Wiliamson; Andrew Crossan; Stephen Brewster
This paper presents a study that explores the issues of mobile multimodal interactions while on the move in the real world. Because multimodal interfaces allow new kinds of eyes and hands free interactions, usability issues while moving through different public spaces becomes an important issue in user experience and acceptance of multimodal interaction. This study focuses on these issues by deploying an RSS reader that participants used during their daily commute every day for one week. The system allows users on the move to access news feeds eyes free through headphones playing audio and speech and hands free through wearable sensors attached to the wrists. The results showed participants were able to interact with the system on the move and became more comfortable performing these interactions as the study progressed. Users were also far more comfortable gesturing on the street than on public transport, which was reflected in the number of interactions and the perceived social acceptability of the gestures in different contexts.
Service-oriented autonomic multimodal interaction in a pervasive environment BIBAFull-Text 369-376
  Pierre-Alain Avouac; Philippe Lalanda; Laurence Nigay
Heterogeneity and dynamicity of pervasive environments require the construction of flexible multimodal interfaces at run time. In this paper, we present how we use an autonomic approach to build and maintain adaptable input multimodal interfaces in smart building environments. We have developed an autonomic solution relying on partial interaction models specified by interaction designers and developers. The role of the autonomic manager is to build complete interaction techniques based on runtime conditions and in conformity with the predicted models. The sole purpose here is to combine and complete partial models in order to obtain an appropriate multimodal interface. We illustrate our autonomic solution by considering a running example based on an existing application and several input devices.
Evaluation of graphical user-interfaces for order picking using head-mounted displays BIBAFull-Text 377-384
  Hannes Baumann; Thad Starner; Hendrik Iben; Anna Lewandowski; Patrick Zschaler
Order picking is the process of collecting items from an assortment in inventory. It represents one of the main activities performed in warehouses and accounts for about 60% of the total operational costs of a warehouse. In previous work, we demonstrated the advantages of a head-mounted display (HMD) based picking chart over a traditional text-based pick list, a paper-based graphical pick chart, and a mobile pick-by-voice system. Here we perform two user studies that suggest that adding color cues and context sensing via a laser rangefinder improves picking accuracy with the HMD system. We also examine other variants of the pick chart, such as adding symbols, textual identifiers, images, and descriptions and their effect on accuracy, speed, and subjective usability.

Oral session 5: virtual and real worlds

Modeling parallel state charts for multithreaded multimodal dialogues BIBAFull-Text 385-392
  Gregor Mehlmann; Birgit Endraß; Elisabeth André
In this paper, we present a modeling approach for the management of highly interactive, multithreaded and multimodal dialogues. Our approach enforces the separation of dialogue content and dialogue structure and is based on a statechart language enfolding concepts for hierarchy, concurrency, variable scoping and a detailed runtime history. These concepts facilitate the modeling of interactive dialogues with multiple virtual characters, autonomous and parallel behaviors, flexible interruption policies, context-sensitive interpretation of the user's discourse acts and coherent resumptions of dialogues. An interpreter allows the realtime visualization and modification of the model to allow a rapid prototyping and easy debugging. Our approach has successfully been used in applications and research projects as well as evaluated in field tests with non-expert authors. We present a demonstrator illustrating our concepts in a social game scenario.
Virtual worlds and active learning for human detection BIBAFull-Text 393-400
  David Vázquez; Antonio M. López; Daniel Ponsa; Javier Marín
Image based human detection is of paramount interest due to its potential applications in fields such as advanced driving assistance, surveillance and media analysis. However, even detecting non-occluded standing humans remains a challenge of intensive research. The most promising human detectors rely on classifiers developed in the discriminative paradigm, i.e. trained with labelled samples. However, labelling is a manual intensive step, especially in cases like human detection where it is necessary to provide at least bounding boxes framing the humans for training. To overcome such problem, some authors have proposed the use of a virtual world where the labels of the different objects are obtained automatically. This means that the human models (classifiers) are learnt using the appearance of rendered images, i.e. using realistic computer graphics. Later, these models are used for human detection in images of the real world. The results of this technique are surprisingly good. However, these are not always as good as the classical approach of training and testing with data coming from the same camera, or similar ones. Accordingly, in this paper we address the challenge of using a virtual world for gathering (while playing a videogame) a large amount of automatically labelled samples (virtual humans and background) and then training a classifier that performs equal, in real-world images, than the one obtained by equally training from manually labelled real-world samples. For doing that, we cast the problem as one of domain adaptation. In doing so, we assume that a small amount of manually labelled samples from real-world images is required. To collect these labelled samples we propose a non-standard active learning technique. Therefore, ultimately our human model is learnt by the combination of virtual and real world labelled samples, which has not been done before.
Making virtual conversational agent aware of the addressee of users' utterances in multi-user conversation using nonverbal information BIBAFull-Text 401-408
  Hung-Hsuan Huang; Naoya Baba; Yukiko Nakano
In multi-user human-agent interaction, the agent should respond to the user when an utterance is addressed to it. To do this, the agent needs to be able to judge whether the utterance is addressed to the agent or to another user. This study proposes a method for estimating the addressee based on the prosodic features of the user's speech and head direction (approximate gaze direction). First, a WOZ experiment is conducted to collect a corpus of human-humanagent triadic conversations. Then, analysis is performed to find out whether the prosodic features as well as head direction information are correlated with the addressee-hood. Based on this analysis, a SVM classifier is trained to estimate the addressee by integrating both the prosodic features and head movement information. Finally, a prototype agent equipped with this real-time addressee estimation mechanism is developed and evaluated.
Temporal binding of multimodal controls for dynamic map displays: a systems approach BIBAFull-Text 409-416
  Ellen C. Haas; Krishna S. Pillalamarri; Chris C. Stachowiak; Gardner McCullough
Dynamic map displays are visual interfaces that show the spatial positions of objects of interest (e.g., people, robots, vehicles), and can be updated with user commands as well as world changes, often in real time. Multimodal (speech and touch) controls were designed for a U.S. Army Research Laboratory dynamic map display to allow users to provide supervisory control of a simulated robotic swarm. This study characterized the effects of user performance (input difficulty, modality preference, and response to different levels of workload) on multimodal intercommand time (i.e., temporal binding), and explored how this might relate to the system's ability to bind or fuse user multimodal inputs into a unitary response. User performance was tested in a laboratory study using 6 male and 6 female volunteers with a mean age of 26 years. Results showed that 64% of all participants used speech commands first 100% of the time, while the remaining used touch commands first 100% of the time. Temporal binding between touch and voice commands was significantly shorter for touch-first than for speech-first commands, no matter what the level of workload. For both speech and touch commands, temporal binding was significantly shorter for both roads and swarm edges than for intersections. Results indicated that all of these factors can be significant in relating to a system's ability to bind multimodal inputs into a unitary response. Suggestions for future research are described.