HCI Bibliography Home | HCI Conferences | ICMI Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
ICMI Tables of Contents: 0203040506070809101112131415

Proceedings of the 2010 International Conference on Multimodal Interfaces

Fullname:ICMI-MLMI'10 International Conference on Multimodal Interfaces/Workshop on Machine Learning for Multimodal Interfaces
Editors:Wen Gao; Chin-Hui Lee; Jie Yang; Xilin Chen; Maxine Eskenazi; Zhengyou Zhang
Location:Beijing, China
Dates:2010-Nov-08 to 2010-Nov-12
Publisher:ACM
Standard No:ISBN: 1-4503-0414-1, 978-1-4503-0414-6; ACM DL: Table of Contents hcibib: ICMI10
Papers:54
Pages:311
Links:Conference Home Page
  1. Invited talk
  2. Multimodal systems
  3. Gaze and interaction
  4. Demo session
  5. Invited talk
  6. Gesture and accessibility
  7. Multimodal interfaces
  8. Human-centered HCI
  9. Invited talk
  10. Speech and language
  11. Poster session
  12. Human-human interactions

Invited talk

Language and thought: talking, gesturing (and signing) about space BIBAFull-Text 1
  John Haviland
Recent research has reopened debates about (neo)Whorfian claims that the language one speaks has an impact on how one thinks -- long discounted by mainstream linguistics and anthropology alike. Some of the most striking evidence for such possible impact derives, not surprisingly, from understudied "exotic" languages and, somewhat more surprisingly, from multimodal and notably gestural practices in communities which speak them. In particular, some of my own work on GuuguYimithirr, a Paman language spoken by Aboriginal people in northeastern Australia, and on Tzotzil, a language spoken by Mayan peasants in southeastern Mexico, suggests strong connections between linguistic expressions of spatial relations, gestural practices in talking about location and motion, and cognitive representations of space -- what have come to be called spatial "Frames of Reference." In this talk, I will present some of the evidence for such connections, and add to the mix evidence from an emerging, first generation sign language developed spontaneously in a single family by deaf siblings who have had contact with neither other deaf people nor any other sign language.

Multimodal systems

Feedback is... late: measuring multimodal delays in mobile device touchscreen interaction BIBAFull-Text 2
  Topi Kaaresoja; Stephen Brewster
Multimodal interaction is becoming common in many kinds of devices, particularly mobile phones. If care is not taken in design and implementation, there may be latencies in the timing of feedback in the different modalities may have unintended effects on users. This paper introduces an easy to implement multimodal latency measurement tool for touchscreen interaction. It uses off-the-shelf components and free software and is capable of measuring latencies accurately between different interaction events in different modalities. The tool uses a high-speed camera, a mirror, a microphone and an accelerometer to measure the touch, visual, audio and tactile feedback events that occur in touchscreen interaction. The microphone and the accelerometer are both interfaced with a standard PC soundcard that makes the measurement and analysis simple. The latencies are obtained by hand and eye using a slow-motion video player and an audio editor. To validate the tool, we measured four commercial mobile phones. Our results show that there are significant differences in latencies, not only between the devices, but also between different applications and modalities within one device. In this paper the focus is on mobile touchscreen devices, but with minor modifications our tool could be also used in other domains.
Learning and evaluating response prediction models using parallel listener consensus BIBAFull-Text 3
  Iwan de Kok; Derya Ozkan; Dirk Heylen; Louis-Philippe Morency
Traditionally listener response prediction models are learned from pre-recorded dyadic interactions. Because of individual differences in behavior, these recordings do not capture the complete ground truth. Where the recorded listener did not respond to an opportunity provided by the speaker, another listener would have responded or vice versa. In this paper, we introduce the concept of parallel listener consensus where the listener responses from multiple parallel interactions are combined to better capture differences and similarities between individuals. We show how parallel listener consensus can be used for both learning and evaluating probabilistic prediction models of listener responses. To improve the learning performance, the parallel consensus helps identifying better negative samples and reduces outliers in the positive samples. We propose a new error measurement called fConsensus which exploits the parallel consensus to better define the concepts of exactness (mislabels) and completeness (missed labels) for prediction models. We present a series of experiments using the MultiLis Corpus where three listeners were tricked into believing that they had a one-on-one conversation with a speaker, while in fact they were recorded in parallel in interaction with the same speaker. In this paper we show that using parallel listener consensus can improve learning performance and represent better evaluation criteria for predictive models.
Real-time adaptive behaviors in multimodal human-avatar interactions BIBAFull-Text 4
  Hui Zhang; Damian Fricker; Thomas G. Smith; Chen Yu
Multimodal interaction in everyday life seems so effortless. However, a closer look reveals that such interaction is indeed complex and comprises multiple levels of coordination, from high-level linguistic exchanges to low-level couplings of momentary bodily movements both within an agent and across multiple interacting agents. A better understanding of how these multimodal behaviors are coordinated can provide insightful principles to guide the development of intelligent multimodal interfaces. In light of this, we propose and implement a research framework in which human participants interact with a virtual agent in a virtual environment. Our platform allows the virtual agent to keep track of the user's gaze and hand movements in real time, and adjust his own behaviors accordingly. An experiment is designed and conducted to investigate adaptive user behaviors in a human-agent joint attention task. Multimodal data streams are collected in the study including speech, eye gaze, hand and head movements from both the human user and the virtual agent, which are then analyzed to discover various behavioral patterns. Those patterns show that human participants are highly sensitive to momentary multimodal behaviors generated by the virtual agent and they rapidly adapt their behaviors accordingly. Our results suggest the importance of studying and understanding real-time adaptive behaviors in human-computer multimodal interactions.
Facilitating multiparty dialog with gaze, gesture, and speech BIBAFull-Text 5
  Dan Bohus; Eric Horvitz
We study how synchronized gaze, gesture and speech rendered by an embodied conversational agent can influence the flow of conversations in multiparty settings. We begin by reviewing a computational framework for turn-taking that provides the foundation for tracking and communicating intentions to hold, release, or take control of the conversational floor. We then present implementation aspects of this model in an embodied conversational agent. Empirical results with this model in a shared task setting indicate that the various verbal and non-verbal cues used by the avatar can effectively shape the multiparty conversational dynamics. In addition, we identify and discuss several context variables which impact the turn allocation process.

Gaze and interaction

Focusing computational visual attention in multi-modal human-robot interaction BIBAFull-Text 6
  Boris Schauerte; Gernot A. Fink
Identifying verbally and non-verbally referred-to objects is an important aspect of human-robot interaction. Most importantly, it is essential to achieve a joint focus of attention and, thus, a natural interaction behavior. In this contribution, we introduce a saliency-based model that reflects how multi-modal referring acts influence the visual search, i.e. the task to find a specific object in a scene. Therefore, we combine positional information obtained from pointing gestures with contextual knowledge about the visual appearance of the referred-to object obtained from language. The available information is then integrated into a biologically-motivated saliency model that forms the basis for visual search. We prove the feasibility of the proposed approach by presenting the results of an experimental evaluation.
Employing social gaze and speaking activity for automatic determination of the Extraversion trait BIBAFull-Text 7
  Bruno Lepri; Ramanathan Subramanian; Kyriaki Kalimeri; Jacopo Staiano; Fabio Pianesi; Nicu Sebe
In order to predict the Extraversion personality trait, we exploit medium-grained behaviors enacted in group meetings, namely, speaking time and social attention (social gaze). The latter will be further distinguished in to attention given to the group members and attention received from them. The results of our work confirm many of our hypotheses: a) speaking time and (some forms of) social gaze are effective in automatically predicting Extraversion; b) classification accuracy is affected by the size of the time slices used for analysis, and c) to a large extent, the consideration of the social context does not add much to accuracy prediction, with an important exception concerning social gaze.
Gaze quality assisted automatic recognition of social contexts in collaborative Tetris BIBAFull-Text 8
  Weifeng Li; Marc-Antoine Nüssli; Patrick Jermann
The use of dual eye-tracking is investigated in a collaborative game setting. Social context influences individual gaze and action during a collaborative Tetris game: results show that experts as well as novices adapt their playing style when interacting in mixed ability pairs. The long term goal of our work is to design adaptive gaze awareness tools that take the pair composition into account. We therefore investigate the automatic detection (or recognition) of pair composition using dual gaze-based as well as action-based multimodal features. We describe several methods for the improvement of detection (or recognition) and experimentally demonstrate their effectiveness, especially in the situations when the collected gaze data are noisy.
Discovering eye gaze behavior during human-agent conversation in an interactive storytelling application BIBAFull-Text 9
  Nikolaus Bee; Johannes Wagner; Elisabeth André; Thurid Vogt; Fred Charles; David Pizzi; Marc Cavazza
In this paper, we investigate the user's eye gaze behavior during the conversation with an interactive storytelling application. We present an interactive eye gaze model for embodied conversational agents in order to improve the experience of users participating in Interactive Storytelling. The underlying narrative in which the approach was tested is based on a classical XIXth century psychological novel: Madame Bovary, by Flaubert. At various stages of the narrative, the user can address the main character or respond to her using free-style spoken natural language input, impersonating her lover. An eye tracker was connected to enable the interactive gaze model to respond to user's current gaze (i.e. looking into the virtual character's eyes or not). We conducted a study with 19 students where we compared our interactive eye gaze model with a non-interactive eye gaze model that was informed by studies of human gaze behaviors, but had no information on where the user was looking. The interactive model achieved a higher score for user ratings than the non-interactive model. In addition we analyzed the users' gaze behavior during the conversation with the virtual character.

Demo session

Speak4it: multimodal interaction for local search BIBAFull-Text 10
  Patrick Ehlen; Michael Johnston
Speak4itSM is a consumer-oriented mobile search application that leverages multimodal input and output to allow users to search for and act on local business information. It supports true multimodal integration where user inputs can be distributed over multiple input modes. In addition to specifying queries by voice (e.g., "bike repair shops near the golden gate bridge") users can combine speech and gesture. For example, "gas stations" + <route drawn on display> will return the gas stations along the specified route traced on the display. We provide interactive demonstrations of Speak4it on both the iPhone and iPad platforms and explain the underlying multimodal architecture and challenges of supporting multimodal interaction as a deployed mobile service.
A multimodal interactive text generation system BIBAFull-Text 11
  Luis Rodríguez; Ismael García-Varea; A. Revuelta-Martínez; Enrique Vidal
We present an interactive text generation system aimed at providing assistance for text typing in different environments. This system works by predicting what the user is going to type based on the text he or she typed previously. A multimodal interface is included, intended to facilitate the text generation in constrained environments. The prototype is designed following a modular client-server architecture to provide a high flexibility.
The Ambient Spotlight: personal multimodal search without query BIBAFull-Text 12
  Jonathan Kilgour; Jean Carletta; Steve Renals
The Ambient Spotlight is a prototype system based on personal meeting capture using a laptop and a portable microphone array. The system automatically recognises and structures the meeting content using automatic speech recognition, topic segmentation and extractive summarisation. The recognised speech in the meeting is used to construct queries to automatically link meeting segments to other relevant material, both multimodal and textual. The interface to the system is constructed around a standard calendar interface, and it is integrated with the laptop's standard indexing, search and retrieval.
Cloud mouse: a new way to interact with the cloud BIBAFull-Text 13
  Chunhui Zhang; Min Wang; Richard Harper
In this paper we present a novel input device and associated UI metaphors for Cloud computing. Cloud computing will give users access to huge amount of data in new forms as well as anywhere and anytime, with applications ranging from Web data mining to social networks. The motivation of this work is to provide users access to cloud computing by a new personal device and to make nearby displays a personal displayer. The key points of this device are direct-point operation, grasping UI and tangible feedback. A UI metaphor for cloud computing is also introduced.

Invited talk

Musical performance as multimodal communication: drummers, musical collaborators, and listeners BIBAFull-Text 14
  Richard Ashley
Musical performance provides an interesting domain for understanding and investigating multimodal communication. Although the primary modality of music is auditory, musicians make considerable use of the visual channel as well. This talk examines musical performance as multimodal, focusing on drumming in one style of popular music (funk or soul music). The way drummers interact with, and communicate with, their musical collaborators and with listeners are examined, in terms of the structure of different musical parts; processes of mutual coordination, entrainment, and turn-taking (complementarity) are highlighted. Both pre-determined (composed) and spontaneous (improvised) behaviors are considered. The way in which digital drumsets function as complexly structured human interfaces to sound synthesis systems is examined as well.

Gesture and accessibility

Toward natural interaction in the real world: real-time gesture recognition BIBAFull-Text 15
  Ying Yin; Randall Davis
Using a new hand tracking technology capable of tracking 3D hand postures in real-time, we developed a recognition system for continuous natural gestures. By natural gestures, we mean those encountered in spontaneous interaction, rather than a set of artificial gestures chosen to simplify recognition. To date we have achieved 95.6% accuracy on isolated gesture recognition, and 73% recognition rate on continuous gesture recognition, with data from 3 users and twelve gesture classes. We connected our gesture recognition system to Google Earth, enabling real time gestural control of a 3D map. We describe the challenges of signal accuracy and signal interpretation presented by working in a real-world environment, and detail how we overcame them.
Gesture and voice prototyping for early evaluations of social acceptability in multimodal interfaces BIBAFull-Text 16
  Julie Rico; Stephen Brewster
Interaction techniques that require users to adopt new behaviors mean that designers must take into account social acceptability and user experience otherwise the techniques may be rejected by users as they are too embarrassing to do in public. This research uses a set of low cost prototypes to study social acceptability and user perceptions of multimodal mobile interaction techniques early on in the design process. We describe 4 prototypes that were used with 8 focus groups to evaluate user perceptions of novel multimodal interactions using gesture, speech and nonspeech sounds, and gain feedback about the usefulness of the prototypes for studying social acceptability. The results of this research describe user perceptions of social acceptability and the realities of using multimodal interaction techniques in daily life. The results also describe key differences between young users (18-29) and older users (70-95) with respect to evaluation and approach to understanding these interaction techniques.
Automatic recognition of sign language subwords based on portable accelerometer and EMG sensors BIBAFull-Text 17
  Yun Li; Xiang Chen; Jianxun Tian; Xu Zhang; Kongqiao Wang; Jihai Yang
Sign language recognition (SLR) not only facilitates the communication between the deaf and hearing society, but also serves as a good basis for the development of gesture-based human-computer interaction (HCI). In this paper, the portable input devices based on accelerometers and surface electromyography (EMG) sensors worn on the forearm are presented, and an effective fusion strategy for combination of multi-sensor and multi-channel information is proposed to automatically recognize sign language at the subword classification level. Experimental results on the recognition of 121 frequently used Chinese sign language subwords demonstrate the feasibility of developing SLR system based on the presented portable input devices and that our proposed information fusion method is effective for automatic SLR. Our study will promote the realization of practical sign language recognizer and multimodal human-computer interfaces.
Enabling multimodal discourse for the blind BIBAFull-Text 18
  Francisco Oliveira; Heidi Cowan; Bing Fang; Francis Quek
This paper presents research that shows that a high degree of skilled performance is required for multimodal discourse support. We discuss how students who are blind or visually impaired (SBVI) were able to understand the instructor's pointing gestures during planar geometry and trigonometry classes. For that, the SBVI must attend to the instructor's speech and have simultaneous access to the instructional graphic material, and to the where the instructor is pointing. We developed the Haptic Deictic System -- HDS, capable of tracking the instructor's pointing and informing the SBVI, through a haptic glove, where she needs to move her hand understand the instructor's illustration-augmented discourse. Several challenges had to be overcome before the SBVI were able to engage in fluid multimodal discourse with the help of the HDS. We discuss how such challenges were addressed with respect to perception and discourse (especially to mathematics instruction).

Multimodal interfaces

Recommendation from robots in a real-world retail shop BIBAFull-Text 19
  Koji Kamei; Kazuhiko Shinozawa; Tetsushi Ikeda; Akira Utsumi; Takahiro Miyashita; Norihiro Hagita
By applying network robot technologies, recommendation methods from E-Commerce are incorporated in a retail shop in the real world. We constructed an experimental shop environment where communication robots recommend specific items to the customers according to their purchasing behavior as observed by networked sensors. A recommendation scenario is implemented with three robots and investigated through an experiment. The results indicate that the participants stayed longer in front of the shelves when the communication robots tried to interact with them and were influenced to carry out similar purchasing behaviors as those observed earlier. Other results suggest that the probability of customers' zone transition can be used to anticipate their purchasing behavior.
Dynamic user interface distribution for flexible multimodal interaction BIBAFull-Text 20
  Marco Blumendorf; Dirk Roscher; Sahin Albayrak
The availability of numerous networked interaction devices within smart environments makes the exploitation of these devices for innovative and more natural interaction possible. In our work we make use of TVs with remote controls, picture frames, mobile phones, touch screens, stereos and PCs to create multimodal user interfaces. The combination of the interaction capabilities of the different devices allows to achieve a more suitable interaction for a situation. Changing situations can then require the dynamic redistribution of the created interfaces and the alteration of the used modalities and devices to keep up the interaction. In this paper we describe our approach for dynamically (re-) distributing user interfaces at run-time. A distribution component is responsible for determining the devices for the interaction based on the (changing) environment situation and the user interface requirements. The component provides possibilities to the application developer and to the user to influence the distribution according to their needs. A user interface model describes the interaction and the modality relations according to the CARE properties (Complementarity, Assignment, Redundancy and Equivalency) and a context model gathers and provides information about the environment.
3D-press: haptic illusion of compliance when pressing on a rigid surface BIBAFull-Text 21
  Johan Kildal
This paper reports a new intramodal haptic illusion. This illusion involves a person pressing on a rigid surface and perceiving that the surface is compliant, i.e. perceiving that the contact point displaces into the surface. The design process, method and conditions used to create this illusion are described in detail. A user study is also reported in which all participants using variants of the basic method experienced the illusion, demonstrating the effectiveness of the method. This study also offers an initial indication of the mechanical dimensions of illusory compliance that could be manipulated by varying the stimuli presented to the users. This method could be used to augment touch interaction with mobile devices, transcending the rigid two-dimensional tangible surface (touch display) currently found on them.

Human-centered HCI

Understanding contextual factors in location-aware multimedia messaging BIBAFull-Text 22
  Abdallah El Ali; Frank Nack; Lynda Hardman
Location-aware messages left by people can make visible some aspects of their everyday experiences at a location. To understand the contextual factors surrounding how users produce and consume location-aware multimedia messaging (LMM), we use an experience-centered framework that makes explicit the different aspects of an experience. Using this framework, we conducted an exploratory, diary study aimed at eliciting implications for the study and design of LMM systems. In an earlier pilot study, we found that subjects did not have enough time to fully capture their everyday experiences using an LMM prototype, which led us to conduct a longer study using a multimodal diary method. The diary study data (verified for reliability using a categorization task) provided a closer look at the different aspects (spatiotemporal, social, affective, and cognitive) of people's experience. From the data, we derive three main findings (predominant LMM domains and tasks, capturing experience vs. experience of capture, context-dependent personalization) to inform the study and design of future LMM systems.
Embedded media barcode links: optimally blended barcode overlay on paper for linking to associated media BIBAFull-Text 23
  Qiong Liu; Chunyuan Liao; Lynn Wilcox; Anthony Dunnigan
Embedded Media Barcode Links, or simply EMBLs, are optimally blended iconic barcode marks, printed on paper documents, that signify the existence of multimedia associated with that part of the document content (Figure 1). EMBLs are used for multimedia retrieval with a camera phone. Users take a picture of an EMBL-signified document patch using a cell phone, and the multimedia associated with the EMBL-signified document location is displayed on the phone. Unlike a traditional barcode which requires an exclusive space, the EMBL construction algorithm acts as an agent to negotiate with a barcode reader for maximum user and document benefits. Because of this negotiation, EMBLs are optimally blended with content and thus have less interference with the original document layout and can be moved closer to a media associated location. Retrieval of media associated with an EMBL is based on the barcode identification of a captured EMBL. Therefore, EMBL retains nearly all barcode identification advantages, such as accuracy, speed, and scalability. Moreover, EMBL takes advantage of users' knowledge of a traditional barcode. Unlike Embedded Media Maker (EMM) which requires underlying document features for marker identification, EMBL has no requirement for the underlying features. This paper will discuss the procedures for EMBL construction and optimization. It will also give experimental results that strongly support the EMBL construction and optimization ideas.
Enhancing browsing experience of table and image elements in web pages BIBAFull-Text 24
  Wenchang Xu; Xin Yang; Yuanchun Shi
As the popularity and diversification of both Internet and its access devices, users' browsing experience of web pages is in great need of improvement. Traditional browsing mode of web elements such as table and image is passive, which limits users' browsing efficiency of web pages. In this paper, we propose to enhance browsing experience of table and image elements in web pages by enabling real-time interactive access to web tables and images. We design new browsing modes that help users improve their browsing efficiency including operation mode, record mode for web tables and normal mode, starred mode, advanced mode for web images. We design and implement a plug-in for Microsoft Internet Explorer, called iWebWidget, which provides a customized user interface supporting real-time interactive access to web tables and images. Besides, we carry out a user study to testify the usefulness of iWebWidget. Experimental results show that users are satisfied and really enjoy the new browsing modes for both web tables and images.
PhotoMagnets: supporting flexible browsing and searching in photo collections BIBAFull-Text 25
  Ya-Xi Chen; Michael Reiter; Andreas Butz
People's activities around their photo collections are often highly dynamic and unstructured, such as casual browsing and searching or loosely structured storytelling. User interfaces to support such an exploratory behavior are a challenging research question. We explore ways to enhance the flexibility in dealing with photo collections and designed a system named PhotoMagnets. It uses a magnet metaphor in addition to more traditional interface elements in order to support a flexible combination of structured and unstructured photo browsing and searching. In an evaluation we received positive feedback especially on the flexibility provided by this approach.
A language-based approach to indexing heterogeneous multimedia lifelog BIBAFull-Text 26
  Peng-Wen Cheng; Snehal Chennuru; Senaka Buthpitiya; Ying Zhang
Lifelog systems, inspired by Vannevar Bush's concept of "MEMory EXtenders" (MEMEX), are capable of storing a person's lifetime experience as a multimedia database. Despite such systems' huge potential for improving people's everyday life, there are major challenges that need to be addressed to make such systems practical. One of them is how to index the inherently large and heterogeneous lifelog data so that a person can efficiently retrieve the log segments that are of interest. In this paper, we present a novel approach to indexing lifelogs using activity language. By quantizing the heterogeneous high dimensional sensory data into text representation, we are able to apply statistical natural language processing techniques to index, recognize, segment, cluster, retrieve, and infer high-level semantic meanings of the collected lifelogs. Based on this indexing approach, our lifelog system supports easy retrieval of log segments representing past similar activities and generation of salient summaries serving as overviews of segments.
Human-centered attention models for video summarization BIBAFull-Text 27
  Kaiming Li; Lei Guo; Carlos Faraco; Dajiang Zhu; Fan Deng; Tuo Zhang; Xi Jiang; Degang Zhang; Hanbo Chen; Xintao Hu; Stephen Miller; Tianming Liu
A variety of user attention models for video/audio streams have been developed for video summarization and abstraction, in order to facilitate efficient video browsing and indexing. Essentially, human brain is the end user and evaluator of multimedia content and representation, and its responses can provide meaningful guidelines for multimedia stream summarization. For example, video/audio segments that significantly activate the visual, auditory, language and working memory systems of the human brain should be considered more important than others. It should be noted that user experience studies could be useful for such evaluations, but are suboptimal in terms of their capability of accurately capturing the full-length dynamics and interactions of the brain's response. This paper presents our preliminary efforts in applying the brain imaging technique of functional magnetic resonance imaging (fMRI) to quantify and model the dynamics and interactions between multimedia streams and brain response, when the human subjects are presented with the multimedia clips, in order to develop human-centered attention models that can be used to guide and facilitate more effective and efficient multimedia summarization. Our initial results are encouraging.

Invited talk

Activity-based Ubicomp: a new research basis for the future of human-computer interaction BIBAFull-Text 28
  James Landay
Ubiquitous computing (Ubicomp) is bringing computing off the desktop and into our everyday lives. For example, an interactive display can be used by the family of an elder to stay in constant touch with the elder's everyday wellbeing, or by a group to visualize and share information about exercise and fitness. Mobile sensors, networks, and displays are proliferating worldwide in mobile phones, enabling this new wave of applications that are intimate with the user's physical world. In addition to being ubiquitous, these applications share a focus on high-level activities, which are long-term social processes that take place in multiple environments and are supported by complex computation and inference of sensor data. However, the promise of this Activity-based Ubicomp is unfulfilled, primarily due to methodological, design, and tool limitations in how we understand the dynamics of activities. The traditional cognitive psychology basis for human-computer interaction, which focuses on our short term interactions with technological artifacts, is insufficient for achieving the promise of Activity-based Ubicomp. We are developing design methodologies and tools, as well as activity recognition technologies, to both demonstrate the potential of Activity-based Ubicomp as well as to support designers in fruitfully creating these types of applications.

Speech and language

Visual speech synthesis by modelling coarticulation dynamics using a non-parametric switching state-space model BIBAFull-Text 29
  Salil Deena; Shaobo Hou; Aphrodite Galata
We present a novel approach to speech-driven facial animation using a non-parametric switching state space model based on Gaussian processes. The model is an extension of the shared Gaussian process dynamical model, augmented with switching states. Audio and visual data from a talking head corpus are jointly modelled using the proposed method. The switching states are found using variable length Markov models trained on labelled phonetic data. We also propose a synthesis technique that takes into account both previous and future phonetic context, thus accounting for coarticulatory effects in speech.
Multi-modal computer assisted speech transcription BIBAFull-Text 30
  Luis Rodríguez; Ismael García-Varea; Enrique Vidal
Speech recognition systems are not typically able to produce error-free results in real scenarios. On account of this, human intervention is usually needed. This intervention can be included into the system by following the Computer Assisted Speech Transcription (CAST) approach, where the user constantly interacts with the system during the transcription process. In order to improve this user interaction, a speech multi-modal interface is proposed here. In addition, the user of word graphs within CAST aimed at facilitating the design of such interface as well as improving the system response time is also discussed.
Grounding spatial language for video search BIBAFull-Text 31
  Stefanie Tellex; Thomas Kollar; George Shaw; Nicholas Roy; Deb Roy
The ability to find a video clip that matches a natural language description of an event would enable intuitive search of large databases of surveillance video. We present a mechanism for connecting a spatial language query to a video clip corresponding to the query. The system can retrieve video clips matching millions of potential queries that describe complex events in video such as "people walking from the hallway door, around the island, to the kitchen sink." By breaking down the query into a sequence of independent structured clauses and modeling the meaning of each component of the structure separately, we are able to improve on previous approaches to video retrieval by finding clips that match much longer and more complex queries using a rich set of spatial relations such as "down" and "past." We present a rigorous analysis of the system's performance, based on a large corpus of task-constrained language collected from fourteen subjects. Using this corpus, we show that the system effectively retrieves clips that match natural language descriptions: 58.3% were ranked in the top two of ten in a retrieval task. Furthermore, we show that spatial relations play an important role in the system's performance.
Location grounding in multimodal local search BIBAFull-Text 32
  Patrick Ehlen; Michael Johnston
Computational models of dialog context have often focused on unimodal spoken dialog or text, using the language itself as the primary locus of contextual information. But as we move from spoken interaction to situated multimodal interaction on mobile platforms supporting a combination of spoken dialog with graphical interaction, touch-screen input, geolocation, and other non-linguistic contextual factors, we will need more sophisticated models of context that capture the influence of these factors on semantic interpretation and dialog flow. Here we focus on how users establish the location they deem salient from the multimodal context by grounding it through interactions with a map-based query system. While many existing systems rely on geolocation to establish the location context of a query, we hypothesize that this approach often ignores the grounding actions users make, and provide an analysis of log data from one such system that reveals errors that arise from that faulty treatment of grounding. We then explore and evaluate, using live field data from a deployed multimodal search system, several different context classification techniques that attempt to learn the location contexts users make salient by grounding them through their multimodal actions.

Poster session

Linearity and synchrony: quantitative metrics for slide-based presentation methodology BIBAFull-Text 33
  Kazutaka Kurihara; Toshio Mochizuki; Hiroki Oura; Mio Tsubakimoto; Toshihisa Nishimori; Jun Nakahara; Yuhei Yamauchi; Katashi Nagao
In this paper we propose new quantitative metrics that express the characteristics of current general practices in slide-based presentation methodology. The proposed metrics are numerical expressions of: 'To what extent are the materials being presented in the prepared order?' and 'What is the degree of separation between the displays of the presenter and the audience?'. Through the use of these metrics, it becomes possible to quantitatively evaluate various extended methods designed to improve presentations. We illustrate examples of calculation and visualization for the proposed metrics.
Empathetic video experience through timely multimodal interaction BIBAFull-Text 34
  Myunghee Lee; Gerard J. Kim
In this paper, we describe a video playing system, named "Empatheater," that is controlled by multimodal interaction. As the video is played, the user must interact and emulate predefined video "events" through multimodal guidance and whole body interaction (e.g. following the main character's motion or gestures). Without the timely interaction, the video stops. The system shows guidance information as how to properly react and continue the video playing. The purpose of such a system is to provide indirect experience (of the given video content) by eliciting the user to mimic and empathize with the main character. The user is given the illusion (suspended disbelief) of playing an active role in the unraveling video content. We discuss various features of the newly proposed interactive medium. In addition, we report on the results of the pilot study that was carried out to evaluate its user experience compared to passive video viewing and keyboard based video control.
Haptic numbers: three haptic representation models for numbers on a touch screen phone BIBAFull-Text 35
  Toni Pakkanen; Roope Raisamo; Katri Salminen; Veikko Surakka
Systematic research on haptic stimuli is needed to create viable haptic feeling for user interface elements. There has been a lot of research with haptic user interface prototypes, but much less with haptic stimulus design. In this study we compared three haptic representation models with two representation rates for the numbers used in the phone number keypad layout. Haptic representations for the numbers were derived from Arabic and Roman numbers, and from the Location of the number button in the layout grid. Using a Nokia 5800 Express Music phone participants entered phone numbers blindly in the phone. The speed, error rate, and subjective experiences were recorded. The results showed that the model had no effect to the measured performance, but subjective experiences were affected. The Arabic numbers with slower speed were preferred most. Thus, subjectively the performance was rated as better, even though objective measures showed no differences.
Key-press gestures recognition and interaction based on SEMG signals BIBAFull-Text 36
  Juan Cheng; Xiang Chen; Zhiyuan Lu; Kongqiao Wang; Minfen Shen
This article conducted research on the pattern recognition of keypress finger gestures based on surface electromyographic (SEMG) signals and the feasibility of keypress gestures for interaction application. Two sort of recognition experiments were designed firstly to explore the feasibility and repeatability of the SEMG-based classification of 1 6 keypress finger gestures relating to right hand and 4 control gestures, and the keypress gestures were defined referring to the standard PC key board. Based on the experimental results, 10 quite well recognized keypress gestures were selected as numeric input keys of a simulated phone, and the 4 control gestures were mapped to 4 control keys. Then two types of use tests, namely volume setting and SMS sending were conducted to survey the gesture-base interaction performance and user's attitude to this technique, and the test results showed that users could accept this novel input strategy with fresh experience.
Mood avatar: automatic text-driven head motion synthesis BIBAFull-Text 37
  Kaihui Mu; Jianhua Tao; Jianfeng Che; Minghao Yang
Natural head motion is an indispensable part of realistic facial animation. This paper presents a novel approach to synthesize natural head motion automatically based on grammatical and prosodic features, which are extracted by the text analysis part of a Chinese Text-to-Speech (TTS) system. A two-layer clustering method is proposed to determine elementary head motion patterns from a multimodal database which covers six emotional states. The mapping problem between textual information and elementary head motion patterns is modeled by Classification and Regression Trees (CART). With the emotional state specified by users, results from text analysis are utilized to drive corresponding CART model to create emotional head motion sequence. Then, the generated sequence is interpolated by spine and used to drive a Chinese text-driven avatar. The comparison experiment indicates that this approach provides a better head motion and an engaging human-computer comparing to random or none head motion.
Does haptic feedback change the way we view touchscreens in cars? BIBAFull-Text 38
  Matthew J. Pitts; Gary E. Burnett; Mark A. Williams; Tom Wellings
Touchscreens are increasingly being used in mobile devices and in-vehicle systems. While the usability benefits of touchscreens are acknowledged, their use places significant visual demand on the user due to the lack of tactile and kinaesthetic feedback. Haptic feedback is shown to improve performance in mobile devices, but little objective data is available regarding touchscreen feedback in an automotive scenario. A study was conducted to investigate the effects of visual and haptic touchscreen feedback on driver visual behaviour and driving performance using a simulated driving environment. Results showed a significant interaction between visual and haptic feedback, with the presence of haptic feedback compensating for changes in visual feedback. Driving performance was unaffected by feedback condition but degraded from a baseline measure when touchscreen tasks were introduced. Subjective responses indicated an improved user experience and increased confidence when haptic feedback was enabled.
Identifying emergent leadership in small groups using nonverbal communicative cues BIBAFull-Text 39
  Dairazalia Sanchez-Cortes; Oya Aran; Marianne Schmid Mast; Daniel Gatica-Perez
This paper addresses firstly an analysis on how an emergent leader is perceived in newly formed small-groups, and secondly, explore correlations between perception of leadership and automatically extracted nonverbal communicative cues. We hypothesize that the difference in individual nonverbal features between emergent leaders and non-emergent leaders is significant and measurable using speech activity. Our results on a new interaction corpus show that such an approach is promising, identifying the emergent leader with an accuracy of up to 80%.
Quantifying group problem solving with stochastic analysis BIBAFull-Text 40
  Wen Dong; Alex "Sandy" Pentland
Quantifying the relationship between group dynamics and group performance is a key issue of increasing group performance. In this paper, we will discuss how group performance is related to several heuristics about group dynamics in performing several typical tasks. We will also give our novel stochastic modeling in learning the structure of group dynamics. Our performance estimators account for between 40 and 60% of the variance across range of group problem solving tasks.
Cognitive skills learning: pen input patterns in computer-based athlete training BIBAFull-Text 41
  Natalie Ruiz; Qian Qian Feng; Ronnie Taib; Tara Handke; Fang Chen
In this paper, we describe a longitudinal user study with athletes using a cognitive training tool, equipped with an interactive pen interface, and think-aloud protocols. The aim is to verify whether cognitive load can be inferred directly from changes in geometric and temporal features of the pen trajectories. We compare trajectories across cognitive load levels and overall Pre and Post training tests. The results show trajectory durations and lengths decrease while speeds increase, all significantly, as cognitive load increases. These changes are attributed to mechanisms for dealing with high cognitive load in working memory, with minimal rehearsal. With more expertise, trajectory durations further decrease and speeds further increase, which is attributed in part to cognitive skill acquisition and to schema development, both in extraneous and intrinsic networks, between Pre and Post tests. As such, these pen trajectory features offer insight into implicit communicative changes related to load fluctuations.
Vocal sketching: a prototype tool for designing multimodal interaction BIBAFull-Text 42
  Koray Tahiroglu; Teemu Ahmaniemi
Dynamic audio feedback enriches the interaction with a mobile device. Novel sensor technologies and audio synthesis tools provide infinite number of possibilities to design the interaction between the sensory input and audio output. This paper presents a study where vocal sketching was used as prototype method to grasp ideas and expectations in early stages of designing multimodal interaction. We introduce an experiment where a graspable mobile device was given to the participants and urged to sketch vocally the sounds to be produced when using the device in a communication and musical expression scenarios. The sensory input methods were limited to gestures such as touch, squeeze and movements. Vocal sketching let us to examine closer how gesture and sound could be coupled in the use of our prototype device, such as moving the device upwards with elevating pitch. The results reported in this paper have already informed our opinions and expectations towards the actual design phase of the audio modality.
Evidence-based automated traffic hazard zone mapping using wearable sensors BIBAFull-Text 43
  Masahiro Tada; Haruo Noma; Kazumi Renge
Recently, underestimating traffic condition risk is considered one of the biggest reasons for traffic accidents. In this paper, we proposed evidence-based automatic hazard zone mapping method using wearable sensors. Here, we measure driver's behavior using three-axis gyro sensors. Analyzing the measured motion data, proposed method can label characteristic motion that is observed at hazard zone. We gathered motion data sets form two types of driver, i.e., an instructor of driving school and an ordinary driver, then, tried to generate traffic hazard zone map focused on difference of the motions. Through the experiment in public road, we confirmed our method allows to extract hazard zone.
Analysis environment of conversational structure with nonverbal multimodal data BIBAFull-Text 44
  Yasuyuki Sumi; Masaharu Yano; Toyoaki Nishida
This paper shows the IMADE (Interaction Measurement, Analysis, and Design Environment) project to build a recording and analyzing environment of human conversational interactions. The IMADE room is designed to record audio/visual, human-motion, eye gazing data for building interaction corpus mainly focusing on understanding of human nonverbal behaviors. In this paper, we show the notion of interaction corpus and iCorpusStudio, software environment for browsing and analyzing the interaction corpus. We also present a preliminary experiment on multiparty conversations.
Design and evaluation of a wearable remote social touch device BIBAFull-Text 45
  Rongrong Wang; Francis Quek; James K. S. Teh; Adrian D. Cheok; Sep Riang Lai
Psychological and sociological studies have established the essential role that touch plays in interpersonal communication. However this channel is largely ignored in current telecommunication technologies. We design and implement a remote touch armband with an electric motor actuator. This is paired with a touch input device in the form of a force-sensor-embedded smart phone case. When the smart phone is squeezed, the paired armband will be activated to simulate a squeeze on the user's upper arm. A usability study is conducted with 22 participants to evaluate the device in terms of perceptibility. The results show that users can easily perceive touch at different force levels.
Multimodal interactive machine translation BIBAFull-Text 46
  Vicent Alabau; Daniel Ortiz-Martínez; Alberto Sanchis; Francisco Casacuberta
Interactive machine translation (IMT) [1] is an alternative approach to machine translation, integrating human expertise into the automatic translation process. In this framework, a human iteratively interacts with a system until the output desired by the human is completely generated. Traditionally, interaction has been performed using a keyboard and a mouse. However, the use of touchscreens has been popularised recently. Many touchscreen devices already exist in the market, namely mobile phones, laptops and tablet computers like the iPad. In this work, we propose a new interaction modality to take advantage of such devices, for which online handwritten text seems a very natural way of input. Multimodality is formulated as an extension to the traditional IMT protocol where the user can amend errors by writing text with an electronic pen or a stylus on a touchscreen. Different approaches to modality fusion have been studied. In addition, these approaches have been assessed on the Xerox task. Finally, a thorough study of the errors committed by the online handwritten system will show future work directions.
Component-based high fidelity interactive prototyping of post-WIMP interactions BIBAFull-Text 47
  Jean-Yves Lionel Lawson; Mathieu Coterot; Cyril Carincotte; Benoît Macq
In order to support interactive high-fidelity prototyping of post-WIMP user interactions, we propose a multi-fidelity design method based on a unifying component-based model and supported by an advanced tool suite, the OpenInterface Platform Workbench. Our approach strives for supporting a collaborative (programmer-designer) and user-centered design activity. The workbench architecture allows exploration of novel interaction techniques through seamless integration and adaptation of heterogeneous components, high-fidelity rapid prototyping, runtime evaluation and fine-tuning of designed systems. This paper illustrates through the iterative construction of a running example how OpenInterface allows the leverage of existing resources and fosters the creation of non-conventional interaction techniques.
Active learning strategies for handwritten text transcription BIBAFull-Text 48
  Nicolás Serrano; Adrià Giménez; Albert Sanchis; Alfons Juan
Active learning strategies are being increasingly used in a variety of real-world tasks, though their application to handwritten text transcription in old manuscripts remains nearly unexplored. The basic idea is to follow a sequential, line-byline transcription of the whole manuscript in which a continuously retrained system interacts with the user to efficiently transcribe each new line. This approach has been recently explored using a conventional strategy by which the user is only asked to supervise words that are not recognized with high confidence. In this paper, the conventional strategy is improved by also letting the system to recompute most probable hypotheses with the constraints imposed by user supervisions. In particular, two strategies are studied which differ in the frequency of hypothesis recomputation on the current line: after each (iterative) or all (delayed) user corrections. Empirical results are reported on two real tasks showing that these strategies outperform the conventional approach.
Behavior and preference in minimal personality: a study on embodied conversational agents BIBAFull-Text 49
  Yuting Chen; Adeel Naveed; Robert Porzel
Endowing embodied conversational agent with personality affords more natural modalities for their interaction with human interlocutors. To bridge the personality gap between users and agents, we designed minimal two personalities for corresponding agents i.e. an introverted and an extroverted agent. Each features a combination of different verbal and non-verbal behaviors. In this paper, we present an examination of the effects of the speaking and behavior styles of the two agents and explore the resulting design factors pertinent for spoken dialogue systems. The results indicate that users prefer the extroverted agent to the introverted one. The personality traits of the agents influence the users' preferences, dialogues, and behavior. Statistically, it is highly significant that users are more talkative with the extroverted agent. We also investigate the spontaneous speech disfluency of the dialogues and demonstrate that the extroverted behavior model reduce the user's speech disfluency. Furthermore, users having different mental models behave differently with the agents. The results and findings show that the minimal personalities of agents maximally influence the interlocutors' behaviors.
Vlogcast yourself: nonverbal behavior and attention in social media BIBAFull-Text 50
  Joan-Isaac Biel; Daniel Gatica-Perez
We introduce vlogs as a type of rich human interaction which is multimodal in nature and suitable for new large-scale behavioral data analysis. The automatic analysis of vlogs is useful not only to study social media, but also remote communication scenarios, and requires the integration of methods for multimodal processing and for social media understanding. Based on works from social psychology and computing, we first propose robust audio, visual, and multimodal cues to measure the nonverbal behavior of vloggers in their videos. Then, we investigate the relation between behavior and the attention videos receive in YouTube. Our study shows significant correlations between some nonverbal behavioral cues and the average number of views per video.

Human-human interactions

3D user-perspective, voxel-based estimation of visual focus of attention in dynamic meeting scenarios BIBAFull-Text 51
  Michael Voit; Rainer Stiefelhagen
In this paper we present a new framework for the online estimation of people's visual focus of attention from their head poses in dynamic meeting scenarios. We describe a voxel based approach to reconstruct the scene composition from an observer's perspective, in order to integrate occlusion handling and visibility verification. The observer's perspective is thereby simulated with live head pose tracking over four far-field views from the room's upper corners. We integrate motion and speech activity as further scene observations in a Bayesian Surprise framework to model prior attractors of attention within the situation's context. As evaluations on a dedicated dataset with 10 meeting videos show, this allows us to predict a meeting participant's focus of attention correctly in up to 72.2% of all frames.
Modelling and analyzing multimodal dyadic interactions using social networks BIBAFull-Text 52
  Sergio Escalera; Petia Radeva; Jordi Vitrià; Xavier Baró; Bogdan Raducanu
Social network analysis became a common technique used to model and quantify the properties of social interactions. In this paper, we propose an integrated framework to explore the characteristics of a social network extracted from multimodal dyadic interactions. First, speech detection is performed through an audio/visual fusion scheme based on stacked sequential learning. In the audio domain, speech is detected through clusterization of audio features. Clusters are modelled by means of an One-state Hidden Markov Model containing a diagonal covariance Gaussian Mixture Model. In the visual domain, speech detection is performed through differential-based feature extraction from the segmented mouth region, and a dynamic programming matching procedure. Second, in order to model the dyadic interactions, we employed the Influence Model whose states encode the previous integrated audio/visual data. Third, the social network is extracted based on the estimated influences. For our study, we used a set of videos belonging to New York Times' Blogging Heads opinion blog. The results are reported both in terms of accuracy of the audio/visual data fusion and centrality measures used to characterize the social network.
Analyzing multimodal time series as dynamical systems BIBAFull-Text 53
  Shohei Hidaka; Chen Yu
We propose a novel approach to discovering latent structures from multimodal time series. We view a time series as observed data from an underlying dynamical system. In this way, analyzing multimodal time series can be viewed as finding latent structures from dynamical systems. In light this, our approach is based on the concept of generating partition which is the theoretically best symbolization of time series maximizing the information of the underlying original continuous dynamical system. However, generating partition is difficult to achieve for time series without explicit dynamical equations. Different from most previous approaches that attempt to approximate generating partition through various deterministic symbolization processes, our algorithm maintains and estimates a probabilistic distribution over a symbol set for each data point in a time series. To do so, we develop a Bayesian framework for probabilistic symbolization and demonstrate that the approach can be successfully applied to both simulated data and empirical data from multimodal agent-agent interactions. We suggest this unsupervised learning algorithm has a potential to be used in various multimodal datasets as first steps to identify underlying structures between temporal variables.
Conversation scene analysis based on dynamic Bayesian network and image-based gaze detection BIBAFull-Text 54
  Sebastian Gorga; Kazuhiro Otsuka
This paper presents a probabilistic framework, which incorporates automatic image-based gaze detection, for inferring the structure of multiparty face-to-face conversations. This framework aims to infer conversation regimes and gaze patterns from the nonverbal behaviors of meeting participants, which are captured from image and audio streams with cameras and microphones. The conversation regime corresponds to a global conversational pattern such as monologue and dialogue, and the gaze pattern indicates "who is looking at whom". Input nonverbal behaviors include presence/absence of utterances, head directions, and discrete head-centered eye-gaze directions. In contrast to conventional meeting analysis methods that focus only on the participant's head pose as a surrogate of visual focus of attention, this paper newly incorporates vision-based gaze detection combined with head pose tracking into a probabilistic conversation model based on dynamic Bayesian network. Our gaze detector is able to differentiate 3 to 5 different eye gaze directions, e.g. left, straight and right. Experiments on four-person conversations confirm the power of the proposed framework in identifying conversation structure and in estimating gaze patterns with higher accuracy then previous models.