HCI Bibliography Home | HCI Conferences | ICMI Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
ICMI Tables of Contents: 0203040506070809101112131415

Proceedings of the 2013 International Conference on Multimodal Interaction

Fullname:Proceedings of the 15th ACM International Conference on Multimodal Interaction
Editors:Julien Epps; Fang Chen; Sharon Oviatt; Kenji Mase; Andrew Sears; Kristiina Jokinen; Björn Schuller
Location:Sydney, Australia
Dates:2013-Dec-09 to 2013-Dec-13
Publisher:ACM
Standard No:ISBN: 978-1-4503-2129-7; ACM DL: Table of Contents; hcibib: ICMI13
Papers:100
Pages:612
Links:Conference Website
  1. Keynote 1
  2. Oral session 1: personality
  3. Oral session 2: communication
  4. Demo session 1
  5. Poster session 1
  6. Oral session 3: intelligent & multimodal interfaces
  7. Keynote 2
  8. Oral session 4: embodied interfaces
  9. Oral session 5: hand and body
  10. Demo session 2
  11. Poster session 2: doctoral spotlight
  12. Grand challenge overviews
  13. Keynote 3
  14. Oral session 6: AR, VR & mobile
  15. Oral session 7: eyes & body
  16. ChaLearn challenge and workshop on multi-modal gesture recognition
  17. Emotion recognition in the wild challenge and workshop
  18. Multimodal learning analytics challenge
  19. Workshop overview

Keynote 1

Behavior imaging and the study of autism BIBAFull-Text 1-2
  James M. Rehg
Computational sensing and modeling can play a key role in the measurement, analysis, and understanding of human behavior. We refer to this research area as Behavior Imaging, by analogy to the medical imaging technologies that revolutionized internal medicine. We outline the development of behavior imaging technologies to study dyadic social interactions between children and their care-givers, and describe a new Multi-Modal Dyadic Behavior (MMDB) dataset.

Oral session 1: personality

On the relationship between head pose, social attention and personality prediction for unstructured and dynamic group interactions BIBAFull-Text 3-10
  Ramanathan Subramanian; Yan Yan; Jacopo Staiano; Oswald Lanz; Nicu Sebe
Correlates between social attention and personality traits have been widely acknowledged in social psychology studies. Head pose has commonly been employed as a proxy for determining the social attention direction in small group interactions. However, the impact of head pose estimation errors on personality estimates has not been studied to our knowledge.
   In this work, we consider the unstructured and dynamic cocktail party scenario where the scene is captured by multiple, large field-of-view cameras. Head pose estimation is a challenging task under these conditions owing to the uninhibited motion of persons (due to which facial appearance varies owing to perspective and scale changes), and the low resolution of captured faces. Based on proxemic and social attention features computed from position and head pose annotations, we first demonstrate that social attention features are excellent predictors of the Extraversion and Neuroticism personality traits. We then repeat classification experiments with behavioral features computed from automated estimates -- obtained experimental results show that while prediction performance for both traits is affected by head pose estimation errors, the impact is more adverse for Extraversion.
One of a kind: inferring personality impressions in meetings BIBAFull-Text 11-18
  Oya Aran; Daniel Gatica-Perez
We present an analysis on personality prediction in small groups based on trait attributes from external observers. We use a rich set of automatically extracted audio-visual nonverbal features, including speaking turn, prosodic, visual activity, and visual focus of attention features. We also investigate whether the thin sliced impressions of external observers generalize to the whole meeting in the personality prediction task. Using ridge regression, we have analyzed both the regression and classification performance of personality prediction. Our experiments show that the extraversion trait can be predicted with high accuracy in a binary classification task and visual activity features give higher accuracies than audio ones. The highest accuracy for the extraversion trait, is 75%, obtained with a combination of audio-visual features. Openness to experience trait also has a significant accuracy, only when the whole meeting is used as the unit of processing.
Who is persuasive?: the role of perceived personality and communication modality in social multimedia BIBAFull-Text 19-26
  Gelareh Mohammadi; Sunghyun Park; Kenji Sagae; Alessandro Vinciarelli; Louis-Philippe Morency
Persuasive communication is part of everyone's daily life. With the emergence of social websites like YouTube, Facebook and Twitter, persuasive communication is now seen online on a daily basis. This paper explores the effect of multi-modality and perceived personality on persuasiveness of social multimedia content. The experiments are performed over a large corpus of movie review clips from YouTube which is presented to online annotators in three different modalities: only text, only audio and video. The annotators evaluated the persuasiveness of each review across different modalities and judged the personality of the speaker. Our detailed analysis confirmed several research hypotheses designed to study the relationships between persuasion, perceived personality and communicative channel, namely modality. Three hypotheses are designed: the first hypothesis studies the effect of communication modality on persuasion, the second hypothesis examines the correlation between persuasion and personality perception and finally the third hypothesis, derived from the first two hypotheses explores how communication modality influence the personality perception.
Going beyond traits: multimodal classification of personality states in the wild BIBAFull-Text 27-34
  Kyriaki Kalimeri; Bruno Lepri; Fabio Pianesi
Recent studies in social and personality psychology introduced the notion of personality states conceived as concrete behaviors that can be described as having the same contents as traits. Our paper is a first step towards addressing automatically this new perspective. In particular, we will focus on the classification of excerpts of social behavior into personality states corresponding to the Big Five traits, rather than focusing on the more traditional goal of using those behaviors to directly infer about the personality traits of the person producing them. The multimodal behavioral cues we exploit were obtained by means of the Sociometric Badges worn by people working at a research institution for a period of six weeks. We investigate the effectiveness of cues concerning acted social behaviors as well as of other situational characteristics for the sake of personality state classification. The encouraging results show that our classifiers always, and sometimes greatly, improve the performances of a random baseline classifier (from 1.5 to 1.8 better than chance). At a general level, we believe that these results support the proposed shift from the classification of personality traits to the classification of personality states.

Oral session 2: communication

Implementation and evaluation of a multimodal addressee identification mechanism for multiparty conversation systems BIBAFull-Text 35-42
  Yukiko I. Nakano; Naoya Baba; Hung-Hsuan Huang; Yuki Hayashi
In conversational agents with multiparty communication functionality, a system needs to be able to identify the addressee for the current floor and respond to the user when the utterance is addressed to the agent. This study proposes some addressee identification models based on speech and gaze information, and tests whether the models can be applied to different proxemics. We build an addressee identification mechanism by implementing the models and incorporate it into a fully autonomous multiparty conversational agent. The system identifies the addressee from online multimodal data and uses this information in language understanding and dialogue management. Finally, an evaluation experiment shows that the proposed addressee identification mechanism works well in a real-time system, with an F-measure for addressee estimation of 0.8 for agent-addressed utterances. We also found that our system more successfully avoided disturbing the conversation by mistakenly taking a turn when the agent is not addressed.
Managing chaos: models of turn-taking in character-multichild interactions BIBAFull-Text 43-50
  Iolanda Leite; Hannaneh Hajishirzi; Sean Andrist; Jill Lehman
Turn-taking decisions in multiparty settings are complex, especially when the participants are children. Our goal is to endow an interactive character with appropriate turn-taking behavior using visual, audio and contextual features. To that end, we investigate three distinct turn-taking models: a baseline model grounded in established turn-taking rules for adults and two machine learning models, one trained with data collected in situ and the other trained with data collected in more controlled conditions. The three models are shown to have different profiles of behavior during silences, overlapping speech, and at the end of participants' turns. An exploratory user evaluation focusing on the decision points where the models differ showed clear preference for the machine learning models over the baseline model. The results indicate that the rules for language interactions with small groups of children are not simply an extension of the rules for interacting with small groups of adults.
Speaker-adaptive multimodal prediction model for listener responses BIBAFull-Text 51-58
  Iwan de Kok; Dirk Heylen; Louis-Philippe Morency
The goal of this paper is to analyze and model the variability in speaking styles in dyadic interactions and build a predictive algorithm for listener responses that is able to adapt to these different styles. The end result of this research will be a virtual human able to automatically respond to a human speaker with proper listener responses (e.g., head nods). Our novel speaker-adaptive prediction model is created from a corpus of dyadic interactions where speaker variability is analyzed to identify a subset of prototypical speaker styles. During a live interaction our prediction model automatically identifies the closest prototypical speaker style and predicts listener responses based on this "communicative style". Central to our approach is the idea of "speaker profile" which uniquely identifies each speaker and enables the matching between prototypical speakers and new speakers. The paper shows the merits of our speaker-adaptive listener response prediction model by showing improvement over a state-of-the-art approach which does not adapt to the speaker. Besides the merits of speaker-adaptation, our experiments highlights the importance of using multimodal features when comparing speakers to select the closest prototypical speaker style.
User experiences of mobile audio conferencing with spatial audio, haptics and gestures BIBAFull-Text 59-66
  Jussi Rantala; Sebastian Müller; Roope Raisamo; Katja Suhonen; Kaisa Väänänen-Vainio-Mattila; Vuokko Lantz
Devices such as mobile phones have made it possible to take part in remote audio conferences regardless of one's physical location. Mobile phones also allow for new ways to interact with other conference participants. We present a study on evaluating the user experiences of a mobile audio conferencing system that was augmented with spatial audio, haptics, and gestures. In a user study groups of participants compared the augmented audio conference based on a mobile phone and headset to a traditional audio conference. The participants' task was to use the two alternative systems in given discussion tasks. The results of the subjective questionnaires showed that the augmented audio conference was perceived as more stimulating (e.g. creative), while the traditional audio conference was perceived as more practical (e.g. straightforward). The results of the group interviews indicated that spatial audio was the most desired feature, and that it had a positive effect on participants' perception of the conversation. Based on our findings, guidelines for the future development of similar systems are presented.

Demo session 1

A framework for multimodal data collection, visualization, annotation and learning BIBAFull-Text 67-68
  Anne Loomis Thompson; Dan Bohus
The development and iterative refinement of inference models for multimodal systems can be challenging and time intensive. We present a framework for multimodal data collection, visualization, annotation, and learning that enables system developers to build models using various machine learning techniques, and quickly iterate through cycles of development, deployment and refinement.
Demonstration of sketch-thru-plan: a multimodal interface for command and control BIBAFull-Text 69-70
  Philip R. Cohen; M. Cecelia Buchanan; Edward J. Kaiser; Michael Corrigan; Scott Lind; Matt Wesson
This paper demonstrates a multimodal system called Sketch-Thru-Plan (STP) that enables users to speak and draw doctrinal language and symbols in order to create courses of action. We argue that STP can meet many of the challenges inherent in building user interfaces for operations planning and command-and-control. The system is being transitioned to military organizations for use in planning courses of action.
Robotic learning companions for early language development BIBAFull-Text 71-72
  Jacqueline M. Kory; Sooyeon Jeong; Cynthia L. Breazeal
Research from the past two decades indicates that preschool is a critical time for children's oral language and vocabulary development, which in turn is a primary predictor of later academic success. However, given the inherently social nature of language learning, it is difficult to develop scalable interventions for young children. Here, we present one solution in the form of robotic learning companions, using the DragonBot platform. Designed as interactive, social characters, these robots combine the flexibility and personalization afforded by educational software with a crucial social context, as peers and conversation partners. They can supplement teachers and caregivers, allowing remote operation as well as the potential for autonomously participating with children in language learning activities. Our aim is to demonstrate the efficacy of the DragonBot platform as an engaging, social, learning companion.
WikiTalk human-robot interactions BIBAFull-Text 73-74
  Graham Wilcock; Kristiina Jokinen
The demo shows WikiTalk, a Wikipedia-based open-domain information access dialogue system implemented on a talking humanoid robot. The robot behaviour integrates speech, nodding, gesturing and face-tracking to support interaction management and the presentation of information to the partner.

Poster session 1

Saliency-guided 3D head pose estimation on 3D expression models BIBAFull-Text 75-78
  Peng Liu; Michael Reale; Xing Zhang; Lijun Yin
Head pose is an important indicator of a person's attention, gestures, and communicative behavior with applications in human-computer interaction, multimedia, and vision systems. Robust head pose estimation is a prerequisite for spontaneous facial biometrics-related applications. However, most previous head pose estimation methods do not consider the facial expression and hence are more likely to be influenced by the facial expression. In this paper, we develop a saliency-guided 3D head pose estimation on 3D expression models. We address the problem of head pose estimation based on a generic model and saliency guided segmentation on a Laplacian fairing model. We propose to perform mesh Laplacian fairing to remove noise and outliers on the 3D facial model. The salient regions are detected and segmented from the model. The salient region Iterative Closest Point (ICP) then register the test face model with the generic head model. The algorithms for pose estimation are evaluated through both static and dynamic 3D facial databases. Overall, the extensive results demonstrate the effectiveness and accuracy of our approach.
Predicting next speaker and timing from gaze transition patterns in multi-party meetings BIBAFull-Text 79-86
  Ryo Ishii; Kazuhiro Otsuka; Shiro Kumano; Masafumi Matsuda; Junji Yamato
In multi-party meetings, participants need to predict the end of the speaker's utterance and who will start speaking next, and to consider a strategy for good timing to speak next. Gaze behavior plays an important role for smooth turn-taking. This paper proposes a mathematical prediction model that features three processing steps to predict (I) whether turn-taking or turn-keeping will occur, (II) who will be the next speaker in turn-taking, and (III) the timing of the start of the next speaker's utterance. For the feature quantity of the model, we focused on gaze transition patterns near the end of utterance. We collected corpus data of multi party meetings and analyzed how the frequencies of appearance of gaze transition patterns differs depending on situations of (I), (II), and (III). On the basis of the analysis, we construct a probabilistic mathematical model that uses the frequencies of appearance of all participants' gaze transition patterns. The results of an evaluation of the model show the proposed models succeed with high precision compared to ones that do not take gaze transition patterns into account.
A semi-automated system for accurate gaze coding in natural dyadic interactions BIBAFull-Text 87-90
  Kenneth A. Funes Mora; Laurent Nguyen; Daniel Gatica-Perez; Jean-Marc Odobez
In this paper we propose a system capable of accurately coding gazing events in natural dyadic interactions. Contrary to previous works, our approach exploits the actual continuous gaze direction of a participant by leveraging on remote RGB-D sensors and a head pose-independent gaze estimation method. Our contributions are: i) we propose a system setup built from low-cost sensors and a technique to easily calibrate these sensors in a room with minimal assumptions; ii) we propose a method which, provided short manual annotations, can automatically detect gazing events in the rest of the sequence; iii) we demonstrate on substantially long, natural dyadic data that high accuracy can be obtained, showing the potential of our system. Our approach is non-invasive and does not require collaboration from the interactors. These characteristics are highly valuable in psychology and sociology research.
Evaluating the robustness of an appearance-based gaze estimation method for multimodal interfaces BIBAFull-Text 91-98
  Nanxiang Li; Carlos Busso
Given the crucial role of eye movements on visual attention, tracking gaze behaviors is an important research problem in various applications including biometric identification, attention modeling and human-computer interaction. Most of the existing gaze tracking methods require a repetitive system calibration process and are sensitive to the user's head movements. Therefore, they cannot be easily implemented in current multimodal interfaces. This paper investigates an appearance-based approach for gaze estimation that requires minimum calibration and is robust against head motion. The approach consists in building an orthonormal basis, or eigenspace, of the eye appearance with principal component analysis (PCA). Unlike previous studies, we build the eigenspace using image patches displaying both eyes. The projections into the basis are used to train regression models which predict the gaze location. The approach is trained and tested with a new multimodal corpus introduced in this paper. We consider several variables such as the distance between user and the computer monitor, and head movement. The evaluation includes the performance of the proposed gaze estimation system with and without head movement. It also evaluates the results in subject-dependent versus subject-independent conditions under different distances. We report promising results which suggest that the proposed gaze estimation approach is a feasible and flexible scheme to facilitate gaze-based multimodal interfaces.
A gaze-based method for relating group involvement to individual engagement in multimodal multiparty dialogue BIBAFull-Text 99-106
  Catharine Oertel; Giampiero Salvi
This paper is concerned with modelling individual engagement and group involvement as well as their relationship in an eight-party, mutimodal corpus. We propose a number of features (presence, entropy, symmetry and maxgaze) that summarise different aspects of eye-gaze patterns and allow us to describe individual as well as group behaviour in time. We use these features to define similarities between the subjects and we compare this information with the engagement rankings the subjects expressed at the end of each interactions about themselves and the other participants. We analyse how these features relate to four classes of group involvement and we build a classifier that is able to distinguish between those classes with 71% of accuracy.
Leveraging the robot dialog state for visual focus of attention recognition BIBAFull-Text 107-110
  Samira Sheikhi; Vasil Khalidov; David Klotz; Britta Wrede; Jean-Marc Odobez
The Visual Focus of Attention (what or whom a person is looking at) or VFOA is a fundamental cue in non-verbal communication and plays an important role when designing effective human-machine interaction systems. However, recognizing the VFOA of an interacting person is difficult for a robot, since due to low resolution imaging, eye gaze estimation is not possible. Rather, head pose cue is used as a substitute for gaze, but leads to ambiguities in its interpretation as VFOA indicator. In this paper, we investigate the use of the robot conversational state, which the robot is aware of, as contextual information to improve VFOA recognition from head pose. We propose a dynamic Bayesian model that accounts for the robot state (speaking status, person he addresses, reference to objects) along with a dynamic head-to-gaze mapping function. Experiments on a publicly available human-robot interaction dataset, where a humanoid robot plays the role of an art guide and quiz master, shows that using such conversational context is effective in improving VFOA.
CoWME: a general framework to evaluate cognitive workload during multimodal interaction BIBAFull-Text 111-118
  Davide Maria Calandra; Antonio Caso; Francesco Cutugno; Antonio Origlia; Silvia Rossi
Evaluating human machine interaction in the case of multimodal systems is often a difficult task involving the monitoring of multiple sources, data fusion and results interpretation. While subtasks are highly dependent on the specific goal of the application and on the available interaction modalities, it is possible to formalize this workflow into a standard process and to consider a generic measure to estimate the ease of use of a specific application. In this work, we present CoWME, a modular software architecture describing multimodal human machine interaction evaluation, from data collection to final evaluation, in a formal way, in terms of cognitive workload. Communication protocols between modules are described in XML while data fusion is delegated to a configurable rule engine. An interface module is introduced between the monitoring modules and the rule engine to collect and summarize data streams for cognitive workload evaluation. We present a deployment example showing how this architecture is deployed by monitoring an interactive session with an Android application taking into account stressed speech detection, mydriasis and touch analysis.
Hi YouTube!: personality impressions and verbal content in social video BIBAFull-Text 119-126
  Joan-Isaac Biel; Vagia Tsiminaki; John Dines; Daniel Gatica-Perez
Despite the evidence that social video conveys rich human personality information, research investigating the automatic prediction of personality impressions in vlogging has shown that, amongst the Big-Five traits, automatic nonverbal behavioral cues are useful to predict mainly the Extraversion trait. This finding, also reported in other conversational settings, indicates that personality information may be coded in other behavioral dimensions like the verbal channel, which has been less studied in multimodal interaction research. In this paper, we address the task of predicting personality impressions from vloggers based on what they say in their YouTube videos. First, we use manual transcripts of vlogs and verbal content analysis techniques to understand the ability of verbal content for the prediction of crowdsourced Big-Five personality impressions. Second, we explore the feasibility of a fully-automatic framework in which transcripts are obtained using automatic speech recognition (ASR). Our results show that the analysis of error-free verbal content is useful to predict four of the Big-Five traits, three of them better than using nonverbal cues, and that the errors caused by the ASR system decrease the performance significantly.
Cross-domain personality prediction: from video blogs to small group meetings BIBAFull-Text 127-130
  Oya Aran; Daniel Gatica-Perez
In this study, we investigate the use of social media content as a domain to learn personality trait impressions, particularly extraversion. Our aim is to transfer the knowledge that can be extracted from conversational videos in video blogging sites to small group settings to predict the extraversion trait with nonverbal cues. We use YouTube data containing personality impression scores of 442 people as the source domain and a small-group meeting data from a total of 102 people as our target domain. Our results show that, for the extraversion trait, by using user-created video blogs, as part of the training data, and a small amount of adaptation data from the target domain, we are able to achieve higher prediction accuracies than using only the data recorded in small group settings.
Automatic detection of deceit in verbal communication BIBAFull-Text 131-134
  Rada Mihalcea; Verónica Pérez-Rosas; Mihai Burzo
This paper presents experiments in building a classifier for the automatic detection of deceit. Using a dataset of deceptive videos, we run several comparative evaluations focusing on the verbal component of these videos, with the goal of understanding the difference in deceit detection when using manual versus automatic transcriptions, as well as the difference between spoken and written lies. We show that using only the linguistic component of the deceptive videos, we can detect deception with accuracies in the range of 52-73%.
Audiovisual behavior descriptors for depression assessment BIBAFull-Text 135-140
  Stefan Scherer; Giota Stratou; Louis-Philippe Morency
We investigate audiovisual indicators, in particular measures of reduced emotional expressivity and psycho-motor retardation, for depression within semi-structured virtual human interviews. Based on a standard self-assessment depression scale we investigate the statistical discriminative strength of the audiovisual features on a depression/no-depression basis. Within subject-independent unimodal and multimodal classification experiments we find that early feature-level fusion yields promising results and confirms the statistical findings. We further correlate the behavior descriptors with the assessed depression severity and find considerable correlation. Lastly, a joint multimodal factor analysis reveals two prominent factors within the data that show both statistical discriminative power as well as strong linear correlation with the depression severity score. These preliminary results based on a standard factor analysis are promising and motivate us to investigate this approach further in the future, while incorporating additional modalities.
A Markov logic framework for recognizing complex events from multimodal data BIBAFull-Text 141-148
  Young Chol Song; Henry Kautz; James Allen; Mary Swift; Yuncheng Li; Jiebo Luo; Ce Zhang
We present a general framework for complex event recognition that is well-suited for integrating information that varies widely in detail and granularity. Consider the scenario of an agent in an instrumented space performing a complex task while describing what he is doing in a natural manner. The system takes in a variety of information, including objects and gestures recognized by RGB-D and descriptions of events extracted from recognized and parsed speech. The system outputs a complete reconstruction of the agent's plan, explaining actions in terms of more complex activities and filling in unobserved but necessary events. We show how to use Markov Logic (a probabilistic extension of first-order logic) to create a model in which observations can be partial, noisy, and refer to future or temporally ambiguous events; complex events are composed from simpler events in a manner that exposes their structure for inference and learning; and uncertainty is handled in a sound probabilistic manner. We demonstrate the effectiveness of the approach for tracking kitchen activities in the presence of noisy and incomplete observations.
Interactive relevance search and modeling: support for expert-driven analysis of multimodal data BIBAFull-Text 149-156
  Chreston Miller; Francis Quek; Louis-Philippe Morency
In this paper we present the findings of three longitudinal case studies in which a new method for conducting multimodal analysis of human behavior is tested. The focus of this new method is to engage a researcher integrally in the analysis process and allow them to guide the identification and discovery of relevant behavior instances within multimodal data. The case studies resulted in the creation of two analysis strategies: Single-Focus Hypothesis Testing and Multi-Focus Hypothesis Testing. Each were shown to be beneficial to multimodal analysis through supporting either a single focused deep analysis or analysis across multiple angles in unison. These strategies exemplified how challenging questions can be answered for multimodal datasets. The new method is described and the case studies' findings are presented detailing how the new method supports multimodal analysis and opens the door for a new breed of analysis methods. Two of the three case studies resulted in publishable results for the respective participants.
Predicting speech overlaps from speech tokens and co-occurring body behaviours in dyadic conversations BIBAFull-Text 157-164
  Costanza Navarretta
This paper deals with speech overlaps in dyadic video record-ed spontaneous conversations. Speech overlaps are quite common in everyday conversations and it is therefore important to study their occurrences in different communicative situations and settings and to model them in applied communicative systems.
   In the present work, we wanted to investigate the frequency and use of speech overlaps in a multimodally annotated corpus of first encounters. Speech overlaps were automatically tagged and a Bayesian Network learner was trained on the multimodal annotations in order to determine to which extent overlaps can be predicted so they can be dealt with in conversational devices and to investigate the relation between overlaps, speech tokens and co-occurring body behaviours. The annotations comprise shape and functions of head movements, facial expressions and body postures.
   23% of the speech tokens and 90% of the spoken contributions of the first encounters are overlapping. The best classification results were obtained training the classifier on multimodal behaviours (speech and co-occurring head movements, facial expressions and body postures) which surround-ed the overlaps. Training the classifier on all speech tokens also gave good results while adding the shape of co-occurring body behaviours to them did not affect the results. Thus, the behaviours of the conversation participants does not change when there is a speech overlap. This could indicate that most of the overlaps in the first encounters are non competitive.
Interaction analysis and joint attention tracking in augmented reality BIBAFull-Text 165-172
  Alexander Neumann; Christian Schnier; Thomas Hermann; Karola Pitsch
Multimodal research in human interaction has to consider a variety of factors, ranging from local short-time phenomena to complex interaction patterns. As of today, no single discipline engaged in communication research offers the methods and tools to investigate the full complexity continuum in a time-efficient way. A synthesis of qualitative and quantitative analysis is required to merge insights about micro-sequential structures with big data patterns. Using the example of a co-present dyadic negotiation analysis to combine methods offered by Conversation Analysis and Data Mining, we show how such a partnership can benefit each discipline and lead to insights as well as new hypotheses evaluation opportunities.
Mo!Games: evaluating mobile gestures in the wild BIBAFull-Text 173-180
  Julie R. Williamson; Stephen Brewster; Rama Vennelakanti
The user experience of performing gesture-based interactions in public spaces is highly dependent on context, where users must decide which gestures they will use and how they will perform them. In order to complete a realistic evaluation of how users make these decisions, the evaluation of such user experiences must be completed "in the wild." Furthermore, studies need to be completed within different cultural contexts in order to understand how users might adopt gesture differently in different cultures. This paper presents such a study using a mobile gesture-based game, where users in the UK and India interacted with this game over the span of 6 days. The results of this study demonstrate similarities between gesture use in these divergent cultural settings, illustrate factors that influence gesture acceptance such as perceived size of movement and perceived accuracy, and provide insights into the interaction design of mobile gestures when gestures are distributed across the body.
Timing and entrainment of multimodal backchanneling behavior for an embodied conversational agent BIBAFull-Text 181-188
  Benjamin Inden; Zofia Malisz; Petra Wagner; Ipke Wachsmuth
We report on an analysis of feedback behavior in an Active Listening Corpus as produced verbally, visually (head movement) and bimodally. The behavior is modeled in an embodied conversational agent and displayed in a conversation with a real human to human participants for perceptual evaluation. Five strategies for the timing of backchannels are compared: copying the timing of the original human listener, producing backchannels at randomly selected times, producing backchannels according to high level timing distributions relative to the interlocutor's utterance and pauses, or according to local entrainment to the interlocutors' vowels, or according to both. Human observers judge that models with global timing distributions miss less opportunities for backchanneling than random timing.
Video analysis of approach-avoidance behaviors of teenagers speaking with virtual agents BIBAFull-Text 189-196
  David Antonio Gómez Jáuregui; Léonor Philip; Céline Clavel; Stéphane Padovani; Mahin Bailly; Jean-Claude Martin
Analysis of non-verbal behaviors in HCI allows understanding how individuals apprehend and adapt to different situations of interaction. This seems particularly relevant when considering tasks such as speaking in a foreign language, which is known to elicit anxiety. This is even truer for young users for whom negative pedagogical feedbacks might have a strong negative impact on their motivation to learn.
   In this paper, we consider the approach-avoidance behaviors of teenagers speaking with virtual agents when using an e-learning platform for learning English. We designed an algorithm for processing the video of these teenagers outside laboratory conditions (e.g. a classical collective classroom in a secondary school) using a webcam. This algorithm processes the video of the user and computes the inter-ocular distance. The anxiety of the users is also collected with questionnaires.
   Results show that the inter-ocular distance enables to discriminate between approach and avoidance behaviors of teenagers reacting to positive or negative stimulus. This simple metric collected via video processing enables to detect an approach behavior related to a positive stimulus and an avoidance behavior related to a negative stimulus. Furthermore, we observed that these automatically detected approach-avoidance behaviors are correlated with anxiety.
A dialogue system for multimodal human-robot interaction BIBAFull-Text 197-204
  Lorenzo Lucignano; Francesco Cutugno; Silvia Rossi; Alberto Finzi
This paper presents a POMDP-based dialogue system for multimodal human-robot interaction (HRI). Our aim is to exploit a dialogical paradigm to allow a natural and robust interaction between the human and the robot. The proposed dialogue system should improve the robustness and the flexibility of the overall interactive system, including multimodal fusion, interpretation, and decision-making. The dialogue is represented as a Partially Observable Markov Decision Process (POMDPs) to cast the inherent communication ambiguity and noise into the dialogue model. POMDPs have been used in spoken dialogue systems, mainly for tourist information services, but their application to multimodal human-robot interaction is novel. This paper presents the proposed model for dialogue representation and the methodology used to compute a dialogue strategy. The whole architecture has been integrated on a mobile robot platform and has bee n tested in a human-robot interaction scenario to assess the overall performances with respect to baseline controllers.
The zigzag paradigm: a new P300-based brain computer interface BIBAFull-Text 205-212
  Qasem Obeidat; Tom Campbell; Jun Kong
Brain Computer Interfaces (BCIs) are used to translate the input of Electroencephalogram (EEG), digitally-recorded via electrodes on the user's scalp, into output commands that control external devices. A P300-based BCI speller system is based upon visual Event-Related Potentials (ERPs) in response to stimulation, as derived from EEG, and is used to type on a computer screen. The Row-Column speller Paradigm (RCP), utilizing a 6-by-6 character matrix, has been a widely-used successful P300 speller, despite inherent problems of adjacency, crowding, and fatigue. RCP is compared here with a new P300 speller interface, the Zigzag Paradigm (ZP). In the ZP interface every second row of the 6-by-6 character matrix is offset to the right by $d/2$ cm, where $d$ cm is the horizontal distance between two adjacent characters. This shifting addressed the adjacency problem by removing all vertical adjacent characters and increasing the distance between most adjacent characters. This shifting also addressed the crowding problem, for most characters, by reducing the number of other characters surrounding a character; critically the target character. A user study upon neurologically-normal individuals revealed significant improvements in online classification performance with the ZP, as supported the view that ZP effectively addressed adjacency and crowding problems. Subjective ratings also revealed that the ZP was more comfortable and caused less fatigue. Theoretical and practical implications of the applicability of the ZP for patients with neuromuscular diseases are discussed.
SpeeG2: a speech- and gesture-based interface for efficient controller-free text input BIBAFull-Text 213-220
  Lode Hoste; Beat Signer
With the emergence of smart TVs, set-top boxes and public information screens over the last few years, there is an increasing demand to no longer use these appliances only for passive output. These devices can also be used to do text-based web search as well as other tasks which require some form of text input. However, the design of text entry interfaces for efficient input on such appliances represents a major challenge. With current virtual keyboard solutions we only achieve an average text input rate of 5.79 words per minute (WPM) while the average typing speed on a traditional keyboard is 38 WPM. Furthermore, so-called controller-free appliances such as Samsung's Smart TV or Microsoft's Xbox Kinect result in even lower average text input rates. We present SpeeG2, a multimodal text entry solution combining speech recognition with gesture-based error correction. Four innovative prototypes for the efficient controller-free text entry have been developed and evaluated. A quantitative evaluation of our SpeeG2 text entry solution revealed that the best of our four prototypes achieves an average input rate of 21.04 WPM (without errors), outperforming current state-of-the-art solutions for controller-free text input.

Oral session 3: intelligent & multimodal interfaces

Interfaces for thinkers: computer input capabilities that support inferential reasoning BIBAFull-Text 221-228
  Sharon Oviatt
Recent research has revealed that basic computer input capabilities can substantially facilitate or impede people's ability to produce ideas and solve problems correctly. This research asks: What type of interface provides best support for inferential reasoning in both low- and high-performing students' Students' ability to make accurate inferences about science and everyday reasoning tasks was compared while they used: (1) non digital pen and paper, (2) a digital pen and paper interface, (3) pen tablet interface, and (4) graphical tablet interface. Correct inferences averaged 10.5% higher when using a digital pen interface, compared with the tablet interfaces. Further analyses revealed that overgeneralization and redundancy errors were more common when using the tablet interfaces and among low performers. Implications are discussed for designing more effective computational thinking tools.
Adaptive timeline interface to personal history data BIBAFull-Text 229-236
  Antti Ajanki; Markus Koskela; Jorma Laaksonen; Samuel Kaski
As the growth of stored personal digital information, such as photographs and emails, is continuously increasing, new tools for browsing and searching are needed. We introduce an intelligent mobile information access tool for personal data. The data are presented in an adaptive timeline where the displayed items function as search cues. The novelty is that the visualization is dynamically changed to emphasize relevant items, which makes them easier to recognize and select. The relevance is inferred during usage of the system from user feedback. In a user study, the dynamic timeline-based interface on a mobile device was shown to require less effort than conventional textual search.
Learning a sparse codebook of facial and body microexpressions for emotion recognition BIBAFull-Text 237-244
  Yale Song; Louis-Philippe Morency; Randall Davis
Obtaining a compact and discriminative representation of facial and body expressions is a difficult problem in emotion recognition. Part of the difficulty is capturing microexpressions, i.e., short, involuntary expressions that last for only a fraction of a second: at a micro-temporal scale, there are so many other subtle face and body movements that do not convey semantically meaningful information. We present a novel approach to this problem by exploiting the sparsity of the frequent micro-temporal motion patterns. Local space-time features are extracted over the face and body region for a very short time period, e.g., few milliseconds. A codebook of microexpressions is learned from the data and used to encode the features in a sparse manner. This allows us to obtain a representation that captures the most salient motion patterns of the face and body at a micro-temporal scale. Experiments performed on the AVEC 2012 dataset show our approach achieving the best published performance on the arousal dimension based solely on visual features. We also report experimental results on audio-visual emotion recognition, comparing early and late data fusion techniques.

Keynote 2

Giving interaction a hand: deep models of co-speech gesture in multimodal systems BIBAFull-Text 245-246
  Stefan Kopp
Humans frequently join words and gestures for multimodal communication. Such natural co-speech gesturing goes far beyond what can be currently processed by gesture-based interfaces and especially its coordination with speech still poses open challenges for basic research and multimodal interfaces alike. How can we develop computational models for processing and generating natural speech-gesture behavior, in a flexible, fast and adaptive manner similar to humans? In this talk I will review approaches and methods applied to this problem and I will argue that such models need to (and can) based on a deeper understanding of what shapes co-speech gesturing in a particular situation. I will present work that connects empirical analyses with computational modeling and evaluation to unravel the cognitive, embodied and socio-interactional mechanisms underlying the use of speech-accompanying gestural behavior, and to develop deeper models of these mechanisms for interactive systems such as virtual characters, humanoid robots, or multimodal interfaces.

Oral session 4: embodied interfaces

Five key challenges in end-user development for tangible and embodied interaction BIBAFull-Text 247-254
  Daniel Tetteroo; Iris Soute; Panos Markopoulos
As tangible and embodied systems are making the transition from the labs to everyday life, there is a growth in the applications related research and design work in this field. We argue that the potential of these technologies can be even further leveraged by enabling domain experts such as teachers, therapists and home owners to act as end-user developers in order to modify and create content for their tangible interactive systems. However, there are important issues that need to be addressed if we want to enable these end users to act as developers. In this paper we identify five key challenges for meta-designers in enabling end-users to develop for tangible and embodied interaction.
How can i help you': comparing engagement classification strategies for a robot bartender BIBAFull-Text 255-262
  Mary Ellen Foster; Andre Gaschler; Manuel Giuliani
A robot agent existing in the physical world must be able to understand the social states of the human users it interacts with in order to respond appropriately. We compared two implemented methods for estimating the engagement state of customers for a robot bartender based on low-level sensor data: a rule-based version derived from the analysis of human behaviour in real bars, and a trained version using supervised learning on a labelled multimodal corpus. We first compared the two implementations using cross-validation on real sensor data and found that nearly all classifier types significantly outperformed the rule-based classifier. We also carried out feature selection to see which sensor features were the most informative for the classification task, and found that the position of the head and hands were relevant, but that the torso orientation was not. Finally, we performed a user study comparing the ability of the two classifiers to detect the intended user engagement of actual customers of the robot bartender; this study found that the trained classifier was faster at detecting initial intended user engagement, but that the rule-based classifier was more stable.
Comparing task-based and socially intelligent behaviour in a robot bartender BIBAFull-Text 263-270
  Manuel Giuliani; Ronald P. A. Petrick; Mary Ellen Foster; Andre Gaschler; Amy Isard; Maria Pateraki; Markos Sigalas
We address the question of whether service robots that interact with humans in public spaces must express socially appropriate behaviour. To do so, we implemented a robot bartender which is able to take drink orders from humans and serve drinks to them. By using a high-level automated planner, we explore two different robot interaction styles: in the task only setting, the robot simply fulfils its goal of asking customers for drink orders and serving them drinks; in the socially intelligent setting, the robot additionally acts in a manner socially appropriate to the bartender scenario, based on the behaviour of humans observed in natural bar interactions. The results of a user study show that the interactions with the socially intelligent robot were somewhat more efficient, but the two implemented behaviour settings had only a small influence on the subjective ratings. However, there were objective factors that influenced participant ratings: the overall duration of the interaction had a positive influence on the ratings, while the number of system order requests had a negative influence. We also found a cultural difference: German participants gave the system higher pre-test ratings than participants who interacted in English, although the post-test scores were similar.
A dynamic multimodal approach for assessing learners' interaction experience BIBAFull-Text 271-278
  Imène Jraidi; Maher Chaouachi; Claude Frasson
In this paper we seek to model the users' experience within an interactive learning environment. More precisely, we are interested in assessing three extreme trends in the interaction experience, namely flow (a perfect immersion within the task), stuck (a difficulty to maintain focused attention) and off-task (a drop out from the task). We propose a hierarchical probabilistic framework using a dynamic Bayesian network to simultaneously assess the probability of experiencing each trend, as well as the emotional responses occurring subsequently. The framework combines three-modality diagnostic variables that sense the learner's experience including physiology, behavior and performance, predictive variables that represent the current context and the learner's profile, and a dynamic structure that tracks the temporal evolution of the learner's experience. We describe the experimental study conducted to validate our approach. A protocol was established to elicit the three target trends as 44 participants interacted with three learning environments involving different cognitive tasks. Physiological activities (electroencephalography, skin conductance and blood volume pulse), patterns of the interaction, and performance during the task were recorded. We demonstrate that the proposed framework outperforms conventional non-dynamic modeling approaches such as static Bayesian networks, as well as three non-hierarchical formalisms including naive Bayes classifiers, decision trees and support vector machines.

Oral session 5: hand and body

Relative accuracy measures for stroke gestures BIBAFull-Text 279-286
  Radu-Daniel Vatavu; Lisa Anthony; Jacob O. Wobbrock
Current measures of stroke gesture articulation lack descriptive power because they only capture absolute characteristics about the gesture as a whole, not fine-grained features that reveal subtleties about the gesture articulation path. We present a set of twelve new relative accuracy measures for stroke gesture articulation that characterize the geometric, kinematic, and articulation accuracy of single and multi-stroke gestures. To compute the accuracy measures, we introduce the concept of a gesture task axis. We evaluate our measures on five public datasets comprising 38,245 samples from 107 participants, about which we make new discoveries; e.g., gestures articulated at fast speed are shorter in path length than slow or medium-speed gestures, but their path lengths vary the most, a finding that helps understand recognition performance. This work will enable a better understanding of users' stroke gesture articulation behavior, ultimately leading to better gesture set designs and more accurate recognizers.
LensGesture: augmenting mobile interactions with back-of-device finger gestures BIBAFull-Text 287-294
  Xiang Xiao; Teng Han; Jingtao Wang
We present LensGesture, a pure software approach for augmenting mobile interactions with back-of-device finger gestures. LensGesture detects full and partial occlusion as well as the dynamic swiping of fingers on the camera lens by analyzing image sequences captured by the built-in camera in real time. We report the feasibility and implementation of LensGesture as well as newly supported interactions. Through offline benchmarking and a 16-subject user study, we found that 1) LensGesture is easy to learn, intuitive to use, and can serve as an effective supplemental input channel for today's smartphones; 2) LensGesture can be detected reliably in real time; 3) LensGesture based target acquisition conforms to Fitts' Law and the information transmission rate is 0.53 bits/sec; and 4) LensGesture applications can improve the usability and the performance of existing mobile interfaces.
Aiding human discovery of handwriting recognition errors BIBAFull-Text 295-302
  Ryan Stedman; Michael Terry; Edward Lank
Handwriting recognizers occasionally misinterpret digital ink input, requiring users to compare their ink input and the recognizer output to identify and correct errors. Technologies like Anoto pens can make this error discovery and correction task more difficult, because verification of recognizer output may occur many hours after data input and may involve the verification of many documents. In this paper, we explore the design space for error discovery aids geared toward "out-of-the-moment" verification. We present three discovery techniques, a visual proximity technique, a multimodal technique, and a character manipulation technique, and analyze the performance of our techniques. Experimental results show that the visual proximity technique outperforms all others in number of errors caught, and is also significantly better than a control technique. This paper is the first experimental study of techniques that aid in the discovery of handwriting recognition errors.
Context-based conversational hand gesture classification in narrative interaction BIBAFull-Text 303-310
  Shogo Okada; Mayumi Bono; Katsuya Takanashi; Yasuyuki Sumi; Katsumi Nitta
Communicative hand gestures play important roles in face-to-face conversations. These gestures are arbitrarily used depending on an individual; even when two speakers narrate the same story, they do not always use the same hand gesture (movement, position, and motion trajectory) to describe the same scene. In this paper, we propose a framework for the classification of communicative gestures in small group interactions. We focus on how many times the hands are held in a gesture and how long a speaker continues a hand stroke, instead of observing hand positions and hand motion trajectories. In addition, to model communicative gesture patterns, we use nonverbal features of participants addressed from participant gestures. In this research, we extract features of gesture phases defined by Kendon (2004) and co-occurring nonverbal patterns with gestures, i.e., utterance, head gesture, and head direction of each participant, by using pattern recognition techniques. In the experiments, we collect eight group narrative interaction datasets to evaluate the classification performance. The experimental results show that gesture phase features and nonverbal features of other participants improves the performance to discriminate communicative gestures that are used in narrative speeches and other gestures from 4% to 16%.

Demo session 2

A haptic touchscreen interface for mobile devices BIBAFull-Text 311-312
  Jong-Uk Lee; Jeong-Mook Lim; Heesook Shin; Ki-Uk Kyung
In this paper, we present a haptic touchscreen interface for mobile devices. A surface actuator composed of two parallel plates is mounted between a touch panel and a display module. It generates haptic feedback when a user input on a touch screen. The electrostatic force is generated when two parallel plates are charged and this phenomenon causes haptic feedback. When an input is detected on the touch screen, multimodal feedback that includes not only basic visual and auditory feedback but also haptic feedback occurs appropriately. Then, a user feels realistic physical feeling in the fingertips and it provides the feeling such as pressing a real keyboard. We have designed and implemented an actuator, thin and transparent, to provide haptic feedback and an interactive architecture to perform multimodal output.
A social interaction system for studying humor with the Robot NAO BIBAFull-Text 313-314
  Laurence Y. Devillers; Mariette Soury
The video of our demonstrator presents a social interaction system for studying humor with the Aldebaran robot NAO. Our application records and analyzes audio and video stream to provide real-time feedback. Using this dialog system during show & tell sessions at Interspeech 2013, we have collected different kind of laughter (positive and negative) from 45 subjects. The participants were involved in a verbal exchange with NAO, including tongue-twisters games and jokes, as well as witty remarks and laughs from the robot. The conversation data captured is used here to study subject behaviors from various personalities and cultural backgrounds.
TaSST: affective mediated touch BIBAFull-Text 315-316
  Aduen Darriba Freriks; Dirk Heylen; Gijs Huisman
Communication with others occurs through a multitude of signals, such as speech, facial expressions, and body postures. Understudied in this regard is the way we use our sense of touch in social communication. In this paper we present the TaSST (Tactile Sleeve for Social Touch), a haptic communication device that enables two people to communicate through touch at a distance.
Talk ROILA to your Robot BIBAFull-Text 317-318
  Omar Mubin; Joshua Henderson; Christoph Bartneck
In our research we present a speech recognition friendly artificial language that is specially designed and implemented for humans to talk to robots. We call this language Robot Interaction Language (ROILA). In this paper, we describe our current work with ROILA that utilizes the Nao humanoid robot. Our current demo implementation will allow users to interact with the Nao robot without the usage of any external laptops or microphones. Therefore the purpose of our demo is two-fold: 1) to demonstrate "live" that ROILA has improved recognition accuracy over English and 2) to demonstrate that users can interact with the Nao robot in ROILA without the use of any external devices.
NEMOHIFI: an affective HiFi agent BIBAFull-Text 319-320
  Syaheerah Lebai Lutfi; Fernando Fernandez-Martinez; Jaime Lorenzo-Trueba; Roberto Barra-Chicote; Juan Manuel Montero
This demo concerns a recently developed prototype of an emotionally-sensitive autonomous HiFi Spoken Conversational Agent, called NEMOHIFI. The baseline agent was developed by the Speech Technology Group (GTH) and has recently been integrated with an emotional engine called NEMO (Need-inspired Emotional Model) to enable it to adapt to users' emotion and respond to the users using appropriate expressive speech. NEMOHIFI controls and manages the HiFi audio system, and for end users, its functions equate a remote control, except that instead of clicking, the user interacts with the agent using voice. A pairwise comparison between the baseline (non-adaptive) and NEMO-HIFI showed that the latter was not only statistically substantially preferred by users to the former, but they are also significantly more satisfied with it than the former.

Poster session 2: doctoral spotlight

Persuasiveness in social multimedia: the role of communication modality and the challenge of crowdsourcing annotations BIBAFull-Text 321-324
  Sunghyun Park
With an exponential growth in social multimedia contents online, there is an increasing importance of understanding why and how some contents are perceived as persuasive while others are ignored. This paper outlines my research goals in understanding human perception of persuasiveness in social multimedia contents, which involve studying how different communication modalities influence our perception and identifying key verbal and nonverbal behaviors that eventually lead us to believe someone is convincing and influential. For any research involving in-depth human behavior analysis, it is imperative to obtain accurate annotations of human behaviors at the micro-level. In addition to investigating persuasiveness, this work will also provide to the research community convenient web-based annotation tools, effective procedures for obtaining high-quality annotations with crowdsourcing, and evaluation metrics to fairly and accurately measure the quality and agreement of micro-level behavior annotations.
Towards a dynamic view of personality: multimodal classification of personality states in everyday situations BIBAFull-Text 325-328
  Kyriaki Kalimeri
A new perspective in the automatic recognition of personality is proposed; shifting our focus from the traditional goal of using behaviors to infer about personality traits, to the classification of excerpts of social behavior into personality states. The personality states are specific behavioral episodes that can be described as having the same content as traits wherein a person behaves more or less introvertedly/ extravertedly, more or less neurotically etc depending on the social situation. Exploiting the SociometricBadge Corpus, a first step towards addressing this new perspective is presented, starting from the automatic classification of personality states from multimodal behavioral cues. The effectiveness of these cues as well as of other situational characteristics are investigated for the sake of personality state classification. Moreover, a first approach towards the automatic discovery of situational characteristics is proposed.
Designing effective multimodal behaviors for robots: a data-driven perspective BIBAFull-Text 329-332
  Chien-Ming Huang
Robots need to effectively use multimodal behaviors, including speech, gaze, and gestures, in support of their users to achieve intended interaction goals, such as improved task performance. This proposed research concerns designing effective multimodal behaviors for robots to interact with humans using a data-driven approach. In particular, probabilistic graphical models (PGMs) are used to model the interdependencies among multiple behavioral channels and generate complexly contingent multimodal behaviors for robots to facilitate human-robot interaction. This data-driven approach not only allows the investigation of hidden and temporal relationships among behavioral channels but also provides a holistic perspective on how multimodal behaviors as a whole might shape interaction outcomes. Three studies are proposed to evaluate the proposed data-driven approach and to investigate the dynamics of multimodal behaviors and interpersonal interaction. This research will contribute to the multimodal interaction community in theoretical, methodological, and practical aspects.
Controllable models of gaze behavior for virtual agents and humanlike robots BIBAFull-Text 333-336
  Sean Andrist
Embodied social agents, through their ability to afford embodied interaction using nonverbal human communicative cues, hold great promise in application areas such as education, training, rehabilitation, and collaborative work. Gaze cues are particularly important for achieving significant social and communicative goals. In this research, I explore how agents -- both virtual agents and humanlike robots -- might achieve such goals through the use of various gaze mechanisms. To this end, I am developing computational control models of gaze behavior that treat gaze as the output of a system with a number of multimodal inputs. These inputs can be characterized at different levels of interaction, from non-interactive (e.g., physical characteristics of the agent itself) to fully interactive (e.g., speech and gaze behavior of a human interlocutor). This research will result in a number of control models that each focus on a different gaze mechanism, combined into an open-source library of gaze behaviors that will be usable by both human-robot and human-virtual agent interaction designers. System-level evaluations in naturalistic settings will validate this gaze library for its ability to evoke positive social and cognitive responses in human users.
The nature of the bots: how people respond to robots, virtual agents and humans as multimodal stimuli BIBAFull-Text 337-340
  Jamy Li
This research agenda aims to understand how people treat robots along two dialectics. In the mechanical-living dialectic, fabricated entities are assessed against their organic counterparts to see if people respond differently to robots versus other people. Multiple experiments are conducted that compare human-robot relationships to human-human relationships by manipulating roles in videos of dyadic conversations shown to participants. In the physical-digital dialectic, a physically embodied robotic agent is compared to either a digitally-presented virtual agent (such as an animated character on a computer screen) or a digitally-presented robotic agent (such as a live video feed of the robot). The role of physical and digital embodiment and display medium are explored through a comprehensive survey and analysis of existing experimental works comparing physical and digital agents. Key research questions, related work, scope, research approach, current findings and remaining work are outlined.
Adaptive virtual rapport for embodied conversational agents BIBAFull-Text 341-344
  Ivan Gris Sepulveda
In this paper I describe my research goals and hypotheses regarding human-computer relationships with embodied conversational agents (ECAs). I include important studies of related research that inform and direct my own efforts. I explain the current state and some technical aspects of the ECAs I have contributed to create, and past experiments regarding human-ECA familiarity, ECA design and analysis, and multiparty ECA interaction, including our semi-automated corpora collection techniques, analysis methodology, and their respective results to date. Finally, I conclude with an overall presentation of all current studies I have worked on, and future possibilities for my final dissertation and post-dissertation research related to virtual human-ECA rapport.
3D head pose and gaze tracking and their application to diverse multimodal tasks BIBAFull-Text 345-348
  Kenneth Alberto Funes Mora
In this PhD thesis the problem of 3D head pose and gaze tracking from minimal user cooperation is addressed. By exploiting characteristics of RGB-D sensors, contributions have been made related to consequent problems of the lack of cooperation: in particular, head pose and inter-person appearance variability; in addition to low resolution handling. The resulting system enabled diverse multimodal applications. In particular, recent work combined multiple RGB-D sensors to detect gazing events in dyadic interactions.
   The research plan consists of: i) Improving the robustness, accuracy and usability of the head pose and gaze tracking system; ii) To use additional multimodal cues, such as speech and dynamic context, to train and adapt gaze models in an unsupervised manner; iii) To extend the application of 3D gaze estimation to diverse multimodal applications. This includes visual focus of attention tasks involving multiple visual targets, e.g. people in a meeting-like setup.
Towards developing a model for group involvement and individual engagement BIBAFull-Text 349-352
  Catharine Oertel
This PhD project is concerned with the multi-modal modeling of conversational dynamics. In particular I focus on investigating how people organise themselves within a multiparty conversation. I am interested in identifying bonds between people, their individual engagement level in the conversation and how the engagement level of the individual person influences the perceived involvement of the whole group of people. To this end machine learning experiments are carried out and I am planning to build a conversational involvement module to be implemented in a dialogue system.
Gesture recognition using depth images BIBAFull-Text 353-356
  Bin Liang
This work presents an approach for recognizing 3D human gestures by using depth images. The proposed motion trail model (MTM) consists of both motion information and static posture information over the gesture sequence along the xoy-plane. By projecting depth images onto other two planes in 3D space, gestures can be represented with complementary information from additional planes. Accordingly 2D-MTM can be extended into 3D space in addition to the lateral scene parallel to the image plane to generate 3D-MTM. The Histogram of Oriented Gradient (HOG) is then extracted from the proposed 3D-MTM as the feature descriptor. The final recognition of gestures is performed through maximum correlation coefficient. The preliminary results demonstrate the average error rate decreases from 62.80% of baseline method to 21.74% after using the proposed approach on Chalearn gesture dataset.
Modeling semantic aspects of gaze behavior while catalog browsing BIBAFull-Text 357-360
  Erina Ishikawa
Gaze behavior is one of crucial clues to understand human mental states. The goal of this study is to build a probabilistic model that represents relationships between users' gaze behavior and user states while catalog browsing. In the proposed model, users' gaze behavior is interpreted based on semantic and spatial relationships among objects constituting displayed contents, which is referred to as designed structures. Moreover, a method for estimating users' mental states is also proposed based on the model to evaluate the model by measuring the performance of user state estimation. The results from preliminary experiments show that the proposed model improved estimation accuracies of user states compared to other baseline methods.
Computational behaviour modelling for autism diagnosis BIBAFull-Text 361-364
  Shyam Sundar Rajagopalan
Autism Spectrum Disorders (ASD), often referred to as autism, are neurological disorders characterised by deficits in cognitive skills, social and communicative behaviours. ASD develop in early childhood and include a spectrum of related problems, such as Asperger Syndrome, Autistic Disorder, and Pervasive Development Disorder. A common way of diagnosing ASD is by studying behavioural cues expressed by the children. The focus of my PhD project is to model the common atypical behaviour cues of children suffering from ASD. These models could assist clinicians in diagnosing autism and alert parents/caregivers for early intervention. The behaviours will be studied in a discrete manner, in the context of a social dyadic conversational setting, using visual and speech signals as well as a fusion of multiple modalities. As part of my initial work, an algorithm based on dense trajectories and the short-time Fourier transform is proposed for modelling stimming (repetitive) behaviour. To validate the approach, preliminary experiments are performed on human action recognition datasets that contain repetitive behaviours. In addition, publicly available videos of children exhibiting repetitive behaviours were also used.

Grand challenge overviews

ChaLearn multi-modal gesture recognition 2013: grand challenge and workshop summary BIBAFull-Text 365-368
  Sergio Escalera; Jordi Gonzàlez; Xavier Baró; Miguel Reyes; Isabelle Guyon; Vassilis Athitsos; Hugo Escalante; Leonid Sigal; Antonis Argyros; Cristian Sminchisescu; Richard Bowden; Stan Sclaroff
We organized a Grand Challenge and Workshop on Multi-Modal Gesture Recognition.
   The MMGR Grand Challenge focused on the recognition of continuous natural gestures from multi-modal data (including RGB, Depth, user mask, Skeletal model, and audio). We made available a large labeled video database of 13,858 gestures from a lexicon of 20 Italian gesture categories recorded with a Kinect™ camera. More than 54 teams participated in the challenge and a final error rate of 12% was achieved by the winner of the competition. Winners of the competition published their work in the workshop of the Challenge.
   The MMGR Workshop was held at ICMI conference 2013, Sidney. A total of 9 relevant papers with basis on multi-modal gesture recognition were accepted for presentation. This includes multi-modal descriptors, multi-class learning strategies for segmentation and classification in temporal data, as well as relevant applications in the field, including multi-modal Social Signal Processing and multi-modal Human Computer Interfaces. Five relevant invited speakers participated in the workshop: Profs. Leonid Signal from Disney Research, Antonis Argyros from FORTH, Institute of Computer Science, Cristian Sminchisescu from Lund University, Richard Bowden from University of Surrey, and Stan Sclaroff from Boston University. They summarized their research in the field and discussed past, current, and future challenges in Multi-Modal Gesture Recognition.
Emotion recognition in the wild challenge (EmotiW) challenge and workshop summary BIBAFull-Text 371-372
  Abhinav Dhall; Roland Goecke; Jyoti Joshi; Michael Wagner; Tom Gedeon
The Emotion Recognition In The Wild Challenge and Workshop (EmotiW) 2013 Grand Challenge consists of an audio-video based emotion classification challenge, which mimics real-world conditions. In total, 27 teams participated in the challenge. The database in the 2013 challenge is the Acted Facial Expression in the Wild (AFEW), which has been collected from movies showing close-to-real-world conditions.
ICMI 2013 grand challenge workshop on multimodal learning analytics BIBAFull-Text 373-378
  Louis-Philippe Morency; Sharon Oviatt; Stefan Scherer; Nadir Weibel; Marcelo Worsley
Advances in learning analytics are contributing new empirical findings, theories, methods, and metrics for understanding how students learn. It also contributes to improving pedagogical support for students' learning through assessment of new digital tools, teaching strategies, and curricula. Multimodal learning analytics (MMLA)[1] is an extension of learning analytics and emphasizes the analysis of natural rich modalities of communication across a variety of learning contexts. This MMLA Grand Challenge combines expertise from the learning sciences and machine learning in order to highlight the rich opportunities that exist at the intersection of these disciplines. As part of the Grand Challenge, researchers were asked to predict: (1) which student in a group was the dominant domain expert, and (2) which problems that the group worked on would be solved correctly or not. Analyses were based on a combination of speech, digital pen and video data. This paper describes the motivation for the grand challenge, the publicly available data resources and results reported by the challenge participants. The results demonstrate that multimodal prediction of the challenge goals: (1) is surprisingly reliable using rich multimodal data sources, (2) can be accomplished using any of the three modalities explored, and (3) need not be based on content analysis.

Keynote 3

Hands and speech in space: multimodal interaction with augmented reality interfaces BIBAFull-Text 379-380
  Mark Billinghurst
Augmented Reality (AR) is technology that allows virtual imagery to be seamlessly integrated into the real world. Although first developed in the 1960's it has only been recently that AR has become widely available, through platforms such as the web and mobile phones. However most AR interfaces have very simple interaction, such as using touch on phone screens or camera tracking from real images. New depth sensing and gesture tracking technologies such as Microsoft Kinect or Leap Motion have made is easier than ever before to track hands in space. Combined with speech recognition and AR tracking and viewing software it is possible to create interfaces that allow users to manipulate 3D graphics in space through a natural combination of speech and gesture. In this paper I will review previous research in multimodal AR interfaces and give an overview of the significant research questions that need to be addressed before speech and gesture interaction can become commonplace.

Oral session 6: AR, VR & mobile

Evaluating dual-view perceptual issues in handheld augmented reality: device vs. user perspective rendering BIBAFull-Text 381-388
  Klen Copic Pucihar; Paul Coulton; Jason Alexander
In handheld Augmented Reality (AR) the magic-lens paradigm is typically implemented by rendering the video stream captured by the back-facing camera onto the device's screen. Unfortunately, such implementations show the real world from the device's perspective rather than the user's perspective. This dual-perspective results in misaligned and incorrectly scaled imagery, a predominate cause for the dual-view problem with potential to distort user's spatial perception. This paper presents a user study that analyzes users' expectations, spatial-perception, and their ability to deal with the dual-view problem, by comparing device-perspective and fixed Point-of-View (POV) user-perspective rendering. The results confirm the existence of the dual-view perceptual issue and that the majority of participants expect user-perspective rendering irrespective of their previous AR experience. Participants also demonstrated significantly better spatial perception and preference of the user-perspective view.
MM+Space: n x 4 degree-of-freedom kinetic display for recreating multiparty conversation spaces BIBAFull-Text 389-396
  Kazuhiro Otsuka; Shiro Kumano; Ryo Ishii; Maja Zbogar; Junji Yamato
A novel system, called MM+Space, is presented for recreating multiparty face-to-face conversation scenes in the real world. It aims to display and playback pre-recorded conversations as if the people were talking in front of the viewer(s). This system consists of multiple projectors and transparent screens, which display the life-size faces of people. The key idea is the physical augmentation of human head motions, i.e. the screen pose is dynamically controlled to emulate the head motions, for boosting the viewers' perception of nonverbal behaviors and interactions. In particular, MM+Space newly introduces 2-Degree-of-Freedom (DoF) translations, in forward-backward and right-left directions, in addition to 2-DoF head rotations (nodding and shaking), which were proposed in our former MM-Space system. The full 4-DoF kinetic display is expected to enhance the expressibility of head and body motions, and to create more realistic representation of interacting people. Experiments showed that the proposed system with 4-DoF motions outperformed the rotation-only system in the increased perception of people's presence and in expressing their postures. In addition, it was reported that the proposed system allowed the viewers to experience rich emotional expressibility, immersion in conversations, and potential behavioral/emotional contagion.
Investigating appropriate spatial relationship between user and ar character agent for communication using AR WoZ system BIBAFull-Text 397-404
  Reina Aramaki; Makoto Murakami
We aim to construct a system that can communicate with humans in everyday life environments, such as in their homes and on the street. To achieve this, we propose using a human-agent communication system that utilizes augmented reality (AR) technology, in which the AR character agent has no physical body, thus facilitating its safe performance in real environments. As a first step towards implementing the motion control component of the AR agent, we focus on investigating the appropriate spatial relationships between the user and the AR agent in experimental settings. Consequently, in this paper, we report on the construction of an AR agent Wizard of Oz (WoZ) system, in which the agent is operated by a hidden operator via remote control, to acquire human-agent interaction data through experimental trials. The interaction data are collected from a simple experimental setting in which a user sits at a desk and communicates with the AR agent standing on the desk. We also investigate the spatial relationship appropriate for communication between the user and the AR agent, and propose a position control strategy for the AR agent.
Inferring social activities with mobile sensor networks BIBAFull-Text 405-412
  Trinh Minh Tri Do; Kyriaki Kalimeri; Bruno Lepri; Fabio Pianesi; Daniel Gatica-Perez
While our daily activities usually involve interactions with others, the current methods on activity recognition do not often exploit the relationship between social interactions and human activity. This paper addresses the problem of interpreting social activity from human interactions captured by mobile sensing networks. Our first goal is to discover different social activities such as chatting with friends from interaction logs and then characterize them by the set of people involved, and the time and location of the occurring event. Our second goal is to perform automatic labeling of the discovered activities using predefined semantic labels such as coffee breaks, weekly meetings, or random discussions. Our analysis was conducted on a real-life interaction network sensed with Bluetooth and infrared sensors of about fifty subjects who carried sociometric badges over 6 weeks. We show that the proposed system reliably recognized coffee breaks with 99% accuracy, while weekly meetings were recognized with 88% accuracy.

Oral session 7: eyes & body

Effects of language proficiency on eye-gaze in second language conversations: toward supporting second language collaboration BIBAFull-Text 413-420
  Ichiro Umata; Seiichi Yamamoto; Koki Ijuin; Masafumi Nishida
The importance of conversation in second languages during international collaboration continues to increase, as does the risk of miscommunication caused by the differences in the linguistic proficiency of participants. This study provides a basis for monitoring each participant's status by studying the effects of linguistic proficiency on such communicative activities as utterances and gazes in second language conversations. In such conversations, we found that gaze produces different functions from those in native language conversations and that partners' linguistic proficiency may also affect gaze functions.
Predicting where we look from spatiotemporal gaps BIBAFull-Text 421-428
  Ryo Yonetani; Hiroaki Kawashima; Takashi Matsuyama
When we are watching videos, there exist spatiotemporal gaps between where we look and what we focus on, which result from temporally delayed responses and anticipation in eye movements. We focus on the underlying structures of those gaps and propose a novel method to predict points of gaze from video data. In the proposed methods, we model the spatiotemporal patterns of salient regions that tend to be focused on and statistically learn which types of the patterns strongly appear around the points of gaze with respect to each type of eye movements. It allows us to exploit the structures of gaps affected by eye movements and salient motions for the gaze-point prediction. The effectiveness of the proposed method is confirmed with several public datasets.
Automatic multimodal descriptors of rhythmic body movement BIBAFull-Text 429-436
  Marwa Mahmoud; Louis-Philippe Morency; Peter Robinson
Prolonged durations of rhythmic body gestures were proved to be correlated with different types of psychological disorders. To-date, there is no automatic descriptor that can robustly detect those behaviours. In this paper, we propose a cyclic gestures descriptor that can detect and localise rhythmic body movements by taking advantage of both colour and depth modalities. We show experimentally how our rhythmic descriptor can successfully localise the rhythmic gestures as: hands fidgeting, legs fidgeting or rocking, significantly higher than the majority vote classification baseline. Our experiments also demonstrate the importance of fusing both modalities, with a significant increase in performance when compared to individual modalities.
Multimodal analysis of body communication cues in employment interviews BIBAFull-Text 437-444
  Laurent Son Nguyen; Alvaro Marcos-Ramiro; Martha Marrón Romera; Daniel Gatica-Perez
Hand gestures and body posture are intimately linked to speech as they are used to enrich the vocal content, and are therefore inherently multimodal. As an important part of nonverbal behavior, body communication carries relevant information that can reveal social constructs as diverse as personality, internal states, or job interview outcomes. In this work, we analyze body communication cues in real dyadic employment interviews, where the protagonists of the interaction are seated. We use a mixture of body communicative features based on manual annotations and automated extraction methods to successfully predict two key organizational constructs, namely personality and job interview ratings. Our work also confirms the multimodal nature of body communication and shows that the speaking status can be used to improve the prediction performance of personality and hirability.

ChaLearn challenge and workshop on multi-modal gesture recognition

Multi-modal gesture recognition challenge 2013: dataset and results BIBAFull-Text 445-452
  Sergio Escalera; Jordi Gonzàlez; Xavier Baró; Miguel Reyes; Oscar Lopes; Isabelle Guyon; Vassilis Athitsos; Hugo Escalante
The recognition of continuous natural gestures is a complex and challenging problem due to the multi-modal nature of involved visual cues (e.g. fingers and lips movements, subtle facial expressions, body pose, etc.), as well as technical limitations such as spatial and temporal resolution and unreliable depth cues. In order to promote the research advance on this field, we organized a challenge on multi-modal gesture recognition. We made available a large video database of 13,858 gestures from a lexicon of 20 Italian gesture categories recorded with a Kinect™ camera, providing the audio, skeletal model, user mask, RGB and depth images. The focus of the challenge was on user independent multiple gesture learning. There are no resting positions and the gestures are performed in continuous sequences lasting 1-2 minutes, containing between 8 and 20 gesture instances in each sequence. As a result, the dataset contains around 1.720.800 frames. In addition to the 20 main gesture categories, "distracter" gestures are included, meaning that additional audio and gestures out of the vocabulary are included. The final evaluation of the challenge was defined in terms of the Levenshtein edit distance, where the goal was to indicate the real order of gestures within the sequence. 54 international teams participated in the challenge, and outstanding results were obtained by the first ranked participants.
Fusing multi-modal features for gesture recognition BIBAFull-Text 453-460
  Jiaxiang Wu; Jian Cheng; Chaoyang Zhao; Hanqing Lu
This paper proposes a novel multi-modal gesture recognition framework and introduces its application to continuous sign language recognition. A Hidden Markov Model is used to construct the audio feature classifier. A skeleton feature classifier is trained to provided complementary information based on the Dynamic Time Warping model. The confidence scores generated by two classifiers are firstly normalized and then combined to produce a weighted sum for the final recognition. Experimental results have shown that the precision and recall scores for 20 classes of our multi-modal recognition framework can achieve 0.8829 and 0.8890 respectively, which proves that our method is able to correctly reject false detection caused by single classifier. Our approach scored 0.12756 in mean Levenshtein distance and was ranked 1st in the Multi-modal Gesture Recognition Challenge in 2013.
A multi modal approach to gesture recognition from audio and video data BIBAFull-Text 461-466
  Immanuel Bayer; Thierry Silbermann
We describe in this paper our approach for the Multi-modal gesture recognition challenge organized by ChaLearn in conjunction with the ICMI 2013 conference. The competition's task was to learn a vocabulary of 20 types of Italian gestures performed from different persons and to detect them in sequences. We develop an algorithm to find the gesture intervals in the audio data, extract audio features from those intervals and train two different models. We engineer features from the skeleton data and use the gesture intervals in the training data to train a model that we afterwards apply to the test sequences using a sliding window. We combine the models through weighted averaging. We find that this way to combine information from two different sources boosts the models performance significantly.
Online RGB-D gesture recognition with extreme learning machines BIBAFull-Text 467-474
  Xi Chen; Markus Koskela
Gesture recognition is needed in many applications such as human-computer interaction and sign language recognition. The challenges of building an actual recognition system do not lie only in reaching an acceptable recognition accuracy but also with requirements for fast online processing. In this paper, we propose a method for online gesture recognition using RGB-D data from a Kinect sensor. Frame-level features are extracted from RGB frames and the skeletal model obtained from the depth data, and then classified by multiple extreme learning machines. The outputs from the classifiers are aggregated to provide the final classification results for the gestures. We test our method on the ChaLearn multi-modal gesture challenge data. The results of the experiments demonstrate that the method can perform effective multi-class gesture recognition in real-time.
A multi-modal gesture recognition system using audio, video, and skeletal joint data BIBAFull-Text 475-482
  Karthik Nandakumar; Kong Wah Wan; Siu Man Alice Chan; Wen Zheng Terence Ng; Jian Gang Wang; Wei Yun Yau
This paper describes the gesture recognition system developed by the Institute for Infocomm Research (I2R) for the 2013 ICMI CHALEARN Multi-modal Gesture Recognition Challenge. The proposed system adopts a multi-modal approach for detecting as well as recognizing the gestures. Automated gesture detection is performed using both audio signals and information about hand joints obtained from the Kinect sensor to segment a sample into individual gestures. Once the gestures are detected and segmented, features extracted from three different modalities, namely, audio, 2-dimensional video (RGB), and skeletal joints (Kinect) are used to classify a given sequence of frames into one of the 20 known gestures or an unrecognized gesture. Mel frequency cepstral coefficients (MFCC) are extracted from the audio signals and a Hidden Markov Model (HMM) is used for classification. While Space-Time Interest Points (STIP) are used to represent the RGB modality, a covariance descriptor is extracted from the skeletal joint data. In the case of both RGB and Kinect modalities, Support Vector Machines (SVM) are used for gesture classification. Finally, a fusion scheme is applied to accumulate evidence from all the three modalities and predict the sequence of gestures in each test sample. The proposed gesture recognition system is able to achieve an average edit distance of 0.2074 over the 275 test samples containing 2,742 unlabeled gestures. While the proposed system is able to recognize the known gestures with high accuracy, most of the errors are caused due to insertion, which occurs when an unrecognized gesture is misclassified as one of the 20 known gestures.
ChAirGest: a challenge for multimodal mid-air gesture recognition for close HCI BIBAFull-Text 483-488
  Simon Ruffieux; Denis Lalanne; Elena Mugellini
In this paper, we present a research oriented open challenge focusing on multimodal gesture spotting and recognition from continuous sequences in the context of close human-computer interaction. We contextually outline the added value of the proposed challenge by presenting most recent and popular challenges and corpora available in the field. Then we present the procedures for data collection, corpus creation and the tools that have been developed for participants. Finally we introduce a novel single performance metric that has been developed to quantitatively evaluate the spotting and recognition task with multiple sensors.
Gesture spotting and recognition using salience detection and concatenated hidden Markov models BIBAFull-Text 489-494
  Ying Yin; Randall Davis
We developed a gesture salience based hand tracking method, and a gesture spotting and recognition method based on concatenated hidden Markov models. A 3-fold cross validation using the ChAirGest development data set with 10 users gives an F1 score of 0.907 and an accurate temporal segmentation rate (ATSR) of 0.923. The average final score is 0.9116. Compared with using the hand joint position from the Kinect SDK, using our hand tracking method gives a 3.7% absolute increase in the recognition F1 score.
Multi-modal social signal analysis for predicting agreement in conversation settings BIBAFull-Text 495-502
  Víctor Ponce-López; Sergio Escalera; Xavier Baró
In this paper we present a non-invasive ambient intelligence framework for the analysis of non-verbal communication applied to conversational settings. In particular, we apply feature extraction techniques to multi-modal audio-RGB-depth data. We compute a set of behavioral indicators that define communicative cues coming from the fields of psychology and observational methodology. We test our methodology over data captured in victim-offender mediation scenarios. Using different state-of-the-art classification approaches, our system achieve upon 75% of recognition predicting agreement among the parts involved in the conversations, using as ground truth the experts opinions.
Multi-modal descriptors for multi-class hand pose recognition in human computer interaction systems BIBAFull-Text 503-508
  Jordi Abella; Raúl Alcaide; Anna Sabaté; Joan Mas; Sergio Escalera; Jordi Gonzàlez; Coen Antens
Hand pose recognition in advanced Human Computer Interaction systems (HCI) is becoming more feasible thanks to the use of affordable multi-modal RGB-Depth cameras. Depth data generated by these sensors is a very valuable input information, although the representation of 3D descriptors is still a critical step to obtain robust object representations. This paper presents an overview of different multi-modal descriptors, and provides a comparative study of two feature descriptors called Multi-modal Hand Shape (MHS) and Fourier-based Hand Shape (FHS), which compute local and global 2D-3D hand shape statistics to robustly describe hand poses. A new dataset of 38K hand poses has been created for real-time hand pose and gesture recognition, corresponding to five hand shape categories recorded from eight users. Experimental results show good performance of the fused MHS and FHS descriptors, improving recognition accuracy while assuring real-time computation in HCI scenarios.

Emotion recognition in the wild challenge and workshop

Emotion recognition in the wild challenge 2013 BIBAFull-Text 509-516
  Abhinav Dhall; Roland Goecke; Jyoti Joshi; Michael Wagner; Tom Gedeon
Emotion recognition is a very active field of research. The Emotion Recognition In The Wild Challenge and Workshop (EmotiW) 2013 Grand Challenge consists of an audio-video based emotion classification challenges, which mimics real-world conditions. Traditionally, emotion recognition has been performed on laboratory controlled data. While undoubtedly worthwhile at the time, such laboratory controlled data poorly represents the environment and conditions faced in real-world situations. The goal of this Grand Challenge is to define a common platform for evaluation of emotion recognition methods in real-world conditions. The database in the 2013 challenge is the Acted Facial Expression in the Wild (AFEW), which has been collected from movies showing close-to-real-world conditions.
Multiple kernel learning for emotion recognition in the wild BIBAFull-Text 517-524
  Karan Sikka; Karmen Dykstra; Suchitra Sathyanarayana; Gwen Littlewort; Marian Bartlett
We propose a method to automatically detect emotions in unconstrained settings as part of the 2013 Emotion Recognition in the Wild Challenge [16], organized in conjunction with the ACM International Conference on Multimodal Interaction (ICMI 2013). Our method combines multiple visual descriptors with paralinguistic audio features for multimodal classification of video clips. Extracted features are combined using Multiple Kernel Learning and the clips are classified using an SVM into one of the seven emotion categories: Anger, Disgust, Fear, Happiness, Neutral, Sadness and Surprise. The proposed method achieves competitive results, with an accuracy gain of approximately 10% above the challenge baseline.
Partial least squares regression on Grassmannian manifold for emotion recognition BIBAFull-Text 525-530
  Mengyi Liu; Ruiping Wang; Zhiwu Huang; Shiguang Shan; Xilin Chen
In this paper, we propose a method for video-based human emotion recognition. For each video clip, all frames are represented as an image set, which can be modeled as a linear subspace to be embedded in Grassmannian manifold. After feature extraction, Class-specific One-to-Rest Partial Least Squares (PLS) is learned on video and audio data respectively to distinguish each class from the other confusing ones. Finally, an optimal fusion of classifiers learned from both modalities (video and audio) is conducted at decision level. Our method is evaluated on the Emotion Recognition In The Wild Challenge (EmotiW 2013). The experimental results on both validation set and blind test set are presented for comparison. The final accuracy achieved on test set outperforms the baseline by 26%.
Emotion recognition with boosted tree classifiers BIBAFull-Text 531-534
  Matthew Day
In this paper, we describe a simple system to recognize emotions from short video sequences, developed for the Emotion Recognition in the Wild Challenge (EmotiW 2013). Performance matches the challenge baseline whilst being significantly faster and lower in complexity. Our experiments and subsequent discussion provide a number of insights into the problem.
Distribution-based iterative pairwise classification of emotions in the wild using LGBP-TOP BIBAFull-Text 535-542
  Timur R. Almaev; Anil Yüce; Alexandru Ghitulescu; Michel F. Valstar
Automatic facial expression analysis promises to be a game-changer in many application areas. But before this promise can be fulfilled, it has to move from the laboratory into the wild. The Emotion Recognition in the Wild challenge provides an opportunity to develop approaches in this direction. We propose a novel Distribution-based Pairwise Iterative Classification scheme, which outperforms standard multi-class classification on this challenge data. We also verify that the recently proposed dynamic appearance descriptor, Local Gabor Patterns on Three Orthogonal Planes, performs well on this real-world data, indicating that it is robust to the type of facial misalignments that can be expected in such scenarios. Finally, we provide details of ACTC, our affective computing tools on the cloud, which is a new resource for researchers in the field of affective computing.
Combining modality specific deep neural networks for emotion recognition in video BIBAFull-Text 543-550
  Samira Ebrahimi Kanou; Christopher Pal; Xavier Bouthillier; Pierre Froumenty; Çaglar Gülçehre; Roland Memisevic; Pascal Vincent; Aaron Courville; Yoshua Bengio; Raul Chandias Ferrari; Mehdi Mirza; Sébastien Jean; Pierre-Luc Carrier; Yann Dauphin; Nicolas Boulanger-Lewandowski; Abhishek Aggarwal; Jeremie Zumer; Pascal Lamblin; Jean-Philippe Raymond; Guillaume Desjardins; Razvan Pascanu; David Warde-Farley; Atousa Torabi; Arjun Sharma; Emmanuel Bengio; Kishore Reddy Konda; Zhenzhou Wu
In this paper we present the techniques used for the University of Montréal's team submissions to the 2013 Emotion Recognition in the Wild Challenge. The challenge is to classify the emotions expressed by the primary human subject in short video clips extracted from feature length movies. This involves the analysis of video clips of acted scenes lasting approximately one-two seconds, including the audio track which may contain human voices as well as background music. Our approach combines multiple deep neural networks for different data modalities, including: (1) a deep convolutional neural network for the analysis of facial expressions within video frames; (2) a deep belief net to capture audio information; (3) a deep autoencoder to model the spatio-temporal information produced by the human actions depicted within the entire scene; and (4) a shallow network architecture focused on extracted features of the mouth of the primary human subject in the scene. We discuss each of these techniques, their performance characteristics and different strategies to aggregate their predictions. Our best single model was a convolutional neural network trained to predict emotions from static frames using two large data sets, the Toronto Face Database and our own set of faces images harvested from Google image search, followed by a per frame aggregation strategy that used the challenge training data. This yielded a test set accuracy of 35.58%. Using our best strategy for aggregating our top performing models into a single predictor we were able to produce an accuracy of 41.03% on the challenge test set. These compare favorably to the challenge baseline test set accuracy of 27.56%.
Multi classifier systems and forward backward feature selection algorithms to classify emotional coloured speech BIBAFull-Text 551-556
  Sascha Meudt; Dimitri Zharkov; Markus Kächele; Friedhelm Schwenker
Systems for the recognition of psychological characteristics such as the emotional state in real world scenarios have to deal with several difficulties. Amongst those are unconstrained environments and uncertainties in one or several input channels. However a more crucial aspect is the content of the data itself. Psychological states are highly person-dependent and often even humans are not able to determine the correct state a person is in. A successful recognition system thus has to deal with data, that is not very discriminative and often simply misleading. In order to succeed, a critical view on features and decisions is essential to select only the most valuable ones. This work presents a comparison of a common multi classifier system approach based on state of the art features and a modified forward backward feature selection algorithm with a long term stopping criteria. The second approach takes also features of the voice quality family into account. Both approaches are based on the audio modality only. The dataset used in the challenge is an in between dataset of real world datasets which are still very hard to handle and over acted datasets which were famous in the past and today are well understood.
Emotion recognition using facial and audio features BIBAFull-Text 557-564
  Tarun Krishna; Ayush Rai; Shubham Bansal; Shubham Khandelwal; Shubham Gupta; Dushyant Goel
Human Computer Interaction is an upcoming scientific field which aims at inter-communication between humans and computers. A major element of this field is Human Emotion Recognition. The most expressive way humans display emotions is through facial expressions. Traditionally, emotion recognition has been performed on laboratory controlled data. While undoubtedly worthwhile at the time, such lab controlled data poorly represents the environment and conditions faced in real-world situations. With the increase in the number of video clips online, it is worthwhile to explore the performance of emotion recognition methods that work 'in the wild'. This work mainly focuses on automatic emotion recognition in a wild video sample. In this task, we have worked on the problem of human emotion recognition using a combination of video features and audio features. The technique that we have utilized for emotion detection involves a blend of Optical flow, Gabor Filtering, few other facial features and audio features. Training and Classification is performed using Support Vector Machine-Hidden Markov Model (HMM). The unique thing about our methodology is that it produces better results for some particular class of emotions as compared to the baseline score in the case of wild emotion dataset with an overall accuracy of 20.51% on the test set.

Multimodal learning analytics challenge

Multimodal learning analytics: description of math data corpus for ICMI grand challenge workshop BIBAFull-Text 563-568
  Sharon Oviatt; Adrienne Cohen; Nadir Weibel
This paper provides documentation on dataset resources for establishing a new research area called multimodal learning analytics (MMLA). Research on this topic has the potential to transform the future of educational practice and technology, as well as computational techniques for advancing data analytics. The Math Data Corpus includes high-fidelity time-synchronized multimodal data recordings (speech, digital pen, images) on collaborating groups of students as they work together to solve mathematics problems that vary in difficulty level. The Math Data Corpus resources include initial coding of problem segmentation, problem-solving correctness, and representational content of students' writing. These resources are made available to participants in the data-driven grand challenge for the Second International Workshop on Multimodal Learning Analytics. The primary goal of this event is to analyze coherent signal, activity, and lexical patterns that can identify domain expertise and change in domain expertise early, reliably, and objectively, as well as learning-oriented precursors. An additional aim is to build an international research community in the emerging area of multimodal learning analytics by organizing a series of workshops that bring together multidisciplinary scientists to work on MMLA topics.
Problem solving, domain expertise and learning: ground-truth performance results for math data corpus BIBAFull-Text 569-574
  Sharon Oviatt
Problem solving, domain expertise, and learning are analyzed for the Math Data Corpus, which involves multimodal data on collaborating student groups as they solve math problems together across sessions. Compared with non-expert students, domain experts contributed more group solutions, solved more problems correctly and took less time. These differences between experts and non-experts were accentuated on harder problems. A cumulative expertise metric validated that expert and non-expert students represented distinct non overlapping populations, a finding that replicated across sessions. Group performance also improved 9.4% across sessions, due mainly to learning by expert students. These findings satisfy ground-truth conditions for developing prediction techniques that aim to identify expertise based on multimodal communication and behavior patterns. Together with the Math Data Corpus, these results contribute valuable resources for supporting data-driven grand challenges on multimodal learning analytics, which aim to develop new techniques for predicting expertise early, reliably, and objectively. as well as learning-oriented precursors.
Automatic identification of experts and performance prediction in the multimodal math data corpus through analysis of speech interaction BIBAFull-Text 575-582
  Saturnino Luz
An analysis of multiparty interaction in the problem solving sessions of the Multimodal Math Data Corpus is presented. The analysis focuses on non-verbal cues extracted from the audio tracks. Algorithms for expert identification and performance prediction (correctness of solution) are implemented based on patterns of speech activity among session participants. Both of these categorisation algorithms employ an underlying graph-based representation of dialogues for each individual problem solving activities. The proposed Bayesian approach to expert prediction proved quite effective, reaching accuracy levels of over 92% with as few as 6 dialogues of training data. Performance prediction was not quite as effective. Although the simple graph-matching strategy employed for predicting incorrect solutions improved considerably over a Monte Carlo simulated baseline (F1 score increased by a factor of 2.3), there is still much room for improvement in this task.
Expertise estimation based on simple multimodal features BIBAFull-Text 583-590
  Xavier Ochoa; Katherine Chiluiza; Gonzalo Méndez; Gonzalo Luzardo; Bruno Guamán; James Castells
Multimodal Learning Analytics is a field that studies how to process learning data from dissimilar sources in order to automatically find useful information to give feedback to the learning process. This work processes video, audio and pen strokes information included in the Math Data Corpus, a set of multimodal resources provided to the participants of the Second International Workshop on Multimodal Learning Analytics. The result of this processing is a set of simple features that could discriminate between experts and non-experts in groups of students solving mathematical problems. The main finding is that several of those simple features, namely the percentage of time that the students use the calculator, the speed at which the student writes or draws and the percentage of time that the student mentions numbers or mathematical terms, are good discriminators between experts and non-experts students. Precision levels of 63% are obtained for individual problems and up to 80% when full sessions (aggregation of 16 problems) are analyzed. While the results are specific for the recorded settings, the methodology used to obtain and analyze the features could be used to create discriminations models for other contexts.
Using micro-patterns of speech to predict the correctness of answers to mathematics problems: an exercise in multimodal learning analytics BIBAFull-Text 591-598
  Kate Thompson
Learning analytics techniques are traditionally used on the "big data" collected at the course or university level. The application of such techniques to the data sets generated in complex learning environments can provide insights into the relationships between the design of learning environments, the processes of learning, and learning outcomes. In this paper, two of the codes described as part of the Collaborative Process Analysis Coding Scheme (CPACS) were extracted from the Math Data Corpus. The codes selected were tense and pronouns, both of which have been found to indicate phases of group work and the action associated with collaboration. Rather than examine these measures of social interactions in isolation, a framework for the analysis of complex learning environments was applied. This facilitated an analysis of the relationships between the social interactions, the task design and learning outcomes, as well as tool use. The generation of a successful problem solution of one expert and one non-expert group was accurately predicted (75%-94%). The examination of interactions between the social, epistemic and tool elements of the learning environment for one group showed that successful role differentiation and participation were related to successful problem solutions in the first meeting. In the second meeting, these were less important. The relationship between the discourse related to problem resolution through action and the correctness of a problems solution was found to be less reliable measures, however further analysis is needed at a finer grain to investigate this finding. A rich description of the processes of learning (with regards to social interaction, generation of knowledge, and discourse related to action) was generated for one group.
Written and multimodal representations as predictors of expertise and problem-solving success in mathematics BIBAFull-Text 599-606
  Sharon Oviatt; Adrienne Cohen
One aim of multimodal learning analytics is to analyze rich natural communication modalities to identify domain expertise and learning rapidly and reliably. In this research, written and multimodal representations are analyzed from the Math Data Corpus, which involves multimodal data (digital pen, speech, images) on collaborating students as they solve math problems. Findings reveal that in 96-97% of cases the correctness of a group's solution was predictable in advance based on students' written work content. In addition, a linear regression revealed that 65% of the variance in individual students' domain expertise rankings could be accounted for based on their written work content. A multimodal content analysis based on both written and spoken input correctly predicted the dominant domain expert in a group 100% of the time, exceeding unimodal prediction rates. Further analysis revealed a reversal between experts and non-experts in the percentage of time that a match versus mismatch was present between their oral and written answer contributions, with non-experts demonstrating higher mismatches. Implications are discussed for developing reliable multimodal learning analytics systems that incorporate digital pen input to automatically track consolidation of domain expertise.

Workshop overview

ERM4HCI 2013: the 1st workshop on emotion representation and modelling in human-computer-interaction-systems BIBAFull-Text 607-608
  Kim Hartmann; Ronald Böck; Christian Becker-Asano; Jonathan Gratch; Björn Schuller; Klaus R. Scherer
This paper presents a brief summary of the first workshop on Emotion Representation and Modelling in Human-Computer-Interaction-Systems. The ERM4HCI 2013 workshop is held in conjunction with the ICMI 2013 conference. The focus is on theory driven representation and modelling of emotions in the context of Human-Computer-Interaction.
Gazein'13: the 6th workshop on eye gaze in intelligent human machine interaction: gaze in multimodal interaction BIBAFull-Text 609-610
  Roman Bednarik; Hung-Hsuan Huang; Yukiko Nakano; Kristiina Jokinen
This paper presents a summary of the sixth workshop in Eye Gaze in Intelligent Human Machine Interaction. The GazeIn'13 workshop is a part of a series of workshops held around the topics related to gaze and multimodal interaction.
   The workshop web-site can be found at http://cs.uef.fi/gazein2013/
Smart material interfaces: "another step to a material future" BIBAFull-Text 611-612
  Manuel Kretzer; Andrea Minuto; Anton Nijholt
Smart Materials have physical properties that can be changed or controlled by external stimuli such as electric or magnetic fields, temperature or stress. Shape, size and color are among the properties that can be changed. Smart Material Interfaces are physical interfaces that utilize these Smart Materials to sense the environment and display responses by changing their physical properties. This workshop aims at stimulating research and development in interfaces that make novel use of Smart Materials. It provides a platform for state-of-the-art design of Smart Material interfaces.