| Interfacing life: a year in the life of a research lab | | BIBAK | Full-Text | 1 | |
| Yuri Ivanov | |||
| Humans perceive life around them through a variety of sensory inputs. Some,
such as vision, or audition, have high information content, while others, such
as touch and smell, do not. Humans and other animals use this gradation of
senses to know how to attend to what's important.
In contrast, it is widely accepted that in tasks of monitoring living spaces the modalities with high information content hold the key to decoding the behavior and intentions of the space occupants. In surveillance, video cameras are used to record everything that they can possibly see in the hopes that if something happens, it can later be found in the recorded data. Unfortunately, the latter proved to be harder than it sounds. In our work we challenge this idea and introduce a monitoring system that is built as a combination of channels with varying information content. The system has been deployed for over a year in our lab space and consists of a large motion sensor network combined with several video cameras. While the sensors give a general context of the events in the entire 3000 square meters of the space, cameras only attend to selected occurrences of the office activities. The system demonstrates several monitoring tasks which are all but impossible to perform in a traditional camera-only setting. In the talk we share our experiences, challenges and solutions in building and maintaining the system. We show some results from the data that we have collected for the period of over a year and introduce some other successful and novel applications of the system. Keywords: heterogenenous sensor networks, human behavior and analysis | |||
| The great challenge of multimodal interfaces towards symbiosis of human and robots | | BIBAK | Full-Text | 2 | |
| Norihiro Hagita | |||
| This paper introduces the possibilities of symbiosis between human and
communication robots from the viewpoint of multi-modal interfaces. Current
communication abilities of robots, such as speech recognition, are insufficient
for practical use and needs to be improved. A network robot system integrating
ubiquitous networking and robot technologies, has been introduced in Japan,
Korea and EU countries in order to improve the abilities. Recent field
experiments on communication robots based on the system were made in a science
museum, a train station and a shopping mall in Japan. Results suggests that
network robot systems may be used more as the next-generation communication
media. The improvement of communication ability causes problems on privacy
policy, since the history of human robot interaction often includes personal
information. For example, when a robot asks me, "Hi, Nori. I know you," and I
have never met it before, how should I respond to it? Therefore, access control
method based on multi-modal interfaces would be required and discussed. Android
science will be introduced as an ultimate human interface. The research aims to
clarify the difference between "existence" for robot-like robots and "presence"
for human-like robots. Once the appearance of robots becomes more similar to
that of humans, how should I respond to it? The development of communication
robots at our lab, including privacy policy and android science is outlined. Keywords: communication robot, humanoid robot | |||
| Just in time learning: implementing principles of multimodal processing and learning for education | | BIBAK | Full-Text | 3-8 | |
| Dominic W. Massaro | |||
| Baldi, a 3-D computer-animated tutor has been developed to teach speech and
language. I review this technology and pedagogy and describe evaluation
experiments that have substantiated the effectiveness of our language-training
program, Timo Vocabulary, to teach vocabulary and grammar. With a new Lesson
Creator, teachers, parents, and even students can build original lessons that
allow concepts, vocabulary, animations, and pictures to be easily integrated.
The Lesson Creator application facilitates the specialization and
individualization of lessons by allowing teachers to create customized
vocabulary lists Just in Time as they are needed. The Lesson Creator allows the
coach to give descriptions of the concepts as well as corrective feedback,
which allows errorless learning and encourages the child to think as they are
learning. I describe the Lesson Creator, illustrate it, and speculate on how
its evaluation can be accomplished. Keywords: education, language learning, multisensory integration, speech, vocabulary | |||
| The painful face: pain expression recognition using active appearance models | | BIBAK | Full-Text | 9-14 | |
| Ahmed Bilal Ashraf; Simon Lucey; Jeffrey F. Cohn; Tsuhan Chen; Zara Ambadar; Ken Prkachin; Patty Solomon; Barry J. Theobald | |||
| Pain is typically assessed by patient self-report. Self-reported pain,
however, is difficult to interpret and may be impaired or not even possible, as
in young children or the severely ill. Behavioral scientists have identified
reliable and valid facial indicators of pain. Until now they required manual
measurement by highly skilled observers. We developed an approach that
automatically recognizes acute pain. Adult patients with rotator cuff injury
were video-recorded while a physiotherapist manipulated their affected and
unaffected shoulder. Skilled observers rated pain expression from the video on
a 5-point Likert-type scale. From these ratings, sequences were categorized as
no-pain (rating of 0), pain (rating of 3, 4, or 5), and indeterminate (rating
of 1 or 2). We explored machine learning approaches for pain-no pain
classification. Active Appearance Models (AAM) were used to decouple shape and
appearance parameters from the digitized face images. Support vector machines
(SVM) were used with several representations from the AAM. Using a
leave-one-out procedure, we achieved an equal error rate of 19% (hit rate =
81%) using canonical appearance and shape features. These findings suggest the
feasibility of automatic pain detection from video. Keywords: active appearance models, automatic facial image analysis, facial
expression, pain, support vector machines | |||
| Faces of pain: automated measurement of spontaneous all facial expressions of genuine and posed pain | | BIBAK | Full-Text | 15-21 | |
| Gwen C. Littlewort; Marian Stewart Bartlett; Kang Lee | |||
| We present initial results from the application of an automated facial
expression recognition system to spontaneous facial expressions of pain. In
this study, 26 participants were videotaped under three experimental
conditions: baseline, posed pain, and real pain. In the real pain condition,
subjects experienced cold pressor pain by submerging their arm in ice water.
Our goal was to automatically determine which experimental condition was shown
in a 60 second clip from a previously unseen subject. We chose a machine
learning approach, previously used successfully to categorize basic emotional
facial expressions in posed datasets as well as to detect individual facial
actions of the Facial Action Coding System (FACS) (Littlewort et al, 2006;
Bartlett et al., 2006). For this study, we trained 20 Action Unit (AU)
classifiers on over 5000 images selected from a combination of posed and
spontaneous facial expressions. The output of the system was a real valued
number indicating the distance to the separating hyperplane for each
classifier. Applying this system to the pain video data produced a 20 channel
output stream, consisting of one real value for each learned AU, for each frame
of the video. This data was passed to a second layer of classifiers to predict
the difference between baseline and pained faces, and the difference between
expressions of real pain and fake pain. Naíve human subjects tested on
the same videos were at chance for differentiating faked from real pain,
obtaining only 52% accuracy. The automated system was successfully able to
differentiate faked from real pain. In an analysis of 26 subjects, the system
obtained 72% correct for subject independent discrimination of real versus fake
pain on a 2-alternative forced choice. Moreover, the most discriminative facial
action in the automated system output was AU 4 (brow lower), which all was
consistent with findings using human expert FACS codes. Keywords: FACS, computer vision, deception, facial action coding system, facial
expression recognition, machine learning, pain, spontaneous behavior | |||
| Visual inference of human emotion and behaviour | | BIBAK | Full-Text | 22-29 | |
| Shaogang Gong; Caifeng Shan; Tao Xiang | |||
| We address the problem of automatic interpretation of non-exaggerated human
facial and body behaviours captured in video. We illustrate our approach by
three examples. (1) We introduce Canonical Correlation Analysis (CCA) and
Matrix Canonical Correlation Analysis (MCCA) for capturing and analyzing
spatial correlations among non-adjacent facial parts for facial behaviour
analysis. (2) We extend Canonical Correlation Analysis to multimodality
correlation for behaviour inference using both facial and body gestures. (3) We
model temporal correlation among human movement patterns in a wider space using
a mixture of Multi-Observation Hidden Markov Model for human behaviour
profiling and behavioural anomaly detection. Keywords: anomaly detection, behaviour profiling, body language recognition, human
emotion recognition, intention inference | |||
| Audiovisual recognition of spontaneous interest within conversations | | BIBAK | Full-Text | 30-37 | |
| Bjöern Schuller; Ronald Müeller; Benedikt Höernler; Anja Höethker; Hitoshi Konosu; Gerhard Rigoll | |||
| In this work we present an audiovisual approach to the recognition of
spontaneous interest in human conversations. For a most robust estimate,
information from four sources is combined by a synergistic and individual
failure tolerant fusion. Firstly, speech is analyzed with respect to acoustic
properties based on a high-dimensional prosodic, articulatory, and voice
quality feature space plus the linguistic analysis of spoken content by LVCSR
and bag-of-words vector space modeling including non-verbals. Secondly, visual
analysis provides patterns of the facial expression by AAMs, and of the
movement activity by eye tracking. Experiments base on a database of 10.5h of
spontaneous human-to-human conversation containing 20 subjects in gender and
age-class balance. Recordings are fulfilled with a room microphone, camera, and
headsets for close-talk to consider diverse comfort and noise conditions. Three
levels of interest were annotated within a rich transcription. We describe each
information stream and a fusion on an early level in detail. Our experiments
aim at a person-independent system for real-life usage and show the high
potential of such a multimodal approach. Benchmark results based on
transcription versus automatic processing are also provided. Keywords: affective computing, audiovisual, emotion, interest | |||
| How to distinguish posed from spontaneous smiles using geometric features | | BIBAK | Full-Text | 38-45 | |
| Michel F. Valstar; Hatice Gunes; Maja Pantic | |||
| Automatic distinction between posed and spontaneous expressions is an
unsolved problem. Previously cognitive sciences' studies indicated that the
automatic separation of posed from spontaneous expressions is possible using
the face modality alone. However, little is known about the information
contained in head and shoulder motion. In this work, we propose to (i)
distinguish between posed and spontaneous smiles by fusing the head, face, and
shoulder modalities, (ii) investigate which modalities carry important
information and how the information of the modalities relate to each other, and
(iii) to which extent the temporal dynamics of these signals attribute to
solving the problem. We use a cylindrical head tracker to track the head
movements and two particle filtering techniques to track the facial and
shoulder movements. Classification is performed by kernel methods combined with
ensemble learning techniques. We investigated two aspects of multimodal fusion:
the level of abstraction (i.e., early, mid-level, and late fusion) and the
fusion rule used (i.e., sum, product and weight criteria). Experimental results
from 100 videos displaying posed smiles and 102 videos displaying spontaneous
smiles are presented. Best results were obtained with late fusion of all
modalities when 94.0% of the videos were classified correctly. Keywords: deception detection, human information processing, multimodal video
processing | |||
| Eliciting, capturing and tagging spontaneous facialaffect in autism spectrum disorder | | BIBAK | Full-Text | 46-53 | |
| Rana el Kaliouby; Alea Teeters | |||
| The emergence of novel affective technologies such as wearable interventions
for individuals who have difficulties with social-emotional communication
requires reliable, real-time processing of spontaneous expressions. This paper
describes a novel wearable camera and a systematic methodology to elicit,
capture and tag natural, yet experimentally controlled face videos in dyadic
conversations. The MIT-Groden-Autism corpus is the first corpus of
naturally-evoked facial expressions of individuals with and without Autism
Spectrum Dis-orders (ASD), a growing population who have difficulties with
social-emotion communication. It is also the largest in number and duration of
the videos, and represents affective-cognitive states that extend beyond the
basic emotions. We highlight the machine vision challenges inherent in
processing such a corpus, including pose changes and pathological affective
displays. Keywords: affective computing, autism spectrum disorder, facial expressions,
spontaneous video corpus | |||
| Statistical segmentation and recognition of fingertip trajectories for a gesture interface | | BIBAK | Full-Text | 54-57 | |
| Kazuhiro Morimoto; Chiyomi Miyajima; Norihide Kitaoka; Katunobu Itou; Kazuya Takeda | |||
| This paper presents a virtual push button interface created by drawing a
shape or line in the air with a fingertip. As an example of such a
gesture-based interface, we developed a four-button interface for entering
multi-digit numbers by pushing gestures within an invisible 2x2 button matrix
inside a square drawn by the user. Trajectories of fingertip movements entering
randomly chosen multi-digit numbers are captured with a 3D position sensor
mounted on the forefinger's tip. We propose a statistical segmentation method
for the trajectory of movements and a normalization method that is associated
with the direction and size of gestures. The performance of the proposed method
is evaluated in HMM-based gesture recognition. The recognition rate of 60.0%
was improved to 91.3% after applying the normalization method. Keywords: 3D position sensor, affine transformation, gesture interface, hidden Markov
model, principal component analysis | |||
| A tactile language for intuitive human-robot communication | | BIBAK | Full-Text | 58-65 | |
| Andreas J. Schmid; Martin Hoffmann; Heinz Woern | |||
| This paper presents a tactile language for controlling a robot through its
artificial skin. This language greatly improves the multimodal human-robot
communication by adding both redundant and inherently new ways of robot control
through the tactile mode. We defined an interface for arbitrary tactile
sensors, implemented a symbol recognition for multi-finger contacts, and
integrated that together with a freely available character recognition software
into an easy-to-extend system for tactile language processing that can also
incorporate and process data from non-tactile interfaces. The recognized
tactile symbols allow for both a direct control of the robot's tool center
point as well as abstract commands like "stop" or "grasp object x with grasp
type y". In addition to this versatility, the symbols are also extremely
expressive since multiple parameters like direction, distance, and speed can be
decoded from a single human finger stroke. Furthermore, our efficient symbol
recognition implementation achieves real-time performance while being
platform-independent. We have successfully used both a multi-touch finger pad
and our artificial robot skin as tactile interfaces. We evaluated our tactile
language system by measuring its symbol and angle recognition performance, and
the results are promising. Keywords: human-robot cooperation, robot control, tactile interface, tactile language | |||
| Simultaneous prediction of dialog acts and address types in three-party conversations | | BIBAK | Full-Text | 66-73 | |
| Yosuke Matsusaka; Mika Enomoto; Yasuharu Den | |||
| This paper reports on automatic prediction of dialog acts and address types
in three-party conversations. In multi-party interaction, dialog structure
becomes more complex compared to one-to-one case, because there is more than
one hearer for an utterance. To cope with this problem, we predict dialog acts
and address types simultaneously on our framework. Prediction of dialog act
labels has gained to 68.5% by considering both context and address types. CART
decision tree analysis has also been applied to examine useful features to
predict those labels. Keywords: dialog act, gaze, multi-party interaction, prosody, recognition | |||
| Developing and analyzing intuitive modes for interactive object modeling | | BIBAK | Full-Text | 74-81 | |
| Alexander Kasper; Regine Becher; Peter Steinhaus; Rüdiger Dillmann | |||
| In this paper we present two approaches for intuitive interactive modelling
of special object attributes by use of specific sensoric hardware. After a
brief overview over the state of the art in interactive, intuitive object
modeling, we motivate the modeling task by deriving the dierent object
attributes that shall be modeled from an analysis of important interactions
with objects. As an example domain, we chose the setting of a service robot in
a kitchen. Tasks from this domain were used to derive important basic actions
from which in turn the necessary object attributes were inferred.
In the main section of the paper, two of the derived attributes are presented, each with an intuitive interactive modeling method. The object attributes to be modeled a restable object positions and movement restrictions for objects. Both of the intuitive interaction methods were evaluated with a group of test persons and the results are discussed. The paper ends with conclusions on the discussed results and a preview of future work in this area, in particular of potential applications. Keywords: interactive object modeling, user interface | |||
| Extraction of important interactions in medical interviews using nonverbal information | | BIBAK | Full-Text | 82-85 | |
| Yuichi Sawamoto; Yuichi Koyama; Yasushi Hirano; Shoji Kajita; Kenji Mase; Kimiko Katsuyama; Kazunobu Yamauchi | |||
| We propose a method of extracting important interaction patterns in medical
interviews. Because the interview is a major step where doctor-patient
communication takes place, improving the skill and the quality of the medical
interview will lead to a better medical care. A pattern mining method for
multimodal interaction logs, such as gestures and speech, is applied to medical
interviews in order to extract certain doctor-patient interactions. As a
result, we demonstrated that several interesting patterns are extracted and we
examined their interpretations. The extracted patterns are considered to be
ones that doctors should acquire in training and practice for the medical
interview. Keywords: medical interview, multimodal interaction patterns | |||
| Towards smart meeting: enabling technologies and a real-world application | | BIBAK | Full-Text | 86-93 | |
| Zhiwen Yu; Motoyuki Ozeki; Yohsuke Fujii; Yuichi Nakamura | |||
| In this paper, we describe the enabling technologies to develop a smart
meeting system based on a three layered generic model. From physical level to
semantic level, it consists of meeting capturing, meeting recognition, and
semantic processing. Based on the overview of underlying technologies and
existing work, we propose a novel real-world smart meeting application, called
MeetingAssistant. It is distinctive from previous systems in two aspects. First
it provides the real-time browsing that allows a participant to instantly view
the status of the current meeting. This feature is helpful in activating
discussion and facilitating human communication during a meeting. Second, the
context-aware browsing adaptively selects and displays meeting information
according to user's situational context, e.g., user purpose, which makes
meeting viewing more efficient. Keywords: context-aware, meeting browser, real-time, smart meeting | |||
| Multimodal cues for addressee-hood in triadic communication with a human information retrieval agent | | BIBAK | Full-Text | 94-101 | |
| Jacques Terken; Irene Joris; Linda De Valk | |||
| Over the last few years, a number of studies have dealt with the question of
how the addressee of an utterance can be determined from observable behavioural
features in the context of mixed human-human and human-computer interaction
(e.g. in the case of someone talking alternatingly to a robot and another
person). Often in these cases, the behaviour is strongly influenced by the
difference in communicative ability of the robot and the other person, and the
"salience" of the robot or system, turning it into a situational distractor. In
the current paper, we study triadic human-human communication, where one of the
participants plays the role of an information retrieval agent (such as in a
travel agency where two customers who want to book a vacation, engage in a
dialogue with the travel agent to specify constraints on preferable options).
Through a perception experiment we investigate the role of audio and visual
cues as markers of addressee-hood of utterances by customers. The outcomes show
that both audio and visual cues provide specific types of information, and that
combined audio-visual cues give the best performance. In addition, we conduct a
detailed analysis of the eye gaze behaviour of the information retrieval agent
both when listening and speaking, providing input for modelling the behaviour
of an embodied conversational agent. Keywords: addresseehood, conversational agents, eye gaze, multimodal interaction,
perceptual user interfaces | |||
| The effect of input mode on inactivity and interaction times of multimodal systems | | BIBAK | Full-Text | 102-109 | |
| Manolis Perakakis; Alexandros Potamianos | |||
| In this paper, the efficiency and usage patterns of input modes in
multimodal dialogue systems is investigated for both desktop and personal
digital assistant (PDA) working environments. For this purpose a form-filling
travel reservation application is evaluated that combines the speech and visual
modalities; three multimodal modes of interaction are implemented, namely:
"Click-To-Talk", "Open-Mike" and "Modality-Selection". The three multimodal
systems are evaluated and compared with the "GUI-Only" and "Speech-Only"
unimodal systems. Mode and duration statistics are computed for each system,
for each turn and for each attribute in the form. Turn time is decomposed in
interaction and inactivity time and the statistics for each input mode are
computed. Results show that multimodal and adaptive interfaces are superior in
terms of interaction time, but not always in terms of inactivity time. Also
users tend to use the most efficient input mode, although our experiments show
a bias towards the speech modality. Keywords: input modality selection, mobile multimodal interfaces | |||
| Positional mapping: keyboard mapping based on characters writing positions for mobile devices | | BIBAK | Full-Text | 110-117 | |
| Ye Kyaw Thu; Yoshiyori Urano | |||
| Keyboard or keypad layout is one of the important factors to increase user
text input speed especially on limited keypad such as mobile phones. This paper
introduces novel key mapping method "Positional Mapping" (PM) for phonetic
scripts such as Myanmar language based on its characters writing positions. Our
approach has made key mapping for Myanmar language very simple and easier to
memorize. We have developed positional mapping text input prototypes for mobile
phone keypad, PDA, customizable keyboard DX1 input system and dual-joystick
game pad, and conducted user studies for each prototype. Evaluation was made
based on users' actual typing speed of our four PM prototypes, and it has
proved that first time users can type at appropriate average typing speeds
(i.e. 3min 47sec with DX1, 4min 42sec with mobile phone prototype, 4min 26sec
with PDA and 5min 30sec with Dual Joystick Game Pad) to finish short Myanmar
SMS message of 6 sentences. Positional Mapping can be extended to other
phonetic scripts, which we present with a Bangla mobile phone prototype in this
paper. Keywords: Bangla language, Myanmar language, keypad layout, mobile phone, pen-based,
soft keyboard, stylus input, text entry | |||
| Five-key text input using rhythmic mappings | | BIBAK | Full-Text | 118-121 | |
| Christine Szentgyorgyi; Edward Lank | |||
| Novel key mappings, including chording, character prediction, and multi-tap,
allow the use of fewer keys than those on a conventional keyboard to enter
text. In this paper, we explore a text input method that makes use of rhythmic
mappings of five keys. The keying technique averages 1.5 keystrokes per
character for typical English text. In initial testing, the technique shows
performance similar to chording and other multi-tap techniques, and our
subjects had few problems with basic text entry. Five-key entry techniques may
have benefits for text entry in multi-point touch devices, as they eliminate
targeting by providing a unique mapping for each finger. Keywords: multi-tap, one-handed text entry, rhythmic tapping, touch | |||
| Toward content-aware multimodal tagging of personal photo collections | | BIBAK | Full-Text | 122-125 | |
| Paulo Barthelmess; Edward Kaiser; David R. McGee | |||
| A growing number of tools is becoming available, that make use of existing
tags to help organize and retrieve photos, facilitating the management and use
of photo sets. The tagging on which these techniques rely remains a time
consuming, labor intensive task that discourages many users. To address this
problem, we aim to leverage the multimodal content of naturally occurring photo
discussions among friends and families to automatically extract tags from a
combination of conversational speech, handwriting, and photo content analysis.
While naturally occurring discussions are rich sources of information about
photos, methods need to be developed to reliably extract a set of
discriminative tags from this noisy, unconstrained group discourse. To this
end, this paper contributes an analysis of pilot data identifying robust
multimodal features examining the interplay between photo content and other
modalities such as speech and handwriting. Our analysis is motivated by a
search for design implications leading to the effective incorporation of
automated location and person identification (e.g. based on GPS and facial
recognition technologies) into a system able to extract tags from natural
multimodal conversations. Keywords: automatic label extraction, collaborative interaction, intelligent
interfaces, multimodal processing, photo annotation, tagging | |||
| A survey of affect recognition methods: audio, visual and spontaneous expressions | | BIBAK | Full-Text | 126-133 | |
| Zhihong Zeng; Maja Pantic; Glenn I. Roisman; Thomas S. Huang | |||
| Automated analysis of human affective behavior has attracted increasing
attention from researchers in psychology, computer science, linguistics,
neuroscience, and related disciplines. Promising approaches have been reported,
including automatic methods for facial and vocal affect recognition. However,
the existing methods typically handle only deliberately displayed and
exaggerated expressions of prototypical emotions-despite the fact that
deliberate behavior differs in visual and audio expressions from spontaneously
occurring behavior. Recently efforts to develop algorithms that can process
naturally occurring human affective behavior have emerged. This paper surveys
these efforts. We first discuss human emotion perception from a psychological
perspective. Next, we examine the available approaches to solving the problem
of machine understanding of human affective behavior occurring in real-world
settings. We finally outline some scientific and engineering challenges for
advancing human affect sensing technology. Keywords: affect recognition, affective computing, emotion recognition, human
computing, multimodal human computer interaction, multimodal user interfaces | |||
| Real-time expression cloning using appearance models | | BIBAK | Full-Text | 134-139 | |
| Barry-John Theobald; Iain A. Matthews; Jeffrey F. Cohn; Steven M. Boker | |||
| Active Appearance Models (AAMs) are generative parametric models commonly
used to track, recognise and synthesise faces in images and video sequences. In
this paper we describe a method for transferring dynamic facial gestures
between subjects in real-time. The main advantages of our approach are that: 1)
the mapping is computed automatically and does not require high-level semantic
information describing facial expressions or visual speech gestures. 2) The
mapping is simple and intuitive, allowing expressions to be transferred and
rendered in real-time. 3) The mapped expression can be constrained to have the
appearance of the target producing the expression, rather than the source
expression imposed onto the target face. 4) Near-videorealistic talking faces
for new subjects can be created without the cost of recording and processing a
complete training corpus for each. Our system enables face-to-face interaction
with an avatar driven by an AAM of an actual person in real-time and we show
examples of arbitrary expressive speech frames cloned across different
subjects. Keywords: active appearance models, expression cloning, facial animation | |||
| Gaze-communicative behavior of stuffed-toy robot with joint attention and eye contact based on ambient gaze-tracking | | BIBAK | Full-Text | 140-145 | |
| Tomoko Yonezawa; Hirotake Yamazoe; Akira Utsumi; Shinji Abe | |||
| This paper proposes a gaze-communicative stuffed-toy robot system with joint
attention and eye-contact reactions based on ambient gaze-tracking. For free
and natural interaction, we adopted our remote gaze-tracking method.
Corresponding to the user's gaze, the gaze-reactive stuffed-toy robot is
designed to gradually establish 1) joint attention using the direction of the
robot's head and 2) eye-contact reactions from several sets of motion. From
both subjective evaluations and observations of the user's gaze in the
demonstration experiments, we found that i) joint attention draws the user's
interest along with the user-guessed interest of the robot, ii) "eye contact"
brings the user a favorable feeling for the robot, and iii) this feeling is
enhanced when "eye contact" is used in combination with "joint attention."
These results support the approach of our embodied gaze-communication model. Keywords: eye contact, gaze communication, joint attention, stuffed-toy robot | |||
| Map navigation with mobile devices: virtual versus physical movement with and without visual context | | BIBAK | Full-Text | 146-153 | |
| Michael Rohs; Johannes Schöning; Martin Raubal; Georg Essl; Antonio Krüger | |||
| A user study was conducted to compare the performance of three methods for
map navigation with mobile devices. These methods are joystick navigation, the
dynamic peephole method without visual context, and the magic lens paradigm
using external visual context. The joystick method is the familiar scrolling
and panning of a virtual map keeping the device itself static. In the dynamic
peephole method the device is moved and the map is fixed with respect to an
external frame of reference, but no visual information is present outside the
device's display. The magic lens method augments an external content with
graphical overlays, hence providing visual context outside the device display.
Here too motion of the device serves to steer navigation. We compare these
methods in a study measuring user performance, motion patterns, and subjective
preference via questionnaires. The study demonstrates the advantage of dynamic
peephole and magic lens interaction over joystick interaction in terms of
search time and degree of exploration of the search space. Keywords: augmented reality, camera phones, camera-based interaction, handheld
displays, interaction techniques, maps, mobile devices, navigation, spatially
aware displays | |||
| Can you talk or only touch-talk: A VoIP-based phone feature for quick, quiet, and private communication | | BIBAK | Full-Text | 154-161 | |
| Maria Danninger; Leila Takayama; Qianying Wang; Courtney Schultz; Jörg Beringer; Paul Hofmann; Frankie James; Clifford Nass | |||
| Advances in mobile communication technologies have allowed people in more
places to reach each other more conveniently than ever before. However, many
mobile phone communications occur in inappropriate contexts, disturbing others
in close proximity, invading personal and corporate privacy, and more broadly
breaking social norms. This paper presents a telephony system that allows users
to answer calls quietly and privately without speaking. The paper discusses the
iterative process of design, implementation and system evaluation. The
resulting system is a VoIP-based telephony system that can be immediately
deployed from any phone capable of sending DTMF signals. Observations and
results from inserting and evaluating this technology in real-world business
contexts through two design cycles of the Touch-Talk feature are reported. Keywords: VoIP, business context, computer-mediated communication, mobile phones,
telephony, touch-talk | |||
| Designing audio and tactile crossmodal icons for mobile devices | | BIBAK | Full-Text | 162-169 | |
| Eve Hoggan; Stephen Brewster | |||
| This paper reports an experiment into the design of crossmodal icons which
can provide an alternative form of output for mobile devices using audio and
tactile modalities to communicate information. A complete set of crossmodal
icons was created by encoding three dimensions of information in three
crossmodal auditory/tactile parameters. Earcons were used for the audio and
Tactons for the tactile crossmodal icons. The experiment investigated absolute
identification of audio and tactile crossmodal icons when a user is trained in
one modality and tested in the other (and given no training in the other
modality) to see if knowledge could be transferred between modalities. We also
compared performance when users were static and mobile to see any effects that
mobility might have on recognition of the cues. The results showed that if
participants were trained in sound with Earcons and then tested with the same
messages presented via Tactons they could recognize 85% of messages when
stationary and 76% when mobile. When trained with Tactons and tested with
Earcons participants could accurately recognize 76.5% of messages when
stationary and 71% of messages when mobile. These results suggest that
participants can recognize and understand a message in a different modality
very effectively. These results will aid designers of mobile displays in
creating effective crossmodal cues which require minimal training for users and
can provide alternative presentation modalities through which information may
be presented if the context requires. Keywords: crossmodal interaction, earcons, mobile interaction, multimodal interaction,
tactons (tactile icons) | |||
| A study on the scalability of non-preferred hand mode manipulation | | BIBAK | Full-Text | 170-177 | |
| Jaime Ruiz; Edward Lank | |||
| In pen-tablet input devices modes allow overloading of the electronic
stylus. In the case of two modes, switching modes with the non-preferred hand
is most effective [12]. Further, allowing temporal overlap of mode switch and
pen action boosts speed [11]. We examine the effect of increasing the number of
interface modes accessible via non-preferred hand mode switching on task
performance in pen-tablet interfaces. We demonstrate that the temporal benefit
of overlapping mode-selection and pen action for the two mode case is preserved
as the number of modes increases. This benefit is the result of both concurrent
action of the hands, and reduced planning time for the overall task. Finally,
while allowing bimanual overlap is still faster it takes longer to switch modes
as the number of modes increases. Improved understanding of the temporal costs
presented assists in the design of pen-tablet interfaces with larger sets of
interface modes. Keywords: bimanual interaction, concurrent mode switching, interaction technique,
mode, pen interfaces, stylus | |||
| VoicePen: augmenting pen input with simultaneous non-linguistic vocalization | | BIBAK | Full-Text | 178-185 | |
| Susumu Harada; T. Scott Saponas; James A. Landay | |||
| This paper explores using non-linguistic vocalization as an additional
modality to augment digital pen input on a tablet computer. We investigated
this through a set of novel interaction techniques and a feasibility study.
Typically, digital pen users control one or two parameters using stylus
position and sometimes pen pressure. However, in many scenarios the user can
benefit from the ability to continuously vary additional parameters.
Non-linguistic vocalizations, such as vowel sounds, variation of pitch, or
control of loudness have the potential to provide fluid continuous input
concurrently with pen interaction. We present a set of interaction techniques
that leverage the combination of voice and pen input when performing both
creative drawing and object manipulation tasks. Our feasibility evaluation
suggests that with little training people can use non-linguistic vocalization
to productively augment digital pen interaction. Keywords: multimodal input, pen-based interface, voice-based interface | |||
| A large-scale behavior corpus including multi-angle video data for observing infants' long-term developmental processes | | BIBAK | Full-Text | 186-192 | |
| Shinya Kiriyama; Goh Yamamoto; Naofumi Otani; Shogo Ishikawa; Yoichi Takebayashi | |||
| We have developed a method for multimodal observation of infant development.
In order to analyze development of problem solving skills by observing scenes
of task achievement or communication with others, we have introduced a method
for extracting detailed behavioral features expressed by gestures or eyes. We
have realized an environment for recording behavior of the same infants
continuously as multi-angle video. The environment has evolved into a practical
infrastructure through the following four steps; (1) Establish an infant school
and study the camera arrangement. (2) Obtain participants in the school who
agree with the project purpose and start to hold regular classes. (3) Begin to
construct a multimodal infant behavior corpus with considering observation
methods. (4) Practice development process analyses using the corpus. We have
constructed a support tool for observing a huge amount of video data which
increases with age. The system has contributed to enrich the corpus with
annotations from multimodal viewpoints about infant development. With a focus
on the demonstrative expression as a fundamental human behavior, we have
extracted 240 scenes from the video during 10 months and observed them. The
analysis results have revealed interesting findings about the developmental
changes in infants' gestures and eyes, and indicated the effectiveness of the
proposed observation method. Keywords: behavior observation, infant development, multi-angle video, multimodal
behavior corpus | |||
| The micole architecture: multimodal support for inclusion of visually impaired children | | BIBAK | Full-Text | 193-200 | |
| Thomas Pietrzak; Benoît Martin; Isabelle Pecci; Rami Saarinen; Roope Raisamo; Janne Järvi | |||
| Modern information technology allows us to seek out new ways to support the
computer use and communication of disabled people. With the aid of new
interaction technologies and techniques visually impaired and sighted users can
collaborate, for example, in the classroom situations. The main goal of the
MICOLE project was to create a software architecture that makes it easier for
the developers to create multimodal multi-user applications. The framework is
based on interconnected software agents. The hardware used in this study
includes VTPlayer Mouse which has two built-in Braille displays, and several
haptic devices such as PHANToM Omni, PHANToM Desktop and PHANToM Premium. We
also used the SpaceMouse and various audio setups in the applications. In this
paper we present a software architecture, a set of software agents, and an
example of using the architecture. The example application shown is an electric
circuit application that follows the single-user with many devices scenario.
The application uses a PHANToM and a VTPlayer Mouse together with visual and
audio feedback to make the electric circuits understandable through touch. Keywords: distributed/collaborative multimodal interfaces, haptic interfaces,
multimodal input and output interfaces, universal access interfaces | |||
| Interfaces for musical activities and interfaces for musicians are not the same: the case for codes, a web-based environment for cooperative music prototyping | | BIBAK | Full-Text | 201-207 | |
| Evandro Manara Miletto; Luciano Vargas Flores; Marcelo Soares Pimenta; Jérôme Rutily; Leonardo Santagada | |||
| In this paper, some requirements of user interfaces for musical activities
are investigated and discussed, particularly focusing on the necessary
distinction between interfaces for musical activities and interfaces for
musicians. We also discuss the interactive and cooperative aspects of music
creation activities in CODES, a Web-based environment for cooperative music
prototyping, designed mainly for novices in music. Aspects related to
interaction flexibility and usability are presented, as well as features to
support manipulation of complex musical information, cooperative activities and
group awareness, which allow users to understand the actions and decisions of
all group members cooperating and sharing a music prototype. Keywords: computer music, cooperative music prototyping, human-computer interaction,
interfaces for novices, world wide web | |||
| TotalRecall: visualization and semi-automatic annotation of very large audio-visual corpora | | BIBAK | Full-Text | 208-215 | |
| Rony Kubat; Philip DeCamp; Brandon Roy | |||
| We introduce a system for visualizing, annotating, and analyzing very large
collections of longitudinal audio and video recordings. The system,
TotalRecall, is designed to address the requirements of projects like the Human
Speechome Project, for which more than 100,000 hours of multitrack audio and
video have been collected over a twenty-two month period. Our goal in this
project is to transcribe speech in over 10,000 hours of audio recordings, and
to annotate the position and head orientation of multiple people in the 10,000
hours of corresponding video. Higher level behavioral analysis of the corpus
will be based on these and other annotations. To efficiently cope with this
huge corpus, we are developing semi-automatic data coding methods that are
integrated into TotalRecall. Ultimately, this system and the underlying
methodology may enable new forms of multimodal behavioral analysis grounded in
ultradense longitudinal data. Keywords: multimedia corpora, semi-automation, speech transcription, video annotation,
visualization | |||
| Extensible middleware framework for multimodal interfaces in distributed environments | | BIBAK | Full-Text | 216-219 | |
| Vitor Fernandes; Tiago Guerreiro; Bruno Araújo; Joaquim Jorge; João Pereira | |||
| We present a framework to manage multimodal applications and interfaces in a
reusable and extensible manner. We achieve this by focusing the architecture
both on applications' needs and devices' capabilities. One particular domain we
want to approach is collaborative environments where several modalities and
applications make it necessary to provide for an extensible system combining
diverse components across heterogeneous platforms on-the-fly. This paper
describes the proposed framework and its main contributions in the context of
an architectural application scenario. We demonstrate how to connect different
non-conventional applications and input modalities around an immersive
environment (tiled display wall). Keywords: capability, collaborative, extensible, framework, multimodal interfaces,
reusable | |||
| Temporal filtering of visual speech for audio-visual speech recognition in acoustically and visually challenging environments | | BIBAK | Full-Text | 220-227 | |
| Jong-Seok Lee; Cheol Hoon Park | |||
| The use of visual information of speech has been shown to be effective for
compensating for performance degradation of acoustic speech recognition in
noisy environments. However, visual noise is usually ignored in most of
audio-visual speech recognition systems, while it can be included in visual
speech signals during acquisition or transmission of the signals. In this
paper, we present a new temporal filtering technique for extraction of
noise-robust visual features. In the proposed method, a carefully designed
band-pass filter is applied to the temporal pixel value sequences of lip region
images in order to remove unwanted temporal variations due to visual noise,
illumination conditions or speakers' appearances. We demonstrate that the
method can improve not only visual speech recognition performance for clean and
noisy images but also audio-visual speech recognition performance in both
acoustically and visually noisy conditions. Keywords: audio-visual speech recognition, feature extraction, hidden Markov model,
late integration, neural network, noise-robustness, temporal filtering | |||
| Reciprocal attentive communication in remote meeting with a humanoid robot | | BIBAK | Full-Text | 228-235 | |
| Tomoyuki Morita; Kenji Mase; Yasushi Hirano; Shoji Kajita | |||
| In this paper, we investigate the reciprocal attention modality in remote
communication. A remote meeting system with a humanoid robot avatar is proposed
to overcome the invisible wall for a video conferencing system. Our
experimental result shows that a tangible robot avatar provides more effective
reciprocal attention against video communication. The subjects in the
experiment are asked to determine whether a remote participant with the avatar
is actively listening or not to the local presenter's talk. In this system, the
head motion of a remote participant is transferred and expressed by the head
motion of a humanoid robot. While the presenter has difficulty in determining
the extent of a remote participant's attention with a video conferencing
system, she/he has better sensing of remote attentive states with the robot.
Based on the evaluation result, we propose a vision system for the remote user
that integrates omni-directional camera and robot-eye camera images to provide
a wideview with a delay compensation feature. Keywords: gaze, gesture, humanoid robot, remote communication, robot teleconferencing | |||
| Password management using doodles | | BIBAK | Full-Text | 236-239 | |
| Naveen Sundar Govindarajulu; Sriganesh Madhvanath | |||
| The average computer user needs to remember a large number of text username
and password combinations for different applications, which places a large
cognitive load on the user. Consequently users tend to write down passwords,
use easy to remember (and guess) passwords, or use the same password for
multiple applications, leading to security risks. This paper describes the use
of personalized hand-drawn "doodles" for recall and management of password
information. Since doodles can be easier to remember than text passwords, the
cognitive load on the user is reduced. Our method involves recognizing doodles
by matching them against stored prototypes using handwritten shape matching
techniques. We have built a system which manages passwords for web applications
through a web browser. In this system, the user logs into a web application by
drawing a doodle using a touchpad or digitizing tablet attached to the
computer. The user is automatically logged into the web application if the
doodle matches the doodle drawn during enrollment. We also report accuracy
results for our doodle recognition system, and conclude with a summary of next
steps. Keywords: doodles, password management | |||
| A computational model for spatial expression resolution | | BIBAK | Full-Text | 240-246 | |
| Andrea Corradini | |||
| This paper presents a computational model for the interpretation of
linguistic spatial propositions in the restricted realm of a 2D puzzle game.
Based on an experiment aimed at analyzing human judgment of spatial
expressions, we establish a set of criteria that explain human preference for
certain interpretations over others. For each of these criteria, we define a
metric that combines the semantic and pragmatic contextual information
regarding the game as well as the utterance being resolved. Each metric gives
rise to a potential field that characterizes the degree of likelihood for
carrying out the instruction at a specific hypothesized location. We resort to
machine learning techniques to determine a model of spatial relationships from
the data collected during the experiment. Sentence interpretation occurs by
matching the potential field of each of its possible interpretations to the
model at hand. The system's explanation capabilities lead to the correct
assessment of ambiguous situated utterances for a large percentage of the
collected expressions. Keywords: machine learning, psycholinguistic study, spatial expressions | |||
| Disambiguating speech commands using physical context | | BIBAK | Full-Text | 247-254 | |
| Katherine M. Everitt; Susumu Harada; Jeff Bilmes; James A. Landay | |||
| Speech has great potential as an input mechanism for ubiquitous computing.
However, the current requirements necessary for accurate speech recognition,
such as a quiet environment and a well-positioned and high-quality microphone,
are unreasonable to expect in a realistic setting. In a physical environment,
there is often contextual information which can be sensed and used to augment
the speech signal. We investigated improving speech recognition rates for an
electronic personal trainer using knowledge about what equipment was in use as
context. We performed an experiment with participants speaking in an
instrumented apartment environment and compared the recognition rates of a
larger grammar with those of a smaller grammar that is determined by the
context. Keywords: context, exercise, fitness, speech recognition | |||
| Automatic inference of cross-modal nonverbal interactions in multiparty conversations: "who responds to whom, when, and how?" from gaze, head gestures, and utterances | | BIBAK | Full-Text | 255-262 | |
| Kazuhiro Otsuka; Hiroshi Sawada; Junji Yamato | |||
| A novel probabilistic framework is proposed for analyzing cross-modal
nonverbal interactions in multiparty face-to-face conversations. The goal is to
determine "who responds to whom, when, and how" from multimodal cues including
gaze, head gestures, and utterances. We formulate this problem as the
probabilistic inference of the causal relationship among participants'
behaviors involving head gestures and utterances. To solve this problem, this
paper proposes a hierarchical probabilistic model; the structures of
interactions are probabilistically determined from high-level conversation
regimes (such as monologue or dialogue) and gaze directions. Based on the
model, the interaction structures, gaze, and conversation regimes, are
simultaneously inferred from observed head motion and utterances, using a
Markov chain Monte Carlo method. The head gestures, including nodding, shaking
and tilt, are recognized with a novel Wavelet-based technique from magnetic
sensor signals. The utterances are detected using data captured by lapel
microphones. Experiments on four-person conversations confirm the effectiveness
of the framework in discovering interactions such as question-and-answer and
addressing behavior followed by back-channel responses. Keywords: Bayesian network, eye gaze, face-to-face multiparty conversation, Gibbs
sampler, head gestures, Markov chain Monte Carlo, nonverbal behaviors,
semi-Markov process | |||
| Influencing social dynamics in meetings through a peripheral display | | BIBAK | Full-Text | 263-270 | |
| Janienke Sturm; Olga Houben-van Herwijnen; Anke Eyck; Jacques Terken | |||
| We present a service providing real-time feedback to participants of small
group meetings on the social dynamics of the meeting. The service measures and
visualizes properties of participants' behaviour that are relevant to the
social dynamics of the meeting: speaking time and gaze behaviour. The dynamic
visualization is offered to meeting participants during the meeting through a
peripheral display. Whereas an initial version was evaluated using wizards to
obtain the required information about gazing behaviour and speaking activity
instead of perceptual systems, in the current paper we employ a system
including automated perceptual components. We describe the system properties
and the perceptual components. The service was evaluated in a within-subjects
experiment, where groups of participants discussed topics of general interest,
with a total of 82 participants. It was found that the presence of the feedback
about speaking time influenced the behaviour of the participants in such a way
that it made over-participators to behave less dominant and under-participators
to become more active. Feedback on eye gaze behaviour did not affect
participants' gazing behaviour (both for listeners and for speakers) during the
meeting. Keywords: head orientation detection, meetings, peripheral display, social dynamics,
speech activity detection | |||
| Using the influence model to recognize functional roles in meetings | | BIBAK | Full-Text | 271-278 | |
| Wen Dong; Bruno Lepri; Alessandro Cappelletti; Alex Sandy Pentland; Fabio Pianesi; Massimo Zancanaro | |||
| In this paper, an influence model is used to recognize functional roles
played during meetings. Previous works on the same corpus demonstrated a high
recognition accuracy using SVMs with RBF kernels. In this paper, we discuss the
problems of that approach, mainly over-fitting, the curse of dimensionality and
the inability to generalize to different group configurations. We present
results obtained with an influence modeling method that avoid these problems
and ensures both greater robustness and generalization capability. Keywords: group interaction, intelligent environments, support vector machines | |||
| User impressions of a stuffed doll robot's facing direction in animation systems | | BIBAK | Full-Text | 279-284 | |
| Hiroko Tochigi; Kazuhiko Shinozawa; Norihiro Hagita | |||
| This paper investigates the effect on user impressions of the body direction
of a stuffed doll robot in an animation system. Many systems that combine a
computer display with a robot have been developed, and one of their
applications is entertainment, for example, an animation system. In these
systems, the robot, as a 3D agent, can be more effective than a 2D agent in
helping the user enjoy the animation experience by using spatial
characteristics, such as body direction, as a means of expression. The
direction in which the robot faces, i.e., towards the human or towards the
display, is investigated here.
User impressions from 25 subjects were examined. The experiment results show that the robot facing the display together with a user is effective for eliciting good feelings from the user, regardless of the user's personality characteristics. Results also suggest that extroverted subjects tend to have a better feeling towards a robot facing the user than introverted ones. Keywords: animation system, impression evaluation, stuffed doll robot | |||
| Speech-driven embodied entrainment character system with hand motion input in mobile environment | | BIBAK | Full-Text | 285-290 | |
| Kouzi Osaki; Tomio Watanabe; Michiya Yamamoto | |||
| InterActor is a speech-input-driven CG-embodied interaction character that
can generate communicative movements and actions for entrained interactions.
InterPuppet, on the other hand, is an embodied interaction character that is
driven by both speech input-similar to InterActor-and hand motion input, like a
puppet. Therefore, humans can use InterPuppet to effectively communicate by
using deliberate body movements and natural communicative movements and
actions. In this paper, an advanced InterPuppet system that uses a
cellular-phone-type device is developed, which can be used in a mobile
environment. The effectiveness of the system is demonstrated by performing a
sensory evaluation experiment in an actual remote communication scenario. Keywords: cellular phone, embodied communication, embodied interaction, human
communication, human interaction | |||
| Natural multimodal dialogue systems: a configurable dialogue and presentation strategies component | | BIBAK | Full-Text | 291-298 | |
| Meriam Horchani; Benjamin Caron; Laurence Nigay; Franck Panaget | |||
| In the context of natural multimodal dialogue systems, we address the
challenging issue of the definition of cooperative answers in an appropriate
multimodal form. Highlighting the intertwined relation of multimodal outputs
with the content, we focus on the Dialogic strategy component, a component that
defines from the set of possible contents to answer a user's request, the
content to be presented to the user and its multimodal presentation. The
content selection and the presentation allocation managed by the Dialogic
strategy component are based on various constraints such as the availability of
a modality and the user's preferences. Considering three generic types of
dialogue strategies and their corresponding handled types of information as
well as three generic types of presentation tasks, we present a first
implementation of the Dialogic strategy component based on rules. By providing
a graphical interface to configure the component by editing the rules, we show
how the component can be easily modified by ergonomists at design time for
exploring different solutions. In further work we envision letting the user
modify the component at runtime. Keywords: development tool, dialogue and presentation strategies, multimodal output | |||
| Modeling human interaction resources to support the design of wearable multimodal systems | | BIBAK | Full-Text | 299-306 | |
| Tobias Klug; Max Mühlhäuser | |||
| Designing wearable application interfaces that integrate well into
real-world processes like aircraft maintenance or medical examinations is
challenging. One of the main success criteria is how well the multimodal
interaction with the computer system fits an already existing real-world task.
Therefore, the interface design needs to take the real-world task flow into
account from the beginning.
We propose a model of interaction devices and human interaction capabilities that helps evaluate how well different interaction devices/techniques integrate with specific real-world scenarios. The model was developed based on a survey of wearable interaction research literature. Combining this model with descriptions of observed real-world tasks, possible conflicts between task performance and device requirements can be visualized helping the interface designer to find a suitable solution. Keywords: interaction devices, interaction resource model, multimodal interaction,
wearable computing | |||
| Speech-filtered bubble ray: improving target acquisition on display walls | | BIBAK | Full-Text | 307-314 | |
| Edward Tse; Mark Hancock; Saul Greenberg | |||
| The rapid development of large interactive wall displays has been
accompanied by research on methods that allow people to interact with the
display at a distance. The basic method for target acquisition is by ray
casting a cursor from one's pointing finger or hand position; the problem is
that selection is slow and error-prone with small targets. A better method is
the bubble cursor that resizes the cursor's activation area to effectively
enlarge the target size. The catch is that this technique's effectiveness
depends on the proximity of surrounding targets: while beneficial in sparse
spaces, it is less so when targets are densely packed together. Our method is
the speech-filtered bubble ray that uses speech to transform a dense target
space into a sparse one. Our strategy builds on what people already do: people
pointing to distant objects in a physical workspace typically disambiguate
their choice through speech. For example, a person could point to a stack of
books and say "the green one". Gesture indicates the approximate location for
the search, and speech 'filters' unrelated books from the search. Our technique
works the same way; a person specifies a property of the desired object, and
only the location of objects matching that property trigger the bubble size. In
a controlled evaluation, people were faster and preferred using the
speech-filtered bubble ray over the standard bubble ray and ray casting
approach. Keywords: freehand interaction, gestures, large display walls, multimodal, pointing,
speech, speech filtering | |||
| Using pen input features as indices of cognitive load | | BIBAK | Full-Text | 315-318 | |
| Natalie Ruiz; Ronnie Taib; Yu (David) Shi; Eric Choi; Fang Chen | |||
| Multimodal interfaces are known to be useful in map-based applications, and
in complex, time-pressure based tasks. Cognitive load variations in such tasks
have been found to impact multimodal behaviour. For example, users become more
multimodal and tend towards semantic complementarity as cognitive load
increases. The richness of multimodal data means that systems could monitor
particular input features to detect experienced load variations. In this paper,
we present our attempt to induce controlled levels of load and solicit natural
speech and pen-gesture inputs. In particular, we analyse for these features in
the pen gesture modality. Our experimental design relies on a map-based Wizard
of Oz, using a tablet PC. This paper details analysis of pen-gesture
interaction across subjects, and presents suggestive trends of increases in the
degree of degeneration of pen-gestures in some subjects, and possible trends in
gesture kinematics, when cognitive load increases. Keywords: cognitive load, multimodal, pen gesture, speech | |||
| Automated generation of non-verbal behavior for virtual embodied characters | | BIBAK | Full-Text | 319-322 | |
| Werner Breitfuss; Helmut Prendinger; Mitsuru Ishizuka | |||
| In this paper we introduce a system that automatically adds different types
of non-verbal behavior to a given dialogue script between two virtual embodied
agents. It allows us to transform a dialogue in text format into an agent
behavior script enriched by eye gaze and conversational gesture behavior. The
agents' gaze behavior is informed by theories of human face-to-face gaze
behavior. Gestures are generated based on the analysis of linguistic and
contextual information of the input text. The resulting annotated dialogue
script is then transformed into the Multimodal Presentation Markup Language for
3D agents (MPML3D), which controls the multi-modal behavior of animated
life-like agents, including facial and body animation and synthetic speech.
Using our system makes it very easy to add appropriate non-verbal behavior to a
given dialogue text, a task that would otherwise be very cumbersome and time
consuming. Keywords: animation agent systems, multi-modal presentation, multimodal input and
output interfaces, processing of language and action patterns | |||
| Detecting communication errors from visual cues during the system's conversational turn | | BIBAK | Full-Text | 323-326 | |
| Sy Bor Wang; David Demirdjian; Trevor Darrell | |||
| Automatic detection of communication errors in conversational systems has
been explored extensively in the speech community. However, most previous
studies have used only acoustic cues. Visual information has also been used by
the speech community to improve speech recognition in dialogue systems, but
this visual information is only used when the speaker is communicating vocally.
A recent perceptual study indicated that human observers can detect
communication problems when they see the visual footage of the speaker during
the system's reply. In this paper, we present work in progress towards the
development of a communication error detector that exploits this visual cue. In
datasets we collected or acquired, facial motion features and head poses were
estimated while users were listening to the system response and passed to a
classifier for detecting a communication error. Preliminary experiments have
demonstrated that the speaker's visual information during the system's reply is
potentially useful and accuracy of automatic detection is close to human
performance. Keywords: conversational systems, system error detection, visual feedback | |||
| Multimodal interaction analysis in a smart house | | BIBAK | Full-Text | 327-334 | |
| Pilar Manchón; Carmen del Solar; Gabriel Amores; Guillermo Pérez | |||
| This is a large extension to a previous paper presented in LREC 2006 [6]. It
describes the motivation, collection and format of the MIMUS corpus, as well as
an in-depth and issue-focused analysis of the data. MIMUS [8] is the result of
multimodal WoZ experiments conducted at the University of Seville as part of
the TALK project. The main objective of the MIMUS corpus was to gather
information about different users and their performance, preferences and usage
of a multimodal multilingual natural dialogue system in the Smart Home scenario
in Spanish. The focus group is composed by wheel-chair-bound users, because of
their special motivation to use this kind of technology, along with their
specific needs. Throughout this article, the WoZ platform, experiments,
methodology, annotation schemes and tools, and all relevant data will be
discussed, as well as the results of the in-depth analysis of these data. The
corpus compresses a set of three related experiments. Due to the limited scope
of this article, only some results related to the first two experiments (1A and
1B) will be discussed. This article will focus on subject's preferences,
multimodal behavioural patterns and willingness to use this kind of technology. Keywords: HCI, mixed-modality events, multimodal corpus, multimodal entries,
multimodal experiments, multimodal interaction | |||
| A multi-modal mobile device for learning Japanese kanji characters through mnemonic stories | | BIBAK | Full-Text | 335-338 | |
| Norman Lin; Shoji Kajita; Kenji Mase | |||
| We describe the design of a novel multi-modal, mobile computer system to
support foreign students in learning Japanese kanji characters through creation
of mnemonic stories. Our system treats complicated kanji shapes as hierarchical
compositions of smaller shapes (following Heisig, 1986) and allows hyperlink
navigation to quickly follow whole-part relationships. Visual display of kanji
shape and meaning are augmented with user-supplied mnemonic stories in audio
form, thereby dividing the learning information multi-modally into visual and
audio modalities. A device-naming scheme and color-coding allow for
asynchronous sharing of audio mnemonic stories among different users' devices.
We describe the design decisions for our mobile multi-modal interface and
present initial usability results based on feedback from beginning kanji
learners. Our combination of mnemonic stories, audio and video modalities, and
mobile device provide a new and effective system for computer-assisted kanji
learning. Keywords: Chinese characters, JSL, Japanese as a second language, kanji, language
education, mobile computing | |||
| 3d augmented mirror: a multimodal interface for string instrument learning and teaching with gesture support | | BIBAK | Full-Text | 339-345 | |
| Kia C. Ng; Tillman Weyde; Oliver Larkin; Kerstin Neubarth; Thijs Koerselman; Bee Ong | |||
| Multimodal interfaces can open up new possibilities for music education,
where the traditional model of teaching is based predominantly on verbal
feedback. This paper explores the development and use of multimodal interfaces
in novel tools to support music practice training. The design of multimodal
interfaces for music education presents a challenge in several respects. One is
the integration of multimodal technology into the music learning process. The
other is the technological development, where we present a solution that aims
to support string practice training with visual and auditory feedback. Building
on the traditional function of a physical mirror as a teaching aid, we describe
the concept and development of an "augmented mirror" using 3D motion capture
technology. Keywords: 3d, education, feedback, gesture, interface, motion capture, multimodal,
music, sonification, visualisation, visualization | |||
| Interest estimation based on dynamic bayesian networks for visual attentive presentation agents | | BIBAK | Full-Text | 346-349 | |
| Boris Brandherm; Helmut Prendinger; Mitsuru Ishizuka | |||
| In this paper, we describe an interface consisting of a virtual showroom
where a team of two highly realistic 3D agents presents product items in an
entertaining and attractive way. The presentation flow adapts to users'
attentiveness, or lack thereof, and may thus provide a more personalized and
user-attractive experience of the presentation. In order to infer users'
attention and visual interest regarding interface objects, our system analyzes
eye movements in real-time. Interest detection algorithms used in previous
research determine an object of interest based on the time that eye gaze dwells
on that object. However, this kind of algorithm is not well suited for dynamic
presentations where the goal is to assess the user's focus of attention
regarding a dynamically changing presentation. Here, the current context of the
object of attention has to be considered, i.e., whether the visual object is
part of (or contributes to) the current presentation content or not. Therefore,
we propose a new approach that estimates the interest (or non-interest) of a
user by means of dynamic Bayesian networks. Each of a predefined set of visual
objects has a dynamic Bayesian network assigned to it, which calculates the
current interest of the user in this object. The estimation takes into account
(1) each new gaze point, (2) the current context of the object, and (3)
preceding estimations of the object itself. Based on these estimations the
presentation agents can provide timely and appropriate response. Keywords: dynamic Bayesian network, eye tracking, interest recognition, multi-modal
presentation | |||
| On-line multi-modal speaker diarization | | BIBAK | Full-Text | 350-357 | |
| Athanasios Noulas; Ben J. A. Krose | |||
| This paper presents a novel framework that utilizes multi-modal information
to achieve speaker diarization. We use dynamic Bayesian networks to achieve
on-line results. We progress from a simple observation model to a complex
multi-modal one as more data becomes available. We present an efficient way to
guide the learning procedure of the complex model using the early results
achieved with the simple model. We present the results achieved in various
real-world situations, including videos coming from webcameras, human computer
interaction and video conferences. Keywords: audio-visual, multi-modal, speaker detection, speaker diarization | |||
| Presentation sensei: a presentation training system using speech and image processing | | BIBAK | Full-Text | 358-365 | |
| Kazutaka Kurihara; Masataka Goto; Jun Ogata; Yosuke Matsusaka; Takeo Igarashi | |||
| In this paper we present a presentation training system that observes a
presentation rehearsal and provides the speaker with recommendations for
improving the delivery of the presentation, such as to speak more slowly and to
look at the audience. Our system "Presentation Sensei" is equipped with a
microphone and camera to analyze a presentation by combining speech and image
processing techniques. Based on the results of the analysis, the system gives
the speaker instant feedback with respect to the speaking rate, eye contact
with the audience, and timing. It also alerts the speaker when some of these
indices exceed predefined warning thresholds. After the presentation, the
system generates visual summaries of the analysis results for the speaker's
self-examinations. Our goal is not to improve the content on a semantic level,
but to improve the delivery of it by reducing inappropriate basic behavior
patterns. We asked a few test users to try the system and they found it very
useful for improving their presentations. We also compared the system's output
with the observations of a human evaluator. The result shows that the system
successfully detected some inappropriate behavior. The contribution of this
work is to introduce a practical recognition-based human training system and to
show its feasibility despite the limitations of state-of-the-art speech and
video recognition technologies. Keywords: image processing, presentation, sensei, speech processing, training | |||
| The world of mushrooms: human-computer interaction prototype systems for ambient intelligence | | BIBAK | Full-Text | 366-373 | |
| Yasuhiro Minami; Minako Sawaki; Kohji Dohsaka; Ryuichiro Higashinaka; Kentaro Ishizuka; Hideki Isozaki; Tatsushi Matsubayashi; Masato Miyoshi; Atsushi Nakamura; Takanobu Oba; Hiroshi Sawada; Takeshi Yamada; Eisaku Maeda | |||
| Our new research project called "ambient intelligence" concentrates on the
creation of new lifestyles through research on communication science and
intelligence integration. It is premised on the creation of such virtual
communication partners as fairies and goblins that can be constantly at our
side. We call these virtual communication partners mushrooms.
To show the essence of ambient intelligence, we developed two multimodal prototype systems: mushrooms that watch, listen, and answer questions and a Quizmaster Mushroom. These two systems work in real time using speech, sound, dialogue, and vision technologies. We performed preliminary experiments with the Quizmaster Mushroom. The results showed that the system can transmit knowledge to users while they are playing the quizzes. Furthermore, through the two mushrooms, we found policies for design effects in multimodal interface and integration. Keywords: dialog, multimodal interfaces, visual-auditory feedback | |||
| Evaluation of haptically augmented touchscreen gui elements under cognitive load | | BIBAK | Full-Text | 374-381 | |
| Rock Leung; Karon MacLean; Martin Bue Bertelsen; Mayukh Saubhasik | |||
| Adding expressive haptic feedback to mobile devices has great potential to
improve their usability, particularly in multitasking situations where one's
visual attention is required. Piezoelectric actuators are emerging as one
suitable technology for rendering expressive haptic feedback on mobile devices.
We describe the design of redundant piezoelectric haptic augmentations of
touchscreen GUI buttons, progress bars, and scroll bars, and their evaluation
under varying cognitive load. Our haptically augmented progress bars and scroll
bars led to significantly faster task completion, and favourable subjective
reactions. We further discuss resulting insights into designing useful haptic
feedback for touchscreens and highlight challenges, including means of
enhancing usability, types of interactions where value is maximized, difficulty
in disambiguating background from foreground signals, tradeoffs in haptic
strength vs. resolution, and subtleties in evaluating these types of
interactions. Keywords: GUI elements, haptic feedback, mobile device, multimodal, multitasking,
piezoelectric actuators, touchscreen, usability | |||
| Multimodal interfaces in semantic interaction | | BIBAK | Full-Text | 382 | |
| Naoto Iwahashi; Mikio Nakano | |||
| This workshop addresses the approaches, methods, standardization, and
theories for multimodal interfaces in which machines need to interact with
humans adaptively according to context, such as the situation in the real world
and each human's individual characteristics. To realize such interaction -- as
semantic interaction -- it is necessary to extract and use the valuable context
information needed for understanding interaction from the obtained real-world
information. In addition, it is important for the user and the machine to share
knowledge and an understanding of a given situation naturally through speech,
images, graphics, manipulators, and so on. Submitted papers address these
topics from diverse fields, such as human-robot interaction, machine learning,
and game design. Keywords: context, human-robot interaction, multimodal interface, semantic
interaction, situatedness | |||
| Workshop on tagging, mining and retrieval of human related activity information | | BIBAK | Full-Text | 383-384 | |
| Paulo Barthelmess; Edward Kaiser | |||
| Inexpensive and user friendly cameras, microphones, and other devices such
as digital pens are making it increasingly easy to capture, store and process
large amounts of data over a variety of media. Even though the barriers for
data acquisition have been lowered, making use of these data remains
challenging. The focus of the present workshop is on issues related to theory,
methods and techniques for facilitating the organization, retrieval and reuse
of multimodal information. The emphasis is on organization and retrieval of
information related to human activity, i.e. that is generated and consumed by
individuals and groups as they go about their work, learning and leisure. Keywords: browsing, mining, multimedia, multimodal, retrieval, tagging | |||
| Workshop on massive datasets | | BIBK | Full-Text | 385 | |
| Christopher R. Wren; Yuri A. Ivanov | |||
Keywords: architecture, data mining, evaluation, motion, sensor networks, tracking,
visualization | |||