| Weight, weight, don't tell me | | BIBA | Full-Text | 1 | |
| Ted Warburton | |||
| Remember the "Internet's firstborn," Ron Lussier's dancing baby from 1996? Other than a vague sense of repeated gyrations, no one can recall any of the movements in particular. Why is that? While that animation was ground-breaking in many respects, to paraphrase a great writer, there was no there there. The dancing baby lacked personality because the movements themselves lacked "weight." Each human being has a unique perceivable movement style composed of repeated recognizable elements that in combination and phrasing capture the liveliness of movement. The use of weight, or "effort quality," is a key element in movement style, defining a dynamic expressive range. In computer representation of human movement, however, weight is often an aspect of life-ness that gets diminished or lost in the process, contributing to a lack of groundedness, personality, and verisimilitude. In this talk, I unpack the idea of effort quality and describe current work with motion capture and telematics that puts the weight back on interface design. | |||
| Movement and music: designing gestural interfaces for computer-based musical instruments | | BIBA | Full-Text | 2 | |
| Sile O'Modhrain | |||
| The concept of body-mediated or embodied interaction, of the coupling of interface and actor, has become increasingly relevant within the domain of HCI. With the reduced size and cost of a wide variety of sensor technologies and the ease with which they can be wirelessly deployed, on the body, in devices we carry with us and in the environment, comes the opportunity to use a wide range of human motion as an integral part of our interaction with many applications. While movement is potentially a rich, multidimensional source of information upon which interface designers can draw, its very richness poses many challenges in developing robust motion capture and gesture recognition systems. In this talk, I will suggest that lessons learned by designers of computer-based musical instruments whose task is to translate expressive movement into nuanced control of sound may now help to inform the design of movement-based interfaces for a much wider range of applications. | |||
| Mixing virtual and actual | | BIBA | Full-Text | 3 | |
| Herbert H. Clark | |||
| People often communicate with a mixture of virtual and actual elements. On the telephone, my sister and I and what we say are actual, even though our voices are virtual. In the London Underground, the warning expressed in the recording "Stand clear of the doors" is actual, even though the person making it is virtual. In the theater, Shakespeare, the actors, and I are actual, even though Romeo and Juliet and what they say are virtual. Mixtures like these cannot be accounted for in standard models of communication-for a variety of reasons. In this talk I introduce the notion of displaced actions (as on the telephone, in the London Underground, and in the theater) and characterize how they are used and interpreted in communication with a range of modern-day technologies. | |||
| Collaborative multimodal photo annotation over digital paper | | BIBAK | Full-Text | 4-11 | |
| Paulo Barthelmess; Edward Kaiser; Xiao Huang; David McGee; Philip Cohen | |||
| The availability of metadata annotations over media content such as photos
is known to enhance retrieval and organization, particularly for large data
sets. The greatest challenge for obtaining annotations remains getting users to
perform the large amount of tedious manual work that is required.
In this paper we introduce an approach for semi-automated labeling based on extraction of metadata from naturally occurring conversations of groups of people discussing pictures among themselves. As the burden for structuring and extracting metadata is shifted from users to the system, new recognition challenges arise. We explore how multimodal language can help in 1) detecting a concise set of meaningful labels to be associated with each photo, 2) achieving robust recognition of these key semantic terms, and 3) facilitating label propagation via multimodal shortcuts. Analysis of the data of a preliminary pilot collection suggests that handwritten labels may be highly indicative of the semantics of each photo, as indicated by the correlation of handwritten terms with high frequency spoken ones. We point to initial directions exploring a multimodal fusion technique to recover robust spelling and pronunciation of these high-value terms from redundant speech and handwriting. Keywords: automatic label extraction, collaborative interaction, intelligent
interfaces, multimodal processing, photo annotation | |||
| MyConnector: analysis of context cues to predict human availability for communication | | BIBAK | Full-Text | 12-19 | |
| Maria Danninger; Tobias Kluge; Rainer Stiefelhagen | |||
| In this thriving world of mobile communications, the difficulty of
communication is no longer contacting someone, but rather contacting people in
a socially appropriate manner. Ideally, senders should have some understanding
of a receiver's availability in order to make contact at the right time, in the
right contexts, and with the optimal communication medium.
We describe the design and implementation of MyConnector, an adaptive and context-aware service designed to facilitate efficient and appropriate communication, based on each party's availability. One of the chief design questions of such a service is to produce technologies with sufficient contextual awareness to decide upon a person's availability for communication. We present results from a pilot study comparing a number of context cues and their predictive power for gauging one's availability. Keywords: availability, computer-mediated communication, context-aware communication,
interruptibility, user models | |||
| Human perception of intended addressee during computer-assisted meetings | | BIBAK | Full-Text | 20-27 | |
| Rebecca Lunsford; Sharon Oviatt | |||
| Recent research aims to develop new open-microphone engagement techniques
capable of identifying when a speaker is addressing a computer versus human
partner, including during computer-assisted group interactions. The present
research explores: (1) how accurately people can judge whether an intended
interlocutor is a human versus computer, (2) which linguistic,
acoustic-prosodic, and visual information sources they use to make these
judgments, and (3) what type of systematic errors are present in their
judgments. Sixteen participants were asked to determine a speaker's intended
addressee based on actual videotaped utterances matched on illocutionary force,
which were played back as: (1) lexical transcriptions only, (2) audio-only, (3)
visual-only, and (4) audio-visual information. Perhaps surprisingly, people's
accuracy in judging human versus computer addressees did not exceed chance
levels with lexical-only content (46%). As predicted, accuracy improved
significantly with audio (58%), visual (57%), and especially audio-visual
information (63%). Overall, accuracy in detecting human interlocutors was
significantly worse than judging computer ones, and specifically worse when
only visual information was present because speakers often looked at the
computer when addressing peers. In contrast, accuracy in judging computer
interlocutors was significantly better whenever visual information was present
than with audio alone, and it yielded the highest accuracy levels observed
(86%). Questionnaire data also revealed that speakers' gaze, peers' gaze, and
tone of voice were considered the most valuable information sources. These
results reveal that people rely on cues appropriate for interpersonal
interactions in determining computer- versus human-directed speech during mixed
human-computer interactions, even though this degrades their accuracy. Future
systems that process actual rather than expected communication patterns
potentially could be designed that perform better than humans. Keywords: acoustic-prosodic cues, dialogue style, gaze, human-computer teamwork,
intended addressee, multiparty interaction, open-microphone engagement | |||
| Automatic detection of group functional roles in face to face interactions | | BIBAK | Full-Text | 28-34 | |
| Massimo Zancanaro; Bruno Lepri; Fabio Pianesi | |||
| In this paper, we discuss a machine learning approach to automatically
detect functional roles played by participants in a face to face interaction.
We shortly introduce the coding scheme we used to classify the roles of the
group members and the corpus we collected to assess the coding scheme
reliability as well as to train statistical systems for automatic recognition
of roles. We then discuss a machine learning approach based on multi-class SVM
to automatically detect such roles by employing simple features of the visual
and acoustical scene. The effectiveness of the classification is better than
the chosen baselines and although the results are not yet good enough for a
real application, they demonstrate the feasibility of the task of detecting
group functional roles in face to face interactions. Keywords: group interaction, intelligent environments, support vector machines | |||
| Speaker localization for microphone array-based ASR: the effects of accuracy on overlapping speech | | BIBAK | Full-Text | 35-38 | |
| Hari Krishna Maganti; Daniel Gatica-Perez | |||
| Accurate speaker location is essential for optimal performance of distant
speech acquisition systems using microphone array techniques. However, to the
best of our knowledge, no comprehensive studies on the degradation of automatic
speech recognition (ASR) as a function of speaker location accuracy in a
multi-party scenario exist. In this paper, we describe a framework for
evaluation of the effects of speaker location errors on a microphone
array-based ASR system, in the context of meetings in multi-sensor rooms
comprising multiple cameras and microphones. Speakers are manually annotated in
videos in different camera views, and triangulation is used to determine an
accurate speaker location. Errors in the speaker location are then induced in a
systematic manner to observe their influence on speech recognition performance.
The system is evaluated on real overlapping speech data collected with
simultaneous speakers in a meeting room. The results are compared with those
obtained from close-talking headset microphones, lapel microphones, and speaker
location based on audio-only and audio-visual information approaches. Keywords: audio-visual speaker tracking, microphone array ASR | |||
| Automatic speech recognition for webcasts: how good is good enough and what to do when it isn't | | BIBAK | Full-Text | 39-42 | |
| Cosmin Munteanu; Gerald Penn; Ron Baecker; Yuecheng Zhang | |||
| The increased availability of broadband connections has recently led to an
increase in the use of Internet broadcasting (webcasting). Most webcasts are
archived and accessed numerous times retrospectively. One challenge to skimming
and browsing through such archives is the lack of text transcripts of the
webcast's audio channel. This paper describes a procedure for prototyping an
Automatic Speech Recognition (ASR) system that generates realistic transcripts
of any desired Word Error Rate (WER), thus overcoming the drawbacks of both
prototype-based and Wizard of Oz simulations. We used such a system in a user
study showing that transcripts with WERs less than 25% are acceptable for use
in webcast archives. As current ASR systems can only deliver, in realistic
conditions, Word Error Rates (WERs) of around 45%, we also describe a solution
for reducing the WER of such transcripts by engaging users to collaborate in a
"wiki" fashion on editing the imperfect transcripts obtained through ASR. Keywords: automatic speech recognition, collaboration, webcasts | |||
| Cross-modal coordination of expressive strength between voice and gesture for personified media | | BIBAK | Full-Text | 43-50 | |
| Tomoko Yonezawa; Noriko Suzuki; Shinji Abe; Kenji Mase; Kiyoshi Kogure | |||
| The aim of this paper is to clarify the relationship between the expressive
strengths of gestures and voice for embodied and personified interfaces. We
conduct perceptual tests using a puppet interface, while controlling
singing-voice expressions, to empirically determine the naturalness and
strength of various combinations of gesture and voice. The results show that
(1) the strength of cross-modal perception is affected more by gestural
expression than by the expressions of a singing voice, and (2) the
appropriateness of cross-modal perception is affected by expressive
combinations between singing voice and gestures in personified expressions. As
a promising solution, we propose balancing a singing voice and gestural
expressions by expanding and correcting the width and shape of the curve of
expressive strength in the singing voice. Keywords: cross-modality, perceptual experiment, personified puppet-interface,
vocal-gestural expression | |||
| VirtualHuman: dialogic and affective interaction with virtual characters | | BIBAK | Full-Text | 51-58 | |
| Norbert Reithinger; Patrick Gebhard; Markus Löckelt; Alassane Ndiaye; Norbert Pfleger; Martin Klesen | |||
| Natural multimodal interaction with realistic virtual characters provides
rich opportunities for entertainment and education. In this paper we present
the current VirtualHuman demonstrator system. It provides a knowledge-based
framework to create interactive applications in a multi-user, multi-agent
setting. The behavior of the virtual humans and objects in the 3D environment
is controlled by interacting affective conversational dialogue engines. An
elaborate model of affective behavior adds natural emotional reactions and
presence of the virtual humans. Actions are defined in a XML-based markup
language that supports the incremental specification of synchronized multimodal
output. The system was successfully demonstrated during CeBIT 2006. Keywords: AI techniques & adaptive multimodal interfaces, mobile, tangible &
virtual/augmented multimodal interfaces, multimodal input and output
interfaces, speech and conversational interfaces | |||
| From vocal to multimodal dialogue management | | BIBAK | Full-Text | 59-67 | |
| Miroslav Melichar; Pavel Cenek | |||
| Multimodal, speech-enabled systems pose different research problems when
compared to unimodal, voice-only dialogue systems. One of the important issues
is the question of how a multimodal interface should look like in order to make
the multimodal interaction natural and smooth, while keeping it manageable from
the system perspective. Another central issue concerns algorithms for
multimodal dialogue management. This paper presents a solution that relies on
adapting an existing unimodal, vocal dialogue management framework to make it
able to cope with multimodality. An experimental multimodal system, Archivus,
is described together with discussion of the required changes to the unimodal
dialogue management algorithms. Results of pilot Wizard of Oz experiments with
Archivus focusing on system efficiency and user behaviour are presented{sup:1}. Keywords: Wizard of Oz, dialogue management, dialogue systems, graphical user
interface (GUI), human computer interaction (HCI), multimodal systems, rapid
dialogue prototyping | |||
| Human-Robot dialogue for joint construction tasks | | BIBAK | Full-Text | 68-71 | |
| Mary Ellen Foster; Tomas By; Markus Rickert; Alois Knoll | |||
| We describe a human-robot dialogue system that allows a human to collaborate
with a robot agent on assembling construction toys. The human and the robot are
fully equal peers in the interaction, rather than simply partners. Joint action
is supported at all stages of the interaction: the participants agree on a
construction task, jointly decide how to proceed to proceed with the task, and
also implement the selected plans jointly. The symmetry provides novel
challenges for a dialogue system, and also makes it possible for findings from
human-human joint-action dialogues to be easily implemented and tested. Keywords: human-robot interaction, multimodal dialogue | |||
| roBlocks: a robotic construction kit for mathematics and science education | | BIBAK | Full-Text | 72-75 | |
| Eric Schweikardt; Mark D. Gross | |||
| We describe work in progress on roBlocks, a computational construction kit
that encourages users to experiment and play with a collection of sensor, logic
and actuator blocks, exposing them to a variety of advanced concepts including
kinematics, feedback and distributed control. Its interface presents novice
users with a simple, tangible set of robotic blocks, whereas advanced users
work with software tools to analyze and rewrite the programs embedded in each
block. Early results suggest that roBlocks may be an effective vehicle to
expose young people to complex ideas in science, technology, engineering and
mathematics. Keywords: construction kit, robotics education, tangible interface | |||
| GSI demo: multiuser gesture/speech interaction over digital tables by wrapping single user applications | | BIBAK | Full-Text | 76-83 | |
| Edward Tse; Saul Greenberg; Chia Shen | |||
| Most commercial software applications are designed for a single user using a
keyboard/mouse over an upright monitor. Our interest is exploiting these
systems so they work over a digital table. Mirroring what people do when
working over traditional tables, we want to allow multiple people to interact
naturally with the tabletop application and with each other via rich speech and
hand gestures. In previous papers, we illustrated multi-user gesture and speech
interaction on a digital table for geospatial applications -- Google Earth,
Warcraft III and The Sims. In this paper, we describe our underlying
architecture: GSI Demo. First, GSI Demo creates a run-time wrapper around
existing single user applications: it accepts and translates speech and
gestures from multiple people into a single stream of keyboard and mouse inputs
recognized by the application. Second, it lets people use multimodal
demonstration -- instead of programming -- to quickly map their own speech and
gestures to these keyboard/mouse inputs. For example, continuous gestures are
trained by saying "Computer, when I do [one finger gesture], you do [mouse
drag]". Similarly, discrete speech commands can be trained by saying "Computer,
when I say [layer bars], you do [keyboard and mouse macro]". The end result is
that end users can rapidly transform single user commercial applications into a
multi-user, multimodal digital tabletop system. Keywords: digital tables, multimodal input, programming by demonstration | |||
| Co-Adaptation of audio-visual speech and gesture classifiers | | BIBAK | Full-Text | 84-91 | |
| C. Mario Christoudias; Kate Saenko; Louis-Philippe Morency; Trevor Darrell | |||
| The construction of robust multimodal interfaces often requires large
amounts of labeled training data to account for cross-user differences and
variation in the environment. In this work, we investigate whether unlabeled
training data can be leveraged to build more reliable audio-visual classifiers
through co-training, a multi-view learning algorithm. Multimodal tasks are good
candidates for multi-view learning, since each modality provides a potentially
redundant view to the learning algorithm. We apply co-training to two problems:
audio-visual speech unit classification, and user agreement recognition using
spoken utterances and head gestures. We demonstrate that multimodal co-training
can be used to learn from only a few labeled examples in one or both of the
audio-visual modalities. We also propose a co-adaptation algorithm, which
adapts existing audio-visual classifiers to a particular user or noise
condition by leveraging the redundancy in the unlabeled data. Keywords: adaptation, audio-visual speech and gesture, co-training, human-computer
interfaces, semi-supervised learning | |||
| Towards the integration of shape-related information in 3-D gestures and speech | | BIBAK | Full-Text | 92-99 | |
| Timo Sowa | |||
| This paper presents a model for the unified semantic representation of shape
conveyed by speech and coverbal 3-D gestures. The representation is tailored to
capture the semantic contributions of both modalities during free descriptions
of objects. It is shown how the semantic content of shape-related adjectives,
nouns, and iconic gestures can be modeled and combined when they occur together
in multimodal utterances like "a longish bar" + iconic gesture. The model has
been applied for the development of a prototype system for gesture recognition
and integration with speech. Keywords: gesture, multimodal integration, shape, speech | |||
| Which one is better?: information navigation techniques for spatially aware handheld displays | | BIBAK | Full-Text | 100-107 | |
| Michael Rohs; Georg Essl | |||
| Information navigation techniques for handheld devices support interacting
with large virtual spaces on small displays, for example finding targets on a
large-scale map. Since only a small part of the virtual space can be shown on
the screen at once, typical interfaces allow for scrolling and panning to reach
off-screen content. Spatially aware handheld displays sense their position and
orientation in physical space in order to provide a corresponding view in
virtual space. We implemented various one-handed navigation techniques for
camera-tracked spatially aware displays. The techniques are compared in a
series of abstract selection tasks that require the investigation of different
levels of detail. The tasks are relevant for interfaces that enable navigating
large scale maps and finding contextual information on them. The results show
that halo is significantly faster than other techniques. In complex situations
zoom and halo show comparable performance. Surprisingly, the combination of
halo and zooming is detrimental to user performance. Keywords: camera phones, handheld devices, information navigation, navigation aids,
small displays, spatial cognition, spatial interaction, spatially aware
displays | |||
| Comparing the effects of visual-auditory and visual-tactile feedback on user performance: a meta-analysis | | BIBAK | Full-Text | 108-117 | |
| Jennifer L. Burke; Matthew S. Prewett; Ashley A. Gray; Liuquin Yang; Frederick R. B. Stilson; Michael D. Coovert; Linda R. Elliot; Elizabeth Redden | |||
| In a meta-analysis of 43 studies, we examined the effects of multimodal
feedback on user performance, comparing visual-auditory and visual-tactile
feedback to visual feedback alone. Results indicate that adding an additional
modality to visual feedback improves performance overall. Both visual-auditory
feedback and visual-tactile feedback provided advantages in reducing reaction
times and improving performance scores, but were not effective in reducing
error rates. Effects are moderated by task type, workload, and number of tasks.
Visual-auditory feedback is most effective when a single task is being
performed (g = .87), and under normal workload conditions (g = .71).
Visual-tactile feedback is more effective when multiple tasks are begin
performed (g = .77) and workload conditions are high (g = .84). Both types of
multimodal feedback are effective for target acquisition tasks; but vary in
effectiveness for other task types. Implications for practice and research are
discussed. Keywords: meta-analysis, multimodal interface, visual-auditory feedback,
visual-tactile feedback | |||
| Multimodal estimation of user interruptibility for smart mobile telephones | | BIBAK | Full-Text | 118-125 | |
| Robert Malkin; Datong Chen; Jie Yang; Alex Waibel | |||
| Context-aware computer systems are characterized by the ability to consider
user state information in their decision logic. One example application of
context-aware computing is the smart mobile telephone. Ideally, a smart mobile
telephone should be able to consider both social factors (i.e., known
relationships between contactor and contactee) and environmental factors (i.e.,
the contactee's current locale and activity) when deciding how to handle an
incoming request for communication.
Toward providing this kind of user state information and improving the ability of the mobile phone to handle calls intelligently, we present work on inferring environmental factors from sensory data and using this information to predict user interruptibility. Specifically, we learn the structure and parameters of a user state model from continuous ambient audio and visual information from periodic still images, and attempt to associate the learned states with user-reported interruptibility levels. We report experimental results using this technique on real data, and show how such an approach can allow for adaptation to specific user preferences. Keywords: HMMs, context awareness, hierarchical HMMs, scene learning, smart mobile
telephones, user interruptibility | |||
| Short message dictation on Symbian series 60 mobile phones | | BIBAK | Full-Text | 126-127 | |
| E. Karpov; I. Kiss; J. Leppänen; J. Olsen; D. Oria; S. Sivadas; J. Tian | |||
| Dictation of natural language text on embedded mobile devices is a
challenging task. First, it involves memory and CPU-efficient implementation of
robust speech recognition algorithms that are generally resource demanding.
Secondly, the acoustic and language models employed in the recognizer require
the availability of suitable text and speech language resources, typically for
a wide set of languages. Thirdly, a proper design of the UI is also essential.
The UI has to provide intuitive and easy means for dictation and error
correction, and must be suitable for a mobile usage scenario. In this
demonstrator, an embedded speech recognition system for short message (SMS)
dictation in US English is presented. The system is running on Nokia Series 60
mobile phones (e.g., N70, E60). The system's vocabulary is 23 thousand words.
Its Flash and RAM memory footprints are small, 2 and 2.5 megabytes,
respectively. After a short enrollment session, most native speakers can
achieve a word accuracy of over 90% when dictating short messages in quiet or
moderately noisy environments. Keywords: embedded dictation, low complexity, low footprint, mobile dictation UI,
speech recognition | |||
| The NIST smart data flow system II multimodal data transport infrastructure | | BIBAK | Full-Text | 128 | |
| Antoine Fillinger; Stéphane Degré; Imad Hamchi; Vincent Stanford | |||
| Multimodal interfaces require numerous computing devices, sensors, and
dynamic networking, to acquire, transport, and process the sensor streams
necessary to sense human activities and respond to them. The NIST Smart Data
Flow System Version II embodies many improvements requested by the research
community including multiple operating systems, simplified data transport
protocols, additional language bindings, an extensible object oriented
architecture, and improved fault tolerance. Keywords: data streams, distributed computing, multimodal data transport
infrastructure, smart data flow, smart spaces | |||
| A contextual multimodal integrator | | BIBAK | Full-Text | 129-130 | |
| Péter Pál Boda | |||
| Multimodal Integration addresses the problem of combining various user
inputs into a single semantic representation that can be used in deciding the
next step of system action(s). The method presented in this paper uses a
statistical framework to implement the integration mechanism and includes
contextual information additionally to the actual user input. The underlying
assumption is that the more information sources are taken into account, the
better picture can be drawn about the actual intention of the user in the given
context of the interaction. The paper presents the latest results with a
Maximum Entropy classifier, with special emphasis on the use of contextual
information (type of gesture movements and type of objects selected). Instead
of explaining the design and implementation process in details (a longer paper
to be published later will do that), only a short description is provided here
about the demonstration implementation that produces above 91% accuracy for the
1st best and higher than 96% for the accumulated five N-bests results. Keywords: context, data fusion, machine learning, maximum entropy, multimodal
database, multimodal integration, virtual modality | |||
| Collaborative multimodal photo annotation over digital paper | | BIBAK | Full-Text | 131-132 | |
| Paulo Barthelmess; Edward Kaiser; Xiao Huang; David McGee; Philip Cohen | |||
| The availability of metadata annotations over media content such as photos
is known to enhance retrieval and organization, particularly for large data
sets. The greatest challenge for obtaining annotations remains getting users to
perform the large amount of tedious manual work that is required. In this demo
we show a system for semi-automated labeling based on extraction of metadata
from naturally occurring conversations of groups of people discussing pictures
among themselves. The system supports a variety of collaborative label
elicitation scenarios mixing co-located and distributed participants, operating
primarily via speech, handwriting and sketching over tangible digital paper
photo printouts. We demonstrate the real-time capabilities of the system by
providing hands-on annotation experience for conference participants. Demo
annotations are performed over public domain pictures portraying mainstream
themes (e.g. from famous movies). Keywords: collaborative interaction, demo, intelligent interfaces, multimodal
processing, photo annotation | |||
| CarDialer: multi-modal in-vehicle cellphone control application | | BIBAK | Full-Text | 133-134 | |
| Vladimír Bergl; Martin Èmejrek; Martin Fanta; Martin Labský; Ladislav Seredi; Jan Sedivý; Lubos Ures | |||
| This demo presents CarDialer -- an in-car cellphone control application. Its
multi-modal user interface blends state-of-the-art speech recognition
technology (including text-to-speech synthesis) with the existing well proven
elements of a vehicle information system GUI (buttons mounted on a steering
wheel and an LCD equipped with touch-screen). This conversational system
provides access to name dialing, unconstrained dictation of numbers, adding new
names, operations with lists of calls and messages, notification of presence,
etc. The application is fully functional from the first start, no prerequisite
steps such as configuration, speech recognition enrollment) are required. The
presentation of the proposed multi-modal architecture goes beyond the specific
application and presents a modular platform to integrate application logic with
various incarnations of UI modalities. Keywords: automated speech recognition, multi-modal, name dialer, vehicle information
system | |||
| Gender and age estimation system robust to pose variations | | BIBAK | Full-Text | 135-136 | |
| Erina Takikawa; Koichi Kinoshita; Shihong Lao; Masato Kawade | |||
| For applications based on facial image processing, pose variation is a
difficult problem. In this paper, we propose a gender and age estimation system
that is robust against pose variations. The acceptable facial pose range is a
yaw (left-right) from -30 degrees to +30 degrees and a pitch (up-down) from -20
degrees to +20 degrees. According to our experiments on several large databases
collected under real environments, the gender estimation accuracy is 84.8% and
the age estimation accuracy is 80.9% (subjects are divided into 5 classes). The
average processing time is about 70 ms/frame for gender estimation and 95
ms/frame for age estimation (Pentium4 3.2 GHz). The system can be used to
automatically analyze shopping customers and pedestrians using surveillance
cameras. Keywords: age estimation, facial image, gender estimation | |||
| A fast and robust 3D head pose and gaze estimation system | | BIBAK | Full-Text | 137-138 | |
| Koichi Kinoshita; Yong Ma; Shihong Lao; Masato Kawaade | |||
| We developed a fast and robust head pose and gaze estimation system. This
system can detect facial points and estimate 3D pose angles and gaze direction
under various conditions including facial expression changes and partial
occlusion. We need only one face image as input and do not need special devices
such as blinking LEDs or stereo cameras. Moreover, no calibration is needed.
The system shows a 95% head pose estimation accuracy and 81% gaze estimation
accuracy (when the error margin is 15 degrees). The processing time is about 15
ms/frame (Pentium4 3.2 GHz). Acceptable range of facial pose is within a yaw
(left-right) of ñ60 degrees and within a pitch (up-down) of ñ30
degrees. Keywords: facial image, gaze estimation, pose estimation | |||
| Audio-visual emotion recognition in adult attachment interview | | BIBAK | Full-Text | 139-145 | |
| Zhihong Zeng; Yuxiao Hu; Yun Fu; Thomas S. Huang; Glenn I. Roisman; Zhen Wen | |||
| Automatic multimodal recognition of spontaneous affective expressions is a
largely unexplored and challenging problem. In this paper, we explore
audio-visual emotion recognition in a realistic human conversation setting --
Adult Attachment Interview (AAI). Based on the assumption that facial
expression and vocal expression be at the same coarse affective states,
positive and negative emotion sequences are labeled according to Facial Action
Coding System Emotion Codes. Facial texture in visual channel and prosody in
audio channel are integrated in the framework of Adaboost multi-stream hidden
Markov model (AMHMM) in which Adaboost learning scheme is used to build
component HMM fusion. Our approach is evaluated in the preliminary AAI
spontaneous emotion recognition experiments. Keywords: affect recognition, affective computing, emotion recognition, multimodal
human-computer interaction | |||
| Modeling naturalistic affective states via facial and vocal expressions recognition | | BIBAK | Full-Text | 146-154 | |
| George Caridakis; Lori Malatesta; Loic Kessous; Noam Amir; Amaryllis Raouzaiou; Kostas Karpouzis | |||
| Affective and human-centered computing are two areas related to HCI which
have attracted attention during the past years. One of the reasons that this
may be attributed to, is the plethora of devices able to record and process
multimodal input from the part of the users and adapt their functionality to
their preferences or individual habits, thus enhancing usability and becoming
attractive to users less accustomed with conventional interfaces. In the quest
to receive feedback from the users in an unobtrusive manner, the visual and
auditory modalities allow us to infer the users' emotional state, combining
information both from facial expression recognition and speech prosody feature
extraction. In this paper, we describe a multi-cue, dynamic approach in
naturalistic video sequences. Contrary to strictly controlled recording
conditions of audiovisual material, the current research focuses on sequences
taken from nearly real world situations. Recognition is performed via a 'Simple
Recurrent Network' which lends itself well to modeling dynamic events in both
user's facial expressions and speech. Moreover this approach differs from
existing work in that it models user expressivity using a dimensional
representation of activation and valence, instead of detecting the usual
'universal emotions' which are scarce in everyday human-machine interaction.
The algorithm is deployed on an audiovisual database which was recorded
simulating human-human discourse and, therefore, contains less extreme
expressivity and subtle variations of a number of emotion labels. Keywords: affective interaction, facial expression recognition, image processing,
multimodal analysis, naturalistic data, prosodic feature extraction, user
modeling | |||
| A 'need to know' system for group classification | | BIBAK | Full-Text | 155-161 | |
| Wen Dong; Jonathan Gips; Alex (Sandy) Pentland | |||
| This paper outlines the design of a distributed sensor classification system
with abnormality detection intended for groups of people who are participating
in coordinated activities. The system comprises an implementation of a
distributed Dynamic Bayesian Network (DBN) model called the Influence Model
(IM) that relies heavily on an inter-process communication architecture called
Enchantment to establish the pathways of information that the model requires.
We use three examples to illustrate how the "need to know" system effectively
recognizes the group structure by simulating the work of cooperating
individuals. Keywords: complex dynamic systems, hidden Markov model, influence model, social
network, state-space model | |||
| Spontaneous vs. posed facial behavior: automatic analysis of brow actions | | BIBAK | Full-Text | 162-170 | |
| Michel F. Valstar; Maja Pantic; Zara Ambadar; Jeffrey F. Cohn | |||
| Past research on automatic facial expression analysis has focused mostly on
the recognition of prototypic expressions of discrete emotions rather than on
the analysis of dynamic changes over time, although the importance of temporal
dynamics of facial expressions for interpretation of the observed facial
behavior has been acknowledged for over 20 years. For instance, it has been
shown that the temporal dynamics of spontaneous and volitional smiles are
fundamentally different from each other. In this work, we argue that the same
holds for the temporal dynamics of brow actions and show that velocity,
duration, and order of occurrence of brow actions are highly relevant
parameters for distinguishing posed from spontaneous brow actions. The proposed
system for discrimination between volitional and spontaneous brow actions is
based on automatic detection of Action Units (AUs) and their temporal segments
(onset, apex, offset) produced by movements of the eyebrows. For each temporal
segment of an activated AU, we compute a number of mid-level feature parameters
including the maximal intensity, duration, and order of occurrence. We use
Gentle Boost to select the most important of these parameters. The selected
parameters are used further to train Relevance Vector Machines to determine per
temporal segment of an activated AU whether the action was displayed
spontaneously or volitionally. Finally, a probabilistic decision function
determines the class (spontaneous or posed) for the entire brow action. When
tested on 189 samples taken from three different sets of spontaneous and
volitional facial data, we attain a 90.7% correct recognition rate. Keywords: automatic facial expression analysis, temporal dynamics | |||
| Gaze-X: adaptive affective multimodal interface for single-user office scenarios | | BIBAK | Full-Text | 171-178 | |
| Ludo Maat; Maja Pantic | |||
| This paper describes an intelligent system that we developed to support
affective multimodal human-computer interaction (AMM-HCI) where the user's
actions and emotions are modeled and then used to adapt the HCI and support the
user in his or her activity. The proposed system, which we named Gaze-X, is
based on sensing and interpretation of the human part of the computer's
context, known as W5+ (who, where, what, when, why, how). It integrates a
number of natural human communicative modalities including speech, eye gaze
direction, face and facial expression, and a number of standard HCI modalities
like keystrokes, mouse movements, and active software identification, which, in
turn, are fed into processes that provide decision making and adapt the HCI to
support the user in his or her activity according to his or her preferences. To
attain a system that can be educated, that can improve its knowledge and
decision making through experience, we use case-based reasoning as the
inference engine of Gaze-X. The utilized case base is a dynamic, incrementally
self-organizing event-content-addressable memory that allows fact retrieval and
evaluation of encountered events based upon the user preferences and the
generalizations formed from prior input. To support concepts of concurrency,
modularity/scalability, persistency, and mobility, Gaze-X has been built as an
agent-based system where different agents are responsible for different parts
of the processing. A usability study conducted in an office scenario with a
number of users indicates that Gaze-X is perceived as effective, easy to use,
useful, and affectively qualitative. Keywords: affective computing, facial expressions, multimodal interfaces | |||
| Human computing, virtual humans and artificial imperfection | | BIBAK | Full-Text | 179-184 | |
| Z. M. Ruttkay; D. Reidsma; A. Nijholt | |||
| In this paper we raise the issue whether imperfections, characteristic of
human-human communication, should be taken into account when developing virtual
humans. We argue that endowing virtual humans with the imperfections of humans
can help making them more 'comfortable' to interact with. That is, the natural
communication of a virtual human should not be restricted to multimodal
utterances that are always perfect, both in the sense of form and of content.
We illustrate our views with examples from two own applications that we have
worked on: the Virtual Dancer, and the Virtual Trainer. In both applications
imperfectness helps in keeping the interaction engaging and entertaining. Keywords: embodied conversational agents, human computing, imperfections, virtual
humans | |||
| Using maximum entropy (ME) model to incorporate gesture cues for SU detection | | BIBAK | Full-Text | 185-192 | |
| Lei Chen; Mary Harper; Zhongqiang Huang | |||
| Accurate identification of sentence units (SUs) in spontaneous speech has
been found to improve the accuracy of speech recognition, as well as downstream
applications such as parsing. In recent multimodal investigations, gestural
features were utilized, in addition to lexical and prosodic cues from the
speech channel, for detecting SUs in conversational interactions using a hidden
Markov model (HMM) approach. Although this approach is computationally
efficient and provides a convenient way to modularize the knowledge sources, it
has two drawbacks for our SU task. First, standard HMM training methods
maximize the joint probability of observations and hidden events, as opposed to
the posterior probability of a hidden event given observations, a criterion
more closely related to SU classification error. A second challenge for
integrating gestural features is that their absence sanctions neither SU events
nor non-events; it is only the co-timing of gestures with the speech channel
that should impact our model. To address these problems, a Maximum Entropy (ME)
model is used to combine multimodal cues for SU estimation. Experiments carried
out on VACE multi-party meetings confirm that the ME modeling approach provides
a solid framework for multimodal integration. Keywords: gesture, language models, meetings, multimodal fusion, prosody, sentence
boundary detection | |||
| Salience modeling based on non-verbal modalities for spoken language understanding | | BIBAK | Full-Text | 193-200 | |
| Shaolin Qu; Joyce Y. Chai | |||
| Previous studies have shown that, in multimodal conversational systems,
fusing information from multiple modalities together can improve the overall
input interpretation through mutual disambiguation. Inspired by these findings,
this paper investigates non-verbal modalities, in particular deictic gesture,
in spoken language processing. Our assumption is that during multimodal
conversation, user's deictic gestures on the graphic display can signal the
underlying domain model that is salient at that particular point of
interaction. This salient domain model can be used to constrain hypotheses for
spoken language processing. Based on this assumption, this paper examines
different configurations of salience driven language models (e.g., n-gram and
probabilistic context free grammar) for spoken language processing across
different stages. Our empirical results have shown the potential of integrating
salience models based on non-verbal modalities in spoken language
understanding. Keywords: language modeling, multimodal interfaces, salience modeling, spoken language
understanding | |||
| EM detection of common origin of multi-modal cues | | BIBAK | Full-Text | 201-208 | |
| A. K. Noulas; B. J. A. Kröse | |||
| Content analysis of clips containing people speaking involves processing
informative cues coming from different modalities. These cues are usually the
words extracted from the audio modality, and the identity of the persons
appearing in the video modality of the clip. To achieve efficient assignment of
these cues to the person that created them, we propose a Bayesian network model
that utilizes the extracted feature characteristics, their relations and their
temporal patterns. We use the EM algorithm in which the E-step estimates the
expectation of the complete-data log-likelihood with respect to the hidden
variables -- that is the identity of the speakers and the visible persons. In
the M-step, the person models that maximize this expectation are computed. This
framework produces excellent results, exhibiting exceptional robustness when
dealing with low quality data. Keywords: audio-visual synchrony, content extraction, multi-modal, multi-modal cue
assignment, speaker detection | |||
| Prototyping novel collaborative multimodal systems: simulation, data collection and analysis tools for the next decade | | BIBAK | Full-Text | 209-216 | |
| Alexander M. Arthur; Rebecca Lunsford; Matt Wesson; Sharon Oviatt | |||
| To support research and development of next-generation multimodal interfaces
for complex collaborative tasks, a comprehensive new infrastructure has been
created for collecting and analyzing time-synchronized audio, video, and
pen-based data during multi-party meetings. This infrastructure needs to be
unobtrusive and to collect rich data involving multiple information sources of
high temporal fidelity to allow the collection and annotation of
simulation-driven studies of natural human-human-computer interactions.
Furthermore, it must be flexibly extensible to facilitate exploratory research.
This paper describes both the infrastructure put in place to record, encode,
playback and annotate the meeting-related media data, and also the simulation
environment used to prototype novel system concepts. Keywords: annotation tools, data collection infrastructure, meeting, multi-party,
multimodal interfaces, simulation studies, synchronized media | |||
| Combining audio and video to predict helpers' focus of attention in multiparty remote collaboration on physical tasks | | BIBAK | Full-Text | 217-224 | |
| Jiazhi Ou; Yanxin Shi; Jeffrey Wong; Susan R. Fussell; Jie Yang | |||
| The increasing interest in supporting multiparty remote collaboration has
created both opportunities and challenges for the research community. The
research reported here aims to develop tools to support multiparty remote
collaborations and to study human behaviors using these tools. In this paper we
first introduce an experimental multimedia (video and audio) system with which
an expert can collaborate with several novices. We then use this system to
study helpers' focus of attention (FOA) during a collaborative circuit assembly
task. We investigate the relationship between FOA and language as well as
activities using multimodal (audio and video) data, and use learning methods to
predict helpers' FOA. We process different modalities separately and fusion the
results to make a final decision. We employ a sliding window-based delayed
labeling method to automatically predict changes in FOA in real time using only
the dialogue among the helper and workers. We apply an adaptive background
subtraction method and support vector machine to recognize the worker's
activities from the video. To predict the helper's FOA, we make decisions using
the information of joint project boundaries and workers' recent activities. The
overall prediction accuracies are 79.52% using audio only and 81.79% using
audio and video combined. Keywords: computer-supported cooperative work, focus of attention, multimodal
integration, remote collaborative physical tasks | |||
| The role of psychological ownership and ownership markers in collaborative working environment | | BIBAK | Full-Text | 225-232 | |
| QianYing Wang; Alberto Battocchi; Ilenia Graziola; Fabio Pianesi; Daniel Tomasini; Massimo Zancanaro; Clifford Nass | |||
| In this paper, we present a study concerning psychological ownership for
digital entities in the context of collaborative working environments. In the
first part of the paper we present a conceptual framework of ownership: various
issues such as definition, effects, target factors and behavioral manifestation
are explicated. We then focus on ownership marking, a behavioral manifestation
that is closely tied to psychological ownership. We designed an experiment
using DiamondTouch Table to investigate the effect of two of the most widely
used ownership markers on users' attitudes and performance. Both performance
and attitudinal differences were found, suggesting the significant role of
ownership and ownership markers in the groupware and interactive workspaces
design. Keywords: collaborative multimodal environment, communicative marker, defensive
marker, digital ownership, marking behavior | |||
| Foundations of human computing: facial expression and emotion | | BIBAK | Full-Text | 233-238 | |
| Jeffrey F. Cohn | |||
| Many people believe that emotions and subjective feelings are one and the
same and that a goal of human-centered computing is emotion recognition. The
first belief is outdated; the second mistaken. For human-centered computing to
succeed, a different way of thinking is needed.
Emotions are species-typical patterns that evolved because of their value in addressing fundamental life tasks[19]. Emotions consist of multiple components that may include intentions, action tendencies, appraisals, other cognitions, central and peripheral changes in physiology, and subjective feelings. Emotions are not directly observable, but are inferred from expressive behavior, self-report, physiological indicators, and context. I focus on expressive behavior because of its coherence with other indicators and the depth of research on the facial expression of emotion in behavioral and computer science. In this paper, among the topics I include are approaches to measurement, timing or dynamics, individual differences, dyadic interaction, and inference. I propose that design and implementation of perceptual user interfaces may be better informed by considering the complexity of emotion, its various indicators, measurement, individual differences, dyadic interaction, and problems of inference. Keywords: automatic facial image analysis, emotion, facial expression, human-computer
interaction, temporal dynamics | |||
| Human computing and machine understanding of human behavior: a survey | | BIBAK | Full-Text | 239-248 | |
| Maja Pantic; Alex Pentland; Anton Nijholt; Thomas Huang | |||
| A widely accepted prediction is that computing will move to the background,
weaving itself into the fabric of our everyday living spaces and projecting the
human user into the foreground. If this prediction is to come true, then next
generation computing, which we will call human computing, should be about
anticipatory user interfaces that should be human-centered, built for humans
based on human models. They should transcend the traditional keyboard and mouse
to include natural, human-like interactive functions including understanding
and emulating certain human behaviors such as affective and social signaling.
This article discusses a number of components of human behavior, how they might
be integrated into computers, and how far we are from realizing the front end
of human computing, that is, how far are we from enabling computers to
understand human behavior. Keywords: affective computing, anticipatory user interfaces, multimodal user
interfaces, socially-aware computing | |||
| Computing human faces for human viewers: automated animation in photographs and paintings | | BIBA | Full-Text | 249-256 | |
| Volker Blanz | |||
| This paper describes a system for animating and modifying faces in images. It combines an algorithm for 3D face reconstruction from single images with a learning-based approach for 3D animation and face modification. Modifications include changes of facial attributes, such as body weight, masculine or feminine look, or overall head shape, as well as cut-and-paste exchange of faces. Unlike traditional photo retouche, this technique can be applied across changes in pose and lighting. Bridging the gap between photorealistic image processing and 3D graphics, the system provides tools for interacting with existing image material, such as photographs or paintings. The core of the approach is a statistical analysis of a dataset of 3D faces, and an analysis-by-synthesis loop that simulates the process of image formation for high-level image processing. | |||
| Detection and application of influence rankings in small group meetings | | BIBAK | Full-Text | 257-264 | |
| Rutger Rienks; Dong Zhang; Daniel Gatica-Perez; Wilfried Post | |||
| We address the problem of automatically detecting participant's influence
levels in meetings. The impact and social psychological background are
discussed. The more influential a participant is, the more he or she influences
the outcome of a meeting. Experiments on 40 meetings show that application of
statistical (both dynamic and static) models while using simply obtainable
features results in a best prediction performance of 70.59% when using a static
model, a balanced training set, and three discrete classes: high, normal and
low. Application of the detected levels are shown in various ways i.e. in a
virtual meeting environment as well as in a meeting browser system. Keywords: dominance detection, influence detection, machine learning, small group
research | |||
| Tracking the multi person wandering visual focus of attention | | BIBAK | Full-Text | 265-272 | |
| Kevin Smith; Sileye O. Ba; Daniel Gatica-Perez; Jean-Marc Odobez | |||
| Estimating the wandering visual focus of attention (WVFOA) for multiple
people is an important problem with many applications in human behavior
understanding. One such application, addressed in this paper, monitors the
attention of passers-by to outdoor advertisements. This paper investigates the
problem of tracking the wandering visual focus-of-attention (VFOA) of multiple
people, an important problem with many applications in human behavior
understanding. We address the specific problem of monitoring attention to
outdoor advertisements. To solve the WVFOA problem, we propose a multi-person
tracking approach based on a hybrid Dynamic Bayesian Network that
simultaneously infers the number of people in the scene, their body and head
locations, and their head pose, in a joint state-space formulation that is
amenable for person interaction modeling. The model exploits both global
measurements and individual observations for the VFOA. For inference in the
resulting high-dimensional state-space, we propose a trans-dimensional Markov
Chain Monte Carlo (MCMC) sampling scheme, which not only handles a varying
number of people, but also efficiently searches the state-space by allowing
person-part state updates. Our model was rigorously evaluated for tracking and
its ability to recognize when people look at an outdoor advertisement using a
realistic data set. Keywords: MCMC, head pose tracking, multi-person tracking | |||
| Toward open-microphone engagement for multiparty interactions | | BIBAK | Full-Text | 273-280 | |
| Rebecca Lunsford; Sharon Oviatt; Alexander M. Arthur | |||
| There currently is considerable interest in developing new open-microphone
engagement techniques for speech and multimodal interfaces that perform
robustly in complex mobile and multiparty field environments. State-of-the-art
audio-visual open-microphone engagement systems aim to eliminate the need for
explicit user engagement by processing more implicit cues that a user is
addressing the system, which results in lower cognitive load for the user. This
is an especially important consideration for mobile and educational interfaces
due to the higher load required by explicit system engagement. In the present
research, longitudinal data were collected with six triads of high-school
students who engaged in peer tutoring on math problems with the aid of a
simulated computer assistant. Results revealed that amplitude was 3.25dB higher
when users addressed a computer rather than human peer when no lexical marker
of intended interlocutor was present, and 2.4dB higher for all data. These
basic results were replicated for both matched and adjacent utterances to
computer versus human partners. With respect to dialogue style, speakers did
not direct a higher ratio of commands to the computer, although such dialogue
differences have been assumed in prior work. Results of this research reveal
that amplitude is a powerful cue marking a speaker's intended addressee, which
should be leveraged to design more effective microphone engagement during
computer-assisted multiparty interactions. Keywords: collaborative peer tutoring, computer-supported collaborative work, dialogue
style, intended addressee, multimodal interaction, open-microphone engagement,
spoken amplitude, user communication modeling | |||
| Tracking head pose and focus of attention with multiple far-field cameras | | BIBAK | Full-Text | 281-286 | |
| Michael Voit; Rainer Stiefelhagen | |||
| In this work we present our recent approach on estimating head orientations
and foci of attention of multiple people in a smart room, which is equipped
with several cameras to monitor the room. In our approach, we estimate each
person's head orientation with respect to the room coordinate system by using
all camera views. We implemented a Neural Network to estimate head pose on
every single camera view, a Bayes filter is then applied to integrate every
estimate into one final, joint hypothesis. Using this scheme, we can track
peoples' horizontal head orientations in a full 360° range at almost all
positions within the room. The tracked head orientations are then used to
determine who is looking at whom, i.e. people's focus of attention. We report
experimental results on one meeting video, that was recorded in the smart room. Keywords: Bayesian filter, focus of attention, gaze, head orientation, head pose,
neural networks | |||
| Recognizing gaze aversion gestures in embodied conversational discourse | | BIBAK | Full-Text | 287-294 | |
| Louis-Philippe Morency; C. Mario Christoudias; Trevor Darrell | |||
| Eye gaze offers several key cues regarding conversational discourse during
face-to-face interaction between people. While a large body of research results
exist to document the use of gaze in human-to-human interaction, and in
animating realistic embodied avatars, recognition of conversational eye
gestures -- distinct eye movement patterns relevant to discourse -- has
received less attention. We analyze eye gestures during interaction with an
animated embodied agent and propose a non-intrusive vision-based approach to
estimate eye gaze and recognize eye gestures. In our user study, human
participants avert their gaze (i.e. with "look-away" or "thinking" gestures)
during periods of cognitive load. Using our approach, an agent can visually
differentiate whether a user is thinking about a response or is waiting for the
agent or robot to take its turn. Keywords: aversion gestures, embodied conversational agent, eye gaze tracking, eye
gestures, human-computer interaction, turn-taking | |||
| Explorations in sound for tilting-based interfaces | | BIBAK | Full-Text | 295-301 | |
| Matthias Rath; Michael Rohs | |||
| Everyday experience as well as recent studies tell that information
contained in ecological sonic feedback may improve human control of, and
interaction with, a system. This notion is particularly worthwhile to consider
in the context of mobile, tilting-based interfaces as have been proposed,
developed and studied extensively. Two interfaces are used for this scope, the
Ballancer, based on the metaphor of balancing a rolling ball on a track, and a
more concretely application-oriented setup of a mobile phone with tilting-based
input. First pilot studies have been conducted. Keywords: acoustic feedback, auditory feedback, auditory information, control,
tilting-based input | |||
| Haptic phonemes: basic building blocks of haptic communication | | BIBAK | Full-Text | 302-309 | |
| Mario Enriquez; Karon MacLean; Christian Chita | |||
| A haptic phoneme represents the smallest unit of a constructed haptic signal
to which a meaning can be assigned. These haptic phonemes can be combined
serially or in parallel to form haptic words, or haptic icons, which can hold
more elaborate meanings for their users. Here, we use phonemes which consist of
brief (<2 seconds) haptic stimuli composed of a simple waveform at a
constant frequency and amplitude. Building on previous results showing that a
set of 12 such haptic stimuli can be perceptually distinguished, here we test
learnability and recall of associations for arbitrarily chosen stimulus-meaning
pairs. We found that users could consistently recall an arbitrary association
between a haptic stimulus and its assigned arbitrary meaning in a 9-phoneme
set, during a 45 minute test period following a reinforced learning stage. Keywords: haptic icons, haptic interfaces, tactile language, touch | |||
| Toward haptic rendering for a virtual dissection | | BIBAK | Full-Text | 310-317 | |
| Nasim Melony Vafai; Shahram Payandeh; John Dill | |||
| In this paper we present a novel data structure combined with geometrically
efficient techniques to simulate a "tissue peeling" method for deformable
bodies. This is done to preserve the basic shape of a body in conjunction with
soft-tissue deformation of multiple deformable bodies in a geometry-based
model. We demonstrate our approach through haptic rendering of a virtual
anatomical model for a dissection simulator that consists of surface skin along
with multiple internal organs. The simulator uses multimodal cues in the form
of haptic feedback to provide guidance and performance feedback to the user.
The realism of the simulation is enhanced by computation of interaction forces
using extrapolation techniques to send these forces back to the user via a
haptic device. Keywords: collision detection, force feedback, haptics, soft tissue deformation,
virtual reality | |||
| Embrace system for remote counseling | | BIBAK | Full-Text | 318-325 | |
| Osamu Morikawa; Sayuri Hashimoto; Tsunetsugu Munakata; Junzo Okunaka | |||
| In counseling, non-verbal communication such as making physical contacts is
an effective skill of role playing. In remote counseling via videophones,
spacing and physical contacts cannot be used, and communication must be made
only with expressions and words. This paper describes an embrace system for
remote counseling, which consists of HyperMirror and vibrators and can provide
effects similar to those by physical contacts in face-to-face counseling. Keywords: HyperMirror, remote counseling, remote embrace, tactile and vibration | |||
| Enabling multimodal communications for enhancing the ability of learning for the visually impaired | | BIBAK | Full-Text | 326-332 | |
| Francis Quek; David McNeill; Francisco Oliveira | |||
| Students who are blind are typically one to three years behind their seeing
counterparts in mathematics and science. We posit that a key reason for this
resides in the inability of such students to access multimodal embodied
communicative behavior of mathematics instructors. This impedes the ability of
blind students and their teachers to maintain situated communication. In this
paper, we set forth the relevant phenomenological analyses to support this
claim. We show that mathematical communication and instruction are inherent
embodied; that the blind are able to conceptualize visuo-spatial information;
and argue that uptake of embodied behavior is critical to receiving relevant
mathematical information. Based on this analysis, we advance an approach to
provide students who are blind with awareness of their teachers' deictic
gestural activity via a set of haptic output devices. We lay forth a set of
open research question that researcher in multimodal interfaces may address. Keywords: awareness, catchment, embodied awareness, embodied deictic activity,
embodiment, gestures, growth point, mediating technology, multimodal,
multimodal interfaces, situated discourse, spatio-temporal cues | |||
| The benefits of multimodal information: a meta-analysis comparing visual and visual-tactile feedback | | BIBAK | Full-Text | 333-338 | |
| Matthew S. Prewett; Liuquin Yang; Frederick R. B. Stilson; Ashley A. Gray; Michael D. Coovert; Jennifer Burke; Elizabeth Redden; Linda R. Elliot | |||
| Information display systems have become increasingly complex and more
difficult for human cognition to process effectively. Based upon Wickens'
Multiple Resource Theory (MRT), information delivered using multiple modalities
(i.e., visual and tactile) could be more effective than communicating the same
information through a single modality. The purpose of this meta-analysis is to
compare user effectiveness when using visual-tactile task feedback (a
multimodality) to using only visual task feedback (a single modality). Results
indicate that using visual-tactile feedback enhances task effectiveness more so
than visual feedback (g = .38). When assessing different criteria,
visual-tactile feedback is particularly effective at reducing reaction time (g
= .631) and increasing performance (g = .618). Follow up moderator analyses
indicate that visual-tactile feedback is more effective when workload is high
(g = .844) and multiple tasks are being performed (g = .767). Implications of
results are discussed in the paper. Keywords: meta-analysis, multimodal, visual feedback, visual-tactile feedback | |||
| Word graph based speech recognition error correction by handwriting input | | BIBAK | Full-Text | 339-346 | |
| Peng Liu; Frank K. Soong | |||
| We propose a convenient handwriting user interface for correcting speech
recognition errors efficiently. Via the proposed hand-marked correction on the
displayed recognition result, substitution, deletion and insertion errors can
be corrected efficiently by rescoring the word graph generated in the
recognition pass. A new path in the graph that matches the user's feedback in
the maximum likelihood sense is found.
With the aid of language model and hand corrections part in the best decoded path, rescoring the word graph can correct more errors than user provides. All recognition errors can be corrected after finite number of corrections. Experimental results show that by indicating one word error in user feedback, 33.8% of the erroneous sentences can be corrected; while by indicating one character error, 12.9% of the erroneous sentences can be corrected. Keywords: handwriting recognition, interactive error correction, multimodal interface,
speech recognition, word graph | |||
| Using redundant speech and handwriting for learning new vocabulary and understanding abbreviations | | BIBAK | Full-Text | 347-356 | |
| Edward C. Kaiser | |||
| New language constantly emerges from complex, collaborative human-human
interactions like meetings -- such as, for instance, when a presenter
handwrites a new term on a whiteboard while saying it. Fixed vocabulary
recognizers fail on such new terms, which often are critical to dialogue
understanding. We present a proof-of-concept multimodal system that combines
information from handwriting and speech recognition to learn the spelling,
pronunciation and semantics of out-of-vocabulary terms from single instances of
redundant multimodal presentation (e.g. saying a term while handwriting it).
For the task of recognizing the spelling and semantics of abbreviated Gantt
chart labels across a held-out test series of five scheduling meetings we show
a significant relative error rate reduction of 37% when our learning methods
are used and allowed to persist across the meeting series, as opposed to when
they are not used. Keywords: handwriting, multimodal, speech | |||
| Multimodal fusion: a new hybrid strategy for dialogue systems | | BIBAK | Full-Text | 357-363 | |
| Pilar Manchón Portillo; Guillermo Pérez García; Gabriel Amores Carredano | |||
| This is a new hybrid fusion strategy based primarily on the implementation
of two former and differentiated approaches to multimodal fusion [11] in
multimodal dialogue systems. Both approaches, their predecessors and their
respective advantages and disadvantages will be described in order to
illustrate how the new strategy merges them into a more solid and coherent
solution. The first strategy was largely based on Johnston's approach [5] and
implies the inclusion of multimodal grammar entries and temporal constraints.
The second approach implied the fusion of information coming from different
channels at dialogue level. The new hybrid strategy hereby described requires
the inclusion of multimodal grammar entries and temporal constraints plus the
additional information at dialogue level utilized in the second strategy.
Within this new approach therefore, the fusion process will be initiated at
grammar level and will be culminated at dialogue level. Keywords: NLP, dialogue systems, multimodal fusion | |||
| Evaluating usability based on multimodal information: an empirical study | | BIBAK | Full-Text | 364-371 | |
| Tao Lin; Atsumi Imamiya | |||
| New technologies are making it possible to provide an enriched view of
interaction for researchers using multimodal information. This preliminary
study explores the use of multiple information streams in usability evaluation.
In the study, easy, medium and difficult versions of a game task were used to
vary the levels of mental effort. Multimodal data streams during the three
versions were analyzed, including eye tracking, pupil size, hand movement,
heart rate variability (HRV) and subjectively reported data. Four findings
indicate the potential value of usability evaluations based on multimodal
information: First, subjective and physiological measures showed significant
sensitivity to task difficulty. Second, different mental workload levels
appeared to correlate with eye movement patterns, especially with a combined
eye-hand movement measure. Third, HRV showed correlations with saccade speed.
Finally, we present a new method using the ratio of eye fixations over mouse
clicks to evaluate performance in more detail. These results warrant further
investigations and take an initial step toward establishing usability
evaluation methods based on multimodal information. Keywords: eye tracking, multimodal, physiological measures, usability | |||
| A new approach to haptic augmentation of the GUI | | BIBAK | Full-Text | 372-379 | |
| Thomas N. Smyth; Arthur E. Kirkpatrick | |||
| Most users do not experience the same level of fluency in their interactions
with computers that they do with physical objects in their daily life. We
believe that much of this results from the limitations of unimodal interaction.
Previous efforts in the haptics literature to remedy those limitations have
been creative and numerous, but have failed to produce substantial improvements
in human performance. This paper presents a new approach, whereby haptic
interaction techniques are designed from scratch, in explicit consideration of
the strengths and weaknesses of the haptic and motor systems. We introduce a
haptic alternative to the tool palette, called Pokespace, which follows this
approach. Two studies (6 and 12 participants) conducted with Pokespace found no
performance improvement over a traditional interface, but showed that
participants learned to use the interface proficiently after about 10 minutes,
and could do so without visual attention. The studies also suggested several
improvements to our design. Keywords: 3D interaction, haptic feedback, haptic interface, multimodal interface,
rehearsal, tool palette, visual attention | |||
| HMM-based synthesis of emotional facial expressions during speech in synthetic talking heads | | BIBAK | Full-Text | 380-387 | |
| Nadia Mana; Fabio Pianesi | |||
| One of the research goals in the human-computer interaction community is to
build believable Embodied Conversational Agents, that is, agents able to
communicate complex information with human-like expressiveness and naturalness.
Since emotions play a crucial role in human communication and most of them are
expressed through the face, having more believable ECAs implies to give them
the ability of displaying emotional facial expressions.
This paper presents a system based on Hidden Markov Models (HMMs) for the synthesis of emotional facial expressions during speech. The HMMs were trained on a set of emotion examples in which a professional actor uttered Italian non-sense words, acting various emotional facial expressions with different intensities. The evaluation of the experimental results, performed comparing the "synthetic examples" (generated by the system) with a reference "natural example" (one of the actor's examples) in three different ways, shows that HMMs for emotional facial expressions synthesis have some limitations but are suitable to make a synthetic Talking Head more expressive and realistic. Keywords: MPEG4 facial animation, emotional facial expression modeling, face
synthesis, hidden Markov models, talking heads | |||
| Embodiment and multimodality | | BIBAK | Full-Text | 388-390 | |
| Francis Quek | |||
| Students who are blind are typically one to three years behind their seeing
counterparts in mathematics and science. We posit that a key reason for this
resides in the inability of such students to access multimodal embodied
communicative behavior of mathematics instructors. This impedes the ability of
blind students and their teachers to maintain situated communication. In this
paper, we set forth the relevant phenomenological analyses to support this
claim. We show that mathematical communication and instruction are inherent
embodied; that the blind are able to conceptualize visuo-spatial information;
and argue that uptake of embodied behavior is critical to receiving relevant
mathematical information. Based on this analysis, we advance an approach to
provide students who are blind with awareness of their teachers' deictic
gestural activity via a set of haptic output devices. We lay forth a set of
open research question that researcher in multimodal interfaces may address. Keywords: awareness, embodiment, gestures, multimodal, theory | |||