| Natural interfaces in the field: the case of pen and paper | | BIBAK | Full-Text | 1-2 | |
| Phil Cohen | |||
| Over the past 7 years, Adapx (formerly, Natural Interaction Systems) has
been developing digital pen-based natural interfaces for field tasks. Examples
include products for field note-taking, mapping and
architecture/engineering/construction, which have been applied to such uses as:
surveying, wild-fire fighting, land use planning and dispute resolution, and
civil engineering. In this talk, I will describe the technology and some of
these field-based use cases, discussing why natural interfaces are the
preferred means for human-computer interaction for these applications. Keywords: digital pen and paper | |||
| Manipulating trigonometric expressions encodedthrough electro-tactile signals | | BIBAK | Full-Text | 3-8 | |
| Tatiana G. Evreinova | |||
| Visually challenged pupils and students need special developmental tools. To
facilitate their skills acquisition in math, different game-like techniques
have been implemented. Along with Braille, the electro-tactile patterns (eTPs)
can be used to deliver mathematical content to the visually challenged user.
The goal of this work was to continue an exploration on non-visual manipulating
mathematics. The eTPs denoting four trigonometric functions and their seven
arguments (angles) were shaped with designed electro-tactile unit. Matching
software application was used to facilitate the learning process of the eTPs.
The permutation puzzle game was employed to improve the perceptual skills of
the players in manipulating the trigonometric functions and their arguments
encoded. The performance of 8 subjects was investigated and discussed. The
experimental findings confirmed the possibility of the use of the eTPs for
communicating different kinds of math content. Keywords: electro-tactile signals, trigonometry accessibility, visually challenged
people | |||
| Multimodal system evaluation using modality efficiency and synergy metrics | | BIBAK | Full-Text | 9-16 | |
| Manolis Perakakis; Alexandros Potamianos | |||
| In this paper, we propose two new objective metrics, relative modality
efficiency and multimodal synergy, that can provide valuable information and
identify usability problems during the evaluation of multimodal systems.
Relative modality efficiency (when compared with modality usage) can identify
suboptimal use of modalities due to poor interface design or information
asymmetries. Multimodal synergy measures the added value from efficiently
combining multiple input modalities, and can be used as a single measure of the
quality of modality fusion and fission in a multimodal system. The proposed
metrics are used to evaluate two multimodal systems that combine pen/speech and
mouse/keyboard modalities respectively. The results provide much insight into
multimodal interface usability issues, and demonstrate how multimodal systems
should adapt to maximize modalities synergy resulting in efficient, natural,
and intelligent multimodal interfaces. Keywords: input modality selection, mobile multimodal interfaces | |||
| Effectiveness and usability of an online help agent embodied as a talking head | | BIBAK | Full-Text | 17-20 | |
| Jérôme Simonin; Noëlle Carbonell; Danielle Pelé | |||
| An empirical study is presented which aims at assessing the possible effects
of embodiment on online help effectiveness and attraction. 22 undergraduate
students who were unfamiliar with animation creation software created two
simple animations with Flash, using two multimodal online help agents, EH and
UH, one per animation. Both help agents used the same database of speech and
graphics messages; EH was personified using a talking head while UH was not
embodied. EH and UH presentation order was counterbalanced between
participants.
Subjective judgments elicited through verbal and nonverbal questionnaires indicate that the presence of the ECA was well accepted by participants and its influence on help effectiveness perceived as positive. Analysis of eye tracking data indicates that the ECA actually attracted their visual attention and interest, since they glanced at it from the beginning to the end of the animation creation (75 fixations during 40 min.). Contrastingly, post-tests marks and interaction traces suggest that the ECA's presence had no perceivable effect on concept or skill learning and task execution. It only encouraged help consultation. Keywords: embodied conversational agents, empirical study, ergonomic evaluation, eye
tracking, online help, talking heads | |||
| Interaction techniques for the analysis of complex data on high-resolution displays | | BIBAK | Full-Text | 21-28 | |
| Chreston Miller; Ashley Robinson; Rongrong Wang; Pak Chung; Francis Quek | |||
| When combined with the organizational space provided by a simple table,
physical notecards are a powerful organizational tool for information analysis.
The physical presence of these cards affords many benefits but also is a source
of disadvantages. For example, complex relationships among them are hard to
represent. There have been a number of notecard software systems developed to
address these problems. Unfortunately, the amount of visual details in such
systems is lacking compared to real notecards on a large physical table; we
look to alleviate this problem by providing a digital solution. One challenge
with new display technology and systems is providing an efficient interface for
its users. In this paper we look at comparing different interaction techniques
of an emerging class of organizational systems that use high-resolution
tabletop displays. The focus of these systems is to more easily and efficiently
assist interaction with information. Using PDA, token, gesture, and voice
interaction techniques, we conducted a within subjects experiment comparing
these techniques over a large high-resolution horizontal display. We found
strengths and weaknesses for each technique. In addition, we noticed that some
techniques build upon and complement others. Keywords: embodied interaction, gesture interaction, high-resolution displays,
horizontal display, human-computer interaction, multimodal interfaces, pda
interaction, tabletop interaction, tangible interaction, voice interaction | |||
| Role recognition in multiparty recordings using social affiliation networks and discrete distributions | | BIBAK | Full-Text | 29-36 | |
| Sarah Favre; Hugues Salamin; John Dines; Alessandro Vinciarelli | |||
| This paper presents an approach for the recognition of roles in multiparty
recordings. The approach includes two major stages: extraction of Social
Affiliation Networks (speaker diarization and representation of people in terms
of their social interactions), and role recognition (application of discrete
probability distributions to map people into roles). The experiments are
performed over several corpora, including broadcast data and meeting
recordings, for a total of roughly 90 hours of material. The results are
satisfactory for the broadcast data (around 80 percent of the data time
correctly labeled in terms of role), while they still must be improved in the
case of the meeting recordings (around 45 percent of the data time correctly
labeled). In both cases, the approach outperforms significantly chance. Keywords: broadcast data, meeting recordings, role recognition, social network
analysis, speaker diarization | |||
| Audiovisual laughter detection based on temporal features | | BIBAK | Full-Text | 37-44 | |
| Stavros Petridis; Maja Pantic | |||
| Previous research on automatic laughter detection has mainly been focused on
audio-based detection. In this study we present an audio-visual approach to
distinguishing laughter from speech based on temporal features and we show that
integrating the information from audio and video channels leads to improved
performance over single-modal approaches. Static features are extracted on an
audio/video frame basis and then combined with temporal features extracted over
a temporal window, describing the evolution of static features over time. The
use of several different temporal features has been investigated and it has
been shown that the addition of temporal information results in an improved
performance over utilizing static information only. It is common to use a fixed
set of temporal features which implies that all static features will exhibit
the same behaviour over a temporal window. However, this does not always hold
and we show that when AdaBoost is used as a feature selector, different
temporal features for each static feature are selected, i.e., the temporal
evolution of each static feature is described by different statistical
measures. When tested on 96 audiovisual sequences, depicting spontaneously
displayed (as opposed to posed) laughter and speech episodes, in a person
independent way the proposed audiovisual approach achieves an F1 rate of over
89%. Keywords: audiovisual data processing, laughter detection, non-linguistic information
processing | |||
| Predicting two facets of social verticality in meetings from five-minute time slices and nonverbal cues | | BIBAK | Full-Text | 45-52 | |
| Dinesh Babu Jayagopi; Sileye Ba; Jean-Marc Odobez; Daniel Gatica-Perez | |||
| This paper addresses the automatic estimation of two aspects of social
verticality (status and dominance) in small-group meetings using nonverbal
cues. The correlation of nonverbal behavior with these social constructs have
been extensively documented in social psychology, but their value for
computational models is, in many cases, still unknown. We present a systematic
study of automatically extracted cues -- including vocalic, visual activity,
and visual attention cues -- and investigate their relative effectiveness to
predict both the most-dominant person and the high-status project manager from
relative short observations. We use five hours of task-oriented meeting data
with natural behavior for our experiments. Our work suggests that, although
dominance and role-based status are related concepts, they are not equivalent
and are thus not equally explained by the same nonverbal cues. Furthermore, the
best cues can correctly predict the person with highest dominance or role-based
status with an accuracy of 70% approximately. Keywords: audio-visual feature extraction, dominance, meetings, social verticality,
status | |||
| Multimodal recognition of personality traits in social interactions | | BIBAK | Full-Text | 53-60 | |
| Fabio Pianesi; Nadia Mana; Alessandro Cappelletti; Bruno Lepri; Massimo Zancanaro | |||
| This paper targets the automatic detection of personality traits in a
meeting environment by means of audio and visual features; information about
the relational context is captured by means of acoustic features designed to
that purpose. Two personality traits are considered: Extraversion (from the Big
Five) and the Locus of Control. The classification task is applied to thin
slices of behaviour, in the form of 1-minute sequences. SVM were used to test
the performances of several training and testing instance setups, including a
restricted set of audio features obtained through feature selection. The
outcomes improve considerably over existing results, provide evidence about the
feasibility of the multimodal analysis of personality, the role of social
context, and pave the way to further studies addressing different features
setups and/or targeting different personality traits. Keywords: group interaction, intelligent environments, personality modeling, support
vector machines | |||
| Social signals, their function, and automatic analysis: a survey | | BIBAK | Full-Text | 61-68 | |
| Alessandro Vinciarelli; Maja Pantic; Hervé Bourlard; Alex Pentland | |||
| Social Signal Processing (SSP) aims at the analysis of social behaviour in
both Human-Human and Human-Computer interactions. SSP revolves around automatic
sensing and interpretation of social signals, complex aggregates of nonverbal
behaviours through which individuals express their attitudes towards other
human (and virtual) participants in the current social context. As such, SSP
integrates both engineering (speech analysis, computer vision, etc.) and human
sciences (social psychology, anthropology, etc.) as it requires multimodal and
multidisciplinary approaches. As of today, SSP is still in its early infancy,
but the domain is quickly developing, and a growing number of works is
appearing in the literature. This paper provides an introduction to nonverbal
behaviour involved in social signals and a survey of the main results obtained
so far in SSP. It also outlines possibilities and challenges that SSP is
expected to face in the next years if it is to reach its full maturity. Keywords: computer vision, social behaviour analysis, social signal processing, speech
analysis | |||
| VoiceLabel: using speech to label mobile sensor data | | BIBAK | Full-Text | 69-76 | |
| Susumu Harada; Jonathan Lester; Kayur Patel; T. Scott Saponas; James Fogarty; James A. Landay; Jacob O. Wobbrock | |||
| Many mobile machine learning applications require collecting and labeling
data, and a traditional GUI on a mobile device may not be an appropriate or
viable method for this task. This paper presents an alternative approach to
mobile labeling of sensor data called VoiceLabel. VoiceLabel consists of two
components: (1) a speech-based data collection tool for mobile devices, and (2)
a desktop tool for offline segmentation of recorded data and recognition of
spoken labels. The desktop tool automatically analyzes the audio stream to find
and recognize spoken labels, and then presents a multimodal interface for
reviewing and correcting data labels using a combination of the audio stream,
the system's analysis of that audio, and the corresponding mobile sensor data.
A study with ten participants showed that VoiceLabel is a viable method for
labeling mobile sensor data. VoiceLabel also illustrates several key features
that inform the design of other data labeling tools. Keywords: data collection, machine learning, mobile devices, sensors, speech
recognition | |||
| The babbleTunes system: talk to your ipod! | | BIBAK | Full-Text | 77-80 | |
| Jan Schehl; Alexander Pfalzgraf; Norbert Pfleger; Jochen Steigner | |||
| This paper presents a full-fledged multimodal dialogue system for accessing
multimedia content in home environments from both portable media players and
online sources. We will mainly focus on two aspects of the system that provide
the basis for a natural interaction: (i) the automatic processing of named
entities which permits the incorporation of dynamic data into the dialogue
(e.g., song or album titles, artist names, etc.) and (ii) general multimodal
interaction patterns that are bound to ease the access to large sets of data. Keywords: multimodal dialogue systems | |||
| Evaluating talking heads for smart home systems | | BIBAK | Full-Text | 81-84 | |
| Christine Kühnel; Benjamin Weiss; Ina Wechsung; Sascha Fagel; Sebastian Möller | |||
| In this paper we report the results of a user study evaluating talking heads
in the smart home domain. Three noncommercial talking head components are
linked to two freely available speech synthesis systems, resulting in six
different combinations. The influence of head and voice components on overall
quality is analyzed as well as the correlation between them. Three different
ways to assess overall quality are presented. It is shown that these three are
consistent in their results. Another important result is that in this design
speech and visual quality are independent of each other. Furthermore, a linear
combination of both quality aspects models overall quality of talking heads to
a good degree. Keywords: multimodal ui, smart home environments, talking heads | |||
| Perception of dynamic audiotactile feedback to gesture input | | BIBAK | Full-Text | 85-92 | |
| Teemu Tuomas Ahmaniemi; Vuokko Lantz; Juha Marila | |||
| In this paper we present results of a study where perception of dynamic
audiotactile feedback to gesture input was examined. Our main motivation was to
investigate how users' active input and different modality conditions effect
the perception of the feedback. The experimental prototype in the study was a
handheld sensor-actuator device that responds dynamically to user's hand
movements creating an impression of a virtual texture. The feedback was
designed so that the amplitude and frequency of texture were proportional to
the overall angular velocity of the device. We used four different textures
with different velocity responses. The feedback was presented to the user by
the tactile actuator in the device, by audio through headphones, or by both.
During the experiments, textures were switched in random intervals and the task
of the user was to detect the changes while moving the device freely. The
performances of the users with audio or audiotactile feedback were quite equal
while tactile feedback alone yielded poorer performance. The texture design
didn't influence the movement velocity or periodicity but tactile feedback
induced most and audio feedback the least energetic motion. In addition,
significantly better performance was achieved with slower motion. We also found
that significant learning happened over time; detection accuracy increased
significantly during and between the experiments. The masking noise used in
tactile modality condition did not significantly influence the detection
accuracy when compared to acoustic blocking but it increased the average
detection time. Keywords: audio, gesture, haptics | |||
| An integrative recognition method for speech and gestures | | BIBAK | Full-Text | 93-96 | |
| Madoka Miki; Chiyomi Miyajima; Takanori Nishino; Norihide Kitaoka; Kazuya Takeda | |||
| We propose an integrative recognition method of speech accompanied with
gestures such as pointing. Simultaneously generated speech and pointing
complementarily help the recognition of both, and thus the integration of these
multiple modalities may improve recognition performance. As an example of such
multimodal speech, we selected the explanation of a geometry problem. While the
problem was being solved, speech and fingertip movements were recorded with a
close-talking microphone and a 3D position sensor. To find the correspondence
between utterance and gestures, we propose probability distribution of the time
gap between the starting times of an utterance and gestures. We also propose an
integrative recognition method using this distribution. We obtained
approximately 3-point improvement for both speech and fingertip movement
recognition performance with this method. Keywords: gesture recognition, integrative recognition, multimodal interface, speech
recognition | |||
| As go the feet...: on the estimation of attentional focus from stance | | BIBAK | Full-Text | 97-104 | |
| Francis Quek; Roger Ehrich; Thurmon Lockhart | |||
| The estimation of the direction of visual attention is critical to a large
number of interactive systems. This paper investigates the cross-modal relation
of the position of one's feet (or standing stance) to the focus of gaze. The
intuition is that while one CAN have a range of attentional foci from a
particular stance, one may be MORE LIKELY to look in specific directions given
an approach vector and stance. We posit that the cross-modal relationship is
constrained by biomechanics and personal style. We define a stance vector that
models the approach direction before stopping and the pose of a subject's feet.
We present a study where the subjects' feet and approach vector are tracked.
The subjects read aloud contents of note cards in 4 locations. The order of
visits' to the cards were randomized. Ten subjects read 40 lines of text each,
yielding 400 stance vectors and gaze directions. We divided our data into 4
sets of 300 training and 100 test vectors and trained a neural net to estimate
the gaze direction given the stance vector. Our results show that 31% our gaze
orientation estimates were within 5°, 51% of our estimates were within
10°, and 60% were within 15°. Given the ability to track foot position,
the procedure is minimally invasive. Keywords: attention estimation, foot-tracking, human-computer interaction, multimodal
interfaces, stance model | |||
| Knowledge and data flow architecture for reference processing in multimodal dialog systems | | BIBAK | Full-Text | 105-108 | |
| Ali Choumane; Jacques Siroux | |||
| This paper is concerned with the part of the system dedicated to the
processing of the user's designation activities for multimodal search of
information. We highlight the necessity of using specific knowledge for
multimodal input processing. We propose and describe knowledge modeling as well
as the associated processing architecture. Knowledge modeling is concerned with
the natural language and the visual context; it is adapted to the kind of
application and allows several types of filtering of the inputs. Part of this
knowledge is dynamically updated to take into account the interaction history.
In the proposed architecture, each input modality is processed first by using
the modeled knowledge, producing intermediate structures. Next a fusion of
these structures allows the determination of the referent aimed at by using
dynamic knowledge. The steps of this last process take into account the
possible combinations of modalities as well as the clues carried by each
modality (linguistic clues, gesture type). The development of this part of our
system is mainly complete and tested. Keywords: gesture, multimodal fusion, multimodal human-computer communication, natural
language, reference | |||
| The CAVA corpus: synchronised stereoscopic and binaural datasets with head movements | | BIBAK | Full-Text | 109-116 | |
| Elise Arnaud; Heidi Christensen; Yan-Chen Lu; Jon Barker; Vasil Khalidov; Miles Hansard; Bertrand Holveck; Hervé Mathieu; Ramya Narasimha; Elise Taillant; Florence Forbes; Radu Horaud | |||
| This paper describes the acquisition and content of a new multi-modal
database. Some tools for making use of the data streams are also presented. The
Computational Audio-Visual Analysis (CAVA) database is a unique collection of
three synchronised data streams obtained from a binaural microphone pair, a
stereoscopic camera pair and a head tracking device. All recordings are made
from the perspective of a person; i.e. what would a human with natural head
movements see and hear in a given environment. The database is intended to
facilitate research into humans' ability to optimise their multi-modal sensory
input and fills a gap by providing data that enables human centred audio-visual
scene analysis. It also enables 3D localisation using either audio, visual, or
audio-visual cues. A total of 50 sessions, with varying degrees of visual and
auditory complexity, were recorded. These range from seeing and hearing a
single speaker moving in and out of field of view, to moving around a 'cocktail
party' style situation, mingling and joining different small groups of people
chatting. Keywords: binaural hearing, database, stereo vision | |||
| Towards a minimalist multimodal dialogue framework using recursive MVC pattern | | BIBAK | Full-Text | 117-120 | |
| Li Li; Wu Chou | |||
| This paper presents a formal framework for multimodal dialogue systems by
applying a set of complexity reduction patterns. The minimalist approach
described in this paper combines recursive application of Model-View-Controller
(MVC) design patterns with layering and interpretation. It leads to a modular,
concise, flexible and dynamic framework building upon a few core constructs.
This framework could expedite the development of complex multimodal dialogue
systems with sound software development practices and techniques. A XML based
prototype multimodal dialogue system that embodies this framework is developed
and studied. Experimental results indicate that the proposed framework is
effective and well suited for multimodal interaction in complex business
transactions. Keywords: dialogue, multimodal, mvc, xml | |||
| Explorative studies on multimodal interaction in a PDA- and desktop-based scenario | | BIBAK | Full-Text | 121-128 | |
| Andreas Ratzka | |||
| This paper presents two explorative case studies on multimodal interaction.
Goal of this work is to find and underpin design recommendations to provide
well proven decision support across all phases of the usability engineering
lifecycle [1]. During this work, user interface patterns for multimodal
interaction were identified [2, 3]. These patterns are closely related to other
user interface patterns [4, 5, 6]. Two empirical case studies, one using a
Wizard of Oz setting and another one using a stand-alone prototype linked to a
speech recognition engine [7] were conducted to assess the acceptance of
resulting interaction styles. Although the prototypes applied as well those
interface patterns that increase usability by means of traditional interaction
techniques and thus compete with multimodal interaction styles, multimodal
interaction was preferred by most of the users. Keywords: mobile computing, multimodality, user interface patterns | |||
| Designing context-aware multimodal virtual environments | | BIBAK | Full-Text | 129-136 | |
| Lode Vanacken; Joan De Boeck; Chris Raymaekers; Karin Coninx | |||
| Despite of decades of research, creating intuitive and easy to learn
interfaces for 3D virtual environments (VE) is still not obvious, requiring VE
specialists to define, implement and evaluate solutions in an iterative way,
often using low-level programming code. Moreover, quite frequently the
interaction with the virtual environment may also vary dependent on the context
in which it is applied, such as the available hardware setup, user experience,
or the pose of the user (e.g. sitting or standing). Lacking other tools, the
context-awareness of an application is usually implemented in an ad-hoc manner,
using low-level programming, as well. This may result in code that is difficult
and expensive to maintain. One possible approach to facilitate the process of
creating these highly interactive user interfaces is by adopting a model-based
user interface design. This lifts the creation of a user interface to a higher
level allowing the designer to reason more in terms of high-level concepts,
rather than writing programming code. In this paper, we adopt a model-based
user interface design (MBUID) process for the creation of VEs, and explain how
a context system using an Event-Condition-Action paradigm is added. We
illustrate our approach by means of a case study. Keywords: context-awareness, model-based user interface design, multimodal interaction
techniques | |||
| A high-performance dual-wizard infrastructure for designing speech, pen, and multimodal interfaces | | BIBAK | Full-Text | 137-140 | |
| Phil Cohen; Colin Swindells; Sharon Oviatt; Alex Arthur | |||
| The present paper reports on the design and performance of a novel
dual-Wizard simulation infrastructure that has been used effectively to
prototype next-generation adaptive and implicit multimodal interfaces for
collaborative groupwork. This high-fidelity simulation infrastructure builds on
past development of single-wizard simulation tools for multiparty multimodal
interactions involving speech, pen, and visual input [1]. In the new
infrastructure, a dual-wizard simulation environment was developed that
supports (1) real-time tracking, analysis, and system adaptivity to a user's
speech and pen paralinguistic signal features (e.g., speech amplitude, pen
pressure), as well as the semantic content of their input. This simulation also
supports (2) transparent user training to adapt their speech and pen signal
features in a manner that enhances the reliability of system functioning, i.e.,
the design of mutually-adaptive interfaces. To accomplish these objectives,
this new environment also is capable of handling (3) dynamic streaming digital
pen input. We illustrate the performance of the simulation infrastructure
during longitudinal empirical research in which a user-adaptive interface was
designed for implicit system engagement based exclusively on users' speech
amplitude and pen pressure [2]. While using this dual-wizard simulation method,
the wizards responded successfully to over 3,000 user inputs with 95-98%
accuracy and a joint wizard response time of less than 1.0 second during speech
interactions and 1.65 seconds during pen interactions. Furthermore, the
interactions they handled involved naturalistic multiparty meeting data in
which high school students were engaged in peer tutoring, and all participants
believed they were interacting with a fully functional system. This type of
simulation capability enables a new level of flexibility and sophistication in
multimodal interface design, including the development of implicit multimodal
interfaces that place minimal cognitive load on users during mobile,
educational, and other applications. Keywords: collaborative meetings, dual-wizard protocol, high-fidelity simulation,
implicit system engagement, multi-stream multimodal data, pen pressure, speech
amplitude, streaming digital pen and paper, wizard-of-oz | |||
| The WAMI toolkit for developing, deploying, and evaluating web-accessible multimodal interfaces | | BIBAK | Full-Text | 141-148 | |
| Alexander Gruenstein; Ian McGraw; Ibrahim Badr | |||
| Many compelling multimodal prototypes have been developed which pair spoken
input and output with a graphical user interface, yet it has often proved
difficult to make them available to a large audience. This unfortunate reality
limits the degree to which authentic user interactions with such systems can be
collected and subsequently analyzed. We present the WAMI toolkit, which
alleviates this difficulty by providing a framework for developing, deploying,
and evaluating Web-Accessible Multimodal Interfaces in which users interact
using speech, mouse, pen, and/or touch. The toolkit makes use of modern
web-programming techniques, enabling the development of browser-based
applications which rival the quality of traditional native interfaces, yet are
available on a wide array of Internet-connected devices. We will showcase
several sophisticated multimodal applications developed and deployed using the
toolkit, which are available via desktop, laptop, and tablet PCs, as well as
via several mobile devices. In addition, we will discuss resources provided by
the toolkit for collecting, transcribing, and annotating usage data from
multimodal user interactions. Keywords: dialogue system, multimodal interface, speech recognition, voice over ip,
world wide web | |||
| A three-dimensional characterization space of software components for rapidly developing multimodal interfaces | | BIBAK | Full-Text | 149-156 | |
| Marcos Serrano; David Juras; Laurence Nigay | |||
| In this paper we address the problem of the development of multimodal
interfaces. We describe a three-dimensional characterization space for software
components along with its implementation in a component-based platform for
rapidly developing multimodal interfaces. By graphically assembling components,
the designer/developer describes the transformation chain from physical devices
to tasks and vice-versa. In this context, the key point is to identify generic
components that can be reused for different multimodal applications.
Nevertheless for flexibility purposes, a mixed approach that enables the
designer to use both generic components and tailored components is required. As
a consequence, our characterization space includes one axis dedicated to the
reusability aspect of a component. The two other axes of our characterization
space, respectively depict the role of the component in the data-flow from
devices to tasks and the level of specification of the component. We illustrate
our three dimensional characterization space as well as the implemented tool
based on it using a multimodal map navigator. Keywords: component-based approach, design and implementation tool., multimodal
interaction model | |||
| Crossmodal congruence: the look, feel and sound of touchscreen widgets | | BIBAK | Full-Text | 157-164 | |
| Eve Hoggan; Topi Kaaresoja; Pauli Laitinen; Stephen Brewster | |||
| Our research considers the following question: how can visual, audio and
tactile feedback be combined in a congruent manner for use with touchscreen
graphical widgets? For example, if a touchscreen display presents different
styles of visual buttons, what should each of those buttons feel and sound
like? This paper presents the results of an experiment conducted to investigate
methods of congruently combining visual and combined audio/tactile feedback by
manipulating the different parameters of each modality. The results indicate
trends with individual visual parameters such as shape, size and height being
combined congruently with audio/tactile parameters such as texture, duration
and different actuator technologies. We draw further on the experiment results
using individual quality ratings to evaluate the perceived quality of our
touchscreen buttons then reveal a correlation between perceived quality and
crossmodal congruence. The results of this research will enable mobile
touchscreen UI designers to create realistic, congruent buttons by selecting
the most appropriate audio and tactile counterparts of visual button styles. Keywords: auditory/tactile/visual congruence, crossmodal interaction, mobile
touchscreen interaction, touchscreen widgets | |||
| MultiML: a general purpose representation language for multimodal human utterances | | BIBAK | Full-Text | 165-172 | |
| Manuel Giuliani; Alois Knoll | |||
| We present MultiML, a markup language for the annotation of multimodal human
utterances. MultiML is able to represent input from several modalities, as well
as the relationships between these modalities. Since MultiML separates general
parts of representation from more context-specific aspects, it can easily be
adapted for use in a wide range of contexts. This paper demonstrates how speech
and gestures are described with MultiML, showing the principles -- including
hierarchy and underspecification -- that ensure the quality and extensibility
of MultiML. As a proof of concept, we show how MultiML is used to annotate a
sample human-robot interaction in the domain of a multimodal joint-action
scenario. Keywords: human-robot interaction, multimodal, representation | |||
| Deducing the visual focus of attention from head pose estimation in dynamic multi-view meeting scenarios | | BIBAK | Full-Text | 173-180 | |
| Michael Voit; Rainer Stiefelhagen | |||
| This paper presents our work on recognizing the visual focus of attention
during dynamic meeting scenarios. We collected a new dataset of meetings, in
which acting participants were to follow a predefined script of events, to
enforce focus shifts of the remaining, unaware meeting members. Including the
whole room, all in all, a total of 35 potential focus targets were annotated,
of which some were moved or introduced spontaneously during the meeting. On
this dynamic dataset, we present a new approach to deduce the visual focus by
means of head orientation as a first clue and show, that our system recognizes
the correct visual target in over 57% of all frames, compared to 47% when
mapping head pose to the first-best intersecting focus target directly. Keywords: data collection, dynamic meetings, eye gaze, head orientation, visual focus
of attention | |||
| Context-based recognition during human interactions: automatic feature selection and encoding dictionary | | BIBAK | Full-Text | 181-188 | |
| Louis-Philippe Morency; Iwan de Kok; Jonathan Gratch | |||
| During face-to-face conversation, people use visual feedback such as head
nods to communicate relevant information and to synchronize rhythm between
participants. In this paper we describe how contextual information from other
participants can be used to predict visual feedback and improve recognition of
head gestures in human-human interactions. For example, in a dyadic
interaction, the speaker contextual cues such as gaze shifts or changes in
prosody will influence listener backchannel feedback (e.g., head nod). To
automatically learn how to integrate this contextual information into the
listener gesture recognition framework, this paper addresses two main
challenges: optimal feature representation using an encoding dictionary and
automatic selection of optimal feature-encoding pairs. Multimodal integration
between context and visual observations is performed using a discriminative
sequential model (Latent-Dynamic Conditional Random Fields) trained on previous
interactions. In our experiments involving 38 storytelling dyads, our
context-based recognizer significantly improved head gesture recognition
performance over a vision-only recognizer. Keywords: contextual information, head nod recognition, human-human interaction,
visual gesture recognition | |||
| AcceleSpell, a gestural interactive game to learn and practice finger spelling | | BIBAK | Full-Text | 189-190 | |
| José Luis Hernandez-Rebollar; Ethar Ibrahim Elsakay; José D. Alanís-Urquieta | |||
| In this paper, an interactive computer game for learning and practicing
continuous fingerspelling is described. The game is controlled by an
instrumented glove known as AcceleGlove and a recognition algorithm based on
decision trees. The Graphical User Interface is designed to allow beginners to
remember the correct hand shapes and start finger spelling words sooner than
traditional methods of learning. Keywords: finger spelling, instrumented gloves, interactive games | |||
| A multi-modal spoken dialog system for interactive TV | | BIBAK | Full-Text | 191-192 | |
| Rajesh Balchandran; Mark E. Epstein; Gerasimos Potamianos; Ladislav Seredi | |||
| In this demonstration we present a novel prototype system that implements a
multi-modal interface for control of the television. This system combines the
standard TV remote control with a dialog management based natural language
speech interface to allow users to efficiently interact with the TV, and to
seamlessly alternate between the two modalities. One of the main objectives of
this system is to make the unwieldy Electronic Program Guide information more
navigable by the use of voice to filter and locate programs of interest. Keywords: natural language speech interface for tv | |||
| Multimodal slideshow: demonstration of the openinterface interaction development environment | | BIBAK | Full-Text | 193-194 | |
| David Juras; Laurence Nigay; Michael Ortega; Marcos Serrano | |||
| In this paper, we illustrate the OpenInterface Interaction Development
Environment (OIDE) that addresses the design and development of multimodal
interfaces. Multimodal interaction software development presents a particular
challenge because of the ever increasing number of novel interaction devices
and modalities that can used for a given interactive application. To
demonstrate our graphical OIDE and its underlying approach, we present a
multimodal slideshow implemented with our tool. Keywords: development environment, multimodal interfaces, prototyping | |||
| A browser-based multimodal interaction system | | BIBAK | Full-Text | 195-196 | |
| Kouichi Katsurada; Teruki Kirihata; Masashi Kudo; Junki Takada; Tsuneo Nitta | |||
| In this paper, we propose a system that enables users to have multimodal
interactions (MMI) with an anthropomorphic agent via a web browser. By using
the system, a user can interact simply by accessing a web site from his/her web
browser. A notable characteristic of the system is that the anthropomorphic
agent is synthesized from a photograph of a real human face. This makes it
possible to construct a web site whose owner's facial agent speaks with
visitors to the site. This paper describes the structure of the system and
provides a screen shot. Keywords: multimodal interaction system, web-based system | |||
| IGlasses: an automatic wearable speech supplementin face-to-face communication and classroom situations | | BIBAK | Full-Text | 197-198 | |
| Dominic W. Massaro; Miguel Á Carreira-Perpiñán; David J. Merrill; Cass Sterling; Stephanie Bigler; Elise Piazza; Marcus Perlman | |||
| The need for language aids is pervasive in today's world. There are millions
of individuals who have language and speech challenges, and these individuals
require additional support for communication and language learning. We
demonstrate technology to supplement common face-to-face language interaction
to enhance intelligibility, understanding, and communication, particularly for
those with hearing impairments. Our research is investigating how to
automatically supplement talking faces with information that is ordinarily
conveyed by auditory means. This research consists of two areas of inquiry: 1)
developing a neural network to perform real-time analysis of selected acoustic
features for visual display, and 2) determining how quickly participants can
learn to use these selected cues and how much they benefit from them when
combined with speechreading. Keywords: automatic speech supplement, multimodal speech perception | |||
| Innovative interfaces in MonAMI: the reminder | | BIBAK | Full-Text | 199-200 | |
| Jonas Beskow; Jens Edlund; Teodore Gjermani; Björn Granström; Joakim Gustafson; Oskar Jonsson; Gabriel Skanze; Helena Tobiasson | |||
| This demo paper presents an early version of the Reminder, a prototype ECA
developed in the European project MonAMI, which aims at "mainstreaming
accessibility in consumer goods and services, using advanced technologies to
ensure equal access, independent living and participation for all". The
Reminder helps users to plan activities and to remember what to do. The
prototype merges mobile ECA technology with other, existing technologies:
Google Calendar and a digital pen and paper. The solution allows users to
continue using a paper calendar in the manner they are used to, whilst the ECA
provides notifications on what has been written in the calendar. Users may ask
questions such as "When was I supposed to meet Sara?" or "What's my schedule
today?" Keywords: dialogue system, facial animation, pda, speech | |||
| PHANTOM prototype: exploring the potential for learning with multimodal features in dentistry | | BIBAK | Full-Text | 201-202 | |
| Jonathan Padilla San Diego; Alastair Barrow; Margaret Cox; William Harwin | |||
| In this paper, we will demonstrate how force feedback, motion-parallax, and
stereoscopic vision can enhance the opportunities for learning in the context
of dentistry. A dental training workstation prototype has been developed
intended for use by dental students in their introductory course to preparing a
tooth cavity. The multimodal feedback from haptics, motion tracking cameras,
computer generated sound and graphics are being exploited to provide
'near-realistic' learning experiences. Whilst the empirical evidence provided
is preliminary, we describe the potential of multimodal interaction via these
technologies for enhancing dental-clinical skills. Keywords: haptics, multimodality, technology-enhanced learning, virtual reality | |||
| Audiovisual 3d rendering as a tool for multimodal interfaces | | BIBAK | Full-Text | 203-204 | |
| George Drettakis | |||
| In this talk, we will start with a short overview of 3D audiovisual
rendering and its applicability to multimodal interfaces. In recent years, we
have seen the generalization of 3D applications, ranging from computer games,
which involve a high level of realism, to applications such as SecondLife, in
which the visual and auditory quality of the 3D environment leaves much to be
desired. In our introduction will attempt to examine the relationship between
the audiovisual rendering of the environment and the interface. We will then
review some of the audio-visual rendering algorithms we have developed in the
last few years. We will discuss four main challenges we have addressed. The
first is the development of realistic illumination and shadow algorithms which
contribute greatly to the realism of 3D scenes, but could also be important for
interfaces. The second involves the application of these illumination
algorithms to augmented reality settings. The third concerns the development of
perceptually-based techniques, and in particular using audio-visual cross-modal
perception. The fourth challenge has been the development of approximate but
"plausible", interactive solutions to more advanced rendering effects, both for
graphics and audio. On the audio side, our review will include the introduction
of clustering, masking and perceptual rendering for 3D spatialized audio and
our recently developed solution for the treatment of contact sounds. On the
graphics side, our discussion will include a quick overview of our illumination
and shadow work, its application to augmented reality, our work on interactive
rendering approximations and perceptually driven algorithms. For all these
techniques we will discuss their relevance to multimodal interfaces, including
our experience in a urban design case-study and attempt to relate them to
recent interface research. We will close with a broad reflection on the
potential for closer collaboration between 3D audiovisual rendering and
multimodal interfaces. Keywords: 3d audio, computer graphics | |||
| Multimodal presentation and browsing of music | | BIBAK | Full-Text | 205-208 | |
| David Damm; Christian Fremerey; Frank Kurth; Meinard Müller; Michael Clausen | |||
| Recent digitization efforts have led to large music collections, which
contain music documents of various modes comprising textual, visual and
acoustic data. In this paper, we present a multimodal music player for
presenting and browsing digitized music collections consisting of heterogeneous
document types. In particular, we concentrate on music documents of two widely
used types for representing a musical work, namely visual music representation
(scanned images of sheet music) and associated interpretations (audio
recordings). We introduce novel user interfaces for multimodal (audio-visual)
music presentation as well as intuitive navigation and browsing. Our system
offers high quality audio playback with time-synchronous display of the
digitized sheet music associated to a musical work. Furthermore, our system
enables a user to seamlessly crossfade between various interpretations
belonging to the currently selected musical work. Keywords: music alignment, music browsing, music information retrieval, music
navigation, music synchronization | |||
| An audio-haptic interface based on auditory depth cues | | BIBAK | Full-Text | 209-216 | |
| Delphine Devallez; Federico Fontana; Davide Rocchesso | |||
| Spatialization of sound sources in depth allows a hierarchical display of
multiple audio streams and therefore may be an efficient tool for developing
novel auditory interfaces. In this paper we present an audio-haptic interface
for audio browsing based on rendering distance cues for ordering sound sources
in depth. The haptic interface includes a linear position tactile sensor made
by conductive material. The touch position on the ribbon is mapped onto the
listening position on a rectangular virtual membrane, modeled by a
bidimensional Digital Waveguide Mesh and providing distance cues of four
equally spaced sound sources. Furthermore a knob of a MIDI controller controls
the position of the mesh along the playlist, which allows to browse the whole
set of files. Subjects involved in a user study found the interface intuitive
and entertaining. In particular the interaction with the stripe was highly
appreciated. Keywords: audio-haptic interface, auditory navigation, digital waveguide mesh,
distance perception, spatialization | |||
| Detection and localization of 3d audio-visual objects using unsupervised clustering | | BIBAK | Full-Text | 217-224 | |
| Vasil Khalidov; Florence Forbes; Miles Hansard; Elise Arnaud; Radu Horaud | |||
| This paper addresses the issues of detecting and localizing objects in a
scene that are both seen and heard. We explain the benefits of a human-like
configuration of sensors (binaural and binocular) for gathering auditory and
visual observations. It is shown that the detection and localization problem
can be recast as the task of clustering the audio-visual observations into
coherent groups. We propose a probabilistic generative model that captures the
relations between audio and visual observations. This model maps the data into
a common audio-visual 3D representation via a pair of mixture models. Inference
is performed by a version of the expectation-maximization algorithm, which is
formally derived, and which provides cooperative estimates of both the auditory
activity and the 3D position of each object. We describe several experiments
with single- and multiple-speaker detection and localization, in the presence
of other audio sources. Keywords: audio-visual clustering, binaural hearing, mixture models, stereo vision | |||
| Robust gesture processing for multimodal interaction | | BIBAK | Full-Text | 225-232 | |
| Srinivas Bangalore; Michael Johnston | |||
| With the explosive growth in mobile computing and communication over the
past few years, it is possible to access almost any information from virtually
anywhere. However, the efficiency and effectiveness of this interaction is
severely limited by the inherent characteristics of mobile devices, including
small screen size and the lack of a viable keyboard or mouse. This paper
concerns the use of multimodal language processing techniques to enable
interfaces combining speech and gesture input that overcome these limitations.
Specifically we focus on robust processing of pen gesture inputs in a local
search application and demonstrate that edit-based techniques that have proven
effective in spoken language processing can also be used to overcome unexpected
or errorful gesture input. We also examine the use of a bottom-up gesture
aggregation technique to improve the coverage of multimodal understanding. Keywords: finite-state methods, local search, mobile, multimodal interfaces,
robustness, speech-gesture integration | |||
| Investigating automatic dominance estimation in groups from visual attention and speaking activity | | BIBAK | Full-Text | 233-236 | |
| Hayley Hung; Dinesh Babu Jayagopi; Sileye Ba; Jean-Marc Odobez; Daniel Gatica-Perez | |||
| We study the automation of the visual dominance ratio (VDR); a classic
measure of displayed dominance in social psychology literature, which combines
both gaze and speaking activity cues. The VDR is modified to estimate dominance
in multi-party group discussions where natural verbal exchanges are possible
and other visual targets such as a table and slide screen are present. Our
findings suggest that fully automated versions of these measures can estimate
effectively the most dominant person in a meeting and can match the dominance
estimation performance when manual labels of visual attention are used. Keywords: audio-visual feature extraction, dominance modeling, meetings, visual focus
of attention | |||
| Dynamic modality weighting for multi-stream hmms in audio-visual speech recognition | | BIBAK | Full-Text | 237-240 | |
| Mihai Gurban; Jean-Philippe Thiran; Thomas Drugman; Thierry Dutoit | |||
| Merging decisions from different modalities is a crucial problem in
Audio-Visual Speech Recognition. To solve this, state synchronous multi-stream
HMMs have been proposed for their important advantage of incorporating stream
reliability in their fusion scheme. This paper focuses on stream weight
adaptation based on modality confidence estimators. We assume different and
time-varying environment noise, as can be encountered in realistic
applications, and, for this, adaptive methods are best suited. Stream
reliability is assessed directly through classifier outputs since they are not
specific to either noise type or level. The influence of constraining the
weights to sum to one is also discussed. Keywords: audio-visual speech recognition, multi-stream hmm, multimodal fusion, stream
reliability | |||
| A Fitts Law comparison of eye tracking and manual input in the selection of visual targets | | BIBAK | Full-Text | 241-248 | |
| Roel Vertegaal | |||
| We present a Fitts' Law evaluation of a number of eye tracking and manual
input devices in the selection of large visual targets. We compared performance
of two eye tracking techniques, manual click and dwell time click, with that of
mouse and stylus. Results show eye tracking with manual click outperformed the
mouse by 16%, with dwell time click 46% faster. However, eye tracking
conditions suffered a high error rate of 11.7% for manual click and 43% for
dwell time click conditions. After Welford correction eye tracking still
appears to outperform manual input, with IPs of 13.8 bits/s for dwell time
click, and 10.9 bits/s for manual click. Eye tracking with manual click
provides the best tradeoff between speed and accuracy, and was preferred by 50%
of participants. Mouse and stylus had IPs of 4.7 and 4.2 respectively. However,
their low error rate of 5% makes these techniques more suitable for refined
target selection. Keywords: Fitts Law, attentive user interfaces., eye tracking, focus selection, input
devices | |||
| A Wizard of Oz study for an AR multimodal interface | | BIBAK | Full-Text | 249-256 | |
| Minkyung Lee; Mark Billinghurst | |||
| In this paper we describe a Wizard of Oz (WOz) user study of an Augmented
Reality (AR) interface that uses multimodal input (MMI) with natural hand
interaction and speech commands. Our goal is to use a WOz study to help guide
the creation of a multimodal AR interface which is most natural to the user. In
this study we used three virtual object arranging tasks with two different
display types (a head mounted display, and a desktop monitor) to see how users
used multimodal commands, and how different AR display conditions affect those
commands. The results provided valuable insights into how people naturally
interact in a multimodal AR scene assembly task. For example, we discovered the
optimal time frame for fusing speech and gesture commands into a single
command. We also found that display type did not produce a significant
difference in the type of commands used. Using these results, we present design
recommendations for multimodal interaction in AR environments. Keywords: AR, augmented reality, multimodal interaction, multimodal interface, user
study, wizard of oz | |||
| A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization | | BIBAK | Full-Text | 257-264 | |
| Kazuhiro Otsuka; Shoko Araki; Kentaro Ishizuka; Masakiyo Fujimoto; Martin Heinrich; Junji Yamato | |||
| This paper presents a realtime system for analyzing group meetings that uses
a novel omnidirectional camera-microphone system. The goal is to automatically
discover the visual focus of attention (VFOA), i.e. "who is looking at whom",
in addition to speaker diarization, i.e. "who is speaking and when". First, a
novel tabletop sensing device for round-table meetings is presented; it
consists of two cameras with two fisheye lenses and a triangular microphone
array. Second, from high-resolution omnidirectional images captured with the
cameras, the position and pose of people's faces are estimated by STCTracker
(Sparse Template Condensation Tracker); it realizes realtime robust tracking of
multiple faces by utilizing GPUs (Graphics Processing Units). The face
position/pose data output by the face tracker is used to estimate the focus of
attention in the group. Using the microphone array, robust speaker diarization
is carried out by a VAD (Voice Activity Detection) and a DOA (Direction of
Arrival) estimation followed by sound source clustering. This paper also
presents new 3-D visualization schemes for meeting scenes and the results of an
analysis. Using two PCs, one for vision and one for audio processing, the
system runs at about 20 frames per second for 5-person meetings. Keywords: face tracking, fisheye lens, focus of attention, meeting analysis,
microphone array, omnidirectional cameras, realtime system, speaker diarization | |||
| Designing and evaluating multimodal interaction for mobile contexts | | BIBAK | Full-Text | 265-272 | |
| Saija Lemmelä; Akos Vetek; Kaj Mäkelä; Dari Trendafilov | |||
| In this paper we report on our experience on the design and evaluation of
multimodal user interfaces in various contexts. We introduce a novel
combination of existing design and evaluation methods in the form of a 5-step
iterative process and show the feasibility of this method and some of the
lessons learned through the design of a messaging application for two contexts
(in car, walking). The iterative design process we employed included the
following five basic steps: 1) identification of the limitations affecting the
usage of different modalities in various contexts (contextual observations and
context analysis) 2) identifying and selecting suitable interaction concepts
and creating a general design for the multimodal application (storyboarding,
use cases, interaction concepts, task breakdown, application UI and interaction
design), 3) creating modality-specific UI designs, 4) rapid prototyping and 5)
evaluating the prototype in naturalistic situations to find key issues to be
taken into account in the next iteration. We have not only found clear
indications that context affects users' preferences in the usage of modalities
and interaction strategies but also identified some of these. For instance,
while speech interaction was preferred in the car environment users did not
consider it useful when they were walking. 2D (finger strokes) and especially
3D (tilt) gestures were preferred by walking users. Keywords: evaluation, interaction design, mobile applications, multimodal interaction | |||
| Automated sip detection in naturally-evoked video | | BIBAK | Full-Text | 273-280 | |
| Rana el Kaliouby; Mina Mikhail | |||
| Quantifying consumer experiences is an emerging application area for event
detection in video. This paper presents a hierarchical model for robust sip
detection that combines bottom-up processing of face videos, namely real-time
head action unit analysis and head gesture recognition, with top-down knowledge
about sip events and task semantics. Our algorithm achieves an average accuracy
of 82% in videos that feature single sips, and an average accuracy of 78% and
false positive rate of 0.3%, in more challenging videos that feature multiple
sips and chewing actions. We discuss the generality of our methodology to
detecting other events in similar contexts. Keywords: affective computing, event detection, head gesture recognition, human
activity recognition, spontaneous video | |||
| Perception of low-amplitude haptic stimuli when biking | | BIBAK | Full-Text | 281-284 | |
| Toni Pakkanen; Jani Lylykangas; Jukka Raisamo; Roope Raisamo; Katri Salminen; Jussi Rantala; Veikko Surakka | |||
| Haptic stimulation in motion has been studied only little earlier. To
provide guidance for designing haptic interfaces for mobile use we carried out
an initial experiment using C-2 actuators. 16 participants attended in the
experiment to find out whether there is a difference in perceiving
low-amplitude vibrotactile stimuli when exposed to minimal and moderate
physical exertion. A stationary bike was used to control the exertion. Four
body locations (wrist, leg, chest and back), two stimulus durations (1000 ms
and 2000 ms) and two motion conditions with the stationary bicycle (still and
moderate pedaling) were applied. It was found that cycling had significant
effect on both the perception accuracy and the reaction times with selected
stimuli. Stimulus amplitudes used in this experiment can be used to help haptic
design for mobile users. Keywords: biking, mobile user, perception, tactile feedback | |||
| TactiMote: a tactile remote control for navigating in long lists | | BIBAK | Full-Text | 285-288 | |
| Muhammad Tahir; Gilles Bailly; Eric Lecolinet; Gérard Mouret | |||
| This paper presents TactiMote, a remote control with tactile feedback
designed for navigating in long lists and catalogues. TactiMote integrates a
joystick that allows 2D interaction with the thumb and a Braille cell that
provides tactile feedback. This feedback is intended to help the selection task
in novice mode and to allow for fast eyes-free navigation among favorite items
in expert mode. The paper describes the design of the TactiMote prototype for
TV channel selection and reports a preliminary experiment that shows the
feasibility of the approach. Keywords: joystick, list, navigation, selection, tactile feedback, target acquisition,
visual feedback | |||
| The DIRAC AWEAR audio-visual platform for detection of unexpected and incongruent events | | BIBAK | Full-Text | 289-292 | |
| Jörn Anemüller; Jörg-Hendrik Bach; Barbara Caputo; Michal Havlena; Luo Jie; Hendrik Kayser; Bastian Leibe; Petr Motlicek; Tomas Pajdla; Misha Pavel; Akihiko Torii; Luc Van Gool; Alon Zweig; Hynek Hermansky | |||
| It is of prime importance in everyday human life to cope with and respond
appropriately to events that are not foreseen by prior experience. Machines to
a large extent lack the ability to respond appropriately to such inputs. An
important class of unexpected events is defined by incongruent combinations of
inputs from different modalities and therefore multimodal information provides
a crucial cue for the identification of such events, e.g., the sound of a voice
is being heard while the person in the field-of-view does not move her lips. In
the project DIRAC ("Detection and Identification of Rare Audio-visual Cues") we
have been developing algorithmic approaches to the detection of such events, as
well as an experimental hardware platform to test it. An audio-visual platform
("AWEAR" -- audio-visual wearable device) has been constructed with the goal to
help users with disabilities or a high cognitive load to deal with unexpected
events. Key hardware components include stereo panoramic vision sensors and
6-channel worn-behind-the-ear (hearing aid) microphone arrays. Data have been
recorded to study audio-visual tracking, a/v scene/object classification and
a/v detection of incongruencies. Keywords: audio-visual, augmented cognition, event detection, multimodal interaction,
sensor platform | |||
| Smoothing human-robot speech interactions by using a blinking-light as subtle expression | | BIBAK | Full-Text | 293-296 | |
| Kotaro Funakoshi; Kazuki Kobayashi; Mikio Nakano; Seiji Yamada; Yasuhiko Kitamura; Hiroshi Tsujino | |||
| Speech overlaps, undesired collisions of utterances between systems and
users, harm smooth communication and degrade the usability of systems. We
propose a method to enable smooth speech interactions between a user and a
robot, which enables subtle expressions by the robot in the form of a blinking
LED attached to its chest. In concrete terms, we show that, by blinking an LED
from the end of the user's speech until the robot's speech, the number of
undesirable repetitions, which are responsible for speech overlaps, decreases,
while that of desirable repetitions increases. In experiments, participants
played a last-and-first game with the robot. The experimental results suggest
that the blinking-light can prevent speech overlaps between a user and a robot,
speed up dialogues, and improve user's impressions. Keywords: human-robot interaction, speech overlap, subtle expression, turn-taking | |||
| Feel-good touch: finding the most pleasant tactile feedback for a mobile touch screen button | | BIBAK | Full-Text | 297-304 | |
| Emilia Koskinen; Topi Kaaresoja; Pauli Laitinen | |||
| Earlier research has shown the benefits of tactile feedback for touch screen
widgets in all metrics: performance, usability and user experience. In our
current research the goal was to go deeper in understanding the characteristics
of a tactile click for virtual buttons. More specifically we wanted to find a
tactile click which is the most pleasant to use with a finger. We used two
actuator solutions in a small mobile touch screen: piezo actuators or a
standard vibration motor. We conducted three experiments: The first and second
experiments aimed to find the most pleasant tactile feedback done with the
piezo actuators or a vibration motor, respectively, and the third one combined
and compared the results from the first two experiments. The results from the
first two experiments showed significant differences for the perceived
pleasantness of the tactile clicks, and we used these most pleasant clicks in
the comparison experiment in addition to the condition with no tactile
feedback. Our findings confirmed results from earlier studies showing that
tactile feedback is superior to a nontactile condition when virtual buttons are
used with the finger regardless of the technology behind the tactile feedback.
Another finding suggests that the users perceived the feedback done with piezo
actuators slightly more pleasant than the vibration motor based feedback,
although not statistically significantly. These results indicate that it is
possible to modify the characteristics of the virtual button tactile clicks
towards the most pleasant ones, and on the other hand this knowledge can help
designers to create better touch screen virtual buttons and keyboards. Keywords: mobile touch screen interaction, tactile feedback pleasantness, virtual
buttons | |||
| Embodied conversational agents for voice-biometric interfaces Álvaro Hernández-Trapote, Beatriz López-Mencía, David Díaz, Rubén Fernández-Pozo, Javier Caminero | | BIBAK | Full-Text | 305-312 | |
| In this article we present a research scheme which aims to analyze the use
of Embodied Conversational Agent (ECA) technology to improve the robustness and
acceptability of speaker enrolment and verification dialogues designed to
provide secure access through natural and intuitive speaker recognition. In
order to find out the possible effects of the visual information channel
provided by the ECA, tests were carried out in which users were divided into
two groups, each interacting with a different interface (metaphor): an ECA
Metaphor group -- with an ECA -- and a VOICE Metaphor group -- without an ECA
--. Our evaluation methodology is based on the ITU-T P.851 recommendation for
spoken dialogue system evaluation, which we have complemented to cover
particular aspects with regard to the two major extra elements we have
incorporated: secure access and an ECA. Our results suggest that
likeability-type factors and system capabilities are perceived more positively
by the ECA metaphor users than by the VOICE metaphor users. However, the ECA's
presence seems to intensify users' privacy concerns. Keywords: biometrics interfaces, embodied conversational agent, multimodal evaluation,
voice authentication | |||