| Living better with robots | | BIBAK | Full-Text | 1-2 | |
| Cynthia Breazeal | |||
| The emerging field of Human-Robot Interaction is undergoing rapid growth,
motivated by important societal challenges and new applications for personal
robotic technologies for the general public. In this talk, I highlight several
projects from my research group to illustrate recent research trends to develop
socially interactive robots that work and learn with people as partners. An
important goal of this work is to use interactive robots as a scientific tool
to understand human behavior, to explore the role of physical embodiment in
interactive technology, and to use these insights to design robotic
technologies that can enhance human performance and quality of life. Throughout
the talk I will highlight synergies with HCI and connect HRI research goals to
specific applications in healthcare, education, and communication. Keywords: embodiement, human behaviors, human-robot interaction, quality of life,
socially interactive robots | |||
| Discovering group nonverbal conversational patterns with topics | | BIBAK | Full-Text | 3-6 | |
| Dinesh Babu Jayagopi; Daniel Gatica-Perez | |||
| This paper addresses the problem of discovering conversational group
dynamics from nonverbal cues extracted from thin-slices of interaction. We
first propose and analyze a novel thin-slice interaction descriptor -- a bag of
group nonverbal patterns -- which robustly captures the turn-taking behavior of
the members of a group while integrating its leader's position. We then rely on
probabilistic topic modeling of the interaction descriptors which, in a fully
unsupervised way, is able to discover group interaction patterns that resemble
prototypical leadership styles proposed in social psychology. Our method,
validated on the Augmented Multi-Party Interaction (AMI) meeting corpus,
facilitates the retrieval of group conversational segments where semantically
meaningful group behaviours emerge, without the need of any previous labeling. Keywords: characterizing groups, meetings, nonverbal cues | |||
| Agreement detection in multiparty conversation | | BIBAK | Full-Text | 7-14 | |
| Sebastian Germesin; Theresa Wilson | |||
| This paper presents a system for the automatic detection of agreements in
multi-party conversations. We investigate various types of features that are
useful for identifying agreements, including lexical, prosodic, and structural
features. This system is implemented using supervised machine learning
techniques and yields competitive results: Accuracy of 98.1% and a kappa value
of 0.4. We also begin to explore the novel task of detecting the addressee of
agreements (which speaker is being agreed with). Our system for this task
achieves an accuracy of 80.3%, a 56% improvement over the baseline. Keywords: agreement detection, multi-party conversation | |||
| Multimodal floor control shift detection | | BIBAK | Full-Text | 15-22 | |
| Lei Chen; Mary P. Harper | |||
| Floor control is a scheme used by people to organize speaking turns in
multi-party conversations. Identifying the floor control shifts is important
for understanding a conversation's structure and would be helpful for more
natural human computer interaction systems. Although people tend to use verbal
and nonverbal cues for managing floor control shifts, only audio cues, e.g.,
lexical and prosodic cues, have been used in most previous investigations on
speaking turn prediction. In this paper, we present a statistical model to
automatically detect floor control shifts using both verbal and nonverbal cues.
Our experimental results show that using a combination of verbal and nonverbal
cues provides more accurate detection. Keywords: floor control, language model, multimodal fusion, nonverbal communication,
prosody | |||
| Static vs. dynamic modeling of human nonverbal behavior from multiple cues and modalities | | BIBAK | Full-Text | 23-30 | |
| Stavros Petridis; Hatice Gunes; Sebastian Kaltwang; Maja Pantic | |||
| Human nonverbal behavior recognition from multiple cues and modalities has
attracted a lot of interest in recent years. Despite the interest, many
research questions, including the type of feature representation, choice of
static vs. dynamic classification schemes, the number and type of cues or
modalities to use, and the optimal way of fusing these, remain open research
questions. This paper compares frame-based vs window-based feature
representation and employs static vs. dynamic classification schemes for two
distinct problems in the field of automatic human nonverbal behavior analysis:
multicue discrimination between posed and spontaneous smiles from facial
expressions, head and shoulder movements, and audio-visual discrimination
between laughter and speech. Single cue and single modality results are
compared to multicue and multimodal results by employing Neural Networks,
Hidden Markov Models (HMMs), and 2- and 3-chain coupled HMMs. Subject
independent experimental evaluation shows that: 1) both for static and dynamic
classification, fusing data coming from multiple cues and modalities proves
useful to the overall task of recognition, 2) the type of feature
representation appears to have a direct impact on the classification
performance, and 3) static classification is comparable to dynamic
classification both for multicue discrimination between posed and spontaneous
smiles, and audio-visual discrimination between laughter and speech. Keywords: dynamic classification, frame-based representation, multicue and multimodal
fusion, static classification, window-based representation | |||
| Dialog in the open world: platform and applications | | BIBAK | Full-Text | 31-38 | |
| Dan Bohus; Eric Horvitz | |||
| We review key challenges of developing spoken dialog systems that can engage
in interactions with one or multiple participants in relatively unconstrained
environments. We outline a set of core competencies for open-world dialog, and
describe three prototype systems. The systems are built on a common underlying
conversational framework which integrates an array of predictive models and
component technologies, including speech recognition, head and pose tracking,
probabilistic models for scene analysis, multiparty engagement and turn taking,
and inferences about user goals and activities. We discuss the current models
and showcase their function by means of a sample recorded interaction, and we
review results from an observational study of open-world, multiparty dialog in
the wild. Keywords: engagement, floor management, multimodal, multiparty interaction, open-world
models, situated interaction, spoken dialog, turn-taking | |||
| Towards adapting fantasy, curiosity and challenge in multimodal dialogue systems for preschoolers | | BIBAK | Full-Text | 39-46 | |
| Theofanis Kannetis; Alexandros Potamianos | |||
| We investigate how fantasy, curiosity and challenge contribute to the user
experience in multimodal dialogue computer games for preschool children. For
this purpose, an on-line multimodal platform has been designed, implemented and
used as a starting point to develop web-based speech-enabled applications for
children. Five task oriented games suitable for preschoolers have been
implemented with varying levels of fantasy and curiosity elements, as well as,
variable difficulty levels. Nine preschool children, ages 4-6, were asked to
play these games in three sessions; in each session only one of the fantasy,
curiosity or challenge factor was evaluated. Both objective and subjective
criteria were used to evaluate the factors and applications. Results show that
fantasy and curiosity are correlated with children's entertainment, while the
level of difficulty seems to depend on each child's individual preferences and
capabilities. In addition, high speech usage and high curiosity levels in the
application correlate well with task completion, showing that preschoolers
become more engaged when multimodal interfaces are speech enabled and contain
curiosity elements. Keywords: adaptation, dialogue, evaluation, multimodal, preschoolers | |||
| Building multimodal applications with EMMA | | BIBAK | Full-Text | 47-54 | |
| Michael Johnston | |||
| Multimodal interfaces combining natural modalities such as speech and touch
with dynamic graphical user interfaces can make it easier and more effective
for users to interact with applications and services on mobile devices.
However, building these interfaces remains a complex and high specialized task.
The W3C EMMA standard provides a representation language for inputs to
multimodal systems facilitating plug-and-play of system components and rapid
prototyping of interactive multimodal systems. We illustrate the capabilities
of the EMMA standard through examination of its use in a series of mobile
multimodal applications for the iPhone. Keywords: gesture, multimodal, prototyping, speech, standards | |||
| A speaker diarization method based on the probabilistic fusion of audio-visual location information | | BIBAK | Full-Text | 55-62 | |
| Kentaro Ishizuka; Shoko Araki; Kazuhiro Otsuka; Tomohiro Nakatani; Masakiyo Fujimoto | |||
| This paper proposes a speaker diarization method for determining ""who spoke
when"" in multi-party conversations, based on the probabilistic fusion of audio
and visual location information. The audio and visual information is obtained
from a compact system designed to analyze round table multi-party
conversations. The system consists of two cameras and a triangular microphone
array with three microphones, and can cover a spherical region. Speaker
locations are estimated from audio and visual observations in terms of azimuths
from this recording system. Unlike conventional speech diarization methods, our
proposed method estimates the probability of the presence of multiple
simultaneous speakers in a physical space with a small microphone setup instead
of using a cascade consisting of speech activity detection, direction of
arrival estimation, acoustic feature extraction, and information criteria based
speaker segmentation. To estimate the speaker presence more correctly, the
speech presence probabilities in a physical space are integrated with the
probabilities estimated from participants' face locations obtained with a
robust particle filtering based face tracker with two cameras equipped with
fisheye lenses. The locations in a physical space with highly integrated
probabilities are then classified into a certain number of speaker classes by
using on-line classification to realize speaker diarization. The probability
calculations and speaker classifications are conducted on-line, making it
unnecessary to observe all the conversation data. An experiment using real
casual conversations, which include more overlaps and short speech segments
than formal meetings, showed the advantages of the proposed method. Keywords: multi-modal systems, multi-party conversation analysis, speaker diarization | |||
| Dynamic robot autonomy: investigating the effects of robot decision-making in a human-robot team task | | BIBAK | Full-Text | 63-70 | |
| Paul Schermerhorn; Matthias Scheutz | |||
| Robot autonomy is of high relevance for HRI, in particular for interactions
of humans and robots in mixed human-robot teams. In this paper, we investigate
empirically the extent to which autonomy based on independent decision making
and acting by the robot can affect the objective task performance of a mixed
human-robot team while being subjectively acceptable to humans. The results
demonstrate that humans not only accept robot autonomy in the interest of the
team, but also view the robot more as a team member and find it easier to
interact with, despite a very minimalist graphical/speech interface. Moreover,
we find evidence that dynamic autonomy reduces human cognitive load. Keywords: human-robot interaction, robot autonomy | |||
| A speech mashup framework for multimodal mobile services | | BIBAK | Full-Text | 71-78 | |
| Giuseppe Di Fabbrizio; Thomas Okken; Jay G. Wilpon | |||
| Amid today's proliferation of Web content and mobile phones with broadband
data access, interacting with small-form factor devices is still cumbersome.
Spoken interaction could overcome the input limitations of mobile devices, but
running an automatic speech recognizer with the limited computational
capabilities of a mobile device becomes an impossible challenge when large
vocabularies for speech recognition must often be updated with dynamic content.
One popular option is to move the speech processing resources into the network
by concentrating the heavy computation load onto server farms. Although
successful services have exploited this approach, it is unclear how such a
model can be generalized to a large range of mobile applications and how to
scale it for large deployments. To address these challenges we introduce the
AT&T speech mashup architecture, a novel approach to speech services that
leverages web services and cloud computing to make it easier to combine web
content and speech processing. We show that this new compositional method is
suitable for integrating automatic speech recognition and text-to-speech
synthesis resources into real multimodal mobile services. The generality of
this method allows researchers and speech practitioners to explore a countless
variety of mobile multimodal services with a finer grain of control and richer
multimedia interfaces. Moreover, we demonstrate that the speech mashup is
scalable and particularly optimized to minimize round trips in the mobile
network, reducing latency for better user experience. Keywords: mashups, multimodal, speech mashup, speech services, speech system
architecture, web services | |||
| Detecting, tracking and interacting with people in a public space | | BIBAK | Full-Text | 79-86 | |
| Sunsern Cheamanunkul; Evan Ettinger; Matt Jacobsen; Patrick Lai; Yoav Freund | |||
| We have built a system that engages naive users in an audio-visual
interaction with a computer in an unconstrained public space. We combine audio
source localization techniques with face detection algorithms to detect and
track the user throughout a large lobby. The sensors we use are an ad-hoc
microphone array and a PTZ camera. To engage the user, the PTZ camera turns and
points at sounds made by people passing by. From this simple pointing of a
camera, the user is made aware that the system has acknowledged their presence.
To further engage the user, we develop a face classification method that
identifies and then greets previously seen users. The user can interact with
the system through a simple hot-spot based gesture interface. To make the user
interactions with the system feel natural, we utilize reconfigurable hardware,
achieving a visual response time of less than 100ms. We rely heavily on machine
learning methods to make our system self-calibrating and adaptive. Keywords: boosting, machine learning, real-time hardware., tracking | |||
| Cache-based language model adaptation using visual attention for ASR in meeting scenarios | | BIBAK | Full-Text | 87-90 | |
| Neil J. Cooke; Martin J. Russell | |||
| In a typical group meeting involving discussion and collaboration, people
look at one another, at shared information resources such as presentation
material, and also at nothing in particular. In this work we investigate
whether the knowledge of what a person is looking at may improve the
performance of Automatic Speech Recognition (ASR). A framework for cache
Language Model (LM) adaptation is proposed with the cache based on a person's
Visual Attention (VA) sequence. The framework attempts to measure the
appropriateness of adaptation from VA sequence characteristics. Evaluation on
the AMI Meeting corpus data shows reduced LM perplexity. This work demonstrates
the potential for cache-based LM adaptation using VA information in large
vocabulary ASR deployed in meeting scenarios. Keywords: multimodal, visual attention | |||
| Multimodal end-of-turn prediction in multi-party meetings | | BIBAK | Full-Text | 91-98 | |
| Iwan de Kok; Dirk Heylen | |||
| One of many skills required to engage properly in a conversation is to know
the appropriate use of the rules of engagement. In order to engage properly in
a conversation, a virtual human or robot should, for instance, be able to know
when it is being addressed or when the speaker is about to hand over the turn.
The paper presents a multimodal approach to end-of-speaker-turn prediction
using sequential probabilistic models (Conditional Random Fields) to learn a
model from observations of real-life multi-party meetings. Although the results
are not as good as expected, we provide insight into which modalities are
important when taking a multimodal approach to the problem based on literature
and our own results. Keywords: end-of-turn prediction, multimodal, probabilistic model | |||
| Recognizing communicative facial expressions for discovering interpersonal emotions in group meetings | | BIBAK | Full-Text | 99-106 | |
| Shiro Kumano; Kazuhiro Otsuka; Dan Mikami; Junji Yamato | |||
| This paper proposes a novel facial expression recognizer and describes its
application to group meeting analysis. Our goal is to automatically discover
the interpersonal emotions that evolve over time in meetings, e.g. how each
person feels about the others, or who affectively influences the others the
most. As the emotion cue, we focus on facial expression, more specifically
smile, and aim to recognize "who is smiling at whom, when, and how often",
since frequently smiling carries affective messages that are strongly directed
to the person being looked at; this point of view is our novelty. To detect
such communicative smiles, we propose a new algorithm that jointly estimates
facial pose and expression in the framework of the particle filter. The main
feature is its automatic selection of interest points that can robustly capture
small changes in expression even in the presence of large head rotations. Based
on the recognized facial expressions and their directions to others, which are
indicated by the estimated head poses, we visualize interpersonal smile events
as a graph structure, we call it the interpersonal emotional network; it is
intended to indicate the emotional relationships among meeting participants. A
four-person meeting captured by an omnidirectional video system is used to
confirm the effectiveness of the proposed method and the potential of our
approach for deep understanding of human relationships developed through
communications. Keywords: direction of facial expression, facial expression, interpersonal emotion,
meeting analysis | |||
| Classification of patient case discussions through analysis of vocalisation graphs | | BIBAK | Full-Text | 107-114 | |
| Saturnino Luz; Bridget Kane | |||
| This paper investigates the use of amount and structure of talk as a basis
for automatic classification of patient case discussions in multidisciplinary
medical team meetings recorded in a real-world setting. We model patient case
discussions as vocalisation graphs, building on research from the fields of
interaction analysis and social psychology. These graphs are "content free" in
that they only encode patterns of vocalisation and silence. The fact that it
does not rely on automatic transcription makes the technique presented in this
paper an attractive complement to more sophisticated speech processing methods
as a means of indexing medical team meetings. We show that despite the
simplicity of the underlying representation mechanism, accurate classification
performance (F-scores: F_1 = 0.98, for medical patient case discussions, and
F_1 = 0.97, for surgical case discussions) can be achieved with a simple
k-nearest neighbour classifier when vocalisations are represented at the level
of individual speakers. Possible applications of the method in health
informatics for storage and retrieval of multimedia medical meeting records are
discussed. Keywords: electronic medical records, language and action patterns, medical team
meetings, patient case discussions | |||
| Learning from preferences and selected multimodal features of players | | BIBAK | Full-Text | 115-118 | |
| Georgios N. Yannakakis | |||
| The influence of multimodal sources of input data to the construction of
accurate computational models of user preferences is investigated in this
paper. The case study presented explores player entertainment preferences of
physical game variants incorporating two data modalities. The main findings of
the paper reveal the benefit of multiple modalities of input data for the
prediction of preferences and highlight the impact of feature selection on the
construction of such models. Keywords: augmented reality games, evolving artificial neural networks, player
satisfaction modeling, preference learning | |||
| Detecting user engagement with a robot companion using task and social interaction-based features | | BIBAK | Full-Text | 119-126 | |
| Ginevra Castellano; André Pereira; Iolanda Leite; Ana Paiva; Peter W. McOwan | |||
| Affect sensitivity is of the utmost importance for a robot companion to be
able to display socially intelligent behaviour, a key requirement for
sustaining long-term interactions with humans. This paper explores a
naturalistic scenario in which children play chess with the iCat, a robot
companion. A person-independent, Bayesian approach to detect the user's
engagement with the iCat robot is presented. Our framework models both causes
and effects of engagement: features related to the user's non-verbal behaviour,
the task and the companion's affective reactions are identified to predict the
children's level of engagement. An experiment was carried out to train and
validate our model. Results show that our approach based on multimodal
integration of task and social interaction-based features outperforms those
based solely on non-verbal behaviour or contextual information (94.79% vs.
93.75% and 78.13%). Keywords: affect recognition, contextual information, human-robot interaction,
non-verbal expressive behaviour | |||
| Multi-modal features for real-time detection of human-robot interaction categories | | BIBAK | Full-Text | 127-134 | |
| Ian R. Fasel; Masahiro Shiomi; Pilippe-Emmanuel Chadutaud; Takayuki Kanda; Norihiro Hagita; Hiroshi Ishiguro | |||
| Social interactions unfold over time, at multiple time scales, and can be
observed through multiple sensory modalities. In this paper, we propose a
machine learning framework for selecting and combining low-level sensory
features from different modalities to produce high-level characterizations of
human-robot social interactions in real-time.
We introduce a novel set of fast, multi-modal, spatio-temporal features for audio sensors, touch sensors, floor sensors, laser range sensors, and the time-series history of the robot's own behaviors. A subset of these features are automatically selected and combined using GentleBoost, an ensemble machine learning technique, allowing the robot to make an estimate of the current interaction category every 100 milliseconds. This information can then be used either by the robot to make decisions autonomously, or by a remote human-operator who can modify the robot's behavior manually (i.e., semi-autonomous operation). We demonstrate the technique on an information-kiosk robot deployed in a busy train station, focusing on the problem of detecting interaction breakdowns (i.e., failure of the robot to engage in a good interaction). We show that despite the varied and unscripted nature of human-robot interactions in the real-world train-station setting, the robot can achieve highly accurate predictions of interaction breakdowns at the same instant human observers become aware of them. Keywords: human-robot interaction, multi-modal features | |||
| Modeling culturally authentic style shifting with virtual peers | | BIBAK | Full-Text | 135-142 | |
| Justine Cassell; Kathleen Geraghty; Berto Gonzalez; John Borland | |||
| We report on a new kind of culturally-authentic embodied conversational
agent more in line with the ways that culture and ethnicity function in the
real world. On the basis of the careful analysis of a corpus of verbal and
nonverbal behavior, we found that children shift dialects and ways of using
their body depending on social context and task. Based on these results, we
implemented a culturally authentic African American virtual peer capable of
"code-switching" between African American English and Mainstream American
English, and of using nonverbal behavior differently, depending on context. An
evaluation of the agent revealed that the virtual peer elicited the same style
changes in real children as real children did in one another. Keywords: analysis and modeling of verbal and nonverbal interaction, culture, embodied
conversational agents | |||
| Between linguistic attention and gaze fixations inmultimodal conversational interfaces | | BIBAK | Full-Text | 143-150 | |
| Rui Fang; Joyce Y. Chai; Fernanda Ferreira | |||
| In multimodal human machine conversation, successfully interpreting human
attention is critical. While attention has been studied extensively in
linguistic processing and visual processing, it is not clear how linguistic
attention is aligned with visual attention in multimodal conversational
interfaces. To address this issue, we conducted a preliminary investigation on
how attention reflected by linguistic discourse aligns with attention indicated
by gaze fixations during human machine conversation. Our empirical findings
have shown that more attended entities based on linguistic discourse correspond
to higher intensity of gaze fixations. The smoother a linguistic transition is,
the less distance between corresponding fixation distributions. These findings
provide insight into how language and gaze can be combined to predict
attention, which have important implications in many tasks such as word
acquisition and object recognition. Keywords: gaze fixations, linguistic attention, multimodal conversational interfaces | |||
| Head-up interaction: can we break our addiction to the screen and keyboard? | | BIBAK | Full-Text | 151-152 | |
| Stephen Brewster | |||
| Mobile user interfaces are commonly based on techniques developed for
desktop computers in the 1970s, often including buttons, sliders, windows and
progress bars. These can be hard to use on the move, which then limits the way
we use our devices and the applications on them. This talk will look at the
possibility of moving away from these kinds of interactions to ones more suited
to mobile devices and their dynamic contexts of use where users need to be able
to look where they are going, carry shopping bags and hold on to children.
Multimodal (gestural, audio and haptic) interactions provide us new ways to use
our devices that can be eyes and hands free, and allow users to interact in a
'head up' way. These new interactions will facilitate new services,
applications and devices that fit better into our daily lives and allow us to
do a whole host of new things.
Brewster will discuss some of the work being done on input using gestures done with fingers, wrist and head, along with work on output using non-speech audio, 3D sound and tactile displays in applications such as for mobile devices such as text entry, camera phone user interfaces and navigation. He will also discuss some of the issues of social acceptability of these new interfaces. Keywords: mobile user interfaces, multimodal human-computer interaction, multiple
sensory input | |||
| Fusion engines for multimodal input: a survey | | BIBAK | Full-Text | 153-160 | |
| Denis Lalanne; Laurence Nigay; philippe Palanque; Peter Robinson; Jean Vanderdonckt; Jean-François Ladry | |||
| Fusion engines are fundamental components of multimodal inter-active
systems, to interpret input streams whose meaning can vary according to the
context, task, user and time. Other surveys have considered multimodal
interactive systems; we focus more closely on the design, specification,
construction and evaluation of fusion engines. We first introduce some
terminology and set out the major challenges that fusion engines propose to
solve. A history of past work in the field of fusion engines is then presented
using the BRETAM model. These approaches to fusion are then classified. The
classification considers the types of application, the fusion principles and
the temporal aspects. Finally, the challenges for future work in the field of
fusion engines are set out. These include software frameworks, quantitative
evaluation, machine learning and adaptation. Keywords: fusion engine, interaction techniques, multimodal interfaces | |||
| A fusion framework for multimodal interactive applications | | BIBAK | Full-Text | 161-168 | |
| Hildeberto Mendonça; Jean-Yves Lionel Lawson; Olga Vybornova; Benoit Macq; Jean Vanderdonckt | |||
| This research aims to propose a multi-modal fusion framework for high-level
data fusion between two or more modalities. It takes as input low level
features extracted from different system devices, analyses and identifies
intrinsic meanings in these data. Extracted meanings are mutually compared to
identify complementarities, ambiguities and inconsistencies to better
understand the user intention when interacting with the system. The whole
fusion life cycle will be described and evaluated in an office environment
scenario, where two co-workers interact by voice and movements, which might
show their intentions. The fusion in this case is focusing on combining
modalities for capturing a context to enhance the user experience. Keywords: context-sensitive interaction, multi-modal fusion, speech recognition | |||
| Benchmarking fusion engines of multimodal interactive systems | | BIBAK | Full-Text | 169-176 | |
| Bruno Dumas; Rolf Ingold; Denis Lalanne | |||
| This article proposes an evaluation framework to benchmark the performance
of multimodal fusion engines. The paper first introduces different concepts and
techniques associated with multimodal fusion engines and further surveys recent
implementations. It then discusses the importance of evaluation as a mean to
assess fusion engines, not only from the user perspective, but also at a
performance level. The article further proposes a benchmark and a formalism to
build testbeds for assessing multimodal fusion engines. In its last section,
our current fusion engine and the associated system HephaisTK are evaluated
thanks to the evaluation framework proposed in this article. The article
concludes with a discussion on the proposed quantitative evaluation,
suggestions to build useful testbeds, and proposes some future improvements. Keywords: fusion engines evaluation, multimodal fusion, multimodal interfaces,
multimodal toolkit | |||
| Temporal aspects of CARE-based multimodal fusion: from a fusion mechanism to composition components and WoZ components | | BIBAK | Full-Text | 177-184 | |
| Marcos Serrano; Laurence Nigay | |||
| The CARE properties (Complementarity, Assignment, Redundancy and
Equivalence) define various forms that multimodal input interaction can take.
While Equivalence and Assignment express the availability and respective
absence of choice between multiple input modalities for performing a given
task, Complementarity and Redundancy describe relationships between modalities
and require fusion mechanisms. In this paper we present a summary of the works
we have carried using the CARE properties for conceiving and implementing
multimodal interaction, as well as a new approach using WoZ components. We
present different technical solutions for implementing the Complementarity and
Redundancy of modalities with a focus on the temporal aspects of the fusion.
Starting from a monolithic fusion mechanism, we then explain our
component-based approach and the composition components (i.e., Redundancy and
Complementarity components). As a new contribution for exploring design
solutions before implementing an adequate fusion mechanism as well as for
tuning the temporal aspects of the performed fusion, we introduce Wizard of Oz
(WoZ) fusion components. We illustrate the composition components as well as
the implemented tools exploiting them using several multimodal systems
including a multimodal slide viewer and a multimodal map navigator. Keywords: Wizard of Oz, component-based approach, design and implementation tool,
fusion, multimodal interaction | |||
| Formal description techniques to support the design, construction and evaluation of fusion engines for sure (safe, usable, reliable and evolvable) multimodal interfaces | | BIBAK | Full-Text | 185-192 | |
| Jean-François Ladry; David Navarre; philippe Palanque | |||
| Representing the behaviour of multimodal interactive systems in a complete,
concise and non-ambiguous way is still a challenge for formal description
techniques (FDT). Depending on the FDT, multimodal interactive systems feature
specific characteristics that are either cumbersome or impossible to capture
with classical FDT. This is due to the multiple (potentially synergistic) use
of modalities and the strong temporal constraints usually encountered in this
kind of systems that have to be dealt with exhaustively if FDT are used. This
paper focuses on the requirements for the modelling and construction of fusion
engines for multimodal interfaces. It proposes a formal description technique
dedicated to the engineering of interactive multimodal systems able to address
the challenges of fusion engines. Such benefits are presented on a set of
examples illustrating both the constructs and the process. Keywords: formal description techniques, fusion engines, interactive software
engineering, model-based approaches, multimodal interfaces, safety-critical
interactive systems | |||
| Multimodal inference for driver-vehicle interaction | | BIBAK | Full-Text | 193-198 | |
| Tevfik Metin Sezgin; Ian Davies; Peter Robinson | |||
| In this paper we present a novel system for driver-vehicle interaction which
combines speech recognition with facial-expression recognition to increase
intention recognition accuracy in the presence of engine- and road-noise. Our
system would allow drivers to interact with in-car devices such as satellite
navigation and other telematic or control systems. We describe a pilot study
and experiment in which we tested the system, and show that multimodal fusion
of speech and facial expression recognition provides higher accuracy than
either would do alone. Keywords: driver monitoring, facial-expression recognition, multimodal inference,
speech recognition | |||
| Multimodal integration of natural gaze behavior for intention recognition during object manipulation | | BIBAK | Full-Text | 199-206 | |
| Thomas Bader; Matthias Vogelgesang; Edmund Klaus | |||
| Naturally gaze is used for visual perception of our environment and gaze
movements are mainly controlled subconsciously. Forcing the user to consciously
diverge from that natural gaze behavior for interaction purposes causes high
cognitive workload and destroys information contained in natural gaze
movements. Instead of proposing a new gaze-based interaction technique, we
analyze natural gaze behavior during an object manipulation task and show ways
how it can be used for intention recognition, which provides a universal basis
for integrating gaze into multimodal interfaces for different applications. We
propose a model for multimodal integration of natural gaze behavior and
evaluate it for two different use cases, namely for improvement of robustness
of other potentially noisy input cues and for the design of proactive
interaction techniques. Keywords: gaze-based interaction, intention recognition, model | |||
| Salience in the generation of multimodal referring acts | | BIBAK | Full-Text | 207-210 | |
| Paul Piwek | |||
| Pointing combined with verbal referring is one of the most paradigmatic
human multimodal behaviours. The aim of this paper is foundational: to uncover
the central notions that are required for a computational model of multimodal
referring acts that include a pointing gesture. The paper draws on existing
work on the generation of referring expressions and shows that in order to
extend that work with pointing, the notion of salience needs to play a pivotal
role. The paper starts by investigating the role of salience in the generation
of referring expressions and introduces a distinction between two opposing
approaches: salience-first and salience-last accounts. The paper then argues
that these differ not only in computational efficiency, as has been pointed out
previously, but also lead to incompatible empirical predictions. The second
half of the paper shows how a salience-first account nicely meshes with a range
of existing empirical findings on multimodal reference. A novel account of the
circumstances under which speakers choose to point is proposed that directly
links salience with pointing. Finally, this account is placed within a
multi-dimensional model of salience for multimodal reference. Keywords: incremental algorithm, pointing gestures, referring expressions, salience | |||
| Communicative gestures in coreference identification in multiparty meetings | | BIBAK | Full-Text | 211-218 | |
| Tyler Baldwin; Joyce Y. Chai; Katrin Kirchhoff | |||
| During multiparty meetings, participants can use non-verbal modalities such
as hand gestures to make reference to the shared environment. Therefore, one
hypothesis is that incorporating hand gestures can improve coreference
identification, a task that automatically identifies what participants refer to
with their linguistic expressions. To evaluate this hypothesis, this paper
examines the role of hand gestures in coreference identification, in
particular, focusing on two questions: (1) what signals can distinguish
communicative gestures that can potentially help coreference identification
from non-communicative gestures; and (2) in what ways can communicative
gestures help coreference identification. Based on the AMI data, our empirical
results have shown that the length of gesture production is highly indicative
of whether a gesture is communicative and potentially helpful in language
understanding. Our experiments on the automated identification of coreferring
expressions indicate that while the incorporation of simple gesture features
does not improve overall performance, it does show potential on expressions
referring to participants, an important and unique component of the meeting
domain. A further analysis suggests that communicative gestures provide both
redundant and complementary information, but further domain modeling and world
knowledge incorporation is required to take full advantage of information that
is complementary. Keywords: hand gesture, multiparty meetings, reference resolution | |||
| Realtime meeting analysis and 3D meeting viewer based on omnidirectional multimodal sensors | | BIBAK | Full-Text | 219-220 | |
| Kazuhiro Otsuka; Shoko Araki; Dan Mikami; Kentaro Ishizuka; Masakiyo Fujimoto; Junji Yamato | |||
| This demo presents a realtime system for analyzing group meetings. Targeting
round-table meetings, this system employs an omnidirectional camera-microphone
system. The goal of this system is to automatically discover "who is talking to
whom and when". To that purpose, the face pose/position of meeting participants
are tracked on panorama images acquired from fisheye-based omnidirectional
cameras. From audio signals obtained with microphone array, speaker
diarization, i.e. the estimation of "who is speaking and when", is carried out.
The visual focus of attention, i.e. "who is looking at whom", is estimated from
the result of face tracking. The results are displayed based on a 3D
visualization scheme. The advantage of our system is its realtimeness. We will
demonstrate the portable version of the system consisting of two laptop PCs. In
addition, we will showcase our meeting playback viewer with man-machine
interfaces that allow users to freely control space and time of meeting scenes.
With this viewer, users can also experience 3D positional sound effect linked
with 3D viewpoint, using enhanced audio tracks for each participant. Keywords: fisheye lens, focus of attention, meeting analysis, microphone array,
omnidirectional cameras, realtime system, speaker diarization | |||
| Guiding hand: a teaching tool for handwriting | | BIBAK | Full-Text | 221-222 | |
| Nalini Vishnoi; Cody Narber; Zoran Duric; Naomi Lynn Gerber | |||
| The goal of our demonstration is to illustrate how the haptic, force
feedback device, can be used to assist people with disabilities in learning
fine motor tasks, such as writing. We will be demonstrating this idea by the
simulation of several letters and symbols. We use electromagnetic sensors
(MotionStar Wireless2) to capture unencumbered movements performed by a
'normal' individual. The captured movement is translated to the haptic
coordinate system with the use of a table-top centered frame as an intermediate
frame. The translated movement is then fed into our haptic system, which varies
the exerted force as a function of trainee performance. Our demonstration will
use the Phantom Omni for the simulation of these writing tasks, and it will
also provide visual feedback of the desired and user trajectories. Keywords: data translation, ems, handwriting, haptic, hci, motionstar wireless,
phantom omni/premium | |||
| A multimedia retrieval system using speech input | | BIBAK | Full-Text | 223-224 | |
| Andrei Popescu-Belis; Peter Poller; Jonathan Kilgour | |||
| The AMIDA Automatic Content Linking Device (ACLD) monitors a conversation
using automatic speech recognition (ASR), and uses the detected words to
retrieve documents that are of potential use to the participants in the
conversation. The document set that is available includes project related
documents such as reports, memos or emails, as well as snippets of past
meetings that were transcribed using offline ASR. In addition, results of Web
searches are also displayed. Several visualisation interfaces are available. Keywords: just-in-time retrieval, meeting assistants | |||
| Navigation with a passive brain based interface | | BIBAK | Full-Text | 225-226 | |
| Jan B. F. van Erp; Peter J. Werkhoven; Marieke E. Thurlings; Anne-Marie M. Brouwer | |||
| In this paper, we describe a Brain Computer Interface (BCI) for navigation.
The system is based on detecting brain signals that are elicited by tactile
stimulation on the torso indicating the desired direction. Keywords: bci, bmi, brain-computer interface, brain-machine interface, cognition,
neuroscience, perception | |||
| A multimodal predictive-interactive application for computer assisted transcription and translation | | BIBAK | Full-Text | 227-228 | |
| Vicent Alabau; Daniel Ortiz; Verónica Romero; Jorge Ocampo | |||
| Traditionally, Natural Language Processing (NLP) technologies have mainly
focused on full automation. However, full automation often proves unnatural in
many applications, where technology is expected to assist rather than replace
the human agents.
In consequence, Multimodal Interactive (MI) technologies have emerged. On the one hand, the user interactively co-operates with the system to improve system accuracy. On the other hand, multimodality improves system ergonomics. In this paper, we present an application that implements such MI technologies. First, we have designed an Application Programming Interface (API), featuring a client-server framework, to deal with most common NLP MI tasks. Second, we have developed a generic client application. The resulting client-server architecture has been successfully tested with two different NLP problems: transcription of text images and translation of texts. Keywords: handwritten recognition, interactive framework, machine translation,
multimodality | |||
| Multi-modal communication system | | BIBAK | Full-Text | 229-230 | |
| Victor S. Finomore; Dianne K. Popik; Douglas S. Brungart; Brian D. Simpson | |||
| The Multi-Modal Communication (MMC) tool was designed to alleviate the
workload and errors associated with intensive radio communication environments.
MMC captures, records, and displays the radio communication to the operator so
that they have instant access to all current and past information. This
eliminates the perishable nature of radio communication and allows the
operators to focus on the task instead of remembering and writing down
information. The MMC tool also employs virtual audio display technology to
spatialized the multiple audio signals to aid in the intelligibility of the
radio communication. The combination of these technologies has led to the
design of a communication interface that will improve the performance of
operators confronted with monitoring high volume of radio communication. Keywords: communication system, distributed/collaborative multimodal interfaces,
network-centric | |||
| HephaisTK: a toolkit for rapid prototyping of multimodal interfaces | | BIBAK | Full-Text | 231-232 | |
| Bruno Dumas; Denis Lalanne; Rolf Ingold | |||
| This article introduces HephaisTK, a toolkit for rapid prototyping of
multimodal interfaces. After briefly discussing the state of the art, the
architecture traits of the toolkit are displayed, along with the major features
of HephaisTK: agent-based architecture, ability to plug in easily new input
recognizers, fusion engine and configuration by means of a SMUIML XML file.
Finally, applications created with the HephaisTK toolkit are discussed. Keywords: human-machine interaction, multimodal interfaces, multimodal toolkit | |||
| State,: an assisted document transcription system | | BIBAK | Full-Text | 233-234 | |
| David Llorens; Andrés Marzal; Federico Prat; Juan Miguel Vilar | |||
| State is an interactive system for ancient and handwritten document
transcription with several input modalities for entering and correcting text.
It has a flexible architecture that allows easy connection to different OCR
systems. Keywords: ancient documents, handwriting, text transcription | |||
| Demonstration: first steps in emotional expression of the humanoid robot Nao | | BIBAK | Full-Text | 235-236 | |
| Jérôme Monceaux; Joffrey Becker; Céline Boudier; Alexandre Mazel | |||
| We created a library of emotional expressions, and not an emotional system,
for the humanoid robot Nao from Aldebaran Robotics. This set of expressions
could be used by robot behavior designers to create advanced behaviors, or by
an emotion simulator. It is an insight into a conjoint work between an invited
anthropologist and robotics researchers which resulted in about a hundred
animations. We do not provide a review of the literature. Keywords: Nao, choregraphe, demonstration, emotional expression, expressive robot
behaviors, humanoid robot | |||
| WiiNote: multimodal application facilitating multi-user photo annotation activity | | BIBAK | Full-Text | 237-238 | |
| Elena Mugellini; Maria Sokhn; Stefano Carrino; Omar Abou Khaled | |||
| In this paper, we describe a multimodal application, called WiiNote,
facilitating multi-user photo annotation activity. The application allows up to
4 users to simultaneously annotating their pictures adding either textual or
vocal comments. Users use the Wii Remote device to select the whole picture or
a specific region of it to be annotated. Annotations can be either free or
structured, i.e. based on a domain specific data model expressed using MPEG7
standard or RDF language for ontology. Keywords: Wii remote, multimedia annotation, multimodal system, semantic | |||
| Are gesture-based interfaces the future of human computer interaction? | | BIBAK | Full-Text | 239-240 | |
| Frederic Kaplan | |||
| The historical evolution of human machine interfaces shows a continuous
tendency towards more physical interactions with computers. Nevertheless, the
mouse and keyboard paradigm is still the dominant one and it is not yet clear
whether there is among recent innovative interaction techniques any real
challenger to this supremacy. To discuss the future of gesture-based
interfaces, I shall build on my own experience in conceiving and launching QB1,
probably the first computer delivered with no mouse or keyboard but equipped
with a depth-perceiving camera enabling interaction with gestures. The ambition
of this talk is to define more precisely how gestures change the way we can
interact with computers, discuss how to design robust interfaces adapted to
this new medium and review what kind of applications benefit the most from this
type of interaction. Through a series of examples, we will see that it is
important to consider gestures not as a way of emulating a mouse pointer at a
distance or as elements of a "vocabulary" of commands, but as a new interaction
paradigm where the interface components are organized in the user's physical
space. This is a shift of reference frame, from a metaphorical virtual space
(e.g. the desktop) where the user controls a representation of himself (e.g.
the mouse pointer) to a truly user-centered augmented reality interface where
the user directly touches and manipulates interface components positioned
around his body. To achieve this kind of interactivity, depth-perceiving
cameras can be relevantly associated with robotic techniques and machine vision
algorithms to create a "halo" of interactivity that can literally follow the
user while he moves in a room. In return, this new kind of intimacy with a
computer interface paves the ways for innovative machine learning approaches to
context understanding. A computer like QB1 knows more about its user than any
other personal computer so far. Gesture-based interaction is not a mean for
replacing the mouse with cooler or more intuitive ways of interacting but leads
to a fundamentally different approach to the design human-computer interfaces. Keywords: 3D camera, gesture-based interface, robotic computer | |||
| Providing expressive eye movement to virtual agents | | BIBAK | Full-Text | 241-244 | |
| Zheng Li; Xia Mao; Lei Liu | |||
| Non-verbal behavior, particularly eye movement, plays a fundamental role in
nonverbal communication among people. In order to realize natural and intuitive
human-agent interaction, the virtual agents need to employ this communicative
channel effectively. Against this background, our research addresses the
problem of emotionally expressive eye movement manner by describing a
preliminary approach based on the parameters picked from real-time eye movement
data (pupil size, blink rate and saccade). Keywords: eye movement, nonverbal behavior, virtual agent | |||
| Mediated attention with multimodal augmented reality | | BIBAK | Full-Text | 245-252 | |
| Angelika Dierker; Christian Mertes; Thomas Hermann; Marc Hanheide; Gerhard Sagerer | |||
| We present an Augmented Reality (AR) system to support collaborative tasks
in a shared real-world interaction space by facilitating joint attention. The
users are assisted by information about their interaction partner's field of
view both visually and acoustically. In our study, the audiovisual improvements
are compared with an AR system without these support mechanisms in terms of the
participants' reaction times and error rates. The participants performed a
simple object-choice task we call the "gaze game" to ensure controlled
experimental conditions. Additionally, we asked the subjects to fill in a
questionnaire to gain subjective feedback from them. We were able to show an
improvement for both dependent variables as well as positive feedback for the
visual augmentation in the questionnaire. Keywords: artificial communication channels, augmented reality, collaboration, cscw,
field of view, joint attention, mediated attention, multimodal | |||
| Grounding spatial prepositions for video search | | BIBAK | Full-Text | 253-260 | |
| Stefanie Tellex; Deb Roy | |||
| Spatial language video retrieval is an important real-world problem that
forms a test bed for evaluating semantic structures for natural language
descriptions of motion on naturalistic data. Video search by natural language
query requires that linguistic input be converted into structures that operate
on video in order to find clips that match a query. This paper describes a
framework for grounding the meaning of spatial prepositions in video. We
present a library of features that can be used to automatically classify a
video clip based on whether it matches a natural language query. To evaluate
these features, we collected a corpus of natural language descriptions about
the motion of people in video clips. We characterize the language used in the
corpus, and use it to train and test models for the meanings of the spatial
prepositions "to," "across," "through," "out," "along," "towards," and
"around." The classifiers can be used to build a spatial language video
retrieval system that finds clips matching queries such as "across the
kitchen." Keywords: spatial language, video retrieval | |||
| Multi-modal and multi-camera attention in smart environments | | BIBAK | Full-Text | 261-268 | |
| Boris Schauerte; Jan Richarz; Thomas Plötz; Christian Thurau; Gernot A. Fink | |||
| This paper considers the problem of multi-modal saliency and attention.
Saliency is a cue that is often used for directing attention of a computer
vision system, e.g., in smart environments or for robots. Unlike the majority
of recent publications on visual/audio saliency, we aim at a well grounded
integration of several modalities. The proposed framework is based on fuzzy
aggregations and offers a flexible, plausible, and efficient way for combining
multi-modal saliency information. Besides incorporating different modalities,
we extend classical 2D saliency maps to multi-camera and multi-modal 3D
saliency spaces. For experimental validation we realized the proposed system
within a smart environment. The evaluation took place for a demanding setup
under real-life conditions, including focus of attention selection for multiple
subjects and concurrently active modalities. Keywords: attention, multi-camera, multi-camera control, multi-modal, smart
environment, spatial saliency, view selection | |||
| RVDT: a design space for multiple input devices, multipleviews and multiple display surfaces combination | | BIBAK | Full-Text | 269-276 | |
| Rami Ajaj; Christian Jacquemin; Frédéric Vernier | |||
| We study interaction combination performed using a tabletop device, a mouse,
and/or a six Degrees Of Freedom (DOF) input device in a systemcombining a
2Dflat (map-kind) view presented horizontally and a 3D perspective vertical
view of the same virtual environment. The design of such a 2D/3D interface
relies on the RVDT model and its design space that allow easy high-level
combined interactions to achieve spatial tasks. RVDT integrates the relations
between physical and numerical DOFs and applies to any graphical user interface
in which multiple views, multiple display surfaces and multiple input devices
are combined. The user study shows that experimented users prefer
table-top/6DOF input device interaction combination with a maximal number of
elementary tasks performed with both devices. Keywords: multiple display surfaces, multiple input devices, multiple views | |||
| Learning and predicting multimodal daily life patterns from cell phones | | BIBAK | Full-Text | 277-280 | |
| Katayoun Farrahi; Daniel Gatica-Perez | |||
| In this paper, we investigate the multimodal nature of cell phone data in
terms of discovering recurrent and rich patterns in people's lives. We present
a method that can discover routines from multiple modalities (location and
proximity) jointly modeled, and that uses these informative routines to predict
unlabeled or missing data. Using a joint representation of location and
proximity data over approximately 10 months of 97 individuals' lives, Latent
Dirichlet Allocation is applied for the unsupervised learning of topics
describing people's most common locations jointly with the most common types of
interactions at these locations. We further successfully predict where and with
how many other individuals users will be, for people with both highly and lowly
varying lifestyles. Keywords: data prediction, mobile phone data, multi-modal data, reality mining, topic
models | |||
| Visual based picking supported by context awareness: comparing picking performance using paper-based lists versus lists presented on a head mounted display with contextual support | | BIBAK | Full-Text | 281-288 | |
| Hendrik Iben; Hannes Baumann; Carmen Ruthenbeck; Tobias Klug | |||
| Warehouse picking is a traditional part of assembly and inventory control,
and several commercial wearable computers address this market. However, head
mounted displays (HMDs) are not yet used in these company's products. We
present a 16 person user study that compares the efficiency and perceived
workload of paper picking lists versus a HMD system aided by contextual cueing.
With practice, users of the HMD system made significantly faster picks and made
less mistakes related to missing or additional picked items overall. Keywords: HMD, picking | |||
| Adaptation from partially supervised handwritten text transcriptions | | BIBAK | Full-Text | 289-292 | |
| Nicolás Serrano; Daniel Pérez; Albert Sanchis; Alfons Juan | |||
| An effective approach to transcribe handwritten text documents is to follow
an interactive-predictive paradigm in which both, the system is guided by the
user, and the user is assisted by the system to complete the transcription task
as efficiently as possible. This approach has been recently implemented in a
system prototype called GIDOC, in which standard speech technology is adapted
to handwritten text (line) images: HMM-based text image modelling, n-gram
language modelling, and also confidence measures on recognized words.
Confidence measures are used to assist the user in locating possible
transcription errors, and thus validate system output after only supervising
those (few) words for which the system is not highly confident. Here, we study
the effect of using these partially supervised transcriptions on the adaptation
of image and language models to the task. Keywords: computer-assisted text transcription, confidence measures, document
analysis, handwriting recognition | |||
| Recognizing events with temporal random forests | | BIBAK | Full-Text | 293-296 | |
| David Demirdjian; Chenna Varri | |||
| In this paper, we present a novel technique for classifying multimodal
temporal events. Our main contribution is the introduction of temporal random
forests (TRFs), an extension of random forests (and decision trees in general)
to the time domain. The approach is relatively simple and able to
discriminatively learn event classes while performing feature selection in an
implicit fashion. We describe here our ongoing research and present experiments
performed on gesture and audio-visual speech recognition datasets comparing our
method against state-of-the-art algorithms. Keywords: decision trees, temporal event recognition | |||
| Activity-aware ECG-based patient authentication for remote health monitoring | | BIBAK | Full-Text | 297-304 | |
| Janani C. Sriram; Minho Shin; Tanzeem Choudhury; David Kotz | |||
| Mobile medical sensors promise to provide an efficient, accurate, and
economic way to monitor patients' health outside the hospital. Patient
authentication is a necessary security requirement in remote health monitoring
scenarios. The monitoring system needs to make sure that the data is coming
from the right person before any medical or financial decisions are made based
on the data. Credential-based authentication methods (e.g., passwords,
certificates) are not well-suited for remote healthcare as patients could hand
over credentials to someone else. Furthermore, one-time authentication using
credentials or trait-based biometrics (e.g., face, fingerprints, iris) do not
cover the entire monitoring period and may lead to unauthorized
post-authentication use. Recent studies have shown that the human
electrocardiogram (ECG) exhibits unique patterns that can be used to
discriminate individuals. However, perturbation of the ECG signal due to
physical activity is a major obstacle in applying the technology in real-world
situations. In this paper, we present a novel ECG and accelerometer-based
system that can authenticate individuals in an ongoing manner under various
activity conditions. We describe the probabilistic authentication system we
have developed and present experimental results from 17 individuals. Keywords: ECG, biometrics, mobile computing, security | |||
| GaZIR: gaze-based zooming interface for image retrieval | | BIBAK | Full-Text | 305-312 | |
| László Kozma; Arto Klami; Samuel Kaski | |||
| We introduce GaZIR, a gaze-based interface for browsing and searching for
images. The system computes on-line predictions of relevance of images based on
implicit feedback, and when the user zooms in, the images predicted to be the
most relevant are brought out. The key novelty is that the relevance feedback
is inferred from implicit cues obtained in real-time from the gaze pattern,
using an estimator learned during a separate training phase. The natural
zooming interface can be connected to any content-based information retrieval
engine operating on user feedback. We show with experiments on one engine that
there is sufficient amount of information in the gaze patterns to make the
estimated relevance feedback a viable choice to complement or even replace
explicit feedback by pointing-and-clicking. Keywords: gaze-based interface, image retrieval, implicit feedback, zooming interface | |||
| Voice key board: multimodal Indic text input | | BIBAK | Full-Text | 313-318 | |
| Prasenjit Dey; Ramchandrula Sitaram; Rahul Ajmera; Kalika Bali | |||
| Multimodal systems, incorporating more natural input modalities like speech,
hand gesture, facial expression etc., can make human-computer-interaction more
intuitive by drawing inspiration from spontaneous human-human-interaction. We
present here a multimodal input device for Indic scripts called the Voice Key
Board (VKB) which offers a simpler and more intuitive method for input of Indic
scripts. VKB exploits the syllabic nature of Indic language scripts and
exploits the user's mental model of Indic scripts wherein a base consonant
character is modified by different vowel ligatures to represent the actual
syllabic character. We also present a user evaluation result for VKB comparing
it with the most common input method for the Devanagari script, the InScript
keyboard. The results indicate a strong user preference for VKB in terms of
input speed and learnability. Though VKB starts with a higher user error rate
compared to InScript, the error rate drops by 55% by the end of the experiment,
and the input speed of VKB is found to be 81% higher than InScript. Our user
study results point to interesting research directions for the use of multiple
natural modalities for Indic text input. Keywords: human-computer-interaction, Indic text, multimodal systems, syllabic
scripts, text input, voice keyboard | |||
| Evaluating the effect of temporal parameters for vibrotactile saltatory patterns | | BIBAK | Full-Text | 319-326 | |
| Jukka Raisamo; Roope Raisamo; Veikko Surakka | |||
| Cutaneous saltation provides interesting possibilities for applications. An
illusion of vibrotactile mediolateral movement was elicited to a left dorsal
forearm to investigate emotional (i.e., pleasantness) and cognitive (i.e.,
continuity) experiences to vibrotactile stimulation. Twelve participants were
presented with nine saltatory stimuli delivered to a linearly aligned row of
three vibrotactile actuators separated by 70 mm in distance. The stimuli were
composed of three temporal parameters of 12, 24 and 48 ms for both burst
duration and inter-burst interval to form all nine possible uniform pairs.
First, the stimuli were ranked by the participants using a special three-step
procedure. Second, the participants rated the stimuli using two nine-point
bipolar scales measuring the pleasantness and continuity of each stimulus,
separately. The results showed especially the interval between two successive
bursts was a significant factor for saltation. Moreover, the temporal
parameters seemed to affect more the experienced continuity of the stimuli
compared to pleasantness. These findings encourage us to continue to further
study the saltation and the effect of different parameters for subjective
experience. Keywords: cutaneous saltation, haptics, human-technology interaction, vibrotactile
patterns | |||
| Mapping information to audio and tactile icons | | BIBAK | Full-Text | 327-334 | |
| Eve Hoggan; Roope Raisamo; Stephen A. Brewster | |||
| We report the results of a study focusing on the meanings that can be
conveyed by audio and tactile icons. Our research considers the following
question: how can audio and tactile icons be designed to optimise congruence
between crossmodal feedback and the type of information this feedback is
intended to convey? For example, if we have a set of system warnings,
confirmations, progress up-dates and errors: what audio and tactile
representations best match the information or type of message? Is one modality
more appropriate at presenting certain types of information than the other
modality? The results of this study indicate that certain parameters of the
audio and tactile modalities such as rhythm, texture and tempo play an
important role in the creation of congruent sets of feedback when given a
specific type of information to transmit. We argue that a combination of audio
or tactile parameters derived from our results allows the same type of
information to be derived through touch and sound with an intuitive match to
the content of the message. Keywords: auditory feedback, earcons, information mapping, mobile touchscreen
interaction, tactile feedback, tactons | |||
| Augmented reality target finding based on tactile cues | | BIBAK | Full-Text | 335-342 | |
| Teemu Tuomas Ahmaniemi; Vuokko Tuulikki Lantz | |||
| This study is based on a user scenario where augmented reality targets could
be found by scanning the environment with a mobile device and getting a tactile
feedback exactly in the direction of the target. In order to understand how
accurately and quickly the targets can be found, we prepared an experiment
setup where a sensor-actuator device consisting of orientation tracking
hardware and a tactile actuator were used. The targets with widths 5ð,
10ð, 15ð, 20ð, and 25ð and various distances between each other
were rendered in a 90ð -wide space successively, and the task of the test
participants was to find them as quickly as possible. The experiment consisted
of two conditions: the first one provided tactile feedback only when pointing
was on the target and the second one included also another cue indicating the
proximity of the target. The average target finding time was 1.8 seconds. The
closest targets appeared to be not the easiest to find, which was attributed to
the adapted scanning velocity causing the missing the closest targets. We also
found that our data did not correlate well with Fitts' model, which may have
been caused by the non-normal data distribution. After filtering out 30% of the
least representative data items, the correlation reached up to 0.71. Overall,
the performance between conditions did not differ from each other
significantly. The only significant improvement in the performance offered by
the close-to-target cue occurred in the tasks where the targets where the
furthest from each other. Keywords: Fitts' Law, augmented reality, haptics, pointing | |||
| Speaker change detection with privacy-preserving audio cues | | BIBAK | Full-Text | 343-346 | |
| Sree Hari Krishnan Parthasarathi; A Mathew Magimai.-Doss; Daniel Gatica-Perez; Hervé Bourlard | |||
| In this paper we investigate a set of privacy-sensitive audio features for
speaker change detection (SCD) in multiparty conversations. These features are
based on three different principles: characterizing the excitation source
information using linear prediction residual, characterizing subband spectral
information shown to contain speaker information, and characterizing the
general shape of the spectrum. Experiments show that the performance of the
privacy-sensitive features is comparable or better than that of the
state-of-the-art full-band spectral-based features, namely, mel frequency
cepstral coefficients, which suggests that socially acceptable ways of
recording conversations in real-life is feasible. Keywords: modeling social interactions, multiparty conversations, privacy-sensitive
features, speaker change detection | |||
| MirrorTrack: tracking with reflection -- comparison with top-down approach | | BIBAK | Full-Text | 347-350 | |
| Yannick Verdie; Bing Fang; Francis Quek | |||
| Tabletop hand tracking techniques have evolved much during the last few
years from single to multiple cameras, offering users an improved interactive
experience. MirrorTrack is one of such techniques. This paper demonstrates the
comparison of accuracy between MirrorTrack and top-down approach, which is
generally used for table top tasks. In this paper, we focus on the comparison
of distance errors in finger trajectory, and clicking errors by manual
monitoring. Keywords: MirrorTrack, augmented reality, comparative study, computer vision, image
processing, top-down camera, vision-based hand tracking | |||
| A framework for continuous multimodal sign language recognition | | BIBAK | Full-Text | 351-358 | |
| Daniel Kelly; Jane Reilly Delannoy; John Mc Donald; Charles Markham | |||
| We present a multimodal system for the recognition of manual signs and
non-manual signals within continuous sign language sentences. In sign language,
information is mainly conveyed through hand gestures (Manual Signs). Non-manual
signals, such as facial expressions, head movements, body postures and torso
movements, are used to express a large part of the grammar and some aspects of
the syntax of sign language. In this paper we propose a multichannel HMM based
system to recognize manual signs and non-manual signals. We choose a single
non-manual signal, head movement, to evaluate our framework when recognizing
non-manual signals. Manual signs and non-manual signals are processed
independently using continuous multidimensional HMMs and a HMM threshold model.
Experiments conducted demonstrate that our system achieved a detection ratio of
0.95 and a reliability measure of 0.93. Keywords: HMM, non-manual signals, sign language | |||