| Two-way eye contact between humans and robots | | BIBAK | Full-Text | 1-8 | |
| Yoshinori Kuno; Arihiro Sakurai; Dai Miyauchi; Akio Nakamura | |||
| Eye contact is an effective means of controlling human communication, such
as in starting communication. It seems that we can make eye contact if we
simply look at each other. However, this alone does not establish eye contact.
Both parties also need to be aware of being watched by the other. We propose a
method of two-way eye contact for human-robot communication. When a human wants
to start communication with a robot, he/she watches the robot. If it finds a
human looking at it, the robot turns to him/her, changing its facial
expressions to let him/her know its awareness of his/her gaze. When the robot
wants to initiate communication with a particular person, it moves its body and
face toward him/her and changes its facial expressions to make the person
notice its gaze. We show several experimental results to prove the
effectiveness of this method. Moreover, we present a robot that can recognize
hand gestures after making eye contact with the human to show the usefulness of
eye contact as a means of controlling communication. Keywords: embodied agent, eye contact, gaze, gesture recognition, human-robot
interface, nonverbal behavior | |||
| Another person's eye gaze as a cue in solving programming problems | | BIBAK | Full-Text | 9-15 | |
| Randy Stein; Susan E. Brennan | |||
| Expertise in computer programming can often be difficult to transfer
verbally. Moreover, technical training and communication occur more and more
between people who are located at a distance. We tested the hypothesis that
seeing one person's visual focus of attention (represented as an eyegaze
cursor) while debugging software (displayed as text on a screen) can be helpful
to another person doing the same task. In an experiment, a group of
professional programmers searched for bugs in small Java programs while wearing
an unobtrusive head-mounted eye tracker. Later, a second set of programmers
searched for bugs in the same programs. For half of the bugs, the second set of
programmers first viewed a recording of an eyegaze cursor from one of the first
programmers displayed over the (indistinct) screen of code, and for the other
half they did not. The second set of programmers found the bugs more quickly
after viewing the eye gaze of the first programmers, suggesting that another
person's eye gaze, produced instrumentally (as opposed to intentionally, like
pointing with a mouse), can be a useful cue in problem solving. This finding
supports the potential of eye gaze as a valuable cue for collaborative
interaction in a visuo-spatial task conducted at a distance. Keywords: debugging, eye tracking, gaze-based & attentional interfaces, mediated
communication, programming, visual co-presence | |||
| EyePrint: support of document browsing with eye gaze trace | | BIBAK | Full-Text | 16-23 | |
| Takehiko Ohno | |||
| Current digital documents provide few traces to help user browsing. This
makes document browsing difficult, and we sometimes feel it is hard to keep
track of all of the information. To overcome this problem, this paper proposes
a method of creating traces on digital documents. The method, called EyePrint,
generates a trace from the user's eye gaze in order to support the browsing of
digital document. Traces are presented as highlighted areas on a document,
which become visual cues for accessing previously visited documents. Traces
also become document attributes that can be used to access and search the
document. A prototype system that works with a gaze tracking system is
developed. The result of a user study confirms the usefulness of the traces in
digital document browsing. Keywords: document browsing, eyePrint, gaze-based interaction, information retrieval,
readwear, reusability problem | |||
| A framework for evaluating multimodal integration by humans and a role for embodied conversational agents | | BIBAK | Full-Text | 24-31 | |
| Dominic W. Massaro | |||
| One of the implicit assumptions of multi-modal interfaces is that
human-computer interaction is significantly facilitated by providing multiple
input and output modalities. Surprisingly, however, there is very little
theoretical and empirical research testing this assumption in terms of the
presentation of multimodal displays to the user. The goal of this paper is
provide both a theoretical and empirical framework for addressing this
important issue. Two contrasting models of human information processing are
formulated and contrasted in experimental tests. According to integration
models, multiple sensory influences are continuously combined during
categorization, leading to perceptual experience and action. The Fuzzy Logical
Model of Perception (FLMP) assumes that processing occurs in three successive
but overlapping stages: evaluation, integration, and decision (Massaro, 1998).
According to nonintegration models, any perceptual experience and action
results from only a single sensory influence. These models are tested in
expanded factorial designs in which two input modalities are varied
independently of one another in a factorial design and each modality is also
presented alone. Results from a variety of experiments on speech, emotion, and
gesture support the predictions of the FLMP. Baldi, an embodied conversational
agent, is described and implications for applications of multimodal interfaces
are discussed. Keywords: emotion, gesture, multisensory integration, speech | |||
| From conversational tooltips to grounded discourse: head poseTracking in interactive dialog systems | | BIBAK | Full-Text | 32-37 | |
| Louis-Philippe Morency; Trevor Darrell | |||
| Head pose and gesture offer several key conversational grounding cues and
are used extensively in face-to-face interaction among people. While the
machine interpretation of these cues has previously been limited to output
modalities, recent advances in face-pose tracking allow for systems which are
robust and accurate enough to sense natural grounding gestures. We present the
design of a module that detects these cues and show examples of its integration
in three different conversational agents with varying degrees of discourse
model complexity. Using a scripted discourse model and off-the-shelf animation
and speech-recognition components, we demonstrate the use of this module in a
novel "conversational tooltip" task, where additional information is
spontaneously provided by an animated character when users attend to various
physical objects or characters in the environment. We further describe the
integration of our module in two systems where animated and robotic characters
interact with users based on rich discourse and semantic models. Keywords: conversational tooltips, grounding, head gesture recognition, head pose
tracking, human-computer interaction, interactive dialog system | |||
| Evaluation of spoken multimodal conversation | | BIBAK | Full-Text | 38-45 | |
| Niels Ole Bernsen; Laila Dybkjær | |||
| Spoken multimodal dialogue systems in which users address face-only or
embodied interface agents have been gaining ground in research for some time.
Although most systems are still strictly task-oriented, the field is now moving
towards domain-oriented systems and real conversational systems which are no
longer defined in terms of the task(s) they support. This paper describes the
first running prototype of such a system which enables spoken and gesture
interaction with life-like fairytale author Hans Christian Andersen about his
fairytales, life, study, etc., focusing on multimodal conversation. We then
present recent user test evaluation results on multimodal conversation. Keywords: evaluation, natural interaction, spoken conversation | |||
| Multimodal transformed social interaction | | BIBAK | Full-Text | 46-52 | |
| Matthew Turk; Jeremy Bailenson; Andrew Beall; Jim Blascovich; Rosanna Guadagno | |||
| Understanding human-human interaction is fundamental to the long-term
pursuit of powerful and natural multimodal interfaces. Nonverbal communication,
including body posture, gesture, facial expression, and eye gaze, is an
important aspect of human-human interaction. We introduce a paradigm for
studying multimodal and nonverbal communication in collaborative virtual
environments (CVEs) called Transformed Social Interaction (TSI), in which a
user's visual representation is rendered in a way that strategically filters
selected communication behaviors in order to change the nature of a social
interaction. To achieve this, TSI must employ technology to detect, recognize,
and manipulate behaviors of interest, such as facial expressions, gestures, and
eye gaze. In [13] we presented a TSI experiment called non-zero-sum gaze (NZSG)
to determine the effect of manipulated eye gaze on persuasion in a small group
setting. Eye gaze was manipulated so that each participant in a three-person
CVE received eye gaze from a presenter that was normal, less than normal, or
greater than normal. We review this experiment and discuss the implications of
TSI for multimodal interfaces. Keywords: computer-mediated communication, multimodal processing, transformed social
interaction | |||
| Multimodal interaction in an augmented reality scenario | | BIBAK | Full-Text | 53-60 | |
| Gunther Heidemann; Ingo Bax; Holger Bekel | |||
| We describe an augmented reality system designed for online acquisition of
visual knowledge and retrieval of memorized objects. The system relies on a
head mounted camera and display, which allow the user to view the environment
together with overlaid augmentations by the system. In this setup,
communication by hand gestures and speech is mandatory as common input devices
like mouse and keyboard are not available. Using gesture and speech, basically
three types of tasks must be handled: (i) Communication with the system about
the environment, in particular, directing attention towards objects and
commanding the memorization of sample views; (ii) control of system operation,
e.g. switching between display modes; and (iii) re-adaptation of the interface
itself in case communication becomes unreliable due to changes in external
factors, such as illumination conditions. We present an architecture to manage
these tasks and describe and evaluate several of its key elements, including
modules for pointing gesture recognition, menu control based on gesture and
speech, and control strategies to cope with situations when vision becomes
unreliable and has to be re-adapted by speech. Keywords: augmented reality, human-machine-interaction, image retrieval, interfaces,
memory, mobile systems, object recognition | |||
| The ThreadMill architecture for stream-oriented human communication analysis applications | | BIBAK | Full-Text | 61-68 | |
| Paulo Barthelmess; Clarence A. Ellis | |||
| This work introduces a new component software architecture -- ThreadMill --
whose main purpose is to facilitate the development of applications in domains
where high volumes of streamed data need to be efficiently analyzed. It focuses
particularly on applications that target the analysis of human communication
e.g. in speech and gesture recognition. Applications in this domain usually
employ costly signal processing techniques, but offer in many cases ample
opportunities for concurrent execution in many different phases. ThreadMill's
abstractions facilitate the development of applications that take advantage of
this potential concurrency by hiding the complexity of parallel and distributed
programming. As a result, ThreadMill applications can be made to run unchanged
on a wide variety of execution environments, ranging from a single-processor
machine to a cluster of multi-processor nodes. The architecture is illustrated
by an implementation of a tracker for hands and face of American Sign Language
signers that uses a parallel and concurrent version of the Joint Likelihood
Filter method. Keywords: human-communication analysis applications, software evolution | |||
| TouchLight: an imaging touch screen and display for gesture-based interaction | | BIBAK | Full-Text | 69-76 | |
| Andrew D. Wilson | |||
| A novel touch screen technology is presented. TouchLight uses simple image
processing techniques to combine the output of two video cameras placed behind
a semi-transparent plane in front of the user. The resulting image shows
objects that are on the plane. This technique is well suited for application
with a commercially available projection screen material (DNP HoloScreen) which
permits projection onto a transparent sheet of acrylic plastic in normal indoor
lighting conditions. The resulting touch screen display system transforms an
otherwise normal sheet of acrylic plastic into a high bandwidth input/output
surface suitable for gesture-based interaction. Image processing techniques are
detailed, and several novel capabilities of the system are outlined. Keywords: computer human interaction, computer vision, displays, gesture recognition,
videoconferencing | |||
| Walking-pad: a step-in-place locomotion interface for virtual environments | | BIBAK | Full-Text | 77-81 | |
| Laroussi Bouguila; Florian Evequoz; Michele Courant; Beat Hirsbrunner | |||
| This paper presents a new locomotion interface that provides users with the
ability to engage in a life-like walking experience using stepping in place.
Stepping actions are performed on top of a flat platform that has an embedded
grid of switch sensors that detect footfalls pressure. Based on data received
from sensors, the system can compute different variables that represent user's
walking behavior such as walking direction, walking speed, standstill, jump,
and walking. The overall platform status is scanned at a rate of 100Hz with
which we can deliver real-time visual feedback reaction to user actions. The
proposed system is portable and easy to integrate into major virtual
environment with large projection feature such as CAVE and DOME systems. The
overall weight of the Walking-Pad is less than 5 Kg and can be connected to any
computer via USB port, which make it even controllable via a portable computer. Keywords: locomotion, sensors, step-in-place, virtual environment, walking-pad | |||
| Multimodal detection of human interaction events in a nursing home environment | | BIBAK | Full-Text | 82-89 | |
| Datong Chen; Robert Malkin; Jie Yang | |||
| In this paper, we propose a multimodal system for detecting human activity
and interaction patterns in a nursing home. Activities of groups of people are
firstly treated as interaction patterns between any pair of partners and are
then further broken into individual activities and behavior events using a
multi-level context hierarchy graph. The graph is implemented using a dynamic
Bayesian network to statistically model the multi-level concepts. We have
developed a coarse-to-fine prototype system to illustrate the proposed concept.
Experimental results have demonstrated the feasibility of the proposed
approaches. The objective of this research is to automatically create concise
and comprehensive reports of activities and behaviors of patients to support
physicians and caregivers in a nursing facility. Keywords: group activity, human interaction, medical care, multimodal, stochastic
modeling | |||
| Elvis: situated speech and gesture understanding for a robotic chandelier | | BIBAK | Full-Text | 90-96 | |
| Joshua Juster; Deb Roy | |||
| We describe a home lighting robot that uses directional spotlights to create
complex lighting scenes. The robot senses its visual environment using a
panoramic camera and attempts to maintain its target goal state by adjusting
the positions and intensities of its lights. Users can communicate desired
changes in the lighting environment through speech and gesture (e.g., "Make it
brighter over there"). Information obtained from these two modalities are
combined to form a goal, a desired change in the lighting of the scene. This
goal is then incorporated into the system's target goal state. When the target
goal state and the world are out of alignment, the system formulates a
sensorimotor plan that acts on the world to return the system to homeostasis. Keywords: gesture, grounded, input methods, lighting, multimodal, natural interaction,
situated, speech | |||
| Towards integrated microplanning of language and iconic gesture for multimodal output | | BIBAK | Full-Text | 97-104 | |
| Stefan Kopp; Paul Tepper; Justine Cassell | |||
| When talking about spatial domains, humans frequently accompany their
explanations with iconic gestures to depict what they are referring to. For
example, when giving directions, it is common to see people making gestures
that indicate the shape of buildings, or outline a route to be taken by the
listener, and these gestures are essential to the understanding of the
directions. Based on results from an ongoing study on language and gesture in
direction-giving, we propose a framework to analyze such gestural images into
semantic units (image description features), and to link these units to
morphological features (hand shape, trajectory, etc.). This feature-based
framework allows us to generate novel iconic gestures for embodied
conversational agents, without drawing on a lexicon of canned gestures. We
present an integrated microplanner that derives the form of both coordinated
natural language and iconic gesture directly from given communicative goals,
and serves as input to the speech and gesture realization engine in our NUMACK
project. Keywords: embodied conversational agents, generation, gesture, language, multimodal
output | |||
| Exploiting prosodic structuring of coverbal gesticulation | | BIBAK | Full-Text | 105-112 | |
| Sanshzar Kettebekov | |||
| Although gesture recognition has been studied extensively, communicative,
affective, and biometrical "utility" of natural gesticulation remains
relatively unexplored. One of the main reasons for that is the modeling
complexity of spontaneous gestures. While lexical information in speech
provides additional cues for disambiguating gestures, it does not cover rich
paralinguistic domain. This paper offers initial findings from a large corpus
of natural monologues about prosodic structuring between frequent beat-like
strokes and concurrent speech. Using a set of audio-visual features in an
HMM-based formulation, we are able to improve the discrimination between
visually similar gestures. Those types of articulatory strokes represent
different communicative functions. The analysis is based on the temporal
alignment of detected vocal perturbations and the concurrent hand movement. As
a supplementary result, we show that recognized articulatory strokes may be
used for quantifying gesturing behavior. Keywords: gesture, multimodal, prosody, speech | |||
| Visual and linguistic information in gesture classification | | BIBAK | Full-Text | 113-120 | |
| Jacob Eisenstein; Randall Davis | |||
| Classification of natural hand gestures is usually approached by applying
pattern recognition to the movements of the hand. However, the gesture
categories most frequently cited in the psychology literature are fundamentally
multimodal; the definitions make reference to the surrounding linguistic
context. We address the question of whether gestures are naturally multimodal,
or whether they can be classified from hand-movement data alone. First, we
describe an empirical study showing that the removal of auditory information
significantly impairs the ability of human raters to classify gestures. Then we
present an automatic gesture classification system based solely on an n-gram
model of linguistic context; the system is intended to supplement a visual
classifier, but achieves 66% accuracy on a three-class classification problem
on its own. This represents higher accuracy than human raters achieve when
presented with the same information. Keywords: gesture recognition, gesture taxonomies, multimodal disambiguation, validity | |||
| Multimodal model integration for sentence unit detection | | BIBAK | Full-Text | 121-128 | |
| Mary P. Harper; Elizabeth Shriberg | |||
| In this paper, we adopt a direct modeling approach to utilize conversational
gesture cues in detecting sentence boundaries, called SUs, in video taped
conversations. We treat the detection of SUs as a classification task such that
for each inter-word boundary, the classifier decides whether there is an SU
boundary or not. In addition to gesture cues, we also utilize prosody and
lexical knowledge sources. In this first investigation, we find that gesture
features complement the prosodic and lexical knowledge sources for this task.
By using all of the knowledge sources, the model is able to achieve the lowest
overall SU detection error rate. Keywords: dialog, gesture, language models, multimodal fusion, prosody, sentence
boundary detection | |||
| When do we interact multimodally?: cognitive load and multimodal communication patterns | | BIBAK | Full-Text | 129-136 | |
| Sharon Oviatt; Rachel Coulston; Rebecca Lunsford | |||
| Mobile usage patterns often entail high and fluctuating levels of difficulty
as well as dual tasking. One major theme explored in this research is whether a
flexible multimodal interface supports users in managing cognitive load.
Findings from this study reveal that multimodal interface users spontaneously
respond to dynamic changes in their own cognitive load by shifting to
multimodal communication as load increases with task difficulty and
communicative complexity. Given a flexible multimodal interface, users' ratio
of multimodal (versus unimodal) interaction increased substantially from 18.6%
when referring to established dialogue context to 77.1% when required to
establish a new context, a +315% relative increase. Likewise, the ratio of
users' multimodal interaction increased significantly as the tasks became more
difficult, from 59.2% during low difficulty tasks, to 65.5% at moderate
difficulty, 68.2% at high and 75.0% at very high difficulty, an overall
relative increase of +27%. Analysis of users' task-critical errors and response
latencies across task difficulty levels increased systematically and
significantly as well, corroborating the manipulation of cognitive processing
load. The adaptations seen in this study reflect users' efforts to self-manage
limitations on working memory when task complexity increases. This is
accomplished by distributing communicative information across multiple
modalities, which is compatible with a cognitive load theory of multimodal
interaction. The long-term goal of this research is the development of an
empirical foundation for proactively guiding flexible and adaptive multimodal
system design. Keywords: cognitive load, dialogue context, human performance, individual differences,
multimodal integration, multimodal interaction, speech and pen input, system
adaptation, task difficulty, unimodal interaction | |||
| Bimodal HCI-related affect recognition | | BIBAK | Full-Text | 137-143 | |
| Zhihong Zeng; Jilin Tu; Ming Liu; Tong Zhang; Nicholas Rizzolo; Zhenqiu Zhang; Thomas S. Huang; Dan Roth; Stephen Levinson | |||
| Perhaps the most fundamental application of affective computing will be
Human-Computer Interaction (HCI) in which the computer should have the ability
to detect and track the user's affective states, and make corresponding
feedback. The human multi-sensor affect system defines the expectation of
multimodal affect analyzer. In this paper, we present our efforts toward
audio-visual HCI-related affect recognition. With HCI applications in mind, we
take into account some special affective states which indicate users'
cognitive/motivational states. Facing the fact that a facial expression is
influenced by both an affective state and speech content, we apply a smoothing
method to extract the information of the affective state from facial features.
In our fusion stage, a voting method is applied to combine audio and visual
modalities so that the final affect recognition accuracy is greatly improved.
We test our bimodal affect recognition approach on 38 subjects with 11
HCI-related affect states. The extensive experimental results show that the
average person-dependent affect recognition accuracy is almost 90% for our
bimodal fusion. Keywords: affect recognition, affective computing, emotion recognition, multimodal
human-computer interaction | |||
| Identifying the addressee in human-human-robot interactions based on head pose and speech | | BIBAK | Full-Text | 144-151 | |
| Michael Katzenmaier; Rainer Stiefelhagen; Tanja Schultz | |||
| In this work we investigate the power of acoustic and visual cues, and their
combination, to identify the addressee in a human-human-robot interaction.
Based on eighteen audio-visual recordings of two human beings and a (simulated)
robot we discriminate the interaction of the two humans from the interaction of
one human with the robot. The paper compares the result of three approaches.
The first approach uses purely acoustic cues to find the addressees. Low level,
feature based cues as well as higher-level cues are examined. In the second
approach we test whether the human's head pose is a suitable cue. Our results
show that visually estimated head pose is a more reliable cue for the
identification of the addressee in the human-human-robot interaction. In the
third approach we combine the acoustic and visual cues which results in
significant improvements. Keywords: attentive interfaces, focus of attention, head pose estimation, human-robot
interaction, multimodal interfaces, speech recognition | |||
| Articulatory features for robust visual speech recognition | | BIBAK | Full-Text | 152-158 | |
| Kate Saenko; Trevor Darrell; James R. Glass | |||
| Visual information has been shown to improve the performance of speech
recognition systems in noisy acoustic environments. However, most audio-visual
speech recognizers rely on a clean visual signal. In this paper, we explore a
novel approach to visual speech modeling, based on articulatory features, which
has potential benefits under visually challenging conditions. The idea is to
use a set of parallel classifiers to extract different articulatory attributes
from the input images, and then combine their decisions to obtain higher-level
units, such as visemes or words. We evaluate our approach in a preliminary
experiment on a small audio-visual database, using several image noise
conditions, and compare it to the standard viseme-based modeling approach. Keywords: articulatory features, audio-visual speech recognition, multimodal
interfaces, speechreading, visual feature extraction | |||
| M/ORIS: a medical/operating room interaction system | | BIBAK | Full-Text | 159-166 | |
| Sébastien Grange; Terrence Fong; Charles Baur | |||
| We propose an architecture for a real-time multimodal system, which provides
non-contact, adaptive user interfacing for Computer-Assisted Surgery (CAS). The
system, called M/ORIS (for Medical/Operating Room Interaction System) combines
gesture interpretation as an explicit interaction modality with continuous,
real-time monitoring of the surgical activity in order to automatically address
the surgeon's needs. Such a system will help reduce a surgeon's workload and
operation time. This paper focuses on the proposed activity monitoring aspect
of M/ORIS. We analyze the issues of Human-Computer Interaction in an OR based
on real-world case studies. We then describe how we intend to address these
issues by combining a surgical procedure description with parameters gathered
from vision-based surgeon tracking and other OR sensors (e.g. tool trackers).
We called this approach Scenario-based Activity Monitoring (SAM). We finally
present preliminary results, including a non-contact mouse interface for
surgical navigation systems. Keywords: CAS, HCI, medical user interfaces, multimodal interaction | |||
| Modality fusion for graphic design applications | | BIBAK | Full-Text | 167-174 | |
| André D. Milota | |||
| Users must enter a complex mix of spatial and abstract information when
operating a graphic design application. Speech / language provides a fluid and
natural method for specifying abstract information while a spatial input device
is often most intuitive for the entry of spatial information. Thus, the
combined speech / gesture interface is ideally suited to this application
domain. While some research has been conducted on multimodal graphic design
applications, advanced research on modality fusion has typically focused on map
related applications. This paper considers the particular demands of graphic
design applications and what impact these demands will have on the general
strategies employed when combining the speech and gesture channels. We also
describe initial work on our own multimodal graphic design application (DPD)
which uses these strategies. Keywords: graphic design, modality fusion, multimodal interface, pen interface, speech
input | |||
| Implementation and evaluation of a constraint-based multimodal fusion system for speech and 3D pointing gestures | | BIBAK | Full-Text | 175-182 | |
| Hartwig Holzapfel; Kai Nickel; Rainer Stiefelhagen | |||
| This paper presents an architecture for fusion of multimodal input streams
for natural interaction with a humanoid robot as well as results from a user
study with our system. The presented fusion architecture consists of an
application independent parser of input events, and application specific rules.
In the presented user study, people could interact with a robot in a kitchen
scenario, using speech and gesture input. In the study, we could observe that
our fusion approach is very tolerant against falsely detected pointing
gestures. This is because we use speech as the main modality and pointing
gestures mainly for disambiguation of objects. In the paper we also report
about the temporal correlation of speech and gesture events as observed in the
user study. Keywords: gesture, multimodal architectures, multimodal fusion and multisensory
integration, natural language, speech, vision | |||
| AROMA: ambient awareness through olfaction in a messaging application | | BIBAK | Full-Text | 183-190 | |
| Adam Bodnar; Richard Corbett; Dmitry Nekrasovski | |||
| This work explores the properties of different output modalities as
notification mechanisms in the context of messaging. In particular, the
olfactory (smell) modality is introduced as a potential alternative to visual
and auditory modalities for providing messaging notifications. An experiment
was performed to compare these modalities as secondary display mechanisms used
to deliver notifications to users working on a cognitively engaging primary
task. It was verified that the disruptiveness and effectiveness of
notifications varied with the notification modality. The olfactory modality was
shown to be less effective in delivering notifications than the other
modalities, but produced a less disruptive effect on user engagement in the
primary task. Our results serve as a starting point for future research into
the use of olfactory notification in messaging systems. Keywords: HCI, ambient awareness, multi-modal interfaces, notification systems,
olfactory display, user study | |||
| The virtual haptic back for palpatory training | | BIBAK | Full-Text | 191-197 | |
| Robert L., II Williams; Mayank Srivastava; John N. Howell; Robert R., Jr. Conatser; David C. Eland; Janet M. Burns; Anthony G. Chila | |||
| This paper discusses the Ohio University Virtual Haptic Back (VHB) project,
including objectives, implementation, and initial evaluations. Haptics is the
science of human tactile sensation and a haptic interface provides force and
touch feedback to the user from virtual reality. Our multimodal VHB simulation
combines high-fidelity computer graphics with haptic feedback and aural
feedback to augment training in palpatory diagnosis in osteopathic medicine,
plus related training applications in physical therapy, massage therapy,
chiropractic therapy, and other tactile fields. We use the PHANToM haptic
interface to provide position interactions by the trainee, with accompanying
force feedback to simulate the back of a live human subject in real-time. Our
simulation is intended to add a measurable, repeatable component of science to
the art of palpatory diagnosis. Based on our experiences in the lab to date, we
believe that haptics-augmented computer models have great potential for
improving training in the future, for various tactile applications. Our main
project goals are to: 1. Provide a novel tool for palpatory diagnosis training;
and 2. Improve the state-of-the-art in haptics and graphics applied to virtual
anatomy. Keywords: PHANToM, haptics, palpatory diagnosis, training, virtual haptic back | |||
| A vision-based sign language recognition system using tied-mixture density HMM | | BIBAK | Full-Text | 198-204 | |
| Liang-Guo Zhang; Yiqiang Chen; Gaolin Fang; Xilin Chen; Wen Gao | |||
| In this paper, a vision-based medium vocabulary Chinese sign language
recognition (SLR) system is presented. The proposed recognition system consists
of two modules. In the first module, techniques of robust hands detection,
background subtraction and pupils detection are efficiently combined to
precisely extract the feature information with the aid of simple colored gloves
in the unconstrained environment. Meanwhile, an effective and efficient
hierarchical feature description scheme with different scale features to
characterize sign language is proposed, where principal component analysis
(PCA) is employed to characterize the finger features more elaborately. In the
second part, a Tied-Mixture Density Hidden Markov Models (TMDHMM) framework for
SLR is proposed, which can speed up the recognition without the significant
loss of recognition accuracy compared with the continuous hidden Markov models
(CHMM). Experimental results based on 439 frequently used Chinese sign language
(CSL) words show that the proposed methods can work well for the medium
vocabulary SLR in the environment without special constraints and the
recognition accuracy is up to 92.5%. Keywords: computer vision, hidden Markov models, human-computer interaction, sign
language recognition | |||
| Analysis of emotion recognition using facial expressions, speech and multimodal information | | BIBAK | Full-Text | 205-211 | |
| Carlos Busso; Zhigang Deng; Serdar Yildirim; Murtaza Bulut; Chul Min Lee; Abe Kazemzadeh; Sungbok Lee; Ulrich Neumann; Shrikanth Narayanan | |||
| The interaction between human beings and computers will be more natural if
computers are able to perceive and respond to human non-verbal communication
such as emotions. Although several approaches have been proposed to recognize
human emotions based on facial expressions or speech, relatively limited work
has been done to fuse these two, and other, modalities to improve the accuracy
and robustness of the emotion recognition system. This paper analyzes the
strengths and the limitations of systems based only on facial expressions or
acoustic information. It also discusses two approaches used to fuse these two
modalities: decision level and feature level integration. Using a database
recorded from an actress, four emotions were classified: sadness, anger,
happiness, and neutral state. By the use of markers on her face, detailed
facial motions were captured with motion capture, in conjunction with
simultaneous speech recordings. The results reveal that the system based on
facial expression gave better performance than the system based on just
acoustic information for the emotions considered. Results also show the
complementarily of the two modalities and that when these two modalities are
fused, the performance and the robustness of the emotion recognition system
improve measurably. Keywords: PCA, SVC, affective states, decision level fusion, emotion recognition,
feature level fusion, human-computer interaction (HCI), speech, vision | |||
| Support for input adaptability in the ICON toolkit | | BIBAK | Full-Text | 212-219 | |
| Pierre Dragicevic; Jean-Daniel Fekete | |||
| In this paper, we introduce input adaptability as the ability of an
application to exploit alternative sets of input devices effectively and offer
users a way of adapting input interaction to suit their needs. We explain why
input adaptability must be seriously considered today and show how it is poorly
supported by current systems, applications and tools. We then describe ICon
(Input Configurator), an input toolkit that allows interactive applications to
achieve a high level of input adaptability. We present the software
architecture behind ICon then the toolkit itself, and give several examples of
non-standard interaction techniques that are easy to build and modify using
ICon's graphical editor while being hard or impossible to support using regular
GUI toolkits. Keywords: adaptability, input devices, interaction techniques, toolkits, visual
programming | |||
| User walkthrough of multimodal access to multidimensional databases | | BIBAK | Full-Text | 220-226 | |
| M. P. van Esch-Bussemakers; A. H. M. Cremers | |||
| This paper describes a user walkthrough that was conducted with an
experimental multimodal dialogue system to access a multidimensional music
database using a simulated mobile device (including a technically challenging
four-PHANToM-setup). The main objectives of the user walkthrough were to assess
user preferences for certain modalities (speech, graphical and haptic-tactile)
to access and present certain types of information, and for certain search
strategies when searching and browsing a multidimensional database. In
addition, the project aimed at providing concrete recommendations for the
experimental setup, multimodal user interface design and evaluation. The
results show that recommendations can be formulated both on the use of
modalities and search strategies, and on the experimental setup as a whole,
including the user interface. In short, it is found that haptically enhanced
buttons are preferred for navigating or selecting and speech is preferred for
searching the database for an album or artist. A 'direct' search strategy
indicating an album, artist or genre is favorable. It can be concluded that
participants were able to look beyond the experimental setup and see the
potential of the envisioned mobile device and its modalities. Therefore it was
possible to formulate recommendations for future multimodal dialogue systems
for multidimensional database access. Keywords: guidelines, haptic-tactile, multidimensional, multimodal, speech, usability,
user walkthrough, visualization | |||
| Multimodal interaction under exerted conditions in a natural field setting | | BIBAK | Full-Text | 227-234 | |
| Sanjeev Kumar; Philip R. Cohen; Rachel Coulston | |||
| This paper evaluates the performance of a multimodal interface under exerted
conditions in a natural field setting. The subjects in the present study
engaged in a strenuous activity while multimodally performing map-based tasks
using handheld computing devices. This activity made the users breathe heavily
and become fatigued during the course of the study. We found that the
performance of both speech and gesture recognizers degraded as a function of
exertion, while the overall multimodal success rate was stable. This
stabilization is accounted for by the mutual disambiguation of modalities,
which increases significantly with exertion. The system performed better for
subjects with a greater level of physical fitness, as measured by their running
speed, with more stable multimodal performance and a later degradation of
speech and gesture recognition as compared with subjects who were less fit. The
findings presented in this paper have a significant impact on design decisions
for multimodal interfaces targeted towards highly mobile and exerted users in
field environments. Keywords: evaluation, exertion, field, mobile, multimodal interaction | |||
| A segment-based audio-visual speech recognizer: data collection, development, and initial experiments | | BIBAK | Full-Text | 235-242 | |
| Timothy J. Hazen; Kate Saenko; Chia-Hao La; James R. Glass | |||
| This paper presents the development and evaluation of a speaker-independent
audio-visual speech recognition (AVSR) system that utilizes a segment-based
modeling strategy. To support this research, we have collected a new video
corpus, called Audio-Visual TIMIT (AV-TIMIT), which consists of 4 total hours
of read speech collected from 223 different speakers. This new corpus was used
to evaluate our new AVSR system which incorporates a novel audio-visual
integration scheme using segment-constrained Hidden Markov Models (HMMs).
Preliminary experiments have demonstrated improvements in phonetic recognition
performance when incorporating visual information into the speech recognition
process. Keywords: audio-visual corpora, audio-visual speech recognition | |||
| A model-based approach for real-time embedded multimodal systems in military aircrafts | | BIBAK | Full-Text | 243-250 | |
| Rémi Bastide; David Navarre; Philippe Palanque; Amélie Schyn; Pierre Dragicevic | |||
| This paper presents the use of a model-based approach for the formal
description of real-time embedded multimodal systems. This modeling technique
has been used in the field of military fighter aircrafts. The paper presents
the formal description techniques, its application on the case study of a
multimodal command and control interface for the Rafale aircraft as well as its
relationship with architectural model for interactive systems. Keywords: embedded systems, formal description techniques, model-based approaches | |||
| ICARE software components for rapidly developing multimodal interfaces | | BIBAK | Full-Text | 251-258 | |
| Jullien Bouchet; Laurence Nigay; Thierry Ganille | |||
| Although several real multimodal systems have been built, their development
still remains a difficult task. In this paper we address this problem of
development of multimodal interfaces by describing a component-based approach,
called ICARE, for rapidly developing multimodal interfaces. ICARE stands for
Interaction-CARE (Complementarity Assignment Redundancy Equivalence). Our
component-based approach relies on two types of software components. Firstly
ICARE elementary components include Device components and Interaction Language
components that enable us to develop pure modalities. The second type of
components, called Composition components, define combined usages of
modalities. Reusing and assembling ICARE components enable rapid development of
multimodal interfaces. We have developed several multimodal systems using ICARE
and we illustrate the discussion using one of them: the FACET simulator of the
Rafale French military plane cockpit. Keywords: multimodal interactive systems, software components | |||
| MacVisSTA: a system for multimodal analysis | | BIBAK | Full-Text | 259-264 | |
| R. Travis Rose; Francis Quek; Yang Shi | |||
| The study of embodied communication requires access to multiple data sources
such as multistream video and audio, various derived and metadata such as
gesture, head, posture, facial expression and gaze information. The common
element that runs through these data is the co-temporality of the multiple
modes of behavior. In this paper, we present the multimedia Visualization for
Situated Temporal Analysis (MacVisSTA) system for the analysis of multimodal
human communication through video, audio, speech transcriptions, and gesture
and head orientation data. The system uses a multiple linked representation
strategy in which different representations are linked by the current time
focus. In this framework, the multiple display components associated with the
disparate data types are kept in synchrony, each component serving as both a
controller of the system as well as a display. Hence the user is able to
analyze and manipulate the data from different analytical viewpoints (e.g.
through the time-synchronized speech transcription or through motion segments
of interest). MacVisSTA supports analysis of the synchronized data at varying
timescales. It provides an annotation interface that permits users to code the
data into 'music-score' objects, and to make and organize multimedia
observations about the data. Hence MacVisSTA integrates flexible visualization
with annotation within a single framework. An XML database manager has been
created for storage and search of annotation data. We compare the system with
other existing annotation tools with respect to functionality and interface
design. The software runs on Macintosh OS X computer systems. Keywords: embodied communication, flexible visualization and annotation, gesture,
multimodal interaction, multiple linked representation | |||
| Context based multimodal fusion | | BIBAK | Full-Text | 265-272 | |
| Norbert Pfleger | |||
| We present a generic approach to multimodal fusion which we call context
based multimodal integration. Key to this approach is that every multimodal
input event is interpreted and enriched with respect to its local turn context.
This local turn context comprises all previously recognized input events and
the dialogue state that both belong to the same user turn. We show that a
production rule system is an elegant way to handle this context based
multimodal integration and we describe a first implementation of the so-called
PATE system. Finally, we present results from a first evaluation of this
approach as part of a human-factors experiment with the COMIC system. Keywords: fusion, multimodal dialogue systems, multimodal integration, speech and pen
input | |||
| Emotional Chinese talking head system | | BIBAK | Full-Text | 273-280 | |
| Jianhua Tao; Tieniu Tan | |||
| Natural Human-Computer Interface requires integration of realistic audio and
visual information for perception and display. In this paper, a lifelike
talking head system is proposed. The system converts text to speech with
synchronized animation of mouth movements and emotion expression. The talking
head is based on a generic 3D human head model. The personalized model is
incorporated into the system. With texture mapping, the personalized model
offers a more natural and realistic look than the generic model. To express
emotion, both emotional speech synthesis and emotional facial animation are
integrated and Chinese viseme models are also created in the paper. Finally,
the emotional talking head system is created to generate the natural and vivid
audio-visual output. Keywords: emotion, facial animation, speech synthesis, talking head | |||
| Experiences on haptic interfaces for visually impaired young children | | BIBAK | Full-Text | 281-288 | |
| Saija Patomäki; Roope Raisamo; Jouni Salo; Virpi Pasto; Arto Hippula | |||
| Visually impaired children do not have equal opportunities to learn and play
compared to sighted children. Computers have a great potential to correct this
problem. In this paper we present a series of studies where multimodal
applications were designed for a group of eleven visually impaired children
aged from 3.5 to 7.5 years. We also present our testing procedure specially
adapted for visually impaired young children. During the two-year project it
became clear that with careful designing of the tasks and proper use of haptic
and auditory features usable computing environments can be created for visually
impaired children. Keywords: Phantom, blind children, haptic environment, haptic feedback, learning,
visually impaired children | |||
| Visual touchpad: a two-handed gestural input device | | BIBAK | Full-Text | 289-296 | |
| Shahzad Malik; Joe Laszlo | |||
| This paper presents the Visual Touchpad, a low-cost vision-based input
device that allows for fluid two-handed interactions with desktop PCs, laptops,
public kiosks, or large wall displays. Two downward-pointing cameras are
attached above a planar surface, and a stereo hand tracking system provides the
3D positions of a user's fingertips on and above the plane. Thus the planar
surface can be used as a multi-point touch-sensitive device, but with the added
ability to also detect hand gestures hovering above the surface. Additionally,
the hand tracker not only provides positional information for the fingertips
but also finger orientations. A variety of one and two-handed multi-finger
gestural interaction techniques are then presented that exploit the affordances
of the hand tracker. Further, by segmenting the hand regions from the video
images and then augmenting them transparently into a graphical interface, our
system provides a compelling direct manipulation experience without the need
for more expensive tabletop displays or touch-screens, and with significantly
less self-occlusion. Keywords: augmented reality, computer vision, direct manipulation, fluid interaction,
gestures, hand tracking, perceptual user interface, two hand, virtual keyboard,
virtual mouse, visual touchpad | |||
| An evaluation of virtual human technology in informational kiosks | | BIBAK | Full-Text | 297-302 | |
| Curry Guinn; Rob Hubal | |||
| In this paper, we look at the results of using spoken language interactive
virtual characters in information kiosks. Users interact with synthetic
spokespeople using spoken natural language dialogue. The virtual characters
respond with spoken language, body and facial gesture, and graphical images on
the screen. We present findings from studies of three different information
kiosk applications. As we developed successive kiosks, we applied lessons
learned from previous kiosks to improve system performance. For each setting,
we briefly describe the application, the participants, and the results, with
specific focus on how we increased user participation and improved
informational throughput. We tie the results together in a lessons learned
section. Keywords: evaluation, gesture, natural language, spoken dialogue system, virtual
humans, virtual reality | |||
| Software infrastructure for multi-modal virtual environments | | BIBAK | Full-Text | 303-308 | |
| Brian Goldiez; Glenn Martin; Jason Daly; Donald Washburn; Todd Lazarus | |||
| Virtual environment systems, especially those supporting multi-modal
interactions require a robust and flexible software infrastructure that
supports a wide range of devices, interaction techniques, and target
applications. In addition to interactivity needs, a key factor of robustness of
the software is the minimization of latency and more importantly, reduction of
jitter (the variability of latency). This paper presents a flexible software
infrastructure that has demonstrated robustness in initial prototyping. The
infrastructure, based on the VESS Libraries from the University of Central
Florida, simplifies the task of creating multi-modal virtual environments. Our
extensions to VESS include numerous features to support new input and output
devices for new sensory modalities and interaction techniques, as well as some
control over latency and jitter. Keywords: augmented environments, haptics, latency, multi-modal interfaces, olfaction,
software infrastructure, virtual environments | |||
| GroupMedia: distributed multi-modal interfaces | | BIBAK | Full-Text | 309-316 | |
| Anmol Madan; Ron Caneel; Alex Sandy Pentland | |||
| In this paper, we describe the GroupMedia system, which uses wireless
wearable computers to measure audio features, head-movement, and galvanic skin
response (GSR) for dyads and groups of interacting people. These group sensor
measurements are then used to build a real-time group interest index. The group
interest index can be used to control group displays, annotate the group
discussion for later retrieval, and even to modulate and guide the group
discussion itself. We explore three different situations where this system has
been introduced, and report experimental results. Keywords: galvanic skin response, head nodding, human behavior, influence model,
interest, prosody, speech features | |||
| Agent and library augmented shared knowledge areas (ALASKA) | | BIBAK | Full-Text | 317-318 | |
| Eric R. Hamilton | |||
| This paper reports on an NSF-funded effort now underway to integrate three
learning technologies that have emerged and matured over the past decade; each
has presented compelling and oftentimes moving opportunities to alter
educational practice and to render learning more effective. The project seeks a
novel way to blend these technologies and to create and test a new model for
human-machine partnership in learning settings. The innovation we are
prototyping in this project creates an applet-rich shared space whereby a
pedagogical agent at each learner's station functions as an instructional
assistant to the teacher or professor and tutor to the student. The platform is
intended to open a series of new -- and instructionally potent -- interactive
pathways. Keywords: animated agents, applets, collaborative workspace, heterogeneous network,
multi-tier system, pedagogical agents | |||
| MULTIFACE: multimodal content adaptations for heterogeneous devices | | BIBAK | Full-Text | 319-320 | |
| Songsak Channarukul; Susan W. McRoy; Syed S. Ali | |||
| We are interested in applying and extending existing frameworks for
combining output modalities for adaptations of multimodal content on
heterogeneous devices based on user and device models. In this paper, we
present Multiface, a multimodal dialog system that allows users to interact
using different devices such as desktop computers, PDAs, and mobile phones. The
presented content and its modality will be customized to individual users and
the device they are using. Keywords: device-centered adaptation, dialog system, multimodal output, user-centered
adaptation | |||
| Command and control resource performance predictor (C2RP2) | | BIBK | Full-Text | 321-322 | |
| Joseph M. Dalton; Ali Ahmad; Kay Stanney | |||
Keywords: applet, command and control, predictor | |||
| A multi-modal architecture for cellular phones | | BIBK | Full-Text | 323-324 | |
| Luca Nardelli; Marco Orlandi; Daniele Falavigna | |||
Keywords: VoiceXML, automatic speech recognition, mobile devices, multimodality | |||
| 'SlidingMap': introducing and evaluating a new modality for map interaction | | BIBAK | Full-Text | 325-326 | |
| Matthias Merdes; Jochen Häußler; Matthias Jöst | |||
| In this paper, we describe the concept of a new modality for interaction
with digital maps. We propose using inclination as a means for panning maps on
a mobile computing device, namely a tablet PC. The result is a map which is
both physically transportable as well as manipulable with very simple and
natural hand movements. We describe a setup for comparing this new modality
with the better known modalities of pen-based and joystick-based interaction.
Apart from demonstrating the new modality we plan to perform a short
evaluation. Keywords: inclination modality, map interaction, mobile systems | |||
| Multimodal interaction for distributed collaboration | | BIBAK | Full-Text | 327-328 | |
| Levent Bolelli; Guoray Cai; Hongmei Wang; Bita Mortazavi; Ingmar Rauschert; Sven Fuhrmann; Rajeev Sharma; Alan MacEachren | |||
| We demonstrate a same-time different-place collaboration system for managing
crisis situations using geospatial information. Our system enables distributed
spatial decision-making by providing a multimodal interface to team members.
Decision makers in front of large screen displays and/or desktop computers, and
emergency responders in the field with tablet PCs can engage in collaborative
activities for situation assessment and emergency response. Keywords: GIS, geocollaboration, interactive maps, multimodal interfaces | |||
| A multimodal learning interface for sketch, speak and point creation of a schedule chart | | BIBAK | Full-Text | 329-330 | |
| Ed Kaiser; David Demirdjian; Alexander Gruenstein; Xiaoguang Li; John Niekrasz; Matt Wesson; Sanjeev Kumar | |||
| We present a video demonstration of an agent-based test bed application for
ongoing research into multi-user, multimodal, computer-assisted meetings. The
system tracks a two person scheduling meeting: one person standing at a touch
sensitive whiteboard creating a Gantt chart, while another person looks on in
view of a calibrated stereo camera. The stereo camera performs real-time,
untethered, vision-based tracking of the onlooker's head, torso and limb
movements, which in turn are routed to a 3D-gesture recognition agent. Using
speech, 3D deictic gesture and 2D object de-referencing the system is able to
track the onlooker's suggestion to move a specific milestone. The system also
has a speech recognition agent capable of recognizing out-of-vocabulary (OOV)
words as phonetic sequences. Thus when a user at the whiteboard speaks an OOV
label name for a chart constituent while also writing it, the OOV speech is
combined with letter sequences hypothesized by the handwriting recognizer to
yield an orthography, pronunciation and semantics for the new label. These are
then learned dynamically by the system and become immediately available for
future recognition. Keywords: multimodal interaction, vision-based body-tracking, vocabulary learning | |||
| Real-time audio-visual tracking for meeting analysis | | BIBAK | Full-Text | 331-332 | |
| David Demirdjian; Kevin Wilson; Michael Siracusa; Trevor Darrell | |||
| We demonstrate an audio-visual tracking system for meeting analysis. A
stereo camera and a microphone array are used to track multiple people and
their speech activity in real-time. Our system can estimate the location of
multiple people, detect the current speaker and build a model of interaction
between people in a meeting. Keywords: microphone array, speaker localization, stereo, tracking | |||
| Collaboration in parallel worlds | | BIBAK | Full-Text | 333-334 | |
| Ashutosh Morde; Jun Hou; S. Kicha Ganapathy; Carlos Correa; Allan Krebs; Lawrence Rabiner | |||
| We present a novel paradigm for human to human asymmetric collaboration.
There is a need for people at geographically separate locations to seamlessly
collaborate in real time as if they are physically co-located. In our system
one user (novice) works in the real world and the other user (expert) works in
a parallel virtual world. They are assisted in this task by an Intelligent
Agent (IA) with considerable knowledge about the environment. Current
tele-collaboration systems deal primarily with collaboration purely in the real
or virtual worlds. The use of a combination of virtual and real worlds allows
us to leverage the advantages from both the worlds. Keywords: augmented reality, collaboration, distributed systems, intelligent agents,
registration, virtual reality | |||
| Segmentation and classification of meetings using multiple information streams | | BIBAK | Full-Text | 335-336 | |
| Paul E. Rybski; Satanjeev Banerjee; Fernando de la Torre; Carlos Vallespi; Alexander I. Rudnicky; Manuela Veloso | |||
| We present a meeting recorder infrastructure used to record and annotate
events that occur in meetings. Multiple data streams are recorded and analyzed
in order to infer a higher-level state of the group's activities. We describe
the hardware and software systems used to capture people's activities as well
as the methods used to characterize them. Keywords: meeting understanding, multi-modal interfaces | |||
| A maximum entropy based approach for multimodal integration | | BIBAK | Full-Text | 337-338 | |
| Péter Pál Boda | |||
| Integration of various user input channels for a multimodal interface is not
just an engineering problem. To fully understand users in the context of an
application and the current session, solutions are sought that process
information from different intentional, i.e. user-originated, as well as from
passively available sources in a uniform manner. As a first step towards this
goal, the work demonstrated here investigates how intentional user input (e.g.
speech, gesture) can be seamlessly combined to provide a single semantic
interpretation of the user input. For this classical Multimodal Integration
problem the Maximum Entropy approach is demonstrated with 76.52% integration
accuracy for the 1st and 86.77% accuracy for the top 3-best candidates. The
paper also exhibits the process that generates multimodal data for training the
statistical integrator, using transcribed speech from MIT's Voyager
application. The quality of the generated data is assessed by comparing to real
inputs to the multimodal version of Voyager. Keywords: machine learning, maximum entropy, multimodal database, multimodal
integration, virtual modality | |||
| Multimodal interface platform for geographical information systems (GeoMIP) in crisis management | | BIBAK | Full-Text | 339-340 | |
| Pyush Agrawal; Ingmar Rauschert; Keerati Inochanon; Levent Bolelli; Sven Fuhrmann; Isaac Brewer; Guoray Cai; Alan MacEachren; Rajeev Sharma | |||
| A novel interface system for accessing geospatial data (GeoMIP) has been
developed that realizes a user-centered multimodal speech/gesture interface for
addressing some of the critical needs in crisis management. In this system we
primarily developed vision sensing algorithms, speech integration,
multimodality fusion, and rule-based mapping of multimodal user input to GIS
database queries. A demo system of this interface has been developed for the
Port Authority NJ/NY and is explained here. Keywords: GIS, collaboration, human-centered design, interactive maps, multimodal
human-computer-interface, speech/gesture recognition | |||
| Adaptations of multimodal content in dialog systems targeting heterogeneous devices | | BIBAK | Full-Text | 341 | |
| Songsak Channarukul | |||
| Dialog systems that adapt to different user needs and preferences
appropriately have been shown to achieve higher levels of user satisfaction
[4]. However, it is also important that dialog systems be able to adapt to the
user's computing environment, because people are able to access computer
systems using different kinds of devices such as desktop computers, personal
digital assistants, and cellular telephones. Each of these devices has a
distinct set of physical capabilities, as well as a distinct set of functions
for which it is typically used.
Existing research on adaptation in both hypermedia and dialog systems has focused on how to customize content based on user models [2, 4] and interaction history. Some researchers have also investigated device-centered adaptations that range from low-level adaptations such as conversion of multimedia objects [6] (e.g., video to images, audio to text, image size reduction) to higher-level adaptations based on multimedia document models [1] and frameworks for combining output modalities [3, 5]. However, to my knowledge, no work has been done on integrating and coordinating both types of adaptation interdependently. The primary problem I would like to address in this thesis is how multimodal dialog systems can adapt their content and style of interaction, taking the user, the device, and the dependency between them into account. Two main aspects of adaptability that my thesis considers are: (1) adaptability in content presentation and communication and (2) adaptability in computational strategies used to achieve system's and user's goals. Beside general user modeling questions such as how to acquire information about the user and construct a user model, this thesis also considers other issues that deal with device modeling such as (1) how can the system employ user and device models to adapt the content and determine the right combination of modalities effectively? (2) how can the system determine the right combination of multimodal contents that best suits the device? (3) how can one model the characteristics and constraints of devices? and (4) is it possible to generalize device models based on modalities rather than on their typical categories or physical appearance. Keywords: device-centered adaptation, dialog system, multimodal output, user-centered
adaptation | |||
| Utilizing gestures to better understand dynamic structure of human communication | | BIBAK | Full-Text | 342 | |
| Lei Chen | |||
| Motivation: Many researchers have highlighted the importance of gesture in
natural human communication. McNeill [4] puts forward the hypothesis that
gesture and speech stem from the same mental process and so tend to be both
temporally and semantically related. However in contrast to speech, which
surfaces as a linear progression of segments, sounds, and words, gestures
appear to be nonlinear, holistic, and imagistic. Gesture adds an important
dimension to language understanding due to this property of sharing a common
origin with speech while using a very different mechanism for transferring
information. Ignoring this information when constructing a model of human
communication would limit its potential effectiveness.
Goal and Method: This thesis concerns the development of methods to effectively incorporate gestural information from a human communication into a computer model to more accurately interpret the content and structure of that communication. Levelt [5] suggests that structure in human communication stems from the dynamic conscious process of language production, during which a conversant organizes the concepts to be expressed, plans the discourse, and selects appropriate words, prosody, and gestures while also correcting errors that occur in this process. Clues related to this conscious processing emerge in both the final speech stream and gestures. This thesis will attempt to utilize these clues to determine the structural elements of human-to-human dialogs, including sentence boundaries, topic boundaries, and disfluency structure. For this purpose, the data driven approach is used. This work requires three important components: corpus generation, feature extraction, and model construction. Previous Work: Some work related to each of these components has already been conducted. A data collection and processing protocol for constructing multimodal corpora has been created; details on the video and audio processing can be found in the Data and Annotation section of [3]. To improve the speed of producing a corpus while maintaining its quality, we have surveyed factors impacting the accuracy of forced alignments of transcriptions to audio files [2]. These alignments provide a crucial temporal synchronization between video events and spoken words (and their components) for this research effort. We have also conducted measurement studies in an attempt to understand how to model multimodal conversations. For example, we have investigated the types of gesture patterns that occur during speech repairs [1]. Recently, we constructed a preliminary model combining speech and gesture features for detecting sentence boundaries in videotaped dialogs. This model combines language and prosody models together with a simple gestural model to more effectively detect sentence boundaries [3]. Future Work: To date, our multimodal corpora involve human monologues and dialogues (see http://vislab.cs.wright.edu/kdi). We are participating in the collection and preparation of a corpus of multi-party meetings (see http://vislab.cs.wright.edu/Projects/Meeting-Analysis). To facilitate the multi-channel audio processing, we are constructing a tool to support accurate audio transcription and alignment. The data from this meeting corpus will enable the development of more sophisticated gesture models allowing us to expand the set of gesture features (e.g., spatial properties of the tracked gestures). Additionally, we will investigate more advanced machine learning methods in an attempt to improve the performance of our models. We also plan to expand our models to phenomena such as topic segmentation. Keywords: dialog, gesture, language models, multimodal fusion, prosody, sentence
boundary detection | |||
| Multimodal programming for dyslexic students | | BIBA | Full-Text | 343 | |
| Dale-Marie Wilson | |||
| As the Web's role in society increases, so to does the need for its
universality. Access to the Web by all, including people with disabilities has
become a requirement of Web sites as can be seen by the passing of the
Americans with Disabilities Act in 1990. This universality has spilled over
into other disciplines, e.g. screen readers for Web browsing; however Computer
Science has not yet made significant efforts to do the same. The main focus of
this research is to provide this universal access in the development of virtual
learning environments, more specifically in computer programming. To facilitate
this access, research into the features of dyslexia is required: what it is,
how it affects a person's thought process and what changes are necessary to
facilitate these effects. Also, a complete understanding of the thought process
in the creation of a complete computer program is necessary.
Dyslexia has been diagnosed as affecting the left side of the brain. The left side of the brain processes information in a linear, sequential manner. It is also responsible for processing symbols, which include letters, words and mathematical notations. Thus dyslexics have problems with the code generation, analysis and implementation steps in the creation of a computer program. Potential solutions to this problem include a multimodal programming environment. This multimodal environment will be interactive, providing multimodal assistance to the user as they generate, analyze and implement code. This assistance will include the ability to add functions and loops via voice and receiving a spoken description of a code segment that has been selected by the cursor. | |||
| Gestural cues for speech understanding | | BIBK | Full-Text | 344 | |
| Jacob Eisenstein | |||
Keywords: multimodal natural language processing | |||
| Using language structure for adaptive multimodal language acquisition | | BIBAK | Full-Text | 345 | |
| Rajesh Chandrasekaran | |||
| In human spoken communication, language structure plays a vital role in
providing a framework for humans to understand each other. Using language
rules, words are combined into meaningful sentences to represent knowledge.
Speech enabled systems based on pre-programmed Rule Grammar suffer from
constraints on vocabulary and sentence structures. To address this problem, in
this paper, we discuss a language acquisition system that is capable of
learning new words and their corresponding semantic meaning by initiating an
adaptive dialog with the user. Thus, the vocabulary of the system can be
increased in real time by the user. The language acquisition system is provided
knowledge about language structure and is capable of accepting multimodal user
inputs that includes speech, touch, pen-tablet, mouse, and keyboard. We discuss
the efficiency of learning new concepts and the ease with which users can teach
the system new concepts.
The multimodal language acquisition system is capable of acquiring, in real time, new words that pertain to objects, actions or attributes and their corresponding meanings. The first step in this process is to detect unknown words in the spoken utterance. Any new word that is detected is classified into one of the above mentioned categories. The second step is to learn from the user the meaning of the word and add it to the semantic database. An unknown word is flagged whenever an utterance is not consistent with the pre-programmed Rule Grammar. Because the system can acquire words pertaining to objects, actions or attributes, we are interested in words that are nouns, verbs or adjectives. We use a transformation based part-of-speech tagger that is capable of annotating English words with their part-of-speech to identify words in the utterance that are nouns, verbs and adjectives. These words are searched in the semantic database and unknown words are identified. The system then initiates an adaptive dialog with the user, requesting the user to provide the meaning of the unknown word. When the user has provided the relevant meaning using any of the input modalities, the system checks whether the meaning given corresponds to the category of the word, i.e. if the unknown word is a noun then the user can associate only an object with it or if the unknown word is a verb then only an action can be associated with the word. Thus, the system uses the knowledge of the occurrence of the word in the sentence to determine what kind of meaning can be associated with the word. The language structure thus gives the system a basic knowledge of the unknown word. Keywords: adaptive dialogue systems, computer language learning, language acquisition,
language structure | |||
| Private speech during multimodal human-computer interaction | | BIBK | Full-Text | 346 | |
| Rebecca Lunsford | |||
Keywords: cognitive load, human performance, individual differences, multimodal
interaction, self-regulatory language, senior users, speaker variability,
system adaptation, task difficulty | |||
| Projection augmented models: the effect of haptic feedback on subjective and objective human factors | | BIBK | Full-Text | 347 | |
| Emily Bennett | |||
Keywords: haptic feedback, projection augmented models | |||
| Multimodal interface design for multimodal meeting content retrieval | | BIBAK | Full-Text | 348 | |
| Agnes Lisowska | |||
| This thesis will investigate which modalities, and in which combinations,
are best suited for use in a multimodal interface that allows users to retrieve
the content of recorded and processed multimodal meetings. The dual role of
multimodality in the system (present in both the interface and the stored data)
poses additional challenges. We will extend and adapt established approaches to
HCI and multimodality [2, 3] to this new domain, maintaining a strongly
user-driven approach to design. Keywords: Wizard of Oz, multimodal interface, multimodal meetings | |||
| Determining efficient multimodal information-interaction spaces for C2 systems | | BIBAK | Full-Text | 349 | |
| Leah M. Reeves | |||
| Military operations and friendly fire mishaps over the last decade have
demonstrated that Command, Control, Communications, Computers, Intelligence,
Surveillance, and Reconnaissance (C4ISR) systems may often lack the ability to
efficiently and effectively support operations in complex, time critical
environments. With the vast increase in the amount and type of information
available, the challenge to today's military system designers is to create
interfaces that allow warfighters to proficiently process the optimal amount of
mission essential data [1]. To meet this challenge, multimodal system
technology is showing great promise because, as the technology that supports
C4ISR systems advances, the possibility of leveraging all of the human sensory
systems becomes possible. The implication is that by facilitating the efficient
use of a C4ISR operator's multiple information processing resources,
substantial gains in the information management capacity of the
warfighter-computer integral may be realized [2]. Despite its great promise,
however, the potential of multimodal technology as a tool for streamlining
interaction within military C4ISR environments may not be fully realized until
the following guiding principles are identified:
* how to combine visualization and multisensory display techniques for given
users, tasks, and problem domains * how task attributes should be represented (e.g., via which modality, via multiple modalities); * which multimodal interaction technique(s) is most appropriate. Due to the current lack of empirical evidence and principle-driven guidelines, designers often encounter difficulties when choosing the most appropriate modal interaction techniques for given users, applications, or specific military command and control (C2) tasks within C4ISR systems. The implication is that inefficient multimodal C2 system design may hinder our military's ability to fully support operations in complex, time critical environments and thus impede warfighters' ability to achieve accurate situational awareness (SA) in a timely manner [1]. Consequently, warfighters are often becoming overwhelmed when provided with more information than they can accurately process. The development of multimodal design guidelines from both a user and task domain perspective is thus critical to the achievement of successful Human Systems Integration (HSI) within military environments such as C2 systems. This study provides preliminary empirical support in identifying user attributes, such as spatial ability (p < 0.02) and learning style (p < 0.03), which may aid in developing principle-driven guidelines for how and when to effectively present task-specific modal information to improve C2 warfighters' performance. A preliminary framework for modeling user interaction in multimodal C2 environments is also in development and is based on existing theories and models of working memory, as well as from new insights gained from the latest in imaging of electromagnetic (e.g., EEG, ERP, MEG) and hemodynamic (e.g., fMRI, PET) changes in the brain while user's perform predefined tasks. This research represents an innovative way to both predict and accommodate a user's information processing resources while interacting with multimodal systems. The current results and planned follow-on studies are facilitating the development of principle-driven multimodal design guidelines regarding how and when to adapt modes of interaction to meet the cognitive capabilities of users. Although the initial application of such results are focused on determining how and when modalities should be presented, either in isolation or combination, to effectively present task-specific information to C4ISR warfighters, this research shows great potential for its applicability to the multimodal design community in general. Keywords: HCI, command and control, guidelines, multimodal design, multisensory | |||
| Using spatial warning signals to capture a driver's visual attention | | BIBAK | Full-Text | 350 | |
| Cristy Ho | |||
| This study was designed to assess the potential benefits of using spatial
auditory or vibrotactile warning signals in the domain of driving performance,
using a simulated driving task. Across six experiments, participants had to
monitor a rapidly presented stream of distractor letters for occasional target
digits (simulating an attention-demanding visual task, such as driving).
Whenever participants heard an auditory cue (E1-E4) or felt a vibration
(E5-E6), they had to check the front and the rearview mirror for the rapid
approach of a car from in front or behind and respond accordingly (either by
accelerating or braking). The efficacy of various auditory and vibrotactile
warning signals in directing a participant's visual attention to the correct
environmental position was compared (see Table 1). The results demonstrate the
potential utility of semantically-meaningful or spatial auditory, and/or
vibrotactile warning signals in interface design for directing a driver's, or
other interface-operator's, visual attention to time-critical events or
information. Keywords: auditory, crossmodal, driving, interface design, spatial attention, verbal,
vibrotactile, visual, warning signals | |||
| Multimodal interfaces and applications for visually impaired children | | BIBAK | Full-Text | 351 | |
| Saija Patomäki | |||
| Applications specially designated for visually handicapped children are
rare. Additionally, this group of users is often not able to obtain the needed
applications and machinery to their homes due to the expenses. However, the
impairment these children have should not preclude them from the benefits and
possibilities computers have to offer. In a modern society services and
applications that open up along with the computers can be considered as a
necessity to its citizens. This is the core issue of our research interest; to
test various haptic devices and design usable applications to give this special
user group the possibility to become acquainted with the computers so that they
are encouraged to use and benefit from the technology also later in their
lives.
Similar research to ours where the haptic sensation is present has been carried out by Sjöström [3]. He has developed and tested haptic games that are used with the Phantom device [1]. Some of his applications are aimed for visually impaired children. During the project "Computer-based learning environment for visually impaired people" we designed, implemented and tested three different applications. Our target group was from three- to seven-year-old visually impaired children. Applications were tested in three phases with the chosen subjects. During the experiments a special testing procedure was developed [2]. The applications were based on haptic and auditory feedback but the simple graphical interface was available for those who were only partially blind. The chosen haptic device was the Phantom [1] that is a six-degrees-of-freedom input device. The Phantom is used with the stylus that resembles a pen. A pen is attached to a robotic arm that generates force feedback to stimulate touch. The first application consisted on simple materials and path shapes. In the user tests the virtual materials were compared with real ones and the various path shapes were meant to track along with the stylus. The second application was more a game-like environment. There were four haptic rooms where children had to do different tasks. The last tested application was a modification of the previous one. Its user interface consisted of six rooms and the tasks in them were simplified based on the results gained in the previous user tests. As the Phantom device is expensive and also difficult to use for some of the children the haptic device was decided to be replaced with simple machinery. In our current project "Multimodal Interfaces for Visually Impaired Children" the applications will be used with haptic devices such as tactile mouse or force feedback joystick. Some applications are designed and implemented from the start and some applications are adapted from the games that are originally meant for sighted children. The desirable research outcome is practical; to produce workable user interfaces and applications whose functionality and cost are reasonable enough to be acquired to the homes of the blind children. Keywords: Phantom, blind children, haptic environment, haptic feedback, learning,
visually impaired children | |||
| Multilayer architecture in sign language recognition system | | BIBAK | Full-Text | 352-353 | |
| Feng Jiang; Hongxun Yao; Guilin Yao | |||
| Up to now analytical or statistical methods have been used in sign language
recognition with large vocabulary. Analytical methods such as Dynamic Time
Wrapping (DTW) or Euclidian distance have been used for isolated word
recognition, but the performance is not satisfactory enough because it is
easily interfered by noise. Statistical methods, especially hidden Markov
Models are commonly used, for both continuous sign language and isolated words
and with the expansion of vocabulary the processing time becomes increasingly
unacceptable. Therefore, a multilayer architecture of sign language recognition
for large vocabulary is proposed in this paper for the purpose of speeding up
the recognition process. In this method the gesture sequence to be recognized
is first located at a set of words that are easy to be confused (confusion set)
through a global cursory search and then the gesture is recognized through a
latter local search and the generation of confusion set is realized by
DTW/ISODATA algorithm. Experiment results indicate that it is an effective
algorithm for Chinese sign language recognition. Keywords: DTW/ISODATA, sign language recognition | |||
| Computer vision techniques and applications in human-computer interaction | | BIBAK | Full-Text | 354 | |
| Erno Mäkinen | |||
| There has been much research on computer vision in last three decades.
Computer vision methods have been developed for different situations. One
example is a detection of human face. For computers face detection is hard.
Faces look different from different viewing directions. Facial expressions
affect to the look of the face. Each individual person has a unique face. The
lightning conditions can vary and so on.
However, face detection is currently possible in limited conditions. In addition, there are some methods that can be used for gender recognition [3], face recognition [5] and facial expression recognition [2]. Nonetheless, there has been very little research on how to combine these methods. There has also been quite little research on how to apply these methods in human-computer interaction (HCI). Finding sets of techniques that complement each other in a useful way is one research challenge. There are some applications that take advantage of one or two computer vision techniques. For example, Christian and Avery [1] have developed an information kiosk that uses computer vision to detect potential users from a distance. A similar kiosk has been developed by us in the University of Tampere [4]. There are also some games that use simple computer vision techniques for the interaction. However, there are very few applications that use several computer vision techniques together such as face detection, facial expression recognition and gender recognition. Overall, there has been very little effort in combining different techniques. In my research I develop computer vision methods and combine them, so that the combined method can detect face, recognize gender and facial expressions. After successfully combining the methods, it is easier to develop HCI applications that take advantage of computer vision. Applications that can be used by small group of people are my specific interest. These applications allow me to build adaptive user interfaces and analyze the use of computer vision techniques in improving human-computer interaction. Keywords: computer vision applications, face detection, facial expression recognition,
gender recognition | |||
| Multimodal response generation in GIS | | BIBA | Full-Text | 355 | |
| Levent Bolelli | |||
| Advances in computer hardware and software technologies have enabled
sophisticated information visualization techniques as well as new interaction
opportunities to be introduced in the development of GIS (Geographical
Information Systems) applications. Especially, research efforts in computer
vision and natural language processing have enabled users to interact with
computer applications using natural speech and gestures, which has proven to be
effective for interacting with dynamic maps [1, 6]. Pen-based mobile devices
and gesture recognition systems enable system designers to define
application-specific gestures for carrying out particular tasks. Using
force-feedback mouse for interacting with GIS has been proposed for
visually-impaired people [4]. These are exciting new opportunities and hold the
promise of advancing interaction with computers to a complete new level. The
ultimate aim, however, should be directed on facilitating human-computer
communication; that is, equal emphasis should be given to both understanding
and generation of multimodal behavior. My proposed research will provide a
conceptual framework and a computational model for generating multimodal
responses to communicate spatial information along with dynamically generated
maps. The model will eventually lead to development of a computational agent
that has reasoning capabilities for distributing the semantic and pragmatic
content of the intended response message among speech, deictic gestures and
visual information. In other words, the system will be able to select the most
natural and effective mode(s) of communicating back to the user.
Any research in computer science that investigates direct interaction of computers with humans should place human factors in center stage. Therefore, this work will follow a multi-disciplinary approach and integrate ideas from previous research in Psychology, Cognitive Science, Linguistics, Cartography, Geographical Information Science (GIScience) and Computer Science that will enable us to identify and address human, cartographic and computational issues involved in response planning and assist users with their spatial decision making by facilitating their visual thinking process as well as reducing their cognitive load. The methodology will be integrated into the design of DAVE_G [7] prototype: a, 6e of Computer Science and USAtyd Engineeringerface to Support Emergency Management. meaning. natural, multimodal, mixed-initiative dialogue interface to GIS. The system is currently capable of recognizing, interpreting and fusing users' natural occurring speech and gesture requests, and generating natural speech output. The communication between the system and user is modeled following the collaborative discourse theory [2] and maintains a Recipe Graph [5] structure -- based on SharedPlan theory[3] -- to represent the intentional structure of the discourse between the user and system. One major concern in generating speech responses for dynamic maps is that spatial information cannot be effectively communicated using speech. Altering perceptual attributes (e.g. color, size, pattern) of the visual data to direct user's attention to a particular location on the map is not usually effective, since each attribute bears an inherent semantic meaning and those perceptual attributes should be modified only when the system's judgement states that those attributes are not crucial to the user's understanding of the situation at that stage of the task. Gesticulation, on the other hand, is powerful for conveying location and form of spatially oriented information [6] without manipulating the map and the benefit of facilitating speech production. My research aims at designing feasible, extensible and effective multimodal response generation (content planning and modality allocation) model. A plan-based reasoning algorithm and methodology integrated with the Recipe Graph structure has the potential to achieve those goals. | |||
| Adaptive multimodal recognition of voluntary and involuntary gestures of people with motor disabilities | | BIBK | Full-Text | 356 | |
| Ingmar Rauschert | |||
Keywords: adaptive systems, gesture recognition, motor-disability, multimodal
human-computer-interface | |||