| Multimodal user interfaces: who's the user? | | BIBA | Full-Text | 1 | |
| Anil K. Jain | |||
| A wide variety of systems require reliable personal recognition schemes to either confirm or determine the identity of an individual requesting their services. The purpose of such schemes is to ensure that only a legitimate user, and not anyone else, accesses the rendered services. Examples of such applications include secure access to buildings, computer systems, laptops, cellular phones and ATMs. Biometric recognition, or simply biometrics, refers to the automatic recognition of individuals based on their physiological and/or behavioral characteristics. By using biometrics it is possible to confirm or establish an individual's identity based on "who she is", rather than by "what she possesses" (e.g., an ID card) or "what she remembers" (e.g., a password). Current biometric systems make use of fingerprints, hand geometry, iris, face, voice, etc. to establish a person's identity. Biometric systems also introduce an aspect of user convenience. For example, they alleviate the need for a user to remember multiple passwords associated with different applications. A biometric system that uses a single biometric trait for recognition has to contend with problems related to non-universality of the trait, spoof attacks, limited degrees of freedom, large intra-class variability, and noisy data. Some of these problems can be addressed by integrating the evidence presented by multiple biometric traits of a user (e.g., face and iris). Such systems, known as multimodal biometric systems, demonstrate substantial improvement in recognition performance. In this talk, we will present various applications of biometrics, challenges associated in designing biometric systems, and various fusion strategies available to implement a multimodal biometric system. | |||
| New techniques for evaluating innovative interfaces with eye tracking | | BIBA | Full-Text | 2 | |
| Sandra Marshall | |||
| Computer interfaces are changing rapidly, as are the cognitive demands on
the operators using them. Innovative applications of new technologies such as
multimodal and multimedia displays, haptic and pen-based interfaces, and
natural language exchanges bring exciting changes to conventional interface
usage. At the same time, their complexity may place overwhelming cognitive
demands on the user. As novel interfaces and software applications are
introduced into operational settings, it is imperative to evaluate them from a
number of different perspectives. One important perspective examines the extent
to which a new interface changes the cognitive requirements for the operator.
This presentation describes a new approach to measuring cognitive effort using metrics based on eye movements and pupil dilation. It is well known that effortful cognitive processing is accompanied by increases in pupil dilation, but measurement techniques were not previously available that could supply results in real time or deal with data collected in long-lasting interactions. We now have a metric-the Index of Cognitive Activity-that is computed in real time as the operator interacts with the interface. The Index can be used to examine extended periods of usage or to assess critical events on an individual-by-individual basis. While dilation reveals when cognitive effort is highest, eye movements provide evidence of why. Especially during critical events, one wants to know whether the operator is confused by the presentation or location of specific information, whether he is attending to key information when necessary, or whether he is distracted by irrelevant features of the display. Important details of confusion, attention, and distraction are revealed by traces of his eye movements and statistical analyses of time spent looking at various features during critical event. Together, the Index of Cognitive Activity and the various analyses of eye movements provide essential information about how users interact with new interface technologies. Their use can aid designers of innovative hardware and software products by highlighting those features that increase rather than decrease users' cognitive effort. In the presentation, the underlying mathematical basis of the Index of Cognitive Activity will be described together with validating research results from a number of experiments. Eye movement analyses from the same studies give clues to the sources of increase in cognitive workload. To illustrate interface evaluation with the ICA and eye movement analysis, several extended examples will be presented using commercial and military displays. [NOTE: Dr. Marshall's eye tracking system will be available to view at Tuesday evening's joint UIST-ICMI demo reception. | |||
| Crossmodal attention and multisensory integration: implications for multimodal interface design | | BIBA | Full-Text | 3 | |
| Charles Spence | |||
| One of the most important findings to emerge from the field of cognitive psychology in recent years has been the discovery that humans have a very limited ability to process incoming sensory information. In fact, contrary to many of the most influential human operator models, the latest research has shown that humans use the same limited pool of attentional resources to process the inputs arriving from each of their senses (e.g., hearing, vision, touch, smell, etc). His research calls for a radical new way of examining and understanding the senses, which has major implications for the way we design everything from household products to multimodal user interfaces. Instead, interface designers should realize that the decision to stimulate more senses actually reflects a trade-off between the benefits of utilizing additional senses and the costs associated with dividing attention between different sensory modalities. In this presentation, I will discuss some of the problems associated with dividing attention between eye and ear, as illustrated by talking on a mobile phone while driving. Charles has published more than 70 articles in scientific journals over the past decade. I hope to demonstrate that a better understanding of the senses and, especially the links between the senses that have been highlighted by recent cognitive neuroscience research, will enable interface designers to develop multimodal interfaces that more effectively stimulate the user's senses. | |||
| A system for fast, full-text entry for small electronic devices | | BIBAK | Full-Text | 4-11 | |
| Saied B. Nesbat | |||
| A novel text entry system designed based on the ubiquitous 12-button
telephone keypad and its adaptation for a soft keypad are presented. This
system can be used to enter full text (letters + numbers + special characters)
on devices where the number of keys or the keyboard area is limited.
Letter-frequency data is used for assigning letters to the positions of a 3x3
matrix on keys, enhancing the entry of the most frequent Letters performed by a
double-click. Less frequent letters and characters are entered based on a 3x3
adjacency matrix using an unambiguous, two-keystroke scheme. The same technique
is applied to a virtual or soft keyboard layout so letters and characters are
entered with taps or slides on an 11-button keypad. Based on the application of
Fitts' law, this system is determined to be 67% faster than the QWERTY soft
keyboard and 31% faster than the multi-tap text entry system commonly used on
cell phones today. The system presented in this paper is implemented and runs
on Palm OS PDAs, replacing the built-in QWERTY keyboard and Graffiti
recognition systems of these PDAs. Keywords: Fitts' law, keypad input, mobile phones, mobile systems, pen-based, soft
keyboard, stylus input, text entry | |||
| Mutual disambiguation of 3D multimodal interaction in augmented and virtual reality | | BIBAK | Full-Text | 12-19 | |
| Ed Kaiser; Alex Olwal; David McGee; Hrvoje Benko; Andrea Corradini; Xiaoguang Li; Phil Cohen; Steven Feiner | |||
| We describe an approach to 3D multimodal interaction in immersive augmented
and virtual reality environments that accounts for the uncertain nature of the
information sources. The resulting multimodal system fuses symbolic and
statistical information from a set of 3D gesture, spoken language, and
referential agents. The referential agents employ visible or invisible volumes
that can be attached to 3D trackers in the environment, and which use a
time-stamped history of the objects that intersect them to derive statistics
for ranking potential referents. We discuss the means by which the system
supports mutual disambiguation of these modalities and information sources, and
show through a user study how mutual disambiguation accounts for over 45% of
the successful 3D multimodal interpretations. An accompanying video
demonstrates the system in action. Keywords: augmented/virtual reality, evaluation, multimodal interaction | |||
| Learning and reasoning about interruption | | BIBAK | Full-Text | 20-27 | |
| Eric Horvitz; Johnson Apacible | |||
| We present methods for inferring the cost of interrupting users based on
multiple streams of events including information generated by interactions with
computing devices, visual and acoustical analyses, and data drawn from online
calendars. Following a review of prior work on techniques for deliberating
about the cost of interruption associated with notifications, we introduce
methods for learning models from data that can be used to compute the expected
cost of interruption for a user. We describe the Interruption Workbench, a set
of event-capture and modeling tools. Finally, we review experiments that
characterize the accuracy of the models for predicting interruption cost and
discuss research directions. Keywords: cognitive models, divided attention, interruption, notifications | |||
| Providing the basis for human-robot-interaction: a multi-modal attention system for a mobile robot | | BIBAK | Full-Text | 28-35 | |
| Sebastian Lang; Marcus Kleinehagenbrock; Sascha Hohenner; Jannik Fritsch; Gernot A. Fink; Gerhard Sagerer | |||
| In order to enable the widespread use of robots in home and office
environments, systems with natural interaction capabilities have to be
developed. A prerequisite for natural interaction is the robot's ability to
automatically recognize when and how long a person's attention is directed
towards it for communication. As in open environments several persons can be
present simultaneously, the detection of the communication partner is of
particular importance. In this paper we present an attention system for a
mobile robot which enables the robot to shift its attention to the person of
interest and to maintain attention during interaction. Our approach is based on
a method for multi-modal person tracking which uses a pan-tilt camera for face
recognition, two microphones for sound source localization, and a laser range
finder for leg detection. Shifting of attention is realized by turning the
camera into the direction of the person which is currently speaking. From the
orientation of the head it is decided whether the speaker addresses the robot.
The performance of the proposed approach is demonstrated with an evaluation. In
addition, qualitative results from the performance of the robot at the
exhibition part of the ICVS'03 are provided. Keywords: attention, human-robot-interaction, multi-modal person tracking | |||
| Selective perception policies for guiding sensing and computation in multimodal systems: a comparative analysis | | BIBAK | Full-Text | 36-43 | |
| Nuria Oliver; Eric Horvitz | |||
| Intensive computations required for sensing and processing perceptual
information can impose significant burdens on personal computer systems. We
explore several policies for selective perception in SEER, a multimodal system
for recognizing office activity that relies on a layered Hidden Markov Model
representation. We review our efforts to employ expected-value-of-information
(EVI) computations to limit sensing and analysis in a context-sensitive manner.
We discuss an implementation of a one-step myopic EVI analysis and compare the
results of using the myopic EVI with a heuristic sensing policy that makes
observations at different frequencies. Both policies are then compared to a
random perception policy, where sensors are selected at random. Finally, we
discuss the sensitivity of ideal perceptual actions to preferences encoded in
utility models about information value and the cost of sensing. Keywords: Hidden Markov models, automatic feature selection, expected value of
information, human behavior recognition, multi-modal interaction, office
awareness, selective perception | |||
| Toward a theory of organized multimodal integration patterns during human-computer interaction | | BIBAK | Full-Text | 44-51 | |
| Sharon Oviatt; Rachel Coulston; Stefanie Tomko; Benfang Xiao; Rebecca Lunsford; Matt Wesson; Lesley Carmichael | |||
| As a new generation of multimodal systems begins to emerge, one dominant
theme will be the integration and synchronization requirements for combining
modalities into robust whole systems. In the present research, quantitative
modeling is presented on the organization of users' speech and pen multimodal
integration patterns. In particular, the potential malleability of users'
multimodal integration patterns is explored, as well as variation in these
patterns during system error handling and tasks varying in difficulty. Using a
new dual-wizard simulation method, data was collected from twelve adults as
they interacted with a map-based task using multimodal speech and pen input.
Analyses based on over 1600 multimodal constructions revealed that users'
dominant multimodal integration pattern was resistant to change, even when
strong selective reinforcement was delivered to encourage switching from a
sequential to simultaneous integration pattern, or vice versa. Instead, both
sequential and simultaneous integrators showed evidence of entrenching further
in their dominant integration patterns (i.e., increasing either their
inter-modal lag or signal overlap) over the course of an interactive session,
during system error handling, and when completing increasingly difficult tasks.
In fact, during error handling these changes in the co-timing of multimodal
signals became the main feature of hyper-clear multimodal language, with
elongation of individual signals either attenuated or absent. Whereas
Behavioral/Structuralist theory cannot account for these data, it is argued
that Gestalt theory provides a valuable framework and insights into multimodal
interaction. Implications of these findings are discussed for the development
of a coherent theory of multimodal integration during human-computer
interaction, and for the design of a new class of adaptive multimodal
interfaces. Keywords: Gestalt theory, co-timing, entrenchment, error handling, multimodal
integration, speech and pen input, task difficulty | |||
| TorqueBAR: an ungrounded haptic feedback device | | BIBAK | Full-Text | 52-59 | |
| Colin Swindells; Alex Unden; Tao Sang | |||
| Kinesthetic feedback is a key mechanism by which people perceive object
properties during their daily tasks -- particularly inertial properties. For
example, transporting a glass of water without spilling, or dynamically
positioning a handheld tool such as a hammer, both require inertial kinesthetic
feedback. We describe a prototype for a novel ungrounded haptic feedback
device, the TorqueBAR, that exploits a kinesthetic awareness of dynamic inertia
to simulate complex coupled motion as both a display and input device. As a
user tilts the TorqueBAR to sense and control computer programmed stimuli, the
TorqueBAR's centre-of-mass changes in real-time according to the user's
actions. We evaluate the TorqueBAR using both quantitative and qualitative
techniques, and we describe possible applications for the device such as video
games and real-time robot navigation. Keywords: 1 DOF, haptic rod, input device, mobile computing, tilt controller, torque
feedback, two-handed, ungrounded force feedback | |||
| Towards tangibility in gameplay: building a tangible affective interface for a computer game | | BIBAK | Full-Text | 60-67 | |
| Ana Paiva; Rui Prada; Ricardo Chaves; Marco Vala; Adrian Bullock; Gerd Andersson; Kristina Höök | |||
| In this paper we describe a way of controlling the emotional states of a
synthetic character in a game (FantasyA) through a tangible interface named
SenToy. SenToy is a doll with sensors in the arms, legs and body, allowing the
user to influence the emotions of her character in the game. The user performs
gestures and movements with SenToy, which are picked up by the sensors and
interpreted according to a scheme found through an initial Wizard of Oz study.
Different gestures are used to express each of the following emotions: anger,
fear, happiness, surprise, sadness and gloating. Depending upon the expressed
emotion, the synthetic character in FantasyA will, in turn, perform different
actions. The evaluation of SenToy acting as the interface to the computer game
FantasyA has shown that users were able to express most of the desired emotions
to influence the synthetic characters, and that overall, players, especially
children, really liked the doll as an interface. Keywords: affective computing, characters, synthetic, tangible interfaces | |||
| Multimodal biometrics: issues in design and testing | | BIBAK | Full-Text | 68-72 | |
| Robert Snelick; Mike Indovina; James Yen; Alan Mink | |||
| Experimental studies show that multimodal biometric systems for small-scale
populations perform better than single-mode biometric systems. We examine if
such techniques scale to larger populations, introduce a methodology to test
the performance of such systems, and assess the feasibility of using commercial
off-the-shelf (COTS) products to construct deployable multimodal biometric
systems. A key aspect of our approach is to leverage confidence level scores
from preexisting single-mode data. An example presents a multimodal biometrics
system analysis that explores various normalization and fusion techniques for
face and fingerprint classifiers. This multimodal analysis uses a population of
about 1000 subjects, a number ten-times larger than seen in any previously
reported study. Experimental results combining face and fingerprint biometric
classifiers reveal significant performance improvement over single-mode
biometric systems. Keywords: evaluation, fusion, multimodal biometrics, normalization, system design,
testing methodology | |||
| Sensitivity to haptic-audio asynchrony | | BIBAK | Full-Text | 73-76 | |
| Bernard D. Adelstein; Durand R. Begault; Mark R. Anderson; Elizabeth M. Wenzel | |||
| The natural role of sound in actions involving mechanical impact and
vibration suggests the use of auditory display as an augmentation to virtual
haptic interfaces. In order to budget available computational resources for
sound simulation, the perceptually tolerable asynchrony between paired
haptic-auditory sensations must be known. This paper describes a psychophysical
study of detectable time delay between a voluntary hammer tap and its auditory
consequence (a percussive sound of either 1, 50, or 200 ms duration). The
results show Just Noticeable Differences (JNDs) for temporal asynchrony of 24
ms with insignificant response bias. The invariance of JND and response bias as
a function of sound duration in this experiment indicates that observers cued
on the initial attack of the auditory stimuli. Keywords: audio, cross-modal asynchrony, haptic, latency, multi-modal interfaces, time
delay, virtual environments | |||
| A multi-modal approach for determining speaker location and focus | | BIBA | Full-Text | 77-80 | |
| Michael Siracusa; Louis-Philippe Morency; Kevin Wilson; John Fisher; Trevor Darrell | |||
| This paper presents a multi-modal approach to locate a speaker in a scene and determine to whom he or she is speaking. We present a simple probabilistic framework that combines multiple cues derived from both audio and video information. A purely visual cue is obtained using a head tracker to identify possible speakers in a scene and provide both their 3-D positions and orientation. In addition, estimates of the audio signal's direction of arrival are obtained with the help of a two-element microphone array. A third cue measures the association between the audio and the tracked regions in the video. Integrating these cues provides a more robust solution than using any single cue alone. The usefulness of our approach is shown in our results for video sequences with two or more people in a prototype interactive kiosk environment. | |||
| Distributed and local sensing techniques for face-to-face collaboration | | BIBAK | Full-Text | 81-84 | |
| Ken Hinckleyss | |||
| This paper describes techniques that allow users to collaborate on tablet
computers that employ distributed sensing techniques to establish a privileged
connection between devices. Each tablet is augmented with a two-axis linear
accelerometer (tilt sensor), touch sensor, proximity sensor, and light sensor.
The system recognizes when users bump two tablets together by looking for
spikes in each tablet's accelerometer data that are synchronized in time;
bumping establishes a privileged connection between the devices. Users can face
one another and bump the tops of two tablets together to establish a
collaborative face-to-face workspace. The system then uses the sensors to
enhance transitions between personal work and shared work. For example, a user
can hold his or her hand near the top of the workspace to "shield" the display
from the other user. This gesture is sensed using the proximity sensor together
with the light sensor, allowing for quick "asides" into private information or
to sketch an idea in a personal workspace. Picking up, putting down, or walking
away with a tablet are also sensed, as is angling the tablet towards the other
user. Much research in single display groupware considers shared displays and
shared artifacts, but our system explores a unique form of dual display
groupware for face-to-face communication and collaboration using personal
display devices. Keywords: co-present collaboration, context awareness, sensing techniques | |||
| Georgia tech gesture toolkit: supporting experiments in gesture recognition | | BIBAK | Full-Text | 85-92 | |
| Tracy Westeyn; Helene Brashear; Amin Atrash; Thad Starner | |||
| Gesture recognition is becoming a more common interaction tool in the fields
of ubiquitous and wearable computing. Designing a system to perform gesture
recognition, however, can be a cumbersome task. Hidden Markov models (HMMs), a
pattern recognition technique commonly used in speech recognition, can be used
for recognizing certain classes of gestures. Existing HMM toolkits for speech
recognition can be adapted to perform gesture recognition, but doing so
requires significant knowledge of the speech recognition literature and its
relation to gesture recognition. This paper introduces the Georgia Tech Gesture
Toolkit GT2k which leverages Cambridge University's speech recognition
toolkit, HTK, to provide tools that support gesture recognition research.
GT2k provides capabilities for training models and allows for both
real-time and off-line recognition. This paper presents four ongoing projects
that utilize the toolkit in a variety of domains. Keywords: American sign language, context recognition, gesture recognition, hidden
Markov models, interfaces, toolkit, wearable computers | |||
| Architecture and implementation of multimodal plug and play | | BIBAK | Full-Text | 93-100 | |
| Christian Elting; Stefan Rapp; Gregor Möhler; Michael Strube | |||
| This paper describes the handling of multimodality in the Embassi system.
Here, multimodality is treated in two modules. Firstly, a modality fusion
component merges speech, video traced pointing gestures, and input from a
graphical user interface. Secondly, a presentation planning component decides
upon the modality to be used for the output, i.e., speech, an animated
life-like character (ALC) and/or the graphical user interface, and ensures that
the presentation is coherent and cohesive. We describe how these two components
work and emphasize one particular feature of our system architecture: All
modality analysis components generate output in a common semantic description
format and all render components process input in a common output language.
This makes it particularly easy to add or remove modality analyzers or renderer
components, even dynamically while the system is running. This plug and play of
modalities can be used to adjust the system's capabilities to different demands
of users and their situative context. In this paper we give details about the
implementations of the models, protocols and modules that are necessary to
realize those features. Keywords: dialog systems, multimodal, multimodal fission, multimodal fusion | |||
| SmartKom: adaptive and flexible multimodal access to multiple applications | | BIBAK | Full-Text | 101-108 | |
| Norbert Reithinger; Jan Alexandersson; Tilman Becker; Anselm Blocher; Ralf Engel; Markus Löckelt; Jochen Müller; Norbert Pfleger; Peter Poller; Michael Streit; Valentin Tschernomas | |||
| The development of an intelligent user interface that supports multimodal
access to multiple applications is a challenging task. In this paper we present
a generic multimodal interface system where the user interacts with an
anthropomorphic personalized interface agent using speech and natural gestures.
The knowledge-based and uniform approach of SmartKom enables us to realize a
comprehensive system that understands imprecise, ambiguous, or incomplete
multimodal input and generates coordinated, cohesive, and coherent multimodal
presentations for three scenarios, currently addressing more than 50 different
functionalities of 14 applications. We demonstrate the main ideas in a walk
through the main processing steps from modality fusion to modality fission. Keywords: intelligent multimodal interfaces, multiple applications, system description | |||
| A framework for rapid development of multimodal interfaces | | BIBAK | Full-Text | 109-116 | |
| Frans Flippo; Allen Krebs; Ivan Marsic | |||
| Despite the availability of multimodal devices, there are very few
commercial multimodal applications available. One reason for this may be the
lack of a framework to support development of multimodal applications in
reasonable time and with limited resources. This paper describes a multimodal
framework enabling rapid development of applications using a variety of
modalities and methods for ambiguity resolution, featuring a novel approach to
multimodal fusion. An example application is studied that was created using the
framework. Keywords: application frameworks, command and control, direct manipulation, multimodal
fusion, multimodal interfaces | |||
| Capturing user tests in a multimodal, multidevice informal prototyping tool | | BIBAK | Full-Text | 117-124 | |
| Anoop K. Sinha; James A. Landay | |||
| Interaction designers are increasingly faced with the challenge of creating
interfaces that incorporate multiple input modalities, such as pen and speech,
and span multiple devices. Few early stage prototyping tools allow
non-programmers to prototype these interfaces. Here we describe CrossWeaver, a
tool for informally prototyping multimodal, multidevice user interfaces. This
tool embodies the informal prototyping paradigm, leaving design representations
in an informal, sketched form, and creates a working prototype from these
sketches. CrossWeaver allows a user interface designer to sketch storyboard
scenes on the computer, specifying simple multimodal command transitions
between scenes. The tool also allows scenes to target different output devices.
Prototypes can run across multiple standalone devices simultaneously,
processing multimodal input from each one. Thus, a designer can visually create
a multimodal prototype for a collaborative meeting or classroom application.
CrossWeaver captures all of the user interaction when running a test of a
prototype. This input log can quickly be viewed visually for the details of the
users' multimodal interaction or it can be replayed across all participating
devices, giving the designer information to help him or her analyze and iterate
on the interface design. Keywords: informal prototyping, mobile interface design, multidevice, multimodal, pen
and speech input, sketching | |||
| Large vocabulary sign language recognition based on hierarchical decision trees | | BIBAK | Full-Text | 125-131 | |
| Gaolin Fang; Wen Gao; Debin Zhao | |||
| The major difficulty for large vocabulary sign language or gesture
recognition lies in the huge search space due to a variety of recognized
classes. How to reduce the recognition time without loss of accuracy is a
challenge issue. In this paper, a hierarchical decision tree is first presented
for large vocabulary sign language recognition based on the divide-and-conquer
principle. As each sign feature has the different importance to gestures, the
corresponding classifiers are proposed for the hierarchical decision to gesture
attributes. One- or two-handed classifier with little computational cost is
first used to eliminate many impossible candidates. The subsequent hand shape
classifier is performed on the possible candidate space. SOFM/HMM classifier is
employed to get the final results at the last non-leaf nodes that only include
few candidates. Experimental results on a large vocabulary of 5113-signs show
that the proposed method drastically reduces the recognition time by 11 times
and also improves the recognition rate about 0.95% over single SOFM/HMM. Keywords: Gaussian mixture model, finite state machine, gesture recognition,
hierarchical decision tree, sign language recognition | |||
| Hand motion gestural oscillations and multimodal discourse | | BIBAK | Full-Text | 132-139 | |
| Yingen Xiong; Francis Quek; David McNeill | |||
| To develop multimodal interfaces, one needs to understand the constraints
underlying human communicative gesticulation and the kinds of features one may
compute based on these underlying human characteristics.
In this paper we address hand motion oscillatory gesture detection in natural speech and conversation. First, the hand motion trajectory signals are extracted from video. Second, a wavelet analysis based approach is presented to process the signals. In this approach, wavelet ridges are extracted from the responses of wavelet analysis for the hand motion trajectory signals, which can be used to characterize frequency properties of the hand motion signals. The hand motion oscillatory gestures can be extracted from these frequency properties. Finally, we relate the hand motion oscillatory gestures to the phases of speech and multimodal discourse analysis. We demonstrate the efficacy of the system on a real discourse dataset in which a subject described her action plan to an interlocutor. We extracted the oscillatory gestures from the x, y and z motion traces of both hands. We further demonstrate the power of gestural oscillation detection as a key to unlock the structure of the underlying discourse. Keywords: gesture symmetry, hand gesture, hand motion trajectory, interaction,
multimodal, multimodal discourse structure, speech analysis | |||
| Pointing gesture recognition based on 3D-tracking of face, hands and head orientation | | BIBAK | Full-Text | 140-146 | |
| Kai Nickel; Rainer Stiefelhagen | |||
| In this paper, we present a system capable of visually detecting pointing
gestures and estimating the 3D pointing direction in real-time. In order to
acquire input features for gesture recognition, we track the positions of a
person's face and hands on image sequences provided by a stereo-camera. Hidden
Markov Models (HMMs), trained on different phases of sample pointing gestures,
are used to classify the 3D-trajectories in order to detect the occurrence of a
gesture. When analyzing sample pointing gestures, we noticed that humans tend
to look at the pointing target while performing the gesture. In order to
utilize this behavior, we additionally measured head orientation by means of a
magnetic sensor in a similar scenario. By using head orientation as an
additional feature, we observed significant gains in both recall and precision
of pointing gestures. Moreover, the percentage of correctly identified pointing
targets improved significantly from 65% to 83%. For estimating the pointing
direction, we comparatively used three approaches: 1) The line of sight between
head and hand, 2) the forearm orientation, and 3) the head orientation. Keywords: computer vision, gesture recognition, person tracking, pointing gestures | |||
| Untethered gesture acquisition and recognition for a multimodal conversational system | | BIBAK | Full-Text | 147-150 | |
| T. Ko; D. Demirdjian; T. Darrell | |||
| Humans use a combination of gesture and speech to convey meaning, and
usually do so without holding a device or pointer. We present a system that
incorporates body tracking and gesture recognition for an untethered
human-computer interface. This research focuses on a module that provides
parameterized gesture recognition, using various machine learning techniques.
We train the support vector classifier to model the boundary of the space of
possible gestures, and train Hidden Markov Models on specific gestures. Given a
sequence, we can find the start and end of various gestures using a support
vector classifier, and find gesture likelihoods and parameters with a HMM.
Finally multimodal recognition is performed using rank-order fusion to merge
speech and vision hypotheses. Keywords: articulated tracking, hidden Markov models, speech, support vector machines,
vision | |||
| Where is "it"? Event Synchronization in Gaze-Speech Input Systems | | BIBAK | Full-Text | 151-158 | |
| Manpreet Kaur; Marilyn Tremaine; Ning Huang; Joseph Wilder; Zoran Gacovski; Frans Flippo; Chandra Sekhar Mantravadi | |||
| The relationship between gaze and speech is explored for the simple task of
moving an object from one location to another on a computer screen. The subject
moves a designated object from a group of objects to a new location on the
screen by stating, "Move it there". Gaze and speech data are captured to
determine if we can robustly predict the selected object and destination
position. We have found that the source fixation closest to the desired object
begins, with high probability, before the beginning of the word "Move". An
analysis of all fixations before and after speech onset time shows that the
fixation that best identifies the object to be moved occurs, on average, 630
milliseconds before speech onset with a range of 150 to 1200 milliseconds for
individual subjects. The variance in these times for individuals is relatively
small although the variance across subjects is large. Selecting a fixation
closest to the onset of the word "Move" as the designator of the object to be
moved gives a system accuracy close to 95% for all subjects. Thus, although
significant differences exist between subjects, we believe that the speech and
gaze integration patterns can be modeled reliably for individual users and
therefore be used to improve the performance of multimodal systems. Keywords: eye-tracking, gaze-speech co-occurrence, multimodal fusion, multimodal
interfaces | |||
| Eyetracking in cognitive state detection for HCI | | BIBAK | Full-Text | 159-163 | |
| Darrell S. Rudmann; George W. McConkie; Xianjun Sam Zheng | |||
| 1. Past research in a number of fields confirms the existence of a link
between cognition and eye movement control, beyond simply a pointing
relationship. This being the case, it should be possible to use eye movement
recording as a basis for detecting users' cognitive states in real time.
Several examples of such cognitive state detectors have been reported in the
literature.
2. A multi-disciplinary project is described in which the goal is to provide the computer with as much real-time information about the human state (cognitive, affective and motivational state) as possible, and to base computer actions on this information. The application area in which this is being implemented is science education, learning about gears through exploration. Two studies are reported in which participants solve simple problems of pictured gear trains while their eye movements are recorded. The first study indicates that most eye movement sequences are compatible with predictions of a simple sequential cognitive model, and it is suggested that those sequences that do not fit the model may be of particular interest in the HCI context as indicating problems or alternative mental strategies. The mental rotation of gears sometimes produces sequences of short eye movements in the direction of motion; thus, such sequences may be useful as cognitive state detectors. The second study tested the hypothesis that participants are thinking about the object to which their eyes are directed. In this study, the display was turned off partway through the process of solving a problem, and the participants reported what they were thinking about at that time. While in most cases the participants reported cognitive activities involving the fixated object, this was not the case on a sizeable number of trials. Keywords: cognitive state, eye tracking | |||
| A multimodal learning interface for grounding spoken language in sensory perceptions | | BIBAK | Full-Text | 164-171 | |
| Chen Yu; Dana H. Ballard | |||
| Most speech interfaces are based on natural language processing techniques
that use pre-defined symbolic representations of word meanings and process only
linguistic information. To understand and use language like their human
counterparts in multimodal human-computer interaction, computers need to
acquire spoken language and map it to other sensory perceptions. This paper
presents a multimodal interface that learns to associate spoken language with
perceptual features by being situated in users' everyday environments and
sharing user-centric multisensory information. The learning interface is
trained in unsupervised mode in which users perform everyday tasks while
providing natural language descriptions of their behaviors. We collect acoustic
signals in concert with multisensory information from non-speech modalities,
such as user's perspective video, gaze positions, head directions and hand
movements. The system firstly estimates users' focus of attention from eye and
head cues. Attention, as represented by gaze fixation, is used for spotting the
target object of user interest. Attention switches are calculated and used to
segment an action sequence into action units which are then categorized by
mixture hidden Markov models. A multimodal learning algorithm is developed to
spot words from continuous speech and then associate them with perceptually
grounded meanings extracted from visual perception and action. Successful
learning has been demonstrated in the experiments of three natural tasks:
"unscrewing a jar", "stapling a letter" and "pouring water". Keywords: language acquisition, machine learning, multimodal integration | |||
| A computer-animated tutor for spoken and written language learning | | BIBAK | Full-Text | 172-175 | |
| Dominic W. Massaro | |||
| Baldi, a computer-animated talking head is introduced. The quality of his
visible speech has been repeatedly modified and evaluated to accurately
simulate naturally talking humans. Baldi's visible speech can be appropriately
aligned with either synthesized or natural auditory speech. Baldi has had great
success in teaching vocabulary and grammar to children with language challenges
and training speech distinctions to children with hearing loss and to adults
learning a new language. We demonstrate these learning programs and also
demonstrate several other potential application areas for Baldi. Keywords: facial and speech synthesis, language learning | |||
| Augmenting user interfaces with adaptive speech commands | | BIBAK | Full-Text | 176-179 | |
| Peter Gorniak; Deb Roy | |||
| We present a system that augments any unmodified Java application with an
adaptive speech interface. The augmented system learns to associate spoken
words and utterances with interface actions such as button clicks. Speech
learning is constantly active and searches for correlations between what the
user says and does. Training the interface is seamlessly integrated with using
the interface. As the user performs normal actions, she may optionally verbally
describe what she is doing. By using a phoneme recognizer, the interface is
able to quickly learn new speech commands. Speech commands are chosen by the
user and can be recognized robustly due to accurate phonetic modelling of the
user's utterances and the small size of the vocabulary learned for a single
application. After only a few examples, speech commands can replace mouse
clicks. In effect, selected interface functions migrate from keyboard and mouse
to speech. We demonstrate the usefulness of this approach by augmenting jfig, a
drawing application, where speech commands save the user from the distraction
of having to use a tool palette. Keywords: machine learning, phoneme recognition, robust speech interfaces, user
modelling | |||
| Combining speech and haptics for intuitive and efficient navigation through image databases | | BIBAK | Full-Text | 180-187 | |
| Thomas Käster; Michael Pfeiffer; Christian Bauckhage | |||
| Given the size of todays professional image databases, the standard approach
to object- or theme-related image retrieval is to interactively navigate
through the content. But as most users of such databases are designers or
artists who do not have a technical background, navigation interfaces must be
intuitive to use and easy to learn. This paper reports on efforts towards this
goal. We present a system for intuitive image retrieval that features different
modalities for interaction. Apart from conventional input devices like mouse or
keyboard it is also possible to use speech or haptic gesture to indicate what
kind of images one is looking for. Seeing a selection of images on the screen,
the user provides relevance feedback to narrow the choice of motifs presented
next. This is done either by scoring whole images or by choosing certain image
regions. In order to derive consistent reactions from multimodal user input,
asynchronous integration of modalities and probabilistic reasoning based on
Bayesian networks are applied. After addressing technical details, we will
discuss a series of usability experiments, which we conducted to examine the
impact of multimodal input facilities on interactive image retrieval. The
results indicate that users appreciate multimodality. While we observed little
decrease in task performance, measures of contentment exceeded those for
conventional input devices. Keywords: content-based image retrieval, fusion of haptics, multimodal interface
evaluation, speech, vision processing | |||
| Interactive skills using active gaze tracking | | BIBAK | Full-Text | 188-195 | |
| Rowel Atienza; Alexander Zelinsky | |||
| We have incorporated interactive skills into an active gaze tracking system.
Our active gaze tracking system can identify an object in a cluttered scene
that a person is looking at. By following the user's 3-D gaze direction
together with a zero-disparity filter, we can determine the object's position.
Our active vision system also directs attention to a user by tracking anything
with both motion and skin color. A Particle Filter fuses skin color and motion
from optical flow techniques together to locate a hand or a face in an image.
The active vision then uses stereo camera geometry, Kalman Filtering and
position and velocity controllers to track the feature in real-time. These
skills are integrated together such that they cooperate with each other in
order to track the user's face and gaze at all times. Results and video demos
provide interesting insights on how active gaze tracking can be utilized and
improved to make human-friendly user interfaces. Keywords: active face tracking, active gaze tracking, selecting an object in 3-D space
using gaze | |||
| Error recovery in a blended style eye gaze and speech interface | | BIBAK | Full-Text | 196-202 | |
| Yeow Kee Tan; Nasser Sherkat; Tony Allen | |||
| In the work carried out earlier [1][2], it was found that an eye gaze and
speech enabled interface was the most preferred form of data entry method when
compared to other methods such as mouse and keyboard, handwriting and speech
only. It was also found that several non-native United Kingdom (UK) English
speaking speakers did not prefer the eye gaze and speech system due to the low
success rate caused by the inaccuracy of the speech recognition component.
Hence in order to increase the usability of the eye gaze and speech data entry
system for these users, error recovery methods are required. In this paper we
present three different multimodal interfaces that employ the use of speech
recognition and eye gaze tracking within a virtual keypad style interface to
allow for the use of error recovery (re-speak with keypad, spelling with keypad
and re-speak and spelling with keypad). Experiments show that through the use
of this virtual keypad interface, an accuracy gain of 10.92% during first
attempt and 6.20% during re-speak by non-native speakers in ambiguous fields
(initials, surnames, city and alphabets) can be achieved [3]. The aim of this
work is to investigate whether the usability of the eye gaze and speech system
can be improved through one of these three multimodal blended multimodal error
recovery methods. Keywords: blended multimodal interface, error recovery and usability, eye gaze
tracking, multimodal interface, speech recognition | |||
| Using an autonomous cube for basic navigation and input | | BIBAK | Full-Text | 203-210 | |
| Kristof Van Laerhoven; Nicolas Villar; Albrecht Schmidt; Gerd Kortuem; Hans Gellersen | |||
| This paper presents a low-cost and practical approach to achieve basic input
using a tactile cube-shaped object, augmented with a set of sensors, processor,
batteries and wireless communication. The algorithm we propose combines a
finite state machine model incorporating prior knowledge about the symmetrical
structure of the cube, with maximum likelihood estimation using multivariate
Gaussians. The claim that the presented solution is cheap, fast and requires
few resources, is demonstrated by implementation in a small-sized,
microcontroller-driven hardware configuration with inexpensive sensors. We
conclude with a few prototyped applications that aim at characterizing how the
familiar and elementary shape of the cube allows it to be used as an
interaction device. Keywords: Gaussian modeling, Markov chain, haptic interfaces, maximum likelihood
estimation, sensor-based tactile interfaces | |||
| GWindows: robust stereo vision for gesture-based control of windows | | BIBAK | Full-Text | 211-218 | |
| Andrew Wilson; Nuria Oliver | |||
| Perceptual user interfaces promise modes of fluid computer-human interaction
that complement the mouse and keyboard, and have been especially motivated in
non-desktop scenarios, such as kiosks or smart rooms. Such interfaces, however,
have been slow to see use for a variety of reasons, including the computational
burden they impose, a lack of robustness outside the laboratory, unreasonable
calibration demands, and a shortage of sufficiently compelling applications. We
address these difficulties by using a fast stereo vision algorithm for
recognizing hand positions and gestures. Our system uses two inexpensive video
cameras to extract depth information. This depth information enhances automatic
object detection and tracking robustness, and may also be used in applications.
We demonstrate the algorithm in combination with speech recognition to perform
several basic window management tasks, report on a user study probing the ease
of using the system, and discuss the implications of such a system for future
user interfaces. Keywords: computer human interaction, computer vision, gesture recognition, speech
recognition | |||
| A visually grounded natural language interface for reference to spatial scenes | | BIBAK | Full-Text | 219-226 | |
| Peter Gorniak; Deb Roy | |||
| Many user interfaces, from graphic design programs to navigation aids in
cars, share a virtual space with the user. Such applications are often ideal
candidates for speech interfaces that allow the user to refer to objects in the
shared space. We present an analysis of how people describe objects in spatial
scenes using natural language. Based on this study, we describe a system that
uses synthetic vision to "see" such scenes from the person's point of view, and
that understands complex natural language descriptions referring to objects in
the scenes. This system is based on a rich notion of semantic compositionality
embedded in a grounded language understanding framework. We describe its
semantic elements, their compositional behaviour, and their grounding through
the synthetic vision system. To conclude, we evaluate the performance of the
system on unconstrained input. Keywords: cognitive modelling, computational semantics, natural language
understanding, vision based semantics | |||
| Perceptual user interfaces using vision-based eye tracking | | BIBAK | Full-Text | 227-233 | |
| Ravikrishna Ruddarraju; Antonio Haro; Kris Nagel; Quan T. Tran; Irfan A. Essa; Gregory Abowd; Elizabeth D. Mynatt | |||
| We present a multi-camera vision-based eye tracking method to robustly
locate and track user's eyes as they interact with an application. We propose
enhancements to various vision-based eye-tracking approaches, which include (a)
the use of multiple cameras to estimate head pose and increase coverage of the
sensors and (b) the use of probabilistic measures incorporating Fisher's linear
discriminant to robustly track the eyes under varying lighting conditions in
real-time. We present experiments and quantitative results to demonstrate the
robustness of our eye tracking in two application prototypes. Keywords: Fisher's Discriminant Analysis, computer vision, eye tracking, human
computer interaction, multiple cameras | |||
| Sketching informal presentations | | BIBAK | Full-Text | 234-241 | |
| Yang Li; James A. Landay; Zhiwei Guan; Xiangshi Ren; Guozhong Dai | |||
| Informal presentations are a lightweight means for fast and convenient
communication of ideas. People communicate their ideas to others on paper and
whiteboards, which afford fluid sketching of graphs, words and other expressive
symbols. Unlike existing authoring tools that are designed for formal
presentations, we created SketchPoint to help presenters design informal
presentations via freeform sketching. In SketchPoint, presenters can quickly
author presentations by sketching slide content, overall hierarchical
structures and hyperlinks. To facilitate the transition from idea capture to
communication, a note-taking workspace was built for accumulating ideas and
sketching presentation outlines. Informal feedback showed that SketchPoint is a
promising tool for idea communication. Keywords: gestures, informal presentation, pen-based computers, rapid prototyping,
sketching, storyboards, zooming user interface (ZUI) | |||
| Gestural communication over video stream: supporting multimodal interaction for remote collaborative physical tasks | | BIBAK | Full-Text | 242-249 | |
| Jiazhi Ou; Susan R. Fussell; Xilin Chen; Leslie D. Setlock; Jie Yang | |||
| We present a system integrating gesture and live video to support
collaboration on physical tasks. The architecture combines network IP cameras,
desktop PCs, and tablet PCs to allow a remote helper to draw on a video feed of
a workspace as he/she provides task instructions. A gesture recognition
component enables the system both to normalize freehand drawings to facilitate
communication with remote partners and to use pen-based input as a camera
control device. Results of a preliminary user study suggest that our gesture
over video communication system enhances task performance over traditional
video-only systems. Implications for the design of multimodal systems to
support collaborative physical tasks are also discussed. Keywords: computer-supported cooperative work, gestural communication, gesture
recognition, multimodal interaction, video conferencing, video mediated
communication, video stream | |||
| The role of spoken feedback in experiencing multimodal interfaces as human-like | | BIBAK | Full-Text | 250-257 | |
| Pernilla Qvarfordt; Arne Jönsson; Nils Dahlbäck | |||
| If user interfaces should be made human-like vs. tool-like has been debated
in the HCI field, and this debate affects the development of multimodal
interfaces. However, little empirical study has been done to support either
view so far. Even if there is evidence that humans interpret media as other
humans, this does not mean that humans experience the interfaces as human-like.
We studied how people experience a multimodal timetable system with varying
degree of human-like spoken feedback in a Wizard-of-Oz study. The results
showed that users' views and preferences lean significantly towards
anthropomorphism after actually experiencing the multimodal timetable system.
The more human-like the spoken feedback is the more participants preferred the
system to be human-like. The results also showed that the users experience
matched their preferences. This shows that in order to appreciate a human-like
interface, the users have to experience it. Keywords: Wizard of Oz, anthropomorphism, multimodal interaction, spoken feedback | |||
| Real time facial expression recognition in video using support vector machines | | BIBAK | Full-Text | 258-264 | |
| Philipp Michel; Rana El Kaliouby | |||
| Enabling computer systems to recognize facial expressions and infer emotions
from them in real time presents a challenging research topic. In this paper, we
present a real time approach to emotion recognition through facial expression
in live video. We employ an automatic facial feature tracker to perform face
localization and feature extraction. The facial feature displacements in the
video stream are used as input to a Support Vector Machine classifier. We
evaluate our method in terms of recognition accuracy for a variety of
interaction and classification scenarios. Our person-dependent and
person-independent experiments demonstrate the effectiveness of a support
vector machine and feature tracking approach to fully automatic, unobtrusive
expression recognition in live video. We conclude by discussing the relevance
of our work to affective and intelligent man-machine interfaces and exploring
further improvements. Keywords: affective user interfaces, emotion classification, facial expression
analysis, feature tracking, support vector machines | |||
| Modeling multimodal integration patterns and performance in seniors: toward adaptive processing of individual differences | | BIBAK | Full-Text | 265-272 | |
| Benfang Xiao; Rebecca Lunsford; Rachel Coulston; Matt Wesson; Sharon Oviatt | |||
| Multimodal interfaces are designed with a focus on flexibility, although
very few currently are capable of adapting to major sources of user, task, or
environmental variation. The development of adaptive multimodal processing
techniques will require empirical guidance from quantitative modeling on key
aspects of individual differences, especially as users engage in different
types of tasks in different usage contexts. In the present study, data were
collected from fifteen 66- to 86-year-old healthy seniors as they interacted
with a map-based flood management system using multimodal speech and pen input.
A comprehensive analysis of multimodal integration patterns revealed that
seniors were classifiable as either simultaneous or sequential integrators,
like children and adults. Seniors also demonstrated early predictability and a
high degree of consistency in their dominant integration pattern. However,
greater individual differences in multimodal integration generally were evident
in this population. Perhaps surprisingly, during sequential constructions
seniors' intermodal lags were no longer in average and maximum duration than
those of younger adults, although both of these groups had longer maximum lags
than children. However, an analysis of seniors' performance did reveal lengthy
latencies before initiating a task, and high rates of self talk and
task-critical errors while completing spatial tasks. All of these behaviors
were magnified as the task difficulty level increased. Results of this research
have implications for the design of adaptive processing strategies appropriate
for seniors' applications, especially for the development of temporal
thresholds used during multimodal fusion. The long-term goal of this research
is the design of high-performance multimodal systems that adapt to a full
spectrum of diverse users, supporting tailored and robust future systems. Keywords: human performance errors, multimodal integration, self-regulatory language,
senior users, speech and pen input, task difficulty | |||
| Auditory, graphical and haptic contact cues for a reach, grasp, and place task in an augmented environment | | BIBAK | Full-Text | 273-276 | |
| Mihaela A. Zahariev; Christine L. MacKenzie | |||
| An experiment was conducted to investigate how performance of a reach, grasp
and place task was influenced by added auditory and graphical cues. The cues
were presented at points in the task, specifically when making contact for
grasping or placing the object, and were presented in single or in combined
modalities. Haptic feedback was present always during physical interaction with
the object. The auditory and graphical cues provided enhanced feedback about
making contact between hand and object and between object and table. Also, the
task was performed with or without vision of hand. Movements were slower
without vision of hand. Providing auditory cues clearly facilitated
performance, while graphical contact cues had no additional effect.
Implications are discussed for various uses of auditory displays in virtual
environments. Keywords: Fitts' law, auditory displays, human performance, multimodal displays,
object manipulation, prehension, proprioception, virtual reality, visual
information | |||
| Mouthbrush: drawing and painting by hand and mouth | | BIBAK | Full-Text | 277-280 | |
| Chi-ho Chan; Michael J. Lyons; Nobuji Tetsutani | |||
| We present a novel multimodal interface which permits users to draw or paint
using coordinated gestures of hand and mouth. A headworn camera captures an
image of the mouth and the mouth cavity region is extracted by Fisher
discriminant analysis of the pixel colour information. A normalized area
parameter is read by a drawing or painting program to allow read-time gestural
control of pen/brush parameters by mouth gesture while sketching with a digital
pen/tablet. A new performance task, the Radius Control Task, is proposed as a
means of systematic evaluation of performance of the interface. Data from
preliminary experiments show that with some practice users can achieve single
pixel radius control with ease. A trial of the system by a professional artist
shows that it is ready for use as a novel tool for creative artistic
expression. Keywords: alternative input devices, mouth controller, vision-based interface | |||
| XISL: a language for describing multimodal interaction scenarios | | BIBAK | Full-Text | 281-284 | |
| Kouichi Katsurada; Yusaku Nakamura; Hirobumi Yamada; Tsuneo Nitta | |||
| This paper outlines the latest version of XISL (eXtensible Interaction
Scenario Language). XISL is an XML-based markup language for web-based
multimodal interaction systems. XISL enables to describe synchronization of
multimodal inputs/outputs, dialog flow/transition, and some other descriptions
required for multimodal interaction. XISL inherits these features from VoiceXML
and SMIL. The original feature of XISL is that XISL has enough
modality-extensibility. We present the basic XISL tags, outline of XISL
execution systems, and then make a comparison with other languages. Keywords: XISL, XML, modality extensibility, multimodal interaction | |||
| IRYS: a visualization tool for temporal analysis of multimodal interaction | | BIBAK | Full-Text | 285-288 | |
| Daniel Bauer; James D. Hollan | |||
| IRYS is a tool for the replay and analysis of gaze and touch behavior during
on-line activities. Essentially a "multimodal VCR", it can record and replay
computer screen activity and overlay this video with a synchronized "spotlight"
of the user's attention, as measured by an eye-tracking and/or touch-tracking
system. This cross-platform tool is particularly useful for detailed
ethnographic analysis of "natural" on-line behavior involving multiple
applications and windows in a continually changing workspace. Keywords: VNC, digital ethnography, eye tracking, gaze analysis, gaze representation,
haptic, multimodal, temporal analysis, touch tracking, virtual network computer | |||
| Towards robust person recognition on handheld devices using face and speaker identification technologies | | BIBAK | Full-Text | 289-292 | |
| Timothy J. Hazen; Eugene Weinstein; Alex Park | |||
| Most face and speaker identification techniques are tested on data collected
in controlled environments using high quality cameras and microphones. However,
the use of these technologies in variable environments and with the help of the
inexpensive sound and image capture hardware present in mobile devices presents
an additional challenge. In this study, we investigate the application of
existing face and speaker identification techniques to a person identification
task on a handheld device. These techniques have proven to perform accurately
on tightly constrained experiments where the lighting conditions, visual
backgrounds, and audio environments are fixed and specifically adjusted for
optimal data quality. When these techniques are applied on mobile devices where
the visual and audio conditions are highly variable, degradations in
performance can be expected. Under these circumstances, the combination of
multiple biometric modalities can improve the robustness and accuracy of the
person identification task. In this paper, we present our approach for
combining face and speaker identification technologies and experimentally
demonstrate a fused multi-biometric system which achieves a 50% reduction in
equal error rate over the better of the two independent systems. Keywords: face identification, handheld devices, multi-biometric interfaces, speaker
identification | |||
| Algorithms for controlling cooperation between output modalities in 2D embodied conversational agents | | BIBAK | Full-Text | 293-296 | |
| Sarkis Abrilian; Jean-Claude Martin; Stéphanie Buisine | |||
| Recent advances in the specification of the multimodal behavior of Embodied
Conversational Agents (ECA) have proposed a direct and deterministic one-step
mapping from high-level specifications of dialog state or agent emotion onto
low-level specifications of the multimodal behavior to be displayed by the
agent (e.g. facial expression, gestures, vocal utterance). The difference of
abstraction between these two levels of specification makes difficult the
definition of such a complex mapping. In this paper we propose an intermediate
level of specification based on combinations between modalities (e.g.
redundancy, complementarity). We explain how such intermediate level
specifications can be described using XML in the case of deictic expressions.
We define algorithms for parsing such descriptions and generating the
corresponding multimodal behavior of 2D cartoon-like conversational agents.
Some random selection has been introduced in these algorithms in order to
induce some "natural variations" in the agent's behavior. We conclude on the
usefulness of this approach for the design of ECA. Keywords: embodied conversational agent, multimodal output, redundancy, specification | |||
| Towards an attentive robotic dialog partner | | BIBAK | Full-Text | 297-300 | |
| Torsten Wilhelm; Hans-Joachim Böhme; Horst-Michael Gross | |||
| This paper describes a system developed for a mobile service robot which
detects and tracks the position of a user's face in 3D-space using a vision
(skin color) and a sonar based component. To make the skin color detection
robust under varying illumination conditions, it is supplied with an automatic
white balance algorithm. The hypothesis of the user's position is used to
orient the robot's head towards the current user allowing it to grab high
resolution images of his face suitable for verifying the hypothesis and for
extracting additional information. Keywords: user detection, user tracking | |||
| Demo: a multi-modal training environment for surgeons | | BIBAK | Full-Text | 301-302 | |
| Shahram Payandeh; John Dill; Graham Wilson; Hui Zhang; Lilong Shi; Alan Lomax; Christine MacKenzie | |||
| This demonstration presents the current state of an on-going team project at
Simon Fraser University in developing a virtual environment for helping to
train surgeons in performing laparoscopic surgery. In collaboration with
surgeons, an initial set of training procedures has been developed. Our goal
has been to develop procedures in each of several general categories, such as
basic hand-eye coordination, single-handed and bi-manual approaches and
dexterous manipulation. The environment is based on an effective data structure
that offers fast graphics and physically based modeling of both rigid and
deformable objects. In addition, the environment supports both 3D and 5D input
devices and devices generating haptic feedback. The demonstration allows users
to interact with a scene using a haptic device. Keywords: haptics, surgery training, surgical simulation, virtual laparoscopy, virtual
reality | |||
| Demo: playingfFantasyA with senToy | | BIBAK | Full-Text | 303-304 | |
| Ana Paiva; Rui Prada; Ricardo Chaves; Marco Vala; Adrian Bullock; Gerd Andersson; Kristina Höök | |||
| Game development is an emerging area of development for new types of
interaction between computers and humans. New forms of communication are now
being explored there, influenced not only by face to face communication but
also by recent developments in multi-modal communication and tangible
interfaces. This demo will feature a computer game, FantasyA, where users can
play the game by interacting with a tangible interface, SenToy (see Figure 1).
The main idea is to involve objects and artifacts from real life into ways to
interact with systems, and in particular with games. So, SenToy is an interface
for users to project some of their emotional gestures through moving the doll
in certain ways. This device would establish a link between the users (holding
the physical device) and a controlled avatar (embodied by that physical device)
of the computer game, FantasyA. Keywords: affective computing, synthetic characters, tangible interfaces | |||