HCI Bibliography Home | HCI Conferences | ICMI Archive | Detailed Records | RefWorks | EndNote | Hide Abstracts
ICMI Tables of Contents: 0203040506070809101112131415

Proceedings of the 2003 International Conference on Multimodal Interfaces

Fullname:ICMI'03 Proceedings of the 5th International Conference on Multimodal Interfaces
Editors:Sharon Oviatt; Trevor Darrell; Mark Maybury; Wolfgang Wahlster
Location:Vancouver, British Columbia, Canada
Dates:2003-Nov-05 to 2003-Nov-07
Publisher:ACM
Standard No:ISBN: 1-58113-621-8; ACM DL: Table of Contents hcibib: ICMI03
Papers:50
Pages:306
  1. Keynote
  2. Joint session with UIST
  3. Attention and integration
  4. Haptics and biometrics
  5. Multimodal architectures and frameworks
  6. User tests and multimodal gesture
  7. Speech and gaze
  8. Posters
  9. Demos

Keynote

Multimodal user interfaces: who's the user? BIBAFull-Text 1
  Anil K. Jain
A wide variety of systems require reliable personal recognition schemes to either confirm or determine the identity of an individual requesting their services. The purpose of such schemes is to ensure that only a legitimate user, and not anyone else, accesses the rendered services. Examples of such applications include secure access to buildings, computer systems, laptops, cellular phones and ATMs. Biometric recognition, or simply biometrics, refers to the automatic recognition of individuals based on their physiological and/or behavioral characteristics. By using biometrics it is possible to confirm or establish an individual's identity based on "who she is", rather than by "what she possesses" (e.g., an ID card) or "what she remembers" (e.g., a password). Current biometric systems make use of fingerprints, hand geometry, iris, face, voice, etc. to establish a person's identity. Biometric systems also introduce an aspect of user convenience. For example, they alleviate the need for a user to remember multiple passwords associated with different applications. A biometric system that uses a single biometric trait for recognition has to contend with problems related to non-universality of the trait, spoof attacks, limited degrees of freedom, large intra-class variability, and noisy data. Some of these problems can be addressed by integrating the evidence presented by multiple biometric traits of a user (e.g., face and iris). Such systems, known as multimodal biometric systems, demonstrate substantial improvement in recognition performance. In this talk, we will present various applications of biometrics, challenges associated in designing biometric systems, and various fusion strategies available to implement a multimodal biometric system.
New techniques for evaluating innovative interfaces with eye tracking BIBAFull-Text 2
  Sandra Marshall
Computer interfaces are changing rapidly, as are the cognitive demands on the operators using them. Innovative applications of new technologies such as multimodal and multimedia displays, haptic and pen-based interfaces, and natural language exchanges bring exciting changes to conventional interface usage. At the same time, their complexity may place overwhelming cognitive demands on the user. As novel interfaces and software applications are introduced into operational settings, it is imperative to evaluate them from a number of different perspectives. One important perspective examines the extent to which a new interface changes the cognitive requirements for the operator.
   This presentation describes a new approach to measuring cognitive effort using metrics based on eye movements and pupil dilation. It is well known that effortful cognitive processing is accompanied by increases in pupil dilation, but measurement techniques were not previously available that could supply results in real time or deal with data collected in long-lasting interactions. We now have a metric-the Index of Cognitive Activity-that is computed in real time as the operator interacts with the interface. The Index can be used to examine extended periods of usage or to assess critical events on an individual-by-individual basis.
   While dilation reveals when cognitive effort is highest, eye movements provide evidence of why. Especially during critical events, one wants to know whether the operator is confused by the presentation or location of specific information, whether he is attending to key information when necessary, or whether he is distracted by irrelevant features of the display. Important details of confusion, attention, and distraction are revealed by traces of his eye movements and statistical analyses of time spent looking at various features during critical event.
   Together, the Index of Cognitive Activity and the various analyses of eye movements provide essential information about how users interact with new interface technologies. Their use can aid designers of innovative hardware and software products by highlighting those features that increase rather than decrease users' cognitive effort.
   In the presentation, the underlying mathematical basis of the Index of Cognitive Activity will be described together with validating research results from a number of experiments. Eye movement analyses from the same studies give clues to the sources of increase in cognitive workload. To illustrate interface evaluation with the ICA and eye movement analysis, several extended examples will be presented using commercial and military displays. [NOTE: Dr. Marshall's eye tracking system will be available to view at Tuesday evening's joint UIST-ICMI demo reception.
Crossmodal attention and multisensory integration: implications for multimodal interface design BIBAFull-Text 3
  Charles Spence
One of the most important findings to emerge from the field of cognitive psychology in recent years has been the discovery that humans have a very limited ability to process incoming sensory information. In fact, contrary to many of the most influential human operator models, the latest research has shown that humans use the same limited pool of attentional resources to process the inputs arriving from each of their senses (e.g., hearing, vision, touch, smell, etc). His research calls for a radical new way of examining and understanding the senses, which has major implications for the way we design everything from household products to multimodal user interfaces. Instead, interface designers should realize that the decision to stimulate more senses actually reflects a trade-off between the benefits of utilizing additional senses and the costs associated with dividing attention between different sensory modalities. In this presentation, I will discuss some of the problems associated with dividing attention between eye and ear, as illustrated by talking on a mobile phone while driving. Charles has published more than 70 articles in scientific journals over the past decade. I hope to demonstrate that a better understanding of the senses and, especially the links between the senses that have been highlighted by recent cognitive neuroscience research, will enable interface designers to develop multimodal interfaces that more effectively stimulate the user's senses.

Joint session with UIST

A system for fast, full-text entry for small electronic devices BIBAKFull-Text 4-11
  Saied B. Nesbat
A novel text entry system designed based on the ubiquitous 12-button telephone keypad and its adaptation for a soft keypad are presented. This system can be used to enter full text (letters + numbers + special characters) on devices where the number of keys or the keyboard area is limited. Letter-frequency data is used for assigning letters to the positions of a 3x3 matrix on keys, enhancing the entry of the most frequent Letters performed by a double-click. Less frequent letters and characters are entered based on a 3x3 adjacency matrix using an unambiguous, two-keystroke scheme. The same technique is applied to a virtual or soft keyboard layout so letters and characters are entered with taps or slides on an 11-button keypad. Based on the application of Fitts' law, this system is determined to be 67% faster than the QWERTY soft keyboard and 31% faster than the multi-tap text entry system commonly used on cell phones today. The system presented in this paper is implemented and runs on Palm OS PDAs, replacing the built-in QWERTY keyboard and Graffiti recognition systems of these PDAs.
Keywords: Fitts' law, keypad input, mobile phones, mobile systems, pen-based, soft keyboard, stylus input, text entry
Mutual disambiguation of 3D multimodal interaction in augmented and virtual reality BIBAKFull-Text 12-19
  Ed Kaiser; Alex Olwal; David McGee; Hrvoje Benko; Andrea Corradini; Xiaoguang Li; Phil Cohen; Steven Feiner
We describe an approach to 3D multimodal interaction in immersive augmented and virtual reality environments that accounts for the uncertain nature of the information sources. The resulting multimodal system fuses symbolic and statistical information from a set of 3D gesture, spoken language, and referential agents. The referential agents employ visible or invisible volumes that can be attached to 3D trackers in the environment, and which use a time-stamped history of the objects that intersect them to derive statistics for ranking potential referents. We discuss the means by which the system supports mutual disambiguation of these modalities and information sources, and show through a user study how mutual disambiguation accounts for over 45% of the successful 3D multimodal interpretations. An accompanying video demonstrates the system in action.
Keywords: augmented/virtual reality, evaluation, multimodal interaction

Attention and integration

Learning and reasoning about interruption BIBAKFull-Text 20-27
  Eric Horvitz; Johnson Apacible
We present methods for inferring the cost of interrupting users based on multiple streams of events including information generated by interactions with computing devices, visual and acoustical analyses, and data drawn from online calendars. Following a review of prior work on techniques for deliberating about the cost of interruption associated with notifications, we introduce methods for learning models from data that can be used to compute the expected cost of interruption for a user. We describe the Interruption Workbench, a set of event-capture and modeling tools. Finally, we review experiments that characterize the accuracy of the models for predicting interruption cost and discuss research directions.
Keywords: cognitive models, divided attention, interruption, notifications
Providing the basis for human-robot-interaction: a multi-modal attention system for a mobile robot BIBAKFull-Text 28-35
  Sebastian Lang; Marcus Kleinehagenbrock; Sascha Hohenner; Jannik Fritsch; Gernot A. Fink; Gerhard Sagerer
In order to enable the widespread use of robots in home and office environments, systems with natural interaction capabilities have to be developed. A prerequisite for natural interaction is the robot's ability to automatically recognize when and how long a person's attention is directed towards it for communication. As in open environments several persons can be present simultaneously, the detection of the communication partner is of particular importance. In this paper we present an attention system for a mobile robot which enables the robot to shift its attention to the person of interest and to maintain attention during interaction. Our approach is based on a method for multi-modal person tracking which uses a pan-tilt camera for face recognition, two microphones for sound source localization, and a laser range finder for leg detection. Shifting of attention is realized by turning the camera into the direction of the person which is currently speaking. From the orientation of the head it is decided whether the speaker addresses the robot. The performance of the proposed approach is demonstrated with an evaluation. In addition, qualitative results from the performance of the robot at the exhibition part of the ICVS'03 are provided.
Keywords: attention, human-robot-interaction, multi-modal person tracking
Selective perception policies for guiding sensing and computation in multimodal systems: a comparative analysis BIBAKFull-Text 36-43
  Nuria Oliver; Eric Horvitz
Intensive computations required for sensing and processing perceptual information can impose significant burdens on personal computer systems. We explore several policies for selective perception in SEER, a multimodal system for recognizing office activity that relies on a layered Hidden Markov Model representation. We review our efforts to employ expected-value-of-information (EVI) computations to limit sensing and analysis in a context-sensitive manner. We discuss an implementation of a one-step myopic EVI analysis and compare the results of using the myopic EVI with a heuristic sensing policy that makes observations at different frequencies. Both policies are then compared to a random perception policy, where sensors are selected at random. Finally, we discuss the sensitivity of ideal perceptual actions to preferences encoded in utility models about information value and the cost of sensing.
Keywords: Hidden Markov models, automatic feature selection, expected value of information, human behavior recognition, multi-modal interaction, office awareness, selective perception
Toward a theory of organized multimodal integration patterns during human-computer interaction BIBAKFull-Text 44-51
  Sharon Oviatt; Rachel Coulston; Stefanie Tomko; Benfang Xiao; Rebecca Lunsford; Matt Wesson; Lesley Carmichael
As a new generation of multimodal systems begins to emerge, one dominant theme will be the integration and synchronization requirements for combining modalities into robust whole systems. In the present research, quantitative modeling is presented on the organization of users' speech and pen multimodal integration patterns. In particular, the potential malleability of users' multimodal integration patterns is explored, as well as variation in these patterns during system error handling and tasks varying in difficulty. Using a new dual-wizard simulation method, data was collected from twelve adults as they interacted with a map-based task using multimodal speech and pen input. Analyses based on over 1600 multimodal constructions revealed that users' dominant multimodal integration pattern was resistant to change, even when strong selective reinforcement was delivered to encourage switching from a sequential to simultaneous integration pattern, or vice versa. Instead, both sequential and simultaneous integrators showed evidence of entrenching further in their dominant integration patterns (i.e., increasing either their inter-modal lag or signal overlap) over the course of an interactive session, during system error handling, and when completing increasingly difficult tasks. In fact, during error handling these changes in the co-timing of multimodal signals became the main feature of hyper-clear multimodal language, with elongation of individual signals either attenuated or absent. Whereas Behavioral/Structuralist theory cannot account for these data, it is argued that Gestalt theory provides a valuable framework and insights into multimodal interaction. Implications of these findings are discussed for the development of a coherent theory of multimodal integration during human-computer interaction, and for the design of a new class of adaptive multimodal interfaces.
Keywords: Gestalt theory, co-timing, entrenchment, error handling, multimodal integration, speech and pen input, task difficulty

Haptics and biometrics

TorqueBAR: an ungrounded haptic feedback device BIBAKFull-Text 52-59
  Colin Swindells; Alex Unden; Tao Sang
Kinesthetic feedback is a key mechanism by which people perceive object properties during their daily tasks -- particularly inertial properties. For example, transporting a glass of water without spilling, or dynamically positioning a handheld tool such as a hammer, both require inertial kinesthetic feedback. We describe a prototype for a novel ungrounded haptic feedback device, the TorqueBAR, that exploits a kinesthetic awareness of dynamic inertia to simulate complex coupled motion as both a display and input device. As a user tilts the TorqueBAR to sense and control computer programmed stimuli, the TorqueBAR's centre-of-mass changes in real-time according to the user's actions. We evaluate the TorqueBAR using both quantitative and qualitative techniques, and we describe possible applications for the device such as video games and real-time robot navigation.
Keywords: 1 DOF, haptic rod, input device, mobile computing, tilt controller, torque feedback, two-handed, ungrounded force feedback
Towards tangibility in gameplay: building a tangible affective interface for a computer game BIBAKFull-Text 60-67
  Ana Paiva; Rui Prada; Ricardo Chaves; Marco Vala; Adrian Bullock; Gerd Andersson; Kristina Höök
In this paper we describe a way of controlling the emotional states of a synthetic character in a game (FantasyA) through a tangible interface named SenToy. SenToy is a doll with sensors in the arms, legs and body, allowing the user to influence the emotions of her character in the game. The user performs gestures and movements with SenToy, which are picked up by the sensors and interpreted according to a scheme found through an initial Wizard of Oz study. Different gestures are used to express each of the following emotions: anger, fear, happiness, surprise, sadness and gloating. Depending upon the expressed emotion, the synthetic character in FantasyA will, in turn, perform different actions. The evaluation of SenToy acting as the interface to the computer game FantasyA has shown that users were able to express most of the desired emotions to influence the synthetic characters, and that overall, players, especially children, really liked the doll as an interface.
Keywords: affective computing, characters, synthetic, tangible interfaces
Multimodal biometrics: issues in design and testing BIBAKFull-Text 68-72
  Robert Snelick; Mike Indovina; James Yen; Alan Mink
Experimental studies show that multimodal biometric systems for small-scale populations perform better than single-mode biometric systems. We examine if such techniques scale to larger populations, introduce a methodology to test the performance of such systems, and assess the feasibility of using commercial off-the-shelf (COTS) products to construct deployable multimodal biometric systems. A key aspect of our approach is to leverage confidence level scores from preexisting single-mode data. An example presents a multimodal biometrics system analysis that explores various normalization and fusion techniques for face and fingerprint classifiers. This multimodal analysis uses a population of about 1000 subjects, a number ten-times larger than seen in any previously reported study. Experimental results combining face and fingerprint biometric classifiers reveal significant performance improvement over single-mode biometric systems.
Keywords: evaluation, fusion, multimodal biometrics, normalization, system design, testing methodology
Sensitivity to haptic-audio asynchrony BIBAKFull-Text 73-76
  Bernard D. Adelstein; Durand R. Begault; Mark R. Anderson; Elizabeth M. Wenzel
The natural role of sound in actions involving mechanical impact and vibration suggests the use of auditory display as an augmentation to virtual haptic interfaces. In order to budget available computational resources for sound simulation, the perceptually tolerable asynchrony between paired haptic-auditory sensations must be known. This paper describes a psychophysical study of detectable time delay between a voluntary hammer tap and its auditory consequence (a percussive sound of either 1, 50, or 200 ms duration). The results show Just Noticeable Differences (JNDs) for temporal asynchrony of 24 ms with insignificant response bias. The invariance of JND and response bias as a function of sound duration in this experiment indicates that observers cued on the initial attack of the auditory stimuli.
Keywords: audio, cross-modal asynchrony, haptic, latency, multi-modal interfaces, time delay, virtual environments
A multi-modal approach for determining speaker location and focus BIBAFull-Text 77-80
  Michael Siracusa; Louis-Philippe Morency; Kevin Wilson; John Fisher; Trevor Darrell
This paper presents a multi-modal approach to locate a speaker in a scene and determine to whom he or she is speaking. We present a simple probabilistic framework that combines multiple cues derived from both audio and video information. A purely visual cue is obtained using a head tracker to identify possible speakers in a scene and provide both their 3-D positions and orientation. In addition, estimates of the audio signal's direction of arrival are obtained with the help of a two-element microphone array. A third cue measures the association between the audio and the tracked regions in the video. Integrating these cues provides a more robust solution than using any single cue alone. The usefulness of our approach is shown in our results for video sequences with two or more people in a prototype interactive kiosk environment.
Distributed and local sensing techniques for face-to-face collaboration BIBAKFull-Text 81-84
  Ken Hinckleyss
This paper describes techniques that allow users to collaborate on tablet computers that employ distributed sensing techniques to establish a privileged connection between devices. Each tablet is augmented with a two-axis linear accelerometer (tilt sensor), touch sensor, proximity sensor, and light sensor. The system recognizes when users bump two tablets together by looking for spikes in each tablet's accelerometer data that are synchronized in time; bumping establishes a privileged connection between the devices. Users can face one another and bump the tops of two tablets together to establish a collaborative face-to-face workspace. The system then uses the sensors to enhance transitions between personal work and shared work. For example, a user can hold his or her hand near the top of the workspace to "shield" the display from the other user. This gesture is sensed using the proximity sensor together with the light sensor, allowing for quick "asides" into private information or to sketch an idea in a personal workspace. Picking up, putting down, or walking away with a tablet are also sensed, as is angling the tablet towards the other user. Much research in single display groupware considers shared displays and shared artifacts, but our system explores a unique form of dual display groupware for face-to-face communication and collaboration using personal display devices.
Keywords: co-present collaboration, context awareness, sensing techniques

Multimodal architectures and frameworks

Georgia tech gesture toolkit: supporting experiments in gesture recognition BIBAKFull-Text 85-92
  Tracy Westeyn; Helene Brashear; Amin Atrash; Thad Starner
Gesture recognition is becoming a more common interaction tool in the fields of ubiquitous and wearable computing. Designing a system to perform gesture recognition, however, can be a cumbersome task. Hidden Markov models (HMMs), a pattern recognition technique commonly used in speech recognition, can be used for recognizing certain classes of gestures. Existing HMM toolkits for speech recognition can be adapted to perform gesture recognition, but doing so requires significant knowledge of the speech recognition literature and its relation to gesture recognition. This paper introduces the Georgia Tech Gesture Toolkit GT2k which leverages Cambridge University's speech recognition toolkit, HTK, to provide tools that support gesture recognition research. GT2k provides capabilities for training models and allows for both real-time and off-line recognition. This paper presents four ongoing projects that utilize the toolkit in a variety of domains.
Keywords: American sign language, context recognition, gesture recognition, hidden Markov models, interfaces, toolkit, wearable computers
Architecture and implementation of multimodal plug and play BIBAKFull-Text 93-100
  Christian Elting; Stefan Rapp; Gregor Möhler; Michael Strube
This paper describes the handling of multimodality in the Embassi system. Here, multimodality is treated in two modules. Firstly, a modality fusion component merges speech, video traced pointing gestures, and input from a graphical user interface. Secondly, a presentation planning component decides upon the modality to be used for the output, i.e., speech, an animated life-like character (ALC) and/or the graphical user interface, and ensures that the presentation is coherent and cohesive. We describe how these two components work and emphasize one particular feature of our system architecture: All modality analysis components generate output in a common semantic description format and all render components process input in a common output language. This makes it particularly easy to add or remove modality analyzers or renderer components, even dynamically while the system is running. This plug and play of modalities can be used to adjust the system's capabilities to different demands of users and their situative context. In this paper we give details about the implementations of the models, protocols and modules that are necessary to realize those features.
Keywords: dialog systems, multimodal, multimodal fission, multimodal fusion
SmartKom: adaptive and flexible multimodal access to multiple applications BIBAKFull-Text 101-108
  Norbert Reithinger; Jan Alexandersson; Tilman Becker; Anselm Blocher; Ralf Engel; Markus Löckelt; Jochen Müller; Norbert Pfleger; Peter Poller; Michael Streit; Valentin Tschernomas
The development of an intelligent user interface that supports multimodal access to multiple applications is a challenging task. In this paper we present a generic multimodal interface system where the user interacts with an anthropomorphic personalized interface agent using speech and natural gestures. The knowledge-based and uniform approach of SmartKom enables us to realize a comprehensive system that understands imprecise, ambiguous, or incomplete multimodal input and generates coordinated, cohesive, and coherent multimodal presentations for three scenarios, currently addressing more than 50 different functionalities of 14 applications. We demonstrate the main ideas in a walk through the main processing steps from modality fusion to modality fission.
Keywords: intelligent multimodal interfaces, multiple applications, system description
A framework for rapid development of multimodal interfaces BIBAKFull-Text 109-116
  Frans Flippo; Allen Krebs; Ivan Marsic
Despite the availability of multimodal devices, there are very few commercial multimodal applications available. One reason for this may be the lack of a framework to support development of multimodal applications in reasonable time and with limited resources. This paper describes a multimodal framework enabling rapid development of applications using a variety of modalities and methods for ambiguity resolution, featuring a novel approach to multimodal fusion. An example application is studied that was created using the framework.
Keywords: application frameworks, command and control, direct manipulation, multimodal fusion, multimodal interfaces

User tests and multimodal gesture

Capturing user tests in a multimodal, multidevice informal prototyping tool BIBAKFull-Text 117-124
  Anoop K. Sinha; James A. Landay
Interaction designers are increasingly faced with the challenge of creating interfaces that incorporate multiple input modalities, such as pen and speech, and span multiple devices. Few early stage prototyping tools allow non-programmers to prototype these interfaces. Here we describe CrossWeaver, a tool for informally prototyping multimodal, multidevice user interfaces. This tool embodies the informal prototyping paradigm, leaving design representations in an informal, sketched form, and creates a working prototype from these sketches. CrossWeaver allows a user interface designer to sketch storyboard scenes on the computer, specifying simple multimodal command transitions between scenes. The tool also allows scenes to target different output devices. Prototypes can run across multiple standalone devices simultaneously, processing multimodal input from each one. Thus, a designer can visually create a multimodal prototype for a collaborative meeting or classroom application. CrossWeaver captures all of the user interaction when running a test of a prototype. This input log can quickly be viewed visually for the details of the users' multimodal interaction or it can be replayed across all participating devices, giving the designer information to help him or her analyze and iterate on the interface design.
Keywords: informal prototyping, mobile interface design, multidevice, multimodal, pen and speech input, sketching
Large vocabulary sign language recognition based on hierarchical decision trees BIBAKFull-Text 125-131
  Gaolin Fang; Wen Gao; Debin Zhao
The major difficulty for large vocabulary sign language or gesture recognition lies in the huge search space due to a variety of recognized classes. How to reduce the recognition time without loss of accuracy is a challenge issue. In this paper, a hierarchical decision tree is first presented for large vocabulary sign language recognition based on the divide-and-conquer principle. As each sign feature has the different importance to gestures, the corresponding classifiers are proposed for the hierarchical decision to gesture attributes. One- or two-handed classifier with little computational cost is first used to eliminate many impossible candidates. The subsequent hand shape classifier is performed on the possible candidate space. SOFM/HMM classifier is employed to get the final results at the last non-leaf nodes that only include few candidates. Experimental results on a large vocabulary of 5113-signs show that the proposed method drastically reduces the recognition time by 11 times and also improves the recognition rate about 0.95% over single SOFM/HMM.
Keywords: Gaussian mixture model, finite state machine, gesture recognition, hierarchical decision tree, sign language recognition
Hand motion gestural oscillations and multimodal discourse BIBAKFull-Text 132-139
  Yingen Xiong; Francis Quek; David McNeill
To develop multimodal interfaces, one needs to understand the constraints underlying human communicative gesticulation and the kinds of features one may compute based on these underlying human characteristics.
   In this paper we address hand motion oscillatory gesture detection in natural speech and conversation. First, the hand motion trajectory signals are extracted from video. Second, a wavelet analysis based approach is presented to process the signals. In this approach, wavelet ridges are extracted from the responses of wavelet analysis for the hand motion trajectory signals, which can be used to characterize frequency properties of the hand motion signals. The hand motion oscillatory gestures can be extracted from these frequency properties. Finally, we relate the hand motion oscillatory gestures to the phases of speech and multimodal discourse analysis.
   We demonstrate the efficacy of the system on a real discourse dataset in which a subject described her action plan to an interlocutor. We extracted the oscillatory gestures from the x, y and z motion traces of both hands. We further demonstrate the power of gestural oscillation detection as a key to unlock the structure of the underlying discourse.
Keywords: gesture symmetry, hand gesture, hand motion trajectory, interaction, multimodal, multimodal discourse structure, speech analysis
Pointing gesture recognition based on 3D-tracking of face, hands and head orientation BIBAKFull-Text 140-146
  Kai Nickel; Rainer Stiefelhagen
In this paper, we present a system capable of visually detecting pointing gestures and estimating the 3D pointing direction in real-time. In order to acquire input features for gesture recognition, we track the positions of a person's face and hands on image sequences provided by a stereo-camera. Hidden Markov Models (HMMs), trained on different phases of sample pointing gestures, are used to classify the 3D-trajectories in order to detect the occurrence of a gesture. When analyzing sample pointing gestures, we noticed that humans tend to look at the pointing target while performing the gesture. In order to utilize this behavior, we additionally measured head orientation by means of a magnetic sensor in a similar scenario. By using head orientation as an additional feature, we observed significant gains in both recall and precision of pointing gestures. Moreover, the percentage of correctly identified pointing targets improved significantly from 65% to 83%. For estimating the pointing direction, we comparatively used three approaches: 1) The line of sight between head and hand, 2) the forearm orientation, and 3) the head orientation.
Keywords: computer vision, gesture recognition, person tracking, pointing gestures
Untethered gesture acquisition and recognition for a multimodal conversational system BIBAKFull-Text 147-150
  T. Ko; D. Demirdjian; T. Darrell
Humans use a combination of gesture and speech to convey meaning, and usually do so without holding a device or pointer. We present a system that incorporates body tracking and gesture recognition for an untethered human-computer interface. This research focuses on a module that provides parameterized gesture recognition, using various machine learning techniques. We train the support vector classifier to model the boundary of the space of possible gestures, and train Hidden Markov Models on specific gestures. Given a sequence, we can find the start and end of various gestures using a support vector classifier, and find gesture likelihoods and parameters with a HMM. Finally multimodal recognition is performed using rank-order fusion to merge speech and vision hypotheses.
Keywords: articulated tracking, hidden Markov models, speech, support vector machines, vision

Speech and gaze

Where is "it"? Event Synchronization in Gaze-Speech Input Systems BIBAKFull-Text 151-158
  Manpreet Kaur; Marilyn Tremaine; Ning Huang; Joseph Wilder; Zoran Gacovski; Frans Flippo; Chandra Sekhar Mantravadi
The relationship between gaze and speech is explored for the simple task of moving an object from one location to another on a computer screen. The subject moves a designated object from a group of objects to a new location on the screen by stating, "Move it there". Gaze and speech data are captured to determine if we can robustly predict the selected object and destination position. We have found that the source fixation closest to the desired object begins, with high probability, before the beginning of the word "Move". An analysis of all fixations before and after speech onset time shows that the fixation that best identifies the object to be moved occurs, on average, 630 milliseconds before speech onset with a range of 150 to 1200 milliseconds for individual subjects. The variance in these times for individuals is relatively small although the variance across subjects is large. Selecting a fixation closest to the onset of the word "Move" as the designator of the object to be moved gives a system accuracy close to 95% for all subjects. Thus, although significant differences exist between subjects, we believe that the speech and gaze integration patterns can be modeled reliably for individual users and therefore be used to improve the performance of multimodal systems.
Keywords: eye-tracking, gaze-speech co-occurrence, multimodal fusion, multimodal interfaces
Eyetracking in cognitive state detection for HCI BIBAKFull-Text 159-163
  Darrell S. Rudmann; George W. McConkie; Xianjun Sam Zheng
1. Past research in a number of fields confirms the existence of a link between cognition and eye movement control, beyond simply a pointing relationship. This being the case, it should be possible to use eye movement recording as a basis for detecting users' cognitive states in real time. Several examples of such cognitive state detectors have been reported in the literature.
   2. A multi-disciplinary project is described in which the goal is to provide the computer with as much real-time information about the human state (cognitive, affective and motivational state) as possible, and to base computer actions on this information. The application area in which this is being implemented is science education, learning about gears through exploration. Two studies are reported in which participants solve simple problems of pictured gear trains while their eye movements are recorded. The first study indicates that most eye movement sequences are compatible with predictions of a simple sequential cognitive model, and it is suggested that those sequences that do not fit the model may be of particular interest in the HCI context as indicating problems or alternative mental strategies. The mental rotation of gears sometimes produces sequences of short eye movements in the direction of motion; thus, such sequences may be useful as cognitive state detectors. The second study tested the hypothesis that participants are thinking about the object to which their eyes are directed. In this study, the display was turned off partway through the process of solving a problem, and the participants reported what they were thinking about at that time. While in most cases the participants reported cognitive activities involving the fixated object, this was not the case on a sizeable number of trials.
Keywords: cognitive state, eye tracking
A multimodal learning interface for grounding spoken language in sensory perceptions BIBAKFull-Text 164-171
  Chen Yu; Dana H. Ballard
Most speech interfaces are based on natural language processing techniques that use pre-defined symbolic representations of word meanings and process only linguistic information. To understand and use language like their human counterparts in multimodal human-computer interaction, computers need to acquire spoken language and map it to other sensory perceptions. This paper presents a multimodal interface that learns to associate spoken language with perceptual features by being situated in users' everyday environments and sharing user-centric multisensory information. The learning interface is trained in unsupervised mode in which users perform everyday tasks while providing natural language descriptions of their behaviors. We collect acoustic signals in concert with multisensory information from non-speech modalities, such as user's perspective video, gaze positions, head directions and hand movements. The system firstly estimates users' focus of attention from eye and head cues. Attention, as represented by gaze fixation, is used for spotting the target object of user interest. Attention switches are calculated and used to segment an action sequence into action units which are then categorized by mixture hidden Markov models. A multimodal learning algorithm is developed to spot words from continuous speech and then associate them with perceptually grounded meanings extracted from visual perception and action. Successful learning has been demonstrated in the experiments of three natural tasks: "unscrewing a jar", "stapling a letter" and "pouring water".
Keywords: language acquisition, machine learning, multimodal integration
A computer-animated tutor for spoken and written language learning BIBAKFull-Text 172-175
  Dominic W. Massaro
Baldi, a computer-animated talking head is introduced. The quality of his visible speech has been repeatedly modified and evaluated to accurately simulate naturally talking humans. Baldi's visible speech can be appropriately aligned with either synthesized or natural auditory speech. Baldi has had great success in teaching vocabulary and grammar to children with language challenges and training speech distinctions to children with hearing loss and to adults learning a new language. We demonstrate these learning programs and also demonstrate several other potential application areas for Baldi.
Keywords: facial and speech synthesis, language learning
Augmenting user interfaces with adaptive speech commands BIBAKFull-Text 176-179
  Peter Gorniak; Deb Roy
We present a system that augments any unmodified Java application with an adaptive speech interface. The augmented system learns to associate spoken words and utterances with interface actions such as button clicks. Speech learning is constantly active and searches for correlations between what the user says and does. Training the interface is seamlessly integrated with using the interface. As the user performs normal actions, she may optionally verbally describe what she is doing. By using a phoneme recognizer, the interface is able to quickly learn new speech commands. Speech commands are chosen by the user and can be recognized robustly due to accurate phonetic modelling of the user's utterances and the small size of the vocabulary learned for a single application. After only a few examples, speech commands can replace mouse clicks. In effect, selected interface functions migrate from keyboard and mouse to speech. We demonstrate the usefulness of this approach by augmenting jfig, a drawing application, where speech commands save the user from the distraction of having to use a tool palette.
Keywords: machine learning, phoneme recognition, robust speech interfaces, user modelling

Posters

Combining speech and haptics for intuitive and efficient navigation through image databases BIBAKFull-Text 180-187
  Thomas Käster; Michael Pfeiffer; Christian Bauckhage
Given the size of todays professional image databases, the standard approach to object- or theme-related image retrieval is to interactively navigate through the content. But as most users of such databases are designers or artists who do not have a technical background, navigation interfaces must be intuitive to use and easy to learn. This paper reports on efforts towards this goal. We present a system for intuitive image retrieval that features different modalities for interaction. Apart from conventional input devices like mouse or keyboard it is also possible to use speech or haptic gesture to indicate what kind of images one is looking for. Seeing a selection of images on the screen, the user provides relevance feedback to narrow the choice of motifs presented next. This is done either by scoring whole images or by choosing certain image regions. In order to derive consistent reactions from multimodal user input, asynchronous integration of modalities and probabilistic reasoning based on Bayesian networks are applied. After addressing technical details, we will discuss a series of usability experiments, which we conducted to examine the impact of multimodal input facilities on interactive image retrieval. The results indicate that users appreciate multimodality. While we observed little decrease in task performance, measures of contentment exceeded those for conventional input devices.
Keywords: content-based image retrieval, fusion of haptics, multimodal interface evaluation, speech, vision processing
Interactive skills using active gaze tracking BIBAKFull-Text 188-195
  Rowel Atienza; Alexander Zelinsky
We have incorporated interactive skills into an active gaze tracking system. Our active gaze tracking system can identify an object in a cluttered scene that a person is looking at. By following the user's 3-D gaze direction together with a zero-disparity filter, we can determine the object's position. Our active vision system also directs attention to a user by tracking anything with both motion and skin color. A Particle Filter fuses skin color and motion from optical flow techniques together to locate a hand or a face in an image. The active vision then uses stereo camera geometry, Kalman Filtering and position and velocity controllers to track the feature in real-time. These skills are integrated together such that they cooperate with each other in order to track the user's face and gaze at all times. Results and video demos provide interesting insights on how active gaze tracking can be utilized and improved to make human-friendly user interfaces.
Keywords: active face tracking, active gaze tracking, selecting an object in 3-D space using gaze
Error recovery in a blended style eye gaze and speech interface BIBAKFull-Text 196-202
  Yeow Kee Tan; Nasser Sherkat; Tony Allen
In the work carried out earlier [1][2], it was found that an eye gaze and speech enabled interface was the most preferred form of data entry method when compared to other methods such as mouse and keyboard, handwriting and speech only. It was also found that several non-native United Kingdom (UK) English speaking speakers did not prefer the eye gaze and speech system due to the low success rate caused by the inaccuracy of the speech recognition component. Hence in order to increase the usability of the eye gaze and speech data entry system for these users, error recovery methods are required. In this paper we present three different multimodal interfaces that employ the use of speech recognition and eye gaze tracking within a virtual keypad style interface to allow for the use of error recovery (re-speak with keypad, spelling with keypad and re-speak and spelling with keypad). Experiments show that through the use of this virtual keypad interface, an accuracy gain of 10.92% during first attempt and 6.20% during re-speak by non-native speakers in ambiguous fields (initials, surnames, city and alphabets) can be achieved [3]. The aim of this work is to investigate whether the usability of the eye gaze and speech system can be improved through one of these three multimodal blended multimodal error recovery methods.
Keywords: blended multimodal interface, error recovery and usability, eye gaze tracking, multimodal interface, speech recognition
Using an autonomous cube for basic navigation and input BIBAKFull-Text 203-210
  Kristof Van Laerhoven; Nicolas Villar; Albrecht Schmidt; Gerd Kortuem; Hans Gellersen
This paper presents a low-cost and practical approach to achieve basic input using a tactile cube-shaped object, augmented with a set of sensors, processor, batteries and wireless communication. The algorithm we propose combines a finite state machine model incorporating prior knowledge about the symmetrical structure of the cube, with maximum likelihood estimation using multivariate Gaussians. The claim that the presented solution is cheap, fast and requires few resources, is demonstrated by implementation in a small-sized, microcontroller-driven hardware configuration with inexpensive sensors. We conclude with a few prototyped applications that aim at characterizing how the familiar and elementary shape of the cube allows it to be used as an interaction device.
Keywords: Gaussian modeling, Markov chain, haptic interfaces, maximum likelihood estimation, sensor-based tactile interfaces
GWindows: robust stereo vision for gesture-based control of windows BIBAKFull-Text 211-218
  Andrew Wilson; Nuria Oliver
Perceptual user interfaces promise modes of fluid computer-human interaction that complement the mouse and keyboard, and have been especially motivated in non-desktop scenarios, such as kiosks or smart rooms. Such interfaces, however, have been slow to see use for a variety of reasons, including the computational burden they impose, a lack of robustness outside the laboratory, unreasonable calibration demands, and a shortage of sufficiently compelling applications. We address these difficulties by using a fast stereo vision algorithm for recognizing hand positions and gestures. Our system uses two inexpensive video cameras to extract depth information. This depth information enhances automatic object detection and tracking robustness, and may also be used in applications. We demonstrate the algorithm in combination with speech recognition to perform several basic window management tasks, report on a user study probing the ease of using the system, and discuss the implications of such a system for future user interfaces.
Keywords: computer human interaction, computer vision, gesture recognition, speech recognition
A visually grounded natural language interface for reference to spatial scenes BIBAKFull-Text 219-226
  Peter Gorniak; Deb Roy
Many user interfaces, from graphic design programs to navigation aids in cars, share a virtual space with the user. Such applications are often ideal candidates for speech interfaces that allow the user to refer to objects in the shared space. We present an analysis of how people describe objects in spatial scenes using natural language. Based on this study, we describe a system that uses synthetic vision to "see" such scenes from the person's point of view, and that understands complex natural language descriptions referring to objects in the scenes. This system is based on a rich notion of semantic compositionality embedded in a grounded language understanding framework. We describe its semantic elements, their compositional behaviour, and their grounding through the synthetic vision system. To conclude, we evaluate the performance of the system on unconstrained input.
Keywords: cognitive modelling, computational semantics, natural language understanding, vision based semantics
Perceptual user interfaces using vision-based eye tracking BIBAKFull-Text 227-233
  Ravikrishna Ruddarraju; Antonio Haro; Kris Nagel; Quan T. Tran; Irfan A. Essa; Gregory Abowd; Elizabeth D. Mynatt
We present a multi-camera vision-based eye tracking method to robustly locate and track user's eyes as they interact with an application. We propose enhancements to various vision-based eye-tracking approaches, which include (a) the use of multiple cameras to estimate head pose and increase coverage of the sensors and (b) the use of probabilistic measures incorporating Fisher's linear discriminant to robustly track the eyes under varying lighting conditions in real-time. We present experiments and quantitative results to demonstrate the robustness of our eye tracking in two application prototypes.
Keywords: Fisher's Discriminant Analysis, computer vision, eye tracking, human computer interaction, multiple cameras
Sketching informal presentations BIBAKFull-Text 234-241
  Yang Li; James A. Landay; Zhiwei Guan; Xiangshi Ren; Guozhong Dai
Informal presentations are a lightweight means for fast and convenient communication of ideas. People communicate their ideas to others on paper and whiteboards, which afford fluid sketching of graphs, words and other expressive symbols. Unlike existing authoring tools that are designed for formal presentations, we created SketchPoint to help presenters design informal presentations via freeform sketching. In SketchPoint, presenters can quickly author presentations by sketching slide content, overall hierarchical structures and hyperlinks. To facilitate the transition from idea capture to communication, a note-taking workspace was built for accumulating ideas and sketching presentation outlines. Informal feedback showed that SketchPoint is a promising tool for idea communication.
Keywords: gestures, informal presentation, pen-based computers, rapid prototyping, sketching, storyboards, zooming user interface (ZUI)
Gestural communication over video stream: supporting multimodal interaction for remote collaborative physical tasks BIBAKFull-Text 242-249
  Jiazhi Ou; Susan R. Fussell; Xilin Chen; Leslie D. Setlock; Jie Yang
We present a system integrating gesture and live video to support collaboration on physical tasks. The architecture combines network IP cameras, desktop PCs, and tablet PCs to allow a remote helper to draw on a video feed of a workspace as he/she provides task instructions. A gesture recognition component enables the system both to normalize freehand drawings to facilitate communication with remote partners and to use pen-based input as a camera control device. Results of a preliminary user study suggest that our gesture over video communication system enhances task performance over traditional video-only systems. Implications for the design of multimodal systems to support collaborative physical tasks are also discussed.
Keywords: computer-supported cooperative work, gestural communication, gesture recognition, multimodal interaction, video conferencing, video mediated communication, video stream
The role of spoken feedback in experiencing multimodal interfaces as human-like BIBAKFull-Text 250-257
  Pernilla Qvarfordt; Arne Jönsson; Nils Dahlbäck
If user interfaces should be made human-like vs. tool-like has been debated in the HCI field, and this debate affects the development of multimodal interfaces. However, little empirical study has been done to support either view so far. Even if there is evidence that humans interpret media as other humans, this does not mean that humans experience the interfaces as human-like. We studied how people experience a multimodal timetable system with varying degree of human-like spoken feedback in a Wizard-of-Oz study. The results showed that users' views and preferences lean significantly towards anthropomorphism after actually experiencing the multimodal timetable system. The more human-like the spoken feedback is the more participants preferred the system to be human-like. The results also showed that the users experience matched their preferences. This shows that in order to appreciate a human-like interface, the users have to experience it.
Keywords: Wizard of Oz, anthropomorphism, multimodal interaction, spoken feedback
Real time facial expression recognition in video using support vector machines BIBAKFull-Text 258-264
  Philipp Michel; Rana El Kaliouby
Enabling computer systems to recognize facial expressions and infer emotions from them in real time presents a challenging research topic. In this paper, we present a real time approach to emotion recognition through facial expression in live video. We employ an automatic facial feature tracker to perform face localization and feature extraction. The facial feature displacements in the video stream are used as input to a Support Vector Machine classifier. We evaluate our method in terms of recognition accuracy for a variety of interaction and classification scenarios. Our person-dependent and person-independent experiments demonstrate the effectiveness of a support vector machine and feature tracking approach to fully automatic, unobtrusive expression recognition in live video. We conclude by discussing the relevance of our work to affective and intelligent man-machine interfaces and exploring further improvements.
Keywords: affective user interfaces, emotion classification, facial expression analysis, feature tracking, support vector machines
Modeling multimodal integration patterns and performance in seniors: toward adaptive processing of individual differences BIBAKFull-Text 265-272
  Benfang Xiao; Rebecca Lunsford; Rachel Coulston; Matt Wesson; Sharon Oviatt
Multimodal interfaces are designed with a focus on flexibility, although very few currently are capable of adapting to major sources of user, task, or environmental variation. The development of adaptive multimodal processing techniques will require empirical guidance from quantitative modeling on key aspects of individual differences, especially as users engage in different types of tasks in different usage contexts. In the present study, data were collected from fifteen 66- to 86-year-old healthy seniors as they interacted with a map-based flood management system using multimodal speech and pen input. A comprehensive analysis of multimodal integration patterns revealed that seniors were classifiable as either simultaneous or sequential integrators, like children and adults. Seniors also demonstrated early predictability and a high degree of consistency in their dominant integration pattern. However, greater individual differences in multimodal integration generally were evident in this population. Perhaps surprisingly, during sequential constructions seniors' intermodal lags were no longer in average and maximum duration than those of younger adults, although both of these groups had longer maximum lags than children. However, an analysis of seniors' performance did reveal lengthy latencies before initiating a task, and high rates of self talk and task-critical errors while completing spatial tasks. All of these behaviors were magnified as the task difficulty level increased. Results of this research have implications for the design of adaptive processing strategies appropriate for seniors' applications, especially for the development of temporal thresholds used during multimodal fusion. The long-term goal of this research is the design of high-performance multimodal systems that adapt to a full spectrum of diverse users, supporting tailored and robust future systems.
Keywords: human performance errors, multimodal integration, self-regulatory language, senior users, speech and pen input, task difficulty
Auditory, graphical and haptic contact cues for a reach, grasp, and place task in an augmented environment BIBAKFull-Text 273-276
  Mihaela A. Zahariev; Christine L. MacKenzie
An experiment was conducted to investigate how performance of a reach, grasp and place task was influenced by added auditory and graphical cues. The cues were presented at points in the task, specifically when making contact for grasping or placing the object, and were presented in single or in combined modalities. Haptic feedback was present always during physical interaction with the object. The auditory and graphical cues provided enhanced feedback about making contact between hand and object and between object and table. Also, the task was performed with or without vision of hand. Movements were slower without vision of hand. Providing auditory cues clearly facilitated performance, while graphical contact cues had no additional effect. Implications are discussed for various uses of auditory displays in virtual environments.
Keywords: Fitts' law, auditory displays, human performance, multimodal displays, object manipulation, prehension, proprioception, virtual reality, visual information
Mouthbrush: drawing and painting by hand and mouth BIBAKFull-Text 277-280
  Chi-ho Chan; Michael J. Lyons; Nobuji Tetsutani
We present a novel multimodal interface which permits users to draw or paint using coordinated gestures of hand and mouth. A headworn camera captures an image of the mouth and the mouth cavity region is extracted by Fisher discriminant analysis of the pixel colour information. A normalized area parameter is read by a drawing or painting program to allow read-time gestural control of pen/brush parameters by mouth gesture while sketching with a digital pen/tablet. A new performance task, the Radius Control Task, is proposed as a means of systematic evaluation of performance of the interface. Data from preliminary experiments show that with some practice users can achieve single pixel radius control with ease. A trial of the system by a professional artist shows that it is ready for use as a novel tool for creative artistic expression.
Keywords: alternative input devices, mouth controller, vision-based interface
XISL: a language for describing multimodal interaction scenarios BIBAKFull-Text 281-284
  Kouichi Katsurada; Yusaku Nakamura; Hirobumi Yamada; Tsuneo Nitta
This paper outlines the latest version of XISL (eXtensible Interaction Scenario Language). XISL is an XML-based markup language for web-based multimodal interaction systems. XISL enables to describe synchronization of multimodal inputs/outputs, dialog flow/transition, and some other descriptions required for multimodal interaction. XISL inherits these features from VoiceXML and SMIL. The original feature of XISL is that XISL has enough modality-extensibility. We present the basic XISL tags, outline of XISL execution systems, and then make a comparison with other languages.
Keywords: XISL, XML, modality extensibility, multimodal interaction
IRYS: a visualization tool for temporal analysis of multimodal interaction BIBAKFull-Text 285-288
  Daniel Bauer; James D. Hollan
IRYS is a tool for the replay and analysis of gaze and touch behavior during on-line activities. Essentially a "multimodal VCR", it can record and replay computer screen activity and overlay this video with a synchronized "spotlight" of the user's attention, as measured by an eye-tracking and/or touch-tracking system. This cross-platform tool is particularly useful for detailed ethnographic analysis of "natural" on-line behavior involving multiple applications and windows in a continually changing workspace.
Keywords: VNC, digital ethnography, eye tracking, gaze analysis, gaze representation, haptic, multimodal, temporal analysis, touch tracking, virtual network computer
Towards robust person recognition on handheld devices using face and speaker identification technologies BIBAKFull-Text 289-292
  Timothy J. Hazen; Eugene Weinstein; Alex Park
Most face and speaker identification techniques are tested on data collected in controlled environments using high quality cameras and microphones. However, the use of these technologies in variable environments and with the help of the inexpensive sound and image capture hardware present in mobile devices presents an additional challenge. In this study, we investigate the application of existing face and speaker identification techniques to a person identification task on a handheld device. These techniques have proven to perform accurately on tightly constrained experiments where the lighting conditions, visual backgrounds, and audio environments are fixed and specifically adjusted for optimal data quality. When these techniques are applied on mobile devices where the visual and audio conditions are highly variable, degradations in performance can be expected. Under these circumstances, the combination of multiple biometric modalities can improve the robustness and accuracy of the person identification task. In this paper, we present our approach for combining face and speaker identification technologies and experimentally demonstrate a fused multi-biometric system which achieves a 50% reduction in equal error rate over the better of the two independent systems.
Keywords: face identification, handheld devices, multi-biometric interfaces, speaker identification
Algorithms for controlling cooperation between output modalities in 2D embodied conversational agents BIBAKFull-Text 293-296
  Sarkis Abrilian; Jean-Claude Martin; Stéphanie Buisine
Recent advances in the specification of the multimodal behavior of Embodied Conversational Agents (ECA) have proposed a direct and deterministic one-step mapping from high-level specifications of dialog state or agent emotion onto low-level specifications of the multimodal behavior to be displayed by the agent (e.g. facial expression, gestures, vocal utterance). The difference of abstraction between these two levels of specification makes difficult the definition of such a complex mapping. In this paper we propose an intermediate level of specification based on combinations between modalities (e.g. redundancy, complementarity). We explain how such intermediate level specifications can be described using XML in the case of deictic expressions. We define algorithms for parsing such descriptions and generating the corresponding multimodal behavior of 2D cartoon-like conversational agents. Some random selection has been introduced in these algorithms in order to induce some "natural variations" in the agent's behavior. We conclude on the usefulness of this approach for the design of ECA.
Keywords: embodied conversational agent, multimodal output, redundancy, specification
Towards an attentive robotic dialog partner BIBAKFull-Text 297-300
  Torsten Wilhelm; Hans-Joachim Böhme; Horst-Michael Gross
This paper describes a system developed for a mobile service robot which detects and tracks the position of a user's face in 3D-space using a vision (skin color) and a sonar based component. To make the skin color detection robust under varying illumination conditions, it is supplied with an automatic white balance algorithm. The hypothesis of the user's position is used to orient the robot's head towards the current user allowing it to grab high resolution images of his face suitable for verifying the hypothesis and for extracting additional information.
Keywords: user detection, user tracking

Demos

Demo: a multi-modal training environment for surgeons BIBAKFull-Text 301-302
  Shahram Payandeh; John Dill; Graham Wilson; Hui Zhang; Lilong Shi; Alan Lomax; Christine MacKenzie
This demonstration presents the current state of an on-going team project at Simon Fraser University in developing a virtual environment for helping to train surgeons in performing laparoscopic surgery. In collaboration with surgeons, an initial set of training procedures has been developed. Our goal has been to develop procedures in each of several general categories, such as basic hand-eye coordination, single-handed and bi-manual approaches and dexterous manipulation. The environment is based on an effective data structure that offers fast graphics and physically based modeling of both rigid and deformable objects. In addition, the environment supports both 3D and 5D input devices and devices generating haptic feedback. The demonstration allows users to interact with a scene using a haptic device.
Keywords: haptics, surgery training, surgical simulation, virtual laparoscopy, virtual reality
Demo: playingfFantasyA with senToy BIBAKFull-Text 303-304
  Ana Paiva; Rui Prada; Ricardo Chaves; Marco Vala; Adrian Bullock; Gerd Andersson; Kristina Höök
Game development is an emerging area of development for new types of interaction between computers and humans. New forms of communication are now being explored there, influenced not only by face to face communication but also by recent developments in multi-modal communication and tangible interfaces. This demo will feature a computer game, FantasyA, where users can play the game by interacting with a tangible interface, SenToy (see Figure 1). The main idea is to involve objects and artifacts from real life into ways to interact with systems, and in particular with games. So, SenToy is an interface for users to project some of their emotional gestures through moving the doll in certain ways. This device would establish a link between the users (holding the physical device) and a controlled avatar (embodied by that physical device) of the computer game, FantasyA.
Keywords: affective computing, synthetic characters, tangible interfaces