Speech-based Interaction: Myths, Challenges, and Opportunities
Course Overviews
/
Munteanu, Cosmin
/
Penn, Gerald
Extended Abstracts of the ACM CHI'16 Conference on Human Factors in
Computing Systems
2016-05-07
v.2
p.992-995
© Copyright 2016 ACM
Summary: HCI research has for long been dedicated to better and more naturally
facilitating information transfer between humans and machines. Unfortunately,
humans' most natural form of communication, speech, is also one of the most
difficult modalities to be understood by machines -- despite, and perhaps,
because it is the highest-bandwidth communication channel we possess. While
significant research efforts, from engineering, to linguistic, and to cognitive
sciences, have been spent on improving machines' ability to understand speech,
the CHI community (and the HCI field at large) has been relatively timid in
embracing this modality as a central focus of research. This can be attributed
in part to the relatively discouraging levels of accuracy in understanding
speech, in contrast with often-unfounded claims of success from industry, but
also to the intrinsic difficulty of designing and especially evaluating speech
and natural language interfaces. As such, the development of interactive
speech-based systems is mostly driven by engineering efforts to improve such
systems with respect to largely arbitrary performance metrics. Such
developments have often been void of any user-centered design principles or
consideration for usability or usefulness. The goal of this course is to inform
the CHI community of the current state of speech and natural language research,
to dispel some of the myths surrounding speech-based interaction, as well as to
provide an opportunity for researchers and practitioners to learn more about
how speech recognition and speech synthesis work, what are their limitations,
and how they could be used to enhance current interaction paradigms. Through
this, we hope that HCI researchers and practitioners will learn how to combine
recent advances in speech processing with user-centred principles in designing
more usable and useful speech-based interactive systems.
Designing Speech and Multimodal Interactions for Mobile, Wearable, and
Pervasive Applications
Workshop Summaries
/
Munteanu, Cosmin
/
Irani, Pourang
/
Oviatt, Sharon
/
Aylett, Matthew
/
Penn, Gerald
/
Pan, Shimei
/
Sharma, Nikhil
/
Rudzicz, Frank
/
Gomez, Randy
/
Nakamura, Keisuke
/
Nakadai, Kazuhiro
Extended Abstracts of the ACM CHI'16 Conference on Human Factors in
Computing Systems
2016-05-07
v.2
p.3612-3619
© Copyright 2016 ACM
Summary: Traditional interfaces are continuously being replaced by mobile, wearable,
or pervasive interfaces. Yet when it comes to the input and output modalities
enabling our interactions, we have yet to fully embrace some of the most
natural forms of communication and information processing that humans possess:
speech, language, gestures, thoughts. Very little HCI attention has been
dedicated to designing and developing spoken language and multimodal
interaction techniques, especially for mobile and wearable devices. In addition
to the enormous, recent engineering progress in processing such modalities,
there is now sufficient evidence that many real-life applications do not
require 100% accuracy of processing multimodal input to be useful, particularly
if such modalities complement each other. This multidisciplinary, two-day
workshop will bring together interaction designers, usability researchers, and
general HCI practitioners to analyze the opportunities and directions to take
in designing more natural interactions with mobile and wearable devices, and to
look at how we can leverage recent advances in speech and multimodal
processing.
Speech-based Interaction: Myths, Challenges, and Opportunities
Course Overviews
/
Munteanu, Cosmin
/
Penn, Gerald
Extended Abstracts of the ACM CHI'15 Conference on Human Factors in
Computing Systems
2015-04-18
v.2
p.2483-2484
© Copyright 2015 ACM
Summary: HCI research has for long been dedicated to better and more naturally
facilitating information transfer between humans and machines. Unfortunately,
humans' most natural form of communication, speech, is also one of the most
difficult modalities to be understood by machines -- despite, and perhaps,
because it is the highest-bandwidth communication channel we possess. While
significant research efforts, from engineering, to linguistic, and to cognitive
sciences, have been spent on improving machines' ability to understand speech,
the HCI community has been relatively timid in embracing this modality as a
central focus of research. This can be attributed in part to the relatively
discouraging levels of accuracy in understanding speech, in contrast with
often-unfounded claims of success from industry, but also to the intrinsic
difficulty of designing and especially evaluating speech and natural language
interfaces. The goal of this course is to inform the CHI community of the
current state of speech and natural language research, to dispel some of the
myths surrounding speech-based interaction, as well as to provide an
opportunity for researchers and practitioners to learn more about how speech
recognition and speech synthesis work, what are their limitations, and how they
could be used to enhance current interaction paradigms. Through this, we hope
that CHI researchers and general HCI, UI, and UX practitioners will learn how
to combine recent advances in speech processing with user-centred principles in
designing more usable and useful speech-based interactive systems.
Speech-based Interaction: Myths, Challenges, and Opportunities
Tutorials
/
Munteanu, Cosmin
/
Penn, Gerald
Proceedings of the 2015 International Conference on Intelligent User
Interfaces
2015-03-29
v.1
p.437-438
© Copyright 2015 ACM
Summary: HCI research has for long been dedicated to better and more naturally
facilitating information transfer between humans and machines. Unfortunately,
humans' most natural form of communication, speech, is also one of the most
difficult modalities to be understood by machines -- despite, and perhaps,
because it is the highest-bandwidth communication channel we possess. While
significant research efforts, from engineering, to linguistic, and to cognitive
sciences, have been spent on improving machines' ability to understand speech,
the HCI community has been relatively timid in embracing this modality as a
central focus of research. This can be attributed in part to the relatively
discouraging levels of accuracy in understanding speech, in contrast with
often-unfounded claims of success from industry, but also to the intrinsic
difficulty of designing and especially evaluating speech and natural language
interfaces.
The goal of this course is to inform the IUI community of the current state
of speech and natural language research, to dispel some of the myths
surrounding speech-based interaction, as well as to provide an opportunity for
researchers and practitioners to learn more about how speech recognition and
speech synthesis work, what are their limitations, and how they could be used
to enhance current interaction paradigms. Through this, we hope that IUI
researchers and general HCI, UI, and UX practitioners will learn how to combine
recent advances in speech processing with user-centred principles in designing
more usable and useful speech-based interactive systems.
Speech-based interaction: myths, challenges, and opportunities
Interactive tutorials
/
Munteanu, Cosmin
/
Penn, Gerald
Proceedings of 2014 Conference on Human-Computer Interaction with Mobile
Devices and Services
2014-09-23
p.567-568
© Copyright 2014 ACM
Summary: Human-Computer Interaction (HCI) research has for long been dedicated to
better and more naturally facilitating information transfer between humans and
machines. Unfortunately, humans' most natural form of communication, speech, is
also one of the most difficult modalities to be understood by machines. This is
largely due to speech being the highest-bandwidth communication channel we
possess. As such, significant research efforts, from engineering, to
linguistic, and to cognitive sciences, have been spent during the past several
decades on improving machines' ability to understand speech. Yet, the MobileHCI
community (and HCI in general) has been relatively timid in embracing this
modality as a central focus of research. This can be attributed in part to the
relatively discouraging levels of accuracy in understanding speech, in contrast
with often-unfounded claims of success from industry, but also to the intrinsic
difficulty of designing and especially evaluating speech and natural language
interfaces.
The goal of this course is to inform the MobileHCI community of the current
state of speech and natural language research, to dispel some of the myths
surrounding speech-based interaction, as well as to provide an opportunity for
researchers and practitioners to learn more about how speech recognition and
speech synthesis work, what are their limitations, and how they could be used
to enhance current interaction paradigms. Through this, we hope that MobileHCI
researchers and practitioners will learn how to combine recent advances in
speech processing with user-centred principles in designing more usable and
useful speech-based interactive systems.
Designing speech and language interactions
Workshop summaries
/
Munteanu, Cosmin
/
Jones, Matt
/
Whittaker, Steve
/
Oviatt, Sharon
/
Aylett, Matthew
/
Penn, Gerald
/
Brewster, Stephen
/
d'Alessandro, Nicolas
Proceedings of ACM CHI 2014 Conference on Human Factors in Computing Systems
2014-04-26
v.2
p.75-78
© Copyright 2014 ACM
Summary: Speech and natural language remain our most natural forms of interaction;
yet the HCI community have been very timid about focusing their attention on
designing and developing spoken language interaction techniques. While
significant efforts are spent and progress made in speech recognition,
synthesis, and natural language processing, there is now sufficient evidence
that many real-life applications using speech technologies do not require 100%
accuracy to be useful. This is particularly true if such systems are designed
with complementary modalities that better support their users or enhance the
systems' usability. Engaging the CHI community now is timely -- many recent
commercial applications, especially in the mobile space, are already tapping
the increased interest in and need for natural user interfaces (NUIs) by
enabling speech interaction in their products. This multidisciplinary, one-day
workshop will bring together interaction designers, usability researchers, and
general HCI practitioners to analyze the opportunities and directions to take
in designing more natural interactions based on spoken language, and to look at
how we can leverage recent advances in speech processing in order to gain
widespread acceptance of speech and natural language interaction.
The CBC newsworld holodeck
Interactivity
/
Ladly, Martha
/
Penn, Gerald
/
Chen, Cathy Pin Chun
/
Chintraruck, Pavika
/
Ghaderi, Maziar
/
Ludlow, Bryn A.
/
Peter, Jessica
/
Tanyag, Ruzette
/
Zhou, Peggy
/
Kazemian, Siavash
Proceedings of ACM CHI 2014 Conference on Human Factors in Computing Systems
2014-04-26
v.2
p.363-366
© Copyright 2014 ACM
Summary: For the past 73 years, the CBC has disseminated a unique Canadian
perspective across the world, producing a phenomenally rich multimedia record
of the country and our social, political and cultural heritage and news. This
project utilizes visualization and sonification of portions of an enormous
historical CBC Newsworld data corpus to enable an "on this day" experience for
viewers. The digitized collection of 24-hour news videos spans a 24-year period
(1989-2013) within an immersive multiscreen environment, to enable
gesture-driven context-aware browsing, information seeking, and segment review.
Employing natural language processing technologies, the interface displays
keywords and key phrases identified in the transcripts, enabling serendipitous
video search and display and offering a unique browsing opportunity within this
rich "big data" corpus.
Speech-based interaction: myths, challenges, and opportunities
Courses
/
Munteanu, Cosmin
/
Penn, Gerald
Proceedings of ACM CHI 2014 Conference on Human Factors in Computing Systems
2014-04-26
v.2
p.1035-1036
© Copyright 2014 ACM
Summary: HCI research has for long been dedicated to better and more naturally
facilitating information transfer between humans and machines. Unfortunately,
humans' most natural form of communication, speech, is also one of the most
difficult modalities to be understood by machines -- despite, and perhaps,
because it is the highest-bandwidth communication channel we possess. While
significant research efforts, from engineering, to linguistic, and to cognitive
sciences, have been spent on improving machines' ability to understand speech,
the CHI community has been relatively timid in embracing this modality as a
central focus of research. This can be attributed in part to the relatively
discouraging levels of accuracy in understanding speech, in contrast with
often-unfounded claims of success from industry, but also to the intrinsic
difficulty of designing and especially evaluating speech and natural language
interfaces. As such, the development of interactive speech-based systems is
mostly driven by engineering efforts to improve such systems with respect to
largely arbitrary performance metrics, often void of any user-centered design
principles or consideration for usability or usefulness.
The goal of this course is to inform the CHI community of the current state
of speech and natural language research, to dispel some of the myths
surrounding speech-based interaction, as well as to provide an opportunity for
researchers and practitioners to learn more about how speech recognition and
speech synthesis work, what are their limitations, and how they could be used
to enhance current interaction paradigms. Through this, we hope that HCI
researchers and practitioners will learn how to combine recent advances in
speech processing with user-centered principles in designing more usable and
useful speech-based interactive systems.
We need to talk: HCI and the delicate topic of spoken language interaction
Panels
/
Munteanu, Cosmin
/
Jones, Matt
/
Oviatt, Sharon
/
Brewster, Stephen
/
Penn, Gerald
/
Whittaker, Steve
/
Rajput, Nitendra
/
Nanavati, Amit
Extended Abstracts of ACM CHI'13 Conference on Human Factors in Computing
Systems
2013-04-27
v.2
p.2459-2464
© Copyright 2013 ACM
Summary: Speech and natural language remain our most natural form of interaction; yet
the HCI community have been very timid about focusing their attention on
designing and developing spoken language interaction techniques. This may be
due to a widespread perception that perfect domain-independent speech
recognition is an unattainable goal. Progress is continuously being made in the
engineering and science of speech and natural language processing, however, and
there is also recent research that suggests that many applications of speech
require far less than 100% accuracy to be useful in many contexts. Engaging the
CHI community now is timely -- many recent commercial applications, especially
in the mobile space, are already tapping the increased interest in and need for
natural user interfaces (NUIs) by enabling speech interaction in their
products. As such, the goal of this panel is to bring together interaction
designers, usability researchers, and general HCI practitioners to discuss the
opportunities and directions to take in designing more natural interactions
based on spoken language, and to look at how we can leverage recent advances in
speech processing in order to gain widespread acceptance of speech and natural
language interaction.
SeeSay and HearSay CAPTCHA for mobile interaction
Papers: mobile interaction
/
Shirali-Shahreza, Sajad
/
Penn, Gerald
/
Balakrishnan, Ravin
/
Ganjali, Yashar
Proceedings of ACM CHI 2013 Conference on Human Factors in Computing Systems
2013-04-27
v.1
p.2147-2156
© Copyright 2013 ACM
Summary: Speech certainly has advantages as an input modality for smartphone
applications, especially in scenarios where using touch or keyboard entry is
difficult, on increasingly miniaturized devices where useable keyboards are
difficult to accommodate, or in scenarios where only small amounts of text need
to be input, such as when entering SMS texts or responding to a CAPTCHA
challenge. In this paper, we propose two new alternative ways to design
CAPTCHAs in which the user says the answer instead of typing it with (a) output
stimuli provided visually (SeeSay) or (b) auditorily (HearSay). Our user study
results show that SeeSay CAPTCHA requires less time to be solved and users
prefer it over current text-based CAPTCHA methods.
An ecologically valid evaluation of speech summarization
Work-in-progress
/
McCallum, Anthony
/
Munteanu, Cosmin
/
Penn, Gerald
/
Zhu, Xiaodan
Extended Abstracts of ACM CHI'12 Conference on Human Factors in Computing
Systems
2012-05-05
v.2
p.2219-2224
© Copyright 2012 ACM
Summary: The past decade has witnessed an explosion in the size and availability of
online audio-visual repositories, such as entertainment, news, or lectures.
Summarization systems have the potential to provide significant assistance with
navigating such repositories. Unfortunately, automatically-generated summaries
often fall short of delivering the information needed by users. This is due, in
no small part, to the fact that the natural language heuristics used to
generate summaries are often optimized with respect to currently-used
evaluation metrics. Such metrics simply score automatically-generated summaries
against subjectively-classified gold standards without taking into account the
usefulness of a summary in assisting a user achieve a certain goal or even
overall summary coherence. We have previously shown that an immediate
consequence of this problem is that even the most linguistically-complex
summarization systems perform no better than basic heuristics, such as picking
the longest sentences from a general-topic, spontaneous dialog, or the first
few sentences from a news recording. Our hypothesis is that complex systems are
in fact better, if measured properly. What is thus needed instead are
evaluation metrics (and consequently, automatic summarizers) that incorporate
features such as user preferences and task-orientation. For this, we propose an
ecologically valid evaluation metric that determines the value of a summary
when embedded in a task, rather than how closely a summary matches a gold
standard.
Collaborative editing for improved usefulness and usability of
transcript-enhanced webcasts
Collaborative User Interfaces
/
Munteanu, Cosmin
/
Baecker, Ron
/
Penn, Gerald
Proceedings of ACM CHI 2008 Conference on Human Factors in Computing Systems
2008-04-05
v.1
p.373-382
© Copyright 2008 ACM
Summary: One challenge in facilitating skimming or browsing through archives of
on-line recordings of webcast lectures is the lack of text transcripts of the
recorded lecture. Ideally, transcripts would be obtainable through Automatic
Speech Recognition (ASR). However, current ASR systems can only deliver, in
realistic lecture conditions, a Word Error Rate of around 45% -- above the
accepted threshold of 25%. In this paper, we present the iterative design of a
webcast extension that engages users to collaborate in a wiki-like manner on
editing the ASR-produced imperfect transcripts, and show that this is a
feasible solution for improving the quality of lecture transcripts. We also
present the findings of a field study carried out in a real lecture environment
investigating how students use and edit the transcripts.
Automatic speech recognition for webcasts: how good is good enough and what
to do when it isn't
Poster Session 1
/
Munteanu, Cosmin
/
Penn, Gerald
/
Baecker, Ron
/
Zhang, Yuecheng
Proceedings of the 2006 International Conference on Multimodal Interfaces
2006-11-02
p.39-42
Keywords: automatic speech recognition, collaboration, webcasts
© Copyright 2006 ACM
Summary: The increased availability of broadband connections has recently led to an
increase in the use of Internet broadcasting (webcasting). Most webcasts are
archived and accessed numerous times retrospectively. One challenge to skimming
and browsing through such archives is the lack of text transcripts of the
webcast's audio channel. This paper describes a procedure for prototyping an
Automatic Speech Recognition (ASR) system that generates realistic transcripts
of any desired Word Error Rate (WER), thus overcoming the drawbacks of both
prototype-based and Wizard of Oz simulations. We used such a system in a user
study showing that transcripts with WERs less than 25% are acceptable for use
in webcast archives. As current ASR systems can only deliver, in realistic
conditions, Word Error Rates (WERs) of around 45%, we also describe a solution
for reducing the WER of such transcripts by engaging users to collaborate in a
"wiki" fashion on editing the imperfect transcripts obtained through ASR.
The effect of speech recognition accuracy rates on the usefulness and
usability of webcast archives
Visualization and search
/
Munteanu, Cosmin
/
Baecker, Ronald
/
Penn, Gerald
/
Toms, Elaine
/
James, David
Proceedings of ACM CHI 2006 Conference on Human Factors in Computing Systems
2006-04-22
v.1
p.493-502
© Copyright 2006 ACM
Best paper nominee: The authors have conducted an important
experiment that establishes minimum levels of accuracy that will make
automatic speech recognition useful for navigating transcriptions of
webcasts. This result is particularly timely given the growing
availability and use of webcasts in research and education.
Summary: The widespread availability of broadband connections has led to an increase
in the use of Internet broadcasting (webcasting). Most webcasts are archived
and accessed numerous times retrospectively. In the absence of transcripts of
what was said, users have difficulty searching and scanning for specific
topics. This research investigates user needs for transcription accuracy in
webcast archives, and measures how the quality of transcripts affects user
performance in a question-answering task, and how quality affects overall user
experience. We tested 48 subjects in a within-subjects design under 4
conditions: perfect transcripts, transcripts with 25% Word Error Rate (WER),
transcripts with 45% WER, and no transcript. Our data reveals that speech
recognition accuracy linearly influences both user performance and experience,
shows that transcripts with 45% WER are unsatisfactory, and suggests that
transcripts having a WER of 25% or less would be useful and usable in webcast
archives.
INTERNET
Knowledge Media Design Institute
/
Alleyne, Joel
/
Baber, Zaheer
/
Baecker, Ronald
/
Balakrishnan, Ravin
/
Berry, Brent
/
Birnholtz, Jeremy
/
Boler, Megan
/
Brett, Clare
/
Buliung, Ron
/
Caidi, Nadia
/
Chan, Leslie
/
Chignell, Mark
/
Choo, Chun Wei
/
Clement, Andrew
/
Consens, Mariano
/
Danahy, John
/
Deibert, Ronald
/
de Kerckhove, Derrick
/
de Lara, Eyal
/
Dryer, Marc
/
Easterbrook, Steve
/
Eysenbach, Gunther
/
Fiume, Eugene
/
Fox, Mark
/
Garrett, Frances
/
Goldfarb, Avi
/
Gotlieb, Calvin
/
Hewitt, Jim
/
Hirst, Graeme
/
Hockema, Stephen
/
Hyman, Avi
/
Hoinkes, Rodney
/
Jacobsen, H.-Arno
/
Jadad, Alex
/
Jamieson, Gregory
/
Jenkinson, Jodie
/
Jones, Charles
/
Kaplan, Louis
/
Kolodny, Harvey
/
Koudas, Nick
/
Lancashire, Ian
/
Logan, Bob
/
Luke, Robert
/
Lyons, Kelly
/
Mann, Steve
/
Martimianakis, Tina
/
Marziali, Elsa
/
Milgram, Paul
/
Moller, Henry
/
Moore, Gale
/
Murty, Vijaya Kumar
/
Muter, Paul
/
Mylopoulos, John
/
Penn, Gerald
/
Pennefather, Peter
/
Phillips, David
/
Plataniotis, Kostas
/
Ratto, Matt
/
Ryan, David
/
Saroiu, Stefan
/
Scheffel-Dunand, Dominique
/
Shafrir, Uri
/
Singh, Karan
/
Slotta, Jim
/
Cantwell, Brian
/
Spence, Ian
/
Steele, Lisa
/
Timmerman, Peter
/
Treviranus, Jutta
/
Trifonas, Peter
/
Truong, Khai
/
Vicente, Kim
/
Wellman, Barry
/
Wensley, Anthony
/
Wilson-Pauwels, Linda
/
Wolfe, David
/
Woodruff, Earl
/
Woolridge, Nicholas
/
Wright, Robert
/
Yu, Eric
2001-01-01
Canada, Ontario, Toronto
University of Toronto
Summary:
Research Themes:
- Knowledge media for learning - the application of computer, communications, and cognitive sciences to knowledge building, problem solving, planning, education, and training, especially to facilitate collaborative, distance and multimedia-based learning
- Technologies for knowledge media - research and development of technologies and the technological infrastructure required to construct knowledge media, including interactive computer graphics, scientific visualization, hypertext, multimedia, databases, natural language processing, and artificial intelligence
- Human-centred design - the design science of human-computer interaction and of the creation of innovative computer systems and interfaces appropriate for human use, and more generally in the human factors of complex real-world systems and t echnologies, as rooted in research from applied cognitive science, psychology, and sociology
- Knowledge media, culture, and society - reflection and analysis of the social implications of the increasing reliance on new technologies. As information and new media technologies challenge fundamental beliefs, this area of research deals broadly with such issues as the nature of communities and institutions, work and employment, the balance of public and private good, privacy, copyright and intellectual property.