UNIVERSIDADE DE LISBOA
INSTITUTO SUPERIOR TÉCNICO
Lexical Entrainment in Spoken Dialog Systems
José David Águas Lopes
Supervisor: Doctor Isabel Maria Martins Trancoso
Co-Supervisor: Doctor Maxine Eskenazi
Thesis approved in public session to obtain the PhD Degree in
Electrical and Computer Engineering
Jury Final classification: Pass With Merit.
Jury
Chairperson: Chairman of the IST Scientific Board
Members of the Committee:
Doctor Joakim Gustafson
Doctor Isabel Maria Martins Trancoso
Doctor Ana Maria Severino de Almeida e Paiva
Doctor Nuno João Neves Mamede
Doctor António Joaquim da Silva Texeira
Doctor Maxine Eskenazi
2013
UNIVERSIDADE DE LISBOA
INSTITUTO SUPERIOR TÉCNICO
Lexical Entrainment in Spoken Dialog Systems
José David Águas Lopes
Supervisor: Doctor Isabel Maria Martins Trancoso
Co-Supervisor: Doctor Maxine Eskenazi
Thesis approved in public session to obtain the PhD Degree in
Electrical and Computer Engineering
Jury Final classification: Pass With Merit.
Jury
Chairperson: Chairman of the IST Scientific Board
Members of the Committee:
Doctor Joakim Gustafson, Professor, School of Computer Science and
Communication, KTH, Royal Institute of Technology, Sweden.
Doctor Isabel Maria Martins Trancoso, Professora Catedrática do Instituto
Superior Técnico, da Universidade de Lisboa.
Doctor Ana Maria Severino de Almeida e Paiva, Professora Associada (com
Agregação) do Instituto Superior Técnico, da Universidade de Lisboa.
Doctor Nuno João Neves Mamede, Professor Associado (com Agregação) do
Instituto Superior Técnico, da Universidade de Lisboa.
Doctor António Joaquim da Silva Texeira, Professor Auxiliar da Universidade
de Aveiro.
Doctor Maxine Eskenazi, Principal Systems Scientist, Language Technologies
Institute, Carnegie Mellon University, USA.
This research work was funded by “Fundação para a Ciência e a Tecnologia” (FCT,
Portugal) through the Ph.D. grant with reference SFRH/BD/47039/2008.
2013
Resumo
A literatura [27] mostrou que os locutores participantes em diálogos falados usam os termos
uns dos outros para estabelecer uma base de conhecimento comum. Os termos utilizados
são correntemente conhecidos como primes, sendo termos capazes de influenciar o processo
de decisão lexical dos interlocutores. O objectivo desta tese é estudar um sistema de diálogo
capaz de imitar as relações estabelecidas em interacção humana no que diz respeito ao processo
de escolha de primes. O objectivo final é que o sistema seja capaz de escolher primes no
decorrer da interacção, baseando essa escolha num conjunto de caracterı́sticas que se julga
indicarem os primes adequados que os utilizadores pretendem usar. A estratégia seguida foi
preparar um sistema que privilegiasse a preferência do utilizador, desde que isso não afectasse
negativamente o desempenho desse sistema. Quando o desempenho do sistema saı́sse afectado,
o sistema deveria estar à altura de propor um prime alternativo, de modo a que utilizador o
pudesse usar e o seu desempenho melhorasse. Consideramos que esta estratégia traz benefı́cios
na qualidade do desempenho do sistema.
O cenário escolhido para o trabalho experimental levado a cabo foi um sistema de informação
de horários de autocarros, na cidade de Pittsburgh, conhecido por sistema Let’s Go, e em Lisboa com uma versão portuguesa desse sistema, denominada Noctı́vago, a fornecer informações
dos horários das carreiras nocturnas da CARRIS. No que se refere a Lisboa, trata-se de um
sistema experimental desenvolvido neste trabalho de doutoramento, privilegiando sobretudo
o estudo das caracterı́sticas que determinam um bom prime. A análise do resultado dos
primeiros testes com este úlimo permitiu-nos identificar caracterı́sticas que posteriormente
foram utilizadas para criar um primeiro método on-line para escolha de primes. O método
foi testado em ambos os sistemas. No sistema Let’s Go verificaram-se melhorias no desempenho do sistema, bem como uma redução do números de turnos por diálogo. O conjunto
5
de caraterı́sticas foi então estendido para a criação de um modelo estatı́stico para selecção
de primes, que foi testado no sistema Noctı́vago. Os resultados revelaram uma redução na
taxa de erro de reconhecimento de primes e aumento no sucesso na aquisição de conceitos
relacionados com os mesmos. Deste modo, pode-se concluir que as decisões lexicais podem
influenciar positivamente o desempenho dos sistemas de diálogo. No entanto, estes foram
apenas os primeiros testes. Melhorar os métodos desenvolvidos e fazer testes em maior escala
são passos necessários para reforçar a hipótese defendida.
Abstract
The literature [27] has shown that speakers engaged in a spoken dialog use one another’s
terms (entrain) when trying to create a common ground. Those terms are commonly called
primes, since they influence the interlocutors linguistic decision-making. The goal of this
thesis is to develop a Spoken Dialog System (SDS) that is able to imitate human interaction
by proposing primes to the user. The final goal is to have these primes chosen on the fly
during the interaction, based on a set of features that are believed to indicate good candidate
terms that the speaker would want to use. The strategy was to develop a system that follows
the user’s choice of prime, if the system performance is not negatively affected. When the
system performance may be affected, the system proposes a new prime that is the next best
candidate to be used by the speaker. We believe that this strategy may improve the system
performance.
The scenario for the entrainment experiments is an information system providing schedules
for buses in Pittsburgh - Let’s Go, and its Portuguese counterpart - Noctı́vago, an experimental system operating for the night buses in Lisbon that was developed especially for studying
the features that make good primes. The analysis of the results from the first tests resulted
in a set of features that were used to create a first on-the-fly method for prime selection.
The method was tested in both systems. In Let’s Go there was an improvement in system
performance and a reduction of the total number of turns per dialog. Using an extended set of
features, a data-driven method was created and tested with Noctı́vago. We observed a reduction in the error rate of prime recognition and an increase of successful acquisition of prime
concepts. Therefore, we can conclude that lexical entrainment can play a role in improving
the performance of SDSs. However, these are only the first results. Further improvements in
methods developed and larger scale tests are need to strengthen our hypothesis.
Palavras Chave
Keywords
Palavras Chave
Adaptação Lexical
Selecção automática de primes
Sistemas de Diálogo
Primes
Captação
Medida de confiança
Gestão de Diálogo
Interacção Homem-Máquina
Noctı́vago
Let’s Go
Keywords
Lexical Entrainment
Automated Prime Selection
Spoken Dialog Systems
Primes
Uptake
Confidence score
Dialog Management
Human Machine Interaction
Noctı́vago
Let’s Go
Acknowledgements
I would like to thank my advisors, Isabel Trancoso and Maxine Eskenazi, for their constant
availability, interest, support and encouragement in my research. I was blessed with two great
advisors with an immense sense of justice. I also thank their support in the research direction
I like the most.
I would like to thank all my colleagues from the Spoken Language Laboratory for the motivating atmosphere for research felt throughout this years. To all the people from Carnegie
Mellon that I have worked with throughout my visits, thanks for making me feel at home.
Special thanks to Alberto Abad, Sungjin Lee and Alan W. Black for their contributions to my
research. I would like also to mention Renato Cassaca and Pedro Fialho, for their patience
and technical support, and Cláudia Figueiredo for the help with the stats.
To the members of the committee for accepting the invitation and their contributions to make
this document better for the community.
I would like to remind also Professor Fernando Perdigão for introducing me to the world of
language technologies and for the warm welcome when I asked him to use the laboratory in
Coimbra.
A word to those that I’ve shared a home in all these years and sometimes had to deal with
my research frustration at the end of the day. A special thanks to those that lived in Casa
do Babar, for being my family in Lisbon.
To my JSC friends for all memorable moments shared and for being so inspiring to me despite
all our differences. To the friends I’ve made at CUMN and other Jesuit initiatives over the
years. Thanks for the truth, for taking care of me and for making me feel the real freedom.
To my Christian Life Community, for being side by side during this thesis. To the Society
of Jesus for the spiritual support and for the tools that help me to become a better human
being day after day. Thanks to all the kids that I tried to serve from Rabo de Peixe (Azores),
Fonte da Prata (Moita) and Bela Vista (Setúbal) for their genuine life, and for reminding me
of the most need of the human kind: being loved.
To my parents, Fernanda and Orlando, and my brother, João, for their unconditional love
and support during the rough moments of these nearly five years.
To God, the One I’ve tried to serve with this work and I want to continue serving for the rest
of my life.
Lisboa, December 15, 2013
José David Águas Lopes
“Para ser grande, sê inteiro: nada
Teu exagera ou exclui.
Sê todo em cada coisa. Põe quanto és
No mı́nimo que fazes.
Assim em cada lago a lua toda
Brilha, porque alta vive”
Ricardo Reis
Contents
1 Introduction
1
1.1
Motivation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.3
Structure of the document . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
2 Related Work
2.1
5
Improving robustness in SDSs . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.1.1
Speech recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.1.2
Confidence annotation . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1.3
Dialog Management . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1.3.1
POMDP Review . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.3.1.1
2.1.3.2
2.1.4
2.2
2.3
10
SDS-POMDP . . . . . . . . . . . . . . . . . . . . .
11
Dialog-State tracking . . . . . . . . . . . . . . . . . . . . . .
14
2.1.3.2.1
N-Best Approaches . . . . . . . . . . . . . . . . . .
14
2.1.3.2.2
Factored Bayesian Networks Approaches . . . . . .
14
2.1.3.3
Policy optimization . . . . . . . . . . . . . . . . . . . . . . .
15
2.1.3.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
Language Generation
. . . . . . . . . . . . . . . . . . . . . . . . . . .
17
Entrainment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2.2.1
Entrainment in Human-Human Dialogs . . . . . . . . . . . . . . . . .
18
2.2.2
Entrainment in Human-Computer Dialogs . . . . . . . . . . . . . . . .
19
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
i
3 Creating Noctı́vago, a Portuguese Let’s Go
3.1
3.2
3.3
23
Choosing a Framework for Noctı́vago . . . . . . . . . . . . . . . . . . . . . . .
23
3.1.1
DIGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.1.2
Olympus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
Modules Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.2.1
Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.2.1.1
Pocketsphinx . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.2.1.2
Audimus . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.2.2
Natural Language Understanding . . . . . . . . . . . . . . . . . . . . .
29
3.2.3
Dialog Management . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.2.3.1
Ravenclaw . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.2.3.2
Cornerstone . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.2.3.2.1
Dialog-State Tracking . . . . . . . . . . . . . . . . .
34
3.2.3.2.2
Policy Optimization . . . . . . . . . . . . . . . . . .
36
3.2.4
Natural Language Generation and Speech Synthesis . . . . . . . . . .
37
3.2.5
Embodied Conversational Agents . . . . . . . . . . . . . . . . . . . . .
37
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
4 Towards Better Prime Choices
43
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.2
Creating a list of prime candidates . . . . . . . . . . . . . . . . . . . . . . . .
44
4.3
Experimental Set Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
4.4
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
4.5
Prime usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
4.6
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
4.6.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
4.7
User’s Feedback
ii
5 Refining confidence measure to improve prime selection
57
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
5.2
Training a confidence annotator using logistic regression . . . . . . . . . . . .
57
5.3
Training a confidence annotator with skewed data . . . . . . . . . . . . . . .
62
5.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
6 Automated Entrainment
6.1
Two-Way Automated Rule-Based entrainment . . . . . . . . . . . . . . . . .
67
6.1.1
Entrainment Events . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
6.1.2
Heuristic Entrainment Rules . . . . . . . . . . . . . . . . . . . . . . .
69
6.1.2.1
70
6.1.2.2
6.1.2.3
Implementing the Entrainment Heuristics . . . . . . . . . . .
6.1.2.1.1
Long-Term Entrainment
. . . . . . . . . . . . . . .
70
6.1.2.1.2
Short-Term Entrainment . . . . . . . . . . . . . . .
70
Version 1: Testing the Entrainment Rules in Noctı́vago . . .
72
6.1.2.2.1
Test Set . . . . . . . . . . . . . . . . . . . . . . . . .
74
6.1.2.2.2
Results . . . . . . . . . . . . . . . . . . . . . . . . .
74
Version 2: Testing Entrainment Rules in Let’s Go . . . . . .
78
6.1.2.3.1
Results . . . . . . . . . . . . . . . . . . . . . . . . .
80
Acoustic Distance and Prime Usage Evolution . . . . . . . . . . . . .
82
6.1.3.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
A Data-Driven Method for Prime Selection . . . . . . . . . . . . . . . . . . .
89
6.2.1
Prime selection model . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
6.2.2
Predicting WER with Off-line data . . . . . . . . . . . . . . . . . . . .
91
6.2.3
Testing the model in an experimental system . . . . . . . . . . . . . .
94
6.2.3.1
Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
6.1.3
6.1.4
6.2
6.2.4
6.2.4.1
Analysis
Prime Usage Evolution . . . . . . . . . . . . . . . . . . . . .
101
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
104
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
105
6.2.5
6.3
67
iii
7 Conclusions
7.1
107
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A Tools for Oral Comprehension in L2 Learning
111
115
A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
115
A.1.1 Speech synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
116
A.1.2 Digital Talking Books . . . . . . . . . . . . . . . . . . . . . . . . . . .
116
A.1.3 Broadcast News
117
A.1.3.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Integrating Automatically Transcribed news shows in REAP.PT117
A.1.3.1.1
Broadcast News Pipeline . . . . . . . . . . . . . . .
117
A.1.3.1.2
Integration in REAP.PT . . . . . . . . . . . . . . .
120
A.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
124
B Nativeness Detection
127
B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.1.1 Corpus
127
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
129
B.1.2 Nativeness Classification . . . . . . . . . . . . . . . . . . . . . . . . . .
130
B.1.2.1
Acoustic Classifier Development . . . . . . . . . . . . . . . .
130
B.1.2.1.1
Feature Extraction . . . . . . . . . . . . . . . . . . .
130
B.1.2.1.2
Supervector Extraction . . . . . . . . . . . . . . . .
131
B.1.2.1.3
Nativeness modeling and scoring . . . . . . . . . . .
131
Prosodic Classifier Development . . . . . . . . . . . . . . . .
131
B.1.2.2
B.1.2.2.1
Prosodic contour extraction . . . . . . . . . . . . . .
131
B.1.2.2.2
Nativeness modeling and scoring . . . . . . . . . . .
132
Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . .
132
B.1.3 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . .
133
B.1.4 Human Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
136
B.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
136
B.1.2.3
iv
C Resources used in the experimental sets
139
C.1 Scenarios used in Noctı́vago tests . . . . . . . . . . . . . . . . . . . . . . . . .
139
C.2 Questionnaires . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
142
C.2.1 Questionnaires used in Section 4.3 . . . . . . . . . . . . . . . . . . . .
142
C.2.2 Questionnaire used in Section 6.1.2.2 . . . . . . . . . . . . . . . . . . .
144
C.2.3 Questionnaire used in Section 6.1.2.2 . . . . . . . . . . . . . . . . . . .
146
v
vi
List of Figures
2.1
Standard architecture of an SDS. . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.2
Spoken Dialog System architecture in [151]. . . . . . . . . . . . . . . . . . . .
11
2.3
Influence Diagram of a SDS-POMDP. From [148] . . . . . . . . . . . . . . . .
13
3.1
Architecture of DIGA framework for SDSs. . . . . . . . . . . . . . . . . . . .
24
3.2
Olympus reference architecture. From [22]. . . . . . . . . . . . . . . . . . . .
25
3.3
Tree for Noctı́vago task implemented in Ravenclaw. . . . . . . . . . . . . . . .
33
3.4
DBN graph for the Let’s Go state tracking. From [75]. . . . . . . . . . . . . .
35
3.5
Flash-based ECA used in Noctı́vago. . . . . . . . . . . . . . . . . . . . . . . .
39
3.6
Unity 3D-based ECA used in Noctı́vago. . . . . . . . . . . . . . . . . . . . . .
39
3.7
Olympus Architecture used in the first Noctı́vago tests (Section 4.3) with telephone interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
3.8
Olympus Architecture used in Let’s Go tests (Section 6.1.2.3). . . . . . . . . .
40
3.9
Olympus Architectures used in Noctı́vago with ECA. . . . . . . . . . . . . . .
41
4.1
Example of scenario used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
5.1
Accuracy, Precision and Recall for the tested methods. . . . . . . . . . . . . .
61
5.2
Performance of the stepwise logistic regression compared with the baseline model. 63
6.1
WER and CTC results for the different configurations tested. . . . . . . . . .
76
6.2
Accumulated of events percentage . . . . . . . . . . . . . . . . . . . . . . . .
78
6.3
Prime Usage over time for the concepts in confirmation, help and now.
84
6.4
Prime Usage over time for the concepts next query, origin place and start over. 85
6.5
Prime Usage over time for the concepts next bus and previous bus. . . . . . .
86
6.6
OOV, WER and CTC results for the different configurations tested. . . . . .
98
vii
. . .
6.7
Comparison between intrinsic prime usages in Prompts between Data Driven
(DD) and Rule Based (RB) prime selection. . . . . . . . . . . . . . . . . . . .
102
Comparison between non-intrinsic prime usages in Prompts between Data
Driven (DD) and Rule Based (RB) prime selection. . . . . . . . . . . . . . . .
103
A.1 Recognized BN interface in REAP. . . . . . . . . . . . . . . . . . . . . . . . .
121
A.2 Broadcast News Pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
123
B.1 DET curves of the GSV-acoustic 256, GMM-prosodic and fusion between both
systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
135
C.1 Scenario 1.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
139
C.2 Scenario 2.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
140
C.3 Scenario 3.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
141
C.4 Questionnaire used in the first week. . . . . . . . . . . . . . . . . . . . . . . .
142
C.5 Questionnaire used in the second week. . . . . . . . . . . . . . . . . . . . . . .
143
C.6 Questionnaire Rule Based tests. . . . . . . . . . . . . . . . . . . . . . . . . . .
145
C.7 Questionnaire Data Driven tests. . . . . . . . . . . . . . . . . . . . . . . . . .
147
6.8
viii
List of Tables
4.1
Prime analysis in pilot experiments. . . . . . . . . . . . . . . . . . . . . . . .
44
4.2
List of primes used in this study. . . . . . . . . . . . . . . . . . . . . . . . . .
44
4.3
Distribution of calls and real success rate according to the device used. . . . .
47
4.4
Success rate of the system in each week. . . . . . . . . . . . . . . . . . . . . .
47
4.5
Analysis of errors at the turn level. . . . . . . . . . . . . . . . . . . . . . . . .
48
4.6
WER for the different weeks. . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
4.7
Use of the primes and error rate in both weeks. . . . . . . . . . . . . . . . . .
50
4.8
Example of uptaken stats taken form interaction for the prime próximo. . . .
52
4.9
Analysis from the uptaken of the primes. . . . . . . . . . . . . . . . . . . . . .
53
5.1
Weights for the features used in the confidence annotator training algorithm.
60
5.2
Confidence annotations models trained with stepwise logistic regression. . . .
62
5.3
Classification error rate for the different strategies . . . . . . . . . . . . . . .
65
6.1
Examples of the events used in the prime choice update. Primes are in bold.
69
6.2
Primes used by Noctı́vago in the heuristic method tests. . . . . . . . . . . . .
73
6.3
Dialog performance results. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
6.4
WER and correctly transferred concepts results. . . . . . . . . . . . . . . . .
76
6.5
User satisfaction results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
6.6
Entrainment events relative frequency. . . . . . . . . . . . . . . . . . . . . . .
77
6.7
Primes used by Let’s Go before and after the entrainment rules were implemented. 80
6.8
Excerpts of dialogs where entrainment rules changed the system’s normal behavior. Primes affected in bold. . . . . . . . . . . . . . . . . . . . . . . . . . .
81
Results for Let’s Go tests. Statistically significant differences in bold. . . . . .
82
6.9
ix
6.10 Primes selected according to the minimal and average acoustic distance for
each language model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
6.11 Example of how the prime distance was computed. . . . . . . . . . . . . . . .
91
6.12 Number of turns used to train the prime selection regression. . . . . . . . . .
92
6.13 Noctı́vago models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
6.14 Let’s Go models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
6.15 Primes used by Noctı́vago in the data-driven model tests. . . . . . . . . . . .
96
6.16 OOV, WER and CTC Results for the different versions. Statistically significant
results in bold (one-way ANOVA with F (3) = 2.881 and p − value = 0.037). .
97
6.17 Entrainment Events and Non-Understandings relative frequencies. One-way
ANOVA tests revealed no statistical significance. . . . . . . . . . . . . . . . .
99
6.18 Dialog performance results. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
6.19 Questionnaire results for the different versions. . . . . . . . . . . . . . . . . .
100
B.1 Details of the training and test sets for Native and Non-Native data. . . . . .
130
B.2 Detailed results for GSV-acoustic 64 and GSV-acoustic 256. . . . . . . . . . .
133
B.3 Results obtained using prosodic features (Accuracy and EER) and the fusion
between prosodic systems and GSV-acoustic 256. . . . . . . . . . . . . . . . .
134
x
1
Introduction
Advances in the last 30 years, especially in Automatic Speech Recognition, allowed the pursue
of the dream of having conversational dialogs with computers. The system that is behind
dialog capabilities of computers is usually names as Spoken Dialog Systems (SDS). The use
of SDSs that are able to talk to humans can make life more comfortable. Some services that
nowadays rely in human-operator dialogs could be done in the future by machines, whenever
the human operators are not available. The use of SDSs in situations where typing is timeconsuming or impossible (while driving, for instance) is also a way to provide safety and
comfort in daily-life. Personal agents like Apple’s Siri [4] are examples of the way that people
look at the SDSs nowadays. There is a large potential in this type of technology.
1.1
Motivation
The study of human dialogs should inspire SDS developers to create systems that are able to
follow some human dialog protocols, in order to achieve successful communication.
Lexical Entrainment is a phenomenon that was first described by psychologists and psycholinguists, who have analyzed dialogs between humans and observed that they converged in the
terms they used as the dialogs progress in order to achieve successful communication [48, 29].
This implies that sometimes one of the subjects has to give up her/his own words and adopt
the words from the other subject, either because the other subject is not familiar with the
word presented or there is a word with the same meaning that can lead to faster and more
successful communication. One characteristic of lexical entrainment is that different pairs of
subjects are very likely to use different terms to refer to the same object. Another one is
that subjects do not lexically entrain because they want to. Entrainment is not a conscious
2
CHAPTER 1. INTRODUCTION
process, which makes it even more difficult to model it within an SDS.
There is an obvious advantage of the application of lexical entrainment to SDS: if the system
entrains, the communication will be more successful. There are several reasons for this.
First, if the system is able to predict the user’s lexical choices speech recognition is likely to
improve [30]. Second, from human-human dialogs it is known that the more two people are
engaged with each other, the more entrainment exists [97]. And third, it has been show that
entrainment is critical for successful communication [106].
1.2
Goals
This thesis aims to add a contribution to all the other improvements made in this area in the
last decades, integrating Lexical Entrainment in an SDS.
Despite the speech recognition and understanding advances, there is still a lot of uncertainty
in the Spoken Language Understanding (SLU) process, which may affect the success of the
dialog. Since lexical entrainment occurs in both directions in a dialog, the human participant
can also entrain to the system. Thus, systems may also be able to try to make the user entrain
to them, whenever a term hinders the SLU process. In any of the ways that the adaptation
may occur, it will require the system to change the terms used in the system prompts to
adapt them appropriately. This thesis aims to contribute on how to detect when and how the
system should modify the prompts in order to make the user entrain or to be entrained by
the user. In the end of the thesis, we hope to be able to show that the methods used and the
results achieved, support our initial ideal that incorporation of lexical entrainment in SDS
will increase the task success, and brings them closer to human-like dialog capabilities.
1.3
Structure of the document
In order to contextualize the reader with SDSs, the first part of Chapter 2 will review the
previous work in this area with a special focus in techniques used to deal with the uncertainty
1.3. STRUCTURE OF THE DOCUMENT
3
provoked by SLU errors. The second part will review Lexical Entrainment in human-to-human
dialogs, as well as its previous applications to human-machine dialogs. Chapter 3 describes
the steps towards the creation of a new dialog system in Portuguese, Noctı́vago, contrasting
it with the Let’s Go system, that served as the role model. Chapter 4 described the first tests
done with Noctı́vago and the findings that led to an on the fly algorithm for prime selection.
An important feature for prime selection is the confidence measure. The development of
new confidence measures will be presented in Chapter 5, showing that accurate confidence
measures may be important to the algorithms for on the fly entrainment. Chapter 6 presents
the methods and tests of a Rule-Based and a Data-Driven algorithm for prime selection.
Finally, the concluding remarks and possible directions for future work will close the main
block of the thesis in Chapter 7.
Annexes A and B correspond to work developed in the scope of the initial direction of this
PhD thesis, the creation of an SDS for non-native students of Portuguese. The first of these
Annexes describes a set of tools for stimulating oral comprehension in students of European
Portuguese as L2, whereas the second Annex describes our efforts in terms of automatic
nativeness classification.
Annex C includes materials used in the experiments throughout this work.
4
CHAPTER 1. INTRODUCTION
2
Related Work
SDSs are typically composed of several modules that work as a pipeline to perform conversational dialogs with humans. Figure 2.1 shows the standard architecture of an SDSs. The
audio is fed into the speech recognizer that generates a text. This text is processed by the
language understanding module, which generates a semantic representation of the text input
and a confidence score. According to that semantic representation, the Dialog Manager (DM),
the “brain” of the system, will decide which is the next action to be performed. Although not
represented, for dialog systems that work in the domain of information services, one can also
have a connection to backend to perform database queries. The natural language generation
(NLG) module will map the action select by the DM into text. Finally, the text generated
by the NLG will be synthesized into audio and returned back to the user.
✁✔✝✄
✁✂✄☎✆✂✝✞ ✟✠✡✡✞☛
☞✡✞✄✌✍✝✂✝✄✍
✖✡✗✂
✎✆✂✁✏✆✑ ✒✆✍✌✁✆✌✡
✓✍✔✡✏✕✂✆✍✔✝✍✌
✟✡☎✆✍✂✝✞
☞✡✠✏✡✕✡✍✂✆✂✝✄✍
✘✝✆✑✄✌ ✙✆✍✆✌✡✏
✁✔✝✄
✟✠✡✡✞☛ ✟✛✍✂☛✡✕✝✕
✖✡✗✂
✎✆✂✁✏✆✑ ✒✆✍✌✁✆✌✡
✚✡✍✡✏✆✂✝✄✍
✞✂✝✄✍
Figure 2.1: Standard architecture of an SDS.
Over the last decades, the advances especially in speech recognition have boosted the area
of spoken dialog systems. Rather than just converting the speech into text, spoken dialog
6
CHAPTER 2. RELATED WORK
systems aim to perform the challenging task of having a computer using speech to interact
with a human. The domains chosen to perform these dialogs are generally constrained and
task-oriented. Examples of the domains served by research SDSs were traveling plan (Communicator project [119]), weather forecast (MIT Jupiter [155]), public transit information
(TOOT [130] and TRAINS [45]), flight schedule information (ATIS [111] and Mercury [123]),
real-estate information (AdApt [54] and ApartmentLine [137]), tourist information (Waxholm [15] and Voyager [154]), e-mail access (ELVIS [139]), banking systems (CTT-Bank [91]),
restaurant and bar information (DINEX [122] and Cambridge Restaurant Information system
[67]), movie information (MovieLine [137]), navigation (Higgins [126] and CityBrowser [53]),
tutoring (ITSPOKE [84]), and interactive collaborative planning (TRIPS [44]).
Spoken dialog systems are nowadays much closer to real users. They have been used to
perform simple tasks in real-life scenarios like dealing with missing or delayed flight luggage,
magazines subscriptions, or simple tasks in customer services. The Let’s Go system [114], in
spite of being a research system, has been providing bus schedule information in Pittsburgh
since 2005. In this decade, technology companies like Apple, Google or Microsoft have devoted
attention to the creation of sophisticated spoken dialog systems. Apple’s personal agent Siri
[4] or Nuance’s Dragon Go! app are the most prominent examples of a new generation of
spoken dialog systems that can be used in mobile devices on a regular basis. Many of these
systems no longer rely on voice alone to interact with the user. Furhat [93], Greta [109], Max
[71] and Ada and Grace [129] are some of the most interesting multimodal dialog systems.
Another type of applications that makes use of SDSs are the service robots. Systems like Flo
[117] or CARL [86] are examples of nursebots especially developed for taking care of elderly
people. They are often incorporated dialog managers.
Over the years, the research community has tried to build more and more complex spoken
dialog systems, enabling them to have conversational dialogs rather than just command-like
utterances dialogs. However, this has been one of the hardest trade-offs faced by the research
community: conversational dialogs with high speech recognition rates versus command-like
dialogs with lower speech recognition rates.
7
The need for trade-off solutions involving stricter interactions within narrower domains derives
from the many unsolved problems faced by the speech research community. Spontaneous
speech recognition is still a crucial problem for current speech recognizers, since they are not
prepared yet to deal with disfluencies, repairs, hesitations and filled pauses. In addition, many
of the current systems have to deal with a large population of users (different accents, age
range, speaking styles), with different acoustic backgrounds (telephones, cellphones, laptop
microphones, indoors, outdoors, etc.). Under these conditions, the word error rates around
20-30% (or even higher) are common [18]. It has been shown that the word error rate (WER)
is inversely correlated with task completion. The damages in task completion are especially
critical for WER above 40% [16].
Over the years, different approaches have been used to reduce the impact of the WER. Some
focus on improving the robustness of the ASR front-end and/or acoustic models. Others try
constraining the acceptable inputs to the system. While the first ones often lead to very small
improvements, the second ones highly restrict the dialog capabilities of the system. Some of
the techniques that may be used to constrain the inputs to an SDS will be described in Section
2.1.1.
Targeting more complex problems is not possible with SDSs that only can deal with commandlike utterances. Since the current state-of-the-art of spontaneous speech recognition is far
from perfect, in order to be able to deal with more complex dialogs, the research community
has been working assuming that errors are very likely in speech recognition. Confidence
annotation is used to detect errors in SDS inputs based on a set of context information that
is not used in the traditional methods of confidence scoring for speech recognizers [20, 17, 60,
120]. Examples of previous work in confidence annotation will be described in Section 2.1.2.
Detecting errors accurately is very important to find the correct strategy to recover from them.
Several recovering strategies were studied for spoken dialog systems [18, 16]. The current
state-of-the-art is Dialog-State Tracking [75, 153], sometimes also called Belief Tracking, where
a statistical model tries to learn how to conduct a dialog in a specific domain. The related
work on this topic will be covered in Section 2.1.3.
8
CHAPTER 2. RELATED WORK
This thesis proposes a different strategy to be used in SDSs: on-the-fly lexical entrainment.
This strategy aims not only the to make the users entrain to the system whenever the error
detection detects a given term is negatively affecting ASR performance, but also to make
the system adapt to the user’s terms. To provide the necessary context to understand the
integration of lexical entrainment in a Spoken Dialog System, Section 2.2.1 will describe
entrainment in human-human dialogs, and Section 2.2.2 will cover the previous work with
entrainment in SDSs.
2.1
Improving robustness in SDSs
2.1.1
Speech recognition
The success of an interaction with an SDS is often affected by speech recognition errors.
One solution to minimize these errors is to have speaker- and noise-adapted models when
facing adverse conditions. However, in many dialog systems there is no information about
who is talking and from where she/he is talking. This fact prevents any kind of speaker or
noise-adaptation. Thus, alternate methods to minimize the errors need to be found.
The use of a restricted vocabulary in syntax to provide better recognition is a straightforward
solution to increase speech recognition performance [80]. This could be done using previously
developed techniques to find the most acoustically distinct options [116, 131, 3]. This solution
does not take into account the user’s lexical preference, and may result in the system use terms
that the users never pick up and use. Consequently, the system may sound less natural and
the user may feel less engaged during the interaction. Another possible effect, is when novice
users who do not know how to address the system, could use different the words that the
system cannot recognize, and thus making no progress in the dialog. In most cases the system
performance is likely to increase when the vocabulary is constrained, since recognition is likely
to work better. However, this comes with a great cost since the dialogs that will be performed
by a spoken dialog system with constrained input are very limited and far from human-like
dialog. This is a solution that the SDS research community has tried to avoid over the last
2.1. IMPROVING ROBUSTNESS IN SDSS
9
few years, despite being the one adopted in some commercial systems where performance is
a key aspect.
2.1.2
Confidence annotation
The use of different information sources to improve the confidence scoring in spoken dialog
systems has also been explored to improve the system performance. In SDSs there are context
information sources that can be more accurate that just the ASR confidence score based on
the acoustic features used to recognize the speech input. The set of features is computed
during live interaction to improve the error recovery strategies. In [16], these features were
provided by the ASR (e.g. acoustic confidence or speech rate), by the Dialog Manager (e.g.
the current dialog state, or if the received answer was more or less expected), or by the
Language Understanding module (e.g. the number of slots in the parse). They could also be
Dialog history features, such as whether the preceding turn was not understood, or prosodic
features, such as pitch or loudness. These features can be used to train a fully-supervised
[85, 60] or implicitly supervised [20] model for confidence annotation. Once the confidence
value is computed, the system can adjust the strategy to be taken in the next turn. For
instance, when a turn is marked with a low confidence score, the system could either repeat
the question or explicitly confirm the information given in the user turn [83]. Accurate
confidence scoring is a crucial item in SDSs to determine the best system action to be taken.
However, per se it only influences the system next action without affecting the system lexical
choice.
2.1.3
Dialog Management
The use of dialog-state tracking (sometimes also called belief tracking) in the SDS dialog
manager is now the state-of-the-art. Dialog-State tracking (DST) is a statistical framework
for dialog management. Its creation was motivated by the need for a data-driven framework
that reduces the cost of laboriously hand-crafting complex dialog managers and that provides
robustness against the errors created by speech recognizers operating in noisy environments
10
CHAPTER 2. RELATED WORK
[153]. Another possibility offered by DST is on-line learning, which was not possible with
non-statistical approaches to dialog management.
Dialog-state tracking systems combine belief tracking with reinforcement learning for dialog
policy optimization. An SDS is modeled by a Partially Observable Markov Decision Process
(POMDP). A POMDP is an extension of a Markov Decision Process (MDP) previously used
in SDS design [41, 51, 108]. The motivation for adopting POMDP derives from the uncertainty
associated with the ASR input, which does not allow the system to be sure about what is the
user’s real intention, and thus the real user intention is not observable to the system [148].
2.1.3.1
POMDP Review
According to [148], a POMDP is defined as tuple {S, A, T, R, O, Z, λ, b0 }, where S is the space
state, A is the action state, T defines the transition probability, R is the immediate reward
function, O are the observations, Z defines the observation probability, λ is a geometric
discount factor, and b0 is the initial belief state. At each time state, the system is in some
unobservable state, s. A distribution over all the possible states, b, is maintained. Based
on the so called “belief state” b, the system is going to select an action a that has a reward
associated r. In the next time step, the system is going to transition to s0 , that depend on s
and a.
This statistical approach to dialog manager associates a belief to any user input. That belief
is updated when an ASR result is produced, depending on the confidence score. Based on the
beliefs available, the dialog manager predicts the user action, which is not observable to the
system. Training a model for dialog management requires large amounts of data (according
to [153] O(105 )). Often this data is generated from user simulators that were trained with
data from previous interactions with spoken dialog systems. The user simulator can reliably
generate the necessary amounts of data that can hardly be collected when developing an
experimental system. This type of dialog management achieves remarkable improvements
especially when dealing with low confidence turns.
2.1. IMPROVING ROBUSTNESS IN SDSS
2.1.3.1.1
SDS-POMDP
11
In order to fit SDS into a POMDP and significantly reduced the
complexity of the problem, the state S was divided into three factors: user goal, user intention
and dialog history. The factored state-space representation for a Spoken Dialog System was
originally presented by Williams and Young in [148]. In [151], an SDS was represented as in
Figure 2.2.
Figure 2.2: Spoken Dialog System architecture in [151].
The user has a user state Su , which is the goal she/he is trying to accomplish. The previous
user turns could be represented as Sd . Au is the user intention that will be converted into a
speech signal Yu . Once Yu is recognized and parsed, it could be represented as (Ãu , C), where
Ãu is the language understanding output and C is the confidence score associated to that
output. Sm is the maintained system state. According to Sm and (Ãu , C), the next system
action Am will be taken and mapped into an audio output Ym .
Due to speech recognition errors Ãu can be different from Au , resulting that the real values
of Su , Au and Sd are hidden to the system. The hidden state of a POMDP, S is therefore
dependent of these three components:
s = (su , au , sd )
(2.1)
The system state, Sm , now becomes the belief state over these three components, and it is
defined by:
12
CHAPTER 2. RELATED WORK
sm = b(s) = b(su , au , sd )
(2.2)
The observations of SDS-POMDP are given by the noisy language understanding input, Ãu ,
and the confidence score C:
o = (ãu , c)
(2.3)
Applying these definitions to the original POMDP equations and performing some simplifications (details can be found in [148]), the transition function for an SDS-POMDP is given
by:
p(s0 |s, a) = p(s0u |su , am )p(a0u |s0u , am )p(s0d |a0u , s0u , sd , am )
(2.4)
and the observation function:
p(ã0u , c|s0u , s0d , a0u , am ) = p(ã0u , c|a0u )
(2.5)
These two equations provide a statistical model of a spoken dialog system. The transition
function can predict future behavior, and the observation function tries to infer the hidden
state from the given observations. The user goal and user action models (first two terms of
equation 2.4) can be estimated from an annotated corpus. The dialog history model (last
term of equation 2.4) can be estimated from data, handcrafted or replaced by a deterministic
model. The observation model can also be estimated from corpora. The immediate reward
function will be defined according to the system objectives.
Equations 2.4 and 2.5 can be used to derive the belief state equation update:
2.1. IMPROVING ROBUSTNESS IN SDSS
b0 (s0u , s0d , a0u ) = k · p(ã0u , c0 |a0u ) · p(a0u |s0u , am ) ·
13
X
su ∈Su
p(s0u |su , am ) ·
X
p(s0d |a0u , s0u , sd , am )·
sd ∈Sd
X
b(su , sd , au )
au ∈Au
(2.6)
The influence diagram of an SDS-POMDP in Figure 2.3 summarizes the modifications made.
Figure 2.3: Influence Diagram of a SDS-POMDP. From [148]
Young et al [153] emphasized that the POMDP-based model for dialog combines two ideas:
dialog state tracking and reinforcement learning. Dialog-state tracking provides an explicit
representation of uncertainty leading to systems that are much more robust to speech recognition errors. The user behavior can be inferred from a distribution of recognition hypothesis
provided by N-Best list or confusion networks. The system is in fact exploring all dialog
paths in parallel. The next action is not based on the most likely state, but on the probability distribution across all states. The rewards associated with state-action pairs will be the
objective measure that reinforcement learning methods will try to optimize. This could be
done using both off-line data or on-line through interaction with real users.
The main problem with this framework is scalability. As the complexity of the dialog grows,
14
CHAPTER 2. RELATED WORK
the number of states in the dialog follows this growth and the optimization problem can
become intractable. To deal with this problem, efficient representation and manipulation of
the state-action space needs to be done using complex algorithms. Policy learning is also
challenging, motivating the use of approximation techniques.
2.1.3.2
Dialog-State tracking
The factored state-space approach is not sufficient to reduce the complexity of the problem.
Thus, further approximations are needed such as N-Best approaches and factored Bayesian
Networks. The N-Best approaches approximate the belief state by the most likely states. In
factored Bayesian Networks approaches, the user goal is factored into concepts that can be
spoken about by the system. The following two sections described each approach in detail.
2.1.3.2.1
N-Best Approaches
The Hidden Information State model (HIS) [150] is one
example of an N-Best approach. In this approach, similar user goals are grouped into equivalent classes, called partitions, assuming that all the goals are equally probable if they are
put in the same partition. The dependencies between the states are defined according to
the domain ontology. The partitions are tree structured, and as the dialog progresses the
root partition is divided into smaller partitions, which reduces the problem complexity and
enables its implementation in a real-time SDS. To simplify, in the HIS the user intention
remains the same during the belief update stage, although this is not necessarily true for all
the belief update techniques that used an N-Best approach [69]. The N-Best approach can
be problematic when the dialogs are too long. The tree will have more nodes as the dialog
progresses. Some pruning techniques have been developed [147, 50]. An effective solution is
to compute a marginal probability for each slot. The low probability slots-value pairs are
pruned recombining them with slot-value pairs that are complement [50].
2.1.3.2.2
Factored Bayesian Networks Approaches Another approach to update the
belief state is to factor the user goal into concepts, and model the dependencies between concepts with an incomplete distribution that handles a limited number of dependencies, but
models the complete distribution. From the factoring process results a Bayesian network,
2.1. IMPROVING ROBUSTNESS IN SDSS
15
where belief propagations algorithms are used in order to update the beliefs. The marginals
for conditionally independent states are directly computed from the belief propagation algorithms. However, for limited dependencies an approximation for the marginal needs to be
computed. Loopy belief propagation (like in the Bayesian Update of dialog state BUDS [133])
or particle filters can be used to solve this problem [1]. This factored approach can also be
combined with the N-best approach to take advantage of the benefits of each approach [134].
2.1.3.3
Policy optimization
The policy optimization aims to maximize the reward function at the end of the dialog.
According to [153], a non-parametric policy must encode firstly a partitioning of belief space
such that all points within any partition map to the same action, and secondly it must encode
the optimal action to take for each partition. Since the use of an exact POMDP representation
in SDS is intractable, a compact representation for policy is needed. In most dialogs only a
subset of the dialog space is used. Thus, optimization can be achieved by computing belief
tracking in this subset, where decision-taking and policy optimization will be performed. In
order to achieve this behavior, the belief state in the dialog space b is mapped to a vector b̂
and a set of candidate actions â. The policy will be used to select the best action to take for
a set of candidate actions, and then map it back to the full action in the full dialog space.
This mapping requires two different steps: select the candidate actions in the master space,
and extract the features from the belief state and candidate actions. In the first step, the
selected action could simply be the action with highest belief [133]. However, this could
lead to inadequate action choices, such as “inform welcome” in the middle of the dialog. To
mitigate this problem, human knowledge has been incorporated to select the candidate action
[146]. Other approaches use the whole set of actions, but contain the slots that are connected
to each action, using handcrafted heuristics [152]. In the second step, one binary feature is
normally created for each dialog act. The dimensionality of this vector will typically depend
of the task. The typical state features according to [153] are: the belief in the top N user goals
or partitions; the top marginal belief in each slot; properties of top user goal or partition; the
16
CHAPTER 2. RELATED WORK
system actions available; dialog history properties; most likely previous user actions [152, 81].
Some of these features are selected either by hand or through automated selection, and may
also be complemented with features like those used for confidence annotation.
For each summary space, the policy may follow a deterministic mapping π(b̂) → â, or a
conditional probability distribution π(b̂, â) = p(â|b̂) where the action is selected by sampling
the distribution. This policy is now a function of the summary space belief state and action.
Similarly, the Q-function which provides the expected discount sum of rewards, can be represented for the summary space. In this case, the approach to find an optimal policy, is to
maximize the Q-function for the summary space:
π ∗ (b̂) = arg max Q∗ (b̂, â)
â
(2.7)
Methods like Monte-Carlo optimization, least-squares policy iteration or natural actor-critic
have been used in SDSs. Currently, optimized methods like Q-Learning [121, 39, 107] or
SARSA [58, 47] are used in SDSs policy optimization. The details of dialog-state update and
policy optimization in the statistical dialog manager used in this thesis will be described in
Section 3.2.3.2.
2.1.3.4
Summary
Although this the use of statistical dialog manager has produced enormous improvements in
the system performances there are some drawbacks in their use. First, the models are very
difficult to scale to more complex dialogs and second the portability to new domains requires
considerable amount of work. There has been some recent work to mitigate this problem [49]
with promising results, where the adaptation is made from an already existing domain. There
has been also work on reducing the amount of training data required to create a new model.
In [78], the results have shown that using discriminative methods even with a limited set of
features can lead to improvements even with there is some mismatch between the train and
the test data.
2.1. IMPROVING ROBUSTNESS IN SDSS
2.1.4
17
Language Generation
In most SDSs, language generation follows a template-based approach, which despite being
easier to handle, makes the creation of prompts for a new system time-consuming. Some work
has been done in this field to avoid this limitation using data driven methods for language
generation. Some of the techniques used were bi-gram language models [98], ranked rules
from a corpus of manually ranked training examples [142] or a given a content plan select
a sentence for a set generated from clause-combining operation [127]. Another approach to
language generation was the use of a machine translation engine to create sentences in Natural
Language from a Internal System representation, like what was done in Mountain [74].
Recent approaches try to apply reinforcement learning to Language Generation. In [79],
reinforcement learning for Dialog Management is combined with reinforcement learning in
Language Generation to adjust the prompts according to the number of concepts that the
system was able to extract from the user input. There are three possibilities to place the
concepts in the system prompts: list, contrast and cluster. This strategy was trained and
tested on a simulated user and there was an increased reward when RL for Dialog Management
was combined with RL for Language Generation. RL for Language Generation was also used
to optimize the use of Referring Expressions (RE) in an Internet Service Provider customer
service SDS [66]. The goal was to adjust the words used in the system prompts to the level
of expertise of the customer. For this purpose two models were created: one to deal with
expert users and another to deal with novice users. The models were first tested with a
user simulator. This user simulator is somewhat different from those used to train DM. In
addition to the action level representation, it also has the RE level representation. There
were 90 different user models in the test set. The RL methods achieved better rewards and
lower number of turns per dialog comparing to rule-based adaptive methods or non-adaptive
methods. This data driven strategy was tested with a limited set of real users with promising
results in terms of objective system measure and subjective user feedback [65].
Although the methods described involve some sort of prompt adaptation to the user, they
have not used entrainment features in their learning methods. In the first case presented [79],
18
CHAPTER 2. RELATED WORK
it is the way the system is presenting the information to the user that is adjusted. In the
second case [66], the system is adapting RE to the user level of expertise. In our case we are
targeting for an adaptation to each user that is supported from previous findings from Lexical
Entrainment in human-human dialogs. The following sections will give an insight of what has
been studied human-human entrainment, and how this was transferred to human-computer
entrainment.
2.2
Entrainment
The approaches presented so far have produced improvements in SDS’s performance. But
they also have drawbacks. The choice for longer and acoustically distinct words could lead
to highly dispreferred lexical choices. The improvements on confidence scoring and statistical
dialog management do not let the system behavior be influenced by the user, that is they do
not adapt to the user during live interaction.
The improvements in natural language generation have already introduced the adaptation to
the type of user as a possible direction to improve the system performance. But the question
of how should the system is not yet given. In this section, an insight on the research performed
for entrainment in human-human dialogs followed by recent approaches to use this knowledge
in spoken dialog systems is going to be given.
2.2.1
Entrainment in Human-Human Dialogs
Entrainment is beginning to be recognized as an important direction in SDS research [59].
It has been reported that in human-human dialogs entrainment occurs at various levels:
lexical, syntactic/semantic and acoustical, and the different levels elicit entrainment among
one another [106].
The lexical entrainment studies carried out for human-human dialogs have showed that subjects establish implicit conceptual pacts with one another in order to achieve success in
task-oriented dialogs [29]. Sets of dialogs were studied where participants collaborated to
2.2. ENTRAINMENT
19
co-ordinate word choice rather than only using their own preferred words. They followed
the output/input coordination principle [48], which states that the next utterance is going
to be formulated according to the same principles of interpretation as the previous successful
utterances. This coordination is not reached by explicit negotiation of the lexical items to
be used, but rather through imitation during the interaction. Frequency is also important.
The more common is a particular conceptualization, the stronger is the conceptual pact [28].
Reitter and colleagues [115] introduced priming as the processing of influencing the linguistic
decisions of the other interlocutor. Hence, the linguistic structures that will be used to influence the linguistic decisions can also be called primes. In their study, evidences of priming
are more visible in task-oriented dialogs, which is the domain of most SDSs. In addition, a
mathematical entrainment measure was developed, and high correlation was found between
this measure and success in task-oriented human-human dialogs [94]. This constitutes a theoretical background that could be automated to try to establish conceptual pacts between
the system and its users. When combined with dialog-state tracking dialog management and
accurate confidence scoring, this is likely to increase the system performance.
2.2.2
Entrainment in Human-Computer Dialogs
There were some studies that established difference between human-human and humancomputer dialogs [26, 28, 24]. When communicating with machines, humans use abbreviated and telegraphic strings. Another difference found is that in human-human conversation
a new term has to be repeated two or more times before uptake occurs and a conceptual
pact is established [29]. In human-computer dialogs, new primes generally start to be used
immediately after they are introduced by the system, including those that are less frequent
in daily use as reported in the literature [24]. Users tend to adopt system’s terms because
they think that the system is not in a position where it can negotiate with them. They think
that the using the computer’s terms is less likely to provoke errors [28]. As it was verified in
the human-human dialogs, the main motivation for entrainment is successful communication
[24]. This phenomenon could be stronger in human-machine dialogs, leading the users to the
20
CHAPTER 2. RELATED WORK
use of highly dispreferred linguistic structures, whenever they believe that is necessary for
successful communication. The users see computer’s ability to understand them as limited
and domain constrained. One of the motivations to use lexical entrainment is to avoid the
idea of an inflexible SDS and eliminate the use of highly dispreferred lexical items.
Lexical entrainment has already been tested in a text based dialog system [90]. There were
improvements on the system performance, and the users preferred better the system that
performed entrainment. Lexical Entrainment was first tested in an SDS with Let’s Go system
[128]. Some syntactic structures and terms were modified to study their impact on the user’s
choice of words. The different word choices did not correspond directly to words that affected
the dialog, although influence in concept acquisition was shown. In [99], the authors confirmed
that real users also entrained to SDSs. They modified terms that system had been using for
a long time and in addition affected the dialog. They observed that some terms were more
adopted than others. This is evidence that a policy to select better terms can be a valuable
resource during the interaction, since users show preferences to some terms more than to
others.
Just like human-human dialogs, entrainment at different levels has also been studied in humanmachine dialogs. Users tend to follow the same syntactic pattern of the question they are
answering [25]. Entrainment at lexical and acoustic/prosodic levels was investigated to find
correlations between entrainment and learning [143].
Acoustic entrainment was also used to modify the way the users address the system [43].
This is especially useful when shouting or hyper articulation is found. These speech styles
negatively affect the speech recognition performance. First, the authors developed methods to
automatically detect these speaking styles. They then tested different strategies to deal what
they had detected on-the-fly: explicitly asking to revert to their normal style, changing the
dialog slot or changing the system’s speaking style. This last strategy makes users acoustically
entrain to the system, returning them to a “neutral” speaking style (softer getting them to
stop shouting, for example) that was more likely to be successfully recognized by the system.
Evidence of acoustic entrainment was found. All of the strategies performed better than the
2.3. SUMMARY
21
baseline system, where the system did not detect the user speaking style.
2.3
Summary
This chapter summarized the related work done to improve SDSs. We started by giving
examples of research SDSs previously developed in many different domains. Then we summarized previous work done to improve the performance of SDSs. We have shown that this
improvement could be achieved by reducing the system vocabulary, by improving confidence
annotation, by adapting the natural language generation or by using state-tracking dialog
management. This last approach was extensively described, since it is the current state-ofthe-art in dialog management.
We have also introduced the theoretical background and motivations to the implementation
of lexical entrainment in SDSs. The last section was dedicated to examples of previous
experiments on entrainment taken with dialog systems.
22
CHAPTER 2. RELATED WORK
3
Creating Noctı́vago, a
Portuguese Let’s Go
This chapter will follow the steps to the creation of an SDS for European Portuguese. Our
choice was to develop a Portuguese counterpart for Let’s Go. In this chapter, this option
will be explained and the the two systems will be contrasted in order to enhance what was
developed in this thesis.
3.1
Choosing a Framework for Noctı́vago
Two different frameworks were under consideration to develop an SDS for the domain that
we were targeting. The following two sections will describe DIGA and Olympus, in order to
establish a contrast between both frameworks and explain our preference towards Olympus.
3.1.1
DIGA
One of the frameworks under consideration was DIGA [95], whose block diagram is shown in
Figure 3.1. The architecture is based on Galaxy II HUB [124] that establishes a socket-based
protocol for communication between the different modules. Galaxy works as hub that routes
messages between the different system components. In DIGA, the Audio Manager deals with
both input and output speech and the Service Manager is responsible for executing the user
requests.
The adopted speech recognizer is Audimus [96]. The text generated by Audimus is passed to
the Language Interpretation and then to the STAR dialog manager [87]. The information is
sent to the Service Manager to perform the backend query. The resulting action is transferred
to the NLG module to generate the text. Finally, the text is synthesized using the DIXI+
24
CHAPTER 3. CREATING NOCTÍVAGO, A PORTUGUESE LET’S GO
text-to-speech (TTS) system [101], and played using the audio manager.
Figure 3.1: Architecture of DIGA framework for SDSs.
This framework has some advantages in view of a domain adaptation to a new system in
European Portuguese (EP). The most obvious is the fact that the ASR and TTS for EP
have already been integrated for other systems. However, the adaptation to a new domain
also required modifications in the Service Manager and Language Interpretation modules.
The former processes the user input text in a pipeline that involves several stages. The final
result is a candidate list. The speech act is selected according to the services available in the
Services Manager. The lack of support and documentation would create many difficulties for
this adaptation task. For these reasons, a different option was also considered, the Olympus
framework for SDSs [22], which will be presented in the following section.
3.1.2
Olympus
The reference architecture for Olympus is pictured in Figure 3.2. Like DIGA, it also relies
on the Galaxy-II HUB to implement a centralized passing-message infrastructure to communicate between the different modules. The reference architecture has a recognition server
3.1. CHOOSING A FRAMEWORK FOR NOCTÍVAGO
25
that performs Voice Activity Detection (VAD), and sends the voiced frames to the connected
speech recognizers (several recognition engines can be connected and work in parallel). The
reference recognizer is Sphinx [63]. The result of the recognition is then passed to the NLU
module that is divided in two separate modules: the Phoenix semantic parser [145], and the
Helios confidence annotator [17]. Phoenix generates a semantic representation for the text
input, and Helios attributes a confidence score to that semantic representation. The result
is passed to the Ravenclaw [21] dialog manager, that interprets the semantic representation,
selects the next action and performs the query back-end to get the information requested.
The semantic representation of the system action is generated by Ravenclaw and transferred
to Rosetta, a template based language generator [119] to create a text representation of the
system action. This text is finally sent to Kalliope [22], a synthesis manager that is compatible with various synthesis engines such as Micrsoft’s SAPI [92], Festival [11] or Cepstral
Swift [36]. The TTYServer is an additional component that allows Olympus to work in a text
input/output mode, leaving out the speech recognition and synthesis. The text input version
is very useful not only for development purposes, but to connect Olympus to an Embodied
Conversational Agent, as it will be detailed in Section 3.2.5.
Figure 3.2: Olympus reference architecture. From [22].
26
CHAPTER 3. CREATING NOCTÍVAGO, A PORTUGUESE LET’S GO
Olympus was developed to provide means for flexible adaptation to new domains. It is possible to replace any component by an equivalent component. In addition, a task specification
language was developed to easily create new Ravenclaw-based dialog managers. This is reflected in the dozen of systems that were created for different domains using Olympus. A
few examples are RoomLine for room reservations [18], Let’s Go for bus schedule information [114], LARRI for aircraft maintenance tasks [19] or TeamTalk for command and control
operation of robot [55].
Besides domain adaptation, in this case we also needed a framework that could facilitate
the language transfer, since in the reference architecture the modules are prepared only for
English. This task would involve changes in the ASR, NLU, NLG and TTS modules. Despite
the fact that many modules would be affected, the changes were done at the high-level:
new lexicon, language and acoustic models; new semantic grammar specification; new NLG
templates and installing new synthesizer for Portuguese that works with Microsoft’s SAPI
synthesis engine. The only drawback was that SPHINX would have to be replaced by a
different speech recognizer for Portuguese. This would require the total replacement of one of
the system modules, and the implementation of the communication protocol for the GalaxyII Hub in Audimus. However, the modifications were mostly straightforward whereas the
modifications in DIGA to adapt to a new domain were not. Another important factor was
the fact that Let’s Go has been using Olympus for several years. Given that Noctı́vago works
in the same domain, the expertise gathered in Let’s Go could be used in the development of
Noctı́vago. These were the reasons why the Olympus framework for SDSs was used throughout
the worked developed in this thesis.
3.2
Modules Used
The experiments held in this thesis involved two different spoken dialog systems that despite
targeting the same domain, bus schedule information, differed in language, type and number
of users. Let’s Go is a system that provides bus schedule information to real users during
off-peak hours. The system is running live since 2005, and receives an average of 40 calls
3.2. MODULES USED
27
during weekdays and 90 during weekends. Noctı́vago was the experimental system developed
for European Portuguese. It was inspired in Let’s Go and provides bus schedule information
for night buses in Lisbon. The choice to cover night buses was obvious due to the similarities between bus frequencies of Lisbon’s night buses and off-peak hour buses in Pittsburgh.
Originally, Noctı́vago covered 8 night bus lines.
In this section, the components of each system will be described, compared and contrasted.
3.2.1
Speech Recognition
Although the recognizer in the original Olympus architecture was Sphinx-II, the current
version of Let’s Go uses PocketSphinx [64]. Noctı́vago uses the Audimus speech recognizer
[96] for European Portuguese.
3.2.1.1
Pocketsphinx
Let’s Go uses two gender-specific acoustic models and context-dependent language models
in parallel. The 1-best hypotheses generated by each of the recognizers are parsed, and the
hypothesis annotated with the highest confidence score is the one selected.
Both general purpose and context-dependent language models used were trained with one year
of Let’s Go data, plus data collected from crowdsourcing. The resulting dataset comprises
18070 corresponding to 257658 user turns. The acoustic model was trained using a corpus of
2000 real dialogs that were manually transcribed [113].
3.2.1.2
Audimus
As it was already mentioned in Section 3.1.2, the first step was to implement the communication protocol that was used for speech recognizers in Olympus.
The Audimus acoustic models for telephone speech were originally developed for a broadcast
news task, with a 100k word vocabulary and a statistical backoff 4-gram language model.
28
CHAPTER 3. CREATING NOCTÍVAGO, A PORTUGUESE LET’S GO
The acoustic models were gender independent, thus only one Audimus engine needed to be
connected to the recognition server.
The original language model and vocabulary was inefficient for the current task, both in
terms of processing time and word error rate. Building appropriate language models for this
task was thus one of our first goals. Since there was no real data available, an automatically
generated corpus based on the parser grammar description was used for this purpose. Our
first attempted comprised a very small 30k sentences context independent corpus, with a
vocabulary around 300 words. The pronunciation lexicon was built using an in-house lexicon
and a grapheme-to-phone rule-based conversion system [35]. In an early stage, the lexicon
included a total of 397 pronunciations entries.
Since Noctı́vago is a newly created system, once new data was collected, it was used to iteratively improve the language model. In addition, new bus lines were added to the baseline
system version, which has increased the number of words in the vocabulary. Another improvement introduced were the context dependent language models for Place (when the system was
requesting departure and arrival stops), Time (when the system was requesting travel time),
Confirmation (when the system was doing explicit confirmation) and New Query (when the
system was asking the following action, after the result of the query was provided). Place
and Time were statistical bi-gram language models, whereas Confirmation and Next Query
were modeled as SRGS
1
grammars, since the number of possible inputs was far less than in
the two other context-dependent models. The 300k sentences artificial corpora generated to
train the bi-gram language models were created using context-dependent grammars based on
the frequencies observed during the previous data collections. The number of different words
in the Place and Time corpora was 654 and 144, respectively. The SRGS grammars were
also designed according to the answers observed in previous data collections. In order to be
used by Audimus, the SRGS grammars were converted to language models on the fly. The
number of words that could be recognized with context-dependent model for Confirmation
was 25, whereas for Next Query was 51. These last models were used in the tests that will
1
Speech Recognition Grammar Specification
3.2. MODULES USED
29
be described in Section 6.2.3.1 and they had positive impact in the system performance, as
it will be shown later.
3.2.2
Natural Language Understanding
For NLU, both systems use the same components: Phoenix for semantic parsing, and Helios
for confidence annotation. According to [145], Phoenix is designed for the development of
simple, robust natural language interfaces to applications, especially spoken language applications. Because spontaneous speech is often ill formed, and because the recognizer will make
recognition errors, it is necessary that the parser be robust to errors in recognition, grammar,
and fluency. This parser is designed to enable robust partial parsing of these types of input.
Phoenix parses each input utterance into a sequence of one or more semantic frames.
A Phoenix frame is a named set of slots, where the slots represent related pieces of information.
Each slot has an associated context-free semantic grammar that specifies word string patterns
that can fill the slot. The grammars are compiled into recursive transition networks, which are
matched against the recognizer output to fill slots. Each filled slot contains a semantic parse
tree with the slot name as root. A new set of frames and grammar rules were specified for
our system to fill each slot in a frame. In addition to producing a standard bracketed string
parse, Phoenix also produces an extracted representation of the parse that maps directly onto
the task concept structures.
A new grammar was created for Noctvago based on the Let’s Go original grammar. Some
modifications were introduced in the grammar due to the dialog strategy that was followed.
Helios relies on a logistic regression model trained with a large number of features, as detailed in Section 5.2. The features used in Let’s Go were trained on real data from collected
dialogs. Noctı́vago used the model that is provided in the Olympus tutorial system for the
development of general purpose systems, since no previous data was available. It was trained
with RoomLine data. The problem of training a new confidence model for a new system
is challenging, since it is highly dependent on the system components. The creation of an
30
CHAPTER 3. CREATING NOCTÍVAGO, A PORTUGUESE LET’S GO
appropriate confidence model for Noctı́vago will be addressed in Chapter 5.
3.2.3
Dialog Management
Two different frameworks for dialog management will be described in this section: Ravenclaw
and Conerstone. The first one is a plan-based dialog management framework, whereas the
second is a statistical framework for dialog management. The first experiments with Noctı́vago
described in Sections 4.3 and 6.1.2.2 used Ravenclaw as the dialog manager. The experiment
ran with Let’s Go (Section 6.1.2.3) and the final experiment done with Noctı́vago (Section
6.2.3) used the Cornerstone dialog manager. In this section, both approaches will be briefly
described.
3.2.3.1
Ravenclaw
Ravenclaw was designed to easily build new SDSs in different domains. The two-tier architecture encompasses a Dialog-Task specification, which is domain dependent, and the Dialog
Engine itself that is domain independent. The Dialog-Task specification is a hierarchical plan
for the interactions, provided by the system developer. The Dialog Engine deals with errorhandling, timing and turn-taking, and other dialog resources such as provide help, repeat the
last utterance, quit, start over, etc.. These tasks are similar across different domains.
The task is specified as a tree of dialog agents, each one responsible for handling a subtask in
the dialog. Agencies are subtasks that will be further decomposed. Inform Agents represent
the system providing information to the user. Request Agents represent the system asking
a question to the user and understanding the answer. Execute Agents represent the nonconversational actions that the system must perform such as querying databases. Expect
Agents are similar to Request Agents except that they do not ask any question to the user, they
just perform the understanding part. The default unfold for a dialog given a task specification
is by “depth-first left-right traversal”. That is, Agencies are immediately traversed towards
their children agents and the children agents are traversed from left to right. This is done
3.2. MODULES USED
31
using a dialog-stack that captures the discourse structure during the interactions. During the
Execution Phase, the Agencies are pushed to the stack, and once they are completed they are
popped from the stack. The control of the execution is then given back to the parent node of
the tree. The completion or execution of Agencies can be triggered by a defined precondition,
as well as success or failure criteria.
Figure 3.3 shows the final task specification developed for Noctı́vago. The system starts
by Giving Introduction to the users. Then the PerformTask agent will pass the control
to the GetQuerySpecs agent to collect the data for the query. The tree will start to be
travelled “depth-first left-right traversal”. Since the first Agent on the left-most end of the
tree is an Expect Agent, the control is given to the GetOriginPlace Agency. It will start by
asking the origin place. If the answer given is a Neighborhood rather than a bus stop, the
control is passed to the GetOriginPlaceNeighborhood Agency to perform the leaf agents to
get the correct neighborhood stop. This is possible due to a trigger placed in the Agency
that only starts the Neighborhood strategy if the parsed input is indeed a Neighborhood.
Otherwise the DM will try to complete the next Request Agent: RequestDestinationPlace.
The ExpectRouteForContinuation is used when the user has chosen to continue the trip after
she/he was already given the first result. The Agency GetDepartureArrivalDisambiguation
will be triggered if the Departure and Arrival stops are the same, in order to detect which
one is correct. The next Request agent on the tree is the RequestTime that fills the time
slot in the query. Finally, before the control is returned to the PerformTask Agency, the
QueryConfirm agent checks if the data collected is correct. If not, the RequestWrongSlot will
ask which slots were mistakenly filled and clear them. This flow is repeated until the user
says that all the slots were correctly filled. The control is then passed to the ProcessQuery
that executes the database query. Once the result of the query is available, the GiveResults
Agency will be executed. If the system finds a bus schedule, the InformSuccess Agent will
be executed, if not the InformError will be pushed to the stack. After informing the user
the system informs the user, the RequestNextQuery will ask the user what to do next. If the
user says goodbye, the GiveResults Agency is successfully completed and Noctı́vago Agency
executes the GreetGoodbye InformAgent. If the system understands that the user wants a new
32
CHAPTER 3. CREATING NOCTÍVAGO, A PORTUGUESE LET’S GO
query, the ConfirmNewQuery will explicitly confirm this information. If the user confirms,
the system informs that a new query is going to start, the concept values will be cleared
and the Agency GetsQuerySpecs will be again executed. If the user requests the Price, the
InformPrice Agent will be executed. Once provided the price information, the system will
execute the RequestNextQuery again.
This tree must be specified in the RavenClaw Task Specification Language, which is a set
of predefined C++ macros, which will be expanded to their native code when compiling the
system.
Other fundamental elements that need to be specified for each task are the concepts. They
store all the data that the system manipulates during the interaction. The other fundamental
data structure associated with the concepts is the expectation agenda. This is the record of
the concepts that the system is expecting to receive at any given moment. During the Input
Phase of the dialog, the information provided by the user is transferred to the concepts using
the expectation agenda. This agenda is tightly connected with the task specification. For
instance, in the bus scheduling domain, if the system asks for the departing stop, the most
expected user input is a bus stop, rather than the time she/he wants to travel, which should
be placed at a lower level of the expectation agenda.
Ravenclaw also encapsulates a set of error recovering strategies (explicit or implicit confirmation, help, repeat, etc.) and task-independent dialog acts (quit, start over, timeout, etc.).
For more details about Ravenclaw, the reader should consult [16].
3.2.3.2
Cornerstone
In 2012, the Let’s Go Ravenclaw Dialog Manager was replaced by a POMDP-based DST
Dialog Manager, Cornerstone. The dialog agencies used used in Let’s Go with RavenClaw
are now modeled as states. Instead of using an agenda, the transitions between states are given
a probability model. The model was trained using 2718 real dialogs Let’s Go, corresponding
to 23044 user turns. This model was used throughout the tests performed using Cornerstone.
Inform
Neighborhood
Strategy
RequestOrigin
Neighborhood
ExpectRoute
Request
OriginStop
Neighborhood
GetOriginPlace
Neighborhood
GetOriginPlace
ExpectDont
KnowStop
Request
OriginPlace
Request
Destination
Place
InformUsing
DefaultStop
Expect
RouteFor
Continuation
Request
DepartureArrival
Disambiguation
Get
DepartureArrival
Disambiguation
GetQuerySpecs
Request
Time
QueryConfirm
RequestWrong
Slot
GiveIntroduction
Execute
Backend
Call
ProcessQuery
PerformTask
Noctívago
Inform
Success
GreetGoodbye
Inform
Error
RequestNext
Query
Expect
Goodbye
GiveResults
Confirm
NewQuery
InformStarting
InformPrice
NewQuery
Key
Execute
Agent
Request
Agent
Inform
Agent
Expect
Agent
Agency
3.2. MODULES USED
33
Figure 3.3: Tree for Noctı́vago task implemented in Ravenclaw.
34
CHAPTER 3. CREATING NOCTÍVAGO, A PORTUGUESE LET’S GO
The complete description of the Dialog Manager, the results achieved and the comparison
with the previous Dialog Manager can be found in [75]. The description of the methods used
to train a DST Dialog Manager with unsupervised data can be found in [76]. In Section
6.1.2.3 the dialog manager uses the same dialog strategy that was used for Let’s Go with
the implementation of the prime selection algorithm described in Section 6.1.2.3. In Section
6.2.3.1 we introduced small changes in the dialog strategy, taking into account the strategy
that was previously used with RavenClaw, and we also trained a new model confidence calibration with the data collected at that point. Besides the Rule Based algorithm used in Let’s
Go experiments, this DM also includes the implementation of the Data Driven algorithm that
will be described in Section 6.2.3. This section summarizes the techniques used for dialogstate tracking and dialog policy optimization to train the dialog manager used in the tests
held with Let’s Go, as well as in the last test done with Noctı́vago.
3.2.3.2.1
Dialog-State Tracking
Previous approaches have used handcrafted proce-
dures or simple maximum likelihood estimation to learn the parameters in partition based
models [68, 118, 133]. One of the parameters to be learned is the user action model, which
defines the likelihood of the user action given a particular context. This model would be more
refined if learned from real data. Taking advantage of the large collections of transcribed data
available for Let’s Go, a Dynamic Bayesian Network (DBN) was used in order to learn the
user action model from that data. Assuming several conditional independencies, the graph for
the DBN is show in Figure 3.4. To perform inference over the graph model, two components
need to be specified: the history model, p(ht |ht−1 , ut , st ), and the observation model p(ot |ut ).
The history model indicates how the dialog history changes and can be set deterministically,
for instance:
p(ht = inf ormed|ht−1 , ut , st ) =



1, if ht−1 = inf ormed or ut = inf orm(.)
(3.1)


0, otherwise
The observation model is computed using a deterministic function that measures the degree
of agreement between ut and ot :
3.2. MODULES USED
35
Figure 3.4: DBN graph for the Let’s Go state tracking. From [75].
p(ot |ut ) = CS(ot ) ·
|ot ∩ ut |
+ε
|ot ∪ ut |
(3.2)
where CS(ot ) is the confidence score provided by Helios for a given observation.
Finally to estimate the user action model the mean field theory [100] approach was used for
Bayesian Learning to avoid over fitting. The method results in the posterior probability over
the parameters of the user action model φ:
q ∗ (φ) =
Q
Dir(φi,j |αi,j )
(3.3)
i,j
0
+ EH,U [ni,j,k ]
αi,j,k = αi,j,l
(3.4)
where ni,j,k represents the number of times that st = i, ht−1 = j and ut = k. Dir(·)
denotes the Dirichelet distribution. α0 is the parameter of the uninformed prior Dirichelet
distribution for symmetry. The junction tree algorithm was used to estimate the quantity
EH,U [ni,j,k ]. Further details about the user action model learning process can be found in
[76].
Another new feature of Conerstone was the use of the explicit confirmed turns to calibrate
the confidence score. The use of explicitly confirmed entries eliminates the need for annotated
corpora to calibrate the confidence score in the future. The system logs were parsed and the
entries that were explicitly confirmed and appeared in the database query were added to the
dataset to tune the confidence score. To determine the correctness, the observed data was
36
CHAPTER 3. CREATING NOCTÍVAGO, A PORTUGUESE LET’S GO
compared to the explicitly confirmed information (the error rate for explicit confirmation turns
in Noctı́vago is around 15%, half of the value found the other turns, around 30%). A Gaussian
kernel-based density estimation was applied to the two sets of confidence scores collected with
respect to correctness. The two densities were scaled by the number of elements to see how
the ratio of correct ones (dc (c)) over the sum of correct and incorrect ones (dinc (c)) varies
according to the confidence score. The calibrated confidence score is given as in Equation
3.5. In order to efficiently compute this value for a given confidence score, a sparse Bayesian
regression with Gaussian kernel was used [76].
c0 =
3.2.3.2.2
Policy Optimization
dc (c)
dc (c) + dinc (c)
(3.5)
The policy optimization was performed using a sparse
Bayesian learning (SBL) approach to rapid reinforcement learning. Lee and Eskenazi [75] have
chosen this method because SBL can generate sparse models endowing sparsity-enforcing
priors without a sparisification threshold. The details of the application of SBL to solve
equation 2.7 can be found in [76].
Despite new data is continuously being generated in Let’s Go, only the latest data is used to
update the model, keeping the efficiency of the learning procedure.
To apply SBL-based reinforcement learning to policy optimization in this task, a simplified
representation of the belief state was adopted as the work mentioned in Section 2.1.3.3 suggested. The state representation for dialog strategy learning was a tuple with the belief of the
top hypothesis and the marginal belief of each concept. Since the actual value of the concept
of a system action can be easily determined, a set of concept level system actions was also
used to improve computational efficiency. The basis function consists of two sub-functions: a
Gaussian kernel to represent the proximity between two belief states and the discrete kernel
to relate two system actions. The maximum history size was set to 1000. The reward function
defines a -1 reward for each turn expect the last turns where a successful dialog occurred and
20 reward was given. During the learning process the system will explore the state-action
3.2. MODULES USED
37
space randomly (with 0.1 probability) or acts according to the partially optimized policy.
Finally, a set of handcrafted rules was implemented to avoid undesirable actions (such as
confirming empty concepts). To train the dialog policy the user simulator described in [77]
was used.
3.2.4
Natural Language Generation and Speech Synthesis
Both Let’s Go and Noctı́vago use Rosetta, a template-based natural language generation
module. The template is composed of hand-written entries, which can be also programmatic
functions (to deal with missing slots in the template sentence). These are used to convert a
semantic frame generated by the dialog manger into a sentence to be read by the synthesizer.
We have created new templates for the Noctı́vago system. These templates were later modified
to be used in the entrainment studies as well as the templates used in Let’s Go. When a system
was running with an entrainment policy, the primes will be treated as slots to be filled in the
available templates. The prime value is computed within the DM and the frame received in
NLG component is parsed to extract the action, the concept values and the prime to be used.
Similarly to the other concepts, the prime is extracted a placed in the slot created for it.
Both systems use Kalliope synthesis manager. The synthesis engine used in Let’s Go uses a
Cepstral Swift engine with a domain-adapted voice developed with the techniques described
in [12]. The corpus was trained with 1702 in-domain utterances and also 1131 out-of-domain
utterances recorded by the same speaker. The resulting voice quality is very high and often
mistakable for prerecorded utterances.
Noctı́vago uses the general-purpose festival-based DIXI synthesizer [101] with SAPI support
that was easily connected to the Kalliope synthesis manger.
3.2.5
Embodied Conversational Agents
The audio input in the Olympus reference architecture can be performed either using a
telephone or a microphone connected to the computer where the system is running. The
38
CHAPTER 3. CREATING NOCTÍVAGO, A PORTUGUESE LET’S GO
first tests performed with Noctı́vago (Chapter 4) rose the question about the best way to
perform data collection with volunteer users for an experimental dialog system. In those
tests a telephone interface was used, however in most cases the volunteers would have cost
associated with the phone call that they needed to reach the system. We thought that using
a web-based interface would help us to easily recruit new users in future tests.
Embodied Conversational Agents (ECA) were under development for a different SDSs in our
group. These ECAs can deal with input and output audio streams, performing the recognition
and synthesis tasks. To connect them to the Olympus architecture, a Java class was created
that implements the communication between the agent and the remaining modules of the
system through the Galaxy HUB. This class implemented the interpretation of the speaking
requests that were sent to the TTS engine and the creation of Galaxy frames with the ASR
result, which included the text and the confidence score. The possibility of handling context
dependent language models was also implemented. This eliminated the need of the recognition
and synthesis engines in the architecture since all the operations performed by those modules
were now performed within the ECA.
The ECA was equipped with a push-to-talk or press-to-talk button, which also eliminated the
necessity of the Audio Sever since Voice Activity Detection was no longer running on-line.
Two different solutions were used for ECA. Both of them used Audimus and DIXI for recognition and synthesis, respectively, just like their telephone-based predecessor Noctı́vago. The
first ECA was used in the tests held with the rule-based prime selection that will be described
in Section 6.1.2.2. The character used is depicted in Figure 3.5. This solution used Adobe’s
Flash Player to play the video and capture the audio using push-to-talk button. The video
is fully generated on the server side and then sent to the client. This requires a server with
an Advanced Graphics Processing Unit (AGPU) and, in addition, requires the client to have
a very fast internet connection. The use of Flash is an advantage, since it is also required by
other internet services, and in most cases does not require the user to install a new plug-in.
The face of this first ECA was not generally considered very appealing. In addition, we noticed
3.2. MODULES USED
39
Figure 3.5: Flash-based ECA used in Noctı́vago.
some problems in the audio capturing that could hinder the ASR performance. The other
solution available in our group used the Unity 3D gaming platform and was being successfully
used in two other projects (the virtual therapist in ViThea [110] and the museum guide in
FalaComigo [46]). Unity 3D provides libraries for capturing the audio. A press-to-talk button
was used in this case. This platform offers the advantage that it does not require a server
with an AGPU since the video is generated on the client side, which also reduces the need
for a very fast internet connection. The synthetic character also looked more appealing than
its predecessor as it can be seen in Figure 3.6. The only drawback was that in most cases the
users had to install the plug-in in their computers.
Figure 3.6: Unity 3D-based ECA used in Noctı́vago.
Figures 3.9 and 3.7 show the different architectures used in the Noctı́vago experiments,
whereas Figure 3.8 shows the architecture used in the Let’s Go test held for this thesis.
The modules that were replaced from the original architecture have a blue contour, whereas
40
CHAPTER 3. CREATING NOCTÍVAGO, A PORTUGUESE LET’S GO
the modules that were modified are highlighted with a green contour. The new modules that
were integrated in the architecture are highlighted in red.
Figure 3.7: Olympus Architecture used in the first Noctı́vago tests (Section 4.3) with telephone
interface.
Figure 3.8: Olympus Architecture used in Let’s Go tests (Section 6.1.2.3).
3.3
Summary
In this chapter we have described the basic architecture of an SDSs and its components
together with their main tasks. Two different architectures were considered to use in this
thesis, DIGA and Olympus. The reasons why we have chosen Olympus were presented.
The components used in the systems developed for this thesis that were both Olympus-based
were described in more detail, highlighting the modifications that were made relative to the
3.3. SUMMARY
41
original architecture.
(a) Olympus Architecture used in the Noctı́vago tests with
the Flash-based ECA (Section 6.1.2.2).
(b) Olympus Architecture used in the Noctı́vago tests with
the Unity-based ECA (Section 6.2.3).
Figure 3.9: Olympus Architectures used in Noctı́vago with ECA.
42
CHAPTER 3. CREATING NOCTÍVAGO, A PORTUGUESE LET’S GO
4
Towards Better Prime
Choices
4.1
Introduction
This Chapter is devoted to our first approach to built an automated lexical entrainment
algorithm for SDSs. We have used the Noctı́vago system to do so. Expert users tested a
preliminary version of the system under tightly controlled conditions, using a landline and a
very limited set of predefined scenarios. The users had to ask for specific bus schedules. In
these early trials the system received 56 calls, which correspond to a total of 742 turns. These
tests allowed us to make the system more robust and to evaluate which concepts could be in
the system prompts to make the system entrain to the users and vice-versa.
There were 143 different words in the system prompts, but according to Ward and Litman
[143] only words with synonyms can be potential primes. This means that even bus stops
and time expressions could be prime candidates. However, if they were selected as prime
candidates, several modifications would be needed in many system modules. For that reason,
the set of prime candidates was restricted to a list of concepts that did not require major
modifications. The set of prime candidates is a very important resource that will be used in
all the tests described in this thesis. It is a list of primes, one list for each concept, that the
system will use in the system prompts. Our goal with this study was to identify the prime
candidates, find new synonyms for that could be used in the system prompts and evaluate
the impact that had on the system performance. We hoped that this would give us clues to
the automation of the prime selection process.
44
4.2
CHAPTER 4. TOWARDS BETTER PRIME CHOICES
Creating a list of prime candidates
The criterion defined above led us to identify the prime candidates that are presented in Table
4.1. The table also shows the Prime Error Rate (PER) for each prime in the preliminary
version tests, computed at the prime level and not the word level, since some primes consist
of more than one word.
Concept
Next Bus
Start Over
Now
Prime
próximo
outro percurso
nova pesquisa
agora
PER (%)
19.0
50.0
36.4
60.6
Frequency
21
6
11
33
Table 4.1: Prime analysis in pilot experiments.
It was very interesting to notice that some users entrained to the erroneous pronunciation of
the word pesquisa, spoken by the synthesizer. In fact, the chosen units to concatenate the
target pronunciation /p@Skiz6/ made it sound like /piSkiz6/, and some users entrained to
that pronunciation, with negative effects in the system performance.
The next step in this study was to find synonyms for the concepts in Table 4.1. The new set
of primes is shown on Table 4.2.
Concept
Now
Old prime
agora
Start Over
nova pesquisa
outro percurso
Next bus
próximo
New prime
imediatamente
neste momento
o mais brevemente possı́vel
o mais rápido possvel
procurar novamente
nova procura
outra procura
nova busca
seguinte
Table 4.2: List of primes used in this study.
4.3. EXPERIMENTAL SET UP
4.3
45
Experimental Set Up
During two weeks, a panel of 64 volunteers participated in the experiments. An e-mail was
sent to the potential subjects with a description of the task, the phone number that could be
used to reach the systems and instructions about the pictorial items that could appear in the
scenarios. In this e-mail they were also informed that they should do one test in each week.
There were three bus schedule scenarios to be completed. Once a scenario was completed they
could move to the next one. Figure 4.1 shows one of the scenarios used. The other scenarios
can be found in Annex C.1. They were very pictorial in order to avoid biasing users towards
a word written in a scenario. At the end of the scenarios the subjects had to fill in a small
questionnaire where they were supposed to inform if they received the correct information in
each scenario and comment the system performance. They should also provide information
about the device they used to complete the test. During the second week, users who already
called the system in the first week were asked to compare the performance of the system in
both weeks. Both questionnaires can be found in Annex C.2.1. The e-mail was send to both
fellows from the laboratory, who were supposed to have had previous contact with SDSs, and
people who did not have any previous contact with SDSs.
Figure 4.1: Example of scenario used.
The set of old primes was used during the first week, whereas the set of new primes was used
in the second week as detailed in Table 4.2. The change in the prime set was made without
46
CHAPTER 4. TOWARDS BETTER PRIME CHOICES
informing the users.
4.4
Results
Unfortunately, since the tests were performed with volunteers users, only few of them completed the task as it was supposed. Therefore, during the first week, the system received 45
calls corresponding to 131 valid queries since each call could contain 3 or more queries. 35 of
these queries were performed by laboratory fellows, 51 were completed by female subjects and
6 by non-native subjects. In the second week, the number of callers decreased to 36 callers,
corresponding to 125 valid, 40 of them by laboratory fellows. 37 queries were made by female
subjects and 6 queries by non-native subjects. 19 people called the system both weeks, 26 of
them only called the system in the first week, and 17 only called the system in the second
week. Many of the participants that took part in this study used an SDS for the first time in
their lives.
In this evaluation, a session was considered successful if the system provided the user with
the information she/he had asked for. All the sessions were carefully analyzed at the utterance level, describing the type of error found in the incorrectly recognized utterances, and
transcribing the content of each utterance.
Table 4.3 shows the number of calls for each week and the real success rate (if the information
given matches the user request), according to the device used to interact with the system.
From the table it is not possible to make a statement about the device used and the system
performance despite the fact that the acoustic models were trained mostly with landline
telephone speech and broadcast news speech down sampled to the telephone speech sampling
rate.
Regarding the system performance, the table shows that the real success rate increased from
the first to the second week although the Fisher’s Square test on session success showed that
this difference is not statistically significant. However, the number of calls in the first week
using cellphones is not enough to draw any conclusion about cellphone speech degrading the
4.4. RESULTS
47
global system performance.
Landline
Cellphone
VoIP
Unknown
TOTAL
# Calls W1
86
12
24
9
131
Real Success Rate W1 (%)
48.8
66.7
54.1
11.1
48.5
# Calls W2
52
37
27
9
125
Real Success Rate W2 (%)
57.6
54.1
59.3
33.3
54.8
Table 4.3: Distribution of calls and real success rate according to the device used.
Table 4.4 shows the evaluation measures used as standards in the Spoken Dialog Challenge
[14]. The challenge created these measures to compare the participating systems in the bus
schedule information task.
No output
Any output
Acceptable output
Incorrect output
Week 1
31.3%
68.7%
71.1%
28.9%
Week 2
19.2%
80.8%
68.3%
31.7%
Table 4.4: Success rate of the system in each week.
The standard defines two possible types of calls: ’No output’ and ’Any output’. ’No output’
means that the dialog did not provide any information to the user. ’Any output’ means that
it provided some sort of information to the user. The calls classified with ’Any output’ are
further divided into ’Acceptable output’ calls and ’Incorrect output’ calls. ’Acceptable output’
means that the correct information was provided to the user. ’Incorrect output’ means that
the system did not provide the correct information.
Table 4.4 shows an improvement from week 1 to week 2 in the percentage of calls that gave
any sort of output to the user. Although the percentage of successful dialogs increased in the
second week, the percentage of acceptable outputs shows a small decrease. However, since the
percentage of ’Any Output’ during the second week was also higher, the absolute number of
successful dialogs was also higher in the second week. This could be explained with the fact
that in one of the scenarios the users had to ask for a bus for a specific weekday and departure
time (eg.: Saturday at 1:15 am). This scenario was completed by 36 times in the first week,
and 45 times in the second week. Users made a large pause between uttering the weekday
and the time of day every time they were completing this scenario. The ASR module would
48
CHAPTER 4. TOWARDS BETTER PRIME CHOICES
output the day of the week immediately, and the language understanding module would bind
the day of the week to the travel time concept, ignoring the departure time. Later, the default
minimal inter word pause value was increased to solve this problem.
In our detailed analysis, a few error sources were identified: VAD, Hung Up, Loss of Synchronism and Data-Time parsing. VAD errors resulted in incomplete utterances that arrived at
the speech recognizer due to VAD mis-detections. Hung up (HU) was identified since many
of the recognition outputs at the beginning of some sessions were very similar in words recognized and length. Later we have found that in those sessions where the telephone channel was
left open a few seconds after the user abandoned the call and consequently all the audio was
captured and held in the recognizing queue until the next call came up. Sometimes Galaxy
frames generated by the ASR with the recognition output arrived a few mili-seconds later at
the HUB, causing a Loss of Synchronism, since the processed text output did not correspond
to the system request at that time, but to the prior system request. Some errors also occurred
during Date-Time Parsing.
The percentage of correctly perceived turns and percentage of turns wrongly recognized are
presented in Table 4.5. The percentage of turns that were not correctly recognized is divided
according to the source of the error.
Correct
Recognition Errors
VAD
Hung Up
Loss of Synchronism
Date-Time Parsing
Week 1 (%)
32.3
52.8
4.0
3.8
6.3
5.6
Week 2 (%)
41.1
44.2
4.0
6.7
6.4
4.5
Table 4.5: Analysis of errors at the turn level.
The most important element to retain here is that the percentage of correctly recognized turns
was 10% higher in the second week. We believe that the new set of primes was crucial to this
improvement. The other sources of errors were not eliminated since they do not depend on
the lexical choice.
4.4. RESULTS
49
Another important evaluation measure in this experiment is the Word Error Rate (WER).
The information available in the system logs was compared with the reference to compute the
WER for the live tests. Later the off-line version of Audimus was applied to each session with
the same language and acoustic models used in the on-line system. This was done in order to
find how much the acoustic and language models contributed to the WER. Hence, the speech
boundaries for each utterance were given a priori. Results in Table 4.6 show WER, deletions
and insertions for both weeks.
WER
Insertions
Deletions
Substitutions
on-line
Week 1 Week 2
58.3%
52.3%
99
177
876
638
553
528
off-line
Week 1 Week 2
30.0%
30.6%
137
192
15
131
520
463
Table 4.6: WER for the different weeks.
These results again confirmed that the system performed better in the second week, at least
in the on-line test. A 6% decrease in WER without changing anything but the prompts, is
quite remarkable. Some of the errors were due to the lack of real data to train the language
model. The language model was the context independent language model trained with only
30k sentences.
The difference found in the off-line recognition test between the WER in the two weeks was
very small. Later it was found that the device used to call the system, which had no major
influence on the dialog success rate (as shown in Table 4.3), had impact on the off-line WER.
The acoustic models used were trained mainly with landline telephone speech, down sampled
broadcast news speech and very few examples of cellphone speech. When compared with the
WER for landline telephone users, the WER for cellphone speech was about 8% higher, and
for VoIP speech 14-17% higher. The number of sessions completed with landline telephone
speech decreased from 86 to 57 in the second week. This can help explain why the off-line
WER did not decrease in the second week.
In order to prevent the impact of adverse acoustic conditions, the future Noctı́vago tests
50
CHAPTER 4. TOWARDS BETTER PRIME CHOICES
were held with a different architecture that no longer uses the AudioSever VAD and users
interacted with the system via a web browser and computer microphone. This solved the
VAD, Hung Up and Lost of Synchronism problems reported in Table 4.5 and allowed us to
focus only on the study of lexical entrainment.
4.5
Prime usage
The last section analyzed the system performance as a whole. In this section, the performance
of each prime will be analyzed. This will be particularly important to confirm if users have
prime preferences and also to find future directions to automate the prime choice. Table 4.7
presents the frequency (absolute and relative) of the primes in the data in both weeks and
the PER for each prime in both weeks.
Concept
now
Start Over
next bus
Prime
W1: agora
W2: imediatamente
W2: neste momento
W2: mais rápido possı́vel
W2: mais brevemente possı́vel
W2: new primes together
W1: nova pesquisa
W1: outro percurso
W2: procurar novamente
W2: nova procura
W2: outra procura
W2: nova busca
W2: new primes together
W1: próximo
W2: seguinte
frequency
W1 (% W1)
64 (100.0)
0
0
0
0
0
35 (85.4)
6 (14.6)
0
0
0
0
0
62 (83.8)
12 (16.2)
PER
(%)
56.3
51.4
16.7
35.5
100.0
W1
frequency
W2 (% W2)
77 (59.7)
26 (18.1)
15 (10.4)
5 (3.5)
6 (4.2)
52 (40.3)
1 (3.1)
0
15 (46.9)
8 (25.0)
5 (15.6)
3 (9.4)
31 (96.9)
3 (4.8)
59 (95.2)
PER
(%)
89.6
61.5
20.0
100.0
16.6
40.4
0.0
0.0
25.0
20.0
0.0
9.7
100.0
45.8
W2
POS
ADV
ADV
PRO/ART1 + N
ADV + ADJ + ADJ
ADV + ADV + ADJ
ADJ + N
ART + N
V + ADV
ADJ + N
ADJ + N
ADJ + N
N or ADJ
N or ADJ
Table 4.7: Use of the primes and error rate in both weeks.
The results presented in Table 4.7 show that the callers entrained to the system, incorporating
the new primes in their speech during the second week of trials. The only word that appears
with similar frequency in both weeks is agora. This is a very frequent word in Portuguese
and in addition the word was explicitly written in one of the scenarios, thus biasing callers
towards that word.
Words like agora (specially in week 2) and seguinte have error rates higher than the other
primes. One possible source is the corpus generated to train the language model used in
these tests. The algorithm used to generate the corpus tried to incorporate as much variety
as possible for each concept described on the grammar, when no prior observation probability
4.5. PRIME USAGE
51
was given a priori. Thus, the more possibilities a concept has, the less frequent each entry
appears in the corpus, since the algorithm equally attributes the frequency whenever it is not
specified. This problem was later resolved setting the probability of each entry according to
the frequency observed in older data.
The phonetic distance seems to be a good metric to choose prompts with the same meaning
considering only the ASR module performance. For instance, when the prime mais rápido
possı́vel was used, where the word rápido /Rapidu/ sounds very similar to Rato /Ratu/, a
bus stop covered in our system, the ASR module always took the bus stop name and never
the word rápido. The same happened with seguinte /s@gi˜t@/ for the next bus prime. This
latter was recognized several times as sim /si˜/ (yes) or sair /s6jr/ (quit).
The last column of table 4.7 contains the POS of each of the primes. No evident connection
can be observed between POS and the entrainment that users showed to the prime.
In order to better understand why users are more likely to adopt some primes than others,
our data was analyzed to find which primes were used immediately after being used in the
system prompts. This conclusion comes from previous work [144, 99] where it was observed
that primes are more likely to be used immediately after the system uses them. In addition,
the reasons to understand why the users adopt a prime more than other, also motivated this
analysis.
For every user turn where a prime was recognized, the previous system prompt was taken
into account. Four different features were computed to analyze the users’ entrainment to
the proposed primes. When a prime was used after being invoked in the system’s previous
utterance, this will be called an Uptake. The correspondent column on Table 4.9 shows the
number of times that that event occurred. This means that the users followed the conceptual
pact proposed by the system. The No Uptake column corresponds to number of times that
a different prime from the one used by the system was used. In this case, the user decided
not to use the term proposed by the system. The number between parentheses represents
the number of times that the prime was already used before in the same session. The ’(%)
52
CHAPTER 4. TOWARDS BETTER PRIME CHOICES
Speaker
System
User
System
User
System
User
Total
utterance
Pode pedir informações sobre
outro percurso, saber qual o
próximo autocarro ou o autocarro anterior.
Próximo.
Podia repetir se deseja saber o
horário do autocarro seguinte,
do autocarro anterior ou
se deseja fazer uma nova
pesquisa?
Próximo autocarro.
Deseja fazer uma nova
pesquisa,
saber qual o
próximo autocarro ou o
autocarro anterior.
Autocarro seguinte.
-
Uptake
-
No Uptake
-
(%) usage
-
X
-
×
-
1/2
-
×
-
X
-
-
×
1
×
1
1/2
50 %
Table 4.8: Example of uptaken stats taken form interaction for the prime próximo.
of Uptakes’ stands for the percentage of user utterances that adopted that prime when it
was present in the previous system prompt. Finally, the ’# prompts’ column shows the total
number of system prompts in the corpus where that prime appears. Table 4.8 has an example
of how Uptakes, No Uptakes and ’% of usage’ were computed from our data.
The results are presented in Table 4.9. In most cases, the numbers of usages in system prompts
is much higher compared to the number of usages in user turns. There are many cases where
the prime was used in the system prompt, but the user intention does not correspond to any
of the primable concepts. For instance, the prompts for the now concept are used in the
context of asking for the time the user wants to travel. One can answer with a time-relative
expression such as now or a specific hour. Again, the fact that the word agora was explicitly
written in one of the scenarios is reflected in the results, since the callers did not use it from
the previous system prompt in most cases.
During the first week, the differences in the (%) of usage found for the start over show a
clear preference for nova pesquisa in that week. A similar conclusion for the next bus prime
is not obvious, since the exposure to the word próximo was nearly five times the exposure to
4.6. DISCUSSION
53
Week 1
No
(%) of
Uptake Uptakes
#
Uptake
prompts
Week 2
No
(%) of
Uptake Uptakes
#
prompts
-
64
-
-
0
-
21
14
11
5
6
56(6)
12(3)
4(1)
1(1)
0
63.3
41.2
45.8
33.3
60.0
60
67
65
30
21
nova pesquisa
outro percurso
procurar novamente
nova procura
outra procura
nova busca
34
3
-
1
3(3)
-
86.2
42.9
-
167
106
-
15
6
4
3
1
0
2
0
0
94.4
83.3
80.0
33.3
0
86
51
54
54
próximo
seguinte
51
8
11(2)
3
100.0
53.8
198
39
2
57
1
2(1)
100.0
96.5
38
105
Primes
Uptake
agora
imediatamente
neste momento
mais rápido possı́vel
mais brevemente possı́vel
Table 4.9: Analysis from the uptaken of the primes.
the word seguinte.
Moving to the second week the variety of primes for each concept increases, and the users have
more options. The primes are chosen randomly, although some primes are included in more
than one prompt, which explains why some primes were more used than others. The results
for now are inconclusive, although it is interesting to observe that the word imediatamente
is more used than the others when there is a No Uptake. Unlike the now concept, the primes
used for start over are uptaken in the majority of the answers. The only prime that was
used without being uptaken from the system was nova procura. Finally, the users quickly
adopted seguinte, the prime the system used for next bus in the second week.
4.6
Discussion
The results have shown that in general, the all the performance measures have improved from
the first to the second week and according to Table 4.5 this was possible due to the increased
number of correctly recognized turns, as the other sources of problems identified remained.
This means that prime selection can improve the system performance, using primes that are
less prone to incorrect recognition.
54
CHAPTER 4. TOWARDS BETTER PRIME CHOICES
Despite the improvements achieved in the second week, the WER is still very high (Table
4.6). In future tests, instead of using a context-independent language model, we should use
context dependent language models trained with more data.
The majority of the primes used in the second week improved the speech recognition. However,
a prime should also be chosen taking into account the user preference. The combination of
the results in Tables 4.8 and 4.9 can provide information to set a tradeoff between the speech
recognition performance and the reaction of the users to the proposed primes.
In the case of now, agora was more used than other primes in both weeks. Despite being written in one of the scenarios, it is a very frequent word and therefore it should be chosen instead
of the new primes. The effort should be in improving the speech recognition performance for
this prime by using a better corpus to create the language model. Imediatamente due the
usage in the no uptake condition should also be included in the set of primes for now. Mais
brevemente possı́vel has a very good PER, however it is not very frequent in daily language.
For start over, Table 4.9 shows that the users have made good use of nova pesquisa during
the first week, despite the synthesis problem reported in Section 4.1. Among the options for
primes available in the second week, procurar novamente that has very good PER and nova
procura, which was twice used without uptake and has also a low PER, can be considered
good primes. The primes tested for next bus were very frequent in the weeks when they
were more often incorporated in the prompts. However, the PER is lower for próximo. In
addition, the usage of próximo without uptaken was much higher in the first week comparing
to the use of seguinte in the same conditions in the second week. This indicates that próximo
is a good prime for next bus. However, taking into account that the comments on users’s
questionnaires valued the variety of prompts, it might be a good option to maintain both
primes.
This analysis points that the use of the prime in the No Uptake condition and the user prime
preference might be correlated. This is confirmed by the Pearson correlation value found
between the number of No Uptakes and number of hits for each prime in a web search engine,
0.61, whereas the correlation with Uptakes is only 0.14. The same type of correlation was
4.7. SUMMARY
55
found with the number of hits of each prime in a frequency corpus for spoken European
Portuguese, the Português Fundamental corpus [6]. In this case the correlation for Uptakes
was −0.23 and 0.99 for No Uptakes. This follows the intuition that those are the words that
the users recall immediately when the system requests a given concept.
4.6.1
User’s Feedback
The questionnaire at the end of the scenarios helped to understand how users perceived the
changes in the system. It was also helpful to know their feedback in order to improve the
system performance. During the first week, they were asked if they received the information
correctly in each scenario, the type of device used, their country of origin and a general comment on the system. During the second week, they were also asked if the system understood
them better and if they felt any evolution from the first to the second week and describe
the evolution, if possible. 18 of the users who called the system in both weeks answered the
questionnaire comparing both systems. 10 of them felt better understood by the system.
15 of them noticed some sort of evolution in the system. 3 users commented positively the
wider variety of prompts available in the second week. 5 users also felt that the system was
answering faster than before, although it was not.
The users’ feedback emphasizes that the two modules that they interact directly with, the
ASR and the TTS, are of major importance in the users’ perception of the system. The new
prompts proved to be very effective, reducing the number of misunderstandings and removing
some incorrectly synthesized words.
4.7
Summary
In this Chapter we have presented our first approach to the use of Lexical Entrainment in
an SDS. We started by identifying a set of prime candidates and find synonyms which could
replace them in the system prompts. Then we have tested the system used the different sets
of primes. The first set used was the same used in the preliminary version, and the second
56
CHAPTER 4. TOWARDS BETTER PRIME CHOICES
set is the one with the synonyms.
The results presented in this Chapter confirm the idea that lexical entrainment in humancomputer dialogs occurs immediately after the system introduced a new prime. In addition,
the system performance improved with the use of different primes, showing once more that
prime choice can influence the system performance.
An a posteriori analysis of the results was made in order to understand which could be the
features that could described a good prime. The purpose is to automate the selection of the
best primes during the interaction. In order to do this, an algorithm that combines the system
prime preference and the user prime preference must be used. Events identified during the
analysis such as Uptakes and No Uptakes can be considered relevant to find the user prime
preference. The system prime preference must rely on events that could be detected during
the interaction such as non-understandings. The confidence measure computed by the system
during the interaction should also be used as an indicator in this process.
To summarize, these are the conclusions that should be taken into account when choosing
primes to be used in SDSs:
• Very frequent primes should have their ASR performance improved;
• Words that are not very frequent should be removed from the prime set;
• Multiple primes should be kept in the prime set to allow variety;
• No Uptakes seem to be events that could indicate user prime preference.
Next section will discuss the features that can be used to refine the computation of the
confidence measure in order to choose better primes.
5
Refining confidence
measure to improve prime
selection
5.1
Introduction
Accurate confidence score is one of the keys to improve the robustness of SDSs. Previous
approaches to this problem were introduced in Section 2.1.2. The confidence model used in
the tests described in Section 4.3 was trained with data from a different system with a different
domain. However, the changes operated in the original architecture to adapt it to European
Portuguese introduced modifications that will influence the features that will give the best
confidence score computed by Helios. In this section, strategies to refine these confidence
measures according to the new modules used will be presented.
5.2
Training a confidence annotator using logistic regression
The confidence model used so far, trained with RoomLine [18] data, computed the confidence
score according to the following logistic regression:
Conf idence =
e1.69−5.55·Ratio of U ncovered W ords
1 + e1.69−5.55·Ratio of U ncovered W ords
(5.1)
where the Ratio of U ncovered W ords is obtained during the interaction and corresponds to
the ratio between of words from the recognized input that were not parsable and all the words
in the input. However, as described in detail in [16], there are other features available when
training the model that could give a better model for the architecture used in Noctı́vago.
Here are some examples from different categories (from [20]):
58CHAPTER 5. REFINING CONFIDENCE MEASURE TO IMPROVE PRIME SELECTION
• speech recognition features, e.g. acoustic and language model scores; number of
words and frames; turn-level confidence scores generated by the recognizer; signal and
noise-levels; speech rate; etc.
• prosody features, e.g. various pitch characteristics such as mean, max, min, standard
deviation, min and max slopes, etc.
• lexical features, e. g. presence or absence of the top-10 words most correlated with
misunderstandings (system-specific).
• language understanding features, e.g. number of new or repeated semantic slots in
the parse; measures of parse-fragmentation; etc.
• dialog management, e.g. match-score between the recognition result and the dialog
manager expectation; dialog state; etc.
• dialog history, e.g. number of previous consecutive non-understandings; ratio of nonunderstandings up to the current point in the dialog; etc.
Since the speech recognition engine was replaced and the remaining modules suffered modifications, there is a chance that some of the features computed can be more helpful to predict
the confidence score than the features used in the model used so far. It is important at this
point to clarify that we will use the same terminology used by Bohus in [16], to distinguish between misunderstanding and non-understandings, since the confidence annotator treats them
differently. A misunderstanding is considered when the recognized input is parsable, but
could not be matched with any of the expected concepts. A non-understanding is considered
when the recognized input was not parsable. Unlike misunderstandings, in the case of a nonunderstanding, some language understanding and dialog management features would not be
given a value.
Our first attempt to train a new model tried to maximize the number of turns used in the
training procedure, including non-understandings. Thus, only the features available in the all
5.2. TRAINING A CONFIDENCE ANNOTATOR USING LOGISTIC REGRESSION 59
the 1592 user turns collected in the tests described in Section 4.4 were used, which significantly
reduced the number of available features. Hence, the features grouped by category were:
• speech recognition features: Word-level confidence score; if word-level confidence
was greater than 0.5.
• language understanding features: Fragment ratio; if fragment ratio was above
the mean; ratio of the number of parses; number of uncovered words; if number of
uncovered words is greater than 0; if the number of uncovered words was greater than
1; the normalized number of uncovered words; the ratio of uncovered words and if the
ratio of uncovered words was above the mean value.
• dialog history: If last turn was a non-understanding.
The model used these features to predict if the turn was correctly understood. The turns
were labeled as correct or incorrect if the parsed transcription corresponded to the parsed
input. Since this is a binary problem, logistic regression can be used. Among the algorithms
available to perform logistic regression we have chosen three: maximum entropy (using MegaM
[40]), prior weighted logistic regression (using FoCal [31], with a prior of 0.5) and the stepwise
logistic regression available in MATLAB’s statistical toolbox. The available turns were divided
into 1144 used for training, and 448 for test. The weight of the features according the
algorithm used is shown in Table 5.1.
Among the features used, the word-level confidence is the most weighted feature. This confirms the accurateness of the confidence score provided by the new European Portuguese
speech recognizer that we have integrated. Other parsing features associated with the number of uncovered words are also important to the model.
The new models’ performance may be compared with the baseline model given in Equation
5.1 varying the confidence threshold to find the optimal trade-off between the model used and
the confidence threshold set in the system. The threshold in the live system was fixed and
the value was 0.2.
60CHAPTER 5. REFINING CONFIDENCE MEASURE TO IMPROVE PRIME SELECTION
Turn-Level Confidence
Turn-Level Confidence > 0.5
Fragment Ratio
Fragment Ratio > mean
Ratio of the number of parses
Number of uncovered words
Number of uncovered words > 0
Number of uncovered words > 1
Normalized number of uncovered words
Ratio of uncovered words
Ratio of uncovered words > mean
Non-understanding in last turn
MegaM
2.95
-0.3
-0.11
0.38
-0.62
-0.88
-0.38
-0.71
0.97
-0.62
-0.38
-0.57
FoCal
4.09
-0.97
-0.36
0.57
-0.98
-1.80
-0.38
-0.03
2.13
-1.34
-0.39
-0.60
Stepwise
0.71
-0.17
0.00
0.00
0.00
-0.12
-0.31
0.00
0.00
-0.20
0.00
-0.10
Table 5.1: Weights for the features used in the confidence annotator training algorithm.
The best performance in terms of Accuracy (Figure 5.1a) is achieved by the maximum entropy
method, using 0.6 as the rejection threshold. In Figure 5.1b, Precision gives an idea of how
the system deals with mis-recognitions. In a dialog system, this is a very important measure
of performance, as mis-recognitions usually have a very high cost, since they often result
in dialog errors. In terms of Precision, the FoCal toolkit has consistently achieved the best
performance among for the evaluated thresholds. It is also important to avoid false rejections,
although they are not as costly as mis-recognitions. The Recall, showed in Figure 5.1c, shows
how the different methods deal with rejections. In terms of Recall, the stepwise method and
the baseline performed better than the two other methods presented.
These new models have overperformed the baseline confidence model. The models trained
with MegaM and FoCal seem to be valuable alternatives to be used in a live system instead
of the baseline model. However, many other features were left out since we were only considering the features available in every turn. Since the feature value will not influence the
way the system is going to deal with non-understandings (typically the system will repeat the
previous request), the next section will try to use all the features available leaving out the
non-understood turns.
5.2. TRAINING A CONFIDENCE ANNOTATOR USING LOGISTIC REGRESSION 61
0.85
FoCal
Helios
Megam
Stepwise
0.8
0.75
Accuracy
0.7
0.65
0.6
0.55
0.5
0
0.1
0.2
0.3
0.4
Confidence Threshold
0.5
0.6
0.7
0.8
(a) Accuracy.
0.95
FoCal
Helios
Megam
Stepwise
0.9
0.85
Precision
0.8
0.75
0.7
0.65
0.6
0.55
0
0.1
0.2
0.3
0.4
Confidence Threshold
0.5
0.6
0.7
0.8
0.5
0.6
0.7
0.8
(b) Precision.
1
FoCal
Helios
Megam
Stepwise
0.9
0.8
Recall
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.1
0.2
0.3
0.4
Confidence Threshold
(c) Recall.
Figure 5.1: Accuracy, Precision and Recall for the tested methods.
62CHAPTER 5. REFINING CONFIDENCE MEASURE TO IMPROVE PRIME SELECTION
5.3
Training a confidence annotator with skewed data
Not considering the non-understood turns reduced the training corpus to 825 turns, divided
into 623 for training and 202 for test. The reduction brought another problem, as only 10%
of the turns corresponded to mis-understood turns, which means that the training is highly
biased. Our first attempt did not use any particular strategy to deal with this fact. The
procedure adopted was to train a model with a stepwise logistic regression using the same
procedure as in [16]. In each step, the next most informative feature was added to the model,
as long as the average data likelihood on the training set improved by a statistically significant
margin (p − value < 0.05). After each step, the features in the model were reassessed, and
any feature with a p − value larger than 0.3 was eliminated from the model. To avoid
over-fitting, two different stopping criteria were used, Bayesian Information Criterion (BIC)
and log-likelihood increase with cross-validation (LLIK). Table 5.2 shows the most relevant
features found at this time with their weights, according to the different stopping criteria.
Stop Criteria
last level touched
slots blocked
new slots num (bool)
BIC
-1.5315
-
LLIK
-1.3706
-1.8173
-1.1683
Table 5.2: Confidence annotations models trained with stepwise logistic regression.
None of these features corresponds to the features found for the first models trained with
Noctı́vago data. These are dialog manager features that were not used to train the first
models. Figure 5.2 compares the performance of these models against the baseline model.
Regardless of the stopping criterion, the models trained clearly outperformed the baseline
model.
The number of features used in both models is very scarce. In fact, the stop criteria were
rapidly achieved which can explain the low number of features selected. One possible reason
for the fast convergence of the methods could be the skewed training corpus. To overcome
this limitation, three different strategies were attempted:
5.3. TRAINING A CONFIDENCE ANNOTATOR WITH SKEWED DATA
63
0.82
BIC
LLIK=ER
Helios Baseline
0.8
0.78
Accuracy
0.76
0.74
0.72
0.7
0.68
0
0.1
0.2
0.3
0.4
Confidence Threshold
0.5
0.6
0.7
0.8
0.5
0.6
0.7
0.8
(a) Accuracy.
0.81
BIC
LLIK=ER
Helios Baseline
0.8
0.79
Precision
0.78
0.77
0.76
0.75
0.74
0.73
0.72
0
0.1
0.2
0.3
0.4
Confidence Threshold
(b) Precision.
BIC
LLIK=ER
Helios Baseline
1
0.98
Recall
0.96
0.94
0.92
0.9
0.88
0
0.1
0.2
0.3
0.4
Confidence Threshold
0.5
0.6
0.7
0.8
(c) Recall.
Figure 5.2: Performance of the stepwise logistic regression compared with the baseline model.
64CHAPTER 5. REFINING CONFIDENCE MEASURE TO IMPROVE PRIME SELECTION
• Use of the non-understood turns, filling the missing data points with the average value
for that feature, if the feature was missing in less than 20% of the training corpus
(NONU);
• Randomly replicating the minority dataset until the dataset is balanced between positive
and negative examples (BAL);
• Feature weighting with an additional cost for false positives (FoCal).
The cost for false positives can be modified changing the prior used in the weighted logistic
regression using FoCal [31]. The cost for false negatives was set to 1, while several costs were
evaluated for false positives (1, 10, 100 and 1000).
The results were computed for stepwise logistic regressions using both BIC and LLIK as stopping criteria. BAL and NONU datasets were used separately and combined. The combination
of both strategies is possible after the non-understood feature values are populated the empty
points with mean values and replicating the minority part of the corpus to achieve a balanced
corpus.
Table 5.3 shows the CER for the optimal rejection threshold, the CER for a 0.5 threshold
and the optimal threshold value. The two rightmost columns show the values for the CER in
the positive and negative part of the test dataset.
None of the attempted strategies developed outperformed the initial stepwise logistic regression, as it is confirmed by the results of Table 5.3. In [16], Bohus considered 500 training
utterances as the minimum dataset size required to train a confidence annotation model using logistic regression, which can help explaining why adding more data did not improve the
models trained.
5.4
Summary
The goal of improving confidence annotation in this context was only to provide better confidence scores to find the system prime preferences. Despite improvements achieved (all the
5.4. SUMMARY
Helios Baseline
logit(LLIK)
logit(LLIK) + BAL
logit(LLIK) + NONU
logit(LLIK) + BAL + NONU
logit(BIC)
logit(BIC) + BAL
logit(BIC) + NONU
logit(BIC) + BAL + NONU
FoCal (1,1)
FoCal (1,10)
FoCal (1,100)
FoCal (1,1000)
65
CER (opt.)
(%)
25.7
19.8
20.3
22.3
22.3
19.8
21.0
22.3
22.3
21.3
22.3
22.3
22.8
CER
(%)
29.7
19.8
20.3
22.3
22.3
20.3
25.7
22.3
22.3
30.7
28.7
35.2
36.6
(0.5)
Opt. Th.
0.03
0.62
0.33
0.52
0.58
0.29
0.36
0.66
0.59
0.10
0.13
0.04
0.05
Postive
CER (%)
0.00
2.00
2.00
5.33
5.33
1.33
3.33
5.33
5.33
4.00
3.33
4.00
5.33
Negative
CER (%)
100
71.1
73.1
71.2
71.2
73.1
71.2
71.2
71.2
73.1
76.9
76.9
76.9
Table 5.3: Classification error rate for the different strategies
new models presented in Table 5.3 have outperformed the baseline model), the models trained
in this section were never used in subsequent experiments. As it was mentioned before, the
confidence annotation model is very system and component dependent. Since the subsequent
tests were conducted with different architectures and dialog strategies, these models were no
longer valide, and therefore they could not be used in live tests to show their effectiveness.
66CHAPTER 5. REFINING CONFIDENCE MEASURE TO IMPROVE PRIME SELECTION
6
Automated Entrainment
In Chapter 4 a first step to understand how can the prime choice be automated was taken.
This Chapter goes beyond what was made at that point by automating the prime selection
process instead of the randomly selecting primes as it was done in those experiments. Section
6.1 describes the first approach towards this automation. The rule-based method derived
from heuristics of lexical entrainment theory and from the intuition gathered during the
experiments presented in Chapter 4. The method was tested on-line in both Noctı́vago and
Let’s Go.
Section 6.2 presents our first attempt to create a Data-Driven method for prime selection.
The method was tested on off-line data from Noctı́vago and Let’s Go. A final on-line test with
Noctı́vago compared the effectiveness of the Data-Driven prime selection with other methods.
6.1
Two-Way Automated Rule-Based entrainment
6.1.1
Entrainment Events
There are several evidences in the literature of lexical, syntactical and/or acoustic entrainment in human-human and human-computer dialogs, as reported in Sections 2.2.1 and 2.2.2,
respectively. This evidence has to be mapped into something that an SDS can use to perform entrainment. Once that evidence is mapped, the system can be programmed to perform
lexical entrainment in both directions. The type of information that can be extracted from
live dialogs indicates whether the user accepted the prime the system proposed, or if she/he
decided to use a different prime. In each case, the system behavior has to be adjusted to the
evidence found. In the first case, the system should stay with the prime proposed, whereas
68
CHAPTER 6. AUTOMATED ENTRAINMENT
in the second case the system needs to decide if there is enough evidence that the prime recognized was indeed different from the prime proposed. If there is evidence, then the system
should incorporate the same prime chosen by the user. This evidence is built according to
the confidence score of the current turn, the information gathered in the previous turns of
the dialog, and the information collected from previous dialogs.
The system logs can be analyzed chronologically in order to find the user prime preferences,
i.e., if the users took up the system prime or used a different term. Some examples of Uptake
and No Uptake events were already given in Section 4.5, using data from Noctı́vago dialogs
(Table 4.8). The set of relevant events will be extended here, using examples from real Let’s
Go Dialogs to better illustrate them.
If the users take up the system prime, then entrainment has taken place. This was designated
as an Uptake (Uptk) event. The example of this event in a Let’s Go dialog can be found in
the first excerpt in Table 6.1, where the system proposed “new conversation” (S1) and the
user followed (U1). When the system detects a different prime from the one proposed, it can
be interpreted as the user deciding not to entrain to the system. This was called a No Uptake
(NUptk) event. This event can be found in the second excerpt of the dialog presented in Table
6.1, where the system proposed “new dialog” (S2), and the user answered with “new task”
(U2). A third class of events should be added to these previously defined classes. When the
user says a prime that was not used until that point in the dialog. This will be considered a
No Previous Use (NPU) event. This is exemplified in the third excerpt in Table 4.8, where the
user said “now” (U6), before the system used this concept. In these cases, the system should
try to follow the user prime choice whenever it is possible. These three events are designed
to cover the possible user behaviors in terms of prime adoption. They will be designated as
user prime events.
The fourth event illustrated in Table 6.1 cannot be considered an entrainment event, but it
will be used later to represent the system preference. Non-Understandings (NUnd), already
introduced in Section 5.2 when discussing which turns should be used to train the confidence
annotator model, occur whenever the dialog manager is unable to bind the parser output to
6.1. TWO-WAY AUTOMATED RULE-BASED ENTRAINMENT
69
S1: To ask about a different journey, you can
say, start a new conversation. If you are
finished, you can say goodbye.
U1: START A NEW CONVERSATION
...
S2: You can say, tell me about the later bus, when
is the bus prior this one, start a new dialog, or say goodbye.
U2: START A NEW TASK
...
S3: Welcome to the CMU Let’s Go bus information system.
S4: To get support at any time, just say support or press zero.
S5: Where would you like to start from?
U3: SHERADEN
S6: Where do you wanna go?
U4: DOWNTOWN
S7: Going to DOWNTOWN. Did I hear that correctly?
U5: YES
S8: When would you like to travel?
U6: NOW
...
S9: To ask about a different journey, you can say, start a new conversation.
If you are finished, you can say goodbye.
U7: SPRING 94C DUQUESNE
S10: To ask about a different journey, you can say, start a new conversation.
If you are finished, you can say goodbye.
U8: HILL CROSS TURTLE THANK .YOU.
S11: To ask about a different ride, you can say, start a new dialog.
If you are finished, you can say goodbye.
U9: START NEW DIALOG
Table 6.1: Examples of the events used in the prime choice update. Primes are in bold.
the concepts available in the expectation agenda for a given state. They will be treated as a
system event. An example is given in the fourth dialog excerpt in Table 6.1. The parser output
for utterances U7 and U8 were bus stops and routes, which were not acceptable concepts at
that point of the dialog. If this is a recurring situation for a given prime, the system should
be able to find a different alternative for that prime (S11) that might work better that the
first prime proposed (U9).
6.1.2
Heuristic Entrainment Rules
This section describes the heuristic rules that were our first approach towards entrainment.
These rules aim to find the most appropriate prime at any given instant. Ideally, the choice
would be made using a data-driven method, however the limited data resources initially
available for Noctı́vago lead us to develop a set of heuristics for prime selection that combine
the user and system events above mentioned to find the best primes. The data-driven method
was later implemented, as it will be described in Section 6.2.
70
6.1.2.1
CHAPTER 6. AUTOMATED ENTRAINMENT
Implementing the Entrainment Heuristics
As described in the literature [48, 29], speakers may adjust their prime choice according
to the other interlocutor and consequently use different primes when talking to different
interlocutors. The ideal solution would be to have user dependent models for prime selection.
However, neither Noctı́vago nor Let’s Go have user dependent models. None of the systems is
currently equipped caller-id or a speaker verification module that could trigger user adapted
models on-the-fly. Thus, the solution adopted was a two-stage algorithm to rank the prime
candidate lists for each concept. In the first stage, “Long-Term Entrainment”, the system
tries to determine the best prime for any speaker, based on the past interactions with the
system. The prime selected at this stage will be used in the beginning of the session. In
the second stage, “Short-Term entrainment”, the system tries to coordinate the primes with
the user’s choices on the fly as the dialog progresses, trying to find if the primes used are
acceptable to the current user.
6.1.2.1.1
Long-Term Entrainment
The results presented in Section 4.7 pointed to a
very high correlation between the number of No Uptakes events, and the most commonly
used primes in daily language. This correlation was verified for the number of hits in a web
search engine and the frequency of the primes in a spoken dialog European Portuguese corpus.
Possible explanations for this is that the terms that are the most frequent in general use are
those that users employ even if the system did not use them. Thus, the primes were ranked
according to the number of No Uptake events, normalized by the total number of uses of that
prime by the system. The higher this value is, the most suitable is the prime. This resulted
in the long-term prime ratio for prime i:
R(i) =
6.1.2.1.2
countN U ptk (i)
countsystem (i)
(6.1)
Short-Term Entrainment The second phase of the algorithm aims to reduce
the number of primes used during each session. That is making the user and the system to
employ the same terms. The system is expected to follow the user prime choice whenever this
6.1. TWO-WAY AUTOMATED RULE-BASED ENTRAINMENT
71
choice does not degrades the system performance, and propose a different prime once enough
evidence is found that a prime is hindering the system performance. To map this kind of
behavior, imagine the system has for new query concept a prime candidate list from prime i
to prime z:
Prime i
Prime j
Prime k
Prime z
, a set of heuristics was designed based on a set of prime update factors, one for each of the
user prime events described: ϕU ptk , ϕN U ptk and ϕN P U . These factors will modify the initial
long-term prime ratio R(i), according to the following heuristics:
• If an Uptake event occurs for prime i, then R(i) is increased by ϕU ptk . Example: in the
first excerpt from Table 6.1, R(new conversation) will be increased by ϕU ptk ;
Uptake
S: To ask about a different journey, you can say start
Prime i
, if you are finished you can say goodbye.
U:
Prime i
• If prime i is used when prime j was proposed, then R(i) is increased by ϕN U ptk and
R(j) is decreased by the same amount. Example: in the second excerpt from Table 6.1,
R(new task) will be increased by ϕN U ptk and R(new dialog) will be reduced by ϕN U ptk ;
No Uptake
S: To ask about a different journey, you can say start
Prime i
, if you are finished you can say goodbye.
U:
Prime j
• If prime i is spoken without being previously used in that session either by the user or
the system, then R(i) is increased by ϕN P U . Example: in the third excerpt from Table
6.1, R(now) will be increased by ϕN P U ;
• If prime i was proposed and a non-understanding was generated in the next user turn,
then R(i) is reduced by countN U nd (i), where countN U nd (i) is the number of nonunderstandings for prime i in a session. Example: in the last excerpt from Table
72
CHAPTER 6. AUTOMATED ENTRAINMENT
No Previous
Use
S: Welcome to the CMU Let's Go bus information system.
S: Where would like to start from?
U: SHERADEN
S: Where do you wanna go?
U: DOWNTOWN
S: When would you like to travel?
U: 11 PM
S: The 28X departs from WEST BUSWAY AT SHERADEN STATION C at 11:29 p.m.
It will arrive at LIBERTY AVENUE AT MARKET STREET at 11:47 p.m.
And it has many seats available.
U:
Prime i
6.1, the R(i) for “journey” and “new conversation” will be decreased by the number of
non-understandings flagged so far, in that session, for each of them.
Non
Understanding
S: To ask about a different journey, you can say, start
Prime i
. If you are finished, you can say goodbye.
U: 4 BRIDGEVILLE HOMESTEAD BRIDGE BAUM [parsed_str: NOT AVAILABLE]
6.1.2.2
Version 1: Testing the Entrainment Rules in Noctı́vago
The entrainment experiments described in this Chapter were performed using the new architecture for Noctı́vago presented in Figure 3.9a.
Noctı́vago’s new version uses the ECA flash-based interface, instead of the telephone interface
previously used, and still used in Let’s Go. This version of Noctı́vago also use contextdependent language models. The models for Place and Time are bi-gram language models,
trained with 30k automatically generated corpora. The models for Confirmation and Next
Query were created from SRGS grammars specified according to the data collected in the
experiments described in Chapter 4.
The heuristic rules described in the previous Section were implemented. The primes candidates that could be affected by the applications of the rules can be found in Table 6.2. To
the primable concepts used in the study presented in Chapter 4, three other could be added
due to small changes in the dialog strategy. Once the values for the slots arrival place, origin
place and time were filled, the system explicitly asked the user if the values were correct
(QueryConfirm Request Agent in the Noctı́vago task presented in Figure 3.3). If the user
6.1. TWO-WAY AUTOMATED RULE-BASED ENTRAINMENT
73
said no, the system asked which slots were not correct (RequestWrongSlot Request Agent in
the Noctı́vago task presented in Figure 3.3). It is important to emphasize that these concepts
correspond to the way the slot is referred rather than the slot value itself. Another primable
concept introduced was price, as the system offered the possibility to the user to ask for the
price of the ticket (InformPrice Inform Agent in the Noctı́vago task presented in Figure 3.3).
The “Type of Prime” column in Table 6.2 specifies if the use of the prime will have influence
in the course of the dialog. For instance, in the example dialog presented in Table 6.1, in
utterance U1 the fact that “new conversation” was correctly recognized intrinsically affects
the course of the dialog. The primes whose incorrect recognition does not affect the course
of the dialog and that are used in system prompts will be typified as “Non-Intrinsic”. None
of the primes used in these Noctı́vago tests is typified as “Non-Intrinsic”. This type can be
found in the primes used in the Let’s Go tests in Table 6.7 and the Noctı́vago tests presented
in Section 6.2.3.1. To give an example, in a Let’s Dialog if the user answers “start from
downtown” and the system recognized “stay from downtown”, where “start” is a prime, the
slot is filled with the same value, “downtown”, in both cases.
Type of Prime
Concept
Next
Now
Price
Intrinsic
Start Over
Arrival Place
Origin Place
Time
Primes
próximo / seguinte
agora / imediatamente
neste momento / o mais rápido possı́vel
o mais brevemente possı́vel
preço / valor
outro percurso / nova procura
nova pesquisa / procurar novamente
outra procura / nova busca
chegada / destino
partida / origem
horas / horário
Table 6.2: Primes used by Noctı́vago in the heuristic method tests.
These tests had two main goals. The first was to compare the users’ behavior with and without
Short-Term entrainment. The second was to find the best configuration for the confidence
measure to be combined with the prime events detected. For this purpose, four different
test sets were created. In Set 1, the dialog confidence score generated in Helios was used to
threshold the Short-Term entrainment updates. In Set 2, the ASR confidence score was used
for the same purpose. In Set 3, Short-Term entrainment updates were performed without
74
CHAPTER 6. AUTOMATED ENTRAINMENT
the use of the confidence scores. Finally, Set 4 only performed the Long-Term entrainment
updates.
The update factors used in Set 1, 2 and 3 were handcrafted. The values for ϕU ptk , ϕN U ptk and
ϕN P U were set to 1, 2 and 3 respectively. These values were chosen to give more importance
to the least frequent events, as the previous findings and the intuition developed in previous
studies pointed to their relevance for prime selection. Also, these values ensure that they are
superimposed on the initial ratio R(i), as they are at least one order of magnitude higher
than the average initial R(i) for the data collected in the tests described in Chapter 4, 0.09.
This was done in order to reinforce entrainment during each session.
6.1.2.2.1
Test Set Users were recruited by e-mail or Facebook event to participate in
the experiment. In both cases they were given a short explanation of the requirements to
complete the task. Then they were given a web link to access the system via the Flashbased multi-modal web interface of Noctı́vago (Figure 3.9a). They were told to carry out
three consecutive requests to the system within one session. This made each dialog longer
and consequently gave the system more chances to apply the heuristics. The four Sets ran
alternately one after the other during the test period, each for the same amount of time, but
did not change within an individual session. The users were not aware that the system was
running different configurations.
At the end of each dialog, the users were asked to fill in a questionnaire based on the PARADISE framework for SDS evaluation [140] was created. The questionnaire also tried to
evaluate if the users noticed any difference in the lexical choice, by asking them ’if the system understood them better towards the end of the session’. We hoped that the use of
adequate primes would result in better recognition as the session progressed. The complete
questionnaire can be found in Annex C.2.2.
6.1.2.2.2
Results
Table 6.3 summarizes the details of the 160 sessions validated in terms
of system performance (Dialog Success and Average Number of Turns). The tests were
performed by 83 different people. 33 were female subjects and 46 were male subjects. It was
6.1. TWO-WAY AUTOMATED RULE-BASED ENTRAINMENT
75
also tested by 5 non-native users. 13 users already participated in the tests described in Section
4.3. System performance includes both the estimated and real success. The estimated success
is computed live, since it only takes into account whether the system queried the backend and
provided any bus schedule information to the user. Real success is computed a posteriori,
after listening to each session and verifying if the schedule provided actually corresponded
to the user’s request. Despite most of the results were not statistically significant, Set 4,
against initial expectations, achieved the best performance in terms of real dialog success
and the second lowest average number of turns per session, although the Chi-square test for
Dialog Success and the one-way ANOVA tests for Number of Turns revealed no statistical
significance between different versions. Since one of our goals was to compare the performance
with different confidence measures, Table 6.3 shows that among those Sets that performed
Short-Term entrainment Set 1, which used the dialog confidence measure, was the one with
the best performance.
Number of Sessions
Estimated Dialog Success (%)
Real Dialog Success (%)
Average Number of Turns
Set 1
40
92.5
70.3
9.24
Set 2
42
95.2
63.2
9.13
Set 3
44
95.5
67.2
8.12
Set 4
34
91.2
74.5
8.92
Table 6.3: Dialog performance results.
Table 6.4 shows Word Error Rate (WER) and percentage of correctly transferred concepts
(% CTC) for the same tests. Figure 6.1 shows a graphical representation of the same results.
A concept is considered correctly transferred when the parsed ASR result is equal to the
parsed transcription. Despite the fact that one-way ANOVA tests did not reveal statistical
significance, results show Set 4 achieved the lowest WER and the highest percentage of CTC.
The relation between the WER and the system performance reveals that Set 1, despite having
the highest WER, has the second best real success rate. We observe that the WER for primes
is lower than the global WER for all the Sets, except Set 3. Against our expectation, the %
CTC is lower for Primes than for the other concepts. However, Sets 1 and 2 achieved the
lowest loss in % CTC. The % CTC for intrinsic primes even increased in Set 1, compared to
76
CHAPTER 6. AUTOMATED ENTRAINMENT
other concepts’ % CTC, where Short-Term entrainment is performed and the threshold for
the entrainment rules is set.
WER (%)
WER Primes (%)
WER Intrinsic (%)
CTC (%)
CTC Primes (%)
CTC Intrinsic (%)
Set 1
59.7
52.6
53.6
47.3
45.6
48.3
Set 2
52.3
50.1
48.3
39.6
36.2
39.1
Set 3
53.7
54.9
58.2
44.5
37.3
37.1
Set 4
47.9
43.4
44.8
51.5
47.2
47.3
Table 6.4: WER and correctly transferred concepts results.
(a) WER
(b) CTC
Figure 6.1: WER and CTC results for the different configurations tested.
The questionnnaires’s are given in Table 6.5 for Adaptation, suggest that the users evaluation
is more correlated with the session success (0.41 computed using Pearson correlation, with
p-value < 0.001) than with the adaptation of lexical choices (0.13, p-value = 0.14). Set 4
received the highest rating. However, Set 3 was the second highest rated, despite being the
6.1. TWO-WAY AUTOMATED RULE-BASED ENTRAINMENT
77
worst performing. These results could mean that the users did not notice that the system
was adapting to them. They also could mean that the question regarding adaptation possibly
induced the users to evaluate other than the lexical choices.
Average Satisfaction
Adaptation
Set 1
19.4
2.75
Set 2
19.5
2.48
Set 3
19.7
3.16
Set 4
20.3
3.38
Table 6.5: User satisfaction results.
Table 6.6 present the results of the entrainment events detected during live interactions. The
event percentage was computed as:
PP
countevent (i)
Event(%) = PPi=1
i=1 countsystem (i)
(6.2)
where the Events are Uptakes, No Uptakes, No Previous Use or Non-Understandings.
Despite one-way ANOVA tests revealing no statistical significance, we see that Uptake events
are much more frequent than the other events as it can be seen in Figure 6.2, that shows the
accumulated percentage of each event per Set. This signifies that users followed the system
proposed prime much more than they used terms of their own choice. The behavior of novice
user has confirmed our expectation [80]. We believe that if the tests were run on Let’s Go,
with some experienced users, the result might be different.
Total Uptake (%)
Total No Uptake (%)
Total No Previous Usage (%)
Total Non Understanding (%)
Set-up 1
16.8
2.03
0.38
9.77
Set-up 2
20.3
2.31
0.12
5.78
Set-up 3
18.4
1.21
0.13
6.85
Set-up 4
17.6
1.20
0.33
6.50
Table 6.6: Entrainment events relative frequency.
A detailed analysis of the systems logs showed that the initial prime rank given by Equation
6.1, rarely changed from session to session, unless a No Uptake event occurred and they were
very rare. We also examined reactions to Non-understandings. With the strategy proposed
in Section 6.1.2.1.2, a prime could be changed before there was the necessary evidence that
78
CHAPTER 6. AUTOMATED ENTRAINMENT
Figure 6.2: Accumulated of events percentage
it was degrading the system performance [24].
The results of these tests do not show an improvement in the system performance when
Short-Term entrainment is used. This brings us to envisage future experiments using the
entrainment rules. The dialog confidence score could be used as a threshold for Short-Term
entrainment. Since uptake events are much more frequent that any other prime event, they
could also be used to compute the initial prime ratio in Long-Term entrainment. An increase
in the update factor ϕU ptk should also enhance the convergence between user and system
during the session. Finally, the approach to the update at a non-understanding event could
also be modified to allow the user to gather more evidence of whether the prime is hindering
ASR success.
6.1.2.3
Version 2: Testing Entrainment Rules in Let’s Go
The tests with Noctı́vago revealed interesting trends. However, we do note that most of the
results obtained were not statistically significant, and consequently do not permit us to draw
6.1. TWO-WAY AUTOMATED RULE-BASED ENTRAINMENT
79
any strong conclusions about the effect of on-line entrainment rules in SDS performance.
These trends motivated modifications in the entrainment rules. Firstly, since a threshold is
needed, we decided to use the dialog confidence score, since the Set that used this configuration
performed better than the set where the ASR confidence score was used and the set where the
Short-Term entrainment updates were performed regardless of the confidence score. Secondly,
the initial prime ratio was modified to a weighted sum of the normalized number of No Uptake
and Uptake events:
R(i) =
countN U ptk (i)
countU ptk (i)
+ wU ptk ×
countsystem (i)
countsystem (i)
(6.3)
where countU ptk (i) is the past number of uptakes for prime i and wU ptk is given by the ratio
between the total Uptake events and the total No Uptake events:
PP
countU ptk (i)
wU ptk = PPi=1
i=1 countN U ptk (i)
(6.4)
where P is the total number of primes. Thirdly, the update factor for Uptake events, ϕU ptk ,
was increased to 2, in order to enhance convergence during the session. Finally, in order to
assure the necessary evidence that a given prime is raising non-understandings, the update
after a non-understanding was only performed after the second non-understanding. In addition, instead of simply subtracting countN U nd (i) from R(i), wN U nd × countN U nd (i) is now
subtracted from R(i), where wN U nd is computed similarly to wU ptk :
PP
wN U nd = PPi=1
countN U nd (i)
i=1 countN U ptk (i)
(6.5)
The algorithm with the modifications above mentioned was tested in Let’s Go, a live system
with real users. Cornerstone [75] was used as dialog manager. Thus, the resulting architecture
is the one represented in Figure 3.8. The set of primes used is shown in Table 6.7. This set
was extended from the one used in [99]. This has also increased the number of prompts
80
CHAPTER 6. AUTOMATED ENTRAINMENT
available, which can make the system sound more natural.
Type of primes
Concept
next bus
now
previous bus
Intrinsic
start over
confirm
help
new query
Non-Intrinsic
origin place
Old Primes
next
now
previous
route
schedule
right
help
query
leaving
leave
New Primes
following / subsequent
later / after
immediately / right now
right away / as soon as possible
preceding / prior / before
itinerary / trip
ride / journey
alright / correct / okay
assistance / support
more information
request / task
dialog / route / conversation
departing / depart
starting / start
Table 6.7: Primes used by Let’s Go before and after the entrainment rules were implemented.
Since Let’s Go has a set of regular users who may be familiar with the system’s lexical choices,
we add another restriction to the prime selection algorithm: the “Old Primes” were only used
when their prime ratio had reached a handcrafted threshold. This threshold was set to be
one order of magnitude higher than the average prime ratio for all primes.
The Let’s Go running with the entrainment rules was live during 31 consecutive days. Its
performance was compared to the same system without the entrainment rules that ran few
weeks before the new system went live, approximately for the same amount of time.
6.1.2.3.1
Results
To give the reader an idea about the type of behavior we are aiming
at with the entrainment rules, Table 6.8 shows two different dialogs where the prime choice
was affected. In the first dialog, the system tried to make the user entrain to a different
prime, since the first attempt resulted in a non-understanding. In U1, “new conversation”
but the wrong recognition output could not be bound to any concept. Once the system used
“new dialog”, the recognition output was correctly bound. In the second dialog, the system
entrained to the user. The system proposed “following” as prime for next bus, however the
user preferred to use “after”. In the subsequent prompt, the system used “after” instead of
“following” when it used a prime for next bus.
Table 6.9 compares the results between the Baseline version without the entrainment rules
and the Entrainment Rules version. The estimated success rate has increased by more than
6.1. TWO-WAY AUTOMATED RULE-BASED ENTRAINMENT
81
S1: To ask about a different journey, you can say, start a new conversation.
If you are finished, you can say goodbye.
U1: 13 TREE STATION
S2: To ask about a different ride, you can say, start a new dialog.
If you are finished, you can say goodbye.
U2: START A NEW DIALOG
...
S3: You can say, tell me about the following bus, when is the bus prior this one,
start a new conversation, or goodbye.
U3: AFTER THAT
S4: Okay.
S5: There is a 71D that departs from HAY STREET AT PENN AVENUE at 8:13 a.m..
It will arrive at FIFTH AVENUE AT MCMASTERS at 8:48 a.m..
S6: Please let me know how I did by pressing, four, if you got the information
you were looking for, or six, if you did not. To get more information about
buses related to this journey, you can say, what time is the bus after this one, or,
when is the bus prior this one. To ask about a different journey, you can say,
start a new conversation. If you are finished, you can say goodbye.
Table 6.8: Excerpts of dialogs where entrainment rules changed the system’s normal behavior.
Primes affected in bold.
2%, which represents a relative 10% reduction in the number of failed sessions, although the
Fisher’s tests revealed no significant differences between versions. The average number of
turns per dialog has also decreased by almost one turn per dialog (6% relative reduction),
which in addition is statistically significant. We believe that this reduction in the number
of turns is due to dialogs like the first one shown in Table 6.8. The fact that the system is
able to change the choice of prime if non-understandings or low confidence turns are detected,
avoids the repetition of primes that might be hindering the system performance.
The user prime events relative frequency was also evaluated. There is a reduction in the
number of Uptake events, which may represent less familiarity of the users with the new primes
proposed. For the same reason, the number of No Uptake events may also have increased.
On the system side, the number of Non-Understandings has increased in the entrainment
rules version counter to the initially expected. A possible explanation could reside on the
fact the some of the primes used were not available in the datasets used to train the acoustic
and language models that the system was using. In order to have the system recognizing
these primes, they were added to the lexicon and language models by hand. Nevertheless, the
improvement in Estimated Dialog Success and the decrease of the average number of turns
suggest that the system performance was better.
82
CHAPTER 6. AUTOMATED ENTRAINMENT
Baseline
Entrainment Rules
Number of sessions
Estimated Dialog Success (%)
1542
75.11
1792
77.64
Avg. number of turns
12.24
11.47
Total Uptake (%)
5.35
2.39
Total No Uptake (%)
0.56
0.78
Total No Previous Usage (%)
1.92
1.75
Total Non Understanding (%)
6.33
9.07
Statistical Significance
Test Result
Fisher’s. No statistically
significant difference.
Two-way
ANOVA.
F (1) = 8.131; p = 0.004..
One-way
ANOVA.
F (1) = 120.579; p =
0.000.
One-way
ANOVA.
F (1) = 4.421; p = 0.036.
One-way
ANOVA.
F (1) = 4.496; p = 0.034.
One-way
ANOVA,
F (1) = 0.240; p = 0.624
Table 6.9: Results for Let’s Go tests. Statistically significant differences in bold.
6.1.3
Acoustic Distance and Prime Usage Evolution
The previous section showed that entrainment rules used for prime selection can improve the
system performance. However, the performance numbers do not show whether the system
always used the same prime or if the prime selected for each concept varied considerably. It is
also interesting to compare the way the primes are chosen according to the entrainment rules,
and how they would be chosen according to the acoustic distance, to see if the system is really
adapting to the users. The latter issue would be the choice of prime if the system preference
was the only factor taken into account, since those primes should be easier to recognize than
their synonyms.
In order to compute the acoustic distance between primes and all the entries in each of the
language models used in Let’s Go, the primes and the all the entries were synthesized with
3 different voices using Flite [13]. The acoustic distances between synthesized samples of the
primes and the remaining entries in each language model were computed using Dynamic Time
Warping with the method described in [135]. Finally, the average and minimum distances
were computed. Table 6.10 shows the resulting primes that would be selected for each dialog
state, according to minimum and average acoustic distances.
6.1. TWO-WAY AUTOMATED RULE-BASED ENTRAINMENT
Dialog State
request place
Concept
Max Min.
Dist. (dB)
4.35
5.57
Prime Avg.
Distance
leave
immediately
Max Avg.
Dist. (dB)
15.58
13.42
5.01
5.07
support
now
13.58
12.93
origin place
help
Prime Min.
Distance
departing
as soon as
possible
assistance
as soon as
possible
departing
assistance
4.18
4.99
14.56
12.00
confirmation
help
alright
assistance
4.46
4.11
next query
start over
next bus
previous bus
help
request
itinerary
next
preceding
assistance
4.35
3.97
3.74
4.48
4.11
leave
more information
correct
more information
query
ride
next
before
more information
origin place
now
help
now
request time
explicit
confirmation
request
next query
83
12.51
12.00
12.54
10.76
11.40
10.79
12.00
Table 6.10: Primes selected according to the minimal and average acoustic distance for each
language model.
In order to capture the prime usage evolution along the 31 days of use of the entrainment
rules, the percentage of usage of each prime among the primes used for the same concept was
computed. Figure 6.3, 6.4 and 6.5 show the resulting prime usage.
6.1.3.1
Analysis
For the confirm concept the system started by proposing “okay”, however after “correct” was
proposed it stayed as the most used prime for the rest of the test period. In this case of
confirm, the former prime, “right”, was rarely used. Compared to the acoustic prime choice,
the most used prime coincides with the prime selected with maximum average distance.
“Support” was always used as the help prime. There are two reasons for this. The first one
is that the concept only appeared twice in user turns during the test period and the users
entrained to the system’s choice of prime. The second is that the help prime was always
followed by the system prompt asking for the place of departure. This means that the initial
84
CHAPTER 6. AUTOMATED ENTRAINMENT
confirm prime Usage Evolution
Prime System Usage per concept (%)
100
80
60
okay
correct
right
alright
40
20
0
0
5
15
10
Day
20
25
30
35
(a) confirm
help prime Usage Evolution
Prime System Usage per concept (%)
100
80
60
help
assistance
more infromation
support
40
20
0
0
5
10
15
Day
20
25
30
35
(b) help
now prime Usage Evolution
Prime System Usage per concept (%)
100
80
60
right now
immediately
next
now
right away
as soon as possible
40
20
0
0
5
10
15
Day
20
25
30
35
(c) now
Figure 6.3: Prime Usage over time for the concepts in confirmation, help and now.
6.1. TWO-WAY AUTOMATED RULE-BASED ENTRAINMENT
85
(a) new query
originplace prime Usage Evolution
Prime System Usage per concept (%)
100
80
60
leaving
departing
starting
leave
depart
start
40
20
0
0
5
10
15
Day
20
25
30
35
(b) origin place
startover prime Usage Evolution
Prime System Usage per concept (%)
100
80
60
new route
itinerary
schedule
trip
ride
journey
40
20
0
0
5
10
15
Day
20
25
30
35
(c) start over
Figure 6.4: Prime Usage over time for the concepts next query, origin place and start over.
86
CHAPTER 6. AUTOMATED ENTRAINMENT
nextbus prime Usage Evolution
Prime System Usage per concept (%)
100
80
60
next bus
following
subsequent
later
after
40
20
0
0
5
10
15
Day
20
25
30
35
(a) next bus
previousbus prime Usage Evolution
Prime System Usage per concept (%)
100
80
60
previous
preceding
prior
before
40
20
0
0
5
10
15
Day
20
25
30
35
(b) previous bus
Figure 6.5: Prime Usage over time for the concepts next bus and previous bus.
6.1. TWO-WAY AUTOMATED RULE-BASED ENTRAINMENT
87
prime ratio R(i) will never be subtracted by an update factor, since non-understandings and
no-uptakes will never occur. Unless the user picked a different prime, the initial prime will
always remain as the chosen prime. The acoustic prime choice would be “support”, “more
information” or “assistance” depending on the language model used and the average metric
chosen.
The new query rule-based prime choice alternated between “dialog”, “conversation” and
“route”, whereas the acoustic prime choice would be “request” or “query”. Figure 6.7d shows
that the system tried “request”, however the attempt did not seem to be fruitful. “Query”
was never used, because of the “old prime” restriction mentioned in Section 6.1.2.3.
The next bus primes show a similar behavior. The system first proposed “later” and “after”,
but when “following” was proposed, the system kept it as the most used prime for the rest
of the test period. The old prime “next”, was only used for very limited periods due to the
“Old prime” restriction. The acoustic distance would have recommended the use of “next”.
Despite the “Old prime” restriction, “now” was still the most used prime for the now concept
together with next. This can be explained by the fact that it was primarily detected in NPU
events, which have a higher update factor. According to the average acoustic distance, “now”
is also a possibility for the prime used by the system to ask for travel time. If the minimal
distance is taken into account, then “as soon as possible” should be used. Figure 6.7b shows
that this prime was never used.
The system started by proposing “start” as origin place prime, however, once it changed to
“depart”, it used any prime except that one. In this case, most of the primes listed were
never used. Since it is a non-intrinsic concept, most of the time the concept is not used in
user turns, and the prime rank is not updated. None of these prime matches the acoustic
distance prime choice.
The system alternated between “prior” and “preceding” for previous bus concepts. Occasionally, it used the prime “before”, and rarely the old prime “previous”. “Before” was the prime
chosen according to the average distance. Apparently the system could not decide if one
88
CHAPTER 6. AUTOMATED ENTRAINMENT
prime was better than the other. “Preceding” corresponds to the acoustic choice according
to the minimum distance.
Finally, there is no clear prime choice for the start over prime. The use of start over has the
same effect in the dialog of the next query concept, i.e., after the user received the schedule
information the system will ask for the next action. If the answer is next query or start over,
the system will restart the dialog from the beginning. At this point of the dialog, there is a
large number of non-understandings, which often results in the update of the prime ratio as
described in Section 6.1.2.3 for non-understandings. In addition, since the prompt explicitly
directs the user to the new query prime (see S6 in Table 6.8), this concept is less used and
consequently the prime ratio is only updated after non-understandings, which results in an
observable prime variance.
The comparison between the acoustic distance prime selection in Table 6.10 and the entrainment rules prime selection in Figure 6.7 shows that the two methods can lead to different
prime choices, as is expected given the role of the users when the system is performing on-thefly lexical entrainment. In addition, since prime selection with entrainment rules constantly
adapts to each user, the prompts have more variety and the system sounds more natural.
However, the methods sometimes selected the same primes. This means that the acoustic
distance could be used to rank primes, if no prior entrainment study had been carried out.
6.1.4
Discussion
The tests presented in this section have shown that entrainment can have an impact on SDS
performance. In Let’s Go, the estimated success rate increased, and the average number of
turns decreased when the system used entrainment rules. However, since the amounts of
data collected are much higher, a detailed analysis such as the one provided for Noctı́vago
could not be performed. The fact that Let’s Go has been running live since 2005 was also
an additional challenge. The experiment was designed to avoid the use of the “Old Primes”
(see Table 6.7), however the reduction of the % of Uptakes and the increase of the % of No
Uptakes shown in Table 6.9, may indicate that some users continued to use the old primes.
6.2. A DATA-DRIVEN METHOD FOR PRIME SELECTION
89
Except for the case of “now”, none of the old primes was the most used, as Section 6.1.3
indicates. However, this restriction might be keeping the system from using the best prime,
which could be in the Old Prime set.
Another thing that we could hypothesize is whether the fact that the variety of primes generates different system prompts, which makes the system behavior seem more natural, can
also be reflected in the improvement of system performance.
The results obtained for Noctı́vago did not show as great improvements in dialog performance
as the ones for Let’s Go. However, the fact that the former is an experimental system adds
one more variable to the analysis. From the tests described in Section 6.1.2.2, the main
conclusions were that the dialog confidence score should be used as the threshold score in
the entrainment rules, that uptake events should be more valued in the rules, and that the
prime ranking should only be affected after the second non-understanding. One interesting
outcome from Tables 6.3 and 6.4 is that the Set with the highest WER was the second best
in terms of system performance. That Set used the dialog confidence score as a threshold in
the entrainment rules. This is already a proof that under adverse conditions where WER is
high, entrainment rules can improve system performance.
6.2
A Data-Driven Method for Prime Selection
It was already shown that entrainment could improve the system performance. However, a
data-driven model could be a more robust approach to perform entrainment than the heuristicdriven rules. Also, if there were data available, it would be easier to generalize a method to
perform live entrainment for any system, rather than each system developer creating her/his
own heuristics. In this section we explore a statistical model for prime selection. The model
was first tested off-line in the data collected from previous experiments with Noctı́vago and
Let’s Go. Finally, a live test comparing the model with other methods for prime selection
was held with Noctı́vago.
90
6.2.1
CHAPTER 6. AUTOMATED ENTRAINMENT
Prime selection model
Any statistical model for prime selection must share the goals of the entrainment rules: adapt
the system’s prime choice in view of improving the system performance. The improvement
can be achieved combining the system and user prime preference to lower the WER of the
prime candidates. Thus, a model for prime selection should predict at each point of the dialog
the prime with the lowest WER based on a set of entrainment-related features.
Employing transcribed data, a supervised learning method could be used to train a regression
for a feature set derived from the entrainment events described in Section 6.1.1 and new
features that can help to find if there was entrainment between the system and the user. This
feature set will be used to predict the WER for any given prime at each point of the dialog.
The prime to be used in a system prompt, should be the one with the lowest WER prediction.
The user entrainment events and non-understandings will be represented as binary features
in the feature vector. The dialog confidence, previously used to threshold the prime updates
in Short-term entrainment, will now be an element of the feature vector.
To the data already used in the previous experiments, two more sources of information were
used: the distance to the previous use of the prime and an entrainment measure previously
developed for human-human dialogs [94].
Since entrainment is more likely to occur in the user turn immediately after the system used
a prime, the distance to the last use of the prime could be an important feature to find if a
prime was really used, or if there was a recognition error and the prime was recognized by
mistake. If a prime was recognized in a short distance to the last system use, the prime was
more likely to be correctly recognized. This was confirmed by the Pearson correlation values
found between distance and confidence score, −0.35, and between the average distance for all
primes in one session and the estimated success of that session, −0.54, for the Let’s Go data
collected with the system running with the entrainment rules (both values are statistically
significant). This distance is simply the number of user turns between the last time the
system used a prime and the user turn where the prime was recognized. Table 6.11 shows the
6.2. A DATA-DRIVEN METHOD FOR PRIME SELECTION
91
computation of the distance for two different cases. In the first case, the system used “new
conversation” and the user picked it up in the subsequent turn, thus the distance is 1. In the
second example, the user uses “now” before the system did, which means that in this case
the distance is simply the number of user turns in that dialog so far.
[dnew conversation = 0] S1: To ask about a different journey, you can
say, start a new conversation. If you are
finished, you can say goodbye.
[dnew conversation = 1] U1: START A NEW CONVERSATION
...
[dnow = 0] S3: Welcome to the CMU Let’s Go bus information system.
[dnow = 0] S4: To get support at any time, just say support or press zero.
[dnow = 0] S5: Where would you like to start from?
[dnow = 1] U3: SHERADEN
[dnow = 1] S6: Where do you wanna go?
[dnow = 2] U4: DOWNTOWN
[dnow = 2] S7: Going to DOWNTOWN. Did I hear that correctly?
[dnow = 3] U5: YES
[dnow = 3] S8: When would you like to travel?
[dnow = 4] U6: NOW
Table 6.11: Example of how the prime distance was computed.
The entrainment measure for prime p, adapted from [94], in an SDS is given by:
countuser (p) countsystem (p) Entr(p) = − −
ALLuser
ALLsystem (6.6)
where countuser (p) is the number of times that user used prime p and ALLuser is the total
number of words said by the user. This measure gives the similarity of use of prime p between
the user and the system during a session and it was correlated with task success in humanhuman task-oriented dialogs.
6.2.2
Predicting WER with Off-line data
A prime selection model was trained for Noctı́vago and Let’s Go. The Noctı́vago model was
trained using the data transcribed from the first studies described in Section 6.1.2.2. The
Let’s Go model was trained with two months of data transcribed using crowdsourcing, which
was released for the Spoken Dialog Challenge (SDC) [14]. It is important to point out that the
data used for the Noctı́vago model came from a version of the system that already performed
92
CHAPTER 6. AUTOMATED ENTRAINMENT
automated prime selection, as described in Section 6.1.2.2, whereas the system used to collect
the SDC data did not.
System logs were analyzed to extract the features described in Section 6.2.1. One feature
vector, F , was generated for each user turn where the presence of a prime was detected. The
feature vectors were grouped by dialog states, to train different models for each dialog state,
since primes are used differently from state to state. However, to have enough data for each
state, the states with similar prime behavior were merged into a single dataset. For instance,
the Request Origin Place and Request Destination place were merged into the Request Stop
dataset. In the Noctı́vago dataset, however, since many states are under-resourced, the states
with less than 300 samples were grouped to create a Generic dataset. The distribution of
data per state in both corpora is given in Table 6.12.
Dialog State
Request Next Query
Generic
(a) Noctı́vago
# Turns
331
369
Dialog State
Request Next
Request Time
Request Stop
Inform
Generic
# Turns
1428
1040
143
434
1733
(b) Let’s Go
Table 6.12: Number of turns used to train the prime selection regression.
Since the problem given was to find the best weight for each feature giving the target value
and the feature values, regression seemed a straightforward method to do it. The datasets
were split in 75% for train and 25% for test. Several regression methods from the scikit-learn
toolbox for Python [102] were tested. Once each model was trained, the Pearson correlation
between the predicted WER and the actual WER was computed for the test set, together
with the coefficient of determination (R2 ) which measures the quality of the model (the closer
to 1.0, the better the model). The best results were achieved using Linear Regression (LR)
using the ordinary least squares method and Support Vector Regression (SVR) with radial
basis function kernel are presented in Tables 6.13 and 6.14.
The results for the Noctı́vago regressions show remarkable correlation values, especially in
the Generic model, and in both cases the correlation is statistically significant. Nevertheless,
6.2. A DATA-DRIVEN METHOD FOR PRIME SELECTION
Model
Request
Next Query
Generic
Measure
Correlation
Coeff. Det. (R2 )
Correlation
Coeff. Det. (R2 )
LR
0.32
0.08
0.63
0.39
93
SVR
0.36
0.11
0.62
0.35
Table 6.13: Noctı́vago models.
Model
Request Next
Request Time
Request Stop
Inform
Generic
Measure
Correlation
Coeff. Det. (R2 )
Correlation
Coeff. Det. (R2 )
Correlation
Coeff. Det. (R2 )
Correlation
Coeff. Det. (R2 )
Correlation
Coeff. Det. (R2 )
LR
0.35
0.11
-0.01
-0.13
0.16
-0.84
0.22
-0.03
0.13
0.01
SVR
0.34
0.05
0.06
-0.02
0.04
-0.11
0.28
0.03
0.13
0.09
Table 6.14: Let’s Go models.
the coefficient of determination is not very high in the Request Next Query model. The two
regression methods used achieved a very similar performance. Given that Request Next Query
is the state where priming is most likely to occur, in an implementation scenario, the SVR
would have preference over the LR.
The regressions trained for the Let’s Go states, except for the Request Next state-type, show
lower correlation values when compared with the values obtained for Noctı́vago. Despite
the fact that state is probably where entrainment is more common, these results should be
further analyzed. Several factors may have contributed to this result. The first was already
mentioned: the fact the Let’s Go data was collected in a different context where there was no
entrainment policy running and the prime set used was static. The system prompts did not
show any change over time and fewer examples of user prime events (especially No Uptake
events) were found in the data. Secondly, the data collections were made in substantially
different contexts. Whereas Noctı́vago is an experimental system tested by novice users, real
users tested Let’s Go. Some of them were believed to be regular users of the system. It was
shown that novice users tend to be more driven by the system prompts than experimented
94
CHAPTER 6. AUTOMATED ENTRAINMENT
users [80]. Finally, the modifications made in the Noctı́vago dialog strategy to have more
states of the dialog where the entrainment could happen, may also help to explain this result.
This is confirmed by the fact that in the Noctı́vago dataset an average of 2.1 turns with
primes per query was found, whereas in Let’s Go the average number of turns with primes
per query was only 1.24. Further work is needed to find a model that better fits the Let’s
Go dialog strategy. Once the data collected with the system running the entrainment rules
is transcribed and used to train a new model the performance of the model should improve.
6.2.3
Testing the model in an experimental system
The SVR model trained with the Noctı́vago data was tested in the on-line system. The
previously used Ravenclaw dialog manager was replaced by Conerstone [75]. This dialog
manager includes a user action model to predict the next best action the system should take,
and a confidence calibration model trained to improve the confidence measure used in statetracking dialog management. Since the Noctı́vago dataset is too small to train a user action
model and both systems target the same domain, we opted to keep the same user action
model adopted in Let’s Go. The fact that the model was trained at the action level poses
no difficulty to its use in a different language. However, to adapt the strategy to the type of
users of Noctı́vago keeping it as close as possible to the one previously designed for Ravenclaw,
minor modifications were made to the original dialog manager. The confidence calibration
model, however, was trained with the Noctı́vago dataset.
In these experiments the language models were also updated. The details can be found in
Section 3.2.1.2.
In the tests performed with off-line data, each feature vector was considered as an isolated
observation. However, since the prime selection depends on the evidence built in previous
\
turns [48], the predicted WER for turn t, W
ERt , should also incorporate a scaling factor,
\
representing those turns. The WER prediction from the previous turn, W
ERt−1 , was chosen
to be the scaling factor. Thus, the predicted WER for prime p at turn t is given by:
6.2. A DATA-DRIVEN METHOD FOR PRIME SELECTION
\
\
W
ERt (p) = W
ERt−1 (p) × r(F )
95
(6.7)
where, r(F ) is the prediction value given by the trained regression for F =< f1 , · · · , fn > , the
feature vector generated during live interaction for every turn where a prime was recognized.
\
Since at the beginning of the dialog W
ERt−1 is not available, we assume that the most
frequently used primes are more likely to have lower WER. Hence, the initial prediction value
is given by:
countuser (p)
\
W
ER0 (p) = 1 −
countuser (C)
(6.8)
where countuser (C) corresponds to the sum of all the primes used for concept C. The relative
\
frequency is subtracted to 1, so that the more frequent words have the lowest W
ER0 . In the
beginning of each session the primes are ranked according to this value.
6.2.3.1
Test Set
In order to compare the new model for prime selection, a study was conducted where it
was compared with other strategies for prime selection, all of them running exactly with
the architecture shown in Figure 3.9b, with Cornerstone and the Unity-based ECA. These
strategies were entrainment rules, random primes and fixed primes. The entrainment rules
version performs the selection using the heuristics defined in Section 6.1.2.3. The random
primes version randomly selected the primes to be used. The fixed primes version used only
the prime that was most acoustically distinct from the remaining entries in the language
model. This distance was found using a predefined acoustic distance between pairs of phones.
The versions ran alternately during different days. The subjects were recruited via e-mail or
Facebook event. In either case they would have a web link to a page with the instructions.
The task given was to query for any bus schedule that they would possibly take. Users were
informed that different versions were running, although did not know which one they were
96
CHAPTER 6. AUTOMATED ENTRAINMENT
testing. They were also given instructions about the press to talk interface that they were
using. After they completed their request, they were asked to fill in a questionnaire based
on the PARADISE questionnaire [140]. Two new questions were added to the questionnaire
mentioned in Section 6.1.2.1: ’if the system was able to propose alternatives when it encountered problems’ and ’if you felt that the system was proposing words in a smart way’. These
new questions were used to help to confirm whether they detected the system’s adaptation of
lexical choices. The complete questionnaire can be found in Annex C.2.3.
The list of primes used in this test is show in Table 6.15. When comparing this table with
Table 6.2, one may notice that some primes changed their type from intrinsic to non-intrinsic.
This change was provoked by the change of dialog manager. The confirmation prompt after
all the slots are filled is not part of the dialog strategy anymore. This means that Arrival
Place, Origin Place and Time are only used when the system is requesting destination, origin
and travel time, respectively. Thus, if they are incorrectly recognized, the course of the dialog
should not change.
Type of Prime
Concept
Next
Now
Intrinsic
Price
Start Over
Non-Intrinsic
Arrival Place
Origin Place
Time
Primes
próximo / seguinte
agora / imediatamente
neste momento / o mais rápido possı́vel
o mais brevemente possı́vel
preço / valor
outro percurso / nova procura
nova pesquisa / procurar novamente
outra procura / nova busca
chegada / destino
partida / origem
horas / horário
Table 6.15: Primes used by Noctı́vago in the data-driven model tests.
6.2.4
Results
A total of 214 dialogs were collected during these tests. 88 sessions were completed by female
subjects. 53 of these sessions were performed people from the laboratory. 96 different users
have participated in these tests and 4 of them were not non-native speakers. 8 people that
participated in these tests had already participated in tests described in Section 4.3 and 23
participated in the tests described in Section 6.1.2.2.1.
6.2. A DATA-DRIVEN METHOD FOR PRIME SELECTION
97
The dialogs were orthographically transcribed to compute prime and word level performance
metrics, since these are expected to be greatly influenced by prime selection methods. The
transcriptions were also parsed using an off-line version of the Phoenix semantic parser, and
the same grammar used in the live system. Table 6.16 shows the performance in terms of Outof-vocabulary words (OOV), WER and CTC. For WER and CTC, the results were further
analyzed at the prime level, and at the intrinsic prime level.
OOV (%)
WER (%)
WER Primes (%)
WER Intrinsic (%)
CTC (%)
CTC Primes (%)
CTC Intrinsic (%)
Data Driven
Rule Based
8.92
27.84
27.53
23.90
68.88
65.22
69.78
11.10
33.05
25.77
24.38
65.24
72.58
75.93
Random
Primes
11.68
35.68
37.38
38.42
65.78
48.20
49.15
Fixed Primes
22.18
33.29
35.25
33.33
66.18
55.28
58.95
Table 6.16: OOV, WER and CTC Results for the different versions. Statistically significant
results in bold (one-way ANOVA with F (3) = 2.881 and p − value = 0.037).
Figure 6.6 shows a graphical representation for the results of Table 6.16.
These results show that both versions with prime entrainment policy clearly outperformed
the Random and Fixed Primes both in terms of reducing OOV, WER and increasing the
percentage of CTC in the dialog. The differences are particularly remarkable when the analysis
is restricted to primes, and even more for intrinsic Primes as it can be observed in the graphs
in Figure 6.6. Both Data Driven and Rule Based methods were able to improve the system
performance when dealing turns with primes. The differences achieved might have influenced
the course of the interaction, improving the user experience.
Among the prime selection algorithms, the Rule Based method showed a slightly better
performance than the Data Driven one.
The relative frequency of User Prime Events is another important measure to evaluate in
this study, which is shown in Table 6.17 for this dataset. The percentage is computed as in
Equation 6.2.
98
CHAPTER 6. AUTOMATED ENTRAINMENT
(a) OOV
(b) WER
(c) CTC
Figure 6.6: OOV, WER and CTC results for the different configurations tested.
6.2. A DATA-DRIVEN METHOD FOR PRIME SELECTION
Uptakes (%)
No Uptakes (%)
Non-Understandings (%)
Data Driven
Rule Based
9.54
2.26
0.57
9.97
1.64
1.03
Random
Primes
9.08
3.94
0.72
99
Fixed Primes
8.60
1.76
1.14
Table 6.17: Entrainment Events and Non-Understandings relative frequencies. One-way
ANOVA tests revealed no statistical significance.
The Entrainment Event results show a higher Uptake percentage for the versions with prime
selection algorithms implemented, which could mean that the users felt more comfortable
with the primes that the system was using. The results for No Uptakes interestingly show
that the highest percentage occurs with Random Primes, where there was also more variance
in the primes used. Among the statistically significant results for Non-Understandings, prime
selection algorithms were less prone to Non-Understandings than Fixed Primes.
Table 6.18 shows the distribution of the dialogs per version, the estimated dialog success, the
real dialog success, the average number of turns per dialog, and the average number of turns
with primes per dialog.
Number of Dialogs
Estimated Success Rate (%)
Real Success Rate (%)
Number of Turns (avg. per dialog)
Number of Turns w/ Primes (avg. per dialog)
Data
Driven
57
96.5
82.4
14.75
2.82
Rule
Based
53
98.1
82.6
13.21
2.34
Random
Primes
52
94.2
84.9
13.73
2.67
Fixed
Primes
52
98.1
86.9
13.15
2.37
Table 6.18: Dialog performance results.
These high level measures show that Fixed Primes was the version with best success rate
and lowest average number of turns per dialog. Statistical significance of Real Success Rate,
Number of Turns and Number of turns with primes was computed one-way ANOVA tests
and the difference between methods was not statistically significant for any of these measures.
However, the session success is computed based on the information returned to the user, it
might not be the best measure to evaluate how effective was the prime choice. In fact, a session
100
CHAPTER 6. AUTOMATED ENTRAINMENT
could be successful even without the need to use any prime. For instance, if a user makes a
single request asking for a bus for a specific hour instead of asking for the next bus and ends
the dialog as soon as she/he receives the correct information, no prime is used in this dialog.
Compared to previous tests held with Noctı́vago (Section 6.1.2.2), there was a significant
improvement on the system performance. We believe that the new Dialog Manager, the new
Language Models and the new ECA have contributed to this improvement. These results
show that the system has greatly increased in terms of robustness from the first experiments
presented in this Thesis.
Finally, Table 6.19 shows the results of the questionnaire that many of the testers answered
(184 questionnaires, for 214 tests). The distribution of questionnaires per version, the average
overall satisfaction and the average score in the entrainment related questions are detailed.
The maximum value admitted for overall satisfaction was 33 and for entrainment satisfaction
was 12.
Number of questionnaires
Average Satisfaction Score
Average Entrainment Question Score
Data
Driven
50
22.08
9.08
Rule
Based
46
20.85
8.30
Random
Primes
47
22.72
9.70
Fixed
Primes
41
22.80
9.66
Table 6.19: Questionnaire results for the different versions.
These results confirm the informal comments from some users that they hardly noticed any
difference between the different versions. Even regarding the entrainment question set, the
versions that had no algorithm implemented achieved higher scores, even if in the case of
Fixed Primes the primes were not changing at all. Since there was a lot of variation in the
Random Primes version, the users might have thought that the system was adapting someway,
although it was not. This could explain the highest entrainment question score was achieved
in this version.
6.2. A DATA-DRIVEN METHOD FOR PRIME SELECTION
6.2.4.1
101
Prime Usage Evolution
In the previous section the results have shown that both Data Driven and Rule Based methods outperformed to baseline methods. Our expectation was that the Data Driven method
performed slightly better than the Rule Base method, however the opposite occurred. In
order to have an overview of the prime selection, Figures 6.7 and 6.8 show the prime usages
evolution during the testing days for Data Driven (DD) and Rule Based (RB) methods, similarly to what was done in Section 6.1.3 for the prime evolution in Let’s Go primes. Figure 6.7
shows the evolution for intrinsic primes, whereas Figure 6.8 shows it for non-intrinsic primes.
The choice of prime for Now and Next Bus almost never changed during the test period
with both methods. Próximo autocarro and agora are by far the prime mostly chosen by
both methods. This could mean that a large majority of the users prefer to use those primes
instead of the other options available.
The case of the Price prime shows much variation in both methods, although the Data Driven
method was only used valor for the first few days. This could be a prime where each user
has her/his own preference and the system is continuously changing to adapt to each user
preference. This is confirmed by the high number of No Uptake events for this concept.
The Start Over prime selection frequency shows some differences between the two methods.
The Rule Based method used Nova Pesquisa during the majority of the test period. On the
other hand, the Data Driven method modified the prime used towards the end of the test
period. Both methods used only a small subset of the prime candidates available.
In what concerns the non-intrinsic primes, the choice for Arrival Place and Origin Place
primes shows a similar distribution. There is a clear preference for one of the primes, chegada
for Arrival Place and partida for Origin Place. For the Time prime, both methods keep
changing the prime used during the test period. The reason for this change can be associated
with the fact that at the point of the dialog where the system uses this prime a statistical
language model is used, whereas in the other points where the system tries to do entrainment
an SRGS grammar is used, which does not allow any non-understandings, since the grammar
102
CHAPTER 6. AUTOMATED ENTRAINMENT
Prime usage evolution for concept nextbus
100
Prime System Usage (%)
80
60
próximo autocarro [DD]
autocarro seguinte [DD]
próximo autocarro [RB]
autocarro seguinte [RB]
40
20
0
0
4
2
6
8
Day
10
14
12
16
(a) Next Bus
agora [DD]
já [DD]
imediatamente [DD]
neste momento [DD]
mais brevemente possível [DD]
mais rápido possível [DD]
mais depressa possível [DD]
agora [RB]
já [RB]
imediatamente [RB]
neste momento [RB]
mais brevemente possível [RB]
mais rápido possível [RB]
mais depressa possível [RB]
Prime usage evolution for concept now
100
Prime System Usage (%)
80
60
40
20
0
0
4
2
6
8
Day
10
14
12
16
(b) Now
Prime usage evolution for concept price
100
Prime System Usage (%)
80
60
preço [DD]
valor [DD]
preço [RB]
valor [RB]
40
20
0
0
4
2
6
8
Day
10
14
12
16
(c) Price
Prime usage evolution for concept startover
100
outro percurso [DD]
nova procura [DD]
nova pesquisa [DD]
procurar novamente [DD]
outra procura [DD]
nova busca [DD]
outro percurso [RB]
nova procura [RB]
nova pesquisa [RB]
procurar novamente [RB]
outra procura [RB]
nova busca [RB]
Prime System Usage (%)
80
60
40
20
0
0
2
4
6
8
Day
10
12
14
16
(d) Start Over
Figure 6.7: Comparison between intrinsic prime usages in Prompts between Data Driven
(DD) and Rule Based (RB) prime selection.
6.2. A DATA-DRIVEN METHOD FOR PRIME SELECTION
103
Prime usage evolution for concept arrivalplace
100
Prime System Usage (%)
80
60
chegada [DD]
destino [DD]
chegada [RB]
destino [RB]
40
20
0
0
2
4
6
8
Day
10
12
14
16
(a) Arrival Place
Prime usage evolution for concept originplace
100
Prime System Usage (%)
80
60
partida [DD]
origem [DD]
partida [RB]
origem [RB]
40
20
0
0
2
4
6
8
Day
10
12
14
16
(b) Origin Place
Prime usage evolution for concept time
100
Prime System Usage (%)
80
60
horário [DD]
hora [DD]
horário [RB]
hora [RB]
40
20
0
0
2
4
6
8
Day
10
12
14
16
(c) Time
Figure 6.8: Comparison between non-intrinsic prime usages in Prompts between Data Driven
(DD) and Rule Based (RB) prime selection.
104
CHAPTER 6. AUTOMATED ENTRAINMENT
specification is based on previous data and forces the recognition output to be an in-grammar
utterance.
6.2.5
Discussion
The results of these tests have showed that it is worth having an algorithm that performs
lexical entrainment, although this still needs to be confirmed by a larger data collection. 50
sessions per version seems insufficient to collect enough data to have statistically significant
results.
The performance of the Rule Based method was slightly better than the Data Driven method
for the majority of the comparable items. This could mean that the intuition developed for
entrainment that was implemented in the Rule Based method, was not yet learned by the
Data Driven model. For instance, for the two concepts where there was more prime variance,
Price (Figure 6.7c) and Time (Figure 6.8c) the Rule Based method prime choice showed more
variability than the Data Driven method. This is probably due to the Short-Term entrainment
phase within dialogs where there is a stronger adaptation to the user.
Unlike the Let’s Go test, none of the Noctı́vago tests with different entrainment policies
resulted in the expected outcome. The configurations where the best performance (in terms
of dialog success rate and average number of turns) was expected were not the ones that
performed better. The first reason for this was already mentioned and is the fact that a
dialog can be successful without the use of any prime. But other reasons can explain why
the dialog success rate might have been better in Fixed and Random Primes. One thing
that might have made the algorithms for prime selection more effective in Let’s Go than in
Noctı́vago was the fact that Let’s Go uses context-dependent statistical language models in
every dialog stage. This increases the number of outputs that the ASR can produce and gives
more freedom to the user. On the other hand, this also generates more non-understandings,
which means more opportunities to update the prime rank. In Noctı́vago, in order to have a
more robust system, we have opted for SRGS grammars in some points of the dialog. The
price paid for boosting the performance was the reduced chances of prime rank update in
6.3. SUMMARY
105
case of Non-understandings as the results in Table 6.17 confirm. However, the goal of prime
selection algorithms is precisely the reduction of non-understandings, without limiting the
user input. We believe that if the users have more freedom in what they can say, the prime
update algorithms will be even more effective than what they were here.
The results in Table 6.19 show that the users were not able to contrast between systems that
were adapting and those that were not. It might not be the easiest task for a user when a
dialog is less than 14 turns long and primes are only used in 2 or 3 turns, which means that
they only have 1 or 2 chances to perceive if the prime was changed. It is curious that the users
felt more adaptation in the Random Primes. The explanation could be in the fact that the
Random Primes version is very likely to modify the words used during the same dialog, unlike
the other versions. Despite this observation, we strongly believe that automated entrainment
could have a relevant role in task-oriented dialog systems success.
6.3
Summary
In this chapter we have presented two different approaches to automated entrainment: RuleBased and Data-Driven. The Rule-Based approach (Section 6.1) was developed based on a
set heuristics created from the lexical entrainment theory and the intuition developed during
the tests described in Chapter 4. The Data-Driven approach (Section 6.2) was developed
to learn how to entrain based on a set of entrainment-related features and other sources of
information that could be extracted from live interaction. This method tries to predict the
primes that are less likely to be incorrectly recognized based on this set of features.
The Rule-Based method was first implemented and tested in Noctı́vago. The results were
inconclusive, but gave interesting trends to be used in the second version of the Rule-Based
algorithm that was tested with Let’s Go. These tests have successfully shown that entrainment
could improve the performance of a task-oriented SDS. The estimated success rate increased
and was followed also by a decrease in the average number of turns per dialog.
The Data-Driven method was first tested in off-line data for both Noctı́vago and Let’s Go.
106
CHAPTER 6. AUTOMATED ENTRAINMENT
The off-line results were better for Noctı́vago than for Let’s Go data. A slight modification
was made in the model when it was implemented in Noctı́vago. A scaling factor that tried to
incorporate the information about the previous turns in the prediction was added. This model
was tested against three different methods for prime choice: Rule-Based, Random and Fixed
choice. The results in terms of dialog success did not show any improvement from automated
prime selection methods, which was not surprising given the way that it was computed. This
was later refuted by the fact that WER, OOV percentage and CTC percentage for prime was
indeed lower among primes in the version where automated performance was running that
with Random or Fixed prime choice. The comparison between methods for automated prime
selection showed a slight advance in the majority of the comparable results for the Rule-Based
method. We believe that using a larger dataset to train the entrainment model, these results
could change.
7
Conclusions
This thesis has shown that lexical entrainment can influence the performance of spoken dialog
systems. The adopted strategy was to make the system entrain to the user whenever this
does not hinder the system performance, and propose alternative primes when a prime is
often followed by Non-Understandings or low confidence turns. All the studies that were
made have shown some sort of improvement when using a prime selection policy.
The studies were done with two parallel systems for the same domain Noctı́vago and Let’s
Go, described in Chapter 3. The steps that were followed in the several iterations to create
and improve Noctı́vago were described in detail. In fact, the process of creating a robust SDS
for European Portuguese was parallel to the development of methods for on the fly prime
selection. The robustness of the system has indeed increased progressively during the thesis,
as it can be seen in by the results presented in Section 6.2.3.1. We have also managed to
integrate dialog-state tracking in Noctı́vago, which corresponds to the state-of-the-art in terms
of dialog management. We have also contrasted the different modules used in Noctı́vago and
Let’s Go.
In Chapter 4, after identifying a set of prime candidates, we have created a list of synonyms
for the concepts included in the prime candidate set. A study was conducted during two
weeks, where in the first week the prime set used was fixed and in the second week the primes
were randomly picked from the synonyms in the prime candidate list. Despite the absence of
criteria in the prime selection in the second week, the system performance was better. There
was entrainment between the user and the system as we expected. In order to automate the
prime selection, the dialogs were carefully analyzed to observe the user behavior towards the
primes used. Two prime events were identified: Uptakes, when the user picked the prime from
108
CHAPTER 7. CONCLUSIONS
the system prompt, and No Uptakes, when the user decided to use a different prime from the
one used by the system. We observed that the primes with a higher number of No Uptakes
corresponded to the most commonly used words in the language, which was confirmed by the
correlation found between the number of No Uptakes and the number of hits of each prime in
a web search engine, and also with the frequency of each prime in a spoken language corpus.
This already constitutes something that can be verified on the fly to decide whether or not
the system should retain the prime or use a different one. Another important outcome of this
study was the users feedback. Users appreciated the increased number of prompts available.
How could this be combined with the improvement in the system performance was another
problem that came up with these tests.
The uncertainty in the SLU module highly recommends that some sort of confidence score
validates the prime events detected. In Chapter 5, the dialog confidence score generated by
the system was studied in order to improve its accurateness. Initially only one feature was
used in the confidence score computation. Since many of the system modules were replaced or
modified, the data collected in the previous Noctı́vago tests was used to train a new confidence
annotation model. Only the features that were common to every turn were used in the first
attempt. We observed that ASR confidence was the most weighted feature. The resulting
models outperformed the original model. However, since many features were left out, we
decided also not to use the non-understood turns in the training procedure. This reduced the
number of turns used to train the model, but on the other hand would allow the evaluation
of a larger set of features. In fact, the system behavior regarding Non-Understandings would
not change if the confidence model was different. The training procedure used also differed
and led to a different model than the one trained including the non-understood turns. In
this training procedure stepwise regression was used, which means that only helpful features
were added to the model. The proceeding of adding and testing a new feature was repeated
until the BIC or LLIK stop-criteria were satisfied. The resulting model only included dialoglevel features. The ASR confidence score was not amongst the features evaluated before the
model satisfied the stop-criterion. We have learned from this study that confidence models
are highly dependent on the system modules used. For instance, if the language model is
109
modified the ASR performance is going to be different. Despite the high performance of the
models trained over the baseline model, none of them were used in tests held with automated
prime selection, since there were no consecutive tests held with the exactly the same system
modules. The system was iteratively improving and consecutive experiments were held with
different versions of the system, which did not let us re-train the model. Many system
modules suffered modifications that were implemented in parallel to the development of the
prime selection algorithm. Thus, these models were no longer valid on the configurations used
from the moment they were trained on.
Automating the prime selection was the next step to take. In Chapter 6 three studies were
conducted with two different methods for prime selection. Due to the limited data resources
available, our first approach to automated prime selection was rule based. The implemented
algorithm relied on User Prime Events, Non-Understandings and confidence score to update
the prime rank for every concept list. In order to satisfy the previous findings from lexical
entrainment literature and to circumvent the fact that none of the systems had user models,
we have designed an algorithm with two phases. The first phase, Long-Term Entrainment,
captured the prime preference from all the user population, whereas the second phase, ShortTerm entrainment, tried to adapt the system behavior to each user during the course of
the session. The algorithm was first tested in Noctı́vago with different configurations. The
results showed that the heuristics used to update the prime during Short-Term entrainment
were not improving the system performance. Thus, when envisaging new experiments with
Let’s Go, the algorithm was modified. The tests held with Let’s Go offered the possibility
of testing the Rule-Based algorithm with a large population of real users. The performance
of the system running the Rule-Based prime selection was compared to the performance of
the system before it had them implemented. The results revealed a 2% absolute increase
(10 % relative) in the number of estimated successes and a reduction of one turn per dialog
(6 % relative reduction), which proves the effectiveness of the on-the-fly lexical entrainment.
The Rule-Based prime selection over the test period was compared to the acoustic distance
prime selection. The choice for the prime to be used was very clear for concepts and less
clear for others. This could mean that there are concepts where there is a prime that is used
110
CHAPTER 7. CONCLUSIONS
by the large majority of the users, so there is no need for adaptation, whereas in other cases
the adaptation to the user during the session is more needed. Despite the differences found,
the acoustic distance prime selection shared some of the primes that were the most common
when using the algorithm. For this reason, when building a new system from scratch, using
acoustic distance might be suitable to initially rank the prime candidates.
With the data collected from the Noctı́vago tests we were able to create a supervised learning
method for prime selection. The idea behind the method is to predict which of the listed
primes is more likely to be correctly recognized. In order to test this idea, an SVM-regression
was trained to predict the WER for the primes listed based on a set of features that included
the User Prime Events, Non-Understandings, confidence, the distance to the last system use
of the prime, and an entrainment measure that was correlated with success in human-human
task oriented dialogs. The model required transcribed data in order to be trained. The Let’s
Go model was trained with data transcribed using crowdsourcing, whereas the Noctı́vago’s
model uses the previously collected data. The off-line test showed higher accuracy for the
Noctı́vago model. The fact that it is an experimental system, with volunteer users and used
different primes may have contributed to the generation of a richer dataset for entrainment
model training. The model was tested on the fly using a scaling factor representing the
history of the dialog. The live tests have compared this model with Rule-Based, Random and
Fixed choice, based on the acoustic distance. The dialog success rate of the system version
running with prime selection policy was below Random and Fixed prime choice. However,
the results for prime WER and prime CTC rate were better either using Rule-Based or
Data-Driven prime selection, which means that the acquisition of the concepts represented
by primes was better. The Rule-Based approach achieved a slightly better performance than
the Data-Driven one, probably because of the limited amount of training data.
Differences were observed in the behavior of the two types of users that tested our systems.
These might have had influence in the effectiveness of the algorithms for prime selection. The
main concern of volunteer users was to get information from the system at any cost. They
would change their initial intention if they believed that was necessary to complete the task.
7.1. FUTURE WORK
111
They also picked up the terms from the system prompts more often than real users, since
they believed that it was necessary to achieve the session success. On the other hand, real
users do not modify their initial intention, but they also adapt their strategy to address the
system if they believe that is necessary, as far as it does not imply a modification in the
initial intention. In both user types, we have managed to explore the adaption to improve
the system performance. In addition, the users’ questionnaires show that in most case they
hardly notice some sort of lexical adaptation in the system. Since the primes that the system
used are selected by the combination of user and system prime preference, the system should
talk in a more natural way, and this was something that expert users reported as well.
The prime selection algorithm effectiveness was reflected on the Let’s Go dialog success rate,
unlike what happened for Noctı́vago. The Let’s Go dialog is more mixed-initiative than
the Noctı́vago’s dialog. Despite having context-dependent language models, it is possible to
recognize other than the most expected concept in any point of a Let’s Go dialog. On the one
hand, this will lead to an increased number of Non-Understandings, but on the other hand
this will increase the chances to apply the prime selection algorithms. If these conditions are
verified, the prime selection algorithms are more likely to have a positive impact in the dialog
success rate.
Despite the positive results of this work, there are many things than can improve the way
SDSs lexically entrain to users. Some of them will be presented in the next section.
7.1
Future Work
Establish a contrast between the different versions of the same SDS was one of our biggest
difficulties. The amounts of data needed to have statistically significant results for some of
the evaluation parameters are far from what we were able to collect. Performing them with
volunteer users made it even more challenging, since we always depended on the willingness
of the volunteers to test the system. A user simulator that would be able to operate at the
prompt level rather than at the action level could be a future solution to easily compare
112
CHAPTER 7. CONCLUSIONS
different methodologies for prime selection. There is already some work in developing user
simulators that operate at the prompt level [79, 66]. We believe that if they are enriched with
some entrainment-based knowledge they can be a very helpful tool in SDS lexical entrainment
policies design.
Another evaluation parameter that could be introduced in future tests in the naturalness of
the prompt. Based on linguistic features and the speech synthesis accuracy, the naturalness
could determine how natural is a specific prompt and compare it with the Uptake percentage
observed and the prime acquisition success.
In terms of the prime selection algorithms, the results from our last experiment show that the
data driven model performance was slight lower than the rule-based performance. In order
to extend this algorithm to other systems or to a different domain, a rule-based algorithm
is not very recommendable since it requires the developer to create a different rule-based
algorithm for each system developed. The lack of data used to train the model and the
fact that we privilege the system robustness using SRGS grammars in some contexts, might
explain the better performance of the rule-base method. However, the model presented will
always need transcribed data. Besides improving the current model with more data and
new knowledge sources, another future challenge constitutes the creation of an unsupervised
method for prime selection learning. The use of reinforcement learning, for instance, by
integrating prime selection in the SDS-POMDP framework is a possibility. However, this will
add scalability problems to solve the reinforcement learning problem. Recent optimization
techniques can make this solution possible.
One current limitation of the framework used for prime selection is that the prime candidate
lists are limited to the synonyms found by the system developer for each concept. When
analyzing the dialogs, we found that some users picked a different choice that the system was
not ready the deal with, since it was not in the predefined prime list. It is unrealistic to think
that a developer is capable of covering all the possibilities of primes for a given concept. In a
longer horizon, the system would benefit from an automatic prime learning procedure. In a
closer horizon, the extension of prime lists to concepts like place names or time events in the
7.1. FUTURE WORK
113
domain used is a very likely possibility.
Porting prime selection to different domains is another future ambition. Different domains
may offer longer dialogs and user dependent dialog models, which will make adaptation easier.
It would be also interesting to test these algorithms in dialogs that are not task-oriented. It
has been reported in human-human dialogs that lexical entrainment is indeed stronger in
task-oriented dialogs. Since the expectations of humans of being understood by a machine
are lower when compared to being understood by other humans, perhaps lexical entrainment
could be also very helpful in non task-oriented human-machine dialogs.
114
CHAPTER 7. CONCLUSIONS
A
Tools for Oral
Comprehension in L2
Learning
A.1
Introduction
The vowel reduction and simplification of consonantal clusters phenomena are well-known
characteristics of European Portuguese that distinguishes it from Brazilian Portuguese. These
phenomena pose difficulties in oral comprehension for students of Portuguese as a second
language (L2). Thus, enhancing oral comprehension skills of L2 students became one of
the research goals of the REAP.PT project. The original REAP system [56], developed at
Carnegie Mellon University, focuses on vocabulary learning by presenting readings with the
target vocabulary words in context for students, either native or non-native. REAP.PT aims
at much more than just adapting the system to Portuguese [89]. The system explores all
types of language tools and corpora for automatically building language learning exercises,
adopting as much as possible a serious gaming perspective. Unlike the original REAP system,
which dealt exclusively with text documents, REAP.PT deals with audiovisual documents, in
particular Digital Talking Books (DTB) and Broadcast News (BN) stories. This exposure to
oral European Portuguese aims to make the L2 students more familiar with vowel reduction
and the simplification of consonantal clusters. Hence, besides the vocabulary learning contents
that were developed for the original REAP, and the new grammar drills and games that use
a plethora of NLP tools, REAP.PT also includes a number of features and exercises designed
to enhance oral comprehension skills.
This was the context for the initial work on oral comprehension of this thesis. This appendix
describes the main steps, starting by the integration of a synthesizer to read words or sequences
of highlighted words, and the use of audiovisual documents in REAP.PT.
116
A.1.1
APPENDIX A. TOOLS FOR ORAL COMPREHENSION IN L2 LEARNING
Speech synthesis
The synthesizer was integrated as a webservice in the REAP.PT platform. It is available in
every text document. Students can highlight words or a sequence of words in the document
and select the “play” option to listen to the sentence read by the synthesizer. This generates
a request to the synthesizer webservice to read out loud the selected text. When searching
for the meaning of a particular word in the dictionary, the pop-up window also includes the
same “play” button to listen to the word selected. The synthesis is provided by DIXI, a
concatenative unit selection synthesizer [101] based on Festival [11].
A.1.2
Digital Talking Books
Digital Talking Books, also known as audio books, were the first type of non-text materials
integrated in the multimedia REAP.PT platform. Audio books are mostly used for entertainment and e-inclusion applications, and their use for Computer-Assisted Language Learning
(CALL) is much more novel. They can be especially important for non-native students of
European Portuguese, giving them the possibility to listen to the whole story with the text
in front of them or to listen to a selected word sequence from the text. DTBs may have
drawbacks in terms of free accessibility, but their advantages in terms of controlled quality
recordings both at prosodic and segmental levels far outweigh these drawbacks.
In order to play a selection of highlighted words, the corresponding word boundaries need
to be determined. This was achieved using our automatic speech recognition system (ASR),
Audimus [96], in a forced alignment mode. Audimus is a hybrid recognizer whose acoustic
models combine the temporal modeling capabilities of hidden Markov models (HMM) with
the pattern classification capabilities of multi-layer perceptrons (MLP). Its decoder, which is
based on weighted finite state transducers (WFST), proved very robust even for aligning very
long recordings (a 2-hour long recording book could be aligned in much less than real-time).
For research purposes, we built a small repository of DTBs including a wide range of genres:
fiction, poetry, children stories, and didactic textbooks.
A.1. INTRODUCTION
117
A later version of the multimedia REAP.PT platform improved the original version developed
by the author, by integrating a karaoke style presentation of highlighting words as they are
spoken, as well as slow-down features.
A.1.3
Broadcast News
The repository of DTB stories may not be covering the areas that potentially will interest
the students. In addition, their durations typically extend the recommend content size for
REAP.PT materials. This motivated us to take advantage of a large corpus of manually
transcribed BN stories, used to train and test Audimus. Since transcriptions were available
at the utterance level, the alignment at word-level was easily obtained using Audimus in
forced alignment mode. With the alignment at the word-level, the student could perform the
same actions performed previously with the DTB. Despite widening the range of subjects of
the multimedia contents in REAP.PT, we believed that the use of automatically transcribed
recent broadcast news, instead of old news stories, might increase the student’s motivation.
The next section describes the process that lead to the integration of recent broadcast news
shows in the REAP.PT platform.
A.1.3.1
Integrating Automatically Transcribed news shows in REAP.PT
The automatically transcribed broadcast news shows have several advantages to CALL: they
are recent, divided in short stories, topic-classified, and have a video associated with each
story. The only drawback is the eventual presence of errors. The errors can either be marked
to let the student know that the transcription may be incorrect, or filtered, removing the
stories with too many errors. This section describes the broadcast news pipeline, and its
integration in the REAP.PT platform.
A.1.3.1.1
Broadcast News Pipeline The first module of the broadcast news pipeline
is the jingle detection that detects the end and the start times of the news show and also
removes advertising segments.
118
APPENDIX A. TOOLS FOR ORAL COMPREHENSION IN L2 LEARNING
Audio Segmentation After the jingle detector, the audio is segmented using several components: three classifiers (Speech/Non-Speech, Gender, and Background); a speaker clustering module, an acoustic change detection module, and a speaker identification module that
identifies very frequent speakers for whom acoustic models have been previously trained (e.g.
anchors). All classifiers are Multi-Layer perceptron based. The errors in the segmentation
are propagated to the other modules of the pipeline. For REAP.PT, the impact of audio
segmentation errors is comparatively negligible. However, an error in identifying the most
frequent speaker (the anchor) may have a major impact in breaking the news into stories, as
described below.
Automatic Speech Recognition The next step in the pipeline is speech recognition using
Audimus. The MLP/HMM architecture already mentioned, combines posterior probabilities
from three phonetic classification branches, based on PLP (Perceptual Linear Prediction),
Log-RASTA (log-RelAtive SpecTrAl), and MSG (Modulation SpectoGram) features. The
WFST-based transducer has a large search space that results from the integration of the
HMM/MLP topology transducer, the lexicon transducer and the language model one. The
recognizer also yields a word confidence measure by combining several features into a maximum entropy classifier, whose output represents the probability of the word being correctly
recognized.
The first acoustic model was trained from 46 hours of manually transcribed broadcast news.
Later, unsupervised training has been adopted, using as training data all the words that were
recognized with a confidence score above 91.5%. The second acoustic model was trained with
378 hours being 332 automatically transcribed. The final acoustic model was trained with
1000 hours, and resulted in 38 three state monophones plus a single-state non-speech model
for silence, and 385 phone transition units which were chosen to cover the majority of the
word transitions in the training data.
The language model (LM) used is daily updated. It is a statistical 4-gram model and results
from the interpolation of three specific LMs: a back-off 4-gram LM, trained on a 700M word
corpus of newspaper texts; a backoff 3-gram LM estimated on BN transcripts; and a backoff
A.1. INTRODUCTION
119
4-gram LM estimated on the web newspapers texts collected from the previous seven days.
The final interpolated language model is a 4-gram LM, with Kneser-Ney modified smoothing.
The vocabulary is also updated on a daily basis from the web [88], which implies a reestimation of the LM and the re-training of the word-level confidence measure classifier.
Once the 100k vocabulary is selected, the pronunciation lexicon is created dividing the words
in those who can be pronounced according to the Portuguese pronunciation rules, and those
that do not follow those rules. The pronunciation for the first group is given either by
the in-house lexicon or a rule-based grapheme to phoneme (GtoP) conversion module [35].
The remaining words, which might be acronyms or foreign words, are subsequently divided
into those which the pronunciation can be derived from the Carnegie Mellon public domain
lexicon, where nativized pronunciation is derived by rule, and those which pronunciation can
not be derived from there. The latter, grapheme nativization rules are applied to them before
their pronunciation is given by the GtoP module. The final multiple-pronunciation lexicon
generally includes 114k entries.
The Word Error Rate (WER) was computed for an evaluation set composed by six one-hour
news shows in 2007. Using a fixed LM/vocabulary, the result was 18.4% . The performance
significantly differs from clean read speech (typically below 10%) to spontaneous speech or
noisy environments (typically above 20%).
Punctuation and capitalization The next modules in the pipeline are responsible for
enriching the raw transcription provided by the speech recognizer. The first rich transcription
modules introduce punctuation and capitalization. The approach followed uses maximum
entropy models for both tasks [9].
Capitalization explores only lexical cues. The results were much worse for automatically transcribed BN (F-measure = 75.4 %) than for manually transcribed BN (F-measure = 85.5%),
reflecting the effects of the recognition errors.
Besides lexical cues, punctuation also explores acoustic and prosodic cues. Initially, this
module only targeted commas and full stops. Results for full stop detection (F-measure =
120
APPENDIX A. TOOLS FOR ORAL COMPREHENSION IN L2 LEARNING
69.7%) were better than for comma detection (F-measure = 50.9%). A later version of this
punctuation module targeted question marks as well.
Topic segmentation and indexation
This module aims to split the BN shows into the
constituent short stories [8]. These shows typically consist of a sequence of segments that
are either stories or fillers (i. e. headlines / teasers detected by the audio segmentation
module). The division into constituent stories is done by using a common heuristic - every
story starts with a segment spoken by the anchor, and is further developed by out-of-studio
reports and/or interviews. Hence, the simplest segmentation algorithm consists of defining
potential story boundaries in every transition non-anchor / anchor. This heuristic can be
refined using a Classification and Regression tree based approach that deals with double
anchors, local commentators and thematic anchors (e.g. for sports or meteorology). The
segmentation obtained using this module achieved an F-measure of 84%.
After segmentation, the topic indexation module assigns one or multiple topics to each story
as detailed in [2]. The set of 10 topics is: Economy, Education, Environment, Health, Justice,
Meteorology, Politics, Security, Society, and Sports. A further classification into National
and International is also done. For each class, a topic and non-topic unigram language model
was created using the stories of a media-watch corpus, which were pre-processed in order to
remove function words and lemmatize the remaining ones. The topic classification is based on
the log likelihood ratio between the topic language model and the non-topic language models.
A predefined threshold is set to assign a story to a topic. A different threshold is set for each
topic in order the account the language model quality differences. The accuracy obtained
in a held-out test set was 91.8%. The use of support vector machines (SVMs) improved the
results of later versions of this module.
The final module follows a simple extractive summarization strategy in order to create a
summary of the story by choosing its first sentence.
A.1.3.1.2
Integration in REAP.PT REAP.PT is accessed via a web-interface. The
students will be submitted to a pre-test when they first log in to the platform. In the pre-
A.1. INTRODUCTION
121
test, they select the words they know from a list, so that they are assigned one of the 12 school
levels. After the pre-test, the student has many options available: group reading, individual
reading, multimedia documents, and topic interests. The task in the first two options is
similar and text-based. The only difference is that in the first case the text is chosen by the
teacher and is available to all the students in the class, whereas in the second option the
student is free to choose a text from a given set. The topic interests tab allows the student
to set the preferred topics. This information is stored in the student database.
The multimedia documents includes the DTBs and the BN shows. The interface for the BN is
shown in Figure A.1. The left part of the screen shows the news show divided by stories. The
navigation can be also be done by topic. The right part of the screen shows the automatically
transcribed story, divided into speaker turns.
Figure A.1: Recognized BN interface in REAP.
The interface also has a teacher menu that allows the teacher to rate the quality of the document, estimate a readability level, select documents for group reading, discard documents,
and insert new questions to be answered at the end of each reading session. Teacher reports
can also be generated from the platform.
After each reading session, the system updates the student model to dynamically adjust the
contents to the student.
122
APPENDIX A. TOOLS FOR ORAL COMPREHENSION IN L2 LEARNING
Document Filtering Unlike the text documents for group and individual reading, the
automatic transcribed texts are not filtered to satisfy the REAP restriction. That is the text
documents have to respect some pedagogical constraints such as text length (300 words), and
need to have a minimum number of words from the target list. Texts with profanity words
are also filtered, as well as documents containing just word lists.
The text length filter was discarded when using BN, since the average number of words per
story in the evaluation corpus was 280 words. The topic segmentation module avoids too
short stories by merging neighboring stories, in order to have enough evidence to assign the
topic. There are cases where the stories may be very long, such as major disasters. In those
cases, a warning to stop reading after 1000 words is given to the student. The profanity words
filter is not needed since those words are excluded from the output of the ASR. The same
applies for the word list filters.
The last filtering stages (topic and readability), are common to both text and multimedia
documents. The only difference is that the words with low confidence level in the multimedia
documents are excluded from this filtering stage.
The readability classifier aims to attribute to each document a grade according to the Portuguese school levels. The readability classifier used for the texts in the reading tasks was
trained with textbooks for native Portuguese students, from grades 5-12. The initial is based
on lexical features, such as statistics of word unigrams. A Support Vector Machine was
trained using the SMO tool implemented in WEKA [149]. The multimedia documents were
also classified with this readability classifier due to the difficulties in obtaining multimedia
documents classified according to the school levels. This classifier produced interesting results
in the held-out test set of textbooks [57]: root mean square error (0.448), Pearsons correlation
(0.994), and accuracy within 1 grade level (1.000).
Interestingly, the BN stories were classified between levels 7 and 11, with an average of 8.
However, some words that were in the BN stories could not be used the classifications since
they are not covered by the vocabulary from textbooks, that despite being very large (95k
A.1. INTRODUCTION
123
words), is still very different from the dynamic vocabulary used in the recognition of BN.
Document display The text placed in the right side of Figure A.1 is created from the XML
file produced by the BN processing chain in Figure A.2. The set of words underlined and
blue colored represent that a given word is in the list of words that the student is supposed
to learn. That list, similarly to other languages like English and French, was designed for
language learning. It is designated as the Portuguese Academic Word List (P-AWL [7]) and
it is composed by 2,019 different lemmas, together with their most common inflections, each
one tagged with the corresponding part-of-speech (POS). The resulting list has a total of
33,284 words.
Figure A.2: Broadcast News Pipeline.
A dictionary is also available by clicking on any word. A pop-up window with the definition
obtained from a electronic dictionary (from Porto Editora) and the corresponding POS tag
are shown. This window also includes the synthesizer button that the implementation was
previously described (Section A.1.1). The student can also play highlighted word sequences
in the same way that was implemented for DTBs (Section A.1.2).
To alert the student that the displayed text may have recognition errors, the words with
124
APPENDIX A. TOOLS FOR ORAL COMPREHENSION IN L2 LEARNING
confidence score below 82% are shown in red. Recently, new features have been added to the
computation of the confidence score [103]. The comparison between to different ASR systems
(word by word) and a feature based on the number of hits on the hypothesized bigrams
in a web search engine, when combined with the decoder based features led to remarkable
improvements in the detection of recognition errors (13.87% in the baseline, 12.16% adding
the new features to the baseline).
Vocabulary exercises
Every action of the student is tracked during the reading session,
including the accesses to the dictionary, in order to keep updating his/her progress. Independent of the type of document, each reading session is followed by a series of cloze,
or fill-in-the-blank, questions about the underlined blue colored words. These vocabulary
questions are manually selected from a set of 6000 where the distractors are automatically
generated based. The methods for distractor generation for REAP.PT are compared in [38].
Later versions of the REAP.PT platform included a number of grammar-based exercises and
games, as well as exercises with the specific focus of enhancing oral comprehension skills.
A.2
Summary
This chapter described a set of tools that were added to REAP.PT to offer the possibility to
the student to practice their oral comprehension skills. Namely, the use of a speech synthesizer
and multimedia contents such as DTBs and BN stories.
Two different types of BN stories were placed in the platform: manually transcribed and
automatically transcribed. The integration of the automatic transcribed stories was described
in detail. The first part of the process includes the generation of the news content by the BN
processing pipeline. Then the REAP.PT processing tools prepare the document (readability
classifications and filtering) to be included in the multimedia materials sections.
So far there was no objective evaluation to the use of multimedia materials. Informal tests
held with more than 10 non-native students revealed that they found it a motivating feature
A.2. SUMMARY
125
in their learning process. They also suggested that the word could be highlighted as it was
spoken (karaoke style), which was later implemented.
The current REAP.PT platform builds on the initial work with multimedia materials developed in this thesis, but includes many other innovations, namely Serious games. One of them
places the students in a 3D environment, to try to make them learn locative prepositions used
to describe spatial position between objects [125]. The other game consisted of a competitive
translation game, where the student plays against an automated agent that uses a statistical
machine translation module to guess the hidden words in a sentence in the target language
[82].
Serious games were also developed for listening comprehension, using automatically selected
excerpts of BN stories. One of them asks the student to order the words as they appear in
the sentence [104]. The other asks the students to select from a list of words those that they
heard in the excerpt [37]. These games were highly appreciated by students.
One of the recently designed games is a version of the locative prepositions game including
speech input, but this version was not yet tested with students. In fact, since the student
has to complete tasks to successfully complete the game, the scenario could be used to design
dialogs where the student could entrain to the game and improve her/his oral skills. This
would require a more complex type of dialog management.
126
APPENDIX A. TOOLS FOR ORAL COMPREHENSION IN L2 LEARNING
B
Nativeness Detection
B.1
Introduction
This annex will describe some experiments regarding automatic nativeness detection. Generally, non-native speech may deviate from native speech in terms of morphology, syntax and
the lexicon, which is naturally more limited than for adult native speakers. In what concerns
morphology, the main problems faced by non-native speakers concern producing correct forms
of verbs (namely when irregular), nouns, adjectives, articles etc, especially when the morphological distinction hinges on subtle phonetic distinctions. The main difficulties in terms of
syntax concern the structure of sentences, the ordering of constituents and their omission
or insertion. In addition, non-native speech typically includes more disfluencies than native
speech, and is characterized by a lower speech rate [138].
None of the above difficulties are very prominent in highly proficient speakers. However, their
speech frequently maintains a foreign accent, denoting the interference from the first language
(L1), both in terms of prosody as well as segmental aspects.
The segmental deviations in non-native speech can be quite mild, when speakers use phonemes
from their L1 without compromising phonemic distinctions, but they may also have a strong
impact on intelligibility whenever phonemic distinctions are blurred, a frequent occurrence
when the phonetic features of L2 are not distinctive in L1.
The prosodic deviations may involve errors in word accent position [62] . Rhythmic traits,
however, are generally regarded as the main source for the low intelligibility of L2 speakers.
Although some authors base these difficulties on the differences in rhythm between L1 and
L2, namely when one of the languages is stress-timed and the other is syllable-timed [112, 52],
128
APPENDIX B. NATIVENESS DETECTION
others claim more complex distinctions (for instance, syllables not carrying the word accent,
that are weak in stress-timed languages, are produced stronger in syllable-timed languages).
The literature on automatic nativeness detection is still scarce, but is tightly connected to
accent detection in general [5, 70, 10]. Some of the approaches use acoustic features, whereas
others are based on prosodic cues. Kumpf and King, for instance, used HMM phone models
to identify Australian English speech [73]. Bouslemi et al. tried another approach using discriminative phoneme sequences to detect speaker’s origin through their accent when speaking
English [23]. Piat et al [105] modeled variations in duration, energy and fundamental frequency at the syllable level for each word in the corpus. The accent identification was done
using one HMM per feature. They compared this approach with MFCC. Then, a combination of the best prosodic parameters with MFCC coefficients was evaluated and achieved best
results.
Features used in pronunciation scoring can also be useful to nativeness detection. Tepperman and Narayanan [132] tried to improve pronunciation scoring using information about
intonation and prosodic structure. They used techniques often used in tonal languages (f0
processing, theoretical grammars for recognition of intonation events and contextual effects)
in that study. Using HMMs to represent the intonation units the correlation between the
predicted score and the annotated pronunciation score increased significantly. Höenig et al
used a set of standard prosodic features to train a k-class classifiers for word accent position
[61]. The feature set was later enriched with speech rate, the duration of consecutive consonantal or vocalic segments, the percentage of vocalic intervals. A multiple linear regression
with cross-validation was trained to select the relevant feature set. The goal was to find the
features that best described intelligibility, accent, melody, rhythm and, in addition, find the
feature set that could classify at a supra-segmental level [62].
Prosodic features were also used in language identification. Lin and Wang approximated the
pitch contour segment by a set of Legendre polynomials being the polynomial coefficients the
feature vector used to represent the pitch contour [32]. Several Gaussian Mixture Models were
trained with the feature generated by the Legendre polynomials and other features such as
B.1. INTRODUCTION
129
the duration of the segment or the energy difference in the segment where the pitch contour
was extracted.
Here a different approach to the problem will be adopted, using state of the art techniques
for language identification to distinguish between the characteristics of native and non-native
speakers. The goal is to be able to detect prosodic and segmental deviations in highly proficient speakers like most speakers in TED Talks. The corpus will be described in the next
section. The main body of this appendix is devoted to the description of our approach, which
uses prosodic contour features to complement the acoustic ones.
Although the study has been conducted for English, some of the characteristics referenced
are common to every non-native speaker of any original language.
B.1.1
Corpus
The collected corpus comprises 139 TED talks. The subset of native English speakers includes
71 talks, by US speakers. The remainder 68 talks are from non-native speakers, i.e. speakers
from any other country that does not have English as the official language. To simplify the
problem at this first stage, speakers from other English speaking countries such as UK and
Australia were not involved in the experiments. The fact that the non-native speakers may
have lived in an English speaking country for some time was ignored in this classification,
place of birth being the major factor.
The first step in processing this corpus was to perform audio segmentation, separating speech
segments from non-speech segments. Next, speech segments that were at least 1 second long
were divided into train and test subsets, making sure that each talk was used only in one of
the subsets.
More details on the corpus can be found in table B.1, for the training and test sets, showing
duration, total number of segments and percentage of segments for each of the durations
considered.
130
APPENDIX B. NATIVENESS DETECTION
Train
Duration (min)
Segments (#)
<3 s (%)
3s-10s (%)
10s-30s (%)
>30 s (%)
Test
Duration (min)
Segments (#)
<3 s (%)
3s-10s (%)
10s-30s (%)
>30 s (%)
Native
683.4
1276
2.1
4.5
60.0
33.3
Native
182.9
400
4.5
6.8
62.2
26.5
Non-Native
616.8
1299
4.8
7.1
64.9
23.3
Non-Native
218.1
548
2.2
4.4
74.1
19.3
Table B.1: Details of the training and test sets for Native and Non-Native data.
B.1.2
Nativeness Classification
In this section the acoustic and prosodic approaches to solve the problem of nativeness classification will be described. Finally the procedures to combine both systems will be presented.
B.1.2.1
Acoustic Classifier Development
A method generally known as Gaussian supervectors (GSV) was first proposed for speaker
recognition in [33]. In the last few years, however, this approach has also been successfully
applied to language recognition tasks. GSV-based approaches map each speech utterance to
a high-dimensional vector space. Support Vector Machines (SVMs) are generally used for
classification of test vectors within this space. The mapping to the high-dimensional space
is achieved by stacking all parameters (usually the means) of an adapted GMM in a single
supervector by means of a Bayesian adaptation of a universal background model (GMMUBM) to the characteristics of a given speech segment. In this work, we apply the GSV
approach to the nativeness detection problem.
B.1.2.1.1
Feature Extraction The extracted features are shifted delta cepstra (SDC)
[136] of Perceptual Linear Prediction features with log-RelAtive SpecTrAl speech processing
B.1. INTRODUCTION
131
(PLP-RASTA). First, 7 PLP-RASTA static features are obtained and mean and variance
normalization is applied on a per segment basis. Then, SDC features (with a 7-1-3-7 configuration) are computed, resulting in a feature vector of 56 components. Finally, low-energy
frames (detected with the alignment generated by a simple bi-Gaussian model of the log
energy distribution computed for each speech segment) are removed.
B.1.2.1.2
Supervector Extraction
In order to obtain the speech segment supervectors,
a Universal Background Model (UBM) must be first trained. The UBM is a single GMM
model that represents the distribution of speaker independent features. This is done in order
to deal with the variability that characterizes accent recognition. All the training data available (native and non-native segments together) were used to train GMM-UBM of 64 and 256
mixtures, resulting in two different GSV-based systems.
A single iteration of Maximum a Posteriori (MAP) adaptation with relevance factor 16 is
performed for each speech segment to obtain the high-dimensional vectors.
B.1.2.1.3
Nativeness modeling and scoring Our classification problem is binary.
Therefore only one classifier needed to be trained. The linear SVM kernel of [33] based
on the Kullback-Leibler (KL) divergence is used to train the target language models with the
LibLinear implementation of the libSVM tool [42].
The native supervectors set is used to provide the positive examples, whereas the non-native
set is used as background or negative samples. During test, the supervectors of the testing
speech utterances are used by the binary classifier to generate a nativeness score.
B.1.2.2
Prosodic Classifier Development
In this work, we apply prosodic contour information features for nativeness classification in
order to complement the acoustic system described above.
B.1.2.2.1
Prosodic contour extraction
The Snack toolkit [72] is used to extract the
log-pitch and the log-energy of the voiced speech regions of every utterance. Log-energy
132
APPENDIX B. NATIVENESS DETECTION
is normalized on an utterance basis. The prosodic contours are segmented into regions by
splitting the voiced regions whenever the energy signal reaches a local minimum (the minimum
length of the regions is 60 ms). We use a 3rd order derivative function of the log-energy to
find local minima. For each region, the log-energy and log-pitch contours are approximated
with a Legendre polynomial of order 5, resulting in 6 coefficients for each contour. The final
feature vector is formed by the two contour coefficients and the length of the syllable-like
region, which results in a total of 13 elements.
B.1.2.2.2
Nativeness modeling and scoring Two different approaches were followed
to train the nativeness detector that uses prosodic features. First, GMM models were trained
for native and non-native speech models following the conventional GMM-UBM approach
that is also applied in [32]. The UBM was estimated using all training data and the native
and non-native GMMs were obtained based on MAP adaptation of the UBM with all the
native and non-native training data, respectively. In this case, only a 64-mixture UBM was
trained due to reduced amount of training vectors resulting from the fact that each feature
vector now corresponds to a syllable-like segment of variable duration. The MAP adaptation
was done with one iteration step. A second modeling approach was also developed based on
the Gaussian supervector technique described in section B.1.2.1.2 but now with the prosodic
features. The MAP adaptation step for this approach uses the same UBM model of the
previous approach.
During test, log-likelihood ratios of the native and non-native GMMs are computed for each
testing speech utterance in the case of the GMM-UBM based system. In the GSV case, test
supervectors are used with the binary classifier to generate a nativeness score.
B.1.2.3
Calibration
Each individual system was calibrated using the s-cal tool available with the Focal toolkit [31].
Additionally, Linear Logistic Regression is applied to the calibrated scores and it is also used
for fusion of the acoustic and prosodic systems whenever it is necessary. A cross-validation
strategy is carried out with the test data set to simultaneously estimate the calibration and
B.1. INTRODUCTION
133
fusion parameters and evaluate the system.
B.1.3
Results and discussion
Table B.2 presents experimental results targeted at comparing the performance of the
acoustic-based Gaussian supervector system with different number of mixtures: GSV-acoustic
64 and GSV-acoustic 256. Both Accuracy (Acc) and Equal Error Rate (EER) scores are computed for Native and Non-Native segments separately and for the overall test data. Results
are also discriminated for different segment length durations.
<3s
3s-10s
10s-30s
>30s
Total
Native
GSVGSVacoustic
acoustic
64
256
Acc (%)
Acc (%)
35.7
35.7
66.7
48.2
82.7
88.4
89.6
91.5
81.8
84.6
Non-Native
GSVGSVacoustic
acoustic
64
256
Acc (%)
Acc (%)
80.0
80.0
66.7
81.0
87.0
91.4
85.9
82.1
85.9
89.0
Total
GSV-acoustic 64
Acc (%)
47.4
66.7
85.3
87.7
84.2
EER (%)
24.3
28.3
15.1
10.4
16.1
GSV-acoustic 256
Acc (%)
47.4
62.5
90.2
86.8
87.2
EER (%)
30.0
30.7
10.1
7.6
13.1
Table B.2: Detailed results for GSV-acoustic 64 and GSV-acoustic 256.
These results show a generally better performance of the GSV-acoustic 256 system relative
to the GSV-acoustic 64 classifier. In fact, increasing the supervector dimensionality allows
considerable nativeness detection improvements. As expected, both classifiers perform better
on longer duration segments, which is a well-known result of language recognition and other
similar problems. Notice, however, that duration-independent calibration parameters were
estimated and the performance loss in shorter segments could be partially alleviated if length
duration dependent calibrations have been estimated as well. This effect can be particularly
important given the fact that most of the test data segments have durations between 10 and
30 seconds. Finally, it is worth noticing that the detectors seem to be biased towards the
non-native class for the shorter segments duration. This fact may be also related with a
134
APPENDIX B. NATIVENESS DETECTION
GSV-prosodic
GMM-prosodic
alone
Acc.(%) EER(%)
68.9
40.7
71.1
30.1
fusion
Acc.(%) EER(%)
89.1
10.6
89.4
10.6
Table B.3: Results obtained using prosodic features (Accuracy and EER) and the fusion
between prosodic systems and GSV-acoustic 256.
mis-calibration problem. In the longer segments duration case, both Native and Non-Native
accuracies are quite balanced.
Table B.3 shows the Accuracy and EER results obtained for the two prosodic systems: the
Gaussian Supervector classifier (GSV-prosodic) and the GMM classifier (GMM-prosodic). In
addition to the individual performance (column ”alone”), the detection results of the prosodic
based systems fused with the best acoustic system (GSV-acoustic 256) are also presented.
Results from Table B.3 show that the performance of both prosodic systems alone is far below
the one of the acoustic systems as it is clear also from DET curve in figure B.1. However,
when fused with GSV-acoustic 256, the combined system considerably improves the individual
acoustic system.
Against initial expectations, the GMM-prosodic performs better than the GSV-prosodic. This
may be related to the reduced amount of prosodic feature vectors, which may influence the
way MAP adaptation should be carried out for supervector extraction: number of iterations,
relevance factor, etc. This possibility, together with the use of mean and variance normalized
prosodic features, has been investigated but no conclusive results were obtained so far. In
any case, slight performance differences between the two prosodic systems are observed when
fused to the acoustic classifier.
Finally, the best performing nativeness detector is the one based on the fusion of the GSVacoustic 256 with the GMM-prosodic sub-systems as the DET curve in figure B.1 is showing.
The incorporation of the prosodic features permits an absolute performance improvement of
2.2% and 2.5% in terms of Accuracy and EER respectively, relative to the original detector
based only on acoustic features.
B.1. INTRODUCTION
135
Nativeness Detection Performance
80
GSV−acoustic 256
GMM−prosodic
60
FUSION
Miss probability (in %)
40
20
10
5
2
1
0.5
0.2
0.1
0.1 0.2
0.5
1
2
5
10
20
40
False Alarm probability (in %)
60
80
Figure B.1: DET curves of the GSV-acoustic 256, GMM-prosodic and fusion between both
systems.
136
B.1.4
APPENDIX B. NATIVENESS DETECTION
Human Benchmark
The above are below the results for state-of-the-art language recognition, where EER is around
3% [34]. Recognizing non-native accents in proficient speakers, however, is a far more difficult
task. The automatic detection of non-native accents in proficient speakers seems to be a far
more difficult task. In order to find out if this is also a difficult task for humans, we have
compared the above results with the human performance on the same task. For that purpose,
3 native Portuguese speakers fluent in English, were asked to classify 45 segments. A set of 50
segments was randomly chosen from our test corpus, but 5 of them were discarded, because
they contained semantic cues (e.g. one of the speakers actually said “I’m French”).
The comparison of the original segment classification with the classification of each of the 3
subjects and the ones produced by the automatic classifiers produced discrepancies in only 10
files. Most differences occur in segments where the highly proficient speakers have a barely
discernible non-native accent that also eludes some listeners. In this subset, the accuracy of
the subjects averages 92.22%, whereas the one of the best fused system achieves 89.80%. We
obtained a Cohen Kappa of 0.88 for the inter-rater agreement of the 3 subjects. The Cohen
Kappa between the 3 subjects and the best fused classifier varied from 0.80 to 0.87.
These results show that in this task the automatic performance is very close to the human
performance. The introduction of new prosodic features together with the enlargement of
data set could help in the system improvement.
B.2
Summary
This annex reported an experiment of an automatic nativeness detector for English. The
first attempt performed only used acoustic features with Gaussian supervectors for training a
classifier based on support vector machines. The final result was a 13.11 % equal error rate.
The combination with prosodic features reduced the equal error rate to 10.58%. Prosodic
feature on their own were not very discriminative possibly due to the reduced number of
frames per file used in their computation.
B.2. SUMMARY
137
The comparison between the performance achieved and the human benchmark showed that
the current method’s performance is very close the human performance.
The fact that the speakers on the database are very fluent, made the task even more difficult.
Thus, the results are lower than what was found in language recognition with a similar
method.
Unfortunately, since there are very few TED talks with non-native speakers where they do
not speak English, it was not possible to extend this work for other languages.
138
APPENDIX B. NATIVENESS DETECTION
C
Resources used in the
experimental sets
C.1
Scenarios used in Noctı́vago tests
Expected input: from the airport to the Oriente train station, now (agora) and also get the
information of the following bus.
Figure C.1: Scenario 1.
140
APPENDIX C. RESOURCES USED IN THE EXPERIMENTAL SETS
Expected input: from Largo de Camões to Cais do Sodré, Saturday (sábado) at 1:15 am.
Figure C.2: Scenario 2.
C.1. SCENARIOS USED IN NOCTÍVAGO TESTS
141
Expected input: from Cais do Sodré to Alameda, at 3:00 am and also get the information of
the previous bus.
Figure C.3: Scenario 3.
142
APPENDIX C. RESOURCES USED IN THE EXPERIMENTAL SETS
C.2
Questionnaires
C.2.1
Questionnaires used in Section 4.3
Both questionnaires requested e-mail, type of device used to call (landline, cellphone or VoIP),
if the information was correctly received in the three scenarios, country of origin and a comment box for any suggestions the users have to improve the system performance.
Figure C.4: Questionnaire used in the first week.
C.2. QUESTIONNAIRES
143
During the second week, two new questions were added: if the users felt better understood
and if they noticed the system had evolved. If they answered positively to this last questions,
they had another comment box to say where they noticed the evolution.
Figure C.5: Questionnaire used in the second week.
144
APPENDIX C. RESOURCES USED IN THE EXPERIMENTAL SETS
C.2.2
Questionnaire used in Section 6.1.2.2
The questionnaire used was based on the PARADISE framework for the evaluation of spoken
dialog systems [141]. It includes the statements:
• the system was easy to understand
• the system understood what you said
• you found the information you were looking for
• you found the pace of the interaction appropriate
• you the system was slow giving the answer
• the system behaved as expected
• you would use these type of system to get transport schedule information
• the system became easier to understand throughout the interaction
This last question was added to the PARADISE questions to capture if entrainment was
helping the system to better understand the user. They should answer in a LIKERT scale
where 1 means that they totally disagree and 5 that they totally agree with the statement.
Since they were asked to do three consecutive requests to the system, they should classify the
information they got in a 5 degree scale from “never correct” to “always correct”.
The questionnaire also had a checkbox to program reminders to test the different configurations, the country of origin and a commentary box.
C.2. QUESTIONNAIRES
Figure C.6: Questionnaire Rule Based tests.
145
146
APPENDIX C. RESOURCES USED IN THE EXPERIMENTAL SETS
C.2.3
Questionnaire used in Section 6.1.2.2
The questionnaire in Figure C.7 is similar to the one presented in the previous section. We
decided to remove the statements about how the system was perceived, the pace of the
interaction and the quickness of the answer. On the other hand, we increased the focus on
entrainment related questions, explicitly asking the user if they perceived entrainment effects,
without ever mentioning the word entrainment. The statements introduced were:
• the system was able to propose alternatives when it was not able to understand me
• some words were appropriately modified by the system
We also reduced the number of possibilites in the LIKERT scale to 4 to avoid middle column
biasing.
C.2. QUESTIONNAIRES
Figure C.7: Questionnaire Data Driven tests.
147
148
APPENDIX C. RESOURCES USED IN THE EXPERIMENTAL SETS
Bibliography
[1] A Method for Evaluating and Comparing User Simulations: The Cramer-von Mises
Divergence, 2007. IEEE.
[2] Rui Amaral and Isabel Trancoso. Topic segmentation and indexation in a media watch
system. In INTERSPEECH, pages 2183–2186, 2008.
[3] Jan Anguita, Stephane Peillon, Javier Hernando, and Alexandre Bramoulle. Word
confusability prediction in automatic speech recognition. In INTERSPEECH 2004 ICSLP, 8th International Conference on Spoken Language Processing. ISCA, 2004.
[4] Apple. Siri, 2011. URL http://www.apple.com/ios/siri/.
[5] Levent M. Arslan and John H. L. Hansen. Language accent classification in American
English. Speech Commun., 18:353–367, June 1996.
[6] M. F. Bacelar do Nascimento, M. L. Garcia Marques, and M. L: Segura da Cruz. Português Fundamental, Métodos e Documentos, tomo 1, Inquérito de Frequência. Lisboa,
INIC, CLUL, 1987.
[7] Jorge Baptista, Neuza Costa, Joaquim Guerra, Marcos Zampieri, Maria Cabral, and
Nuno J. Mamede. P-AWL: Academic word list for Portuguese. In PROPOR, pages
120–123, 2010.
[8] Regina Barzilay, Michael Collins, Julia Hirschberg, and Steve Whittaker. The rules
behind roles: Identifying speaker role in radio broadcasts. In Proceeding of AAAI,
2000.
[9] F. Batista, D. Caseiro, N. Mamede, and I. Trancoso. Recovering capitalization and
punctuation marks for automatic speech recognition: Case study for portuguese broadcast news. Speech Commun., 50(10):847–862, October 2008.
[10] Fadi Biadsy, Julia Hirschberg, and Nizar Habash. Spoken arabic dialect identification
using phonotactic modeling. In Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages, Semitic ’09, pages 53–61, Stroudsburg, PA,
USA, 2009.
[11] A. Black, P. Taylor, and R. Caley. The Festival synthesis system, December 2002.
[12] Alan W. Black and Kevin A. Lenzo. Limited domain synthesis. In ICSLP, pages
411–414, 2000.
[13] Alan W Black and Kevin A. Lenzo. Flite: A small fast run-time synthesis engine. In
4TH ISCA TUTORIAL AND RESEARCH WORKSHOP ON SPEECH SYNTHESIS,
pages 20–4, Perthshire, Scotland, 2001.
149
150
BIBLIOGRAPHY
[14] Alan W. Black, Susanne Burger, Brian Langner, Gabriel Parent, and Maxine Eskenazi.
Spoken dialog challenge 2010. In SLT, pages 448–453, 2010.
[15] Mats Blomberg, Rolf Carlson, Kjell Elenius, Björn Granström, Joakim Gustafson, Sheri
Hunnicutt, Roger Lindell, and Lennart Neovius. An experimental dialogue system:
Waxholm. In EUROSPEECH, 1993.
[16] Dan Bohus. Error awareness and recovery in conversational spoken language interfaces.
PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 2007. AAI3277260.
[17] Dan Bohus and Alex Rudnicky. Integrating multiple knowledge sources for utterancelevel confidence annotation in the cmu communicator spoken dialog system. Technical
report, Roots in the Town. In 2nd International Workshop on Community Networking.
1995. Princeton (NJ): IEEE Comm. Soc, 2002.
[18] Dan Bohus and Alex Rudnicky. Sorry, I didn’t catch that! an investigation of nonunderstanding errors and recovery strategies. In SIGDial, 2005.
[19] Dan Bohus and Alexander Rudnicky. LARRI: A Language-Based Maintenance and
Repair Assistant, volume 28 of Text speech and language technology, pages 203–218.
Springer, 2005.
[20] Dan Bohus and Alexander I. Rudnicky. Implicitly-supervised learning in spoken language interfaces: an application to the confidence annotation problem. In SIGDIAL,
2007.
[21] Dan Bohus and Alexander I. Rudnicky. The Ravenclaw dialog management framework:
Architecture and systems. Comput. Speech Lang., 23(3):332–361, July 2009.
[22] Dan Bohus, Antoine Raux, Thomas K. Harris, Maxine Eskenazi, and Alexander I.
Rudnicky. Olympus: an open-source framework fro conversational spoken language
interface research. In proceedings of HLT-NAACL 2007 workshop on Bridging the Gap:
Academic and Industrial Research in Dialog Technology, 2007.
[23] Ghazi Bouselmi, Dominique Fohr, Irina Illina, and Jean-Paul Haton. Discriminative
phoneme sequences extraction for non-native speaker’s origin classification. Computing
Research Repository, 2007.
[24] H. Branigan, J. Pickering, J. Pearson, and J. McLean. Linguistic alignment between
people and computers. Journal of Pragmatics, 42(9):2355–2368, September 2010.
[25] Holly P. Branigan, Martin J. Pickering, Jamie Pearson, Janet F. McLean, and Clifford
Nass. Syntactic alignment between computers and people: the role of belief about
mental states. In In Proceedings of the Twenty-fifth Annual Conference of the Cognitive
Science Society, 2003.
[26] Susan E. Brennan. Conversation with and through computers. User Modeling and
User-Adapted Interaction, 1(1):67–86, March 1991. ISSN 0924-1868.
[27] Susan E. Brennan. Lexical entrainment in spontaneous dialog. In International Symposium on Spoken Dialog, pages 41–44, 1996.
BIBLIOGRAPHY
151
[28] Susan E. Brennan. The grounding problem in conversations with and through computers. In SOCIAL AND COGNITIVE PSYCHOLOGICAL APPROACHES TO INTERPERSONAL COMMUNICATION, pages 201–225. Erlbaum, 1998.
[29] Susan E. Brennan and Herbert H. Clark. Conceptual pacts and lexical choice in conversation. Journal of Experimental Psychology: Learning, Memory, and Cognition, 22:
1482–1493, 1996.
[30] Susan E. Brennan, P. S. Ries, C. Rubman, and G. Lee. The vocabulary problem in
spoken language systems. In S. Luperfoy, editor, Automated Spoken Dialog Systems.
MIT Press, 1996, 1998.
[31] N. Brummer. Focal: Tools for Fusion and Calibration of automatic speaker detection
systems, 2011. URL https://sites.google.com/site/nikobrummer/focal.
[32] Lin C-Y and Wang H-C. Language identification using pitch contour information. In
Proc. ICASSP 2005, 2005.
[33] W. M. Campbell, D. E. Sturim, and D. A. Reynolds. Support vector machines using
gmm supervectors for speaker verification. Signal Processing Letters, IEEE, 13(5):308–
311, April 2006.
[34] William M. Campbell, Joseph P. Campbell, Douglas A. Reynolds, Elliot Singer, and
Pedro A. Torres-Carrasquillo. Support vector machines for speaker and language recognition. Computer Speech & Language, 20(2-3):210–229, 2006.
[35] D. Caseiro, I. Trancoso, L. Oliveira, and C. Viana. Grapheme-to-phone using finitestate transducers. In In: Proc. 2002 IEEE Workshop on Speech Synthesis. Volume,
pages 1349–1360, 2002.
[36] LLC Cepstral. Swift™: Small Footprint Text-to-Speech Synthesizer, 2005. URL http:
//www.cepstral.com.
[37] R. Correia, T. Pellegrini, M. Eskenazi, I. Trancoso, J. Baptista, and N. Mamede. Listening comprehension games for portuguese: exploring the best features. In Proc. SLaTE,
2011.
[38] Rui Correia, Jorge Baptista, Nuno Mamede, Isabel Trancoso, and Maxine Eskenazi.
Automatic generation of cloze question distractors. In Proc. Workshop on Second Language Studies: Acquisition, Learning, Education and Technology, 2010.
[39] Heriberto Cuayhuitl, Steve Renals, Oliver Lemon, and Hiroshi Shimodaira. Hierarchical dialogue optimization using semi-Markov decision processes. In Proc. of INTERSPEECH, August 2007.
[40] Hal Daumé III. Notes on CG and LM-BFGS optimization of logistic regression, 2004.
URL http://hal3.name/megam/.
[41] Matthias Denecke, Kohji Dohsaka, and Mikio Nakano. Learning dialogue policies using
state aggregation in reinforcement learning. In INTERSPEECH. ISCA, 2004.
152
BIBLIOGRAPHY
[42] R.-E Fan, K.-W. Chang, C.-J Hsieh, X.-R Wang, and C.-J Lin. LIBLINEAR - A
Library for Large Linear Classification, 2008. URL http://www.csie.ntu.edu.tw/
~cjlin/liblinear/.
[43] Andrew Fandrianto and Maxine Eskenazi. Prosodic entrainment in an informationdriven dialog system. In Proceedings of Interspeech 2012, Portland, Oregon, USA, 2012.
[44] George Ferguson and James F. Allen. Trips: An integrated intelligent problem-solving
assistant. In Jack Mostow and Chuck Rich, editors, AAAI/IAAI, pages 567–572. AAAI
Press / The MIT Press, 1998.
[45] George Ferguson, James F. Allen, and Bradford W. Miller. Trains-95: Towards a
mixed-initiative planning assistant. In AIPS, pages 70–77, 1996.
[46] Pedro Fialho, Luı́sa Coheur, Sérgio Curto, Pedro Cláudio, Ângela Costa, Alberto Abad,
Hugo Meinedo, and Isabel Trancoso. Meet Edgar, a tutoring agent at Monserrate. In
ACL Demo Session, 2013.
[47] Matthew Frampton and Oliver Lemon. Learning more effective dialogue strategies using
limited dialogue move features. In Proceedings of the 21st International Conference on
Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, ACL-44, pages 185–192, Stroudsburg, PA, USA, 2006. Association
for Computational Linguistics.
[48] Simon Garrod and Anthony Anderson. Saying what you mean in dialogue: a study in
conceptual and semantic co-ordination. Cognition, 27(2):181–218, 1987.
[49] Milica Gasic, Catherine Breslin, Matthew Henderson, Dongho Kim, Martin Szummer,
Blaise Thomson, Pirros Tsiakoulis, and Steve Young. Pomdp-based dialogue manager
adaptation to extended domains. In Proceedings of the SIGDIAL 2013 Conference,
Metz, France, August 2013. Association for Computational Linguistics.
[50] Milica Gašić and Steve Young. Effective handling of dialogue state in the hidden information state POMDP-based dialogue manager. ACM Trans. Speech Lang. Process., 7
(3):4:1–4:28, June 2011.
[51] David Goddeau and Joelle Pineau. Fast reinforcement learning of dialog strategies. In
ICASSP, 2000.
[52] E. Grabe and E. L. Low. Durational variability in speech and rhythm class hypothesis.
Laboratory of Phonology, VII:515–546, 2002.
[53] Alexander Gruenstein and Stephanie Seneff. Releasing a multimodal dialogue system
into the wild: User support mechanisms. In SIGdial Workshop on Discourse and Dialogue, 2007.
[54] Joakim Gustafson, Linda Bell, Jonas Beskow, Johan Boye, Rolf Carlson, Jens Edlund,
Björn Granström, David House, and Mats Wirén. Adapt - a multimodal conversational
dialogue system in an apartment domain. In INTERSPEECH, pages 134–137, 2000.
BIBLIOGRAPHY
153
[55] Thomas K. Harris, Satanjeev Banerjee, and Er I. Rudnicky. Heterogeneous multirobot dialogues for search tasks. In In Proceedings of the AAAI Spring Symposium
Intelligence, 2005.
[56] Michael Heilman, Kevyn Collins-Thompson, Jamie Callan, and Maxine Eskenazi. Classroom success of an intelligent tutoring system for lexical practice and reading comprehension. In INTERSPEECH. ISCA, 2006.
[57] Michael Heilman, Kevyn Collins-Thompson, and Maxine Eskenazi. An analysis of statistical models and features for reading difficulty prediction. In Proceedings of the Third
Workshop on Innovative Use of NLP for Building Educational Applications, EANL ’08,
pages 71–79, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics.
[58] James Henderson, Oliver Lemon, and Kallirroi Georgila. Hybrid reinforcement/supervised learning of dialogue policies from fixed data sets. Comput. Linguist., 34(4):
487–511, December 2008.
[59] Julia Hirschberg. Speaking more like you: Entrainment in conversational speech. In
Proc. INTERSPEECH, 2011.
[60] Julia Hirschberg, Diane J. Litman, and Marc Swerts. Prosodic and other cues to speech
recognition failures. Speech Communication, 43(1-2):155–175, 2004.
[61] Florian Hönig, Anton Batliner, Karl Weilhammer, and Elmar Nöth. Islands of failure:
Employing word accent information for pronunciation quality assessment of English L2
learners. In Proceedings of the ISCA Special Interest Group on Speech and Language
Technology in Education, 2009.
[62] Florian Hönig, Anton Batliner, Karl Weilhammer, and Elmar Nöth. Automatic Assessment of Non-Native Prosody for English as L2. In Proceedings of Speech Prosody 2010,
2010.
[63] Xuedong Huang, Fileno Alleva, Hsiao W. Hon, Mei Y. Hwang, and Ronald Rosenfeld. The SPHINX-II speech recognition system: an overview. Computer Speech and
Language, 7(2):137–148, 1993.
[64] David Huggins-daines, Mohit Kumar, Arthur Chan, Alan W Black, Mosur Ravishankar,
and Alex I. Rudnicky. Pocketsphinx: A free, real-time continuous speech recognition
system for hand-held devices. In Proceedings of ICASSP 2006, 2006.
[65] Srini Janarthanam, Oliver Lemon, Romain Laroche, and Ghislain Putois. Testing
learned NLG and TTS policies with real users, in self-help and appointment scheduling
systems. Technical report, University of Edinburgh, 2011.
[66] Srinivasan Janarthanam and Oliver Lemon. Learning to adapt to unknown users: referring expression generation in spoken dialogue systems. In Proceedings of the 48th
Annual Meeting of the Association for Computational Linguistics, ACL ’10, pages 69–
78, Morristown, NJ, USA, 2010.
[67] Filip Jurčı́ček, Blaise Thomson, and Steve Young. Reinforcement learning for parameter
estimation in statistical spoken dialogue systems. Comput. Speech Lang., 26(3):168–192,
June 2012. ISSN 0885-2308.
154
BIBLIOGRAPHY
[68] Simon Keizer, Milica Gasic, François Mairesse, Blaise Thomson, Kai Yu, and Steve
Young. Modelling user behaviour in the HIS-POMDP dialogue manager. In SLT, pages
121–124, 2008.
[69] Kyungduk Kim, Cheongjae Lee, Sangkeun Jung, and Gary Geunbae Lee. A framebased probabilistic framework for spoken dialog management using dialog examples. In
Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue, SIGdial ’08, pages
120–127, Stroudsburg, PA, USA, 2008.
[70] Oscar Koller, Alberto Abad, Isabel Trancoso, and Céu Viana. Exploiting varietydependent phones in portuguese variety identification applied to broadcast news transcription. In INTERSPEECH, pages 749–752, 2010.
[71] Stefan Kopp, L Gesellensetter, NC Kramer, and Ipke Wachsmuth. A conversational
agent as museum guide - design and evaluation of a real-world application. volume
3661 of Intelligent Virtual Agents, Proceedings, pages 329–343. Springer, 2005.
[72] Music KTH Royal Institute of Technology, Department of Speech and Hearing. Snack
Toolkit v2.2.10, 2001. URL http://www.speech.kth.se/snack/.
[73] Karsten Kumpf and Robin W. King. Automatic accent classification of foreign accented
Australian English speech. In Proceedings of ICSLP, 1996.
[74] Brian Langner and Alan Black. Mountain: A translation-based approach to natural
language generation for dialog systems. In In Proc. of IWSDS 2009, Irsee, Germany,
2009.
[75] Sungjin Lee and Maxine Eskenazi. POMDP-based Let’s Go system for spoken dialog
challenge. In Proc. IEEE SLT Workshop, 2012.
[76] Sungjin Lee and Maxine Eskenazi. Exploiting machine-transcribed dialog corpus to
improve multiple dialog states tracking methods. In Proceedings of the 13th Annual
Meeting of the Special Interest Group on Discourse and Dialogue, SIGDIAL ’12, pages
189–196, Stroudsburg, PA, USA, 2012.
[77] Sungjin Lee and Maxine Eskenazi. An unsupervised approach to user simulation: toward
self-improving dialog systems. In Proceedings of the 13th Annual Meeting of the Special
Interest Group on Discourse and Dialogue, SIGDIAL ’12, pages 50–59, Stroudsburg,
PA, USA, 2012.
[78] Sungjin Lee and Maxine Eskenazi. Recipe for building robust spoken dialog state
trackers: Dialog state tracking challenge system description. In Proceedings of the
SIGDIAL 2013 Conference, Metz, France, August 2013. Association for Computational
Linguistics.
[79] Oliver Lemon. Learning what to say and how to say it: Joint optimisation of spoken
dialogue management and natural language generation. Comput. Speech Lang., 25(2):
210–221, April 2011.
[80] Gina-Anne Levow. Learning to speak to a spoken language system: Vocabulary convergence in novice users. In Proc. SIGDIAl, 2003.
BIBLIOGRAPHY
155
[81] Lihong Li, Jason D. Williams, and Suhrid Balakrishnan. Reinforcement learning for
dialog management using least-squares policy iteration and fast feature selection. In
INTERSPEECH, pages 2475–2478. ISCA, 2009.
[82] W. Ling, I. Trancoso, and R. Prada. An agent based competitive translation game for
second language learning. In Proc. SLaTE, 2011.
[83] Diane J. Litman and Shimei Pan. Designing and evaluating an adaptive spoken dialogue
system. User Modeling and User-Adapted Interaction, 12(2-3):111–137, March 2002.
ISSN 0924-1868.
[84] Diane J. Litman and Scott Silliman. ITSPOKE: an intelligent tutoring spoken dialogue
system. In Demonstration Papers at HLT-NAACL 2004, HLT-NAACL–Demonstrations
’04, pages 5–8, Stroudsburg, PA, USA, 2004.
[85] Diane J. Litman, Marilyn A. Walker, and Michael S. Kearns. Automatic detection
of poor speech recognition at the dialogue level. In Proceedings of the 37th annual
meeting of the Association for Computational Linguistics on Computational Linguistics,
ACL ’99, pages 309–316, Stroudsburg, PA, USA, 1999. Association for Computational
Linguistics.
[86] Lus Seabra Lopes and Antnio J. S. Teixeira. Human-robot interaction through spoken
language dialogue. In IROS. IEEE, 2000.
[87] P. Madeira, M. Mourao, and N. Mamede. STAR - a multi domain dialog manager. In
ICEIS, 2003.
[88] Ciro Martins, António J. S. Teixeira, and João Paulo Neto. Dynamic language modeling
for a daily broadcast news transcription system. In ASRU, pages 165–170, 2007.
[89] Luı́s Marujo, José Lopes, Nuno J. Mamede, Isabel Trancoso, Juan Pino, Maxine Eskenazi, Jorge Baptista, and Céu Viana. Porting REAP to european portuguese. In
Procedings of SCA International Workshop on Speech and Language Technology in Education (SLaTE 2009), 2009.
[90] Michael Matessa. Measures of adaptive communication. In Second Workshop on Empirical Evaluation of Adaptive Systems, 2003.
[91] H. Melin, A. Sandell, and M. Ihse. CTT-bank: A speech controlled telephone banking
system - an initial evaluation. Technical report, KTH, 2001.
[92] Microsoft. Microsoft Speech Software Development Kit Developer’s Guide, 1996.
[93] Samer Al Moubayed, Gabriel Skantze, and Jonas Beskow. The Furhat back-projected
humanoid head-lip reading, gaze and multi-party interaction. I. J. Humanoid Robotics,
10(1), 2013.
[94] Ani Nenkova, Agustn Gravano, and Julia Hirschberg. High frequency word entrainment
in spoken dialogue. In In Proceedings of ACL-08: HLT. Association for Computational
Linguistics, 2008.
156
BIBLIOGRAPHY
[95] J. Neto, N. Mamede, R. Cassaca, and L. Oliveira. The development of a multi-purpose
spoken dialogue system. In Proceedings of EUROSPEECH, 2003.
[96] J. Neto, H. Meinedo, M. Viveiros, R. Cassaca, C. Martins, and D. Caseiro. Broadcast
news subtitling system in portuguese. In ICASSP’08, pages 1561–1564, 2008.
[97] K. G. Niederhoffer and J. W. Pennebaker. Linguistic style matching in social interaction.
Journal of Language and Social Psychology, 21(4):337–360, 2002.
[98] Alice H. Oh and Alexander I. Rudnicky. Stochastic language generation for spoken
dialogue systems. In Proceedings of the 2000 ANLP/NAACL Workshop on Conversational systems - Volume 3, ANLP/NAACL-ConvSyst ’00, pages 27–32, Stroudsburg,
PA, USA, 2000.
[99] Gabriel Parent and Maxine Eskenazi. Lexical entrainment of real users in the Let’s Go
spoken dialog system. In INTERSPEECH, pages 3018–3021, 2010.
[100] Giorgio Parisi. Statistical Field Theory. Addison-Wesley, 1988.
[101] Sérgio Paulo, Luı́s C. Oliveira, Carlos Mendes, Luı́s Figueira, Renato Cassaca, Céu
Viana, and Helena Moniz. Dixi — a generic text-to-speech system for European Portuguese. In Proceedings of the 8th international conference on Computational Processing
of the Portuguese Language, PROPOR ’08, pages 91–100, 2008.
[102] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in python.
Journal of Machine Learning Research, 12:2825–2830, 2011.
[103] Thomas Pellegrini and Isabel Trancoso. Improving ASR error detection with nondecoder based features. In INTERSPEECH, pages 1950–1953, 2010.
[104] Thomas Pellegrini, Rui Correia, Isabel Trancoso, Jorge Baptista, and Nuno J. Mamede.
Automatic generation of listening comprehension learning material in european portuguese. In INTERSPEECH, pages 1629–1632, 2011.
[105] Marina Piat, Dominique Fohr, and Irina Illina. Foreign accent identification based on
prosodic parameters. In INTERSPEECH 2008, pages 759–762, 2008.
[106] Martin J. Pickering and Simon Garrod. Towards a mechanistic psychology of dialogue.
Behavioral and Brain Sciences, 27(2):169–190, 2004.
[107] O. Pietquin and T. Dutoit. A probabilistic framework for dialog simulation and optimal
strategy learning. Trans. Audio, Speech and Lang. Proc., 14(2):589–599, December 2006.
[108] Olivier Pietquin. A Framework for Unsupervised Learning of Dialogue Strategies. PhD
thesis, Faculté Polytechnique de Mons, TCTS Lab (Belgique), apr 2004.
[109] Isabella Poggi, Catherine Pelachaud, Fiorella de Rosis, Valeria Carofiglio, and Berardina
de Carolis. Greta, a believable embodied conversational agent. Multimodal Communication in Virtual Environments, pages 27–45, 2005.
BIBLIOGRAPHY
157
[110] Anna Pompili, Alberto Abad, Isabel Trancoso, José Fonseca, Isabel P. Martins, Gabriela
Leal, and Luisa Farrajota. An on-line system for remote treatment of aphasia. In
Proceedings of the Second Workshop on Speech and Language Processing for Assistive
Technologies, SLPAT ’11, pages 1–10, Stroudsburg, PA, USA, 2011. Association for
Computational Linguistics.
[111] P. J. Price. Evaluation of spoken language systems: the ATIS domain. In Proceedings
of the workshop on Speech and Natural Language, HLT ’90, pages 91–95, Stroudsburg,
PA, USA, 1990.
[112] F Ramus. Acoustic correlates of linguistic rhythm: Perspectives, volume pp, pages
115–120. 2002.
[113] Antoine Raux. Flexible Turn-Taking for Spoken Dialog Systems. PhD thesis, Carnegie
Mellon University, Pittsburgh, PA, USA, 2008.
[114] Antoine Raux, Brian Langner, Dan Bohus, Alan W Black, and Maxine Eskenazi. Let’s
Go public! taking a spoken dialog system to the real world. In Proc. of Interspeech
2005, 2005.
[115] David Reitter, Frank Keller, and Johanna D. Moore. Computational modelling of
structural priming in dialogue. In Proceedings of the Human Language Technology
Conference of the NAACL, Companion Volume: Short Papers, NAACL-Short ’06, pages
121–124, 2006.
[116] David B. Roe and Michael D. Riley. Prediction of word confusabilities for speech
recognition. In Proc. ICSLP, 1994.
[117] Nicholas Roy, Gregory Baltus, Dieter Fox, Francine Gemperle, Jennifer Goetz, Tad
Hirsch, Dimitris Margaritis, Michael Montemerlo, Joelle Pineau, Jamieson Schulte, and
Sebastian Thrun. Towards personal service robots for the elderly, May 2000.
[118] Nicholas Roy, Joelle Pineau, and Sebastian Thrun. Spoken dialogue management using
probabilistic reasoning. In Proceedings of the 38th Annual Meeting on Association for
Computational Linguistics, ACL ’00, pages 93–100, Stroudsburg, PA, USA, 2000.
[119] Alexander I. Rudnicky, Eric H. Thayer, Paul C. Constantinides, Chris Tchou, R. Shern,
Kevin A. Lenzo, W. Xu, and Alice Oh. Creating natural dialogs in the Carnegie Mellon
Communicator system. In EUROSPEECH. ISCA, 1999.
[120] Ruhi Sarikaya, Yuqing Gao, Michael Picheny, and Hakan Erdogan. Semantic confidence
measurement for spoken dialog systems. IEEE Transactions on Speech and Audio Processing, (4):534–545.
[121] Konrad Scheffler and Steve Young. Automatic learning of dialogue strategy using dialogue simulation and reinforcement learning. In Proceedings of the second international
conference on Human Language Technology Research, HLT ’02, pages 12–19, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc.
[122] Stephanie Seneff and Joseph Polifroni. A new restaurant guide conversational system:
issues in rapid prototyping for specialized domains. In The 4th International Conference
on Spoken Language Processing, Philadelphia, PA, USA, October 3-6, 1996. ISCA, 1996.
158
BIBLIOGRAPHY
[123] Stephanie Seneff and Joseph Polifroni. Dialogue management in the Mercury flight
reservation system. In Proceedings of the 2000 ANLP/NAACL Workshop on Conversational systems - Volume 3, ANLP/NAACL-ConvSyst ’00, pages 11–16, Stroudsburg,
PA, USA, 2000.
[124] Stephanie Seneff, Ed Hurley, Raymond Lau, Christine Pao, Philipp Schmid, and Victor
Zue. Galaxy-ii: A reference architecture for conversational system development. In in
Proc. ICSLP, pages 931–934, 1998.
[125] André Silva, Nuno J. Mamede, Alfredo Ferreira Jr., Jorge Baptista, and João Fernandes.
Towards a serious game for portuguese learning. In SGDA, pages 83–94, 2011.
[126] Gabriel Skantze, Jens Edlund, and Rolf Carlson. Talking with Higgins: Research challenges in a spoken dialogue system. In PIT, pages 193–196, 2006.
[127] Amanda Stent, Rashmi Prasad, and Marilyn Walker. Trainable sentence planning for
complex information presentation in spoken dialog systems. In Proceedings of the 42nd
Annual Meeting on Association for Computational Linguistics, ACL ’04, Stroudsburg,
PA, USA, 2004.
[128] Svetlana Stoyanchev and Amanda Stent. Lexical and syntactic priming and their impact in deployed spoken dialog systems. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association
for Computational Linguistics, Companion Volume: Short Papers, NAACL-Short ’09,
pages 189–192, Stroudsburg, PA, USA, 2009.
[129] William Swartout, David Traum, Ron Artstein, Dan Noren, Paul Debevec, Kerry Bronnenkant, Josh Williams, Anton Leuski, Shrikanth Narayanan, Diane Piepol, Chad Lane,
Jacquelyn Morie, Priti Aggarwal, Matt Liewer, Jen-Yuan Chiang, Jillian Gerten, Selina
Chu, and Kyle White. Ada and Grace: toward realistic and engaging virtual museum
guides. In Proceedings of the 10th international conference on Intelligent virtual agents,
IVA’10, pages 286–300, Berlin, Heidelberg, 2010. Springer-Verlag.
[130] Marc Swerts, Diane J. Litman, and Julia Hirschberg. Corrections in spoken dialogue
systems. In INTERSPEECH, pages 615–618, 2000.
[131] Beng T. Tan, Yong Gu, and Trevor Thomas. Word confusability measures for vocabulary selection in speech recognition. In Proc. ASRU, 1999.
[132] Joseph Tepperman and Shrikanth S. Narayanan. Better non-native intonation scores
through prosodic theory. In INTERSPEECH, pages 1813–1816, 2008.
[133] Blaise Thomson and Steve Young. Bayesian update of dialogue state: A POMDP
framework for spoken dialogue systems. Comput. Speech Lang., 24(4):562–588, October
2010.
[134] Blaise Thomson, Kai Yu, Simon Keizer, Milica Gasic, Filip Jurccek, Franois Mairesse,
and Steve Young. Bayesian dialogue system for the Let’s Go spoken dialogue challenge.
In SLT, pages 460–465. IEEE, 2010.
BIBLIOGRAPHY
159
[135] T. Toda, A.W. Black, and K. Tokuda. Voice conversion based on maximum-likelihood
estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech, and
Language Processing, 15(8):2222–2235, nov. 2007.
[136] Pedro A. Torres-carrasquillo, Elliot Singer, Mary A. Kohler, and J. R. Deller. Approaches to language identification using gaussian mixture models and shifted delta
cepstral features. In Proc. ICSLP 2002, pages 89–92, 2002.
[137] Arthur R. Toth, Thomas K. Harris, James Sanders, Stefanie Shriver, and Roni Rosenfeld. Towards every-citizen’s speech interface: An application generator for speech
interfaces to databases. In IN PROCEEDINGS OF ICSLP, pages 1497–1500, 2002.
[138] Joost Van Doremalen, Catia Cucchiarini, and Helmer Strik. Optimizing automatic
speech recognition for low-proficient non-native speakers. EURASIP J. Audio Speech
Music Process., 2010:2:1–2:13, January 2010.
[139] Marilyn A. Walker, Jeanne Fromer, Giuseppe Di Fabbrizio, Craig Mestel, and Donald
Hindle. What can I say? evaluating a spoken language interface to email. In CHI, pages
582–589, 1998.
[140] Marilyn A. Walker, Diane J. Litman, Candace A. Kamm, Ace A. Kamm, and Alicia
Abella. Evaluating spoken dialogue agents with PARADISE: Two case studies, 1998.
[141] Marilyn A. Walker, Candace A. Kamm, and Diane J. Litman. Towards developing
general models of usability with paradise. Natural Language Engineering, 6(3 and 4):
363–377, 2000.
[142] Marilyn A. Walker, Owen Rambow, and Monica Rogati. Spot: a trainable sentence
planner. In Proceedings of the second meeting of the North American Chapter of the
Association for Computational Linguistics on Language technologies, NAACL ’01, pages
1–8, Stroudsburg, PA, USA, 2001.
[143] A. Ward and D. Litman. Measuring convergence and priming in tutorial dialog. in
technical report tr-07-148. University of Pittsburgh, 2007.
[144] Arthur Ward and Diane Litman. Automatically measuring lexical and acoustic/prosodic
convergence in tutorial. In Proceedings of SLaTE 2007, Framington, Pennsylvania, USA,
2007.
[145] Wayne Ward and Sunil Issar. Recent improvements in the cmu spoken language understanding system. In Proceedings of the workshop on Human Language Technology, HLT
’94, pages 213–216, 1994.
[146] Jason D. Williams. The best of both worlds: unifying conventional dialog systems and
POMDPs. In INTERSPEECH, pages 1173–1176, 2008.
[147] Jason D. Williams. Incremental partition recombination for efficient tracking of multiple
dialog states. In ICASSP, pages 5382–5385, 2010.
[148] Jason D. Williams and Steve Young. Partially observable Markov decision processes for
spoken dialog systems. Comput. Speech Lang., 21(2):393–422, 2007.
160
BIBLIOGRAPHY
[149] Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and
Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems).
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005. ISBN 0120884070.
[150] S. Young, J. Schatzmann, K. Weilhammer, and Hui Ye. The Hidden Information State
approach to dialog management. In IEEE International Conference on Acoustics, Speech
and Signal Processing, 2007. ICASSP 2007, volume 4, pages IV–149–IV–152, 2007.
[151] Steve Young. Probabilistic methods in spoken dialogue systems. Philosophical Transactions of the Royal Society (Series A), 358(1769):1389–1402, 2000.
[152] Steve Young, Milica Gašić, Simon Keizer, François Mairesse, Jost Schatzmann, Blaise
Thomson, and Kai Yu. The Hidden Information State model: A practical framework
for POMDP-based spoken dialogue management. Comput. Speech Lang., 24(2):150–174,
April 2010.
[153] Steve Young, Milica Gasic, Blaise Thomson, and Jason D. Williams. POMDP-based
statistical spoken dialog systems: A review. Proceedings of the IEEE, 101(5):1160–1179,
2013.
[154] Victor Zue, James Glass, David Goodine, Hong Leung, Michael Phillips, Joseph Polifroni, and Stephanie Seneff. The VOYAGER speech understanding system: a progress
report. In Proceedings of the workshop on Speech and Natural Language, HLT ’89, pages
51–59, Stroudsburg, PA, USA, 1989. ISBN 1-55860-112-0.
[155] Victor Zue, Stephanie Seneff, James R. Glass, Joseph Polifroni, Christine Pao, Timothy J. Hazen, and I. Lee Hetherington. JUPITER: a telephone-based conversational
interface for weather information. IEEE Transactions on Speech and Audio Processing,
8(1):85–96, 2000.
Download

UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TÉCNICO