UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TÉCNICO Lexical Entrainment in Spoken Dialog Systems José David Águas Lopes Supervisor: Doctor Isabel Maria Martins Trancoso Co-Supervisor: Doctor Maxine Eskenazi Thesis approved in public session to obtain the PhD Degree in Electrical and Computer Engineering Jury Final classification: Pass With Merit. Jury Chairperson: Chairman of the IST Scientific Board Members of the Committee: Doctor Joakim Gustafson Doctor Isabel Maria Martins Trancoso Doctor Ana Maria Severino de Almeida e Paiva Doctor Nuno João Neves Mamede Doctor António Joaquim da Silva Texeira Doctor Maxine Eskenazi 2013 UNIVERSIDADE DE LISBOA INSTITUTO SUPERIOR TÉCNICO Lexical Entrainment in Spoken Dialog Systems José David Águas Lopes Supervisor: Doctor Isabel Maria Martins Trancoso Co-Supervisor: Doctor Maxine Eskenazi Thesis approved in public session to obtain the PhD Degree in Electrical and Computer Engineering Jury Final classification: Pass With Merit. Jury Chairperson: Chairman of the IST Scientific Board Members of the Committee: Doctor Joakim Gustafson, Professor, School of Computer Science and Communication, KTH, Royal Institute of Technology, Sweden. Doctor Isabel Maria Martins Trancoso, Professora Catedrática do Instituto Superior Técnico, da Universidade de Lisboa. Doctor Ana Maria Severino de Almeida e Paiva, Professora Associada (com Agregação) do Instituto Superior Técnico, da Universidade de Lisboa. Doctor Nuno João Neves Mamede, Professor Associado (com Agregação) do Instituto Superior Técnico, da Universidade de Lisboa. Doctor António Joaquim da Silva Texeira, Professor Auxiliar da Universidade de Aveiro. Doctor Maxine Eskenazi, Principal Systems Scientist, Language Technologies Institute, Carnegie Mellon University, USA. This research work was funded by “Fundação para a Ciência e a Tecnologia” (FCT, Portugal) through the Ph.D. grant with reference SFRH/BD/47039/2008. 2013 Resumo A literatura [27] mostrou que os locutores participantes em diálogos falados usam os termos uns dos outros para estabelecer uma base de conhecimento comum. Os termos utilizados são correntemente conhecidos como primes, sendo termos capazes de influenciar o processo de decisão lexical dos interlocutores. O objectivo desta tese é estudar um sistema de diálogo capaz de imitar as relações estabelecidas em interacção humana no que diz respeito ao processo de escolha de primes. O objectivo final é que o sistema seja capaz de escolher primes no decorrer da interacção, baseando essa escolha num conjunto de caracterı́sticas que se julga indicarem os primes adequados que os utilizadores pretendem usar. A estratégia seguida foi preparar um sistema que privilegiasse a preferência do utilizador, desde que isso não afectasse negativamente o desempenho desse sistema. Quando o desempenho do sistema saı́sse afectado, o sistema deveria estar à altura de propor um prime alternativo, de modo a que utilizador o pudesse usar e o seu desempenho melhorasse. Consideramos que esta estratégia traz benefı́cios na qualidade do desempenho do sistema. O cenário escolhido para o trabalho experimental levado a cabo foi um sistema de informação de horários de autocarros, na cidade de Pittsburgh, conhecido por sistema Let’s Go, e em Lisboa com uma versão portuguesa desse sistema, denominada Noctı́vago, a fornecer informações dos horários das carreiras nocturnas da CARRIS. No que se refere a Lisboa, trata-se de um sistema experimental desenvolvido neste trabalho de doutoramento, privilegiando sobretudo o estudo das caracterı́sticas que determinam um bom prime. A análise do resultado dos primeiros testes com este úlimo permitiu-nos identificar caracterı́sticas que posteriormente foram utilizadas para criar um primeiro método on-line para escolha de primes. O método foi testado em ambos os sistemas. No sistema Let’s Go verificaram-se melhorias no desempenho do sistema, bem como uma redução do números de turnos por diálogo. O conjunto 5 de caraterı́sticas foi então estendido para a criação de um modelo estatı́stico para selecção de primes, que foi testado no sistema Noctı́vago. Os resultados revelaram uma redução na taxa de erro de reconhecimento de primes e aumento no sucesso na aquisição de conceitos relacionados com os mesmos. Deste modo, pode-se concluir que as decisões lexicais podem influenciar positivamente o desempenho dos sistemas de diálogo. No entanto, estes foram apenas os primeiros testes. Melhorar os métodos desenvolvidos e fazer testes em maior escala são passos necessários para reforçar a hipótese defendida. Abstract The literature [27] has shown that speakers engaged in a spoken dialog use one another’s terms (entrain) when trying to create a common ground. Those terms are commonly called primes, since they influence the interlocutors linguistic decision-making. The goal of this thesis is to develop a Spoken Dialog System (SDS) that is able to imitate human interaction by proposing primes to the user. The final goal is to have these primes chosen on the fly during the interaction, based on a set of features that are believed to indicate good candidate terms that the speaker would want to use. The strategy was to develop a system that follows the user’s choice of prime, if the system performance is not negatively affected. When the system performance may be affected, the system proposes a new prime that is the next best candidate to be used by the speaker. We believe that this strategy may improve the system performance. The scenario for the entrainment experiments is an information system providing schedules for buses in Pittsburgh - Let’s Go, and its Portuguese counterpart - Noctı́vago, an experimental system operating for the night buses in Lisbon that was developed especially for studying the features that make good primes. The analysis of the results from the first tests resulted in a set of features that were used to create a first on-the-fly method for prime selection. The method was tested in both systems. In Let’s Go there was an improvement in system performance and a reduction of the total number of turns per dialog. Using an extended set of features, a data-driven method was created and tested with Noctı́vago. We observed a reduction in the error rate of prime recognition and an increase of successful acquisition of prime concepts. Therefore, we can conclude that lexical entrainment can play a role in improving the performance of SDSs. However, these are only the first results. Further improvements in methods developed and larger scale tests are need to strengthen our hypothesis. Palavras Chave Keywords Palavras Chave Adaptação Lexical Selecção automática de primes Sistemas de Diálogo Primes Captação Medida de confiança Gestão de Diálogo Interacção Homem-Máquina Noctı́vago Let’s Go Keywords Lexical Entrainment Automated Prime Selection Spoken Dialog Systems Primes Uptake Confidence score Dialog Management Human Machine Interaction Noctı́vago Let’s Go Acknowledgements I would like to thank my advisors, Isabel Trancoso and Maxine Eskenazi, for their constant availability, interest, support and encouragement in my research. I was blessed with two great advisors with an immense sense of justice. I also thank their support in the research direction I like the most. I would like to thank all my colleagues from the Spoken Language Laboratory for the motivating atmosphere for research felt throughout this years. To all the people from Carnegie Mellon that I have worked with throughout my visits, thanks for making me feel at home. Special thanks to Alberto Abad, Sungjin Lee and Alan W. Black for their contributions to my research. I would like also to mention Renato Cassaca and Pedro Fialho, for their patience and technical support, and Cláudia Figueiredo for the help with the stats. To the members of the committee for accepting the invitation and their contributions to make this document better for the community. I would like to remind also Professor Fernando Perdigão for introducing me to the world of language technologies and for the warm welcome when I asked him to use the laboratory in Coimbra. A word to those that I’ve shared a home in all these years and sometimes had to deal with my research frustration at the end of the day. A special thanks to those that lived in Casa do Babar, for being my family in Lisbon. To my JSC friends for all memorable moments shared and for being so inspiring to me despite all our differences. To the friends I’ve made at CUMN and other Jesuit initiatives over the years. Thanks for the truth, for taking care of me and for making me feel the real freedom. To my Christian Life Community, for being side by side during this thesis. To the Society of Jesus for the spiritual support and for the tools that help me to become a better human being day after day. Thanks to all the kids that I tried to serve from Rabo de Peixe (Azores), Fonte da Prata (Moita) and Bela Vista (Setúbal) for their genuine life, and for reminding me of the most need of the human kind: being loved. To my parents, Fernanda and Orlando, and my brother, João, for their unconditional love and support during the rough moments of these nearly five years. To God, the One I’ve tried to serve with this work and I want to continue serving for the rest of my life. Lisboa, December 15, 2013 José David Águas Lopes “Para ser grande, sê inteiro: nada Teu exagera ou exclui. Sê todo em cada coisa. Põe quanto és No mı́nimo que fazes. Assim em cada lago a lua toda Brilha, porque alta vive” Ricardo Reis Contents 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Structure of the document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Related Work 2.1 5 Improving robustness in SDSs . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.1 Speech recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.2 Confidence annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.3 Dialog Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.3.1 POMDP Review . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3.1.1 2.1.3.2 2.1.4 2.2 2.3 10 SDS-POMDP . . . . . . . . . . . . . . . . . . . . . 11 Dialog-State tracking . . . . . . . . . . . . . . . . . . . . . . 14 2.1.3.2.1 N-Best Approaches . . . . . . . . . . . . . . . . . . 14 2.1.3.2.2 Factored Bayesian Networks Approaches . . . . . . 14 2.1.3.3 Policy optimization . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Language Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Entrainment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.1 Entrainment in Human-Human Dialogs . . . . . . . . . . . . . . . . . 18 2.2.2 Entrainment in Human-Computer Dialogs . . . . . . . . . . . . . . . . 19 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 i 3 Creating Noctı́vago, a Portuguese Let’s Go 3.1 3.2 3.3 23 Choosing a Framework for Noctı́vago . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.1 DIGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.2 Olympus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Modules Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.1 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.1.1 Pocketsphinx . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.1.2 Audimus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.2 Natural Language Understanding . . . . . . . . . . . . . . . . . . . . . 29 3.2.3 Dialog Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.3.1 Ravenclaw . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.3.2 Cornerstone . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.3.2.1 Dialog-State Tracking . . . . . . . . . . . . . . . . . 34 3.2.3.2.2 Policy Optimization . . . . . . . . . . . . . . . . . . 36 3.2.4 Natural Language Generation and Speech Synthesis . . . . . . . . . . 37 3.2.5 Embodied Conversational Agents . . . . . . . . . . . . . . . . . . . . . 37 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4 Towards Better Prime Choices 43 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2 Creating a list of prime candidates . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3 Experimental Set Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.5 Prime usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.7 User’s Feedback ii 5 Refining confidence measure to improve prime selection 57 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2 Training a confidence annotator using logistic regression . . . . . . . . . . . . 57 5.3 Training a confidence annotator with skewed data . . . . . . . . . . . . . . . 62 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6 Automated Entrainment 6.1 Two-Way Automated Rule-Based entrainment . . . . . . . . . . . . . . . . . 67 6.1.1 Entrainment Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.1.2 Heuristic Entrainment Rules . . . . . . . . . . . . . . . . . . . . . . . 69 6.1.2.1 70 6.1.2.2 6.1.2.3 Implementing the Entrainment Heuristics . . . . . . . . . . . 6.1.2.1.1 Long-Term Entrainment . . . . . . . . . . . . . . . 70 6.1.2.1.2 Short-Term Entrainment . . . . . . . . . . . . . . . 70 Version 1: Testing the Entrainment Rules in Noctı́vago . . . 72 6.1.2.2.1 Test Set . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.1.2.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . 74 Version 2: Testing Entrainment Rules in Let’s Go . . . . . . 78 6.1.2.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . 80 Acoustic Distance and Prime Usage Evolution . . . . . . . . . . . . . 82 6.1.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 A Data-Driven Method for Prime Selection . . . . . . . . . . . . . . . . . . . 89 6.2.1 Prime selection model . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.2.2 Predicting WER with Off-line data . . . . . . . . . . . . . . . . . . . . 91 6.2.3 Testing the model in an experimental system . . . . . . . . . . . . . . 94 6.2.3.1 Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.1.3 6.1.4 6.2 6.2.4 6.2.4.1 Analysis Prime Usage Evolution . . . . . . . . . . . . . . . . . . . . . 101 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.2.5 6.3 67 iii 7 Conclusions 7.1 107 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Tools for Oral Comprehension in L2 Learning 111 115 A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 A.1.1 Speech synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 A.1.2 Digital Talking Books . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 A.1.3 Broadcast News 117 A.1.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Integrating Automatically Transcribed news shows in REAP.PT117 A.1.3.1.1 Broadcast News Pipeline . . . . . . . . . . . . . . . 117 A.1.3.1.2 Integration in REAP.PT . . . . . . . . . . . . . . . 120 A.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 B Nativeness Detection 127 B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.1.1 Corpus 127 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 B.1.2 Nativeness Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 130 B.1.2.1 Acoustic Classifier Development . . . . . . . . . . . . . . . . 130 B.1.2.1.1 Feature Extraction . . . . . . . . . . . . . . . . . . . 130 B.1.2.1.2 Supervector Extraction . . . . . . . . . . . . . . . . 131 B.1.2.1.3 Nativeness modeling and scoring . . . . . . . . . . . 131 Prosodic Classifier Development . . . . . . . . . . . . . . . . 131 B.1.2.2 B.1.2.2.1 Prosodic contour extraction . . . . . . . . . . . . . . 131 B.1.2.2.2 Nativeness modeling and scoring . . . . . . . . . . . 132 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 B.1.3 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 B.1.4 Human Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 B.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 B.1.2.3 iv C Resources used in the experimental sets 139 C.1 Scenarios used in Noctı́vago tests . . . . . . . . . . . . . . . . . . . . . . . . . 139 C.2 Questionnaires . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 C.2.1 Questionnaires used in Section 4.3 . . . . . . . . . . . . . . . . . . . . 142 C.2.2 Questionnaire used in Section 6.1.2.2 . . . . . . . . . . . . . . . . . . . 144 C.2.3 Questionnaire used in Section 6.1.2.2 . . . . . . . . . . . . . . . . . . . 146 v vi List of Figures 2.1 Standard architecture of an SDS. . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Spoken Dialog System architecture in [151]. . . . . . . . . . . . . . . . . . . . 11 2.3 Influence Diagram of a SDS-POMDP. From [148] . . . . . . . . . . . . . . . . 13 3.1 Architecture of DIGA framework for SDSs. . . . . . . . . . . . . . . . . . . . 24 3.2 Olympus reference architecture. From [22]. . . . . . . . . . . . . . . . . . . . 25 3.3 Tree for Noctı́vago task implemented in Ravenclaw. . . . . . . . . . . . . . . . 33 3.4 DBN graph for the Let’s Go state tracking. From [75]. . . . . . . . . . . . . . 35 3.5 Flash-based ECA used in Noctı́vago. . . . . . . . . . . . . . . . . . . . . . . . 39 3.6 Unity 3D-based ECA used in Noctı́vago. . . . . . . . . . . . . . . . . . . . . . 39 3.7 Olympus Architecture used in the first Noctı́vago tests (Section 4.3) with telephone interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.8 Olympus Architecture used in Let’s Go tests (Section 6.1.2.3). . . . . . . . . . 40 3.9 Olympus Architectures used in Noctı́vago with ECA. . . . . . . . . . . . . . . 41 4.1 Example of scenario used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.1 Accuracy, Precision and Recall for the tested methods. . . . . . . . . . . . . . 61 5.2 Performance of the stepwise logistic regression compared with the baseline model. 63 6.1 WER and CTC results for the different configurations tested. . . . . . . . . . 76 6.2 Accumulated of events percentage . . . . . . . . . . . . . . . . . . . . . . . . 78 6.3 Prime Usage over time for the concepts in confirmation, help and now. 84 6.4 Prime Usage over time for the concepts next query, origin place and start over. 85 6.5 Prime Usage over time for the concepts next bus and previous bus. . . . . . . 86 6.6 OOV, WER and CTC results for the different configurations tested. . . . . . 98 vii . . . 6.7 Comparison between intrinsic prime usages in Prompts between Data Driven (DD) and Rule Based (RB) prime selection. . . . . . . . . . . . . . . . . . . . 102 Comparison between non-intrinsic prime usages in Prompts between Data Driven (DD) and Rule Based (RB) prime selection. . . . . . . . . . . . . . . . 103 A.1 Recognized BN interface in REAP. . . . . . . . . . . . . . . . . . . . . . . . . 121 A.2 Broadcast News Pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 B.1 DET curves of the GSV-acoustic 256, GMM-prosodic and fusion between both systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 C.1 Scenario 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 C.2 Scenario 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 C.3 Scenario 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 C.4 Questionnaire used in the first week. . . . . . . . . . . . . . . . . . . . . . . . 142 C.5 Questionnaire used in the second week. . . . . . . . . . . . . . . . . . . . . . . 143 C.6 Questionnaire Rule Based tests. . . . . . . . . . . . . . . . . . . . . . . . . . . 145 C.7 Questionnaire Data Driven tests. . . . . . . . . . . . . . . . . . . . . . . . . . 147 6.8 viii List of Tables 4.1 Prime analysis in pilot experiments. . . . . . . . . . . . . . . . . . . . . . . . 44 4.2 List of primes used in this study. . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3 Distribution of calls and real success rate according to the device used. . . . . 47 4.4 Success rate of the system in each week. . . . . . . . . . . . . . . . . . . . . . 47 4.5 Analysis of errors at the turn level. . . . . . . . . . . . . . . . . . . . . . . . . 48 4.6 WER for the different weeks. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.7 Use of the primes and error rate in both weeks. . . . . . . . . . . . . . . . . . 50 4.8 Example of uptaken stats taken form interaction for the prime próximo. . . . 52 4.9 Analysis from the uptaken of the primes. . . . . . . . . . . . . . . . . . . . . . 53 5.1 Weights for the features used in the confidence annotator training algorithm. 60 5.2 Confidence annotations models trained with stepwise logistic regression. . . . 62 5.3 Classification error rate for the different strategies . . . . . . . . . . . . . . . 65 6.1 Examples of the events used in the prime choice update. Primes are in bold. 69 6.2 Primes used by Noctı́vago in the heuristic method tests. . . . . . . . . . . . . 73 6.3 Dialog performance results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.4 WER and correctly transferred concepts results. . . . . . . . . . . . . . . . . 76 6.5 User satisfaction results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.6 Entrainment events relative frequency. . . . . . . . . . . . . . . . . . . . . . . 77 6.7 Primes used by Let’s Go before and after the entrainment rules were implemented. 80 6.8 Excerpts of dialogs where entrainment rules changed the system’s normal behavior. Primes affected in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Results for Let’s Go tests. Statistically significant differences in bold. . . . . . 82 6.9 ix 6.10 Primes selected according to the minimal and average acoustic distance for each language model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.11 Example of how the prime distance was computed. . . . . . . . . . . . . . . . 91 6.12 Number of turns used to train the prime selection regression. . . . . . . . . . 92 6.13 Noctı́vago models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.14 Let’s Go models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.15 Primes used by Noctı́vago in the data-driven model tests. . . . . . . . . . . . 96 6.16 OOV, WER and CTC Results for the different versions. Statistically significant results in bold (one-way ANOVA with F (3) = 2.881 and p − value = 0.037). . 97 6.17 Entrainment Events and Non-Understandings relative frequencies. One-way ANOVA tests revealed no statistical significance. . . . . . . . . . . . . . . . . 99 6.18 Dialog performance results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.19 Questionnaire results for the different versions. . . . . . . . . . . . . . . . . . 100 B.1 Details of the training and test sets for Native and Non-Native data. . . . . . 130 B.2 Detailed results for GSV-acoustic 64 and GSV-acoustic 256. . . . . . . . . . . 133 B.3 Results obtained using prosodic features (Accuracy and EER) and the fusion between prosodic systems and GSV-acoustic 256. . . . . . . . . . . . . . . . . 134 x 1 Introduction Advances in the last 30 years, especially in Automatic Speech Recognition, allowed the pursue of the dream of having conversational dialogs with computers. The system that is behind dialog capabilities of computers is usually names as Spoken Dialog Systems (SDS). The use of SDSs that are able to talk to humans can make life more comfortable. Some services that nowadays rely in human-operator dialogs could be done in the future by machines, whenever the human operators are not available. The use of SDSs in situations where typing is timeconsuming or impossible (while driving, for instance) is also a way to provide safety and comfort in daily-life. Personal agents like Apple’s Siri [4] are examples of the way that people look at the SDSs nowadays. There is a large potential in this type of technology. 1.1 Motivation The study of human dialogs should inspire SDS developers to create systems that are able to follow some human dialog protocols, in order to achieve successful communication. Lexical Entrainment is a phenomenon that was first described by psychologists and psycholinguists, who have analyzed dialogs between humans and observed that they converged in the terms they used as the dialogs progress in order to achieve successful communication [48, 29]. This implies that sometimes one of the subjects has to give up her/his own words and adopt the words from the other subject, either because the other subject is not familiar with the word presented or there is a word with the same meaning that can lead to faster and more successful communication. One characteristic of lexical entrainment is that different pairs of subjects are very likely to use different terms to refer to the same object. Another one is that subjects do not lexically entrain because they want to. Entrainment is not a conscious 2 CHAPTER 1. INTRODUCTION process, which makes it even more difficult to model it within an SDS. There is an obvious advantage of the application of lexical entrainment to SDS: if the system entrains, the communication will be more successful. There are several reasons for this. First, if the system is able to predict the user’s lexical choices speech recognition is likely to improve [30]. Second, from human-human dialogs it is known that the more two people are engaged with each other, the more entrainment exists [97]. And third, it has been show that entrainment is critical for successful communication [106]. 1.2 Goals This thesis aims to add a contribution to all the other improvements made in this area in the last decades, integrating Lexical Entrainment in an SDS. Despite the speech recognition and understanding advances, there is still a lot of uncertainty in the Spoken Language Understanding (SLU) process, which may affect the success of the dialog. Since lexical entrainment occurs in both directions in a dialog, the human participant can also entrain to the system. Thus, systems may also be able to try to make the user entrain to them, whenever a term hinders the SLU process. In any of the ways that the adaptation may occur, it will require the system to change the terms used in the system prompts to adapt them appropriately. This thesis aims to contribute on how to detect when and how the system should modify the prompts in order to make the user entrain or to be entrained by the user. In the end of the thesis, we hope to be able to show that the methods used and the results achieved, support our initial ideal that incorporation of lexical entrainment in SDS will increase the task success, and brings them closer to human-like dialog capabilities. 1.3 Structure of the document In order to contextualize the reader with SDSs, the first part of Chapter 2 will review the previous work in this area with a special focus in techniques used to deal with the uncertainty 1.3. STRUCTURE OF THE DOCUMENT 3 provoked by SLU errors. The second part will review Lexical Entrainment in human-to-human dialogs, as well as its previous applications to human-machine dialogs. Chapter 3 describes the steps towards the creation of a new dialog system in Portuguese, Noctı́vago, contrasting it with the Let’s Go system, that served as the role model. Chapter 4 described the first tests done with Noctı́vago and the findings that led to an on the fly algorithm for prime selection. An important feature for prime selection is the confidence measure. The development of new confidence measures will be presented in Chapter 5, showing that accurate confidence measures may be important to the algorithms for on the fly entrainment. Chapter 6 presents the methods and tests of a Rule-Based and a Data-Driven algorithm for prime selection. Finally, the concluding remarks and possible directions for future work will close the main block of the thesis in Chapter 7. Annexes A and B correspond to work developed in the scope of the initial direction of this PhD thesis, the creation of an SDS for non-native students of Portuguese. The first of these Annexes describes a set of tools for stimulating oral comprehension in students of European Portuguese as L2, whereas the second Annex describes our efforts in terms of automatic nativeness classification. Annex C includes materials used in the experiments throughout this work. 4 CHAPTER 1. INTRODUCTION 2 Related Work SDSs are typically composed of several modules that work as a pipeline to perform conversational dialogs with humans. Figure 2.1 shows the standard architecture of an SDSs. The audio is fed into the speech recognizer that generates a text. This text is processed by the language understanding module, which generates a semantic representation of the text input and a confidence score. According to that semantic representation, the Dialog Manager (DM), the “brain” of the system, will decide which is the next action to be performed. Although not represented, for dialog systems that work in the domain of information services, one can also have a connection to backend to perform database queries. The natural language generation (NLG) module will map the action select by the DM into text. Finally, the text generated by the NLG will be synthesized into audio and returned back to the user. ✁✔✝✄ ✁✂✄☎✆✂✝✞ ✟✠✡✡✞☛ ☞✡✞✄✌✍✝✂✝✄✍ ✖✡✗✂ ✎✆✂✁✏✆✑ ✒✆✍✌✁✆✌✡ ✓✍✔✡✏✕✂✆✍✔✝✍✌ ✟✡☎✆✍✂✝✞ ☞✡✠✏✡✕✡✍✂✆✂✝✄✍ ✘✝✆✑✄✌ ✙✆✍✆✌✡✏ ✁✔✝✄ ✟✠✡✡✞☛ ✟✛✍✂☛✡✕✝✕ ✖✡✗✂ ✎✆✂✁✏✆✑ ✒✆✍✌✁✆✌✡ ✚✡✍✡✏✆✂✝✄✍ ✞✂✝✄✍ Figure 2.1: Standard architecture of an SDS. Over the last decades, the advances especially in speech recognition have boosted the area of spoken dialog systems. Rather than just converting the speech into text, spoken dialog 6 CHAPTER 2. RELATED WORK systems aim to perform the challenging task of having a computer using speech to interact with a human. The domains chosen to perform these dialogs are generally constrained and task-oriented. Examples of the domains served by research SDSs were traveling plan (Communicator project [119]), weather forecast (MIT Jupiter [155]), public transit information (TOOT [130] and TRAINS [45]), flight schedule information (ATIS [111] and Mercury [123]), real-estate information (AdApt [54] and ApartmentLine [137]), tourist information (Waxholm [15] and Voyager [154]), e-mail access (ELVIS [139]), banking systems (CTT-Bank [91]), restaurant and bar information (DINEX [122] and Cambridge Restaurant Information system [67]), movie information (MovieLine [137]), navigation (Higgins [126] and CityBrowser [53]), tutoring (ITSPOKE [84]), and interactive collaborative planning (TRIPS [44]). Spoken dialog systems are nowadays much closer to real users. They have been used to perform simple tasks in real-life scenarios like dealing with missing or delayed flight luggage, magazines subscriptions, or simple tasks in customer services. The Let’s Go system [114], in spite of being a research system, has been providing bus schedule information in Pittsburgh since 2005. In this decade, technology companies like Apple, Google or Microsoft have devoted attention to the creation of sophisticated spoken dialog systems. Apple’s personal agent Siri [4] or Nuance’s Dragon Go! app are the most prominent examples of a new generation of spoken dialog systems that can be used in mobile devices on a regular basis. Many of these systems no longer rely on voice alone to interact with the user. Furhat [93], Greta [109], Max [71] and Ada and Grace [129] are some of the most interesting multimodal dialog systems. Another type of applications that makes use of SDSs are the service robots. Systems like Flo [117] or CARL [86] are examples of nursebots especially developed for taking care of elderly people. They are often incorporated dialog managers. Over the years, the research community has tried to build more and more complex spoken dialog systems, enabling them to have conversational dialogs rather than just command-like utterances dialogs. However, this has been one of the hardest trade-offs faced by the research community: conversational dialogs with high speech recognition rates versus command-like dialogs with lower speech recognition rates. 7 The need for trade-off solutions involving stricter interactions within narrower domains derives from the many unsolved problems faced by the speech research community. Spontaneous speech recognition is still a crucial problem for current speech recognizers, since they are not prepared yet to deal with disfluencies, repairs, hesitations and filled pauses. In addition, many of the current systems have to deal with a large population of users (different accents, age range, speaking styles), with different acoustic backgrounds (telephones, cellphones, laptop microphones, indoors, outdoors, etc.). Under these conditions, the word error rates around 20-30% (or even higher) are common [18]. It has been shown that the word error rate (WER) is inversely correlated with task completion. The damages in task completion are especially critical for WER above 40% [16]. Over the years, different approaches have been used to reduce the impact of the WER. Some focus on improving the robustness of the ASR front-end and/or acoustic models. Others try constraining the acceptable inputs to the system. While the first ones often lead to very small improvements, the second ones highly restrict the dialog capabilities of the system. Some of the techniques that may be used to constrain the inputs to an SDS will be described in Section 2.1.1. Targeting more complex problems is not possible with SDSs that only can deal with commandlike utterances. Since the current state-of-the-art of spontaneous speech recognition is far from perfect, in order to be able to deal with more complex dialogs, the research community has been working assuming that errors are very likely in speech recognition. Confidence annotation is used to detect errors in SDS inputs based on a set of context information that is not used in the traditional methods of confidence scoring for speech recognizers [20, 17, 60, 120]. Examples of previous work in confidence annotation will be described in Section 2.1.2. Detecting errors accurately is very important to find the correct strategy to recover from them. Several recovering strategies were studied for spoken dialog systems [18, 16]. The current state-of-the-art is Dialog-State Tracking [75, 153], sometimes also called Belief Tracking, where a statistical model tries to learn how to conduct a dialog in a specific domain. The related work on this topic will be covered in Section 2.1.3. 8 CHAPTER 2. RELATED WORK This thesis proposes a different strategy to be used in SDSs: on-the-fly lexical entrainment. This strategy aims not only the to make the users entrain to the system whenever the error detection detects a given term is negatively affecting ASR performance, but also to make the system adapt to the user’s terms. To provide the necessary context to understand the integration of lexical entrainment in a Spoken Dialog System, Section 2.2.1 will describe entrainment in human-human dialogs, and Section 2.2.2 will cover the previous work with entrainment in SDSs. 2.1 Improving robustness in SDSs 2.1.1 Speech recognition The success of an interaction with an SDS is often affected by speech recognition errors. One solution to minimize these errors is to have speaker- and noise-adapted models when facing adverse conditions. However, in many dialog systems there is no information about who is talking and from where she/he is talking. This fact prevents any kind of speaker or noise-adaptation. Thus, alternate methods to minimize the errors need to be found. The use of a restricted vocabulary in syntax to provide better recognition is a straightforward solution to increase speech recognition performance [80]. This could be done using previously developed techniques to find the most acoustically distinct options [116, 131, 3]. This solution does not take into account the user’s lexical preference, and may result in the system use terms that the users never pick up and use. Consequently, the system may sound less natural and the user may feel less engaged during the interaction. Another possible effect, is when novice users who do not know how to address the system, could use different the words that the system cannot recognize, and thus making no progress in the dialog. In most cases the system performance is likely to increase when the vocabulary is constrained, since recognition is likely to work better. However, this comes with a great cost since the dialogs that will be performed by a spoken dialog system with constrained input are very limited and far from human-like dialog. This is a solution that the SDS research community has tried to avoid over the last 2.1. IMPROVING ROBUSTNESS IN SDSS 9 few years, despite being the one adopted in some commercial systems where performance is a key aspect. 2.1.2 Confidence annotation The use of different information sources to improve the confidence scoring in spoken dialog systems has also been explored to improve the system performance. In SDSs there are context information sources that can be more accurate that just the ASR confidence score based on the acoustic features used to recognize the speech input. The set of features is computed during live interaction to improve the error recovery strategies. In [16], these features were provided by the ASR (e.g. acoustic confidence or speech rate), by the Dialog Manager (e.g. the current dialog state, or if the received answer was more or less expected), or by the Language Understanding module (e.g. the number of slots in the parse). They could also be Dialog history features, such as whether the preceding turn was not understood, or prosodic features, such as pitch or loudness. These features can be used to train a fully-supervised [85, 60] or implicitly supervised [20] model for confidence annotation. Once the confidence value is computed, the system can adjust the strategy to be taken in the next turn. For instance, when a turn is marked with a low confidence score, the system could either repeat the question or explicitly confirm the information given in the user turn [83]. Accurate confidence scoring is a crucial item in SDSs to determine the best system action to be taken. However, per se it only influences the system next action without affecting the system lexical choice. 2.1.3 Dialog Management The use of dialog-state tracking (sometimes also called belief tracking) in the SDS dialog manager is now the state-of-the-art. Dialog-State tracking (DST) is a statistical framework for dialog management. Its creation was motivated by the need for a data-driven framework that reduces the cost of laboriously hand-crafting complex dialog managers and that provides robustness against the errors created by speech recognizers operating in noisy environments 10 CHAPTER 2. RELATED WORK [153]. Another possibility offered by DST is on-line learning, which was not possible with non-statistical approaches to dialog management. Dialog-state tracking systems combine belief tracking with reinforcement learning for dialog policy optimization. An SDS is modeled by a Partially Observable Markov Decision Process (POMDP). A POMDP is an extension of a Markov Decision Process (MDP) previously used in SDS design [41, 51, 108]. The motivation for adopting POMDP derives from the uncertainty associated with the ASR input, which does not allow the system to be sure about what is the user’s real intention, and thus the real user intention is not observable to the system [148]. 2.1.3.1 POMDP Review According to [148], a POMDP is defined as tuple {S, A, T, R, O, Z, λ, b0 }, where S is the space state, A is the action state, T defines the transition probability, R is the immediate reward function, O are the observations, Z defines the observation probability, λ is a geometric discount factor, and b0 is the initial belief state. At each time state, the system is in some unobservable state, s. A distribution over all the possible states, b, is maintained. Based on the so called “belief state” b, the system is going to select an action a that has a reward associated r. In the next time step, the system is going to transition to s0 , that depend on s and a. This statistical approach to dialog manager associates a belief to any user input. That belief is updated when an ASR result is produced, depending on the confidence score. Based on the beliefs available, the dialog manager predicts the user action, which is not observable to the system. Training a model for dialog management requires large amounts of data (according to [153] O(105 )). Often this data is generated from user simulators that were trained with data from previous interactions with spoken dialog systems. The user simulator can reliably generate the necessary amounts of data that can hardly be collected when developing an experimental system. This type of dialog management achieves remarkable improvements especially when dealing with low confidence turns. 2.1. IMPROVING ROBUSTNESS IN SDSS 2.1.3.1.1 SDS-POMDP 11 In order to fit SDS into a POMDP and significantly reduced the complexity of the problem, the state S was divided into three factors: user goal, user intention and dialog history. The factored state-space representation for a Spoken Dialog System was originally presented by Williams and Young in [148]. In [151], an SDS was represented as in Figure 2.2. Figure 2.2: Spoken Dialog System architecture in [151]. The user has a user state Su , which is the goal she/he is trying to accomplish. The previous user turns could be represented as Sd . Au is the user intention that will be converted into a speech signal Yu . Once Yu is recognized and parsed, it could be represented as (Ãu , C), where Ãu is the language understanding output and C is the confidence score associated to that output. Sm is the maintained system state. According to Sm and (Ãu , C), the next system action Am will be taken and mapped into an audio output Ym . Due to speech recognition errors Ãu can be different from Au , resulting that the real values of Su , Au and Sd are hidden to the system. The hidden state of a POMDP, S is therefore dependent of these three components: s = (su , au , sd ) (2.1) The system state, Sm , now becomes the belief state over these three components, and it is defined by: 12 CHAPTER 2. RELATED WORK sm = b(s) = b(su , au , sd ) (2.2) The observations of SDS-POMDP are given by the noisy language understanding input, Ãu , and the confidence score C: o = (ãu , c) (2.3) Applying these definitions to the original POMDP equations and performing some simplifications (details can be found in [148]), the transition function for an SDS-POMDP is given by: p(s0 |s, a) = p(s0u |su , am )p(a0u |s0u , am )p(s0d |a0u , s0u , sd , am ) (2.4) and the observation function: p(ã0u , c|s0u , s0d , a0u , am ) = p(ã0u , c|a0u ) (2.5) These two equations provide a statistical model of a spoken dialog system. The transition function can predict future behavior, and the observation function tries to infer the hidden state from the given observations. The user goal and user action models (first two terms of equation 2.4) can be estimated from an annotated corpus. The dialog history model (last term of equation 2.4) can be estimated from data, handcrafted or replaced by a deterministic model. The observation model can also be estimated from corpora. The immediate reward function will be defined according to the system objectives. Equations 2.4 and 2.5 can be used to derive the belief state equation update: 2.1. IMPROVING ROBUSTNESS IN SDSS b0 (s0u , s0d , a0u ) = k · p(ã0u , c0 |a0u ) · p(a0u |s0u , am ) · 13 X su ∈Su p(s0u |su , am ) · X p(s0d |a0u , s0u , sd , am )· sd ∈Sd X b(su , sd , au ) au ∈Au (2.6) The influence diagram of an SDS-POMDP in Figure 2.3 summarizes the modifications made. Figure 2.3: Influence Diagram of a SDS-POMDP. From [148] Young et al [153] emphasized that the POMDP-based model for dialog combines two ideas: dialog state tracking and reinforcement learning. Dialog-state tracking provides an explicit representation of uncertainty leading to systems that are much more robust to speech recognition errors. The user behavior can be inferred from a distribution of recognition hypothesis provided by N-Best list or confusion networks. The system is in fact exploring all dialog paths in parallel. The next action is not based on the most likely state, but on the probability distribution across all states. The rewards associated with state-action pairs will be the objective measure that reinforcement learning methods will try to optimize. This could be done using both off-line data or on-line through interaction with real users. The main problem with this framework is scalability. As the complexity of the dialog grows, 14 CHAPTER 2. RELATED WORK the number of states in the dialog follows this growth and the optimization problem can become intractable. To deal with this problem, efficient representation and manipulation of the state-action space needs to be done using complex algorithms. Policy learning is also challenging, motivating the use of approximation techniques. 2.1.3.2 Dialog-State tracking The factored state-space approach is not sufficient to reduce the complexity of the problem. Thus, further approximations are needed such as N-Best approaches and factored Bayesian Networks. The N-Best approaches approximate the belief state by the most likely states. In factored Bayesian Networks approaches, the user goal is factored into concepts that can be spoken about by the system. The following two sections described each approach in detail. 2.1.3.2.1 N-Best Approaches The Hidden Information State model (HIS) [150] is one example of an N-Best approach. In this approach, similar user goals are grouped into equivalent classes, called partitions, assuming that all the goals are equally probable if they are put in the same partition. The dependencies between the states are defined according to the domain ontology. The partitions are tree structured, and as the dialog progresses the root partition is divided into smaller partitions, which reduces the problem complexity and enables its implementation in a real-time SDS. To simplify, in the HIS the user intention remains the same during the belief update stage, although this is not necessarily true for all the belief update techniques that used an N-Best approach [69]. The N-Best approach can be problematic when the dialogs are too long. The tree will have more nodes as the dialog progresses. Some pruning techniques have been developed [147, 50]. An effective solution is to compute a marginal probability for each slot. The low probability slots-value pairs are pruned recombining them with slot-value pairs that are complement [50]. 2.1.3.2.2 Factored Bayesian Networks Approaches Another approach to update the belief state is to factor the user goal into concepts, and model the dependencies between concepts with an incomplete distribution that handles a limited number of dependencies, but models the complete distribution. From the factoring process results a Bayesian network, 2.1. IMPROVING ROBUSTNESS IN SDSS 15 where belief propagations algorithms are used in order to update the beliefs. The marginals for conditionally independent states are directly computed from the belief propagation algorithms. However, for limited dependencies an approximation for the marginal needs to be computed. Loopy belief propagation (like in the Bayesian Update of dialog state BUDS [133]) or particle filters can be used to solve this problem [1]. This factored approach can also be combined with the N-best approach to take advantage of the benefits of each approach [134]. 2.1.3.3 Policy optimization The policy optimization aims to maximize the reward function at the end of the dialog. According to [153], a non-parametric policy must encode firstly a partitioning of belief space such that all points within any partition map to the same action, and secondly it must encode the optimal action to take for each partition. Since the use of an exact POMDP representation in SDS is intractable, a compact representation for policy is needed. In most dialogs only a subset of the dialog space is used. Thus, optimization can be achieved by computing belief tracking in this subset, where decision-taking and policy optimization will be performed. In order to achieve this behavior, the belief state in the dialog space b is mapped to a vector b̂ and a set of candidate actions â. The policy will be used to select the best action to take for a set of candidate actions, and then map it back to the full action in the full dialog space. This mapping requires two different steps: select the candidate actions in the master space, and extract the features from the belief state and candidate actions. In the first step, the selected action could simply be the action with highest belief [133]. However, this could lead to inadequate action choices, such as “inform welcome” in the middle of the dialog. To mitigate this problem, human knowledge has been incorporated to select the candidate action [146]. Other approaches use the whole set of actions, but contain the slots that are connected to each action, using handcrafted heuristics [152]. In the second step, one binary feature is normally created for each dialog act. The dimensionality of this vector will typically depend of the task. The typical state features according to [153] are: the belief in the top N user goals or partitions; the top marginal belief in each slot; properties of top user goal or partition; the 16 CHAPTER 2. RELATED WORK system actions available; dialog history properties; most likely previous user actions [152, 81]. Some of these features are selected either by hand or through automated selection, and may also be complemented with features like those used for confidence annotation. For each summary space, the policy may follow a deterministic mapping π(b̂) → â, or a conditional probability distribution π(b̂, â) = p(â|b̂) where the action is selected by sampling the distribution. This policy is now a function of the summary space belief state and action. Similarly, the Q-function which provides the expected discount sum of rewards, can be represented for the summary space. In this case, the approach to find an optimal policy, is to maximize the Q-function for the summary space: π ∗ (b̂) = arg max Q∗ (b̂, â) â (2.7) Methods like Monte-Carlo optimization, least-squares policy iteration or natural actor-critic have been used in SDSs. Currently, optimized methods like Q-Learning [121, 39, 107] or SARSA [58, 47] are used in SDSs policy optimization. The details of dialog-state update and policy optimization in the statistical dialog manager used in this thesis will be described in Section 3.2.3.2. 2.1.3.4 Summary Although this the use of statistical dialog manager has produced enormous improvements in the system performances there are some drawbacks in their use. First, the models are very difficult to scale to more complex dialogs and second the portability to new domains requires considerable amount of work. There has been some recent work to mitigate this problem [49] with promising results, where the adaptation is made from an already existing domain. There has been also work on reducing the amount of training data required to create a new model. In [78], the results have shown that using discriminative methods even with a limited set of features can lead to improvements even with there is some mismatch between the train and the test data. 2.1. IMPROVING ROBUSTNESS IN SDSS 2.1.4 17 Language Generation In most SDSs, language generation follows a template-based approach, which despite being easier to handle, makes the creation of prompts for a new system time-consuming. Some work has been done in this field to avoid this limitation using data driven methods for language generation. Some of the techniques used were bi-gram language models [98], ranked rules from a corpus of manually ranked training examples [142] or a given a content plan select a sentence for a set generated from clause-combining operation [127]. Another approach to language generation was the use of a machine translation engine to create sentences in Natural Language from a Internal System representation, like what was done in Mountain [74]. Recent approaches try to apply reinforcement learning to Language Generation. In [79], reinforcement learning for Dialog Management is combined with reinforcement learning in Language Generation to adjust the prompts according to the number of concepts that the system was able to extract from the user input. There are three possibilities to place the concepts in the system prompts: list, contrast and cluster. This strategy was trained and tested on a simulated user and there was an increased reward when RL for Dialog Management was combined with RL for Language Generation. RL for Language Generation was also used to optimize the use of Referring Expressions (RE) in an Internet Service Provider customer service SDS [66]. The goal was to adjust the words used in the system prompts to the level of expertise of the customer. For this purpose two models were created: one to deal with expert users and another to deal with novice users. The models were first tested with a user simulator. This user simulator is somewhat different from those used to train DM. In addition to the action level representation, it also has the RE level representation. There were 90 different user models in the test set. The RL methods achieved better rewards and lower number of turns per dialog comparing to rule-based adaptive methods or non-adaptive methods. This data driven strategy was tested with a limited set of real users with promising results in terms of objective system measure and subjective user feedback [65]. Although the methods described involve some sort of prompt adaptation to the user, they have not used entrainment features in their learning methods. In the first case presented [79], 18 CHAPTER 2. RELATED WORK it is the way the system is presenting the information to the user that is adjusted. In the second case [66], the system is adapting RE to the user level of expertise. In our case we are targeting for an adaptation to each user that is supported from previous findings from Lexical Entrainment in human-human dialogs. The following sections will give an insight of what has been studied human-human entrainment, and how this was transferred to human-computer entrainment. 2.2 Entrainment The approaches presented so far have produced improvements in SDS’s performance. But they also have drawbacks. The choice for longer and acoustically distinct words could lead to highly dispreferred lexical choices. The improvements on confidence scoring and statistical dialog management do not let the system behavior be influenced by the user, that is they do not adapt to the user during live interaction. The improvements in natural language generation have already introduced the adaptation to the type of user as a possible direction to improve the system performance. But the question of how should the system is not yet given. In this section, an insight on the research performed for entrainment in human-human dialogs followed by recent approaches to use this knowledge in spoken dialog systems is going to be given. 2.2.1 Entrainment in Human-Human Dialogs Entrainment is beginning to be recognized as an important direction in SDS research [59]. It has been reported that in human-human dialogs entrainment occurs at various levels: lexical, syntactic/semantic and acoustical, and the different levels elicit entrainment among one another [106]. The lexical entrainment studies carried out for human-human dialogs have showed that subjects establish implicit conceptual pacts with one another in order to achieve success in task-oriented dialogs [29]. Sets of dialogs were studied where participants collaborated to 2.2. ENTRAINMENT 19 co-ordinate word choice rather than only using their own preferred words. They followed the output/input coordination principle [48], which states that the next utterance is going to be formulated according to the same principles of interpretation as the previous successful utterances. This coordination is not reached by explicit negotiation of the lexical items to be used, but rather through imitation during the interaction. Frequency is also important. The more common is a particular conceptualization, the stronger is the conceptual pact [28]. Reitter and colleagues [115] introduced priming as the processing of influencing the linguistic decisions of the other interlocutor. Hence, the linguistic structures that will be used to influence the linguistic decisions can also be called primes. In their study, evidences of priming are more visible in task-oriented dialogs, which is the domain of most SDSs. In addition, a mathematical entrainment measure was developed, and high correlation was found between this measure and success in task-oriented human-human dialogs [94]. This constitutes a theoretical background that could be automated to try to establish conceptual pacts between the system and its users. When combined with dialog-state tracking dialog management and accurate confidence scoring, this is likely to increase the system performance. 2.2.2 Entrainment in Human-Computer Dialogs There were some studies that established difference between human-human and humancomputer dialogs [26, 28, 24]. When communicating with machines, humans use abbreviated and telegraphic strings. Another difference found is that in human-human conversation a new term has to be repeated two or more times before uptake occurs and a conceptual pact is established [29]. In human-computer dialogs, new primes generally start to be used immediately after they are introduced by the system, including those that are less frequent in daily use as reported in the literature [24]. Users tend to adopt system’s terms because they think that the system is not in a position where it can negotiate with them. They think that the using the computer’s terms is less likely to provoke errors [28]. As it was verified in the human-human dialogs, the main motivation for entrainment is successful communication [24]. This phenomenon could be stronger in human-machine dialogs, leading the users to the 20 CHAPTER 2. RELATED WORK use of highly dispreferred linguistic structures, whenever they believe that is necessary for successful communication. The users see computer’s ability to understand them as limited and domain constrained. One of the motivations to use lexical entrainment is to avoid the idea of an inflexible SDS and eliminate the use of highly dispreferred lexical items. Lexical entrainment has already been tested in a text based dialog system [90]. There were improvements on the system performance, and the users preferred better the system that performed entrainment. Lexical Entrainment was first tested in an SDS with Let’s Go system [128]. Some syntactic structures and terms were modified to study their impact on the user’s choice of words. The different word choices did not correspond directly to words that affected the dialog, although influence in concept acquisition was shown. In [99], the authors confirmed that real users also entrained to SDSs. They modified terms that system had been using for a long time and in addition affected the dialog. They observed that some terms were more adopted than others. This is evidence that a policy to select better terms can be a valuable resource during the interaction, since users show preferences to some terms more than to others. Just like human-human dialogs, entrainment at different levels has also been studied in humanmachine dialogs. Users tend to follow the same syntactic pattern of the question they are answering [25]. Entrainment at lexical and acoustic/prosodic levels was investigated to find correlations between entrainment and learning [143]. Acoustic entrainment was also used to modify the way the users address the system [43]. This is especially useful when shouting or hyper articulation is found. These speech styles negatively affect the speech recognition performance. First, the authors developed methods to automatically detect these speaking styles. They then tested different strategies to deal what they had detected on-the-fly: explicitly asking to revert to their normal style, changing the dialog slot or changing the system’s speaking style. This last strategy makes users acoustically entrain to the system, returning them to a “neutral” speaking style (softer getting them to stop shouting, for example) that was more likely to be successfully recognized by the system. Evidence of acoustic entrainment was found. All of the strategies performed better than the 2.3. SUMMARY 21 baseline system, where the system did not detect the user speaking style. 2.3 Summary This chapter summarized the related work done to improve SDSs. We started by giving examples of research SDSs previously developed in many different domains. Then we summarized previous work done to improve the performance of SDSs. We have shown that this improvement could be achieved by reducing the system vocabulary, by improving confidence annotation, by adapting the natural language generation or by using state-tracking dialog management. This last approach was extensively described, since it is the current state-ofthe-art in dialog management. We have also introduced the theoretical background and motivations to the implementation of lexical entrainment in SDSs. The last section was dedicated to examples of previous experiments on entrainment taken with dialog systems. 22 CHAPTER 2. RELATED WORK 3 Creating Noctı́vago, a Portuguese Let’s Go This chapter will follow the steps to the creation of an SDS for European Portuguese. Our choice was to develop a Portuguese counterpart for Let’s Go. In this chapter, this option will be explained and the the two systems will be contrasted in order to enhance what was developed in this thesis. 3.1 Choosing a Framework for Noctı́vago Two different frameworks were under consideration to develop an SDS for the domain that we were targeting. The following two sections will describe DIGA and Olympus, in order to establish a contrast between both frameworks and explain our preference towards Olympus. 3.1.1 DIGA One of the frameworks under consideration was DIGA [95], whose block diagram is shown in Figure 3.1. The architecture is based on Galaxy II HUB [124] that establishes a socket-based protocol for communication between the different modules. Galaxy works as hub that routes messages between the different system components. In DIGA, the Audio Manager deals with both input and output speech and the Service Manager is responsible for executing the user requests. The adopted speech recognizer is Audimus [96]. The text generated by Audimus is passed to the Language Interpretation and then to the STAR dialog manager [87]. The information is sent to the Service Manager to perform the backend query. The resulting action is transferred to the NLG module to generate the text. Finally, the text is synthesized using the DIXI+ 24 CHAPTER 3. CREATING NOCTÍVAGO, A PORTUGUESE LET’S GO text-to-speech (TTS) system [101], and played using the audio manager. Figure 3.1: Architecture of DIGA framework for SDSs. This framework has some advantages in view of a domain adaptation to a new system in European Portuguese (EP). The most obvious is the fact that the ASR and TTS for EP have already been integrated for other systems. However, the adaptation to a new domain also required modifications in the Service Manager and Language Interpretation modules. The former processes the user input text in a pipeline that involves several stages. The final result is a candidate list. The speech act is selected according to the services available in the Services Manager. The lack of support and documentation would create many difficulties for this adaptation task. For these reasons, a different option was also considered, the Olympus framework for SDSs [22], which will be presented in the following section. 3.1.2 Olympus The reference architecture for Olympus is pictured in Figure 3.2. Like DIGA, it also relies on the Galaxy-II HUB to implement a centralized passing-message infrastructure to communicate between the different modules. The reference architecture has a recognition server 3.1. CHOOSING A FRAMEWORK FOR NOCTÍVAGO 25 that performs Voice Activity Detection (VAD), and sends the voiced frames to the connected speech recognizers (several recognition engines can be connected and work in parallel). The reference recognizer is Sphinx [63]. The result of the recognition is then passed to the NLU module that is divided in two separate modules: the Phoenix semantic parser [145], and the Helios confidence annotator [17]. Phoenix generates a semantic representation for the text input, and Helios attributes a confidence score to that semantic representation. The result is passed to the Ravenclaw [21] dialog manager, that interprets the semantic representation, selects the next action and performs the query back-end to get the information requested. The semantic representation of the system action is generated by Ravenclaw and transferred to Rosetta, a template based language generator [119] to create a text representation of the system action. This text is finally sent to Kalliope [22], a synthesis manager that is compatible with various synthesis engines such as Micrsoft’s SAPI [92], Festival [11] or Cepstral Swift [36]. The TTYServer is an additional component that allows Olympus to work in a text input/output mode, leaving out the speech recognition and synthesis. The text input version is very useful not only for development purposes, but to connect Olympus to an Embodied Conversational Agent, as it will be detailed in Section 3.2.5. Figure 3.2: Olympus reference architecture. From [22]. 26 CHAPTER 3. CREATING NOCTÍVAGO, A PORTUGUESE LET’S GO Olympus was developed to provide means for flexible adaptation to new domains. It is possible to replace any component by an equivalent component. In addition, a task specification language was developed to easily create new Ravenclaw-based dialog managers. This is reflected in the dozen of systems that were created for different domains using Olympus. A few examples are RoomLine for room reservations [18], Let’s Go for bus schedule information [114], LARRI for aircraft maintenance tasks [19] or TeamTalk for command and control operation of robot [55]. Besides domain adaptation, in this case we also needed a framework that could facilitate the language transfer, since in the reference architecture the modules are prepared only for English. This task would involve changes in the ASR, NLU, NLG and TTS modules. Despite the fact that many modules would be affected, the changes were done at the high-level: new lexicon, language and acoustic models; new semantic grammar specification; new NLG templates and installing new synthesizer for Portuguese that works with Microsoft’s SAPI synthesis engine. The only drawback was that SPHINX would have to be replaced by a different speech recognizer for Portuguese. This would require the total replacement of one of the system modules, and the implementation of the communication protocol for the GalaxyII Hub in Audimus. However, the modifications were mostly straightforward whereas the modifications in DIGA to adapt to a new domain were not. Another important factor was the fact that Let’s Go has been using Olympus for several years. Given that Noctı́vago works in the same domain, the expertise gathered in Let’s Go could be used in the development of Noctı́vago. These were the reasons why the Olympus framework for SDSs was used throughout the worked developed in this thesis. 3.2 Modules Used The experiments held in this thesis involved two different spoken dialog systems that despite targeting the same domain, bus schedule information, differed in language, type and number of users. Let’s Go is a system that provides bus schedule information to real users during off-peak hours. The system is running live since 2005, and receives an average of 40 calls 3.2. MODULES USED 27 during weekdays and 90 during weekends. Noctı́vago was the experimental system developed for European Portuguese. It was inspired in Let’s Go and provides bus schedule information for night buses in Lisbon. The choice to cover night buses was obvious due to the similarities between bus frequencies of Lisbon’s night buses and off-peak hour buses in Pittsburgh. Originally, Noctı́vago covered 8 night bus lines. In this section, the components of each system will be described, compared and contrasted. 3.2.1 Speech Recognition Although the recognizer in the original Olympus architecture was Sphinx-II, the current version of Let’s Go uses PocketSphinx [64]. Noctı́vago uses the Audimus speech recognizer [96] for European Portuguese. 3.2.1.1 Pocketsphinx Let’s Go uses two gender-specific acoustic models and context-dependent language models in parallel. The 1-best hypotheses generated by each of the recognizers are parsed, and the hypothesis annotated with the highest confidence score is the one selected. Both general purpose and context-dependent language models used were trained with one year of Let’s Go data, plus data collected from crowdsourcing. The resulting dataset comprises 18070 corresponding to 257658 user turns. The acoustic model was trained using a corpus of 2000 real dialogs that were manually transcribed [113]. 3.2.1.2 Audimus As it was already mentioned in Section 3.1.2, the first step was to implement the communication protocol that was used for speech recognizers in Olympus. The Audimus acoustic models for telephone speech were originally developed for a broadcast news task, with a 100k word vocabulary and a statistical backoff 4-gram language model. 28 CHAPTER 3. CREATING NOCTÍVAGO, A PORTUGUESE LET’S GO The acoustic models were gender independent, thus only one Audimus engine needed to be connected to the recognition server. The original language model and vocabulary was inefficient for the current task, both in terms of processing time and word error rate. Building appropriate language models for this task was thus one of our first goals. Since there was no real data available, an automatically generated corpus based on the parser grammar description was used for this purpose. Our first attempted comprised a very small 30k sentences context independent corpus, with a vocabulary around 300 words. The pronunciation lexicon was built using an in-house lexicon and a grapheme-to-phone rule-based conversion system [35]. In an early stage, the lexicon included a total of 397 pronunciations entries. Since Noctı́vago is a newly created system, once new data was collected, it was used to iteratively improve the language model. In addition, new bus lines were added to the baseline system version, which has increased the number of words in the vocabulary. Another improvement introduced were the context dependent language models for Place (when the system was requesting departure and arrival stops), Time (when the system was requesting travel time), Confirmation (when the system was doing explicit confirmation) and New Query (when the system was asking the following action, after the result of the query was provided). Place and Time were statistical bi-gram language models, whereas Confirmation and Next Query were modeled as SRGS 1 grammars, since the number of possible inputs was far less than in the two other context-dependent models. The 300k sentences artificial corpora generated to train the bi-gram language models were created using context-dependent grammars based on the frequencies observed during the previous data collections. The number of different words in the Place and Time corpora was 654 and 144, respectively. The SRGS grammars were also designed according to the answers observed in previous data collections. In order to be used by Audimus, the SRGS grammars were converted to language models on the fly. The number of words that could be recognized with context-dependent model for Confirmation was 25, whereas for Next Query was 51. These last models were used in the tests that will 1 Speech Recognition Grammar Specification 3.2. MODULES USED 29 be described in Section 6.2.3.1 and they had positive impact in the system performance, as it will be shown later. 3.2.2 Natural Language Understanding For NLU, both systems use the same components: Phoenix for semantic parsing, and Helios for confidence annotation. According to [145], Phoenix is designed for the development of simple, robust natural language interfaces to applications, especially spoken language applications. Because spontaneous speech is often ill formed, and because the recognizer will make recognition errors, it is necessary that the parser be robust to errors in recognition, grammar, and fluency. This parser is designed to enable robust partial parsing of these types of input. Phoenix parses each input utterance into a sequence of one or more semantic frames. A Phoenix frame is a named set of slots, where the slots represent related pieces of information. Each slot has an associated context-free semantic grammar that specifies word string patterns that can fill the slot. The grammars are compiled into recursive transition networks, which are matched against the recognizer output to fill slots. Each filled slot contains a semantic parse tree with the slot name as root. A new set of frames and grammar rules were specified for our system to fill each slot in a frame. In addition to producing a standard bracketed string parse, Phoenix also produces an extracted representation of the parse that maps directly onto the task concept structures. A new grammar was created for Noctvago based on the Let’s Go original grammar. Some modifications were introduced in the grammar due to the dialog strategy that was followed. Helios relies on a logistic regression model trained with a large number of features, as detailed in Section 5.2. The features used in Let’s Go were trained on real data from collected dialogs. Noctı́vago used the model that is provided in the Olympus tutorial system for the development of general purpose systems, since no previous data was available. It was trained with RoomLine data. The problem of training a new confidence model for a new system is challenging, since it is highly dependent on the system components. The creation of an 30 CHAPTER 3. CREATING NOCTÍVAGO, A PORTUGUESE LET’S GO appropriate confidence model for Noctı́vago will be addressed in Chapter 5. 3.2.3 Dialog Management Two different frameworks for dialog management will be described in this section: Ravenclaw and Conerstone. The first one is a plan-based dialog management framework, whereas the second is a statistical framework for dialog management. The first experiments with Noctı́vago described in Sections 4.3 and 6.1.2.2 used Ravenclaw as the dialog manager. The experiment ran with Let’s Go (Section 6.1.2.3) and the final experiment done with Noctı́vago (Section 6.2.3) used the Cornerstone dialog manager. In this section, both approaches will be briefly described. 3.2.3.1 Ravenclaw Ravenclaw was designed to easily build new SDSs in different domains. The two-tier architecture encompasses a Dialog-Task specification, which is domain dependent, and the Dialog Engine itself that is domain independent. The Dialog-Task specification is a hierarchical plan for the interactions, provided by the system developer. The Dialog Engine deals with errorhandling, timing and turn-taking, and other dialog resources such as provide help, repeat the last utterance, quit, start over, etc.. These tasks are similar across different domains. The task is specified as a tree of dialog agents, each one responsible for handling a subtask in the dialog. Agencies are subtasks that will be further decomposed. Inform Agents represent the system providing information to the user. Request Agents represent the system asking a question to the user and understanding the answer. Execute Agents represent the nonconversational actions that the system must perform such as querying databases. Expect Agents are similar to Request Agents except that they do not ask any question to the user, they just perform the understanding part. The default unfold for a dialog given a task specification is by “depth-first left-right traversal”. That is, Agencies are immediately traversed towards their children agents and the children agents are traversed from left to right. This is done 3.2. MODULES USED 31 using a dialog-stack that captures the discourse structure during the interactions. During the Execution Phase, the Agencies are pushed to the stack, and once they are completed they are popped from the stack. The control of the execution is then given back to the parent node of the tree. The completion or execution of Agencies can be triggered by a defined precondition, as well as success or failure criteria. Figure 3.3 shows the final task specification developed for Noctı́vago. The system starts by Giving Introduction to the users. Then the PerformTask agent will pass the control to the GetQuerySpecs agent to collect the data for the query. The tree will start to be travelled “depth-first left-right traversal”. Since the first Agent on the left-most end of the tree is an Expect Agent, the control is given to the GetOriginPlace Agency. It will start by asking the origin place. If the answer given is a Neighborhood rather than a bus stop, the control is passed to the GetOriginPlaceNeighborhood Agency to perform the leaf agents to get the correct neighborhood stop. This is possible due to a trigger placed in the Agency that only starts the Neighborhood strategy if the parsed input is indeed a Neighborhood. Otherwise the DM will try to complete the next Request Agent: RequestDestinationPlace. The ExpectRouteForContinuation is used when the user has chosen to continue the trip after she/he was already given the first result. The Agency GetDepartureArrivalDisambiguation will be triggered if the Departure and Arrival stops are the same, in order to detect which one is correct. The next Request agent on the tree is the RequestTime that fills the time slot in the query. Finally, before the control is returned to the PerformTask Agency, the QueryConfirm agent checks if the data collected is correct. If not, the RequestWrongSlot will ask which slots were mistakenly filled and clear them. This flow is repeated until the user says that all the slots were correctly filled. The control is then passed to the ProcessQuery that executes the database query. Once the result of the query is available, the GiveResults Agency will be executed. If the system finds a bus schedule, the InformSuccess Agent will be executed, if not the InformError will be pushed to the stack. After informing the user the system informs the user, the RequestNextQuery will ask the user what to do next. If the user says goodbye, the GiveResults Agency is successfully completed and Noctı́vago Agency executes the GreetGoodbye InformAgent. If the system understands that the user wants a new 32 CHAPTER 3. CREATING NOCTÍVAGO, A PORTUGUESE LET’S GO query, the ConfirmNewQuery will explicitly confirm this information. If the user confirms, the system informs that a new query is going to start, the concept values will be cleared and the Agency GetsQuerySpecs will be again executed. If the user requests the Price, the InformPrice Agent will be executed. Once provided the price information, the system will execute the RequestNextQuery again. This tree must be specified in the RavenClaw Task Specification Language, which is a set of predefined C++ macros, which will be expanded to their native code when compiling the system. Other fundamental elements that need to be specified for each task are the concepts. They store all the data that the system manipulates during the interaction. The other fundamental data structure associated with the concepts is the expectation agenda. This is the record of the concepts that the system is expecting to receive at any given moment. During the Input Phase of the dialog, the information provided by the user is transferred to the concepts using the expectation agenda. This agenda is tightly connected with the task specification. For instance, in the bus scheduling domain, if the system asks for the departing stop, the most expected user input is a bus stop, rather than the time she/he wants to travel, which should be placed at a lower level of the expectation agenda. Ravenclaw also encapsulates a set of error recovering strategies (explicit or implicit confirmation, help, repeat, etc.) and task-independent dialog acts (quit, start over, timeout, etc.). For more details about Ravenclaw, the reader should consult [16]. 3.2.3.2 Cornerstone In 2012, the Let’s Go Ravenclaw Dialog Manager was replaced by a POMDP-based DST Dialog Manager, Cornerstone. The dialog agencies used used in Let’s Go with RavenClaw are now modeled as states. Instead of using an agenda, the transitions between states are given a probability model. The model was trained using 2718 real dialogs Let’s Go, corresponding to 23044 user turns. This model was used throughout the tests performed using Cornerstone. Inform Neighborhood Strategy RequestOrigin Neighborhood ExpectRoute Request OriginStop Neighborhood GetOriginPlace Neighborhood GetOriginPlace ExpectDont KnowStop Request OriginPlace Request Destination Place InformUsing DefaultStop Expect RouteFor Continuation Request DepartureArrival Disambiguation Get DepartureArrival Disambiguation GetQuerySpecs Request Time QueryConfirm RequestWrong Slot GiveIntroduction Execute Backend Call ProcessQuery PerformTask Noctívago Inform Success GreetGoodbye Inform Error RequestNext Query Expect Goodbye GiveResults Confirm NewQuery InformStarting InformPrice NewQuery Key Execute Agent Request Agent Inform Agent Expect Agent Agency 3.2. MODULES USED 33 Figure 3.3: Tree for Noctı́vago task implemented in Ravenclaw. 34 CHAPTER 3. CREATING NOCTÍVAGO, A PORTUGUESE LET’S GO The complete description of the Dialog Manager, the results achieved and the comparison with the previous Dialog Manager can be found in [75]. The description of the methods used to train a DST Dialog Manager with unsupervised data can be found in [76]. In Section 6.1.2.3 the dialog manager uses the same dialog strategy that was used for Let’s Go with the implementation of the prime selection algorithm described in Section 6.1.2.3. In Section 6.2.3.1 we introduced small changes in the dialog strategy, taking into account the strategy that was previously used with RavenClaw, and we also trained a new model confidence calibration with the data collected at that point. Besides the Rule Based algorithm used in Let’s Go experiments, this DM also includes the implementation of the Data Driven algorithm that will be described in Section 6.2.3. This section summarizes the techniques used for dialogstate tracking and dialog policy optimization to train the dialog manager used in the tests held with Let’s Go, as well as in the last test done with Noctı́vago. 3.2.3.2.1 Dialog-State Tracking Previous approaches have used handcrafted proce- dures or simple maximum likelihood estimation to learn the parameters in partition based models [68, 118, 133]. One of the parameters to be learned is the user action model, which defines the likelihood of the user action given a particular context. This model would be more refined if learned from real data. Taking advantage of the large collections of transcribed data available for Let’s Go, a Dynamic Bayesian Network (DBN) was used in order to learn the user action model from that data. Assuming several conditional independencies, the graph for the DBN is show in Figure 3.4. To perform inference over the graph model, two components need to be specified: the history model, p(ht |ht−1 , ut , st ), and the observation model p(ot |ut ). The history model indicates how the dialog history changes and can be set deterministically, for instance: p(ht = inf ormed|ht−1 , ut , st ) = 1, if ht−1 = inf ormed or ut = inf orm(.) (3.1) 0, otherwise The observation model is computed using a deterministic function that measures the degree of agreement between ut and ot : 3.2. MODULES USED 35 Figure 3.4: DBN graph for the Let’s Go state tracking. From [75]. p(ot |ut ) = CS(ot ) · |ot ∩ ut | +ε |ot ∪ ut | (3.2) where CS(ot ) is the confidence score provided by Helios for a given observation. Finally to estimate the user action model the mean field theory [100] approach was used for Bayesian Learning to avoid over fitting. The method results in the posterior probability over the parameters of the user action model φ: q ∗ (φ) = Q Dir(φi,j |αi,j ) (3.3) i,j 0 + EH,U [ni,j,k ] αi,j,k = αi,j,l (3.4) where ni,j,k represents the number of times that st = i, ht−1 = j and ut = k. Dir(·) denotes the Dirichelet distribution. α0 is the parameter of the uninformed prior Dirichelet distribution for symmetry. The junction tree algorithm was used to estimate the quantity EH,U [ni,j,k ]. Further details about the user action model learning process can be found in [76]. Another new feature of Conerstone was the use of the explicit confirmed turns to calibrate the confidence score. The use of explicitly confirmed entries eliminates the need for annotated corpora to calibrate the confidence score in the future. The system logs were parsed and the entries that were explicitly confirmed and appeared in the database query were added to the dataset to tune the confidence score. To determine the correctness, the observed data was 36 CHAPTER 3. CREATING NOCTÍVAGO, A PORTUGUESE LET’S GO compared to the explicitly confirmed information (the error rate for explicit confirmation turns in Noctı́vago is around 15%, half of the value found the other turns, around 30%). A Gaussian kernel-based density estimation was applied to the two sets of confidence scores collected with respect to correctness. The two densities were scaled by the number of elements to see how the ratio of correct ones (dc (c)) over the sum of correct and incorrect ones (dinc (c)) varies according to the confidence score. The calibrated confidence score is given as in Equation 3.5. In order to efficiently compute this value for a given confidence score, a sparse Bayesian regression with Gaussian kernel was used [76]. c0 = 3.2.3.2.2 Policy Optimization dc (c) dc (c) + dinc (c) (3.5) The policy optimization was performed using a sparse Bayesian learning (SBL) approach to rapid reinforcement learning. Lee and Eskenazi [75] have chosen this method because SBL can generate sparse models endowing sparsity-enforcing priors without a sparisification threshold. The details of the application of SBL to solve equation 2.7 can be found in [76]. Despite new data is continuously being generated in Let’s Go, only the latest data is used to update the model, keeping the efficiency of the learning procedure. To apply SBL-based reinforcement learning to policy optimization in this task, a simplified representation of the belief state was adopted as the work mentioned in Section 2.1.3.3 suggested. The state representation for dialog strategy learning was a tuple with the belief of the top hypothesis and the marginal belief of each concept. Since the actual value of the concept of a system action can be easily determined, a set of concept level system actions was also used to improve computational efficiency. The basis function consists of two sub-functions: a Gaussian kernel to represent the proximity between two belief states and the discrete kernel to relate two system actions. The maximum history size was set to 1000. The reward function defines a -1 reward for each turn expect the last turns where a successful dialog occurred and 20 reward was given. During the learning process the system will explore the state-action 3.2. MODULES USED 37 space randomly (with 0.1 probability) or acts according to the partially optimized policy. Finally, a set of handcrafted rules was implemented to avoid undesirable actions (such as confirming empty concepts). To train the dialog policy the user simulator described in [77] was used. 3.2.4 Natural Language Generation and Speech Synthesis Both Let’s Go and Noctı́vago use Rosetta, a template-based natural language generation module. The template is composed of hand-written entries, which can be also programmatic functions (to deal with missing slots in the template sentence). These are used to convert a semantic frame generated by the dialog manger into a sentence to be read by the synthesizer. We have created new templates for the Noctı́vago system. These templates were later modified to be used in the entrainment studies as well as the templates used in Let’s Go. When a system was running with an entrainment policy, the primes will be treated as slots to be filled in the available templates. The prime value is computed within the DM and the frame received in NLG component is parsed to extract the action, the concept values and the prime to be used. Similarly to the other concepts, the prime is extracted a placed in the slot created for it. Both systems use Kalliope synthesis manager. The synthesis engine used in Let’s Go uses a Cepstral Swift engine with a domain-adapted voice developed with the techniques described in [12]. The corpus was trained with 1702 in-domain utterances and also 1131 out-of-domain utterances recorded by the same speaker. The resulting voice quality is very high and often mistakable for prerecorded utterances. Noctı́vago uses the general-purpose festival-based DIXI synthesizer [101] with SAPI support that was easily connected to the Kalliope synthesis manger. 3.2.5 Embodied Conversational Agents The audio input in the Olympus reference architecture can be performed either using a telephone or a microphone connected to the computer where the system is running. The 38 CHAPTER 3. CREATING NOCTÍVAGO, A PORTUGUESE LET’S GO first tests performed with Noctı́vago (Chapter 4) rose the question about the best way to perform data collection with volunteer users for an experimental dialog system. In those tests a telephone interface was used, however in most cases the volunteers would have cost associated with the phone call that they needed to reach the system. We thought that using a web-based interface would help us to easily recruit new users in future tests. Embodied Conversational Agents (ECA) were under development for a different SDSs in our group. These ECAs can deal with input and output audio streams, performing the recognition and synthesis tasks. To connect them to the Olympus architecture, a Java class was created that implements the communication between the agent and the remaining modules of the system through the Galaxy HUB. This class implemented the interpretation of the speaking requests that were sent to the TTS engine and the creation of Galaxy frames with the ASR result, which included the text and the confidence score. The possibility of handling context dependent language models was also implemented. This eliminated the need of the recognition and synthesis engines in the architecture since all the operations performed by those modules were now performed within the ECA. The ECA was equipped with a push-to-talk or press-to-talk button, which also eliminated the necessity of the Audio Sever since Voice Activity Detection was no longer running on-line. Two different solutions were used for ECA. Both of them used Audimus and DIXI for recognition and synthesis, respectively, just like their telephone-based predecessor Noctı́vago. The first ECA was used in the tests held with the rule-based prime selection that will be described in Section 6.1.2.2. The character used is depicted in Figure 3.5. This solution used Adobe’s Flash Player to play the video and capture the audio using push-to-talk button. The video is fully generated on the server side and then sent to the client. This requires a server with an Advanced Graphics Processing Unit (AGPU) and, in addition, requires the client to have a very fast internet connection. The use of Flash is an advantage, since it is also required by other internet services, and in most cases does not require the user to install a new plug-in. The face of this first ECA was not generally considered very appealing. In addition, we noticed 3.2. MODULES USED 39 Figure 3.5: Flash-based ECA used in Noctı́vago. some problems in the audio capturing that could hinder the ASR performance. The other solution available in our group used the Unity 3D gaming platform and was being successfully used in two other projects (the virtual therapist in ViThea [110] and the museum guide in FalaComigo [46]). Unity 3D provides libraries for capturing the audio. A press-to-talk button was used in this case. This platform offers the advantage that it does not require a server with an AGPU since the video is generated on the client side, which also reduces the need for a very fast internet connection. The synthetic character also looked more appealing than its predecessor as it can be seen in Figure 3.6. The only drawback was that in most cases the users had to install the plug-in in their computers. Figure 3.6: Unity 3D-based ECA used in Noctı́vago. Figures 3.9 and 3.7 show the different architectures used in the Noctı́vago experiments, whereas Figure 3.8 shows the architecture used in the Let’s Go test held for this thesis. The modules that were replaced from the original architecture have a blue contour, whereas 40 CHAPTER 3. CREATING NOCTÍVAGO, A PORTUGUESE LET’S GO the modules that were modified are highlighted with a green contour. The new modules that were integrated in the architecture are highlighted in red. Figure 3.7: Olympus Architecture used in the first Noctı́vago tests (Section 4.3) with telephone interface. Figure 3.8: Olympus Architecture used in Let’s Go tests (Section 6.1.2.3). 3.3 Summary In this chapter we have described the basic architecture of an SDSs and its components together with their main tasks. Two different architectures were considered to use in this thesis, DIGA and Olympus. The reasons why we have chosen Olympus were presented. The components used in the systems developed for this thesis that were both Olympus-based were described in more detail, highlighting the modifications that were made relative to the 3.3. SUMMARY 41 original architecture. (a) Olympus Architecture used in the Noctı́vago tests with the Flash-based ECA (Section 6.1.2.2). (b) Olympus Architecture used in the Noctı́vago tests with the Unity-based ECA (Section 6.2.3). Figure 3.9: Olympus Architectures used in Noctı́vago with ECA. 42 CHAPTER 3. CREATING NOCTÍVAGO, A PORTUGUESE LET’S GO 4 Towards Better Prime Choices 4.1 Introduction This Chapter is devoted to our first approach to built an automated lexical entrainment algorithm for SDSs. We have used the Noctı́vago system to do so. Expert users tested a preliminary version of the system under tightly controlled conditions, using a landline and a very limited set of predefined scenarios. The users had to ask for specific bus schedules. In these early trials the system received 56 calls, which correspond to a total of 742 turns. These tests allowed us to make the system more robust and to evaluate which concepts could be in the system prompts to make the system entrain to the users and vice-versa. There were 143 different words in the system prompts, but according to Ward and Litman [143] only words with synonyms can be potential primes. This means that even bus stops and time expressions could be prime candidates. However, if they were selected as prime candidates, several modifications would be needed in many system modules. For that reason, the set of prime candidates was restricted to a list of concepts that did not require major modifications. The set of prime candidates is a very important resource that will be used in all the tests described in this thesis. It is a list of primes, one list for each concept, that the system will use in the system prompts. Our goal with this study was to identify the prime candidates, find new synonyms for that could be used in the system prompts and evaluate the impact that had on the system performance. We hoped that this would give us clues to the automation of the prime selection process. 44 4.2 CHAPTER 4. TOWARDS BETTER PRIME CHOICES Creating a list of prime candidates The criterion defined above led us to identify the prime candidates that are presented in Table 4.1. The table also shows the Prime Error Rate (PER) for each prime in the preliminary version tests, computed at the prime level and not the word level, since some primes consist of more than one word. Concept Next Bus Start Over Now Prime próximo outro percurso nova pesquisa agora PER (%) 19.0 50.0 36.4 60.6 Frequency 21 6 11 33 Table 4.1: Prime analysis in pilot experiments. It was very interesting to notice that some users entrained to the erroneous pronunciation of the word pesquisa, spoken by the synthesizer. In fact, the chosen units to concatenate the target pronunciation /p@Skiz6/ made it sound like /piSkiz6/, and some users entrained to that pronunciation, with negative effects in the system performance. The next step in this study was to find synonyms for the concepts in Table 4.1. The new set of primes is shown on Table 4.2. Concept Now Old prime agora Start Over nova pesquisa outro percurso Next bus próximo New prime imediatamente neste momento o mais brevemente possı́vel o mais rápido possvel procurar novamente nova procura outra procura nova busca seguinte Table 4.2: List of primes used in this study. 4.3. EXPERIMENTAL SET UP 4.3 45 Experimental Set Up During two weeks, a panel of 64 volunteers participated in the experiments. An e-mail was sent to the potential subjects with a description of the task, the phone number that could be used to reach the systems and instructions about the pictorial items that could appear in the scenarios. In this e-mail they were also informed that they should do one test in each week. There were three bus schedule scenarios to be completed. Once a scenario was completed they could move to the next one. Figure 4.1 shows one of the scenarios used. The other scenarios can be found in Annex C.1. They were very pictorial in order to avoid biasing users towards a word written in a scenario. At the end of the scenarios the subjects had to fill in a small questionnaire where they were supposed to inform if they received the correct information in each scenario and comment the system performance. They should also provide information about the device they used to complete the test. During the second week, users who already called the system in the first week were asked to compare the performance of the system in both weeks. Both questionnaires can be found in Annex C.2.1. The e-mail was send to both fellows from the laboratory, who were supposed to have had previous contact with SDSs, and people who did not have any previous contact with SDSs. Figure 4.1: Example of scenario used. The set of old primes was used during the first week, whereas the set of new primes was used in the second week as detailed in Table 4.2. The change in the prime set was made without 46 CHAPTER 4. TOWARDS BETTER PRIME CHOICES informing the users. 4.4 Results Unfortunately, since the tests were performed with volunteers users, only few of them completed the task as it was supposed. Therefore, during the first week, the system received 45 calls corresponding to 131 valid queries since each call could contain 3 or more queries. 35 of these queries were performed by laboratory fellows, 51 were completed by female subjects and 6 by non-native subjects. In the second week, the number of callers decreased to 36 callers, corresponding to 125 valid, 40 of them by laboratory fellows. 37 queries were made by female subjects and 6 queries by non-native subjects. 19 people called the system both weeks, 26 of them only called the system in the first week, and 17 only called the system in the second week. Many of the participants that took part in this study used an SDS for the first time in their lives. In this evaluation, a session was considered successful if the system provided the user with the information she/he had asked for. All the sessions were carefully analyzed at the utterance level, describing the type of error found in the incorrectly recognized utterances, and transcribing the content of each utterance. Table 4.3 shows the number of calls for each week and the real success rate (if the information given matches the user request), according to the device used to interact with the system. From the table it is not possible to make a statement about the device used and the system performance despite the fact that the acoustic models were trained mostly with landline telephone speech and broadcast news speech down sampled to the telephone speech sampling rate. Regarding the system performance, the table shows that the real success rate increased from the first to the second week although the Fisher’s Square test on session success showed that this difference is not statistically significant. However, the number of calls in the first week using cellphones is not enough to draw any conclusion about cellphone speech degrading the 4.4. RESULTS 47 global system performance. Landline Cellphone VoIP Unknown TOTAL # Calls W1 86 12 24 9 131 Real Success Rate W1 (%) 48.8 66.7 54.1 11.1 48.5 # Calls W2 52 37 27 9 125 Real Success Rate W2 (%) 57.6 54.1 59.3 33.3 54.8 Table 4.3: Distribution of calls and real success rate according to the device used. Table 4.4 shows the evaluation measures used as standards in the Spoken Dialog Challenge [14]. The challenge created these measures to compare the participating systems in the bus schedule information task. No output Any output Acceptable output Incorrect output Week 1 31.3% 68.7% 71.1% 28.9% Week 2 19.2% 80.8% 68.3% 31.7% Table 4.4: Success rate of the system in each week. The standard defines two possible types of calls: ’No output’ and ’Any output’. ’No output’ means that the dialog did not provide any information to the user. ’Any output’ means that it provided some sort of information to the user. The calls classified with ’Any output’ are further divided into ’Acceptable output’ calls and ’Incorrect output’ calls. ’Acceptable output’ means that the correct information was provided to the user. ’Incorrect output’ means that the system did not provide the correct information. Table 4.4 shows an improvement from week 1 to week 2 in the percentage of calls that gave any sort of output to the user. Although the percentage of successful dialogs increased in the second week, the percentage of acceptable outputs shows a small decrease. However, since the percentage of ’Any Output’ during the second week was also higher, the absolute number of successful dialogs was also higher in the second week. This could be explained with the fact that in one of the scenarios the users had to ask for a bus for a specific weekday and departure time (eg.: Saturday at 1:15 am). This scenario was completed by 36 times in the first week, and 45 times in the second week. Users made a large pause between uttering the weekday and the time of day every time they were completing this scenario. The ASR module would 48 CHAPTER 4. TOWARDS BETTER PRIME CHOICES output the day of the week immediately, and the language understanding module would bind the day of the week to the travel time concept, ignoring the departure time. Later, the default minimal inter word pause value was increased to solve this problem. In our detailed analysis, a few error sources were identified: VAD, Hung Up, Loss of Synchronism and Data-Time parsing. VAD errors resulted in incomplete utterances that arrived at the speech recognizer due to VAD mis-detections. Hung up (HU) was identified since many of the recognition outputs at the beginning of some sessions were very similar in words recognized and length. Later we have found that in those sessions where the telephone channel was left open a few seconds after the user abandoned the call and consequently all the audio was captured and held in the recognizing queue until the next call came up. Sometimes Galaxy frames generated by the ASR with the recognition output arrived a few mili-seconds later at the HUB, causing a Loss of Synchronism, since the processed text output did not correspond to the system request at that time, but to the prior system request. Some errors also occurred during Date-Time Parsing. The percentage of correctly perceived turns and percentage of turns wrongly recognized are presented in Table 4.5. The percentage of turns that were not correctly recognized is divided according to the source of the error. Correct Recognition Errors VAD Hung Up Loss of Synchronism Date-Time Parsing Week 1 (%) 32.3 52.8 4.0 3.8 6.3 5.6 Week 2 (%) 41.1 44.2 4.0 6.7 6.4 4.5 Table 4.5: Analysis of errors at the turn level. The most important element to retain here is that the percentage of correctly recognized turns was 10% higher in the second week. We believe that the new set of primes was crucial to this improvement. The other sources of errors were not eliminated since they do not depend on the lexical choice. 4.4. RESULTS 49 Another important evaluation measure in this experiment is the Word Error Rate (WER). The information available in the system logs was compared with the reference to compute the WER for the live tests. Later the off-line version of Audimus was applied to each session with the same language and acoustic models used in the on-line system. This was done in order to find how much the acoustic and language models contributed to the WER. Hence, the speech boundaries for each utterance were given a priori. Results in Table 4.6 show WER, deletions and insertions for both weeks. WER Insertions Deletions Substitutions on-line Week 1 Week 2 58.3% 52.3% 99 177 876 638 553 528 off-line Week 1 Week 2 30.0% 30.6% 137 192 15 131 520 463 Table 4.6: WER for the different weeks. These results again confirmed that the system performed better in the second week, at least in the on-line test. A 6% decrease in WER without changing anything but the prompts, is quite remarkable. Some of the errors were due to the lack of real data to train the language model. The language model was the context independent language model trained with only 30k sentences. The difference found in the off-line recognition test between the WER in the two weeks was very small. Later it was found that the device used to call the system, which had no major influence on the dialog success rate (as shown in Table 4.3), had impact on the off-line WER. The acoustic models used were trained mainly with landline telephone speech, down sampled broadcast news speech and very few examples of cellphone speech. When compared with the WER for landline telephone users, the WER for cellphone speech was about 8% higher, and for VoIP speech 14-17% higher. The number of sessions completed with landline telephone speech decreased from 86 to 57 in the second week. This can help explain why the off-line WER did not decrease in the second week. In order to prevent the impact of adverse acoustic conditions, the future Noctı́vago tests 50 CHAPTER 4. TOWARDS BETTER PRIME CHOICES were held with a different architecture that no longer uses the AudioSever VAD and users interacted with the system via a web browser and computer microphone. This solved the VAD, Hung Up and Lost of Synchronism problems reported in Table 4.5 and allowed us to focus only on the study of lexical entrainment. 4.5 Prime usage The last section analyzed the system performance as a whole. In this section, the performance of each prime will be analyzed. This will be particularly important to confirm if users have prime preferences and also to find future directions to automate the prime choice. Table 4.7 presents the frequency (absolute and relative) of the primes in the data in both weeks and the PER for each prime in both weeks. Concept now Start Over next bus Prime W1: agora W2: imediatamente W2: neste momento W2: mais rápido possı́vel W2: mais brevemente possı́vel W2: new primes together W1: nova pesquisa W1: outro percurso W2: procurar novamente W2: nova procura W2: outra procura W2: nova busca W2: new primes together W1: próximo W2: seguinte frequency W1 (% W1) 64 (100.0) 0 0 0 0 0 35 (85.4) 6 (14.6) 0 0 0 0 0 62 (83.8) 12 (16.2) PER (%) 56.3 51.4 16.7 35.5 100.0 W1 frequency W2 (% W2) 77 (59.7) 26 (18.1) 15 (10.4) 5 (3.5) 6 (4.2) 52 (40.3) 1 (3.1) 0 15 (46.9) 8 (25.0) 5 (15.6) 3 (9.4) 31 (96.9) 3 (4.8) 59 (95.2) PER (%) 89.6 61.5 20.0 100.0 16.6 40.4 0.0 0.0 25.0 20.0 0.0 9.7 100.0 45.8 W2 POS ADV ADV PRO/ART1 + N ADV + ADJ + ADJ ADV + ADV + ADJ ADJ + N ART + N V + ADV ADJ + N ADJ + N ADJ + N N or ADJ N or ADJ Table 4.7: Use of the primes and error rate in both weeks. The results presented in Table 4.7 show that the callers entrained to the system, incorporating the new primes in their speech during the second week of trials. The only word that appears with similar frequency in both weeks is agora. This is a very frequent word in Portuguese and in addition the word was explicitly written in one of the scenarios, thus biasing callers towards that word. Words like agora (specially in week 2) and seguinte have error rates higher than the other primes. One possible source is the corpus generated to train the language model used in these tests. The algorithm used to generate the corpus tried to incorporate as much variety as possible for each concept described on the grammar, when no prior observation probability 4.5. PRIME USAGE 51 was given a priori. Thus, the more possibilities a concept has, the less frequent each entry appears in the corpus, since the algorithm equally attributes the frequency whenever it is not specified. This problem was later resolved setting the probability of each entry according to the frequency observed in older data. The phonetic distance seems to be a good metric to choose prompts with the same meaning considering only the ASR module performance. For instance, when the prime mais rápido possı́vel was used, where the word rápido /Rapidu/ sounds very similar to Rato /Ratu/, a bus stop covered in our system, the ASR module always took the bus stop name and never the word rápido. The same happened with seguinte /s@gi˜t@/ for the next bus prime. This latter was recognized several times as sim /si˜/ (yes) or sair /s6jr/ (quit). The last column of table 4.7 contains the POS of each of the primes. No evident connection can be observed between POS and the entrainment that users showed to the prime. In order to better understand why users are more likely to adopt some primes than others, our data was analyzed to find which primes were used immediately after being used in the system prompts. This conclusion comes from previous work [144, 99] where it was observed that primes are more likely to be used immediately after the system uses them. In addition, the reasons to understand why the users adopt a prime more than other, also motivated this analysis. For every user turn where a prime was recognized, the previous system prompt was taken into account. Four different features were computed to analyze the users’ entrainment to the proposed primes. When a prime was used after being invoked in the system’s previous utterance, this will be called an Uptake. The correspondent column on Table 4.9 shows the number of times that that event occurred. This means that the users followed the conceptual pact proposed by the system. The No Uptake column corresponds to number of times that a different prime from the one used by the system was used. In this case, the user decided not to use the term proposed by the system. The number between parentheses represents the number of times that the prime was already used before in the same session. The ’(%) 52 CHAPTER 4. TOWARDS BETTER PRIME CHOICES Speaker System User System User System User Total utterance Pode pedir informações sobre outro percurso, saber qual o próximo autocarro ou o autocarro anterior. Próximo. Podia repetir se deseja saber o horário do autocarro seguinte, do autocarro anterior ou se deseja fazer uma nova pesquisa? Próximo autocarro. Deseja fazer uma nova pesquisa, saber qual o próximo autocarro ou o autocarro anterior. Autocarro seguinte. - Uptake - No Uptake - (%) usage - X - × - 1/2 - × - X - - × 1 × 1 1/2 50 % Table 4.8: Example of uptaken stats taken form interaction for the prime próximo. of Uptakes’ stands for the percentage of user utterances that adopted that prime when it was present in the previous system prompt. Finally, the ’# prompts’ column shows the total number of system prompts in the corpus where that prime appears. Table 4.8 has an example of how Uptakes, No Uptakes and ’% of usage’ were computed from our data. The results are presented in Table 4.9. In most cases, the numbers of usages in system prompts is much higher compared to the number of usages in user turns. There are many cases where the prime was used in the system prompt, but the user intention does not correspond to any of the primable concepts. For instance, the prompts for the now concept are used in the context of asking for the time the user wants to travel. One can answer with a time-relative expression such as now or a specific hour. Again, the fact that the word agora was explicitly written in one of the scenarios is reflected in the results, since the callers did not use it from the previous system prompt in most cases. During the first week, the differences in the (%) of usage found for the start over show a clear preference for nova pesquisa in that week. A similar conclusion for the next bus prime is not obvious, since the exposure to the word próximo was nearly five times the exposure to 4.6. DISCUSSION 53 Week 1 No (%) of Uptake Uptakes # Uptake prompts Week 2 No (%) of Uptake Uptakes # prompts - 64 - - 0 - 21 14 11 5 6 56(6) 12(3) 4(1) 1(1) 0 63.3 41.2 45.8 33.3 60.0 60 67 65 30 21 nova pesquisa outro percurso procurar novamente nova procura outra procura nova busca 34 3 - 1 3(3) - 86.2 42.9 - 167 106 - 15 6 4 3 1 0 2 0 0 94.4 83.3 80.0 33.3 0 86 51 54 54 próximo seguinte 51 8 11(2) 3 100.0 53.8 198 39 2 57 1 2(1) 100.0 96.5 38 105 Primes Uptake agora imediatamente neste momento mais rápido possı́vel mais brevemente possı́vel Table 4.9: Analysis from the uptaken of the primes. the word seguinte. Moving to the second week the variety of primes for each concept increases, and the users have more options. The primes are chosen randomly, although some primes are included in more than one prompt, which explains why some primes were more used than others. The results for now are inconclusive, although it is interesting to observe that the word imediatamente is more used than the others when there is a No Uptake. Unlike the now concept, the primes used for start over are uptaken in the majority of the answers. The only prime that was used without being uptaken from the system was nova procura. Finally, the users quickly adopted seguinte, the prime the system used for next bus in the second week. 4.6 Discussion The results have shown that in general, the all the performance measures have improved from the first to the second week and according to Table 4.5 this was possible due to the increased number of correctly recognized turns, as the other sources of problems identified remained. This means that prime selection can improve the system performance, using primes that are less prone to incorrect recognition. 54 CHAPTER 4. TOWARDS BETTER PRIME CHOICES Despite the improvements achieved in the second week, the WER is still very high (Table 4.6). In future tests, instead of using a context-independent language model, we should use context dependent language models trained with more data. The majority of the primes used in the second week improved the speech recognition. However, a prime should also be chosen taking into account the user preference. The combination of the results in Tables 4.8 and 4.9 can provide information to set a tradeoff between the speech recognition performance and the reaction of the users to the proposed primes. In the case of now, agora was more used than other primes in both weeks. Despite being written in one of the scenarios, it is a very frequent word and therefore it should be chosen instead of the new primes. The effort should be in improving the speech recognition performance for this prime by using a better corpus to create the language model. Imediatamente due the usage in the no uptake condition should also be included in the set of primes for now. Mais brevemente possı́vel has a very good PER, however it is not very frequent in daily language. For start over, Table 4.9 shows that the users have made good use of nova pesquisa during the first week, despite the synthesis problem reported in Section 4.1. Among the options for primes available in the second week, procurar novamente that has very good PER and nova procura, which was twice used without uptake and has also a low PER, can be considered good primes. The primes tested for next bus were very frequent in the weeks when they were more often incorporated in the prompts. However, the PER is lower for próximo. In addition, the usage of próximo without uptaken was much higher in the first week comparing to the use of seguinte in the same conditions in the second week. This indicates that próximo is a good prime for next bus. However, taking into account that the comments on users’s questionnaires valued the variety of prompts, it might be a good option to maintain both primes. This analysis points that the use of the prime in the No Uptake condition and the user prime preference might be correlated. This is confirmed by the Pearson correlation value found between the number of No Uptakes and number of hits for each prime in a web search engine, 0.61, whereas the correlation with Uptakes is only 0.14. The same type of correlation was 4.7. SUMMARY 55 found with the number of hits of each prime in a frequency corpus for spoken European Portuguese, the Português Fundamental corpus [6]. In this case the correlation for Uptakes was −0.23 and 0.99 for No Uptakes. This follows the intuition that those are the words that the users recall immediately when the system requests a given concept. 4.6.1 User’s Feedback The questionnaire at the end of the scenarios helped to understand how users perceived the changes in the system. It was also helpful to know their feedback in order to improve the system performance. During the first week, they were asked if they received the information correctly in each scenario, the type of device used, their country of origin and a general comment on the system. During the second week, they were also asked if the system understood them better and if they felt any evolution from the first to the second week and describe the evolution, if possible. 18 of the users who called the system in both weeks answered the questionnaire comparing both systems. 10 of them felt better understood by the system. 15 of them noticed some sort of evolution in the system. 3 users commented positively the wider variety of prompts available in the second week. 5 users also felt that the system was answering faster than before, although it was not. The users’ feedback emphasizes that the two modules that they interact directly with, the ASR and the TTS, are of major importance in the users’ perception of the system. The new prompts proved to be very effective, reducing the number of misunderstandings and removing some incorrectly synthesized words. 4.7 Summary In this Chapter we have presented our first approach to the use of Lexical Entrainment in an SDS. We started by identifying a set of prime candidates and find synonyms which could replace them in the system prompts. Then we have tested the system used the different sets of primes. The first set used was the same used in the preliminary version, and the second 56 CHAPTER 4. TOWARDS BETTER PRIME CHOICES set is the one with the synonyms. The results presented in this Chapter confirm the idea that lexical entrainment in humancomputer dialogs occurs immediately after the system introduced a new prime. In addition, the system performance improved with the use of different primes, showing once more that prime choice can influence the system performance. An a posteriori analysis of the results was made in order to understand which could be the features that could described a good prime. The purpose is to automate the selection of the best primes during the interaction. In order to do this, an algorithm that combines the system prime preference and the user prime preference must be used. Events identified during the analysis such as Uptakes and No Uptakes can be considered relevant to find the user prime preference. The system prime preference must rely on events that could be detected during the interaction such as non-understandings. The confidence measure computed by the system during the interaction should also be used as an indicator in this process. To summarize, these are the conclusions that should be taken into account when choosing primes to be used in SDSs: • Very frequent primes should have their ASR performance improved; • Words that are not very frequent should be removed from the prime set; • Multiple primes should be kept in the prime set to allow variety; • No Uptakes seem to be events that could indicate user prime preference. Next section will discuss the features that can be used to refine the computation of the confidence measure in order to choose better primes. 5 Refining confidence measure to improve prime selection 5.1 Introduction Accurate confidence score is one of the keys to improve the robustness of SDSs. Previous approaches to this problem were introduced in Section 2.1.2. The confidence model used in the tests described in Section 4.3 was trained with data from a different system with a different domain. However, the changes operated in the original architecture to adapt it to European Portuguese introduced modifications that will influence the features that will give the best confidence score computed by Helios. In this section, strategies to refine these confidence measures according to the new modules used will be presented. 5.2 Training a confidence annotator using logistic regression The confidence model used so far, trained with RoomLine [18] data, computed the confidence score according to the following logistic regression: Conf idence = e1.69−5.55·Ratio of U ncovered W ords 1 + e1.69−5.55·Ratio of U ncovered W ords (5.1) where the Ratio of U ncovered W ords is obtained during the interaction and corresponds to the ratio between of words from the recognized input that were not parsable and all the words in the input. However, as described in detail in [16], there are other features available when training the model that could give a better model for the architecture used in Noctı́vago. Here are some examples from different categories (from [20]): 58CHAPTER 5. REFINING CONFIDENCE MEASURE TO IMPROVE PRIME SELECTION • speech recognition features, e.g. acoustic and language model scores; number of words and frames; turn-level confidence scores generated by the recognizer; signal and noise-levels; speech rate; etc. • prosody features, e.g. various pitch characteristics such as mean, max, min, standard deviation, min and max slopes, etc. • lexical features, e. g. presence or absence of the top-10 words most correlated with misunderstandings (system-specific). • language understanding features, e.g. number of new or repeated semantic slots in the parse; measures of parse-fragmentation; etc. • dialog management, e.g. match-score between the recognition result and the dialog manager expectation; dialog state; etc. • dialog history, e.g. number of previous consecutive non-understandings; ratio of nonunderstandings up to the current point in the dialog; etc. Since the speech recognition engine was replaced and the remaining modules suffered modifications, there is a chance that some of the features computed can be more helpful to predict the confidence score than the features used in the model used so far. It is important at this point to clarify that we will use the same terminology used by Bohus in [16], to distinguish between misunderstanding and non-understandings, since the confidence annotator treats them differently. A misunderstanding is considered when the recognized input is parsable, but could not be matched with any of the expected concepts. A non-understanding is considered when the recognized input was not parsable. Unlike misunderstandings, in the case of a nonunderstanding, some language understanding and dialog management features would not be given a value. Our first attempt to train a new model tried to maximize the number of turns used in the training procedure, including non-understandings. Thus, only the features available in the all 5.2. TRAINING A CONFIDENCE ANNOTATOR USING LOGISTIC REGRESSION 59 the 1592 user turns collected in the tests described in Section 4.4 were used, which significantly reduced the number of available features. Hence, the features grouped by category were: • speech recognition features: Word-level confidence score; if word-level confidence was greater than 0.5. • language understanding features: Fragment ratio; if fragment ratio was above the mean; ratio of the number of parses; number of uncovered words; if number of uncovered words is greater than 0; if the number of uncovered words was greater than 1; the normalized number of uncovered words; the ratio of uncovered words and if the ratio of uncovered words was above the mean value. • dialog history: If last turn was a non-understanding. The model used these features to predict if the turn was correctly understood. The turns were labeled as correct or incorrect if the parsed transcription corresponded to the parsed input. Since this is a binary problem, logistic regression can be used. Among the algorithms available to perform logistic regression we have chosen three: maximum entropy (using MegaM [40]), prior weighted logistic regression (using FoCal [31], with a prior of 0.5) and the stepwise logistic regression available in MATLAB’s statistical toolbox. The available turns were divided into 1144 used for training, and 448 for test. The weight of the features according the algorithm used is shown in Table 5.1. Among the features used, the word-level confidence is the most weighted feature. This confirms the accurateness of the confidence score provided by the new European Portuguese speech recognizer that we have integrated. Other parsing features associated with the number of uncovered words are also important to the model. The new models’ performance may be compared with the baseline model given in Equation 5.1 varying the confidence threshold to find the optimal trade-off between the model used and the confidence threshold set in the system. The threshold in the live system was fixed and the value was 0.2. 60CHAPTER 5. REFINING CONFIDENCE MEASURE TO IMPROVE PRIME SELECTION Turn-Level Confidence Turn-Level Confidence > 0.5 Fragment Ratio Fragment Ratio > mean Ratio of the number of parses Number of uncovered words Number of uncovered words > 0 Number of uncovered words > 1 Normalized number of uncovered words Ratio of uncovered words Ratio of uncovered words > mean Non-understanding in last turn MegaM 2.95 -0.3 -0.11 0.38 -0.62 -0.88 -0.38 -0.71 0.97 -0.62 -0.38 -0.57 FoCal 4.09 -0.97 -0.36 0.57 -0.98 -1.80 -0.38 -0.03 2.13 -1.34 -0.39 -0.60 Stepwise 0.71 -0.17 0.00 0.00 0.00 -0.12 -0.31 0.00 0.00 -0.20 0.00 -0.10 Table 5.1: Weights for the features used in the confidence annotator training algorithm. The best performance in terms of Accuracy (Figure 5.1a) is achieved by the maximum entropy method, using 0.6 as the rejection threshold. In Figure 5.1b, Precision gives an idea of how the system deals with mis-recognitions. In a dialog system, this is a very important measure of performance, as mis-recognitions usually have a very high cost, since they often result in dialog errors. In terms of Precision, the FoCal toolkit has consistently achieved the best performance among for the evaluated thresholds. It is also important to avoid false rejections, although they are not as costly as mis-recognitions. The Recall, showed in Figure 5.1c, shows how the different methods deal with rejections. In terms of Recall, the stepwise method and the baseline performed better than the two other methods presented. These new models have overperformed the baseline confidence model. The models trained with MegaM and FoCal seem to be valuable alternatives to be used in a live system instead of the baseline model. However, many other features were left out since we were only considering the features available in every turn. Since the feature value will not influence the way the system is going to deal with non-understandings (typically the system will repeat the previous request), the next section will try to use all the features available leaving out the non-understood turns. 5.2. TRAINING A CONFIDENCE ANNOTATOR USING LOGISTIC REGRESSION 61 0.85 FoCal Helios Megam Stepwise 0.8 0.75 Accuracy 0.7 0.65 0.6 0.55 0.5 0 0.1 0.2 0.3 0.4 Confidence Threshold 0.5 0.6 0.7 0.8 (a) Accuracy. 0.95 FoCal Helios Megam Stepwise 0.9 0.85 Precision 0.8 0.75 0.7 0.65 0.6 0.55 0 0.1 0.2 0.3 0.4 Confidence Threshold 0.5 0.6 0.7 0.8 0.5 0.6 0.7 0.8 (b) Precision. 1 FoCal Helios Megam Stepwise 0.9 0.8 Recall 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 Confidence Threshold (c) Recall. Figure 5.1: Accuracy, Precision and Recall for the tested methods. 62CHAPTER 5. REFINING CONFIDENCE MEASURE TO IMPROVE PRIME SELECTION 5.3 Training a confidence annotator with skewed data Not considering the non-understood turns reduced the training corpus to 825 turns, divided into 623 for training and 202 for test. The reduction brought another problem, as only 10% of the turns corresponded to mis-understood turns, which means that the training is highly biased. Our first attempt did not use any particular strategy to deal with this fact. The procedure adopted was to train a model with a stepwise logistic regression using the same procedure as in [16]. In each step, the next most informative feature was added to the model, as long as the average data likelihood on the training set improved by a statistically significant margin (p − value < 0.05). After each step, the features in the model were reassessed, and any feature with a p − value larger than 0.3 was eliminated from the model. To avoid over-fitting, two different stopping criteria were used, Bayesian Information Criterion (BIC) and log-likelihood increase with cross-validation (LLIK). Table 5.2 shows the most relevant features found at this time with their weights, according to the different stopping criteria. Stop Criteria last level touched slots blocked new slots num (bool) BIC -1.5315 - LLIK -1.3706 -1.8173 -1.1683 Table 5.2: Confidence annotations models trained with stepwise logistic regression. None of these features corresponds to the features found for the first models trained with Noctı́vago data. These are dialog manager features that were not used to train the first models. Figure 5.2 compares the performance of these models against the baseline model. Regardless of the stopping criterion, the models trained clearly outperformed the baseline model. The number of features used in both models is very scarce. In fact, the stop criteria were rapidly achieved which can explain the low number of features selected. One possible reason for the fast convergence of the methods could be the skewed training corpus. To overcome this limitation, three different strategies were attempted: 5.3. TRAINING A CONFIDENCE ANNOTATOR WITH SKEWED DATA 63 0.82 BIC LLIK=ER Helios Baseline 0.8 0.78 Accuracy 0.76 0.74 0.72 0.7 0.68 0 0.1 0.2 0.3 0.4 Confidence Threshold 0.5 0.6 0.7 0.8 0.5 0.6 0.7 0.8 (a) Accuracy. 0.81 BIC LLIK=ER Helios Baseline 0.8 0.79 Precision 0.78 0.77 0.76 0.75 0.74 0.73 0.72 0 0.1 0.2 0.3 0.4 Confidence Threshold (b) Precision. BIC LLIK=ER Helios Baseline 1 0.98 Recall 0.96 0.94 0.92 0.9 0.88 0 0.1 0.2 0.3 0.4 Confidence Threshold 0.5 0.6 0.7 0.8 (c) Recall. Figure 5.2: Performance of the stepwise logistic regression compared with the baseline model. 64CHAPTER 5. REFINING CONFIDENCE MEASURE TO IMPROVE PRIME SELECTION • Use of the non-understood turns, filling the missing data points with the average value for that feature, if the feature was missing in less than 20% of the training corpus (NONU); • Randomly replicating the minority dataset until the dataset is balanced between positive and negative examples (BAL); • Feature weighting with an additional cost for false positives (FoCal). The cost for false positives can be modified changing the prior used in the weighted logistic regression using FoCal [31]. The cost for false negatives was set to 1, while several costs were evaluated for false positives (1, 10, 100 and 1000). The results were computed for stepwise logistic regressions using both BIC and LLIK as stopping criteria. BAL and NONU datasets were used separately and combined. The combination of both strategies is possible after the non-understood feature values are populated the empty points with mean values and replicating the minority part of the corpus to achieve a balanced corpus. Table 5.3 shows the CER for the optimal rejection threshold, the CER for a 0.5 threshold and the optimal threshold value. The two rightmost columns show the values for the CER in the positive and negative part of the test dataset. None of the attempted strategies developed outperformed the initial stepwise logistic regression, as it is confirmed by the results of Table 5.3. In [16], Bohus considered 500 training utterances as the minimum dataset size required to train a confidence annotation model using logistic regression, which can help explaining why adding more data did not improve the models trained. 5.4 Summary The goal of improving confidence annotation in this context was only to provide better confidence scores to find the system prime preferences. Despite improvements achieved (all the 5.4. SUMMARY Helios Baseline logit(LLIK) logit(LLIK) + BAL logit(LLIK) + NONU logit(LLIK) + BAL + NONU logit(BIC) logit(BIC) + BAL logit(BIC) + NONU logit(BIC) + BAL + NONU FoCal (1,1) FoCal (1,10) FoCal (1,100) FoCal (1,1000) 65 CER (opt.) (%) 25.7 19.8 20.3 22.3 22.3 19.8 21.0 22.3 22.3 21.3 22.3 22.3 22.8 CER (%) 29.7 19.8 20.3 22.3 22.3 20.3 25.7 22.3 22.3 30.7 28.7 35.2 36.6 (0.5) Opt. Th. 0.03 0.62 0.33 0.52 0.58 0.29 0.36 0.66 0.59 0.10 0.13 0.04 0.05 Postive CER (%) 0.00 2.00 2.00 5.33 5.33 1.33 3.33 5.33 5.33 4.00 3.33 4.00 5.33 Negative CER (%) 100 71.1 73.1 71.2 71.2 73.1 71.2 71.2 71.2 73.1 76.9 76.9 76.9 Table 5.3: Classification error rate for the different strategies new models presented in Table 5.3 have outperformed the baseline model), the models trained in this section were never used in subsequent experiments. As it was mentioned before, the confidence annotation model is very system and component dependent. Since the subsequent tests were conducted with different architectures and dialog strategies, these models were no longer valide, and therefore they could not be used in live tests to show their effectiveness. 66CHAPTER 5. REFINING CONFIDENCE MEASURE TO IMPROVE PRIME SELECTION 6 Automated Entrainment In Chapter 4 a first step to understand how can the prime choice be automated was taken. This Chapter goes beyond what was made at that point by automating the prime selection process instead of the randomly selecting primes as it was done in those experiments. Section 6.1 describes the first approach towards this automation. The rule-based method derived from heuristics of lexical entrainment theory and from the intuition gathered during the experiments presented in Chapter 4. The method was tested on-line in both Noctı́vago and Let’s Go. Section 6.2 presents our first attempt to create a Data-Driven method for prime selection. The method was tested on off-line data from Noctı́vago and Let’s Go. A final on-line test with Noctı́vago compared the effectiveness of the Data-Driven prime selection with other methods. 6.1 Two-Way Automated Rule-Based entrainment 6.1.1 Entrainment Events There are several evidences in the literature of lexical, syntactical and/or acoustic entrainment in human-human and human-computer dialogs, as reported in Sections 2.2.1 and 2.2.2, respectively. This evidence has to be mapped into something that an SDS can use to perform entrainment. Once that evidence is mapped, the system can be programmed to perform lexical entrainment in both directions. The type of information that can be extracted from live dialogs indicates whether the user accepted the prime the system proposed, or if she/he decided to use a different prime. In each case, the system behavior has to be adjusted to the evidence found. In the first case, the system should stay with the prime proposed, whereas 68 CHAPTER 6. AUTOMATED ENTRAINMENT in the second case the system needs to decide if there is enough evidence that the prime recognized was indeed different from the prime proposed. If there is evidence, then the system should incorporate the same prime chosen by the user. This evidence is built according to the confidence score of the current turn, the information gathered in the previous turns of the dialog, and the information collected from previous dialogs. The system logs can be analyzed chronologically in order to find the user prime preferences, i.e., if the users took up the system prime or used a different term. Some examples of Uptake and No Uptake events were already given in Section 4.5, using data from Noctı́vago dialogs (Table 4.8). The set of relevant events will be extended here, using examples from real Let’s Go Dialogs to better illustrate them. If the users take up the system prime, then entrainment has taken place. This was designated as an Uptake (Uptk) event. The example of this event in a Let’s Go dialog can be found in the first excerpt in Table 6.1, where the system proposed “new conversation” (S1) and the user followed (U1). When the system detects a different prime from the one proposed, it can be interpreted as the user deciding not to entrain to the system. This was called a No Uptake (NUptk) event. This event can be found in the second excerpt of the dialog presented in Table 6.1, where the system proposed “new dialog” (S2), and the user answered with “new task” (U2). A third class of events should be added to these previously defined classes. When the user says a prime that was not used until that point in the dialog. This will be considered a No Previous Use (NPU) event. This is exemplified in the third excerpt in Table 4.8, where the user said “now” (U6), before the system used this concept. In these cases, the system should try to follow the user prime choice whenever it is possible. These three events are designed to cover the possible user behaviors in terms of prime adoption. They will be designated as user prime events. The fourth event illustrated in Table 6.1 cannot be considered an entrainment event, but it will be used later to represent the system preference. Non-Understandings (NUnd), already introduced in Section 5.2 when discussing which turns should be used to train the confidence annotator model, occur whenever the dialog manager is unable to bind the parser output to 6.1. TWO-WAY AUTOMATED RULE-BASED ENTRAINMENT 69 S1: To ask about a different journey, you can say, start a new conversation. If you are finished, you can say goodbye. U1: START A NEW CONVERSATION ... S2: You can say, tell me about the later bus, when is the bus prior this one, start a new dialog, or say goodbye. U2: START A NEW TASK ... S3: Welcome to the CMU Let’s Go bus information system. S4: To get support at any time, just say support or press zero. S5: Where would you like to start from? U3: SHERADEN S6: Where do you wanna go? U4: DOWNTOWN S7: Going to DOWNTOWN. Did I hear that correctly? U5: YES S8: When would you like to travel? U6: NOW ... S9: To ask about a different journey, you can say, start a new conversation. If you are finished, you can say goodbye. U7: SPRING 94C DUQUESNE S10: To ask about a different journey, you can say, start a new conversation. If you are finished, you can say goodbye. U8: HILL CROSS TURTLE THANK .YOU. S11: To ask about a different ride, you can say, start a new dialog. If you are finished, you can say goodbye. U9: START NEW DIALOG Table 6.1: Examples of the events used in the prime choice update. Primes are in bold. the concepts available in the expectation agenda for a given state. They will be treated as a system event. An example is given in the fourth dialog excerpt in Table 6.1. The parser output for utterances U7 and U8 were bus stops and routes, which were not acceptable concepts at that point of the dialog. If this is a recurring situation for a given prime, the system should be able to find a different alternative for that prime (S11) that might work better that the first prime proposed (U9). 6.1.2 Heuristic Entrainment Rules This section describes the heuristic rules that were our first approach towards entrainment. These rules aim to find the most appropriate prime at any given instant. Ideally, the choice would be made using a data-driven method, however the limited data resources initially available for Noctı́vago lead us to develop a set of heuristics for prime selection that combine the user and system events above mentioned to find the best primes. The data-driven method was later implemented, as it will be described in Section 6.2. 70 6.1.2.1 CHAPTER 6. AUTOMATED ENTRAINMENT Implementing the Entrainment Heuristics As described in the literature [48, 29], speakers may adjust their prime choice according to the other interlocutor and consequently use different primes when talking to different interlocutors. The ideal solution would be to have user dependent models for prime selection. However, neither Noctı́vago nor Let’s Go have user dependent models. None of the systems is currently equipped caller-id or a speaker verification module that could trigger user adapted models on-the-fly. Thus, the solution adopted was a two-stage algorithm to rank the prime candidate lists for each concept. In the first stage, “Long-Term Entrainment”, the system tries to determine the best prime for any speaker, based on the past interactions with the system. The prime selected at this stage will be used in the beginning of the session. In the second stage, “Short-Term entrainment”, the system tries to coordinate the primes with the user’s choices on the fly as the dialog progresses, trying to find if the primes used are acceptable to the current user. 6.1.2.1.1 Long-Term Entrainment The results presented in Section 4.7 pointed to a very high correlation between the number of No Uptakes events, and the most commonly used primes in daily language. This correlation was verified for the number of hits in a web search engine and the frequency of the primes in a spoken dialog European Portuguese corpus. Possible explanations for this is that the terms that are the most frequent in general use are those that users employ even if the system did not use them. Thus, the primes were ranked according to the number of No Uptake events, normalized by the total number of uses of that prime by the system. The higher this value is, the most suitable is the prime. This resulted in the long-term prime ratio for prime i: R(i) = 6.1.2.1.2 countN U ptk (i) countsystem (i) (6.1) Short-Term Entrainment The second phase of the algorithm aims to reduce the number of primes used during each session. That is making the user and the system to employ the same terms. The system is expected to follow the user prime choice whenever this 6.1. TWO-WAY AUTOMATED RULE-BASED ENTRAINMENT 71 choice does not degrades the system performance, and propose a different prime once enough evidence is found that a prime is hindering the system performance. To map this kind of behavior, imagine the system has for new query concept a prime candidate list from prime i to prime z: Prime i Prime j Prime k Prime z , a set of heuristics was designed based on a set of prime update factors, one for each of the user prime events described: ϕU ptk , ϕN U ptk and ϕN P U . These factors will modify the initial long-term prime ratio R(i), according to the following heuristics: • If an Uptake event occurs for prime i, then R(i) is increased by ϕU ptk . Example: in the first excerpt from Table 6.1, R(new conversation) will be increased by ϕU ptk ; Uptake S: To ask about a different journey, you can say start Prime i , if you are finished you can say goodbye. U: Prime i • If prime i is used when prime j was proposed, then R(i) is increased by ϕN U ptk and R(j) is decreased by the same amount. Example: in the second excerpt from Table 6.1, R(new task) will be increased by ϕN U ptk and R(new dialog) will be reduced by ϕN U ptk ; No Uptake S: To ask about a different journey, you can say start Prime i , if you are finished you can say goodbye. U: Prime j • If prime i is spoken without being previously used in that session either by the user or the system, then R(i) is increased by ϕN P U . Example: in the third excerpt from Table 6.1, R(now) will be increased by ϕN P U ; • If prime i was proposed and a non-understanding was generated in the next user turn, then R(i) is reduced by countN U nd (i), where countN U nd (i) is the number of nonunderstandings for prime i in a session. Example: in the last excerpt from Table 72 CHAPTER 6. AUTOMATED ENTRAINMENT No Previous Use S: Welcome to the CMU Let's Go bus information system. S: Where would like to start from? U: SHERADEN S: Where do you wanna go? U: DOWNTOWN S: When would you like to travel? U: 11 PM S: The 28X departs from WEST BUSWAY AT SHERADEN STATION C at 11:29 p.m. It will arrive at LIBERTY AVENUE AT MARKET STREET at 11:47 p.m. And it has many seats available. U: Prime i 6.1, the R(i) for “journey” and “new conversation” will be decreased by the number of non-understandings flagged so far, in that session, for each of them. Non Understanding S: To ask about a different journey, you can say, start Prime i . If you are finished, you can say goodbye. U: 4 BRIDGEVILLE HOMESTEAD BRIDGE BAUM [parsed_str: NOT AVAILABLE] 6.1.2.2 Version 1: Testing the Entrainment Rules in Noctı́vago The entrainment experiments described in this Chapter were performed using the new architecture for Noctı́vago presented in Figure 3.9a. Noctı́vago’s new version uses the ECA flash-based interface, instead of the telephone interface previously used, and still used in Let’s Go. This version of Noctı́vago also use contextdependent language models. The models for Place and Time are bi-gram language models, trained with 30k automatically generated corpora. The models for Confirmation and Next Query were created from SRGS grammars specified according to the data collected in the experiments described in Chapter 4. The heuristic rules described in the previous Section were implemented. The primes candidates that could be affected by the applications of the rules can be found in Table 6.2. To the primable concepts used in the study presented in Chapter 4, three other could be added due to small changes in the dialog strategy. Once the values for the slots arrival place, origin place and time were filled, the system explicitly asked the user if the values were correct (QueryConfirm Request Agent in the Noctı́vago task presented in Figure 3.3). If the user 6.1. TWO-WAY AUTOMATED RULE-BASED ENTRAINMENT 73 said no, the system asked which slots were not correct (RequestWrongSlot Request Agent in the Noctı́vago task presented in Figure 3.3). It is important to emphasize that these concepts correspond to the way the slot is referred rather than the slot value itself. Another primable concept introduced was price, as the system offered the possibility to the user to ask for the price of the ticket (InformPrice Inform Agent in the Noctı́vago task presented in Figure 3.3). The “Type of Prime” column in Table 6.2 specifies if the use of the prime will have influence in the course of the dialog. For instance, in the example dialog presented in Table 6.1, in utterance U1 the fact that “new conversation” was correctly recognized intrinsically affects the course of the dialog. The primes whose incorrect recognition does not affect the course of the dialog and that are used in system prompts will be typified as “Non-Intrinsic”. None of the primes used in these Noctı́vago tests is typified as “Non-Intrinsic”. This type can be found in the primes used in the Let’s Go tests in Table 6.7 and the Noctı́vago tests presented in Section 6.2.3.1. To give an example, in a Let’s Dialog if the user answers “start from downtown” and the system recognized “stay from downtown”, where “start” is a prime, the slot is filled with the same value, “downtown”, in both cases. Type of Prime Concept Next Now Price Intrinsic Start Over Arrival Place Origin Place Time Primes próximo / seguinte agora / imediatamente neste momento / o mais rápido possı́vel o mais brevemente possı́vel preço / valor outro percurso / nova procura nova pesquisa / procurar novamente outra procura / nova busca chegada / destino partida / origem horas / horário Table 6.2: Primes used by Noctı́vago in the heuristic method tests. These tests had two main goals. The first was to compare the users’ behavior with and without Short-Term entrainment. The second was to find the best configuration for the confidence measure to be combined with the prime events detected. For this purpose, four different test sets were created. In Set 1, the dialog confidence score generated in Helios was used to threshold the Short-Term entrainment updates. In Set 2, the ASR confidence score was used for the same purpose. In Set 3, Short-Term entrainment updates were performed without 74 CHAPTER 6. AUTOMATED ENTRAINMENT the use of the confidence scores. Finally, Set 4 only performed the Long-Term entrainment updates. The update factors used in Set 1, 2 and 3 were handcrafted. The values for ϕU ptk , ϕN U ptk and ϕN P U were set to 1, 2 and 3 respectively. These values were chosen to give more importance to the least frequent events, as the previous findings and the intuition developed in previous studies pointed to their relevance for prime selection. Also, these values ensure that they are superimposed on the initial ratio R(i), as they are at least one order of magnitude higher than the average initial R(i) for the data collected in the tests described in Chapter 4, 0.09. This was done in order to reinforce entrainment during each session. 6.1.2.2.1 Test Set Users were recruited by e-mail or Facebook event to participate in the experiment. In both cases they were given a short explanation of the requirements to complete the task. Then they were given a web link to access the system via the Flashbased multi-modal web interface of Noctı́vago (Figure 3.9a). They were told to carry out three consecutive requests to the system within one session. This made each dialog longer and consequently gave the system more chances to apply the heuristics. The four Sets ran alternately one after the other during the test period, each for the same amount of time, but did not change within an individual session. The users were not aware that the system was running different configurations. At the end of each dialog, the users were asked to fill in a questionnaire based on the PARADISE framework for SDS evaluation [140] was created. The questionnaire also tried to evaluate if the users noticed any difference in the lexical choice, by asking them ’if the system understood them better towards the end of the session’. We hoped that the use of adequate primes would result in better recognition as the session progressed. The complete questionnaire can be found in Annex C.2.2. 6.1.2.2.2 Results Table 6.3 summarizes the details of the 160 sessions validated in terms of system performance (Dialog Success and Average Number of Turns). The tests were performed by 83 different people. 33 were female subjects and 46 were male subjects. It was 6.1. TWO-WAY AUTOMATED RULE-BASED ENTRAINMENT 75 also tested by 5 non-native users. 13 users already participated in the tests described in Section 4.3. System performance includes both the estimated and real success. The estimated success is computed live, since it only takes into account whether the system queried the backend and provided any bus schedule information to the user. Real success is computed a posteriori, after listening to each session and verifying if the schedule provided actually corresponded to the user’s request. Despite most of the results were not statistically significant, Set 4, against initial expectations, achieved the best performance in terms of real dialog success and the second lowest average number of turns per session, although the Chi-square test for Dialog Success and the one-way ANOVA tests for Number of Turns revealed no statistical significance between different versions. Since one of our goals was to compare the performance with different confidence measures, Table 6.3 shows that among those Sets that performed Short-Term entrainment Set 1, which used the dialog confidence measure, was the one with the best performance. Number of Sessions Estimated Dialog Success (%) Real Dialog Success (%) Average Number of Turns Set 1 40 92.5 70.3 9.24 Set 2 42 95.2 63.2 9.13 Set 3 44 95.5 67.2 8.12 Set 4 34 91.2 74.5 8.92 Table 6.3: Dialog performance results. Table 6.4 shows Word Error Rate (WER) and percentage of correctly transferred concepts (% CTC) for the same tests. Figure 6.1 shows a graphical representation of the same results. A concept is considered correctly transferred when the parsed ASR result is equal to the parsed transcription. Despite the fact that one-way ANOVA tests did not reveal statistical significance, results show Set 4 achieved the lowest WER and the highest percentage of CTC. The relation between the WER and the system performance reveals that Set 1, despite having the highest WER, has the second best real success rate. We observe that the WER for primes is lower than the global WER for all the Sets, except Set 3. Against our expectation, the % CTC is lower for Primes than for the other concepts. However, Sets 1 and 2 achieved the lowest loss in % CTC. The % CTC for intrinsic primes even increased in Set 1, compared to 76 CHAPTER 6. AUTOMATED ENTRAINMENT other concepts’ % CTC, where Short-Term entrainment is performed and the threshold for the entrainment rules is set. WER (%) WER Primes (%) WER Intrinsic (%) CTC (%) CTC Primes (%) CTC Intrinsic (%) Set 1 59.7 52.6 53.6 47.3 45.6 48.3 Set 2 52.3 50.1 48.3 39.6 36.2 39.1 Set 3 53.7 54.9 58.2 44.5 37.3 37.1 Set 4 47.9 43.4 44.8 51.5 47.2 47.3 Table 6.4: WER and correctly transferred concepts results. (a) WER (b) CTC Figure 6.1: WER and CTC results for the different configurations tested. The questionnnaires’s are given in Table 6.5 for Adaptation, suggest that the users evaluation is more correlated with the session success (0.41 computed using Pearson correlation, with p-value < 0.001) than with the adaptation of lexical choices (0.13, p-value = 0.14). Set 4 received the highest rating. However, Set 3 was the second highest rated, despite being the 6.1. TWO-WAY AUTOMATED RULE-BASED ENTRAINMENT 77 worst performing. These results could mean that the users did not notice that the system was adapting to them. They also could mean that the question regarding adaptation possibly induced the users to evaluate other than the lexical choices. Average Satisfaction Adaptation Set 1 19.4 2.75 Set 2 19.5 2.48 Set 3 19.7 3.16 Set 4 20.3 3.38 Table 6.5: User satisfaction results. Table 6.6 present the results of the entrainment events detected during live interactions. The event percentage was computed as: PP countevent (i) Event(%) = PPi=1 i=1 countsystem (i) (6.2) where the Events are Uptakes, No Uptakes, No Previous Use or Non-Understandings. Despite one-way ANOVA tests revealing no statistical significance, we see that Uptake events are much more frequent than the other events as it can be seen in Figure 6.2, that shows the accumulated percentage of each event per Set. This signifies that users followed the system proposed prime much more than they used terms of their own choice. The behavior of novice user has confirmed our expectation [80]. We believe that if the tests were run on Let’s Go, with some experienced users, the result might be different. Total Uptake (%) Total No Uptake (%) Total No Previous Usage (%) Total Non Understanding (%) Set-up 1 16.8 2.03 0.38 9.77 Set-up 2 20.3 2.31 0.12 5.78 Set-up 3 18.4 1.21 0.13 6.85 Set-up 4 17.6 1.20 0.33 6.50 Table 6.6: Entrainment events relative frequency. A detailed analysis of the systems logs showed that the initial prime rank given by Equation 6.1, rarely changed from session to session, unless a No Uptake event occurred and they were very rare. We also examined reactions to Non-understandings. With the strategy proposed in Section 6.1.2.1.2, a prime could be changed before there was the necessary evidence that 78 CHAPTER 6. AUTOMATED ENTRAINMENT Figure 6.2: Accumulated of events percentage it was degrading the system performance [24]. The results of these tests do not show an improvement in the system performance when Short-Term entrainment is used. This brings us to envisage future experiments using the entrainment rules. The dialog confidence score could be used as a threshold for Short-Term entrainment. Since uptake events are much more frequent that any other prime event, they could also be used to compute the initial prime ratio in Long-Term entrainment. An increase in the update factor ϕU ptk should also enhance the convergence between user and system during the session. Finally, the approach to the update at a non-understanding event could also be modified to allow the user to gather more evidence of whether the prime is hindering ASR success. 6.1.2.3 Version 2: Testing Entrainment Rules in Let’s Go The tests with Noctı́vago revealed interesting trends. However, we do note that most of the results obtained were not statistically significant, and consequently do not permit us to draw 6.1. TWO-WAY AUTOMATED RULE-BASED ENTRAINMENT 79 any strong conclusions about the effect of on-line entrainment rules in SDS performance. These trends motivated modifications in the entrainment rules. Firstly, since a threshold is needed, we decided to use the dialog confidence score, since the Set that used this configuration performed better than the set where the ASR confidence score was used and the set where the Short-Term entrainment updates were performed regardless of the confidence score. Secondly, the initial prime ratio was modified to a weighted sum of the normalized number of No Uptake and Uptake events: R(i) = countN U ptk (i) countU ptk (i) + wU ptk × countsystem (i) countsystem (i) (6.3) where countU ptk (i) is the past number of uptakes for prime i and wU ptk is given by the ratio between the total Uptake events and the total No Uptake events: PP countU ptk (i) wU ptk = PPi=1 i=1 countN U ptk (i) (6.4) where P is the total number of primes. Thirdly, the update factor for Uptake events, ϕU ptk , was increased to 2, in order to enhance convergence during the session. Finally, in order to assure the necessary evidence that a given prime is raising non-understandings, the update after a non-understanding was only performed after the second non-understanding. In addition, instead of simply subtracting countN U nd (i) from R(i), wN U nd × countN U nd (i) is now subtracted from R(i), where wN U nd is computed similarly to wU ptk : PP wN U nd = PPi=1 countN U nd (i) i=1 countN U ptk (i) (6.5) The algorithm with the modifications above mentioned was tested in Let’s Go, a live system with real users. Cornerstone [75] was used as dialog manager. Thus, the resulting architecture is the one represented in Figure 3.8. The set of primes used is shown in Table 6.7. This set was extended from the one used in [99]. This has also increased the number of prompts 80 CHAPTER 6. AUTOMATED ENTRAINMENT available, which can make the system sound more natural. Type of primes Concept next bus now previous bus Intrinsic start over confirm help new query Non-Intrinsic origin place Old Primes next now previous route schedule right help query leaving leave New Primes following / subsequent later / after immediately / right now right away / as soon as possible preceding / prior / before itinerary / trip ride / journey alright / correct / okay assistance / support more information request / task dialog / route / conversation departing / depart starting / start Table 6.7: Primes used by Let’s Go before and after the entrainment rules were implemented. Since Let’s Go has a set of regular users who may be familiar with the system’s lexical choices, we add another restriction to the prime selection algorithm: the “Old Primes” were only used when their prime ratio had reached a handcrafted threshold. This threshold was set to be one order of magnitude higher than the average prime ratio for all primes. The Let’s Go running with the entrainment rules was live during 31 consecutive days. Its performance was compared to the same system without the entrainment rules that ran few weeks before the new system went live, approximately for the same amount of time. 6.1.2.3.1 Results To give the reader an idea about the type of behavior we are aiming at with the entrainment rules, Table 6.8 shows two different dialogs where the prime choice was affected. In the first dialog, the system tried to make the user entrain to a different prime, since the first attempt resulted in a non-understanding. In U1, “new conversation” but the wrong recognition output could not be bound to any concept. Once the system used “new dialog”, the recognition output was correctly bound. In the second dialog, the system entrained to the user. The system proposed “following” as prime for next bus, however the user preferred to use “after”. In the subsequent prompt, the system used “after” instead of “following” when it used a prime for next bus. Table 6.9 compares the results between the Baseline version without the entrainment rules and the Entrainment Rules version. The estimated success rate has increased by more than 6.1. TWO-WAY AUTOMATED RULE-BASED ENTRAINMENT 81 S1: To ask about a different journey, you can say, start a new conversation. If you are finished, you can say goodbye. U1: 13 TREE STATION S2: To ask about a different ride, you can say, start a new dialog. If you are finished, you can say goodbye. U2: START A NEW DIALOG ... S3: You can say, tell me about the following bus, when is the bus prior this one, start a new conversation, or goodbye. U3: AFTER THAT S4: Okay. S5: There is a 71D that departs from HAY STREET AT PENN AVENUE at 8:13 a.m.. It will arrive at FIFTH AVENUE AT MCMASTERS at 8:48 a.m.. S6: Please let me know how I did by pressing, four, if you got the information you were looking for, or six, if you did not. To get more information about buses related to this journey, you can say, what time is the bus after this one, or, when is the bus prior this one. To ask about a different journey, you can say, start a new conversation. If you are finished, you can say goodbye. Table 6.8: Excerpts of dialogs where entrainment rules changed the system’s normal behavior. Primes affected in bold. 2%, which represents a relative 10% reduction in the number of failed sessions, although the Fisher’s tests revealed no significant differences between versions. The average number of turns per dialog has also decreased by almost one turn per dialog (6% relative reduction), which in addition is statistically significant. We believe that this reduction in the number of turns is due to dialogs like the first one shown in Table 6.8. The fact that the system is able to change the choice of prime if non-understandings or low confidence turns are detected, avoids the repetition of primes that might be hindering the system performance. The user prime events relative frequency was also evaluated. There is a reduction in the number of Uptake events, which may represent less familiarity of the users with the new primes proposed. For the same reason, the number of No Uptake events may also have increased. On the system side, the number of Non-Understandings has increased in the entrainment rules version counter to the initially expected. A possible explanation could reside on the fact the some of the primes used were not available in the datasets used to train the acoustic and language models that the system was using. In order to have the system recognizing these primes, they were added to the lexicon and language models by hand. Nevertheless, the improvement in Estimated Dialog Success and the decrease of the average number of turns suggest that the system performance was better. 82 CHAPTER 6. AUTOMATED ENTRAINMENT Baseline Entrainment Rules Number of sessions Estimated Dialog Success (%) 1542 75.11 1792 77.64 Avg. number of turns 12.24 11.47 Total Uptake (%) 5.35 2.39 Total No Uptake (%) 0.56 0.78 Total No Previous Usage (%) 1.92 1.75 Total Non Understanding (%) 6.33 9.07 Statistical Significance Test Result Fisher’s. No statistically significant difference. Two-way ANOVA. F (1) = 8.131; p = 0.004.. One-way ANOVA. F (1) = 120.579; p = 0.000. One-way ANOVA. F (1) = 4.421; p = 0.036. One-way ANOVA. F (1) = 4.496; p = 0.034. One-way ANOVA, F (1) = 0.240; p = 0.624 Table 6.9: Results for Let’s Go tests. Statistically significant differences in bold. 6.1.3 Acoustic Distance and Prime Usage Evolution The previous section showed that entrainment rules used for prime selection can improve the system performance. However, the performance numbers do not show whether the system always used the same prime or if the prime selected for each concept varied considerably. It is also interesting to compare the way the primes are chosen according to the entrainment rules, and how they would be chosen according to the acoustic distance, to see if the system is really adapting to the users. The latter issue would be the choice of prime if the system preference was the only factor taken into account, since those primes should be easier to recognize than their synonyms. In order to compute the acoustic distance between primes and all the entries in each of the language models used in Let’s Go, the primes and the all the entries were synthesized with 3 different voices using Flite [13]. The acoustic distances between synthesized samples of the primes and the remaining entries in each language model were computed using Dynamic Time Warping with the method described in [135]. Finally, the average and minimum distances were computed. Table 6.10 shows the resulting primes that would be selected for each dialog state, according to minimum and average acoustic distances. 6.1. TWO-WAY AUTOMATED RULE-BASED ENTRAINMENT Dialog State request place Concept Max Min. Dist. (dB) 4.35 5.57 Prime Avg. Distance leave immediately Max Avg. Dist. (dB) 15.58 13.42 5.01 5.07 support now 13.58 12.93 origin place help Prime Min. Distance departing as soon as possible assistance as soon as possible departing assistance 4.18 4.99 14.56 12.00 confirmation help alright assistance 4.46 4.11 next query start over next bus previous bus help request itinerary next preceding assistance 4.35 3.97 3.74 4.48 4.11 leave more information correct more information query ride next before more information origin place now help now request time explicit confirmation request next query 83 12.51 12.00 12.54 10.76 11.40 10.79 12.00 Table 6.10: Primes selected according to the minimal and average acoustic distance for each language model. In order to capture the prime usage evolution along the 31 days of use of the entrainment rules, the percentage of usage of each prime among the primes used for the same concept was computed. Figure 6.3, 6.4 and 6.5 show the resulting prime usage. 6.1.3.1 Analysis For the confirm concept the system started by proposing “okay”, however after “correct” was proposed it stayed as the most used prime for the rest of the test period. In this case of confirm, the former prime, “right”, was rarely used. Compared to the acoustic prime choice, the most used prime coincides with the prime selected with maximum average distance. “Support” was always used as the help prime. There are two reasons for this. The first one is that the concept only appeared twice in user turns during the test period and the users entrained to the system’s choice of prime. The second is that the help prime was always followed by the system prompt asking for the place of departure. This means that the initial 84 CHAPTER 6. AUTOMATED ENTRAINMENT confirm prime Usage Evolution Prime System Usage per concept (%) 100 80 60 okay correct right alright 40 20 0 0 5 15 10 Day 20 25 30 35 (a) confirm help prime Usage Evolution Prime System Usage per concept (%) 100 80 60 help assistance more infromation support 40 20 0 0 5 10 15 Day 20 25 30 35 (b) help now prime Usage Evolution Prime System Usage per concept (%) 100 80 60 right now immediately next now right away as soon as possible 40 20 0 0 5 10 15 Day 20 25 30 35 (c) now Figure 6.3: Prime Usage over time for the concepts in confirmation, help and now. 6.1. TWO-WAY AUTOMATED RULE-BASED ENTRAINMENT 85 (a) new query originplace prime Usage Evolution Prime System Usage per concept (%) 100 80 60 leaving departing starting leave depart start 40 20 0 0 5 10 15 Day 20 25 30 35 (b) origin place startover prime Usage Evolution Prime System Usage per concept (%) 100 80 60 new route itinerary schedule trip ride journey 40 20 0 0 5 10 15 Day 20 25 30 35 (c) start over Figure 6.4: Prime Usage over time for the concepts next query, origin place and start over. 86 CHAPTER 6. AUTOMATED ENTRAINMENT nextbus prime Usage Evolution Prime System Usage per concept (%) 100 80 60 next bus following subsequent later after 40 20 0 0 5 10 15 Day 20 25 30 35 (a) next bus previousbus prime Usage Evolution Prime System Usage per concept (%) 100 80 60 previous preceding prior before 40 20 0 0 5 10 15 Day 20 25 30 35 (b) previous bus Figure 6.5: Prime Usage over time for the concepts next bus and previous bus. 6.1. TWO-WAY AUTOMATED RULE-BASED ENTRAINMENT 87 prime ratio R(i) will never be subtracted by an update factor, since non-understandings and no-uptakes will never occur. Unless the user picked a different prime, the initial prime will always remain as the chosen prime. The acoustic prime choice would be “support”, “more information” or “assistance” depending on the language model used and the average metric chosen. The new query rule-based prime choice alternated between “dialog”, “conversation” and “route”, whereas the acoustic prime choice would be “request” or “query”. Figure 6.7d shows that the system tried “request”, however the attempt did not seem to be fruitful. “Query” was never used, because of the “old prime” restriction mentioned in Section 6.1.2.3. The next bus primes show a similar behavior. The system first proposed “later” and “after”, but when “following” was proposed, the system kept it as the most used prime for the rest of the test period. The old prime “next”, was only used for very limited periods due to the “Old prime” restriction. The acoustic distance would have recommended the use of “next”. Despite the “Old prime” restriction, “now” was still the most used prime for the now concept together with next. This can be explained by the fact that it was primarily detected in NPU events, which have a higher update factor. According to the average acoustic distance, “now” is also a possibility for the prime used by the system to ask for travel time. If the minimal distance is taken into account, then “as soon as possible” should be used. Figure 6.7b shows that this prime was never used. The system started by proposing “start” as origin place prime, however, once it changed to “depart”, it used any prime except that one. In this case, most of the primes listed were never used. Since it is a non-intrinsic concept, most of the time the concept is not used in user turns, and the prime rank is not updated. None of these prime matches the acoustic distance prime choice. The system alternated between “prior” and “preceding” for previous bus concepts. Occasionally, it used the prime “before”, and rarely the old prime “previous”. “Before” was the prime chosen according to the average distance. Apparently the system could not decide if one 88 CHAPTER 6. AUTOMATED ENTRAINMENT prime was better than the other. “Preceding” corresponds to the acoustic choice according to the minimum distance. Finally, there is no clear prime choice for the start over prime. The use of start over has the same effect in the dialog of the next query concept, i.e., after the user received the schedule information the system will ask for the next action. If the answer is next query or start over, the system will restart the dialog from the beginning. At this point of the dialog, there is a large number of non-understandings, which often results in the update of the prime ratio as described in Section 6.1.2.3 for non-understandings. In addition, since the prompt explicitly directs the user to the new query prime (see S6 in Table 6.8), this concept is less used and consequently the prime ratio is only updated after non-understandings, which results in an observable prime variance. The comparison between the acoustic distance prime selection in Table 6.10 and the entrainment rules prime selection in Figure 6.7 shows that the two methods can lead to different prime choices, as is expected given the role of the users when the system is performing on-thefly lexical entrainment. In addition, since prime selection with entrainment rules constantly adapts to each user, the prompts have more variety and the system sounds more natural. However, the methods sometimes selected the same primes. This means that the acoustic distance could be used to rank primes, if no prior entrainment study had been carried out. 6.1.4 Discussion The tests presented in this section have shown that entrainment can have an impact on SDS performance. In Let’s Go, the estimated success rate increased, and the average number of turns decreased when the system used entrainment rules. However, since the amounts of data collected are much higher, a detailed analysis such as the one provided for Noctı́vago could not be performed. The fact that Let’s Go has been running live since 2005 was also an additional challenge. The experiment was designed to avoid the use of the “Old Primes” (see Table 6.7), however the reduction of the % of Uptakes and the increase of the % of No Uptakes shown in Table 6.9, may indicate that some users continued to use the old primes. 6.2. A DATA-DRIVEN METHOD FOR PRIME SELECTION 89 Except for the case of “now”, none of the old primes was the most used, as Section 6.1.3 indicates. However, this restriction might be keeping the system from using the best prime, which could be in the Old Prime set. Another thing that we could hypothesize is whether the fact that the variety of primes generates different system prompts, which makes the system behavior seem more natural, can also be reflected in the improvement of system performance. The results obtained for Noctı́vago did not show as great improvements in dialog performance as the ones for Let’s Go. However, the fact that the former is an experimental system adds one more variable to the analysis. From the tests described in Section 6.1.2.2, the main conclusions were that the dialog confidence score should be used as the threshold score in the entrainment rules, that uptake events should be more valued in the rules, and that the prime ranking should only be affected after the second non-understanding. One interesting outcome from Tables 6.3 and 6.4 is that the Set with the highest WER was the second best in terms of system performance. That Set used the dialog confidence score as a threshold in the entrainment rules. This is already a proof that under adverse conditions where WER is high, entrainment rules can improve system performance. 6.2 A Data-Driven Method for Prime Selection It was already shown that entrainment could improve the system performance. However, a data-driven model could be a more robust approach to perform entrainment than the heuristicdriven rules. Also, if there were data available, it would be easier to generalize a method to perform live entrainment for any system, rather than each system developer creating her/his own heuristics. In this section we explore a statistical model for prime selection. The model was first tested off-line in the data collected from previous experiments with Noctı́vago and Let’s Go. Finally, a live test comparing the model with other methods for prime selection was held with Noctı́vago. 90 6.2.1 CHAPTER 6. AUTOMATED ENTRAINMENT Prime selection model Any statistical model for prime selection must share the goals of the entrainment rules: adapt the system’s prime choice in view of improving the system performance. The improvement can be achieved combining the system and user prime preference to lower the WER of the prime candidates. Thus, a model for prime selection should predict at each point of the dialog the prime with the lowest WER based on a set of entrainment-related features. Employing transcribed data, a supervised learning method could be used to train a regression for a feature set derived from the entrainment events described in Section 6.1.1 and new features that can help to find if there was entrainment between the system and the user. This feature set will be used to predict the WER for any given prime at each point of the dialog. The prime to be used in a system prompt, should be the one with the lowest WER prediction. The user entrainment events and non-understandings will be represented as binary features in the feature vector. The dialog confidence, previously used to threshold the prime updates in Short-term entrainment, will now be an element of the feature vector. To the data already used in the previous experiments, two more sources of information were used: the distance to the previous use of the prime and an entrainment measure previously developed for human-human dialogs [94]. Since entrainment is more likely to occur in the user turn immediately after the system used a prime, the distance to the last use of the prime could be an important feature to find if a prime was really used, or if there was a recognition error and the prime was recognized by mistake. If a prime was recognized in a short distance to the last system use, the prime was more likely to be correctly recognized. This was confirmed by the Pearson correlation values found between distance and confidence score, −0.35, and between the average distance for all primes in one session and the estimated success of that session, −0.54, for the Let’s Go data collected with the system running with the entrainment rules (both values are statistically significant). This distance is simply the number of user turns between the last time the system used a prime and the user turn where the prime was recognized. Table 6.11 shows the 6.2. A DATA-DRIVEN METHOD FOR PRIME SELECTION 91 computation of the distance for two different cases. In the first case, the system used “new conversation” and the user picked it up in the subsequent turn, thus the distance is 1. In the second example, the user uses “now” before the system did, which means that in this case the distance is simply the number of user turns in that dialog so far. [dnew conversation = 0] S1: To ask about a different journey, you can say, start a new conversation. If you are finished, you can say goodbye. [dnew conversation = 1] U1: START A NEW CONVERSATION ... [dnow = 0] S3: Welcome to the CMU Let’s Go bus information system. [dnow = 0] S4: To get support at any time, just say support or press zero. [dnow = 0] S5: Where would you like to start from? [dnow = 1] U3: SHERADEN [dnow = 1] S6: Where do you wanna go? [dnow = 2] U4: DOWNTOWN [dnow = 2] S7: Going to DOWNTOWN. Did I hear that correctly? [dnow = 3] U5: YES [dnow = 3] S8: When would you like to travel? [dnow = 4] U6: NOW Table 6.11: Example of how the prime distance was computed. The entrainment measure for prime p, adapted from [94], in an SDS is given by: countuser (p) countsystem (p) Entr(p) = − − ALLuser ALLsystem (6.6) where countuser (p) is the number of times that user used prime p and ALLuser is the total number of words said by the user. This measure gives the similarity of use of prime p between the user and the system during a session and it was correlated with task success in humanhuman task-oriented dialogs. 6.2.2 Predicting WER with Off-line data A prime selection model was trained for Noctı́vago and Let’s Go. The Noctı́vago model was trained using the data transcribed from the first studies described in Section 6.1.2.2. The Let’s Go model was trained with two months of data transcribed using crowdsourcing, which was released for the Spoken Dialog Challenge (SDC) [14]. It is important to point out that the data used for the Noctı́vago model came from a version of the system that already performed 92 CHAPTER 6. AUTOMATED ENTRAINMENT automated prime selection, as described in Section 6.1.2.2, whereas the system used to collect the SDC data did not. System logs were analyzed to extract the features described in Section 6.2.1. One feature vector, F , was generated for each user turn where the presence of a prime was detected. The feature vectors were grouped by dialog states, to train different models for each dialog state, since primes are used differently from state to state. However, to have enough data for each state, the states with similar prime behavior were merged into a single dataset. For instance, the Request Origin Place and Request Destination place were merged into the Request Stop dataset. In the Noctı́vago dataset, however, since many states are under-resourced, the states with less than 300 samples were grouped to create a Generic dataset. The distribution of data per state in both corpora is given in Table 6.12. Dialog State Request Next Query Generic (a) Noctı́vago # Turns 331 369 Dialog State Request Next Request Time Request Stop Inform Generic # Turns 1428 1040 143 434 1733 (b) Let’s Go Table 6.12: Number of turns used to train the prime selection regression. Since the problem given was to find the best weight for each feature giving the target value and the feature values, regression seemed a straightforward method to do it. The datasets were split in 75% for train and 25% for test. Several regression methods from the scikit-learn toolbox for Python [102] were tested. Once each model was trained, the Pearson correlation between the predicted WER and the actual WER was computed for the test set, together with the coefficient of determination (R2 ) which measures the quality of the model (the closer to 1.0, the better the model). The best results were achieved using Linear Regression (LR) using the ordinary least squares method and Support Vector Regression (SVR) with radial basis function kernel are presented in Tables 6.13 and 6.14. The results for the Noctı́vago regressions show remarkable correlation values, especially in the Generic model, and in both cases the correlation is statistically significant. Nevertheless, 6.2. A DATA-DRIVEN METHOD FOR PRIME SELECTION Model Request Next Query Generic Measure Correlation Coeff. Det. (R2 ) Correlation Coeff. Det. (R2 ) LR 0.32 0.08 0.63 0.39 93 SVR 0.36 0.11 0.62 0.35 Table 6.13: Noctı́vago models. Model Request Next Request Time Request Stop Inform Generic Measure Correlation Coeff. Det. (R2 ) Correlation Coeff. Det. (R2 ) Correlation Coeff. Det. (R2 ) Correlation Coeff. Det. (R2 ) Correlation Coeff. Det. (R2 ) LR 0.35 0.11 -0.01 -0.13 0.16 -0.84 0.22 -0.03 0.13 0.01 SVR 0.34 0.05 0.06 -0.02 0.04 -0.11 0.28 0.03 0.13 0.09 Table 6.14: Let’s Go models. the coefficient of determination is not very high in the Request Next Query model. The two regression methods used achieved a very similar performance. Given that Request Next Query is the state where priming is most likely to occur, in an implementation scenario, the SVR would have preference over the LR. The regressions trained for the Let’s Go states, except for the Request Next state-type, show lower correlation values when compared with the values obtained for Noctı́vago. Despite the fact that state is probably where entrainment is more common, these results should be further analyzed. Several factors may have contributed to this result. The first was already mentioned: the fact the Let’s Go data was collected in a different context where there was no entrainment policy running and the prime set used was static. The system prompts did not show any change over time and fewer examples of user prime events (especially No Uptake events) were found in the data. Secondly, the data collections were made in substantially different contexts. Whereas Noctı́vago is an experimental system tested by novice users, real users tested Let’s Go. Some of them were believed to be regular users of the system. It was shown that novice users tend to be more driven by the system prompts than experimented 94 CHAPTER 6. AUTOMATED ENTRAINMENT users [80]. Finally, the modifications made in the Noctı́vago dialog strategy to have more states of the dialog where the entrainment could happen, may also help to explain this result. This is confirmed by the fact that in the Noctı́vago dataset an average of 2.1 turns with primes per query was found, whereas in Let’s Go the average number of turns with primes per query was only 1.24. Further work is needed to find a model that better fits the Let’s Go dialog strategy. Once the data collected with the system running the entrainment rules is transcribed and used to train a new model the performance of the model should improve. 6.2.3 Testing the model in an experimental system The SVR model trained with the Noctı́vago data was tested in the on-line system. The previously used Ravenclaw dialog manager was replaced by Conerstone [75]. This dialog manager includes a user action model to predict the next best action the system should take, and a confidence calibration model trained to improve the confidence measure used in statetracking dialog management. Since the Noctı́vago dataset is too small to train a user action model and both systems target the same domain, we opted to keep the same user action model adopted in Let’s Go. The fact that the model was trained at the action level poses no difficulty to its use in a different language. However, to adapt the strategy to the type of users of Noctı́vago keeping it as close as possible to the one previously designed for Ravenclaw, minor modifications were made to the original dialog manager. The confidence calibration model, however, was trained with the Noctı́vago dataset. In these experiments the language models were also updated. The details can be found in Section 3.2.1.2. In the tests performed with off-line data, each feature vector was considered as an isolated observation. However, since the prime selection depends on the evidence built in previous \ turns [48], the predicted WER for turn t, W ERt , should also incorporate a scaling factor, \ representing those turns. The WER prediction from the previous turn, W ERt−1 , was chosen to be the scaling factor. Thus, the predicted WER for prime p at turn t is given by: 6.2. A DATA-DRIVEN METHOD FOR PRIME SELECTION \ \ W ERt (p) = W ERt−1 (p) × r(F ) 95 (6.7) where, r(F ) is the prediction value given by the trained regression for F =< f1 , · · · , fn > , the feature vector generated during live interaction for every turn where a prime was recognized. \ Since at the beginning of the dialog W ERt−1 is not available, we assume that the most frequently used primes are more likely to have lower WER. Hence, the initial prediction value is given by: countuser (p) \ W ER0 (p) = 1 − countuser (C) (6.8) where countuser (C) corresponds to the sum of all the primes used for concept C. The relative \ frequency is subtracted to 1, so that the more frequent words have the lowest W ER0 . In the beginning of each session the primes are ranked according to this value. 6.2.3.1 Test Set In order to compare the new model for prime selection, a study was conducted where it was compared with other strategies for prime selection, all of them running exactly with the architecture shown in Figure 3.9b, with Cornerstone and the Unity-based ECA. These strategies were entrainment rules, random primes and fixed primes. The entrainment rules version performs the selection using the heuristics defined in Section 6.1.2.3. The random primes version randomly selected the primes to be used. The fixed primes version used only the prime that was most acoustically distinct from the remaining entries in the language model. This distance was found using a predefined acoustic distance between pairs of phones. The versions ran alternately during different days. The subjects were recruited via e-mail or Facebook event. In either case they would have a web link to a page with the instructions. The task given was to query for any bus schedule that they would possibly take. Users were informed that different versions were running, although did not know which one they were 96 CHAPTER 6. AUTOMATED ENTRAINMENT testing. They were also given instructions about the press to talk interface that they were using. After they completed their request, they were asked to fill in a questionnaire based on the PARADISE questionnaire [140]. Two new questions were added to the questionnaire mentioned in Section 6.1.2.1: ’if the system was able to propose alternatives when it encountered problems’ and ’if you felt that the system was proposing words in a smart way’. These new questions were used to help to confirm whether they detected the system’s adaptation of lexical choices. The complete questionnaire can be found in Annex C.2.3. The list of primes used in this test is show in Table 6.15. When comparing this table with Table 6.2, one may notice that some primes changed their type from intrinsic to non-intrinsic. This change was provoked by the change of dialog manager. The confirmation prompt after all the slots are filled is not part of the dialog strategy anymore. This means that Arrival Place, Origin Place and Time are only used when the system is requesting destination, origin and travel time, respectively. Thus, if they are incorrectly recognized, the course of the dialog should not change. Type of Prime Concept Next Now Intrinsic Price Start Over Non-Intrinsic Arrival Place Origin Place Time Primes próximo / seguinte agora / imediatamente neste momento / o mais rápido possı́vel o mais brevemente possı́vel preço / valor outro percurso / nova procura nova pesquisa / procurar novamente outra procura / nova busca chegada / destino partida / origem horas / horário Table 6.15: Primes used by Noctı́vago in the data-driven model tests. 6.2.4 Results A total of 214 dialogs were collected during these tests. 88 sessions were completed by female subjects. 53 of these sessions were performed people from the laboratory. 96 different users have participated in these tests and 4 of them were not non-native speakers. 8 people that participated in these tests had already participated in tests described in Section 4.3 and 23 participated in the tests described in Section 6.1.2.2.1. 6.2. A DATA-DRIVEN METHOD FOR PRIME SELECTION 97 The dialogs were orthographically transcribed to compute prime and word level performance metrics, since these are expected to be greatly influenced by prime selection methods. The transcriptions were also parsed using an off-line version of the Phoenix semantic parser, and the same grammar used in the live system. Table 6.16 shows the performance in terms of Outof-vocabulary words (OOV), WER and CTC. For WER and CTC, the results were further analyzed at the prime level, and at the intrinsic prime level. OOV (%) WER (%) WER Primes (%) WER Intrinsic (%) CTC (%) CTC Primes (%) CTC Intrinsic (%) Data Driven Rule Based 8.92 27.84 27.53 23.90 68.88 65.22 69.78 11.10 33.05 25.77 24.38 65.24 72.58 75.93 Random Primes 11.68 35.68 37.38 38.42 65.78 48.20 49.15 Fixed Primes 22.18 33.29 35.25 33.33 66.18 55.28 58.95 Table 6.16: OOV, WER and CTC Results for the different versions. Statistically significant results in bold (one-way ANOVA with F (3) = 2.881 and p − value = 0.037). Figure 6.6 shows a graphical representation for the results of Table 6.16. These results show that both versions with prime entrainment policy clearly outperformed the Random and Fixed Primes both in terms of reducing OOV, WER and increasing the percentage of CTC in the dialog. The differences are particularly remarkable when the analysis is restricted to primes, and even more for intrinsic Primes as it can be observed in the graphs in Figure 6.6. Both Data Driven and Rule Based methods were able to improve the system performance when dealing turns with primes. The differences achieved might have influenced the course of the interaction, improving the user experience. Among the prime selection algorithms, the Rule Based method showed a slightly better performance than the Data Driven one. The relative frequency of User Prime Events is another important measure to evaluate in this study, which is shown in Table 6.17 for this dataset. The percentage is computed as in Equation 6.2. 98 CHAPTER 6. AUTOMATED ENTRAINMENT (a) OOV (b) WER (c) CTC Figure 6.6: OOV, WER and CTC results for the different configurations tested. 6.2. A DATA-DRIVEN METHOD FOR PRIME SELECTION Uptakes (%) No Uptakes (%) Non-Understandings (%) Data Driven Rule Based 9.54 2.26 0.57 9.97 1.64 1.03 Random Primes 9.08 3.94 0.72 99 Fixed Primes 8.60 1.76 1.14 Table 6.17: Entrainment Events and Non-Understandings relative frequencies. One-way ANOVA tests revealed no statistical significance. The Entrainment Event results show a higher Uptake percentage for the versions with prime selection algorithms implemented, which could mean that the users felt more comfortable with the primes that the system was using. The results for No Uptakes interestingly show that the highest percentage occurs with Random Primes, where there was also more variance in the primes used. Among the statistically significant results for Non-Understandings, prime selection algorithms were less prone to Non-Understandings than Fixed Primes. Table 6.18 shows the distribution of the dialogs per version, the estimated dialog success, the real dialog success, the average number of turns per dialog, and the average number of turns with primes per dialog. Number of Dialogs Estimated Success Rate (%) Real Success Rate (%) Number of Turns (avg. per dialog) Number of Turns w/ Primes (avg. per dialog) Data Driven 57 96.5 82.4 14.75 2.82 Rule Based 53 98.1 82.6 13.21 2.34 Random Primes 52 94.2 84.9 13.73 2.67 Fixed Primes 52 98.1 86.9 13.15 2.37 Table 6.18: Dialog performance results. These high level measures show that Fixed Primes was the version with best success rate and lowest average number of turns per dialog. Statistical significance of Real Success Rate, Number of Turns and Number of turns with primes was computed one-way ANOVA tests and the difference between methods was not statistically significant for any of these measures. However, the session success is computed based on the information returned to the user, it might not be the best measure to evaluate how effective was the prime choice. In fact, a session 100 CHAPTER 6. AUTOMATED ENTRAINMENT could be successful even without the need to use any prime. For instance, if a user makes a single request asking for a bus for a specific hour instead of asking for the next bus and ends the dialog as soon as she/he receives the correct information, no prime is used in this dialog. Compared to previous tests held with Noctı́vago (Section 6.1.2.2), there was a significant improvement on the system performance. We believe that the new Dialog Manager, the new Language Models and the new ECA have contributed to this improvement. These results show that the system has greatly increased in terms of robustness from the first experiments presented in this Thesis. Finally, Table 6.19 shows the results of the questionnaire that many of the testers answered (184 questionnaires, for 214 tests). The distribution of questionnaires per version, the average overall satisfaction and the average score in the entrainment related questions are detailed. The maximum value admitted for overall satisfaction was 33 and for entrainment satisfaction was 12. Number of questionnaires Average Satisfaction Score Average Entrainment Question Score Data Driven 50 22.08 9.08 Rule Based 46 20.85 8.30 Random Primes 47 22.72 9.70 Fixed Primes 41 22.80 9.66 Table 6.19: Questionnaire results for the different versions. These results confirm the informal comments from some users that they hardly noticed any difference between the different versions. Even regarding the entrainment question set, the versions that had no algorithm implemented achieved higher scores, even if in the case of Fixed Primes the primes were not changing at all. Since there was a lot of variation in the Random Primes version, the users might have thought that the system was adapting someway, although it was not. This could explain the highest entrainment question score was achieved in this version. 6.2. A DATA-DRIVEN METHOD FOR PRIME SELECTION 6.2.4.1 101 Prime Usage Evolution In the previous section the results have shown that both Data Driven and Rule Based methods outperformed to baseline methods. Our expectation was that the Data Driven method performed slightly better than the Rule Base method, however the opposite occurred. In order to have an overview of the prime selection, Figures 6.7 and 6.8 show the prime usages evolution during the testing days for Data Driven (DD) and Rule Based (RB) methods, similarly to what was done in Section 6.1.3 for the prime evolution in Let’s Go primes. Figure 6.7 shows the evolution for intrinsic primes, whereas Figure 6.8 shows it for non-intrinsic primes. The choice of prime for Now and Next Bus almost never changed during the test period with both methods. Próximo autocarro and agora are by far the prime mostly chosen by both methods. This could mean that a large majority of the users prefer to use those primes instead of the other options available. The case of the Price prime shows much variation in both methods, although the Data Driven method was only used valor for the first few days. This could be a prime where each user has her/his own preference and the system is continuously changing to adapt to each user preference. This is confirmed by the high number of No Uptake events for this concept. The Start Over prime selection frequency shows some differences between the two methods. The Rule Based method used Nova Pesquisa during the majority of the test period. On the other hand, the Data Driven method modified the prime used towards the end of the test period. Both methods used only a small subset of the prime candidates available. In what concerns the non-intrinsic primes, the choice for Arrival Place and Origin Place primes shows a similar distribution. There is a clear preference for one of the primes, chegada for Arrival Place and partida for Origin Place. For the Time prime, both methods keep changing the prime used during the test period. The reason for this change can be associated with the fact that at the point of the dialog where the system uses this prime a statistical language model is used, whereas in the other points where the system tries to do entrainment an SRGS grammar is used, which does not allow any non-understandings, since the grammar 102 CHAPTER 6. AUTOMATED ENTRAINMENT Prime usage evolution for concept nextbus 100 Prime System Usage (%) 80 60 próximo autocarro [DD] autocarro seguinte [DD] próximo autocarro [RB] autocarro seguinte [RB] 40 20 0 0 4 2 6 8 Day 10 14 12 16 (a) Next Bus agora [DD] já [DD] imediatamente [DD] neste momento [DD] mais brevemente possível [DD] mais rápido possível [DD] mais depressa possível [DD] agora [RB] já [RB] imediatamente [RB] neste momento [RB] mais brevemente possível [RB] mais rápido possível [RB] mais depressa possível [RB] Prime usage evolution for concept now 100 Prime System Usage (%) 80 60 40 20 0 0 4 2 6 8 Day 10 14 12 16 (b) Now Prime usage evolution for concept price 100 Prime System Usage (%) 80 60 preço [DD] valor [DD] preço [RB] valor [RB] 40 20 0 0 4 2 6 8 Day 10 14 12 16 (c) Price Prime usage evolution for concept startover 100 outro percurso [DD] nova procura [DD] nova pesquisa [DD] procurar novamente [DD] outra procura [DD] nova busca [DD] outro percurso [RB] nova procura [RB] nova pesquisa [RB] procurar novamente [RB] outra procura [RB] nova busca [RB] Prime System Usage (%) 80 60 40 20 0 0 2 4 6 8 Day 10 12 14 16 (d) Start Over Figure 6.7: Comparison between intrinsic prime usages in Prompts between Data Driven (DD) and Rule Based (RB) prime selection. 6.2. A DATA-DRIVEN METHOD FOR PRIME SELECTION 103 Prime usage evolution for concept arrivalplace 100 Prime System Usage (%) 80 60 chegada [DD] destino [DD] chegada [RB] destino [RB] 40 20 0 0 2 4 6 8 Day 10 12 14 16 (a) Arrival Place Prime usage evolution for concept originplace 100 Prime System Usage (%) 80 60 partida [DD] origem [DD] partida [RB] origem [RB] 40 20 0 0 2 4 6 8 Day 10 12 14 16 (b) Origin Place Prime usage evolution for concept time 100 Prime System Usage (%) 80 60 horário [DD] hora [DD] horário [RB] hora [RB] 40 20 0 0 2 4 6 8 Day 10 12 14 16 (c) Time Figure 6.8: Comparison between non-intrinsic prime usages in Prompts between Data Driven (DD) and Rule Based (RB) prime selection. 104 CHAPTER 6. AUTOMATED ENTRAINMENT specification is based on previous data and forces the recognition output to be an in-grammar utterance. 6.2.5 Discussion The results of these tests have showed that it is worth having an algorithm that performs lexical entrainment, although this still needs to be confirmed by a larger data collection. 50 sessions per version seems insufficient to collect enough data to have statistically significant results. The performance of the Rule Based method was slightly better than the Data Driven method for the majority of the comparable items. This could mean that the intuition developed for entrainment that was implemented in the Rule Based method, was not yet learned by the Data Driven model. For instance, for the two concepts where there was more prime variance, Price (Figure 6.7c) and Time (Figure 6.8c) the Rule Based method prime choice showed more variability than the Data Driven method. This is probably due to the Short-Term entrainment phase within dialogs where there is a stronger adaptation to the user. Unlike the Let’s Go test, none of the Noctı́vago tests with different entrainment policies resulted in the expected outcome. The configurations where the best performance (in terms of dialog success rate and average number of turns) was expected were not the ones that performed better. The first reason for this was already mentioned and is the fact that a dialog can be successful without the use of any prime. But other reasons can explain why the dialog success rate might have been better in Fixed and Random Primes. One thing that might have made the algorithms for prime selection more effective in Let’s Go than in Noctı́vago was the fact that Let’s Go uses context-dependent statistical language models in every dialog stage. This increases the number of outputs that the ASR can produce and gives more freedom to the user. On the other hand, this also generates more non-understandings, which means more opportunities to update the prime rank. In Noctı́vago, in order to have a more robust system, we have opted for SRGS grammars in some points of the dialog. The price paid for boosting the performance was the reduced chances of prime rank update in 6.3. SUMMARY 105 case of Non-understandings as the results in Table 6.17 confirm. However, the goal of prime selection algorithms is precisely the reduction of non-understandings, without limiting the user input. We believe that if the users have more freedom in what they can say, the prime update algorithms will be even more effective than what they were here. The results in Table 6.19 show that the users were not able to contrast between systems that were adapting and those that were not. It might not be the easiest task for a user when a dialog is less than 14 turns long and primes are only used in 2 or 3 turns, which means that they only have 1 or 2 chances to perceive if the prime was changed. It is curious that the users felt more adaptation in the Random Primes. The explanation could be in the fact that the Random Primes version is very likely to modify the words used during the same dialog, unlike the other versions. Despite this observation, we strongly believe that automated entrainment could have a relevant role in task-oriented dialog systems success. 6.3 Summary In this chapter we have presented two different approaches to automated entrainment: RuleBased and Data-Driven. The Rule-Based approach (Section 6.1) was developed based on a set heuristics created from the lexical entrainment theory and the intuition developed during the tests described in Chapter 4. The Data-Driven approach (Section 6.2) was developed to learn how to entrain based on a set of entrainment-related features and other sources of information that could be extracted from live interaction. This method tries to predict the primes that are less likely to be incorrectly recognized based on this set of features. The Rule-Based method was first implemented and tested in Noctı́vago. The results were inconclusive, but gave interesting trends to be used in the second version of the Rule-Based algorithm that was tested with Let’s Go. These tests have successfully shown that entrainment could improve the performance of a task-oriented SDS. The estimated success rate increased and was followed also by a decrease in the average number of turns per dialog. The Data-Driven method was first tested in off-line data for both Noctı́vago and Let’s Go. 106 CHAPTER 6. AUTOMATED ENTRAINMENT The off-line results were better for Noctı́vago than for Let’s Go data. A slight modification was made in the model when it was implemented in Noctı́vago. A scaling factor that tried to incorporate the information about the previous turns in the prediction was added. This model was tested against three different methods for prime choice: Rule-Based, Random and Fixed choice. The results in terms of dialog success did not show any improvement from automated prime selection methods, which was not surprising given the way that it was computed. This was later refuted by the fact that WER, OOV percentage and CTC percentage for prime was indeed lower among primes in the version where automated performance was running that with Random or Fixed prime choice. The comparison between methods for automated prime selection showed a slight advance in the majority of the comparable results for the Rule-Based method. We believe that using a larger dataset to train the entrainment model, these results could change. 7 Conclusions This thesis has shown that lexical entrainment can influence the performance of spoken dialog systems. The adopted strategy was to make the system entrain to the user whenever this does not hinder the system performance, and propose alternative primes when a prime is often followed by Non-Understandings or low confidence turns. All the studies that were made have shown some sort of improvement when using a prime selection policy. The studies were done with two parallel systems for the same domain Noctı́vago and Let’s Go, described in Chapter 3. The steps that were followed in the several iterations to create and improve Noctı́vago were described in detail. In fact, the process of creating a robust SDS for European Portuguese was parallel to the development of methods for on the fly prime selection. The robustness of the system has indeed increased progressively during the thesis, as it can be seen in by the results presented in Section 6.2.3.1. We have also managed to integrate dialog-state tracking in Noctı́vago, which corresponds to the state-of-the-art in terms of dialog management. We have also contrasted the different modules used in Noctı́vago and Let’s Go. In Chapter 4, after identifying a set of prime candidates, we have created a list of synonyms for the concepts included in the prime candidate set. A study was conducted during two weeks, where in the first week the prime set used was fixed and in the second week the primes were randomly picked from the synonyms in the prime candidate list. Despite the absence of criteria in the prime selection in the second week, the system performance was better. There was entrainment between the user and the system as we expected. In order to automate the prime selection, the dialogs were carefully analyzed to observe the user behavior towards the primes used. Two prime events were identified: Uptakes, when the user picked the prime from 108 CHAPTER 7. CONCLUSIONS the system prompt, and No Uptakes, when the user decided to use a different prime from the one used by the system. We observed that the primes with a higher number of No Uptakes corresponded to the most commonly used words in the language, which was confirmed by the correlation found between the number of No Uptakes and the number of hits of each prime in a web search engine, and also with the frequency of each prime in a spoken language corpus. This already constitutes something that can be verified on the fly to decide whether or not the system should retain the prime or use a different one. Another important outcome of this study was the users feedback. Users appreciated the increased number of prompts available. How could this be combined with the improvement in the system performance was another problem that came up with these tests. The uncertainty in the SLU module highly recommends that some sort of confidence score validates the prime events detected. In Chapter 5, the dialog confidence score generated by the system was studied in order to improve its accurateness. Initially only one feature was used in the confidence score computation. Since many of the system modules were replaced or modified, the data collected in the previous Noctı́vago tests was used to train a new confidence annotation model. Only the features that were common to every turn were used in the first attempt. We observed that ASR confidence was the most weighted feature. The resulting models outperformed the original model. However, since many features were left out, we decided also not to use the non-understood turns in the training procedure. This reduced the number of turns used to train the model, but on the other hand would allow the evaluation of a larger set of features. In fact, the system behavior regarding Non-Understandings would not change if the confidence model was different. The training procedure used also differed and led to a different model than the one trained including the non-understood turns. In this training procedure stepwise regression was used, which means that only helpful features were added to the model. The proceeding of adding and testing a new feature was repeated until the BIC or LLIK stop-criteria were satisfied. The resulting model only included dialoglevel features. The ASR confidence score was not amongst the features evaluated before the model satisfied the stop-criterion. We have learned from this study that confidence models are highly dependent on the system modules used. For instance, if the language model is 109 modified the ASR performance is going to be different. Despite the high performance of the models trained over the baseline model, none of them were used in tests held with automated prime selection, since there were no consecutive tests held with the exactly the same system modules. The system was iteratively improving and consecutive experiments were held with different versions of the system, which did not let us re-train the model. Many system modules suffered modifications that were implemented in parallel to the development of the prime selection algorithm. Thus, these models were no longer valid on the configurations used from the moment they were trained on. Automating the prime selection was the next step to take. In Chapter 6 three studies were conducted with two different methods for prime selection. Due to the limited data resources available, our first approach to automated prime selection was rule based. The implemented algorithm relied on User Prime Events, Non-Understandings and confidence score to update the prime rank for every concept list. In order to satisfy the previous findings from lexical entrainment literature and to circumvent the fact that none of the systems had user models, we have designed an algorithm with two phases. The first phase, Long-Term Entrainment, captured the prime preference from all the user population, whereas the second phase, ShortTerm entrainment, tried to adapt the system behavior to each user during the course of the session. The algorithm was first tested in Noctı́vago with different configurations. The results showed that the heuristics used to update the prime during Short-Term entrainment were not improving the system performance. Thus, when envisaging new experiments with Let’s Go, the algorithm was modified. The tests held with Let’s Go offered the possibility of testing the Rule-Based algorithm with a large population of real users. The performance of the system running the Rule-Based prime selection was compared to the performance of the system before it had them implemented. The results revealed a 2% absolute increase (10 % relative) in the number of estimated successes and a reduction of one turn per dialog (6 % relative reduction), which proves the effectiveness of the on-the-fly lexical entrainment. The Rule-Based prime selection over the test period was compared to the acoustic distance prime selection. The choice for the prime to be used was very clear for concepts and less clear for others. This could mean that there are concepts where there is a prime that is used 110 CHAPTER 7. CONCLUSIONS by the large majority of the users, so there is no need for adaptation, whereas in other cases the adaptation to the user during the session is more needed. Despite the differences found, the acoustic distance prime selection shared some of the primes that were the most common when using the algorithm. For this reason, when building a new system from scratch, using acoustic distance might be suitable to initially rank the prime candidates. With the data collected from the Noctı́vago tests we were able to create a supervised learning method for prime selection. The idea behind the method is to predict which of the listed primes is more likely to be correctly recognized. In order to test this idea, an SVM-regression was trained to predict the WER for the primes listed based on a set of features that included the User Prime Events, Non-Understandings, confidence, the distance to the last system use of the prime, and an entrainment measure that was correlated with success in human-human task oriented dialogs. The model required transcribed data in order to be trained. The Let’s Go model was trained with data transcribed using crowdsourcing, whereas the Noctı́vago’s model uses the previously collected data. The off-line test showed higher accuracy for the Noctı́vago model. The fact that it is an experimental system, with volunteer users and used different primes may have contributed to the generation of a richer dataset for entrainment model training. The model was tested on the fly using a scaling factor representing the history of the dialog. The live tests have compared this model with Rule-Based, Random and Fixed choice, based on the acoustic distance. The dialog success rate of the system version running with prime selection policy was below Random and Fixed prime choice. However, the results for prime WER and prime CTC rate were better either using Rule-Based or Data-Driven prime selection, which means that the acquisition of the concepts represented by primes was better. The Rule-Based approach achieved a slightly better performance than the Data-Driven one, probably because of the limited amount of training data. Differences were observed in the behavior of the two types of users that tested our systems. These might have had influence in the effectiveness of the algorithms for prime selection. The main concern of volunteer users was to get information from the system at any cost. They would change their initial intention if they believed that was necessary to complete the task. 7.1. FUTURE WORK 111 They also picked up the terms from the system prompts more often than real users, since they believed that it was necessary to achieve the session success. On the other hand, real users do not modify their initial intention, but they also adapt their strategy to address the system if they believe that is necessary, as far as it does not imply a modification in the initial intention. In both user types, we have managed to explore the adaption to improve the system performance. In addition, the users’ questionnaires show that in most case they hardly notice some sort of lexical adaptation in the system. Since the primes that the system used are selected by the combination of user and system prime preference, the system should talk in a more natural way, and this was something that expert users reported as well. The prime selection algorithm effectiveness was reflected on the Let’s Go dialog success rate, unlike what happened for Noctı́vago. The Let’s Go dialog is more mixed-initiative than the Noctı́vago’s dialog. Despite having context-dependent language models, it is possible to recognize other than the most expected concept in any point of a Let’s Go dialog. On the one hand, this will lead to an increased number of Non-Understandings, but on the other hand this will increase the chances to apply the prime selection algorithms. If these conditions are verified, the prime selection algorithms are more likely to have a positive impact in the dialog success rate. Despite the positive results of this work, there are many things than can improve the way SDSs lexically entrain to users. Some of them will be presented in the next section. 7.1 Future Work Establish a contrast between the different versions of the same SDS was one of our biggest difficulties. The amounts of data needed to have statistically significant results for some of the evaluation parameters are far from what we were able to collect. Performing them with volunteer users made it even more challenging, since we always depended on the willingness of the volunteers to test the system. A user simulator that would be able to operate at the prompt level rather than at the action level could be a future solution to easily compare 112 CHAPTER 7. CONCLUSIONS different methodologies for prime selection. There is already some work in developing user simulators that operate at the prompt level [79, 66]. We believe that if they are enriched with some entrainment-based knowledge they can be a very helpful tool in SDS lexical entrainment policies design. Another evaluation parameter that could be introduced in future tests in the naturalness of the prompt. Based on linguistic features and the speech synthesis accuracy, the naturalness could determine how natural is a specific prompt and compare it with the Uptake percentage observed and the prime acquisition success. In terms of the prime selection algorithms, the results from our last experiment show that the data driven model performance was slight lower than the rule-based performance. In order to extend this algorithm to other systems or to a different domain, a rule-based algorithm is not very recommendable since it requires the developer to create a different rule-based algorithm for each system developed. The lack of data used to train the model and the fact that we privilege the system robustness using SRGS grammars in some contexts, might explain the better performance of the rule-base method. However, the model presented will always need transcribed data. Besides improving the current model with more data and new knowledge sources, another future challenge constitutes the creation of an unsupervised method for prime selection learning. The use of reinforcement learning, for instance, by integrating prime selection in the SDS-POMDP framework is a possibility. However, this will add scalability problems to solve the reinforcement learning problem. Recent optimization techniques can make this solution possible. One current limitation of the framework used for prime selection is that the prime candidate lists are limited to the synonyms found by the system developer for each concept. When analyzing the dialogs, we found that some users picked a different choice that the system was not ready the deal with, since it was not in the predefined prime list. It is unrealistic to think that a developer is capable of covering all the possibilities of primes for a given concept. In a longer horizon, the system would benefit from an automatic prime learning procedure. In a closer horizon, the extension of prime lists to concepts like place names or time events in the 7.1. FUTURE WORK 113 domain used is a very likely possibility. Porting prime selection to different domains is another future ambition. Different domains may offer longer dialogs and user dependent dialog models, which will make adaptation easier. It would be also interesting to test these algorithms in dialogs that are not task-oriented. It has been reported in human-human dialogs that lexical entrainment is indeed stronger in task-oriented dialogs. Since the expectations of humans of being understood by a machine are lower when compared to being understood by other humans, perhaps lexical entrainment could be also very helpful in non task-oriented human-machine dialogs. 114 CHAPTER 7. CONCLUSIONS A Tools for Oral Comprehension in L2 Learning A.1 Introduction The vowel reduction and simplification of consonantal clusters phenomena are well-known characteristics of European Portuguese that distinguishes it from Brazilian Portuguese. These phenomena pose difficulties in oral comprehension for students of Portuguese as a second language (L2). Thus, enhancing oral comprehension skills of L2 students became one of the research goals of the REAP.PT project. The original REAP system [56], developed at Carnegie Mellon University, focuses on vocabulary learning by presenting readings with the target vocabulary words in context for students, either native or non-native. REAP.PT aims at much more than just adapting the system to Portuguese [89]. The system explores all types of language tools and corpora for automatically building language learning exercises, adopting as much as possible a serious gaming perspective. Unlike the original REAP system, which dealt exclusively with text documents, REAP.PT deals with audiovisual documents, in particular Digital Talking Books (DTB) and Broadcast News (BN) stories. This exposure to oral European Portuguese aims to make the L2 students more familiar with vowel reduction and the simplification of consonantal clusters. Hence, besides the vocabulary learning contents that were developed for the original REAP, and the new grammar drills and games that use a plethora of NLP tools, REAP.PT also includes a number of features and exercises designed to enhance oral comprehension skills. This was the context for the initial work on oral comprehension of this thesis. This appendix describes the main steps, starting by the integration of a synthesizer to read words or sequences of highlighted words, and the use of audiovisual documents in REAP.PT. 116 A.1.1 APPENDIX A. TOOLS FOR ORAL COMPREHENSION IN L2 LEARNING Speech synthesis The synthesizer was integrated as a webservice in the REAP.PT platform. It is available in every text document. Students can highlight words or a sequence of words in the document and select the “play” option to listen to the sentence read by the synthesizer. This generates a request to the synthesizer webservice to read out loud the selected text. When searching for the meaning of a particular word in the dictionary, the pop-up window also includes the same “play” button to listen to the word selected. The synthesis is provided by DIXI, a concatenative unit selection synthesizer [101] based on Festival [11]. A.1.2 Digital Talking Books Digital Talking Books, also known as audio books, were the first type of non-text materials integrated in the multimedia REAP.PT platform. Audio books are mostly used for entertainment and e-inclusion applications, and their use for Computer-Assisted Language Learning (CALL) is much more novel. They can be especially important for non-native students of European Portuguese, giving them the possibility to listen to the whole story with the text in front of them or to listen to a selected word sequence from the text. DTBs may have drawbacks in terms of free accessibility, but their advantages in terms of controlled quality recordings both at prosodic and segmental levels far outweigh these drawbacks. In order to play a selection of highlighted words, the corresponding word boundaries need to be determined. This was achieved using our automatic speech recognition system (ASR), Audimus [96], in a forced alignment mode. Audimus is a hybrid recognizer whose acoustic models combine the temporal modeling capabilities of hidden Markov models (HMM) with the pattern classification capabilities of multi-layer perceptrons (MLP). Its decoder, which is based on weighted finite state transducers (WFST), proved very robust even for aligning very long recordings (a 2-hour long recording book could be aligned in much less than real-time). For research purposes, we built a small repository of DTBs including a wide range of genres: fiction, poetry, children stories, and didactic textbooks. A.1. INTRODUCTION 117 A later version of the multimedia REAP.PT platform improved the original version developed by the author, by integrating a karaoke style presentation of highlighting words as they are spoken, as well as slow-down features. A.1.3 Broadcast News The repository of DTB stories may not be covering the areas that potentially will interest the students. In addition, their durations typically extend the recommend content size for REAP.PT materials. This motivated us to take advantage of a large corpus of manually transcribed BN stories, used to train and test Audimus. Since transcriptions were available at the utterance level, the alignment at word-level was easily obtained using Audimus in forced alignment mode. With the alignment at the word-level, the student could perform the same actions performed previously with the DTB. Despite widening the range of subjects of the multimedia contents in REAP.PT, we believed that the use of automatically transcribed recent broadcast news, instead of old news stories, might increase the student’s motivation. The next section describes the process that lead to the integration of recent broadcast news shows in the REAP.PT platform. A.1.3.1 Integrating Automatically Transcribed news shows in REAP.PT The automatically transcribed broadcast news shows have several advantages to CALL: they are recent, divided in short stories, topic-classified, and have a video associated with each story. The only drawback is the eventual presence of errors. The errors can either be marked to let the student know that the transcription may be incorrect, or filtered, removing the stories with too many errors. This section describes the broadcast news pipeline, and its integration in the REAP.PT platform. A.1.3.1.1 Broadcast News Pipeline The first module of the broadcast news pipeline is the jingle detection that detects the end and the start times of the news show and also removes advertising segments. 118 APPENDIX A. TOOLS FOR ORAL COMPREHENSION IN L2 LEARNING Audio Segmentation After the jingle detector, the audio is segmented using several components: three classifiers (Speech/Non-Speech, Gender, and Background); a speaker clustering module, an acoustic change detection module, and a speaker identification module that identifies very frequent speakers for whom acoustic models have been previously trained (e.g. anchors). All classifiers are Multi-Layer perceptron based. The errors in the segmentation are propagated to the other modules of the pipeline. For REAP.PT, the impact of audio segmentation errors is comparatively negligible. However, an error in identifying the most frequent speaker (the anchor) may have a major impact in breaking the news into stories, as described below. Automatic Speech Recognition The next step in the pipeline is speech recognition using Audimus. The MLP/HMM architecture already mentioned, combines posterior probabilities from three phonetic classification branches, based on PLP (Perceptual Linear Prediction), Log-RASTA (log-RelAtive SpecTrAl), and MSG (Modulation SpectoGram) features. The WFST-based transducer has a large search space that results from the integration of the HMM/MLP topology transducer, the lexicon transducer and the language model one. The recognizer also yields a word confidence measure by combining several features into a maximum entropy classifier, whose output represents the probability of the word being correctly recognized. The first acoustic model was trained from 46 hours of manually transcribed broadcast news. Later, unsupervised training has been adopted, using as training data all the words that were recognized with a confidence score above 91.5%. The second acoustic model was trained with 378 hours being 332 automatically transcribed. The final acoustic model was trained with 1000 hours, and resulted in 38 three state monophones plus a single-state non-speech model for silence, and 385 phone transition units which were chosen to cover the majority of the word transitions in the training data. The language model (LM) used is daily updated. It is a statistical 4-gram model and results from the interpolation of three specific LMs: a back-off 4-gram LM, trained on a 700M word corpus of newspaper texts; a backoff 3-gram LM estimated on BN transcripts; and a backoff A.1. INTRODUCTION 119 4-gram LM estimated on the web newspapers texts collected from the previous seven days. The final interpolated language model is a 4-gram LM, with Kneser-Ney modified smoothing. The vocabulary is also updated on a daily basis from the web [88], which implies a reestimation of the LM and the re-training of the word-level confidence measure classifier. Once the 100k vocabulary is selected, the pronunciation lexicon is created dividing the words in those who can be pronounced according to the Portuguese pronunciation rules, and those that do not follow those rules. The pronunciation for the first group is given either by the in-house lexicon or a rule-based grapheme to phoneme (GtoP) conversion module [35]. The remaining words, which might be acronyms or foreign words, are subsequently divided into those which the pronunciation can be derived from the Carnegie Mellon public domain lexicon, where nativized pronunciation is derived by rule, and those which pronunciation can not be derived from there. The latter, grapheme nativization rules are applied to them before their pronunciation is given by the GtoP module. The final multiple-pronunciation lexicon generally includes 114k entries. The Word Error Rate (WER) was computed for an evaluation set composed by six one-hour news shows in 2007. Using a fixed LM/vocabulary, the result was 18.4% . The performance significantly differs from clean read speech (typically below 10%) to spontaneous speech or noisy environments (typically above 20%). Punctuation and capitalization The next modules in the pipeline are responsible for enriching the raw transcription provided by the speech recognizer. The first rich transcription modules introduce punctuation and capitalization. The approach followed uses maximum entropy models for both tasks [9]. Capitalization explores only lexical cues. The results were much worse for automatically transcribed BN (F-measure = 75.4 %) than for manually transcribed BN (F-measure = 85.5%), reflecting the effects of the recognition errors. Besides lexical cues, punctuation also explores acoustic and prosodic cues. Initially, this module only targeted commas and full stops. Results for full stop detection (F-measure = 120 APPENDIX A. TOOLS FOR ORAL COMPREHENSION IN L2 LEARNING 69.7%) were better than for comma detection (F-measure = 50.9%). A later version of this punctuation module targeted question marks as well. Topic segmentation and indexation This module aims to split the BN shows into the constituent short stories [8]. These shows typically consist of a sequence of segments that are either stories or fillers (i. e. headlines / teasers detected by the audio segmentation module). The division into constituent stories is done by using a common heuristic - every story starts with a segment spoken by the anchor, and is further developed by out-of-studio reports and/or interviews. Hence, the simplest segmentation algorithm consists of defining potential story boundaries in every transition non-anchor / anchor. This heuristic can be refined using a Classification and Regression tree based approach that deals with double anchors, local commentators and thematic anchors (e.g. for sports or meteorology). The segmentation obtained using this module achieved an F-measure of 84%. After segmentation, the topic indexation module assigns one or multiple topics to each story as detailed in [2]. The set of 10 topics is: Economy, Education, Environment, Health, Justice, Meteorology, Politics, Security, Society, and Sports. A further classification into National and International is also done. For each class, a topic and non-topic unigram language model was created using the stories of a media-watch corpus, which were pre-processed in order to remove function words and lemmatize the remaining ones. The topic classification is based on the log likelihood ratio between the topic language model and the non-topic language models. A predefined threshold is set to assign a story to a topic. A different threshold is set for each topic in order the account the language model quality differences. The accuracy obtained in a held-out test set was 91.8%. The use of support vector machines (SVMs) improved the results of later versions of this module. The final module follows a simple extractive summarization strategy in order to create a summary of the story by choosing its first sentence. A.1.3.1.2 Integration in REAP.PT REAP.PT is accessed via a web-interface. The students will be submitted to a pre-test when they first log in to the platform. In the pre- A.1. INTRODUCTION 121 test, they select the words they know from a list, so that they are assigned one of the 12 school levels. After the pre-test, the student has many options available: group reading, individual reading, multimedia documents, and topic interests. The task in the first two options is similar and text-based. The only difference is that in the first case the text is chosen by the teacher and is available to all the students in the class, whereas in the second option the student is free to choose a text from a given set. The topic interests tab allows the student to set the preferred topics. This information is stored in the student database. The multimedia documents includes the DTBs and the BN shows. The interface for the BN is shown in Figure A.1. The left part of the screen shows the news show divided by stories. The navigation can be also be done by topic. The right part of the screen shows the automatically transcribed story, divided into speaker turns. Figure A.1: Recognized BN interface in REAP. The interface also has a teacher menu that allows the teacher to rate the quality of the document, estimate a readability level, select documents for group reading, discard documents, and insert new questions to be answered at the end of each reading session. Teacher reports can also be generated from the platform. After each reading session, the system updates the student model to dynamically adjust the contents to the student. 122 APPENDIX A. TOOLS FOR ORAL COMPREHENSION IN L2 LEARNING Document Filtering Unlike the text documents for group and individual reading, the automatic transcribed texts are not filtered to satisfy the REAP restriction. That is the text documents have to respect some pedagogical constraints such as text length (300 words), and need to have a minimum number of words from the target list. Texts with profanity words are also filtered, as well as documents containing just word lists. The text length filter was discarded when using BN, since the average number of words per story in the evaluation corpus was 280 words. The topic segmentation module avoids too short stories by merging neighboring stories, in order to have enough evidence to assign the topic. There are cases where the stories may be very long, such as major disasters. In those cases, a warning to stop reading after 1000 words is given to the student. The profanity words filter is not needed since those words are excluded from the output of the ASR. The same applies for the word list filters. The last filtering stages (topic and readability), are common to both text and multimedia documents. The only difference is that the words with low confidence level in the multimedia documents are excluded from this filtering stage. The readability classifier aims to attribute to each document a grade according to the Portuguese school levels. The readability classifier used for the texts in the reading tasks was trained with textbooks for native Portuguese students, from grades 5-12. The initial is based on lexical features, such as statistics of word unigrams. A Support Vector Machine was trained using the SMO tool implemented in WEKA [149]. The multimedia documents were also classified with this readability classifier due to the difficulties in obtaining multimedia documents classified according to the school levels. This classifier produced interesting results in the held-out test set of textbooks [57]: root mean square error (0.448), Pearsons correlation (0.994), and accuracy within 1 grade level (1.000). Interestingly, the BN stories were classified between levels 7 and 11, with an average of 8. However, some words that were in the BN stories could not be used the classifications since they are not covered by the vocabulary from textbooks, that despite being very large (95k A.1. INTRODUCTION 123 words), is still very different from the dynamic vocabulary used in the recognition of BN. Document display The text placed in the right side of Figure A.1 is created from the XML file produced by the BN processing chain in Figure A.2. The set of words underlined and blue colored represent that a given word is in the list of words that the student is supposed to learn. That list, similarly to other languages like English and French, was designed for language learning. It is designated as the Portuguese Academic Word List (P-AWL [7]) and it is composed by 2,019 different lemmas, together with their most common inflections, each one tagged with the corresponding part-of-speech (POS). The resulting list has a total of 33,284 words. Figure A.2: Broadcast News Pipeline. A dictionary is also available by clicking on any word. A pop-up window with the definition obtained from a electronic dictionary (from Porto Editora) and the corresponding POS tag are shown. This window also includes the synthesizer button that the implementation was previously described (Section A.1.1). The student can also play highlighted word sequences in the same way that was implemented for DTBs (Section A.1.2). To alert the student that the displayed text may have recognition errors, the words with 124 APPENDIX A. TOOLS FOR ORAL COMPREHENSION IN L2 LEARNING confidence score below 82% are shown in red. Recently, new features have been added to the computation of the confidence score [103]. The comparison between to different ASR systems (word by word) and a feature based on the number of hits on the hypothesized bigrams in a web search engine, when combined with the decoder based features led to remarkable improvements in the detection of recognition errors (13.87% in the baseline, 12.16% adding the new features to the baseline). Vocabulary exercises Every action of the student is tracked during the reading session, including the accesses to the dictionary, in order to keep updating his/her progress. Independent of the type of document, each reading session is followed by a series of cloze, or fill-in-the-blank, questions about the underlined blue colored words. These vocabulary questions are manually selected from a set of 6000 where the distractors are automatically generated based. The methods for distractor generation for REAP.PT are compared in [38]. Later versions of the REAP.PT platform included a number of grammar-based exercises and games, as well as exercises with the specific focus of enhancing oral comprehension skills. A.2 Summary This chapter described a set of tools that were added to REAP.PT to offer the possibility to the student to practice their oral comprehension skills. Namely, the use of a speech synthesizer and multimedia contents such as DTBs and BN stories. Two different types of BN stories were placed in the platform: manually transcribed and automatically transcribed. The integration of the automatic transcribed stories was described in detail. The first part of the process includes the generation of the news content by the BN processing pipeline. Then the REAP.PT processing tools prepare the document (readability classifications and filtering) to be included in the multimedia materials sections. So far there was no objective evaluation to the use of multimedia materials. Informal tests held with more than 10 non-native students revealed that they found it a motivating feature A.2. SUMMARY 125 in their learning process. They also suggested that the word could be highlighted as it was spoken (karaoke style), which was later implemented. The current REAP.PT platform builds on the initial work with multimedia materials developed in this thesis, but includes many other innovations, namely Serious games. One of them places the students in a 3D environment, to try to make them learn locative prepositions used to describe spatial position between objects [125]. The other game consisted of a competitive translation game, where the student plays against an automated agent that uses a statistical machine translation module to guess the hidden words in a sentence in the target language [82]. Serious games were also developed for listening comprehension, using automatically selected excerpts of BN stories. One of them asks the student to order the words as they appear in the sentence [104]. The other asks the students to select from a list of words those that they heard in the excerpt [37]. These games were highly appreciated by students. One of the recently designed games is a version of the locative prepositions game including speech input, but this version was not yet tested with students. In fact, since the student has to complete tasks to successfully complete the game, the scenario could be used to design dialogs where the student could entrain to the game and improve her/his oral skills. This would require a more complex type of dialog management. 126 APPENDIX A. TOOLS FOR ORAL COMPREHENSION IN L2 LEARNING B Nativeness Detection B.1 Introduction This annex will describe some experiments regarding automatic nativeness detection. Generally, non-native speech may deviate from native speech in terms of morphology, syntax and the lexicon, which is naturally more limited than for adult native speakers. In what concerns morphology, the main problems faced by non-native speakers concern producing correct forms of verbs (namely when irregular), nouns, adjectives, articles etc, especially when the morphological distinction hinges on subtle phonetic distinctions. The main difficulties in terms of syntax concern the structure of sentences, the ordering of constituents and their omission or insertion. In addition, non-native speech typically includes more disfluencies than native speech, and is characterized by a lower speech rate [138]. None of the above difficulties are very prominent in highly proficient speakers. However, their speech frequently maintains a foreign accent, denoting the interference from the first language (L1), both in terms of prosody as well as segmental aspects. The segmental deviations in non-native speech can be quite mild, when speakers use phonemes from their L1 without compromising phonemic distinctions, but they may also have a strong impact on intelligibility whenever phonemic distinctions are blurred, a frequent occurrence when the phonetic features of L2 are not distinctive in L1. The prosodic deviations may involve errors in word accent position [62] . Rhythmic traits, however, are generally regarded as the main source for the low intelligibility of L2 speakers. Although some authors base these difficulties on the differences in rhythm between L1 and L2, namely when one of the languages is stress-timed and the other is syllable-timed [112, 52], 128 APPENDIX B. NATIVENESS DETECTION others claim more complex distinctions (for instance, syllables not carrying the word accent, that are weak in stress-timed languages, are produced stronger in syllable-timed languages). The literature on automatic nativeness detection is still scarce, but is tightly connected to accent detection in general [5, 70, 10]. Some of the approaches use acoustic features, whereas others are based on prosodic cues. Kumpf and King, for instance, used HMM phone models to identify Australian English speech [73]. Bouslemi et al. tried another approach using discriminative phoneme sequences to detect speaker’s origin through their accent when speaking English [23]. Piat et al [105] modeled variations in duration, energy and fundamental frequency at the syllable level for each word in the corpus. The accent identification was done using one HMM per feature. They compared this approach with MFCC. Then, a combination of the best prosodic parameters with MFCC coefficients was evaluated and achieved best results. Features used in pronunciation scoring can also be useful to nativeness detection. Tepperman and Narayanan [132] tried to improve pronunciation scoring using information about intonation and prosodic structure. They used techniques often used in tonal languages (f0 processing, theoretical grammars for recognition of intonation events and contextual effects) in that study. Using HMMs to represent the intonation units the correlation between the predicted score and the annotated pronunciation score increased significantly. Höenig et al used a set of standard prosodic features to train a k-class classifiers for word accent position [61]. The feature set was later enriched with speech rate, the duration of consecutive consonantal or vocalic segments, the percentage of vocalic intervals. A multiple linear regression with cross-validation was trained to select the relevant feature set. The goal was to find the features that best described intelligibility, accent, melody, rhythm and, in addition, find the feature set that could classify at a supra-segmental level [62]. Prosodic features were also used in language identification. Lin and Wang approximated the pitch contour segment by a set of Legendre polynomials being the polynomial coefficients the feature vector used to represent the pitch contour [32]. Several Gaussian Mixture Models were trained with the feature generated by the Legendre polynomials and other features such as B.1. INTRODUCTION 129 the duration of the segment or the energy difference in the segment where the pitch contour was extracted. Here a different approach to the problem will be adopted, using state of the art techniques for language identification to distinguish between the characteristics of native and non-native speakers. The goal is to be able to detect prosodic and segmental deviations in highly proficient speakers like most speakers in TED Talks. The corpus will be described in the next section. The main body of this appendix is devoted to the description of our approach, which uses prosodic contour features to complement the acoustic ones. Although the study has been conducted for English, some of the characteristics referenced are common to every non-native speaker of any original language. B.1.1 Corpus The collected corpus comprises 139 TED talks. The subset of native English speakers includes 71 talks, by US speakers. The remainder 68 talks are from non-native speakers, i.e. speakers from any other country that does not have English as the official language. To simplify the problem at this first stage, speakers from other English speaking countries such as UK and Australia were not involved in the experiments. The fact that the non-native speakers may have lived in an English speaking country for some time was ignored in this classification, place of birth being the major factor. The first step in processing this corpus was to perform audio segmentation, separating speech segments from non-speech segments. Next, speech segments that were at least 1 second long were divided into train and test subsets, making sure that each talk was used only in one of the subsets. More details on the corpus can be found in table B.1, for the training and test sets, showing duration, total number of segments and percentage of segments for each of the durations considered. 130 APPENDIX B. NATIVENESS DETECTION Train Duration (min) Segments (#) <3 s (%) 3s-10s (%) 10s-30s (%) >30 s (%) Test Duration (min) Segments (#) <3 s (%) 3s-10s (%) 10s-30s (%) >30 s (%) Native 683.4 1276 2.1 4.5 60.0 33.3 Native 182.9 400 4.5 6.8 62.2 26.5 Non-Native 616.8 1299 4.8 7.1 64.9 23.3 Non-Native 218.1 548 2.2 4.4 74.1 19.3 Table B.1: Details of the training and test sets for Native and Non-Native data. B.1.2 Nativeness Classification In this section the acoustic and prosodic approaches to solve the problem of nativeness classification will be described. Finally the procedures to combine both systems will be presented. B.1.2.1 Acoustic Classifier Development A method generally known as Gaussian supervectors (GSV) was first proposed for speaker recognition in [33]. In the last few years, however, this approach has also been successfully applied to language recognition tasks. GSV-based approaches map each speech utterance to a high-dimensional vector space. Support Vector Machines (SVMs) are generally used for classification of test vectors within this space. The mapping to the high-dimensional space is achieved by stacking all parameters (usually the means) of an adapted GMM in a single supervector by means of a Bayesian adaptation of a universal background model (GMMUBM) to the characteristics of a given speech segment. In this work, we apply the GSV approach to the nativeness detection problem. B.1.2.1.1 Feature Extraction The extracted features are shifted delta cepstra (SDC) [136] of Perceptual Linear Prediction features with log-RelAtive SpecTrAl speech processing B.1. INTRODUCTION 131 (PLP-RASTA). First, 7 PLP-RASTA static features are obtained and mean and variance normalization is applied on a per segment basis. Then, SDC features (with a 7-1-3-7 configuration) are computed, resulting in a feature vector of 56 components. Finally, low-energy frames (detected with the alignment generated by a simple bi-Gaussian model of the log energy distribution computed for each speech segment) are removed. B.1.2.1.2 Supervector Extraction In order to obtain the speech segment supervectors, a Universal Background Model (UBM) must be first trained. The UBM is a single GMM model that represents the distribution of speaker independent features. This is done in order to deal with the variability that characterizes accent recognition. All the training data available (native and non-native segments together) were used to train GMM-UBM of 64 and 256 mixtures, resulting in two different GSV-based systems. A single iteration of Maximum a Posteriori (MAP) adaptation with relevance factor 16 is performed for each speech segment to obtain the high-dimensional vectors. B.1.2.1.3 Nativeness modeling and scoring Our classification problem is binary. Therefore only one classifier needed to be trained. The linear SVM kernel of [33] based on the Kullback-Leibler (KL) divergence is used to train the target language models with the LibLinear implementation of the libSVM tool [42]. The native supervectors set is used to provide the positive examples, whereas the non-native set is used as background or negative samples. During test, the supervectors of the testing speech utterances are used by the binary classifier to generate a nativeness score. B.1.2.2 Prosodic Classifier Development In this work, we apply prosodic contour information features for nativeness classification in order to complement the acoustic system described above. B.1.2.2.1 Prosodic contour extraction The Snack toolkit [72] is used to extract the log-pitch and the log-energy of the voiced speech regions of every utterance. Log-energy 132 APPENDIX B. NATIVENESS DETECTION is normalized on an utterance basis. The prosodic contours are segmented into regions by splitting the voiced regions whenever the energy signal reaches a local minimum (the minimum length of the regions is 60 ms). We use a 3rd order derivative function of the log-energy to find local minima. For each region, the log-energy and log-pitch contours are approximated with a Legendre polynomial of order 5, resulting in 6 coefficients for each contour. The final feature vector is formed by the two contour coefficients and the length of the syllable-like region, which results in a total of 13 elements. B.1.2.2.2 Nativeness modeling and scoring Two different approaches were followed to train the nativeness detector that uses prosodic features. First, GMM models were trained for native and non-native speech models following the conventional GMM-UBM approach that is also applied in [32]. The UBM was estimated using all training data and the native and non-native GMMs were obtained based on MAP adaptation of the UBM with all the native and non-native training data, respectively. In this case, only a 64-mixture UBM was trained due to reduced amount of training vectors resulting from the fact that each feature vector now corresponds to a syllable-like segment of variable duration. The MAP adaptation was done with one iteration step. A second modeling approach was also developed based on the Gaussian supervector technique described in section B.1.2.1.2 but now with the prosodic features. The MAP adaptation step for this approach uses the same UBM model of the previous approach. During test, log-likelihood ratios of the native and non-native GMMs are computed for each testing speech utterance in the case of the GMM-UBM based system. In the GSV case, test supervectors are used with the binary classifier to generate a nativeness score. B.1.2.3 Calibration Each individual system was calibrated using the s-cal tool available with the Focal toolkit [31]. Additionally, Linear Logistic Regression is applied to the calibrated scores and it is also used for fusion of the acoustic and prosodic systems whenever it is necessary. A cross-validation strategy is carried out with the test data set to simultaneously estimate the calibration and B.1. INTRODUCTION 133 fusion parameters and evaluate the system. B.1.3 Results and discussion Table B.2 presents experimental results targeted at comparing the performance of the acoustic-based Gaussian supervector system with different number of mixtures: GSV-acoustic 64 and GSV-acoustic 256. Both Accuracy (Acc) and Equal Error Rate (EER) scores are computed for Native and Non-Native segments separately and for the overall test data. Results are also discriminated for different segment length durations. <3s 3s-10s 10s-30s >30s Total Native GSVGSVacoustic acoustic 64 256 Acc (%) Acc (%) 35.7 35.7 66.7 48.2 82.7 88.4 89.6 91.5 81.8 84.6 Non-Native GSVGSVacoustic acoustic 64 256 Acc (%) Acc (%) 80.0 80.0 66.7 81.0 87.0 91.4 85.9 82.1 85.9 89.0 Total GSV-acoustic 64 Acc (%) 47.4 66.7 85.3 87.7 84.2 EER (%) 24.3 28.3 15.1 10.4 16.1 GSV-acoustic 256 Acc (%) 47.4 62.5 90.2 86.8 87.2 EER (%) 30.0 30.7 10.1 7.6 13.1 Table B.2: Detailed results for GSV-acoustic 64 and GSV-acoustic 256. These results show a generally better performance of the GSV-acoustic 256 system relative to the GSV-acoustic 64 classifier. In fact, increasing the supervector dimensionality allows considerable nativeness detection improvements. As expected, both classifiers perform better on longer duration segments, which is a well-known result of language recognition and other similar problems. Notice, however, that duration-independent calibration parameters were estimated and the performance loss in shorter segments could be partially alleviated if length duration dependent calibrations have been estimated as well. This effect can be particularly important given the fact that most of the test data segments have durations between 10 and 30 seconds. Finally, it is worth noticing that the detectors seem to be biased towards the non-native class for the shorter segments duration. This fact may be also related with a 134 APPENDIX B. NATIVENESS DETECTION GSV-prosodic GMM-prosodic alone Acc.(%) EER(%) 68.9 40.7 71.1 30.1 fusion Acc.(%) EER(%) 89.1 10.6 89.4 10.6 Table B.3: Results obtained using prosodic features (Accuracy and EER) and the fusion between prosodic systems and GSV-acoustic 256. mis-calibration problem. In the longer segments duration case, both Native and Non-Native accuracies are quite balanced. Table B.3 shows the Accuracy and EER results obtained for the two prosodic systems: the Gaussian Supervector classifier (GSV-prosodic) and the GMM classifier (GMM-prosodic). In addition to the individual performance (column ”alone”), the detection results of the prosodic based systems fused with the best acoustic system (GSV-acoustic 256) are also presented. Results from Table B.3 show that the performance of both prosodic systems alone is far below the one of the acoustic systems as it is clear also from DET curve in figure B.1. However, when fused with GSV-acoustic 256, the combined system considerably improves the individual acoustic system. Against initial expectations, the GMM-prosodic performs better than the GSV-prosodic. This may be related to the reduced amount of prosodic feature vectors, which may influence the way MAP adaptation should be carried out for supervector extraction: number of iterations, relevance factor, etc. This possibility, together with the use of mean and variance normalized prosodic features, has been investigated but no conclusive results were obtained so far. In any case, slight performance differences between the two prosodic systems are observed when fused to the acoustic classifier. Finally, the best performing nativeness detector is the one based on the fusion of the GSVacoustic 256 with the GMM-prosodic sub-systems as the DET curve in figure B.1 is showing. The incorporation of the prosodic features permits an absolute performance improvement of 2.2% and 2.5% in terms of Accuracy and EER respectively, relative to the original detector based only on acoustic features. B.1. INTRODUCTION 135 Nativeness Detection Performance 80 GSV−acoustic 256 GMM−prosodic 60 FUSION Miss probability (in %) 40 20 10 5 2 1 0.5 0.2 0.1 0.1 0.2 0.5 1 2 5 10 20 40 False Alarm probability (in %) 60 80 Figure B.1: DET curves of the GSV-acoustic 256, GMM-prosodic and fusion between both systems. 136 B.1.4 APPENDIX B. NATIVENESS DETECTION Human Benchmark The above are below the results for state-of-the-art language recognition, where EER is around 3% [34]. Recognizing non-native accents in proficient speakers, however, is a far more difficult task. The automatic detection of non-native accents in proficient speakers seems to be a far more difficult task. In order to find out if this is also a difficult task for humans, we have compared the above results with the human performance on the same task. For that purpose, 3 native Portuguese speakers fluent in English, were asked to classify 45 segments. A set of 50 segments was randomly chosen from our test corpus, but 5 of them were discarded, because they contained semantic cues (e.g. one of the speakers actually said “I’m French”). The comparison of the original segment classification with the classification of each of the 3 subjects and the ones produced by the automatic classifiers produced discrepancies in only 10 files. Most differences occur in segments where the highly proficient speakers have a barely discernible non-native accent that also eludes some listeners. In this subset, the accuracy of the subjects averages 92.22%, whereas the one of the best fused system achieves 89.80%. We obtained a Cohen Kappa of 0.88 for the inter-rater agreement of the 3 subjects. The Cohen Kappa between the 3 subjects and the best fused classifier varied from 0.80 to 0.87. These results show that in this task the automatic performance is very close to the human performance. The introduction of new prosodic features together with the enlargement of data set could help in the system improvement. B.2 Summary This annex reported an experiment of an automatic nativeness detector for English. The first attempt performed only used acoustic features with Gaussian supervectors for training a classifier based on support vector machines. The final result was a 13.11 % equal error rate. The combination with prosodic features reduced the equal error rate to 10.58%. Prosodic feature on their own were not very discriminative possibly due to the reduced number of frames per file used in their computation. B.2. SUMMARY 137 The comparison between the performance achieved and the human benchmark showed that the current method’s performance is very close the human performance. The fact that the speakers on the database are very fluent, made the task even more difficult. Thus, the results are lower than what was found in language recognition with a similar method. Unfortunately, since there are very few TED talks with non-native speakers where they do not speak English, it was not possible to extend this work for other languages. 138 APPENDIX B. NATIVENESS DETECTION C Resources used in the experimental sets C.1 Scenarios used in Noctı́vago tests Expected input: from the airport to the Oriente train station, now (agora) and also get the information of the following bus. Figure C.1: Scenario 1. 140 APPENDIX C. RESOURCES USED IN THE EXPERIMENTAL SETS Expected input: from Largo de Camões to Cais do Sodré, Saturday (sábado) at 1:15 am. Figure C.2: Scenario 2. C.1. SCENARIOS USED IN NOCTÍVAGO TESTS 141 Expected input: from Cais do Sodré to Alameda, at 3:00 am and also get the information of the previous bus. Figure C.3: Scenario 3. 142 APPENDIX C. RESOURCES USED IN THE EXPERIMENTAL SETS C.2 Questionnaires C.2.1 Questionnaires used in Section 4.3 Both questionnaires requested e-mail, type of device used to call (landline, cellphone or VoIP), if the information was correctly received in the three scenarios, country of origin and a comment box for any suggestions the users have to improve the system performance. Figure C.4: Questionnaire used in the first week. C.2. QUESTIONNAIRES 143 During the second week, two new questions were added: if the users felt better understood and if they noticed the system had evolved. If they answered positively to this last questions, they had another comment box to say where they noticed the evolution. Figure C.5: Questionnaire used in the second week. 144 APPENDIX C. RESOURCES USED IN THE EXPERIMENTAL SETS C.2.2 Questionnaire used in Section 6.1.2.2 The questionnaire used was based on the PARADISE framework for the evaluation of spoken dialog systems [141]. It includes the statements: • the system was easy to understand • the system understood what you said • you found the information you were looking for • you found the pace of the interaction appropriate • you the system was slow giving the answer • the system behaved as expected • you would use these type of system to get transport schedule information • the system became easier to understand throughout the interaction This last question was added to the PARADISE questions to capture if entrainment was helping the system to better understand the user. They should answer in a LIKERT scale where 1 means that they totally disagree and 5 that they totally agree with the statement. Since they were asked to do three consecutive requests to the system, they should classify the information they got in a 5 degree scale from “never correct” to “always correct”. The questionnaire also had a checkbox to program reminders to test the different configurations, the country of origin and a commentary box. C.2. QUESTIONNAIRES Figure C.6: Questionnaire Rule Based tests. 145 146 APPENDIX C. RESOURCES USED IN THE EXPERIMENTAL SETS C.2.3 Questionnaire used in Section 6.1.2.2 The questionnaire in Figure C.7 is similar to the one presented in the previous section. We decided to remove the statements about how the system was perceived, the pace of the interaction and the quickness of the answer. On the other hand, we increased the focus on entrainment related questions, explicitly asking the user if they perceived entrainment effects, without ever mentioning the word entrainment. The statements introduced were: • the system was able to propose alternatives when it was not able to understand me • some words were appropriately modified by the system We also reduced the number of possibilites in the LIKERT scale to 4 to avoid middle column biasing. C.2. QUESTIONNAIRES Figure C.7: Questionnaire Data Driven tests. 147 148 APPENDIX C. RESOURCES USED IN THE EXPERIMENTAL SETS Bibliography [1] A Method for Evaluating and Comparing User Simulations: The Cramer-von Mises Divergence, 2007. IEEE. [2] Rui Amaral and Isabel Trancoso. Topic segmentation and indexation in a media watch system. In INTERSPEECH, pages 2183–2186, 2008. [3] Jan Anguita, Stephane Peillon, Javier Hernando, and Alexandre Bramoulle. Word confusability prediction in automatic speech recognition. In INTERSPEECH 2004 ICSLP, 8th International Conference on Spoken Language Processing. ISCA, 2004. [4] Apple. Siri, 2011. URL http://www.apple.com/ios/siri/. [5] Levent M. Arslan and John H. L. Hansen. Language accent classification in American English. Speech Commun., 18:353–367, June 1996. [6] M. F. Bacelar do Nascimento, M. L. Garcia Marques, and M. L: Segura da Cruz. Português Fundamental, Métodos e Documentos, tomo 1, Inquérito de Frequência. Lisboa, INIC, CLUL, 1987. [7] Jorge Baptista, Neuza Costa, Joaquim Guerra, Marcos Zampieri, Maria Cabral, and Nuno J. Mamede. P-AWL: Academic word list for Portuguese. In PROPOR, pages 120–123, 2010. [8] Regina Barzilay, Michael Collins, Julia Hirschberg, and Steve Whittaker. The rules behind roles: Identifying speaker role in radio broadcasts. In Proceeding of AAAI, 2000. [9] F. Batista, D. Caseiro, N. Mamede, and I. Trancoso. Recovering capitalization and punctuation marks for automatic speech recognition: Case study for portuguese broadcast news. Speech Commun., 50(10):847–862, October 2008. [10] Fadi Biadsy, Julia Hirschberg, and Nizar Habash. Spoken arabic dialect identification using phonotactic modeling. In Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages, Semitic ’09, pages 53–61, Stroudsburg, PA, USA, 2009. [11] A. Black, P. Taylor, and R. Caley. The Festival synthesis system, December 2002. [12] Alan W. Black and Kevin A. Lenzo. Limited domain synthesis. In ICSLP, pages 411–414, 2000. [13] Alan W Black and Kevin A. Lenzo. Flite: A small fast run-time synthesis engine. In 4TH ISCA TUTORIAL AND RESEARCH WORKSHOP ON SPEECH SYNTHESIS, pages 20–4, Perthshire, Scotland, 2001. 149 150 BIBLIOGRAPHY [14] Alan W. Black, Susanne Burger, Brian Langner, Gabriel Parent, and Maxine Eskenazi. Spoken dialog challenge 2010. In SLT, pages 448–453, 2010. [15] Mats Blomberg, Rolf Carlson, Kjell Elenius, Björn Granström, Joakim Gustafson, Sheri Hunnicutt, Roger Lindell, and Lennart Neovius. An experimental dialogue system: Waxholm. In EUROSPEECH, 1993. [16] Dan Bohus. Error awareness and recovery in conversational spoken language interfaces. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 2007. AAI3277260. [17] Dan Bohus and Alex Rudnicky. Integrating multiple knowledge sources for utterancelevel confidence annotation in the cmu communicator spoken dialog system. Technical report, Roots in the Town. In 2nd International Workshop on Community Networking. 1995. Princeton (NJ): IEEE Comm. Soc, 2002. [18] Dan Bohus and Alex Rudnicky. Sorry, I didn’t catch that! an investigation of nonunderstanding errors and recovery strategies. In SIGDial, 2005. [19] Dan Bohus and Alexander Rudnicky. LARRI: A Language-Based Maintenance and Repair Assistant, volume 28 of Text speech and language technology, pages 203–218. Springer, 2005. [20] Dan Bohus and Alexander I. Rudnicky. Implicitly-supervised learning in spoken language interfaces: an application to the confidence annotation problem. In SIGDIAL, 2007. [21] Dan Bohus and Alexander I. Rudnicky. The Ravenclaw dialog management framework: Architecture and systems. Comput. Speech Lang., 23(3):332–361, July 2009. [22] Dan Bohus, Antoine Raux, Thomas K. Harris, Maxine Eskenazi, and Alexander I. Rudnicky. Olympus: an open-source framework fro conversational spoken language interface research. In proceedings of HLT-NAACL 2007 workshop on Bridging the Gap: Academic and Industrial Research in Dialog Technology, 2007. [23] Ghazi Bouselmi, Dominique Fohr, Irina Illina, and Jean-Paul Haton. Discriminative phoneme sequences extraction for non-native speaker’s origin classification. Computing Research Repository, 2007. [24] H. Branigan, J. Pickering, J. Pearson, and J. McLean. Linguistic alignment between people and computers. Journal of Pragmatics, 42(9):2355–2368, September 2010. [25] Holly P. Branigan, Martin J. Pickering, Jamie Pearson, Janet F. McLean, and Clifford Nass. Syntactic alignment between computers and people: the role of belief about mental states. In In Proceedings of the Twenty-fifth Annual Conference of the Cognitive Science Society, 2003. [26] Susan E. Brennan. Conversation with and through computers. User Modeling and User-Adapted Interaction, 1(1):67–86, March 1991. ISSN 0924-1868. [27] Susan E. Brennan. Lexical entrainment in spontaneous dialog. In International Symposium on Spoken Dialog, pages 41–44, 1996. BIBLIOGRAPHY 151 [28] Susan E. Brennan. The grounding problem in conversations with and through computers. In SOCIAL AND COGNITIVE PSYCHOLOGICAL APPROACHES TO INTERPERSONAL COMMUNICATION, pages 201–225. Erlbaum, 1998. [29] Susan E. Brennan and Herbert H. Clark. Conceptual pacts and lexical choice in conversation. Journal of Experimental Psychology: Learning, Memory, and Cognition, 22: 1482–1493, 1996. [30] Susan E. Brennan, P. S. Ries, C. Rubman, and G. Lee. The vocabulary problem in spoken language systems. In S. Luperfoy, editor, Automated Spoken Dialog Systems. MIT Press, 1996, 1998. [31] N. Brummer. Focal: Tools for Fusion and Calibration of automatic speaker detection systems, 2011. URL https://sites.google.com/site/nikobrummer/focal. [32] Lin C-Y and Wang H-C. Language identification using pitch contour information. In Proc. ICASSP 2005, 2005. [33] W. M. Campbell, D. E. Sturim, and D. A. Reynolds. Support vector machines using gmm supervectors for speaker verification. Signal Processing Letters, IEEE, 13(5):308– 311, April 2006. [34] William M. Campbell, Joseph P. Campbell, Douglas A. Reynolds, Elliot Singer, and Pedro A. Torres-Carrasquillo. Support vector machines for speaker and language recognition. Computer Speech & Language, 20(2-3):210–229, 2006. [35] D. Caseiro, I. Trancoso, L. Oliveira, and C. Viana. Grapheme-to-phone using finitestate transducers. In In: Proc. 2002 IEEE Workshop on Speech Synthesis. Volume, pages 1349–1360, 2002. [36] LLC Cepstral. Swift™: Small Footprint Text-to-Speech Synthesizer, 2005. URL http: //www.cepstral.com. [37] R. Correia, T. Pellegrini, M. Eskenazi, I. Trancoso, J. Baptista, and N. Mamede. Listening comprehension games for portuguese: exploring the best features. In Proc. SLaTE, 2011. [38] Rui Correia, Jorge Baptista, Nuno Mamede, Isabel Trancoso, and Maxine Eskenazi. Automatic generation of cloze question distractors. In Proc. Workshop on Second Language Studies: Acquisition, Learning, Education and Technology, 2010. [39] Heriberto Cuayhuitl, Steve Renals, Oliver Lemon, and Hiroshi Shimodaira. Hierarchical dialogue optimization using semi-Markov decision processes. In Proc. of INTERSPEECH, August 2007. [40] Hal Daumé III. Notes on CG and LM-BFGS optimization of logistic regression, 2004. URL http://hal3.name/megam/. [41] Matthias Denecke, Kohji Dohsaka, and Mikio Nakano. Learning dialogue policies using state aggregation in reinforcement learning. In INTERSPEECH. ISCA, 2004. 152 BIBLIOGRAPHY [42] R.-E Fan, K.-W. Chang, C.-J Hsieh, X.-R Wang, and C.-J Lin. LIBLINEAR - A Library for Large Linear Classification, 2008. URL http://www.csie.ntu.edu.tw/ ~cjlin/liblinear/. [43] Andrew Fandrianto and Maxine Eskenazi. Prosodic entrainment in an informationdriven dialog system. In Proceedings of Interspeech 2012, Portland, Oregon, USA, 2012. [44] George Ferguson and James F. Allen. Trips: An integrated intelligent problem-solving assistant. In Jack Mostow and Chuck Rich, editors, AAAI/IAAI, pages 567–572. AAAI Press / The MIT Press, 1998. [45] George Ferguson, James F. Allen, and Bradford W. Miller. Trains-95: Towards a mixed-initiative planning assistant. In AIPS, pages 70–77, 1996. [46] Pedro Fialho, Luı́sa Coheur, Sérgio Curto, Pedro Cláudio, Ângela Costa, Alberto Abad, Hugo Meinedo, and Isabel Trancoso. Meet Edgar, a tutoring agent at Monserrate. In ACL Demo Session, 2013. [47] Matthew Frampton and Oliver Lemon. Learning more effective dialogue strategies using limited dialogue move features. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, ACL-44, pages 185–192, Stroudsburg, PA, USA, 2006. Association for Computational Linguistics. [48] Simon Garrod and Anthony Anderson. Saying what you mean in dialogue: a study in conceptual and semantic co-ordination. Cognition, 27(2):181–218, 1987. [49] Milica Gasic, Catherine Breslin, Matthew Henderson, Dongho Kim, Martin Szummer, Blaise Thomson, Pirros Tsiakoulis, and Steve Young. Pomdp-based dialogue manager adaptation to extended domains. In Proceedings of the SIGDIAL 2013 Conference, Metz, France, August 2013. Association for Computational Linguistics. [50] Milica Gašić and Steve Young. Effective handling of dialogue state in the hidden information state POMDP-based dialogue manager. ACM Trans. Speech Lang. Process., 7 (3):4:1–4:28, June 2011. [51] David Goddeau and Joelle Pineau. Fast reinforcement learning of dialog strategies. In ICASSP, 2000. [52] E. Grabe and E. L. Low. Durational variability in speech and rhythm class hypothesis. Laboratory of Phonology, VII:515–546, 2002. [53] Alexander Gruenstein and Stephanie Seneff. Releasing a multimodal dialogue system into the wild: User support mechanisms. In SIGdial Workshop on Discourse and Dialogue, 2007. [54] Joakim Gustafson, Linda Bell, Jonas Beskow, Johan Boye, Rolf Carlson, Jens Edlund, Björn Granström, David House, and Mats Wirén. Adapt - a multimodal conversational dialogue system in an apartment domain. In INTERSPEECH, pages 134–137, 2000. BIBLIOGRAPHY 153 [55] Thomas K. Harris, Satanjeev Banerjee, and Er I. Rudnicky. Heterogeneous multirobot dialogues for search tasks. In In Proceedings of the AAAI Spring Symposium Intelligence, 2005. [56] Michael Heilman, Kevyn Collins-Thompson, Jamie Callan, and Maxine Eskenazi. Classroom success of an intelligent tutoring system for lexical practice and reading comprehension. In INTERSPEECH. ISCA, 2006. [57] Michael Heilman, Kevyn Collins-Thompson, and Maxine Eskenazi. An analysis of statistical models and features for reading difficulty prediction. In Proceedings of the Third Workshop on Innovative Use of NLP for Building Educational Applications, EANL ’08, pages 71–79, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. [58] James Henderson, Oliver Lemon, and Kallirroi Georgila. Hybrid reinforcement/supervised learning of dialogue policies from fixed data sets. Comput. Linguist., 34(4): 487–511, December 2008. [59] Julia Hirschberg. Speaking more like you: Entrainment in conversational speech. In Proc. INTERSPEECH, 2011. [60] Julia Hirschberg, Diane J. Litman, and Marc Swerts. Prosodic and other cues to speech recognition failures. Speech Communication, 43(1-2):155–175, 2004. [61] Florian Hönig, Anton Batliner, Karl Weilhammer, and Elmar Nöth. Islands of failure: Employing word accent information for pronunciation quality assessment of English L2 learners. In Proceedings of the ISCA Special Interest Group on Speech and Language Technology in Education, 2009. [62] Florian Hönig, Anton Batliner, Karl Weilhammer, and Elmar Nöth. Automatic Assessment of Non-Native Prosody for English as L2. In Proceedings of Speech Prosody 2010, 2010. [63] Xuedong Huang, Fileno Alleva, Hsiao W. Hon, Mei Y. Hwang, and Ronald Rosenfeld. The SPHINX-II speech recognition system: an overview. Computer Speech and Language, 7(2):137–148, 1993. [64] David Huggins-daines, Mohit Kumar, Arthur Chan, Alan W Black, Mosur Ravishankar, and Alex I. Rudnicky. Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices. In Proceedings of ICASSP 2006, 2006. [65] Srini Janarthanam, Oliver Lemon, Romain Laroche, and Ghislain Putois. Testing learned NLG and TTS policies with real users, in self-help and appointment scheduling systems. Technical report, University of Edinburgh, 2011. [66] Srinivasan Janarthanam and Oliver Lemon. Learning to adapt to unknown users: referring expression generation in spoken dialogue systems. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ’10, pages 69– 78, Morristown, NJ, USA, 2010. [67] Filip Jurčı́ček, Blaise Thomson, and Steve Young. Reinforcement learning for parameter estimation in statistical spoken dialogue systems. Comput. Speech Lang., 26(3):168–192, June 2012. ISSN 0885-2308. 154 BIBLIOGRAPHY [68] Simon Keizer, Milica Gasic, François Mairesse, Blaise Thomson, Kai Yu, and Steve Young. Modelling user behaviour in the HIS-POMDP dialogue manager. In SLT, pages 121–124, 2008. [69] Kyungduk Kim, Cheongjae Lee, Sangkeun Jung, and Gary Geunbae Lee. A framebased probabilistic framework for spoken dialog management using dialog examples. In Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue, SIGdial ’08, pages 120–127, Stroudsburg, PA, USA, 2008. [70] Oscar Koller, Alberto Abad, Isabel Trancoso, and Céu Viana. Exploiting varietydependent phones in portuguese variety identification applied to broadcast news transcription. In INTERSPEECH, pages 749–752, 2010. [71] Stefan Kopp, L Gesellensetter, NC Kramer, and Ipke Wachsmuth. A conversational agent as museum guide - design and evaluation of a real-world application. volume 3661 of Intelligent Virtual Agents, Proceedings, pages 329–343. Springer, 2005. [72] Music KTH Royal Institute of Technology, Department of Speech and Hearing. Snack Toolkit v2.2.10, 2001. URL http://www.speech.kth.se/snack/. [73] Karsten Kumpf and Robin W. King. Automatic accent classification of foreign accented Australian English speech. In Proceedings of ICSLP, 1996. [74] Brian Langner and Alan Black. Mountain: A translation-based approach to natural language generation for dialog systems. In In Proc. of IWSDS 2009, Irsee, Germany, 2009. [75] Sungjin Lee and Maxine Eskenazi. POMDP-based Let’s Go system for spoken dialog challenge. In Proc. IEEE SLT Workshop, 2012. [76] Sungjin Lee and Maxine Eskenazi. Exploiting machine-transcribed dialog corpus to improve multiple dialog states tracking methods. In Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue, SIGDIAL ’12, pages 189–196, Stroudsburg, PA, USA, 2012. [77] Sungjin Lee and Maxine Eskenazi. An unsupervised approach to user simulation: toward self-improving dialog systems. In Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue, SIGDIAL ’12, pages 50–59, Stroudsburg, PA, USA, 2012. [78] Sungjin Lee and Maxine Eskenazi. Recipe for building robust spoken dialog state trackers: Dialog state tracking challenge system description. In Proceedings of the SIGDIAL 2013 Conference, Metz, France, August 2013. Association for Computational Linguistics. [79] Oliver Lemon. Learning what to say and how to say it: Joint optimisation of spoken dialogue management and natural language generation. Comput. Speech Lang., 25(2): 210–221, April 2011. [80] Gina-Anne Levow. Learning to speak to a spoken language system: Vocabulary convergence in novice users. In Proc. SIGDIAl, 2003. BIBLIOGRAPHY 155 [81] Lihong Li, Jason D. Williams, and Suhrid Balakrishnan. Reinforcement learning for dialog management using least-squares policy iteration and fast feature selection. In INTERSPEECH, pages 2475–2478. ISCA, 2009. [82] W. Ling, I. Trancoso, and R. Prada. An agent based competitive translation game for second language learning. In Proc. SLaTE, 2011. [83] Diane J. Litman and Shimei Pan. Designing and evaluating an adaptive spoken dialogue system. User Modeling and User-Adapted Interaction, 12(2-3):111–137, March 2002. ISSN 0924-1868. [84] Diane J. Litman and Scott Silliman. ITSPOKE: an intelligent tutoring spoken dialogue system. In Demonstration Papers at HLT-NAACL 2004, HLT-NAACL–Demonstrations ’04, pages 5–8, Stroudsburg, PA, USA, 2004. [85] Diane J. Litman, Marilyn A. Walker, and Michael S. Kearns. Automatic detection of poor speech recognition at the dialogue level. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, ACL ’99, pages 309–316, Stroudsburg, PA, USA, 1999. Association for Computational Linguistics. [86] Lus Seabra Lopes and Antnio J. S. Teixeira. Human-robot interaction through spoken language dialogue. In IROS. IEEE, 2000. [87] P. Madeira, M. Mourao, and N. Mamede. STAR - a multi domain dialog manager. In ICEIS, 2003. [88] Ciro Martins, António J. S. Teixeira, and João Paulo Neto. Dynamic language modeling for a daily broadcast news transcription system. In ASRU, pages 165–170, 2007. [89] Luı́s Marujo, José Lopes, Nuno J. Mamede, Isabel Trancoso, Juan Pino, Maxine Eskenazi, Jorge Baptista, and Céu Viana. Porting REAP to european portuguese. In Procedings of SCA International Workshop on Speech and Language Technology in Education (SLaTE 2009), 2009. [90] Michael Matessa. Measures of adaptive communication. In Second Workshop on Empirical Evaluation of Adaptive Systems, 2003. [91] H. Melin, A. Sandell, and M. Ihse. CTT-bank: A speech controlled telephone banking system - an initial evaluation. Technical report, KTH, 2001. [92] Microsoft. Microsoft Speech Software Development Kit Developer’s Guide, 1996. [93] Samer Al Moubayed, Gabriel Skantze, and Jonas Beskow. The Furhat back-projected humanoid head-lip reading, gaze and multi-party interaction. I. J. Humanoid Robotics, 10(1), 2013. [94] Ani Nenkova, Agustn Gravano, and Julia Hirschberg. High frequency word entrainment in spoken dialogue. In In Proceedings of ACL-08: HLT. Association for Computational Linguistics, 2008. 156 BIBLIOGRAPHY [95] J. Neto, N. Mamede, R. Cassaca, and L. Oliveira. The development of a multi-purpose spoken dialogue system. In Proceedings of EUROSPEECH, 2003. [96] J. Neto, H. Meinedo, M. Viveiros, R. Cassaca, C. Martins, and D. Caseiro. Broadcast news subtitling system in portuguese. In ICASSP’08, pages 1561–1564, 2008. [97] K. G. Niederhoffer and J. W. Pennebaker. Linguistic style matching in social interaction. Journal of Language and Social Psychology, 21(4):337–360, 2002. [98] Alice H. Oh and Alexander I. Rudnicky. Stochastic language generation for spoken dialogue systems. In Proceedings of the 2000 ANLP/NAACL Workshop on Conversational systems - Volume 3, ANLP/NAACL-ConvSyst ’00, pages 27–32, Stroudsburg, PA, USA, 2000. [99] Gabriel Parent and Maxine Eskenazi. Lexical entrainment of real users in the Let’s Go spoken dialog system. In INTERSPEECH, pages 3018–3021, 2010. [100] Giorgio Parisi. Statistical Field Theory. Addison-Wesley, 1988. [101] Sérgio Paulo, Luı́s C. Oliveira, Carlos Mendes, Luı́s Figueira, Renato Cassaca, Céu Viana, and Helena Moniz. Dixi — a generic text-to-speech system for European Portuguese. In Proceedings of the 8th international conference on Computational Processing of the Portuguese Language, PROPOR ’08, pages 91–100, 2008. [102] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12:2825–2830, 2011. [103] Thomas Pellegrini and Isabel Trancoso. Improving ASR error detection with nondecoder based features. In INTERSPEECH, pages 1950–1953, 2010. [104] Thomas Pellegrini, Rui Correia, Isabel Trancoso, Jorge Baptista, and Nuno J. Mamede. Automatic generation of listening comprehension learning material in european portuguese. In INTERSPEECH, pages 1629–1632, 2011. [105] Marina Piat, Dominique Fohr, and Irina Illina. Foreign accent identification based on prosodic parameters. In INTERSPEECH 2008, pages 759–762, 2008. [106] Martin J. Pickering and Simon Garrod. Towards a mechanistic psychology of dialogue. Behavioral and Brain Sciences, 27(2):169–190, 2004. [107] O. Pietquin and T. Dutoit. A probabilistic framework for dialog simulation and optimal strategy learning. Trans. Audio, Speech and Lang. Proc., 14(2):589–599, December 2006. [108] Olivier Pietquin. A Framework for Unsupervised Learning of Dialogue Strategies. PhD thesis, Faculté Polytechnique de Mons, TCTS Lab (Belgique), apr 2004. [109] Isabella Poggi, Catherine Pelachaud, Fiorella de Rosis, Valeria Carofiglio, and Berardina de Carolis. Greta, a believable embodied conversational agent. Multimodal Communication in Virtual Environments, pages 27–45, 2005. BIBLIOGRAPHY 157 [110] Anna Pompili, Alberto Abad, Isabel Trancoso, José Fonseca, Isabel P. Martins, Gabriela Leal, and Luisa Farrajota. An on-line system for remote treatment of aphasia. In Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technologies, SLPAT ’11, pages 1–10, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. [111] P. J. Price. Evaluation of spoken language systems: the ATIS domain. In Proceedings of the workshop on Speech and Natural Language, HLT ’90, pages 91–95, Stroudsburg, PA, USA, 1990. [112] F Ramus. Acoustic correlates of linguistic rhythm: Perspectives, volume pp, pages 115–120. 2002. [113] Antoine Raux. Flexible Turn-Taking for Spoken Dialog Systems. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 2008. [114] Antoine Raux, Brian Langner, Dan Bohus, Alan W Black, and Maxine Eskenazi. Let’s Go public! taking a spoken dialog system to the real world. In Proc. of Interspeech 2005, 2005. [115] David Reitter, Frank Keller, and Johanna D. Moore. Computational modelling of structural priming in dialogue. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, NAACL-Short ’06, pages 121–124, 2006. [116] David B. Roe and Michael D. Riley. Prediction of word confusabilities for speech recognition. In Proc. ICSLP, 1994. [117] Nicholas Roy, Gregory Baltus, Dieter Fox, Francine Gemperle, Jennifer Goetz, Tad Hirsch, Dimitris Margaritis, Michael Montemerlo, Joelle Pineau, Jamieson Schulte, and Sebastian Thrun. Towards personal service robots for the elderly, May 2000. [118] Nicholas Roy, Joelle Pineau, and Sebastian Thrun. Spoken dialogue management using probabilistic reasoning. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, ACL ’00, pages 93–100, Stroudsburg, PA, USA, 2000. [119] Alexander I. Rudnicky, Eric H. Thayer, Paul C. Constantinides, Chris Tchou, R. Shern, Kevin A. Lenzo, W. Xu, and Alice Oh. Creating natural dialogs in the Carnegie Mellon Communicator system. In EUROSPEECH. ISCA, 1999. [120] Ruhi Sarikaya, Yuqing Gao, Michael Picheny, and Hakan Erdogan. Semantic confidence measurement for spoken dialog systems. IEEE Transactions on Speech and Audio Processing, (4):534–545. [121] Konrad Scheffler and Steve Young. Automatic learning of dialogue strategy using dialogue simulation and reinforcement learning. In Proceedings of the second international conference on Human Language Technology Research, HLT ’02, pages 12–19, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc. [122] Stephanie Seneff and Joseph Polifroni. A new restaurant guide conversational system: issues in rapid prototyping for specialized domains. In The 4th International Conference on Spoken Language Processing, Philadelphia, PA, USA, October 3-6, 1996. ISCA, 1996. 158 BIBLIOGRAPHY [123] Stephanie Seneff and Joseph Polifroni. Dialogue management in the Mercury flight reservation system. In Proceedings of the 2000 ANLP/NAACL Workshop on Conversational systems - Volume 3, ANLP/NAACL-ConvSyst ’00, pages 11–16, Stroudsburg, PA, USA, 2000. [124] Stephanie Seneff, Ed Hurley, Raymond Lau, Christine Pao, Philipp Schmid, and Victor Zue. Galaxy-ii: A reference architecture for conversational system development. In in Proc. ICSLP, pages 931–934, 1998. [125] André Silva, Nuno J. Mamede, Alfredo Ferreira Jr., Jorge Baptista, and João Fernandes. Towards a serious game for portuguese learning. In SGDA, pages 83–94, 2011. [126] Gabriel Skantze, Jens Edlund, and Rolf Carlson. Talking with Higgins: Research challenges in a spoken dialogue system. In PIT, pages 193–196, 2006. [127] Amanda Stent, Rashmi Prasad, and Marilyn Walker. Trainable sentence planning for complex information presentation in spoken dialog systems. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, ACL ’04, Stroudsburg, PA, USA, 2004. [128] Svetlana Stoyanchev and Amanda Stent. Lexical and syntactic priming and their impact in deployed spoken dialog systems. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, NAACL-Short ’09, pages 189–192, Stroudsburg, PA, USA, 2009. [129] William Swartout, David Traum, Ron Artstein, Dan Noren, Paul Debevec, Kerry Bronnenkant, Josh Williams, Anton Leuski, Shrikanth Narayanan, Diane Piepol, Chad Lane, Jacquelyn Morie, Priti Aggarwal, Matt Liewer, Jen-Yuan Chiang, Jillian Gerten, Selina Chu, and Kyle White. Ada and Grace: toward realistic and engaging virtual museum guides. In Proceedings of the 10th international conference on Intelligent virtual agents, IVA’10, pages 286–300, Berlin, Heidelberg, 2010. Springer-Verlag. [130] Marc Swerts, Diane J. Litman, and Julia Hirschberg. Corrections in spoken dialogue systems. In INTERSPEECH, pages 615–618, 2000. [131] Beng T. Tan, Yong Gu, and Trevor Thomas. Word confusability measures for vocabulary selection in speech recognition. In Proc. ASRU, 1999. [132] Joseph Tepperman and Shrikanth S. Narayanan. Better non-native intonation scores through prosodic theory. In INTERSPEECH, pages 1813–1816, 2008. [133] Blaise Thomson and Steve Young. Bayesian update of dialogue state: A POMDP framework for spoken dialogue systems. Comput. Speech Lang., 24(4):562–588, October 2010. [134] Blaise Thomson, Kai Yu, Simon Keizer, Milica Gasic, Filip Jurccek, Franois Mairesse, and Steve Young. Bayesian dialogue system for the Let’s Go spoken dialogue challenge. In SLT, pages 460–465. IEEE, 2010. BIBLIOGRAPHY 159 [135] T. Toda, A.W. Black, and K. Tokuda. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech, and Language Processing, 15(8):2222–2235, nov. 2007. [136] Pedro A. Torres-carrasquillo, Elliot Singer, Mary A. Kohler, and J. R. Deller. Approaches to language identification using gaussian mixture models and shifted delta cepstral features. In Proc. ICSLP 2002, pages 89–92, 2002. [137] Arthur R. Toth, Thomas K. Harris, James Sanders, Stefanie Shriver, and Roni Rosenfeld. Towards every-citizen’s speech interface: An application generator for speech interfaces to databases. In IN PROCEEDINGS OF ICSLP, pages 1497–1500, 2002. [138] Joost Van Doremalen, Catia Cucchiarini, and Helmer Strik. Optimizing automatic speech recognition for low-proficient non-native speakers. EURASIP J. Audio Speech Music Process., 2010:2:1–2:13, January 2010. [139] Marilyn A. Walker, Jeanne Fromer, Giuseppe Di Fabbrizio, Craig Mestel, and Donald Hindle. What can I say? evaluating a spoken language interface to email. In CHI, pages 582–589, 1998. [140] Marilyn A. Walker, Diane J. Litman, Candace A. Kamm, Ace A. Kamm, and Alicia Abella. Evaluating spoken dialogue agents with PARADISE: Two case studies, 1998. [141] Marilyn A. Walker, Candace A. Kamm, and Diane J. Litman. Towards developing general models of usability with paradise. Natural Language Engineering, 6(3 and 4): 363–377, 2000. [142] Marilyn A. Walker, Owen Rambow, and Monica Rogati. Spot: a trainable sentence planner. In Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, NAACL ’01, pages 1–8, Stroudsburg, PA, USA, 2001. [143] A. Ward and D. Litman. Measuring convergence and priming in tutorial dialog. in technical report tr-07-148. University of Pittsburgh, 2007. [144] Arthur Ward and Diane Litman. Automatically measuring lexical and acoustic/prosodic convergence in tutorial. In Proceedings of SLaTE 2007, Framington, Pennsylvania, USA, 2007. [145] Wayne Ward and Sunil Issar. Recent improvements in the cmu spoken language understanding system. In Proceedings of the workshop on Human Language Technology, HLT ’94, pages 213–216, 1994. [146] Jason D. Williams. The best of both worlds: unifying conventional dialog systems and POMDPs. In INTERSPEECH, pages 1173–1176, 2008. [147] Jason D. Williams. Incremental partition recombination for efficient tracking of multiple dialog states. In ICASSP, pages 5382–5385, 2010. [148] Jason D. Williams and Steve Young. Partially observable Markov decision processes for spoken dialog systems. Comput. Speech Lang., 21(2):393–422, 2007. 160 BIBLIOGRAPHY [149] Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005. ISBN 0120884070. [150] S. Young, J. Schatzmann, K. Weilhammer, and Hui Ye. The Hidden Information State approach to dialog management. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2007. ICASSP 2007, volume 4, pages IV–149–IV–152, 2007. [151] Steve Young. Probabilistic methods in spoken dialogue systems. Philosophical Transactions of the Royal Society (Series A), 358(1769):1389–1402, 2000. [152] Steve Young, Milica Gašić, Simon Keizer, François Mairesse, Jost Schatzmann, Blaise Thomson, and Kai Yu. The Hidden Information State model: A practical framework for POMDP-based spoken dialogue management. Comput. Speech Lang., 24(2):150–174, April 2010. [153] Steve Young, Milica Gasic, Blaise Thomson, and Jason D. Williams. POMDP-based statistical spoken dialog systems: A review. Proceedings of the IEEE, 101(5):1160–1179, 2013. [154] Victor Zue, James Glass, David Goodine, Hong Leung, Michael Phillips, Joseph Polifroni, and Stephanie Seneff. The VOYAGER speech understanding system: a progress report. In Proceedings of the workshop on Speech and Natural Language, HLT ’89, pages 51–59, Stroudsburg, PA, USA, 1989. ISBN 1-55860-112-0. [155] Victor Zue, Stephanie Seneff, James R. Glass, Joseph Polifroni, Christine Pao, Timothy J. Hazen, and I. Lee Hetherington. JUPITER: a telephone-based conversational interface for weather information. IEEE Transactions on Speech and Audio Processing, 8(1):85–96, 2000.