technology from seed Using subtitles to deal with Out-of-Domain interactions Daniel Magarreiro, Luísa Coheur, Francisco S. Melo INESC-ID / Instituto Superior Técnico, Lisbon, Portugal Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 1 05-11-2015 Index technology from seed • Introduction • Building the Subtle Corpus • The Say Something Smart Engine – Corpora Indexing and candidate extraction – Choosing the answer • Evaluation • Meet Filipe • Conclusions and Future Work Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 2 05-11-2015 Motivation technology from seed • Users often insist in confronting domain-specialized virtual assistants with Out Of Domain (OOD) inputs. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 3 05-11-2015 Motivation technology from seed • Considering that: – people become more engaged with these applications if OOD requests are addressed (Bickmore and Cassell, 2000; Patel et al., 2006) – system designers are not able to successfully anticipate all the possible OOD requests Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 4 05-11-2015 Motivation technology from seed • A possible approach: – explore the (semi-)automatic creation/enrichment of the knowledge base of virtual assistants/chatbots, taking advantage of the vast amount of dialogues available at the web. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 5 05-11-2015 Motivation technology from seed • We will focus on movie subtitles (for now) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 6 05-11-2015 Motivation technology from seed • Movie Subtitles – the web offers a vast number of repositories with a comprehensive archive of subtitle files • this will allows data redundancy • example: – – – – How are you? Fine So, how are you? Fine How are you? Fine How are you? I’m dying – subtitles are often available in multiple languages Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 7 05-11-2015 Motivation technology from seed • Our approach – Build a corpus of interactions from the subtitles • the Subtle corpus – Test a set of techniques to select an adequate response (from Subtle) to a user request • Deployed in the Say Something Smart engine – Evaluate the plausibility of the selected answers Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 8 05-11-2015 Index technology from seed • Introduction • Building the Subtle Corpus • The Say Something Smart Engine – Corpora Indexing and candidate extraction – Choosing the answer • Evaluation • Meet Filipe • Conclusions and Future Work Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 9 05-11-2015 Building the Subtle Corpus technology from seed • The Subtle corpus will be a set of interactions – Like Edgar’s knowledge base • Each interaction is a pair of turns (T , A): – T is the trigger – A is an answer (to the trigger) • Example: – (T: So how old are you?, A: That’s none of your business) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 10 05-11-2015 Building the Subtle Corpus technology from seed • Problem: – Extracting interactions from subtitles files – Example: 770 01:01:05,537 --> 01:01:08,905 And makes an offer so ridiculous, 771 01:01:09,082 --> 01:01:11,881 the farmer is forced to say yes. 772 01:01:12,752 --> 01:01:15,494 We gonna offer to buy Candyland? Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 11 05-11-2015 Building the Subtle Corpus technology from seed • Starting point: – 2Gb of subtitles in Portuguese and English from OpenSubtitles • Building Subtle: – Cleaning data • Example: [TIRES SCREECHING] – Finding real turns • Based on handcrafted rules (previous example) • The user can configure the maximum time allowed between two slots for them to be considered part of a dialogue Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 12 05-11-2015 Building the Subtle Corpus technology from seed SubId - 100000 DialogId - 1 Diff - 3715 T - What a son! A - How about my mother? SubId - 100000 DialogId - 3 Diff - 1678 T - Tell me, did my mother fight you? A - Did she fight me? SubId - 100000 DialogId - 2 Diff - 80 T - How about my mother? A - Tell me, did my mother fight you? Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 13 05-11-2015 Building the Subtle Corpus technology from seed English # Movies 5,764 # Movies Ok # Interactions 5, 665 5, 693, 811 # Average 1, 005 Portuguese # Movies 3, 701 # Movies Ok # Interactions 3, 598 3, 322, 683 # Average 923 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 14 05-11-2015 Index technology from seed • Introduction • Building the Subtle Corpus • The Say Something Smart Engine – Corpora Indexing and candidate extraction – Choosing the answer • Evaluation • Meet Filipe • Conclusions and Future Work Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 15 05-11-2015 The Say Something Smart engine technology from seed • The Say Something Smart Engine (SSS) will use the Subtle corpus to get an answer to a given user request. User: Where do you live? Say Something Smart SSS: Anywhere I feel like! Sublte: (T10: What was your mother’s name?, A10: The mother’s name isn’t important.) (T121: Where do you live? A121: Beaver Creek, off the Route 10.) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 16 05-11-2015 The Say Something Smart engine technology from seed • Problem: – As we will compute the distance between the given user request and the interactions from the Subtle corpus we need to limit the number of interactions. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 17 05-11-2015 The Say Something Smart engine technology from seed • SSS main steps: – Corpora Indexing – Candidate extraction – Choosing the answer Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 18 05-11-2015 The Say Something Smart engine technology from seed • SSS main steps: – Corpora indexing – Candidate extraction • Tokenizers, stemmers, and stop-word filters – the default ones for English – snowball analyzer for the Portuguese language • The number of retrieved interactions can be configured – Choosing the answer Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 19 05-11-2015 The Say Something Smart engine technology from seed • SSS main steps: – Corpora indexing – Candidate extraction • Tokenizers, stemmers, and stop-word filters – the default ones for English – snowball analyzer for the Portuguese language • The number of retrieved interactions can be configured – Choosing the answer Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 20 05-11-2015 The Say Something Smart engine technology from seed (T4: You don’t have to go brother., A4: I’m not your brother.) (T5: You have a brother?, A5: Yeah, I’ve got a brother, man. You know that.) Do you have a brother? (T6: Joe doesn’t have a brother?, A6: No brother.) (T7: Brother, do you have tooth paste?, A7: What brother?) (T8: Have you seen my brother?, A8: He’s not your brother anymore.) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 21 05-11-2015 The Say Something Smart engine technology from seed • Being given: – A user request u – The set of interactions, U, retrieved by Lucene • For each (Ti, Ai) in U: score(Ai , u) = å w j M j (U,Ti , Ai , u) 4 j=1 Where wj is the weight assigned to measure Mj • Measures M1, M2 and M3 are based on Jaccard similarity: • J(A, B) = |A ∩ B| / |A U B| Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 22 Título da apresentação 05-11-2015 The Say Something Smart engine technology from seed • M1: Jaccard similarity between user request and trigger (T9: How nice. What’s your mother’s name?, …) u: What’s your mother’s name? (T10: What was your mother’s name?, A10: The mother’s name isn’t important.) (T11: What’s your name?, …) (T12: What’s the name your mother and father gave you?, …) (T13: Your mother? how dare you to call my mother’s name?, …) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 23 05-11-2015 The Say Something Smart engine technology from seed • M2: a higher score is given to the most “frequent” answer (Jaccard) (T14: Where do you live?, A14: Right here.) u: How are you? (T15: Where are you living?, A15: Right here.) (T16: Where do you live?, A16: New York City.) (T17: Where do you live?, A17: Dune Road. ) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 24 05-11-2015 The Say Something Smart engine technology from seed • M3: Jaccard similarity between the user request and the answer u: What’s your mother’s name? ? (T9: How nice. What’s your mother’s name?, A9: Vickie.) (T10: What was your mother’s name?, A10: The mother’s name isn’t important.) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 25 05-11-2015 The Say Something Smart engine technology from seed • M4: Time difference between trigger and answer u: Are you joking? (T: You're a joke! You're a joke! A: Linda Kasabian gives birth to a son. She names the child Angel.) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 26 05-11-2015 Index technology from seed • Introduction • Building the Subtle Corpus • The Say Something Smart Engine – Corpora Indexing and candidate extraction – Choosing the answer • Evaluation • Meet Filipe • Conclusions and Future Work Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 27 Título da apresentação 05-11-2015 Evaluation technology from seed • Evaluation Setup – Filipe, online since September 2013 – 103, user requests • 20 were randomly selected Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 28 Título da apresentação 05-11-2015 Evaluation technology from seed • Experiment 1: Are subtitles adequate? – Three human annotators – First 25 interactions returned by Lucene to the 20 requests – Question: • is there at least one plausible answer in the 25 candidates? Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 29 Título da apresentação 05-11-2015 Evaluation technology from seed • Experiment 1: Are subtitles adequate? • Results – Evaluator 1: • “What country do you live?” not ok; – Evaluator 3 consider “it depends” as a plausible answer – Evaluator 2: • “What country do you live?” not ok; • “Are you a loser?” not ok; – Evaluators 2 and 3 considered that “So what? You want to hit me?”, or “Shut up.” were plausible answers – Evaluator 3: • “Where is the capital of Japan?” not ok; – Evaluators 1 and 2 considered that “58% don’t know” was a plausible answer Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 30 05-11-2015 Evaluation technology from seed • Experiment 1: Are subtitles adequate? The three annotators agreed that 17 out of 20 turns had a plausible answer Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 31 05-11-2015 Evaluation technology from seed • Experiment 2: Answer selection – Settings (S1,...,S5) : • • • • • S1 – Only takes into account M1. S2 – Only takes into account M2. S3 – Takes into account M1 and M2. S4 – Takes into account M1, M2 and M3. S5 – Takes into account all four measures. – Weights: • S1−4: the same weight was given to the measures. • S5: – – – – 40% weight for M1 30% weight for M2 20% weight for M3 10% weight for M4. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 32 05-11-2015 Evaluation technology from seed • Experiment 2: Answer selection – 21 people evaluated the returned response, given the 20 requests Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 33 05-11-2015 Evaluation technology from seed • Experiment 2: Answer selection – Results • S1 S2 S3 S4 S5 39,29% 45,24% 46,90% 61,67% 51,19% S4 – Takes into account M1, M2 and M3. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 34 05-11-2015 Index technology from seed • Introduction • Building the Subtle Corpus • The Say Something Smart Engine – Corpora Indexing and candidate extraction – Choosing the answer • Evaluation • Meet Filipe • Conclusions and Future Work Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 35 Título da apresentação 05-11-2015 Meet Filipe (or “Filaipe”) technology from seed Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 36 Título da apresentação 05-11-2015 Index technology from seed • Introduction • Building the Subtle Corpus • The Say Something Smart Engine – Corpora Indexing and candidate extraction – Choosing the answer • Evaluation • Meet Filipe • Conclusions and Future Work Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 37 05-11-2015 Conclusions and Future Work technology from seed • We have built the Subtle corpus (PT and EN) • Tested several techniques to extract a plausible answer in Say Something Smart engine • Still much room for improvement – Organizing data • Detecting paraphrases • … – Text processing • Synonyms • Named entities – – – – Combining the measures Adding other corpus Tanking context into consideration Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa … 38 05-11-2015 technology from seed technology from seed Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 39 Título da apresentação 05-11-2015