technology
from seed
Using subtitles to deal with Out-of-Domain
interactions
Daniel Magarreiro, Luísa Coheur, Francisco S. Melo
INESC-ID / Instituto Superior Técnico,
Lisbon, Portugal
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
1
05-11-2015
Index
technology
from seed
• Introduction
• Building the Subtle Corpus
• The Say Something Smart Engine
– Corpora Indexing and candidate extraction
– Choosing the answer
• Evaluation
• Meet Filipe
• Conclusions and Future Work
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
2
05-11-2015
Motivation
technology
from seed
• Users often insist in confronting domain-specialized virtual
assistants with Out Of Domain (OOD) inputs.
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
3
05-11-2015
Motivation
technology
from seed
• Considering that:
– people become more engaged with these applications if OOD
requests are addressed (Bickmore and Cassell, 2000; Patel et al.,
2006)
– system designers are not able to successfully anticipate all the
possible OOD requests
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
4
05-11-2015
Motivation
technology
from seed
• A possible approach:
– explore the (semi-)automatic creation/enrichment of the knowledge
base of virtual assistants/chatbots, taking advantage of the vast
amount of dialogues available at the web.
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
5
05-11-2015
Motivation
technology
from seed
• We will focus on movie subtitles (for now)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
6
05-11-2015
Motivation
technology
from seed
• Movie Subtitles
– the web offers a vast number of repositories with a comprehensive
archive of subtitle files
• this will allows data redundancy
• example:
–
–
–
–
How are you? Fine
So, how are you? Fine
How are you? Fine
How are you? I’m dying
– subtitles are often available in multiple languages
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
7
05-11-2015
Motivation
technology
from seed
• Our approach
– Build a corpus of interactions from the subtitles
• the Subtle corpus
– Test a set of techniques to select an adequate response (from
Subtle) to a user request
• Deployed in the Say Something Smart engine
– Evaluate the plausibility of the selected answers
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
8
05-11-2015
Index
technology
from seed
• Introduction
• Building the Subtle Corpus
• The Say Something Smart Engine
– Corpora Indexing and candidate extraction
– Choosing the answer
• Evaluation
• Meet Filipe
• Conclusions and Future Work
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
9
05-11-2015
Building the Subtle Corpus
technology
from seed
• The Subtle corpus will be a set of interactions
– Like Edgar’s knowledge base
• Each interaction is a pair of turns (T , A):
– T is the trigger
– A is an answer (to the trigger)
• Example:
– (T: So how old are you?,
A: That’s none of your business)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
10
05-11-2015
Building the Subtle Corpus
technology
from seed
• Problem:
– Extracting interactions from subtitles files
– Example:
770
01:01:05,537 --> 01:01:08,905
And makes an offer so ridiculous,
771
01:01:09,082 --> 01:01:11,881
the farmer is forced to say yes.
772
01:01:12,752 --> 01:01:15,494
We gonna offer to buy Candyland?
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
11
05-11-2015
Building the Subtle Corpus
technology
from seed
• Starting point:
– 2Gb of subtitles in Portuguese and English from OpenSubtitles
• Building Subtle:
– Cleaning data
• Example: [TIRES SCREECHING]
– Finding real turns
• Based on handcrafted rules (previous example)
• The user can configure the maximum time allowed between two slots
for them to be considered part of a dialogue
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
12
05-11-2015
Building the Subtle Corpus
technology
from seed
SubId - 100000
DialogId - 1
Diff - 3715
T - What a son!
A - How about my mother?
SubId - 100000
DialogId - 3
Diff - 1678
T - Tell me, did my mother fight you?
A - Did she fight me?
SubId - 100000
DialogId - 2
Diff - 80
T - How about my mother?
A - Tell me, did my mother fight you?
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
13
05-11-2015
Building the Subtle Corpus
technology
from seed
English
# Movies
5,764
# Movies Ok # Interactions
5, 665
5, 693, 811
# Average
1, 005
Portuguese
# Movies
3, 701
# Movies Ok # Interactions
3, 598
3, 322, 683
# Average
923
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
14
05-11-2015
Index
technology
from seed
• Introduction
• Building the Subtle Corpus
• The Say Something Smart Engine
– Corpora Indexing and candidate extraction
– Choosing the answer
• Evaluation
• Meet Filipe
• Conclusions and Future Work
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
15
05-11-2015
The Say Something Smart engine
technology
from seed
• The Say Something Smart Engine (SSS) will use the
Subtle corpus to get an answer to a given user request.
User: Where do you live?
Say
Something
Smart
SSS: Anywhere I feel like!
Sublte:
(T10: What was your mother’s name?,
A10: The mother’s name isn’t important.)
(T121: Where do you live?
A121: Beaver Creek, off the Route 10.)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
16
05-11-2015
The Say Something Smart engine
technology
from seed
• Problem:
– As we will compute the distance between the given user request
and the interactions from the Subtle corpus we need to limit the
number of interactions.
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
17
05-11-2015
The Say Something Smart engine
technology
from seed
• SSS main steps:
– Corpora Indexing
– Candidate extraction
– Choosing the answer
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
18
05-11-2015
The Say Something Smart engine
technology
from seed
• SSS main steps:
– Corpora indexing
– Candidate extraction
• Tokenizers, stemmers, and stop-word filters
– the default ones for English
– snowball analyzer for the Portuguese language
• The number of retrieved interactions can be configured
– Choosing the answer
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
19
05-11-2015
The Say Something Smart engine
technology
from seed
• SSS main steps:
– Corpora indexing
– Candidate extraction
• Tokenizers, stemmers, and stop-word filters
– the default ones for English
– snowball analyzer for the Portuguese language
• The number of retrieved interactions can be configured
– Choosing the answer
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
20
05-11-2015
The Say Something Smart engine
technology
from seed
(T4: You don’t have to go brother.,
A4: I’m not your brother.)
(T5: You have a brother?,
A5: Yeah, I’ve got a brother, man. You know that.)
Do you have a brother?
(T6: Joe doesn’t have a brother?,
A6: No brother.)
(T7: Brother, do you have tooth paste?,
A7: What brother?)
(T8: Have you seen my brother?,
A8: He’s not your brother anymore.)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
21
05-11-2015
The Say Something Smart engine
technology
from seed
• Being given:
– A user request u
– The set of interactions, U, retrieved by Lucene
• For each (Ti, Ai) in U:
score(Ai , u) = å w j M j (U,Ti , Ai , u)
4
j=1
Where wj is the weight assigned to measure Mj
• Measures M1, M2 and M3 are based on Jaccard similarity:
• J(A, B) = |A ∩ B| / |A U B|
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
22
Título da apresentação
05-11-2015
The Say Something Smart engine
technology
from seed
• M1: Jaccard similarity between user request and trigger
(T9: How nice. What’s your mother’s name?, …)
u: What’s your mother’s
name?
(T10: What was your mother’s name?,
A10: The mother’s name isn’t important.)
(T11: What’s your name?, …)
(T12: What’s the name your mother and father gave
you?, …)
(T13: Your mother? how dare you to call my mother’s
name?, …)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
23
05-11-2015
The Say Something Smart engine
technology
from seed
• M2: a higher score is given to the most “frequent” answer
(Jaccard)
(T14: Where do you live?,
A14: Right here.)
u: How are you?
(T15: Where are you living?,
A15: Right here.)
(T16: Where do you live?,
A16: New York City.)
(T17: Where do you live?,
A17: Dune Road. )
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
24
05-11-2015
The Say Something Smart engine
technology
from seed
• M3: Jaccard similarity between the user request and the
answer
u: What’s your mother’s name? ?
(T9: How nice. What’s your mother’s name?,
A9: Vickie.)
(T10: What was your mother’s name?,
A10: The mother’s name isn’t important.)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
25
05-11-2015
The Say Something Smart engine
technology
from seed
• M4: Time difference between trigger and answer
u: Are you joking?
(T: You're a joke! You're a joke!
A: Linda Kasabian gives birth to a son. She
names the child Angel.)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
26
05-11-2015
Index
technology
from seed
• Introduction
• Building the Subtle Corpus
• The Say Something Smart Engine
– Corpora Indexing and candidate extraction
– Choosing the answer
• Evaluation
• Meet Filipe
• Conclusions and Future Work
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
27
Título da apresentação
05-11-2015
Evaluation
technology
from seed
• Evaluation Setup
– Filipe, online since September 2013
– 103, user requests
• 20 were randomly selected
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
28
Título da apresentação
05-11-2015
Evaluation
technology
from seed
• Experiment 1: Are subtitles adequate?
– Three human annotators
– First 25 interactions returned by Lucene to the 20 requests
– Question:
• is there at least one plausible answer in the 25 candidates?
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
29
Título da apresentação
05-11-2015
Evaluation
technology
from seed
• Experiment 1: Are subtitles adequate?
• Results
– Evaluator 1:
• “What country do you live?” not ok;
– Evaluator 3 consider “it depends” as a plausible answer
– Evaluator 2:
• “What country do you live?” not ok;
• “Are you a loser?” not ok;
– Evaluators 2 and 3 considered that “So what? You want to hit me?”, or
“Shut up.” were plausible answers
– Evaluator 3:
• “Where is the capital of Japan?” not ok;
– Evaluators 1 and 2 considered that “58% don’t know” was a plausible
answer
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
30
05-11-2015
Evaluation
technology
from seed
• Experiment 1: Are subtitles adequate?
The three annotators agreed that 17 out of 20 turns
had a plausible answer
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
31
05-11-2015
Evaluation
technology
from seed
• Experiment 2: Answer selection
– Settings (S1,...,S5) :
•
•
•
•
•
S1 – Only takes into account M1.
S2 – Only takes into account M2.
S3 – Takes into account M1 and M2.
S4 – Takes into account M1, M2 and M3.
S5 – Takes into account all four measures.
– Weights:
• S1−4: the same weight was given to the measures.
• S5:
–
–
–
–
40% weight for M1
30% weight for M2
20% weight for M3
10% weight for M4.
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
32
05-11-2015
Evaluation
technology
from seed
• Experiment 2: Answer selection
– 21 people evaluated the returned response, given the 20 requests
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
33
05-11-2015
Evaluation
technology
from seed
• Experiment 2: Answer selection
– Results
•
S1
S2
S3
S4
S5
39,29%
45,24%
46,90%
61,67%
51,19%
S4 – Takes into account M1, M2 and M3.
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
34
05-11-2015
Index
technology
from seed
• Introduction
• Building the Subtle Corpus
• The Say Something Smart Engine
– Corpora Indexing and candidate extraction
– Choosing the answer
• Evaluation
• Meet Filipe
• Conclusions and Future Work
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
35
Título da apresentação
05-11-2015
Meet Filipe (or “Filaipe”)
technology
from seed
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
36
Título da apresentação
05-11-2015
Index
technology
from seed
• Introduction
• Building the Subtle Corpus
• The Say Something Smart Engine
– Corpora Indexing and candidate extraction
– Choosing the answer
• Evaluation
• Meet Filipe
• Conclusions and Future Work
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
37
05-11-2015
Conclusions and Future Work
technology
from seed
• We have built the Subtle corpus (PT and EN)
• Tested several techniques to extract a plausible answer in
Say Something Smart engine
• Still much room for improvement
– Organizing data
• Detecting paraphrases
• …
– Text processing
• Synonyms
• Named entities
–
–
–
–
Combining the measures
Adding other corpus
Tanking context into consideration
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
…
38
05-11-2015
technology
from seed
technology
from seed
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
39
Título da apresentação
05-11-2015
Download

technology - Técnico Lisboa