Priberam Informática
Av. Defensores de Chaves, 32 – 3º Esq.
1000-119 Lisboa, Portugal
Tel.: +351 21 781 72 60 / Fax: +351 21 781 72 79
Priberam’s
question answering system
for Portuguese
Carlos Amaral, Helena Figueira, André Martins, Afonso Mendes,
Pedro Mendes, Cláudia Pinto
CLEF Workshop, Vienna, 21-23 of September, 2005
1
Summary
• Introduction
• A workbench for NLP
• Lexical resources
• Software tools
• Question categorization
• System description
•
•
•
•
•
Indexing process
Question analysis
Document retrieval
Sentence retrieval
Answer extraction
• Evaluation & Results
• Conclusions
2
Introduction
• Goal: to build a question answering (QA) engine that finds
a unique exact answer for NL questions.
• Evaluation: QA@CLEF Portuguese monolingual task.
• Previous work by Priberam on this subject:
• LegiX – a juridical information system
• SintaGest – a workbench for NLP
• TRUST project (Text Retrieval Using Semantics Technology) –
development of the Portuguese module in a cross-language environment.
3
Lexical resources
• Lexicon:
•
•
•
•
•
•
Lemmas, inflections and POS;
Sense definitions (*);
Semantic features, subcategorization and selection restrictions;
Ontological and terminological domains;
English and French equivalents (*);
Lexical-semantic relations (e.g. derivations).
(*) Not used in the QA system.
• Thesaurus
• Ontology:
• Multilingual (**) (English, French, Portuguese) – enables translations;
• Designed by Synapse Développement for TRUST
(**) Only Portuguese information is used in the QA system.
4
Software tools
• Priberam’s SintaGest – a NLP application that allows:
• Building & testing a context-free grammar (CFG);
• Building & testing contextual rules for:
• Morphological disambiguation;
• Named entity & fixed expressions recognition;
• Building & testing patterns for question categorization/answer extraction;
• Compressing & compiling all data into binary files.
• Statistical POS tagger:
• Used together w/ contextual rules for morphological disambiguation;
• HMM-based (2nd order), trained with the CETEMPublico corpus;
• Fast & efficient performance => Viterbi algorithm.
5
Question categorization (I)
• 86 question categories, flat structure
• <DENOMINATION>, <DATE OF EVENT>, <TOWN NAME>, <BIRTH
DATE>, <FUNCTION>, …
• Categorization: performed through “rich” patterns (more
powerful than regular expressions)
• More than one category is allowed (avoiding hard decisions);
• “Rich” patterns are conditional expressions w/ words (Word), lemmas
(Root), POS (Cat), ontology entries (Ont), question identifiers
(QuestIdent), and constant phrases;
• Everything built & tested through SintaGest.
6
Question categorization (II)
• There are 3 kinds of patterns:
• Question patterns (QPs): for question categorization.
• Answer patterns (APs): for sentence categorization (during indexation).
• Question answering patterns (QAPs): for answer extraction.
Heuristic
scores
QPs
QAPs
APs
Question (FUNCTION)
: Word(quem) Distance(0,3) Root(ser) AnyCat(Nprop, ENT) = 15
// e.g. ‘‘Quem é Jorge Sampaio?’’
: Word(que) QuestIdent(FUNCTION_N) Distance(0,3) QuestIdent(FUNCTION_V) = 15
// e.g. ‘‘Que cargo desempenha Jorge Sampaio?’’
Answer
: Pivot & AnyCat (Nprop, ENT) Root(ser) {Definition With Ergonym?} = 20
// e.g. ‘‘Jorge Sampaio é o {Presidente da República}...’’
: {NounPhrase With Ergonym?} AnyCat (Trav, Vg) Pivot & AnyCat (Nprop, ENT) = 15
// e.g. ‘‘O {presidente da República}, Jorge Sampaio...’’
;
Answer (FUNCTION)
: QuestIdent(FUNCTION_N) = 10
: Ergonym = 10
;
7
QA system overview
• The system
architecture is
composed by 5
major modules:
8
Indexing process
• The collection of target documents is analysed (off-line)
and information is stored in a index database.
• Each document first feeds the sentence analyser;
• Sentence categorization: each sentence is classified with one or more
question categories through the APs.
• We build indices for:
•
•
•
•
•
Lemmas
Heads of derivation
NEs and fixed expressions
Question categories
Ontology domains (at document level)
9
Question analysis
• Input:
• A NL question (e.g. “Quem é o presidente da Albânia?”)
• Procedure:
•
•
•
•
Sentence analysis
Question categorization & activation of QAPs (through the QPs)
Extraction of pivots (words, NEs, phrases, dates, abbreviations, …)
“Query expansion” (heads of derivation & synonyms)
• Output:
• Pivots’ lemmas, heads & synonyms (e.g. presidente, Albânia, presidir,
albanês, chefe de estado)
• Question categories (e.g. <FUNCTION>, <DENOMINATION>)
• Relevant ontological domains
• Active QAPs
10
Document retrieval
• Input:
• Pivots’ lemmas (wLi), heads (wHi) & synonyms (wSij)
• Question categories (ck) & ontological domains (ol)
• Procedure:
• Word weighting (w)
according to:
• POS;
• ilf (inv. lexical freq.);
• idf (inv. docum. freq.).
• Each document d is
given a score d:
• Output:
• The top 30 scored
documents.
d := 0
For Each pivot i
If d contains lemma wLi Then
d += KL (wLi)
Else If d contains head wHi Then
d += KH (wHi)
Else If d contains any synonym wSij Then
d += maxj(KS  (wSij, wLi) (wSij))
If d contains any question category ck Then
d += KC
If d contains any ontology domain ol Then
d += KO
d := RewardPivotProximity(d, d)
11
Sentence retrieval
• Input:
• Scored documents {(d, d)} w/ relevant sentences marked.
• Procedure:
• Sentence analysis
• Sentence scoring – Each sentence s is given a score s according to:
• # pivots’ lemmas, heads & synonyms
matching s;
• # partial matches: Fidel ↔ Fidel Castro;
• Order & proximity of pivots in s;
• Existence of common question
categories between q and s;
• Score d of document d containing s.
• Output:
• Scored sentences {(s, s)} above a fixed threshold.
12
Answer extraction
• Input:
• Scored sentences {(s, s)}
• Active QAPs (from the Question Analysis module)
• Procedure:
• Answer extraction & scoring – through the QAPs
• Answer coherence
• Each answer a is rescored to a taking into account its coherence to the whole
collection of candidate answers (e.g., “Sali Berisha”, “Ramiz Alia”, “Berisha”)
• Selection of the final answer.
e.g. “O Presidente da Albânia, Sali Berisha, tentou evitar o pior, afirmando
que não está provado que o Governo grego esteja envolvido no ataque.”
• Output:
• The answer a with highest a or ‘NIL’ if none answer was extracted.
13
Results & evaluation (I)
• QA@CLEF evaluation:
• Portuguese monolingual task
• 210734 target documents (~564 Mb) from Portuguese & Brazilian
newspaper corpora: Público1994, Público1995, Folha1994, Folha1995
• Test set of 200 questions (in Brazilian and European Portuguese).
• Results
• 64,5% of right answers (R):
14
Results & evaluation (II)
• Reasons for bad answers (W+X+U):
16,5% Extraction of
candidate
answers
“Como se chama a
Organização para a
Alimentação e
Agricultura das
Nações Unidas?”
Overextraction:
“(...) que viria a estar na origem da
FAO (a Organização para a
Alimentação e a Agricultura das
Nações Unidas)”
8,0%
NIL validation
“Que partido foi
fundado por Andrei
Brejnev?”
Should return NIL
6,5%
Choice of the
final answer
“O que é a
Sabena?”
1st answer: “No caso da Sabena, a
Swissair (…) terá de pronunciar-se”.
2nd answer: “(...) o acordo de união
entre a companhia aérea belga
Sabena”
4,5%
Document
retrieval
“Diga o nome de
um assassino em
série americano.”
The right document was missed. No
match between americano and EUA
in “(...) John Wayne Gacy, maior
assassino em série da história dos
EUA (…)”
15
Conclusions
• Priberam’s QA system exhibited encouraging results:
• State-of-the-art accuracy (64.5%) in QA@CLEF evaluation
• Possible advantages over other systems:
• Adjustable & powerful patterns for categorization & extraction (SintaGest)
• Query expansion through heads of derivation & synonyms
• Use of ontology to introduce semantic knowledge
• Some future work:
•
•
•
•
•
Confidence measure for final answer validation
Handling of list-, how-, & temporally-restricted questions
Semantic disambiguation & further exploiting of the ontology
Syntactical parsing & anaphora resolution
Refinement for Web & book searching
16
Priberam Informática
Av. Defensores de Chaves, 32 – 3º Esq.
1000-119 Lisboa, Portugal
Tel.: +351 21 781 72 60 / Fax: +351 21 781 72 79
Priberam’s
question answering system
for Portuguese
Carlos Amaral, Helena Figueira, André Martins, Afonso Mendes,
Pedro Mendes, Cláudia Pinto
CLEF Workshop, Vienna, 21-23 of September, 2005
17
Ontology
•
•
•
•
•
Concept-based
Tree-structured, 4 levels
Nodes are concepts
Leaves are senses of words
Words are translated in several languages (English,
French, Portuguese, Italian, Polish, and soon Spanish and
Czech)
• There are 3387 terminal nodes (the most specific
concepts)
18