Priberam Informática Av. Defensores de Chaves, 32 – 3º Esq. 1000-119 Lisboa, Portugal Tel.: +351 21 781 72 60 / Fax: +351 21 781 72 79 Priberam’s question answering system for Portuguese Carlos Amaral, Helena Figueira, André Martins, Afonso Mendes, Pedro Mendes, Cláudia Pinto CLEF Workshop, Vienna, 21-23 of September, 2005 1 Summary • Introduction • A workbench for NLP • Lexical resources • Software tools • Question categorization • System description • • • • • Indexing process Question analysis Document retrieval Sentence retrieval Answer extraction • Evaluation & Results • Conclusions 2 Introduction • Goal: to build a question answering (QA) engine that finds a unique exact answer for NL questions. • Evaluation: QA@CLEF Portuguese monolingual task. • Previous work by Priberam on this subject: • LegiX – a juridical information system • SintaGest – a workbench for NLP • TRUST project (Text Retrieval Using Semantics Technology) – development of the Portuguese module in a cross-language environment. 3 Lexical resources • Lexicon: • • • • • • Lemmas, inflections and POS; Sense definitions (*); Semantic features, subcategorization and selection restrictions; Ontological and terminological domains; English and French equivalents (*); Lexical-semantic relations (e.g. derivations). (*) Not used in the QA system. • Thesaurus • Ontology: • Multilingual (**) (English, French, Portuguese) – enables translations; • Designed by Synapse Développement for TRUST (**) Only Portuguese information is used in the QA system. 4 Software tools • Priberam’s SintaGest – a NLP application that allows: • Building & testing a context-free grammar (CFG); • Building & testing contextual rules for: • Morphological disambiguation; • Named entity & fixed expressions recognition; • Building & testing patterns for question categorization/answer extraction; • Compressing & compiling all data into binary files. • Statistical POS tagger: • Used together w/ contextual rules for morphological disambiguation; • HMM-based (2nd order), trained with the CETEMPublico corpus; • Fast & efficient performance => Viterbi algorithm. 5 Question categorization (I) • 86 question categories, flat structure • <DENOMINATION>, <DATE OF EVENT>, <TOWN NAME>, <BIRTH DATE>, <FUNCTION>, … • Categorization: performed through “rich” patterns (more powerful than regular expressions) • More than one category is allowed (avoiding hard decisions); • “Rich” patterns are conditional expressions w/ words (Word), lemmas (Root), POS (Cat), ontology entries (Ont), question identifiers (QuestIdent), and constant phrases; • Everything built & tested through SintaGest. 6 Question categorization (II) • There are 3 kinds of patterns: • Question patterns (QPs): for question categorization. • Answer patterns (APs): for sentence categorization (during indexation). • Question answering patterns (QAPs): for answer extraction. Heuristic scores QPs QAPs APs Question (FUNCTION) : Word(quem) Distance(0,3) Root(ser) AnyCat(Nprop, ENT) = 15 // e.g. ‘‘Quem é Jorge Sampaio?’’ : Word(que) QuestIdent(FUNCTION_N) Distance(0,3) QuestIdent(FUNCTION_V) = 15 // e.g. ‘‘Que cargo desempenha Jorge Sampaio?’’ Answer : Pivot & AnyCat (Nprop, ENT) Root(ser) {Definition With Ergonym?} = 20 // e.g. ‘‘Jorge Sampaio é o {Presidente da República}...’’ : {NounPhrase With Ergonym?} AnyCat (Trav, Vg) Pivot & AnyCat (Nprop, ENT) = 15 // e.g. ‘‘O {presidente da República}, Jorge Sampaio...’’ ; Answer (FUNCTION) : QuestIdent(FUNCTION_N) = 10 : Ergonym = 10 ; 7 QA system overview • The system architecture is composed by 5 major modules: 8 Indexing process • The collection of target documents is analysed (off-line) and information is stored in a index database. • Each document first feeds the sentence analyser; • Sentence categorization: each sentence is classified with one or more question categories through the APs. • We build indices for: • • • • • Lemmas Heads of derivation NEs and fixed expressions Question categories Ontology domains (at document level) 9 Question analysis • Input: • A NL question (e.g. “Quem é o presidente da Albânia?”) • Procedure: • • • • Sentence analysis Question categorization & activation of QAPs (through the QPs) Extraction of pivots (words, NEs, phrases, dates, abbreviations, …) “Query expansion” (heads of derivation & synonyms) • Output: • Pivots’ lemmas, heads & synonyms (e.g. presidente, Albânia, presidir, albanês, chefe de estado) • Question categories (e.g. <FUNCTION>, <DENOMINATION>) • Relevant ontological domains • Active QAPs 10 Document retrieval • Input: • Pivots’ lemmas (wLi), heads (wHi) & synonyms (wSij) • Question categories (ck) & ontological domains (ol) • Procedure: • Word weighting (w) according to: • POS; • ilf (inv. lexical freq.); • idf (inv. docum. freq.). • Each document d is given a score d: • Output: • The top 30 scored documents. d := 0 For Each pivot i If d contains lemma wLi Then d += KL (wLi) Else If d contains head wHi Then d += KH (wHi) Else If d contains any synonym wSij Then d += maxj(KS (wSij, wLi) (wSij)) If d contains any question category ck Then d += KC If d contains any ontology domain ol Then d += KO d := RewardPivotProximity(d, d) 11 Sentence retrieval • Input: • Scored documents {(d, d)} w/ relevant sentences marked. • Procedure: • Sentence analysis • Sentence scoring – Each sentence s is given a score s according to: • # pivots’ lemmas, heads & synonyms matching s; • # partial matches: Fidel ↔ Fidel Castro; • Order & proximity of pivots in s; • Existence of common question categories between q and s; • Score d of document d containing s. • Output: • Scored sentences {(s, s)} above a fixed threshold. 12 Answer extraction • Input: • Scored sentences {(s, s)} • Active QAPs (from the Question Analysis module) • Procedure: • Answer extraction & scoring – through the QAPs • Answer coherence • Each answer a is rescored to a taking into account its coherence to the whole collection of candidate answers (e.g., “Sali Berisha”, “Ramiz Alia”, “Berisha”) • Selection of the final answer. e.g. “O Presidente da Albânia, Sali Berisha, tentou evitar o pior, afirmando que não está provado que o Governo grego esteja envolvido no ataque.” • Output: • The answer a with highest a or ‘NIL’ if none answer was extracted. 13 Results & evaluation (I) • QA@CLEF evaluation: • Portuguese monolingual task • 210734 target documents (~564 Mb) from Portuguese & Brazilian newspaper corpora: Público1994, Público1995, Folha1994, Folha1995 • Test set of 200 questions (in Brazilian and European Portuguese). • Results • 64,5% of right answers (R): 14 Results & evaluation (II) • Reasons for bad answers (W+X+U): 16,5% Extraction of candidate answers “Como se chama a Organização para a Alimentação e Agricultura das Nações Unidas?” Overextraction: “(...) que viria a estar na origem da FAO (a Organização para a Alimentação e a Agricultura das Nações Unidas)” 8,0% NIL validation “Que partido foi fundado por Andrei Brejnev?” Should return NIL 6,5% Choice of the final answer “O que é a Sabena?” 1st answer: “No caso da Sabena, a Swissair (…) terá de pronunciar-se”. 2nd answer: “(...) o acordo de união entre a companhia aérea belga Sabena” 4,5% Document retrieval “Diga o nome de um assassino em série americano.” The right document was missed. No match between americano and EUA in “(...) John Wayne Gacy, maior assassino em série da história dos EUA (…)” 15 Conclusions • Priberam’s QA system exhibited encouraging results: • State-of-the-art accuracy (64.5%) in QA@CLEF evaluation • Possible advantages over other systems: • Adjustable & powerful patterns for categorization & extraction (SintaGest) • Query expansion through heads of derivation & synonyms • Use of ontology to introduce semantic knowledge • Some future work: • • • • • Confidence measure for final answer validation Handling of list-, how-, & temporally-restricted questions Semantic disambiguation & further exploiting of the ontology Syntactical parsing & anaphora resolution Refinement for Web & book searching 16 Priberam Informática Av. Defensores de Chaves, 32 – 3º Esq. 1000-119 Lisboa, Portugal Tel.: +351 21 781 72 60 / Fax: +351 21 781 72 79 Priberam’s question answering system for Portuguese Carlos Amaral, Helena Figueira, André Martins, Afonso Mendes, Pedro Mendes, Cláudia Pinto CLEF Workshop, Vienna, 21-23 of September, 2005 17 Ontology • • • • • Concept-based Tree-structured, 4 levels Nodes are concepts Leaves are senses of words Words are translated in several languages (English, French, Portuguese, Italian, Polish, and soon Spanish and Czech) • There are 3387 terminal nodes (the most specific concepts) 18