Automatically Enriching a Thesaurus with Information from Dictionaries Hugo Gonçalo Oliveira1 Paulo Gomes {hroliv,pgomes}@dei.uc.pt Cognitive & Media Systems Group CISUC, Universidade de Coimbra October 11, 2011 1 supported by FCT scholarship grant SFRH/BD/44955/2008 Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 1 / 18 Index 1 Introduction 2 Proposed approach 3 Enriching TeP with synonymy in PAPEL 4 Evaluation 5 Concluding remarks Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 2 / 18 Introduction Lexical knowledge bases Thesaurus, lexical networks, lexical ontologies, ... Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 3 / 18 Introduction Lexical knowledge bases Thesaurus, lexical networks, lexical ontologies, ... Structured on words and their meanings Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 3 / 18 Introduction Lexical knowledge bases Thesaurus, lexical networks, lexical ontologies, ... Structured on words and their meanings Try to cover the whole language Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 3 / 18 Introduction Lexical knowledge bases Thesaurus, lexical networks, lexical ontologies, ... Structured on words and their meanings Try to cover the whole language No specific domain Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 3 / 18 Introduction Lexical knowledge bases Thesaurus, lexical networks, lexical ontologies, ... Structured on words and their meanings Try to cover the whole language No specific domain Essential for developing NLP tools for a language Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 3 / 18 Introduction Lexical knowledge bases Thesaurus, lexical networks, lexical ontologies, ... Structured on words and their meanings Try to cover the whole language No specific domain Essential for developing NLP tools for a language I Useful for NLP tasks (eg. word-sense disambiguation, question-answering, determining similarities, ...) Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 3 / 18 Introduction Lexical knowledge bases Thesaurus, lexical networks, lexical ontologies, ... Structured on words and their meanings Try to cover the whole language No specific domain Essential for developing NLP tools for a language I I Useful for NLP tasks (eg. word-sense disambiguation, question-answering, determining similarities, ...) See Princeton WordNet [Fellbaum, 1998] Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 3 / 18 Introduction Free lexical knowledge bases for Portuguese Public domain thesaurus: I I 2 3 TeP [Maziero et al., 2008] OpenThesaurus.PT2 http://openthesaurus.caixamagica.pt/ http://pt.wiktionary.org/ Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 4 / 18 Introduction Free lexical knowledge bases for Portuguese Public domain thesaurus: I I TeP [Maziero et al., 2008] OpenThesaurus.PT2 Collaborative dictionary I 2 3 Portuguese Wiktionary3 http://openthesaurus.caixamagica.pt/ http://pt.wiktionary.org/ Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 4 / 18 Introduction Free lexical knowledge bases for Portuguese Public domain thesaurus: I I TeP [Maziero et al., 2008] OpenThesaurus.PT2 Collaborative dictionary I Portuguese Wiktionary3 Public domain lexical network I 2 3 PAPEL [Gonçalo Oliveira et al., 2010] http://openthesaurus.caixamagica.pt/ http://pt.wiktionary.org/ Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 4 / 18 Introduction Free lexical knowledge bases for Portuguese Public domain thesaurus: I I TeP [Maziero et al., 2008] OpenThesaurus.PT2 Collaborative dictionary I Portuguese Wiktionary3 Public domain lexical network I PAPEL [Gonçalo Oliveira et al., 2010] Lexical ontology [coming soon] I 2 3 Onto.PT [Gonçalo Oliveira and Gomes, 2010] http://openthesaurus.caixamagica.pt/ http://pt.wiktionary.org/ Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 4 / 18 Introduction Free lexical knowledge bases for Portuguese Public domain thesaurus: I I TeP [Maziero et al., 2008] OpenThesaurus.PT2 Collaborative dictionary I Portuguese Wiktionary3 Public domain lexical network I PAPEL [Gonçalo Oliveira et al., 2010] Lexical ontology [coming soon] I Onto.PT [Gonçalo Oliveira and Gomes, 2010] More complementary than overlapping 2 3 http://openthesaurus.caixamagica.pt/ http://pt.wiktionary.org/ Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 4 / 18 Introduction Free lexical knowledge bases for Portuguese Public domain thesaurus: I I TeP [Maziero et al., 2008] OpenThesaurus.PT2 Collaborative dictionary I Portuguese Wiktionary3 Public domain lexical network I PAPEL [Gonçalo Oliveira et al., 2010] Lexical ontology [coming soon] I Onto.PT [Gonçalo Oliveira and Gomes, 2010] More complementary than overlapping Fruitful to merge some of them in a unique broader resource 2 3 http://openthesaurus.caixamagica.pt/ http://pt.wiktionary.org/ Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 4 / 18 Introduction This work Integrate synonymy information from dictionaries in a thesaurus Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 5 / 18 Introduction This work Integrate synonymy information from dictionaries in a thesaurus 1 Extraction of synpairs from dictionaries Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 5 / 18 Introduction This work Integrate synonymy information from dictionaries in a thesaurus 1 2 Extraction of synpairs from dictionaries Assigning synpairs to synsets Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 5 / 18 Introduction This work Integrate synonymy information from dictionaries in a thesaurus 1 2 3 Extraction of synpairs from dictionaries Assigning synpairs to synsets Clustering remaining pairs Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 5 / 18 Introduction This work Integrate synonymy information from dictionaries in a thesaurus 1 2 3 Extraction of synpairs from dictionaries Assigning synpairs to synsets Clustering remaining pairs Apply the procedure in the enrichment of TeP with PAPEL Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 5 / 18 Proposed approach Extracting synpairs from dictionaries mente, n: cérebro, cabeça, intelecto [mind, n: brain, head, intellect] máquina, n: o mesmo que computador [machine, n: the same as computer ] Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 6 / 18 Proposed approach Extracting synpairs from dictionaries mente, n: cérebro, cabeça, intelecto [mind, n: brain, head, intellect] I (cérebro, mente) (cabeça, mente) (intelecto, mente) [(brain, mind) (head, mind) (intellect, mind)] máquina, n: o mesmo que computador [machine, n: the same as computer ] Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 6 / 18 Proposed approach Extracting synpairs from dictionaries mente, n: cérebro, cabeça, intelecto [mind, n: brain, head, intellect] I (cérebro, mente) (cabeça, mente) (intelecto, mente) [(brain, mind) (head, mind) (intellect, mind)] máquina, n: o mesmo que computador [machine, n: the same as computer ] I (computador, máquina) [(computer, machine)] Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 6 / 18 Proposed approach Assigning synpairs to synsets p = (wx , wy ) + Sa = (w1 , w2 , ..., wn ) → Sa = (w1 , w2 , ..., wn , wx , wy ) Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 7 / 18 Proposed approach Assigning synpairs to synsets p = (wx , wy ) + Sa = (w1 , w2 , ..., wn ) → Sa = (w1 , w2 , ..., wn , wx , wy ) Synonymy graph G I I I All the extracted synpairs Nodes represent words (eg. wx , wy ) p = (wx , wy ) establishes an edge between wx and wy Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 7 / 18 Proposed approach Assigning synpairs to synsets For each synpair p = (wx , wy ) 1 a If Si ∈ T : wx ∈ Si ∧ wy ∈ Si , nothing is done. Any measure for computing the similarity of two vectors can be used Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 8 / 18 Proposed approach Assigning synpairs to synsets For each synpair p = (wx , wy ) 1 If Si ∈ T : wx ∈ Si ∧ wy ∈ Si , nothing is done. 2 Select all synsets Cj ∈ C : C ⊂ T , C = {C1 , C2 , ..., Cn } ∀(Cj ∈ C ) : wx ∈ Cj ∨ wy ∈ Cj . a Any measure for computing the similarity of two vectors can be used Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 8 / 18 Proposed approach Assigning synpairs to synsets For each synpair p = (wx , wy ) 1 If Si ∈ T : wx ∈ Si ∧ wy ∈ Si , nothing is done. 2 Select all synsets Cj ∈ C : C ⊂ T , C = {C1 , C2 , ..., Cn } ∀(Cj ∈ C ) : wx ∈ Cj ∨ wy ∈ Cj . 3 If |C | = 1, p + C1 . a Any measure for computing the similarity of two vectors can be used Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 8 / 18 Proposed approach Assigning synpairs to synsets For each synpair p = (wx , wy ) 1 If Si ∈ T : wx ∈ Si ∧ wy ∈ Si , nothing is done. 2 Select all synsets Cj ∈ C : C ⊂ T , C = {C1 , C2 , ..., Cn } ∀(Cj ∈ C ) : wx ∈ Cj ∨ wy ∈ Cj . 3 If |C | = 1, p + C1 . 4 Compute the adjacency vector [p] = [wx ] + [wy ]. The adjacency vector of a word is a column of the matrix M, [wj ] = [Mj ]; a Any measure for computing the similarity of two vectors can be used Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 8 / 18 Proposed approach Assigning synpairs to synsets For each synpair p = (wx , wy ) 1 If Si ∈ T : wx ∈ Si ∧ wy ∈ Si , nothing is done. 2 Select all synsets Cj ∈ C : C ⊂ T , C = {C1 , C2 , ..., Cn } ∀(Cj ∈ C ) : wx ∈ Cj ∨ wy ∈ Cj . 3 If |C | = 1, p + C1 . 4 Compute the adjacency vector [p] = [wx ] + [wy ]. The adjacency vector of a word is a column of the matrix M, [wj ] = [Mj ]; 5 Compute the adjacency vector of each Cj ∈ C P|Cj | [Cj ] = k=1 [wk ] : wk ∈ Cj ; a Any measure for computing the similarity of two vectors can be used Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 8 / 18 Proposed approach Assigning synpairs to synsets For each synpair p = (wx , wy ) 1 If Si ∈ T : wx ∈ Si ∧ wy ∈ Si , nothing is done. 2 Select all synsets Cj ∈ C : C ⊂ T , C = {C1 , C2 , ..., Cn } ∀(Cj ∈ C ) : wx ∈ Cj ∨ wy ∈ Cj . 3 If |C | = 1, p + C1 . 4 Compute the adjacency vector [p] = [wx ] + [wy ]. The adjacency vector of a word is a column of the matrix M, [wj ] = [Mj ]; 5 Compute the adjacency vector of each Cj ∈ C P|Cj | [Cj ] = k=1 [wk ] : wk ∈ Cj ; 6 Select the most similar synset Cbest : sim(p, Cbest )a = max(sim(p, Cj )); a Any measure for computing the similarity of two vectors can be used Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 8 / 18 Proposed approach Assigning synpairs to synsets For each synpair p = (wx , wy ) 1 If Si ∈ T : wx ∈ Si ∧ wy ∈ Si , nothing is done. 2 Select all synsets Cj ∈ C : C ⊂ T , C = {C1 , C2 , ..., Cn } ∀(Cj ∈ C ) : wx ∈ Cj ∨ wy ∈ Cj . 3 If |C | = 1, p + C1 . 4 Compute the adjacency vector [p] = [wx ] + [wy ]. The adjacency vector of a word is a column of the matrix M, [wj ] = [Mj ]; 5 Compute the adjacency vector of each Cj ∈ C P|Cj | [Cj ] = k=1 [wk ] : wk ∈ Cj ; 6 Select the most similar synset Cbest : sim(p, Cbest )a = max(sim(p, Cj )); 7 p + Cbest . a Any measure for computing the similarity of two vectors can be used Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 8 / 18 Proposed approach Clustering remaining pairs G 0 is established by the remaining pairs 1 Sparse matrix M 0 (|N| × |N|) Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 9 / 18 Proposed approach Clustering remaining pairs G 0 is established by the remaining pairs 1 Sparse matrix M 0 (|N| × |N|) 2 Mij0 = sim([wi ], [wj ]) Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 9 / 18 Proposed approach Clustering remaining pairs G 0 is established by the remaining pairs 1 Sparse matrix M 0 (|N| × |N|) 2 Mij0 = sim([wi ], [wj ]) 3 Normalise the columns of M, so that Gonçalo Oliveira & Gomes (CISUC) P|Mj | k=1 KDBI, EPIA 2011 Mjk = 1 October 11, 2011 9 / 18 Proposed approach Clustering remaining pairs G 0 is established by the remaining pairs 1 Sparse matrix M 0 (|N| × |N|) 2 Mij0 = sim([wi ], [wj ]) 3 Normalise the columns of M, so that 4 Extract cluster Si from each row Mi0 , with the words wj where Mij0 > θ Gonçalo Oliveira & Gomes (CISUC) P|Mj | k=1 KDBI, EPIA 2011 Mjk = 1 October 11, 2011 9 / 18 Proposed approach Clustering remaining pairs G 0 is established by the remaining pairs 1 Sparse matrix M 0 (|N| × |N|) 2 Mij0 = sim([wi ], [wj ]) 3 Normalise the columns of M, so that 4 Extract cluster Si from each row Mi0 , with the words wj where Mij0 > θ 5 For each Si : Si ∪ Sj = Sj and Si ∩ Sj = Si , Si is discarded. Gonçalo Oliveira & Gomes (CISUC) P|Mj | k=1 KDBI, EPIA 2011 Mjk = 1 October 11, 2011 9 / 18 Enriching TeP with synonymy in PAPEL Coverage of the synpairs by TeP POS Nouns Verbs Adjectives 4 Synpairs 37,452 21,465 19,073 In TeP 27.38% 43.01% 37.60% |C |4 = 0 14.98% 1.34% 5.58% |C | = 1 12.01% 4.04% 8.22% |C | > 1 45.63% 51.66% 48.60% |C | 3.86 6.64 4.26 Number of candidate synsets Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 10 / 18 Enriching TeP with synonymy in PAPEL Coverage of the synpairs by TeP POS Nouns Verbs Adjectives Synpairs 37,452 21,465 19,073 In TeP 27.38% 43.01% 37.60% |C |4 = 0 14.98% 1.34% 5.58% |C | = 1 12.01% 4.04% 8.22% |C | > 1 45.63% 51.66% 48.60% |C | 3.86 6.64 4.26 Experimentation was performed using the cosine similarity 4 Number of candidate synsets Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 10 / 18 Enriching TeP with synonymy in PAPEL Results – words Thesaurus TeP 2.0 After assignments Clusters Final thesaurus POS Nouns Verbs Adjectives Nouns Verbs Adjectives Nouns Verbs Adjectives Nouns Verbs Adjectives Gonçalo Oliveira & Gomes (CISUC) Total 17,158 10,827 14,586 23,775 12,818 17,158 8,546 502 1,858 30,369 13,090 18,525 Ambiguous 5,805 4,905 3,735 10,418 7,094 6,294 701 8 39 12,045 7,221 6,550 KDBI, EPIA 2011 Words Avg(senses) 1.71 2.08 1.46 2.09 2.64 1.83 1.15 1.02 1.03 1.96 2.62 1.80 Most ambig. 20 41 19 37 42 22 8 3 4 38 42 23 October 11, 2011 11 / 18 Enriching TeP with synonymy in PAPEL Results – synsets Thesaurus TeP 2.0 After assignments Clusters Final thesaurus POS Nouns Verbs Adjectives Nouns Verbs Adjectives Nouns Verbs Adjectives Nouns Verbs Adjectives Gonçalo Oliveira & Gomes (CISUC) Total 8,254 3,978 6,066 8,254 3,978 6,066 3,524 220 820 11,778 4,198 6,886 Avg(size) 3.56 5.67 3.50 6.01 8.50 5.17 2.78 2.34 2.33 5.05 8.18 4.84 KDBI, EPIA 2011 Synsets size = 2 size > 25 3,079 0 939 48 3,033 19 1,930 179 702 217 2,369 120 2,247 0 174 0 656 0 4,177 179 876 217 3,025 120 max(size) 21 53 43 150 148 110 13 6 10 150 148 110 October 11, 2011 12 / 18 Evaluation Assignments evaluation Manual evaluation of sample assignments Two judges for each assignment Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 13 / 18 Evaluation Assignments evaluation Manual evaluation of sample assignments Two judges for each assignment POS Nouns Verbs Adjectives Sample 100 assigns. × 2 100 assigns. × 2 100 assigns. × 2 Gonçalo Oliveira & Gomes (CISUC) 153 142 151 Correct (76.50%) (71.00%) (75.50%) KDBI, EPIA 2011 47 58 49 Incorrect (23.50%) (29.00%) (24.50%) Agreement 77.00% 74.00% 75.00% October 11, 2011 13 / 18 Evaluation Assignments evaluation Manual evaluation of sample assignments Two judges for each assignment POS Nouns Verbs Adjectives Synpair Synset Judge 1 Judge 2 (escrutı́nio,votação) (decisão,desempate) (plano,gizamento) (venerar,homenagear) (atacar,combater) (obter,rapar) (grandioso,épico) (delicado,requintado) (falido,queimado) votação;voto;sufrágio resolução;objetivação;tenção;intenção planı́cie;chã;chanura;plaino;plano;planura venerar;cultuar;adorar;idolatrar atacar;inciar depilar;despelar;pelar;raspar;rapar;rascar admirável;fabuloso;grandioso difı́cil;complicado;delicado queimado;incendiado 1 0 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 0 Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 13 / 18 Evaluation Clustering Manual evaluation of clusters Two judges for each cluster Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 14 / 18 Evaluation Clustering Manual evaluation of clusters Two judges for each cluster Cluster is correct if, in some context, all its words might have the same meaning Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 14 / 18 Evaluation Clustering Manual evaluation of clusters Two judges for each cluster Cluster is correct if, in some context, all its words might have the same meaning Table: Evaluation of clustering POS Nouns Verbs Adjectives Sample 105 × 2 105 × 2 105 × 2 Gonçalo Oliveira & Gomes (CISUC) Correct 179 (85.24%) 193 (91.90%) 189 (90.00%) KDBI, EPIA 2011 Incorrect 31 (14.76%) 17 (8.10%) 21 (10.00%) Agreement 91.43% 87.62% 85.71% October 11, 2011 14 / 18 Evaluation Clustering Manual evaluation of clusters Two judges for each cluster Cluster is correct if, in some context, all its words might have the same meaning Figure: Examples of connected subgraphs and resulting clusters. Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 14 / 18 Concluding remarks Update: computing similarity Sum the adjacencies I One vector per synset: [Cj ] = I sim(p, Cj ) = sim([p], [Cj ]) Gonçalo Oliveira & Gomes (CISUC) P|Cj | k=1 [wk ] KDBI, EPIA 2011 : wk ∈ Cj ; October 11, 2011 15 / 18 Concluding remarks Update: computing similarity Sum the adjacencies I One vector per synset: [Cj ] = I sim(p, Cj ) = sim([p], [Cj ]) P|Cj | k=1 [wk ] : wk ∈ Cj ; Average similarity of the pair with each synset element I One vector per synset element: [Cj ] = ([w1 ], ..., [wn ]), n = |Cj | |Cj | P I sim(p, Cj ) = cos([p],[Mwk ]) k=1 Gonçalo Oliveira & Gomes (CISUC) |Cj | , wk ∈ C j KDBI, EPIA 2011 October 11, 2011 15 / 18 Concluding remarks Update: computing similarity Sum the adjacencies I One vector per synset: [Cj ] = I sim(p, Cj ) = sim([p], [Cj ]) P|Cj | k=1 [wk ] : wk ∈ Cj ; Average similarity of the pair with each synset element I One vector per synset element: [Cj ] = ([w1 ], ..., [wn ]), n = |Cj | |Cj | P I sim(p, Cj ) = cos([p],[Mwk ]) k=1 |Cj | , wk ∈ C j Gold resource of 220 synpairs and possible assignments Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 15 / 18 Concluding remarks Update: computing similarity Sum the adjacencies I One vector per synset: [Cj ] = I sim(p, Cj ) = sim([p], [Cj ]) P|Cj | k=1 [wk ] : wk ∈ Cj ; Average similarity of the pair with each synset element I One vector per synset element: [Cj ] = ([w1 ], ..., [wn ]), n = |Cj | |Cj | P I sim(p, Cj ) = cos([p],[Mwk ]) k=1 |Cj | , wk ∈ C j Gold resource of 220 synpairs and possible assignments I Variable cut point θ on similarity Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 15 / 18 Concluding remarks Update: computing similarity Sum the adjacencies I One vector per synset: [Cj ] = I sim(p, Cj ) = sim([p], [Cj ]) P|Cj | k=1 [wk ] : wk ∈ Cj ; Average similarity of the pair with each synset element I One vector per synset element: [Cj ] = ([w1 ], ..., [wn ]), n = |Cj | |Cj | P I sim(p, Cj ) = cos([p],[Mwk ]) k=1 |Cj | , wk ∈ C j Gold resource of 220 synpairs and possible assignments I I Variable cut point θ on similarity Possible to assign the same synpair to 0 ≤ n ≤ |C | synsets Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 15 / 18 Concluding remarks Update: computing similarity Sum the adjacencies I One vector per synset: [Cj ] = I sim(p, Cj ) = sim([p], [Cj ]) P|Cj | k=1 [wk ] : wk ∈ Cj ; Average similarity of the pair with each synset element I One vector per synset element: [Cj ] = ([w1 ], ..., [wn ]), n = |Cj | |Cj | P I sim(p, Cj ) = cos([p],[Mwk ]) k=1 |Cj | , wk ∈ C j Gold resource of 220 synpairs and possible assignments I I Variable cut point θ on similarity Possible to assign the same synpair to 0 ≤ n ≤ |C | synsets Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 15 / 18 Concluding remarks Final remarks Flexible method for enriching thesaurus with synonymy in dictionaries Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 16 / 18 Concluding remarks Final remarks Flexible method for enriching thesaurus with synonymy in dictionaries Applied to the enrichment of a Portuguese thesaurus Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 16 / 18 Concluding remarks Final remarks Flexible method for enriching thesaurus with synonymy in dictionaries Applied to the enrichment of a Portuguese thesaurus This work was made in the scope of Onto.PT I Automatic creation of a lexical ontology for Portuguese Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 16 / 18 Concluding remarks Final remarks Flexible method for enriching thesaurus with synonymy in dictionaries Applied to the enrichment of a Portuguese thesaurus This work was made in the scope of Onto.PT I I Automatic creation of a lexical ontology for Portuguese Extraction + integration of lexical information from textual sources Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 16 / 18 Concluding remarks Final remarks Flexible method for enriching thesaurus with synonymy in dictionaries Applied to the enrichment of a Portuguese thesaurus This work was made in the scope of Onto.PT I I I I Automatic creation of a lexical ontology for Portuguese Extraction + integration of lexical information from textual sources Soon freely available! Check http://ontopt.dei.uc.pt Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 16 / 18 Concluding remarks Thank you! Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 17 / 18 References References I [Fellbaum, 1998] Fellbaum, C., editor (1998). WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press. [Gonçalo Oliveira and Gomes, 2010] Gonçalo Oliveira, H. and Gomes, P. (2010). Onto.PT: Automatic Construction of a Lexical Ontology for Portuguese. In Proc. 5th European Starting AI Researcher Symposium (STAIRS 2010). IOS Press. [Gonçalo Oliveira et al., 2010] Gonçalo Oliveira, H., Santos, D., and Gomes, P. (2010). Extracção de relações semânticas entre palavras a partir de um dicionário: o PAPEL e sua avaliação. Linguamática, 2(1):77–93. [Maziero et al., 2008] Maziero, E. G., Pardo, T. A. S., Felippo, A. D., and Dias-da-Silva, B. C. (2008). A Base de Dados Lexical e a Interface Web do TeP 2.0 - Thesaurus Eletrônico para o Português do Brasil. In VI Workshop em Tecnologia da Informação e da Linguagem Humana (TIL), pages 390–392. Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 18 / 18