FC / 02 Unsupervised Document Classification and Automatic Topic Extraction Joaquim Silva João Mexia Universidade Nova de Lisboa Portugal Universidade Nova de Lisboa Portugal Agra Coelho Gabriel Lopes Universidade Técnica de Lisboa Portugal Universidade Nova de Lisboa Portugal FC / 02 .Extracting Relevant Expressions (REs) from Documents .Application to Document Clustering and Automatic Ontology Extraction .Other applications FC / 02 Extracting REs from Documents REs: Common agriculture policy Common Customs Produits agricoles Economia de energia Rational use of energy Energy saving in the public sector Publication au journal officiell des Communautés Primeiras experiências: Frequências dos bigramas e tetragramas por ordem decrescente Freq. Bigrama Freq. Tetragrama 1528 - O 891 - A 348 Estados Unidos 203 05 Jan 195 De acordo 188 Agência Lusa 179 Banco de 165 Conselho de 75 - Notícias breves da 74 Notícias breves da actualidade 64 - A bolsa de 60 do Banco de Portugal 59 ministro dos negócios estrangeiros 58 - Notícias breves da 57 Notícias breves da actualidade 54 De acordo com o 51 De acordo com a 49 por cento do que 49 disse à Agência Lusa 46 na africa do Sul 45 com o objectivo de 40 Libertação Nacional 40 Irlanda do 40 Câmara de 40 13 39 Nacional de 39 Na sua 39 Geral de 39 Campeonato Nacional 20 na abertura do mercado 20 na Assembleia da Republica 20 em conferência de imprensa 15 15 15 15 15 15 15 15 4 4 4 4 4 4 4 4 Câmara dos Comissão Nacional Com o Carvalho da Cabo Verde Bósnia e Associação 25 As conversações Mês Cultural México e Mário Tomé Municipalizados de Municipal e Mundo dos Ministério de Minas Gerais 20 do que no fecho 20 do campeonato português de 20 Ministro dos Negócios Estrangeiros 20 - A Camara Municipal 19 presidente de Camara Municipal 19 por cento para o 19 face às principais divisas 19 disse hoje à Agência 19 de final da Taça 19 da Santa Casa da 4 visita oficial de dois 4 visa protestar contra a 4 vila franca do campo 4 vice-ministro dos negócios estrangeiros 4 verde deverá continuar a 4 venda e do transkei 4 valores estavam hoje a Este critério penaliza o comprimento da sequência Colocações propostas após os filtros de Justeson e Katz f(w1 w2) w1 w2 Padrão 11487 New York A N 7261 United States A N 5412 Los Angeles N N 3301 last year AN 3191 Saudi Arabia N N 2699 last week AN 2514 vice president A N 2378 Persian Gulf A N 2161 San Francisco N N 2106 President Bush N N f(w1 w2) w1 w2 2001 Padrão Middle East AN 1942 Saddam Hussein 1867 Soviet Union 1850 White House 1633 United Nations 1337 York City 1328 oil prices 1210 next year 1074 chief executive 1073 real estate É necessária informação morfo-sintáctica. As longas sequências continuam a ser penalizadas pelo critério da frequência. NN AN AN AN NN NN AN AN AN FC / 02 Tools used for Extrating REs: [Silva and Lopes 99] .LocalMaxs algorithm .Fair Dispersion Point Normalization (FDPN) .Symmetric Conditional Probability (SCP) FC / 02 Measuring 2-grams cohesion p( x, y) 2 SCP((x, y)) p( x | y). p( y | x) p( x). p( y) Applying FDPN to measure n-grams cohesion (n>2). Building pseudo-2-grams p((w1 wn ))2 SCP _ f ((w1 wn )) F where 1 i n1 F p(w1 wi ).p(wi 1 wn ) n 1 i 1 FC / 02 Ex: SCP _ f ((energy, saving, in, the, public, sec tor)) p((energy, saving, in, the, public, sec tor))2 F 1 p(energy ). p(( saving , in, the, public , sec tor )) F 6 1 p((energy, saving)).p((in, the, public, sec tor)) p((energy, saving, in)).p((the, public, sec tor)) p((energy, saving, in, the)).p(( public, sec tor)) p((energy, saving, in, the, public)).p(sector) FC / 02 LocalMaxs Algorithm W is a RE if, for x n1 (W ),y n1 (W ) [length(W)=2 and g(W)>y] or [length(W)>2 and g(W)x and g(W)>y] n1 (W ) is the set of the g(.) values of all the (n-1)gram contained in the n-gram W n1 (W ) is the set of the g(.) values of all the (n+1)-gram containing the n-gram W LocalMaxs Algorithm (improved) W is a RE if, for x n1 (W ),y n1 (W ) [length(W)=2 and g(W)>y] or [length(W)>2 and g(W)>(x+y)/2] is the set of the g(.) values of all the (n-1)gram contained in the n-gram W is the set of the g(.) values of all the (n+1)-gram containing the n-gram W FC / 02 LocalMaxs Algorithm The cohesion values of the n-grams and the election of REs g(.)=SCP_f(.) in energy saving energy saving energy saving in energy saving in the energy saving in the public energy saving in the public sector energy saving in the public sector has SCP_f(.) 0.0009276 Universidade Nova 0.0001322 Universidade Nova de 0.0004058 da Universidade Nova 0.00005399 na Universidade Nova 0.0002555 Nova de Lisboa 0.0053873 Universidade Nova de Lisboa 0.0001187 Universidade Nova de Lisboa ( 0.00006521 Universidade Nova de Lisboa , 0.00002609 Universidade Nova de Lisboa . 0.0001675 na Universidade Nova de Lisboa 0.0005022 da Universidade Nova de Lisboa 0.02768 Faculdade de Economia da Universidade 0.0001675 de Economia da Universidade Nova 0.004839 reitor da Universidade Nova de Lisboa 0.03134 Faculdade de Economia da Universidade Nova 0.00004907 , reitor da Universidade Nova de Lisboa 0.0001744 o reitor da Universidade Nova de Lisboa 0.00004893 reitor da Universidade Nova de Lisboa , 0.00007832 reitor da Universidade Nova de Lisboa . 0.0001992 Faculdade de Economia da Universidade Nova , 0.0007259 da Faculdade de Economia da Universidade Nova Universidade Autodidacta Universidade Nova Universidade Tecnica Universidade Técnica Universidades Portuguesas Associacao de Estudantes da Universidade do Algarve cento dos estudantes da Universidade de Coimbra reitor da Universidade Nova de Lisboa Faculdade de Economia da Universidade Nova académica da Universidade da Beira Interior criação de uma Universidade de Bragança dirigente da associação académica da Universidade reitor da Universidade de Aveiro Associacao de Estudantes da Universidade Associação de Estudantes da Universidade Estudantes da Universidade do Algarve Hospitais da Universidade de Coimbra Reitoria da Universidade de Lisboa cento dos estudantes da Universidade uma Universidade de Bragança Economia da Universidade Nova Universidade Clássica de Lisboa Universidade Nova de Lisboa Universidade da Beira Interior associação académica da Universidade criação de uma Universidade Estudantes da Universidade Hospitais da Universidade Reitores de Universidades Universidade Católica Portuguesa Universidade de Aveiro Universidade de Coimbra Universidade de Edimburgo Universidade de Evora Universidade do Algarve reitor da Universidade Universidade Clássica de Lisboa Universidade Nova de Lisboa Universidade da Beira Interior associação académica da Universidade criação de uma Universidade Estudantes da Universidade Hospitais da Universidade Reitores de Universidades Universidade Católica Portuguesa Universidade de Aveiro Universidade de Coimbra Universidade de Edimburgo Universidade de Evora Universidade do Algarve reitor da Universidade FC / 02 25,838 REs (Base Features) extrated from the multilingual corpus: 872,795 words in 324 documents .Reducing the Number of Base Features Using Principal Components Analysis to reduce 25,838 Base Features? RE1,1 RE 1, 2 RE= RE1, p RE1, 2 RE2, 2 RE2, p RE1, p RE2, p RE p , p REi,k Cov( REi , REk ) No!!! FC / 02 Matrix of Document Similarity A smaller (324 324) covariance matrix: S1,1 S 1, 2 S= S1,n S j ,l S1, 2 S 2, 2 S 2,n S1,n S 2,n S n,n 1 i p * * * * ( z z ).( z z i, j ., j i ,l .,l ) p 1 i 1 With n documents and p REs FC / 02 z * i, j is a Transformed Occurrence zi*, j xi*, j x.,* j * j i 1,, p; j 1,, n V (D ) i p 1 * * 2 V ( D*j ) ( x x i, j ., j ) p 1 i 1 * i, j is a Weighted Occurrence xi*, j xi, j .V (REi ).AL(REi ) x xi , j is the number of occurrences of the ith RE in the jth document AL( REi ) is the Average Length of words in the ith RE FC / 02 The “Variance” of RE. Normalizing the documents size 1 j n 2 V ( REi ) ( z z ) i, j i,. n 1 j 1 zi , j xi , j x., j V (D j ) 1 i p x., j xi , j p i 1 1 i p 2 V (D j ) ( x x ) i, j ., j p 1 i 1 i 1,, p; j 1,, n FC / 02 S is a covariance matrix T T S=PP =QQ 1/2 with Q=P [Escoufier and L’Hermier 78] Q is a matrix of documents characterized by uncorrelated features (components) The first k columns of Q concentrate PTK(k)100% of the total information contained in the original Base Features PTV (k ) j k j 1 j n j j 1 j 1 n are the eigenvalues of S FC / 02 How many clusters? Model Based-Cluster Analysis [Fraley and Raftery 98] Parameterizations of the covariance matrix in the Gaussian model and their geometric interpretation Σc λI λcI T λDAD T λcDcAcDc T λDcADc T λcDcADc Distrib. Spher. Spher. Ellips. Ellips. Ellips. Ellips. Vol. Equal Vari. Equal Vari. Equal Vari. Shape Equal Equal Equal Vari. Equal Equal Orient. Equal Vari. Vari. Vari. Bayesian Information Criterion. Evidence of Clustering FC / 02 Clusters Topics Topics correspond to the most important REs Score( REi ) V ( REi ).AL( REi ).Thr( REi ) Thr( REi ) 1 Thr( REi ) 0 if SCP _ f ( REi ) threshold otherwise The 15 most important REs of the cluster occurring in at least 50% of its documents and having a score(.) > max(score(.))/50 are considered as topics FC / 02 Results First level of clusters: 3 components PTV(3)=.848; PTV(5)=.932 and PTV(8)=.955 Second level (sub-clusters): 11 components PTV(11)=.82 FC / 02 Clust. Main Topic Prc. % Rec. % 1 2 3 1.1 1.2 1.3 2.1 2.2 2.3 3.1 3.2 3.3 100 100 99.1 86.9 77.8 87.9 80.8 84 92.9 84.6 77.8 94.6 100 99.1 100 86.9 77.8 87.9 91.3 77.8 89.7 95.7 77.8 91.4 Corr. Tot. Act. Corr. # # # 108 108 108 European communities 108 107 107 Comunidade Europeia 108 109 108 Communauté européene 23 20 23 Rational use of energy 27 21 27 Agricultural products 58 51 58 Combined nomenclature 26 21 23 Economia de energia 25 21 27 Produtos agrícolas 56 52 Nomenclatura Combinada 58 26 22 23 Politique énergétique 27 21 27 Produits agricoles 56 53 58 Nomenclature combinée FC / 02 Cluster 1 European Communities, Member States, EUROPEAN COMMUNITIES, Council Regulation, Having regard to Council Regulation and Official Journal Cluster 2 Comunidade Europeia (European Community), Nomenclatura Combinada (Combined Nomenclature), COMUNIDADES EUROPEIAS and directamente aplicável (directly applicable) Cluster 3 Communauté européenne, nomenclature combinée, États membres, COMMUNAUTÉS EUROPÉENNES and directement applicable FC / 02 Cluster 1.1 Cluster 1.2 Cluster 1.3 Rational use of energy, energy consumption and rational use Agricultural products, Official Journal, detailed rules, Official Journal of the European Communities, proposal from the Commission, publication in the Official Journal and entirely and directly Combined nomenclature, Common Customs, customs authorities, No 2658/87, goods described, general rules, appropriate CN, Having regard to Council Regulation, tariff and statistical and Customs Code FC / 02 Cluster 2.1 economia de energia (energy saving), utilização racional (rational use), racional da energia (rational of energy) and consumo de energia (energy consuming) Cluster 2.2 produtos agrícolas, Comunidades Europeias, Jornal Oficial (Official Journal), directamente aplicável (directly apllicable), COMUNIDADES EUROPEIAS, Jornal Oficial das Comunidades (Official Journal of the Communities), directamente aplicável em todos os Estadosmembros (directly apllicable to all Member States), publicação no Jornal Oficial (publication in the Official Journal), publicação no Jornal Official das Comunidades and Parlamento Europeu (European Parliament) Cluster 2.3 Nomenclatura Cominada, autoridades aduaneiras (customs authorities), indicados na coluna (indicated in the column), mercadorias descritas (goods described), informações pautais vinculativas (binding tariff informations), Aduaneira Comum (Common Customs) regras gerais (general rules), códigos NC (NC codes) and COMUNIDADES EUROPEIAS FC / 02 Cluster 3.1 politique énergétique (energy policy), rationnelle de énergie and l’utilization rationnelle Cluster 3.2 produits agricoles, organisation commune (common organization), organisation commune des marchés (common organization of the markets), directment applicable, Journal officiel, Journal officiel des Communautés and COMMUNAUTÉS EUROPÉENNES Cluster 3.3 Nomenclature combinée, autorités douanières (customs authorities), nomenclature tarifaire, No 2658/87, marchandises décrites (goods described), tarifaire et statistique (tariff and statistical) and COMMUNAUTÉS EUROPÉENES Ordered observations FC / 02 2.5 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2.5 -2.5 Assessing Normality of Data -2 -1.5 -1 -0.5 0 0.5 1 Standard normal quantiles 1.5 2 2.5 FC / 02 Ordered distances 25 20 15 10 Assessing Normality of Data 5 5 10 15 Chi-square percentiles 20 25 FC / 02 Data Transformations to approximate to Normality For each column of Q, each observation can be modified from y to y( ) [Box and Cox, 64] y ( ) y ( ) y 1 ln y for for 0 0 is chosen such that l() is maximized j m m 1 j m ( ) l ( ) ln[ ( y j y ( ) ) 2 ] ( 1) ln y j 2 m j 1 j 1 m is the number of elements of the cluster y ( ) 1 j m ( ) yj m j 1 Document Classification • Representing the new Document with k components T We need: x j [v1, j , v2, j ,...vk , j ] Where vi , j is the value of the document j for the component i But we have: y [ z , z ,...,z ] T j * 1, j where * i ,l * 2, j * p, j z is the weighted occurrence of the document l for Relevant Expression (RE) i, and p the number of REs Translate y T to j x T j x [v1, j , v2, j ,...vk , j ] T j So, let 1/ 2 v s PΛ T j T j with S PΛ PT S The similarity document matrix, P the eigenvectors matrix, Λ the eigenvalues diagonal matrix, and sTj is a vector of similarities 1 sTj p 1 y Tj Z 1/ 2 So, v s PΛ is a vector with n (the number of documents) elements, i.e. vT [v , v ,...v ] T j T j j 1, j 2, j Vector x [v1, j , v2, j ,...vk , j ] has the first k elements (components) of the vector T j v [v1, j , v2, j ,...vn, j ] T j n, j Quadratic Discrimination Score 1 1 d (x ) ln(| Σi |) (x μi )T i1 (x μi ) ln( pi ) 2 2 Q i .Document class i is represented by cluster i and is estimated by SC (cov matrix between components) SCl ,m 1 i n (cl ,i cl ,. )(cm,i cm,. ) n 1 i 1 SC1,1 SC 1, 2 SC SC1,k 1 i n cl ,. cl ,i n i 1 SC1, 2 SC1,k SC2, 2 SC2,k SC2,k SCk ,k is estimated by the vector of means c T [c1,., c2,.,...,ck ,. ] A Criterion for Classifying Documents Be x the vector of components for a document to classify and r a class represented by cluster r x belongs to class r if d (x) maxi (d (x)) d (x) minj (d (c j )) Q r Q i Q r Q r i = 1, 2, … g ; j=1,2, …, n Q r d (x) is the quadratic score for x g is the number of classes; n is the number of documents of the cluster r; is a j document of cluster r c Results • • • • Average Precision: 93% Average Recall: 93% Average Precision on rejection: 91.5% Average Recall on rejection: 100% FC / 02 Conclusions .Unsupervised statistics-based and language independent approach .No pre-defined topics, features or document descriptors . Number of clusters not left to user choice .About 80% of the clusters REs can be taken as Topics FC / 02 Other Applications NLP Lexicon enrichment Parsing precision Attachment decision Rewriting Grammar Rules Sequences of Strongly connected Characters .Information Retrieval FC / 02 Applying the Fair Dispersion to the 2(.): (f(x, y) N f(x) f(y)) 2 Φ ((x, y)) f(x) f(y) (N f(x)) (N f(y)) 2 f (w ...w ) N Avp 2 2 _ f (( w1...wn)) 1 n Avp ( N Avx ) ( N Avy ) 1 i n1 Avp p(w1 ...wi ) p(wi 1 ...wn ) n 1 i 1 1 i n1 Avx f(w1 ...wi ) n 1 i 1 1 i n Avy f(wi ...wn ) n 1 i 2 FC / 02 Results for Contiguous MWUs The scores for the several statistics-based measures Statisticsbased measure: g(.)= SCP_f(.) SI_f(.) _f(.) Dice_f(.) 81.00% 75.00% 76.00% 58.00% Extracted MWUs (count) 24476 20906 24711 32381 LogLike_f(.) 53.00% 40602 2 Precision (average) LUSA Corpus containing 919,253 words. Evaluation Criterion: Compound nouns, locutions, frozen forms and Relevant Expressions are correct (MWUs). FC / 02 Multilingual Corpora ( Eupopean Parliament debates) No morpho-syntactic filters used Evaluation Criterion: MWU / Relevant Expressions are correct extractions LocalMaxs and SCP_f scores for Different Languages Language English French German Medieval Portuguese 77.00% 76.00% 75.00% Extracted MWUs (count) 8017 8980 5190 73.00% 5451 Precision Corpus size 493191 512031 454750 377724 FC / 02 Evaluation Criterion: From aTagged Corpus [Marques&Lopes] (1,194,206 words) verb forms changed to infinitive forms Ex: - arredar pé (to leave) - estar para chegar (to be about arriving) - ter pela frente (to face) The Scores for the Contiguous Compound Verb Extractions Form 2-gram 3-gram Precision Extracted compound verbs 81.00% 108 73.00% 492 FC / 02 MWUs Samples Universidade Autodidacta Universidade Nova Universidade Tecnica Universidade Técnica Universidades Portuguesas Associacao de Estudantes da Universidade do Algarve * cento dos estudantes da Universidade de Coimbra reitor da Universidade Nova de Lisboa Faculdade de Economia da Universidade Nova FC / 02 English sub-Thatcherite theology sine qua non deformed by the removal of a tumour Vocational Training Reform of the common agricultural Council of Agriculture Ministers Common agricultural policy Spread of Organized Crime Sanz Fernández SIR JACK STEWART-CLARK Royal Society Richard Attenborough Red Cross LUCAS PIRES Henry the Navigator FC / 02 French Contrôle de la croissance démographique Infrastructures nécessaires Résolutions adoptées Président du tribunal Mise en marché commune Protection de la petite enfance Miskito Tawahka Pech Protection du touriste Drame algérien Commission a données aux amendements adoptés Sécurité de nos approvisionnements énergétiques Directive relative à la sécurité FC / 02 German Annahme einer Richtlinie über die Werbung (consideration of a directrix about publicity) Rechte beim Gerichtshof der Europäischen Gemeinschaft geltend (European Community Parliament's Rights in force) Algerischen Volkes (Algerian people) Gefahr für die Volksgesundheit (Danger for public health) Zusammensetzung der Ausschüsse und Delegationen (composition of the ? and delegation) Schaffung des EWR (creation of the EWR) Währung und Industriepolitik über den Vorschlag (metal money and industrial policy about a proposal) Gemischten Parlamentarischen (mixed Parliament) FC / 02 Using Tags as “Words” in LocalMaxs to obtain Preference Selection for relative clauses or other clauses attachments. _PR _ADV _ADV _V que mais tipicamente corresponde freq=2 _PR _ART _N _V que os mesmos derem freq=6 _PR _N _PPOA cujo reexame se freq=4 _PR _N _V _ART cuja realização impliquem a freq=2 _PR _PPOA _V _VIRG _PREP _N que se desdobra , com normalidade freq=2 _PR _V quem vier freq=92 _PR _V _ART _PAR _N _PAR que precedeu a " deliberação " freq=3 _PR _V _CONTR _N _ADJ _PREP _PIND que resultaram da análise sistemática de todo freq=2 _PREP _ADJ _N _PTO _V de particular importncia . Vejamos freq=2 _PREP _ART _N _ADJ _ADJ _PREP de os serviços públicos municipais de freq=4 _PREP _ART _N _CONJCOORD _N _ADJ por o município ou municípios concedentes freq=5 FC / 02 Using Characters as “Words” in LocalMaxs to obtain Strongly connected Sequences of Characters (within words and between words) Ex: trateg trutur xtracç éctric tratam struíd tradiç trangeir ntratuai ncentraç xtraí xtrem tritiva trocas* tráfico trânsit utras*in (utrazin) FC / 02 . Information Retrieval REs points to relevant information: Topics and Subtopics Ex. about “Human Rights” Extracting its related Subject Matters (Topics and Subtopics) … : European Convention on Human Rights European Court of Human Rights Universal Declaration of Human Rights European Commision of the Human Rights Etc. Selection by Topic / Subtopic