FC / 02
Unsupervised Document Classification and
Automatic Topic Extraction
Joaquim Silva
João Mexia
Universidade Nova de Lisboa
Portugal
Universidade Nova de Lisboa
Portugal
Agra Coelho
Gabriel Lopes
Universidade Técnica de Lisboa
Portugal
Universidade Nova de Lisboa
Portugal
FC / 02
.Extracting Relevant Expressions (REs) from
Documents
.Application to Document Clustering and
Automatic Ontology Extraction
.Other applications
FC / 02
Extracting REs from Documents
REs:
Common agriculture policy
Common Customs
Produits agricoles
Economia de energia
Rational use of energy
Energy saving in the public sector
Publication au journal officiell des Communautés
Primeiras experiências: Frequências dos bigramas e tetragramas por
ordem decrescente
Freq. Bigrama
Freq. Tetragrama
1528 - O
891 - A
348 Estados Unidos
203 05 Jan
195 De acordo
188 Agência Lusa
179 Banco de
165 Conselho de
75 - Notícias breves da
74 Notícias breves da actualidade
64 - A bolsa de
60 do Banco de Portugal
59 ministro dos negócios estrangeiros
58 - Notícias breves da
57 Notícias breves da actualidade
54 De acordo com o
51 De acordo com a
49 por cento do que
49 disse à Agência Lusa
46 na africa do Sul
45 com o objectivo de
40 Libertação Nacional
40 Irlanda do
40 Câmara de
40 13 39 Nacional de
39 Na sua
39 Geral de
39 Campeonato Nacional
20 na abertura do mercado
20 na Assembleia da Republica
20 em conferência de imprensa
15
15
15
15
15
15
15
15
4
4
4
4
4
4
4
4
Câmara dos
Comissão Nacional
Com o
Carvalho da
Cabo Verde
Bósnia e
Associação 25
As conversações
Mês Cultural
México e
Mário Tomé
Municipalizados de
Municipal e
Mundo dos
Ministério de
Minas Gerais
20 do que no fecho
20 do campeonato português de
20 Ministro dos Negócios Estrangeiros
20 - A Camara Municipal
19 presidente de Camara Municipal
19 por cento para o
19 face às principais divisas
19 disse hoje à Agência
19 de final da Taça
19 da Santa Casa da
4 visita oficial de dois
4
visa protestar contra a
4
vila franca do campo
4 vice-ministro dos negócios estrangeiros
4 verde deverá continuar a
4 venda e do transkei
4 valores estavam hoje a
Este critério penaliza o comprimento da sequência
Colocações propostas após os filtros de Justeson e Katz
f(w1 w2) w1 w2
Padrão
11487 New York A N
7261 United States A N
5412 Los Angeles N N
3301 last year
AN
3191 Saudi Arabia N N
2699 last week
AN
2514 vice president A N
2378 Persian Gulf A N
2161 San Francisco N N
2106 President Bush N N
f(w1 w2) w1 w2
2001
Padrão
Middle East
AN
1942 Saddam Hussein
1867 Soviet Union
1850 White House
1633 United Nations
1337 York City
1328 oil prices
1210 next year
1074 chief executive
1073 real estate
É necessária informação morfo-sintáctica. As longas sequências
continuam a ser penalizadas pelo critério da frequência.
NN
AN
AN
AN
NN
NN
AN
AN
AN
FC / 02
Tools used for Extrating REs:
[Silva and Lopes 99]
.LocalMaxs algorithm
.Fair Dispersion Point Normalization (FDPN)
.Symmetric Conditional Probability (SCP)
FC / 02
Measuring 2-grams cohesion
p( x, y) 2
SCP((x, y))  p( x | y). p( y | x) 
p( x). p( y)
Applying FDPN to measure n-grams cohesion
(n>2). Building pseudo-2-grams
p((w1  wn ))2
SCP _ f ((w1  wn )) 
F
where
1 i n1
F
p(w1  wi ).p(wi 1  wn )

n  1 i 1
FC / 02
Ex:
SCP _ f ((energy, saving, in, the, public, sec tor))
p((energy, saving, in, the, public, sec tor))2
F
1
 p(energy ). p(( saving , in, the, public , sec tor )) 
F
6 1
p((energy, saving)).p((in, the, public, sec tor)) 
p((energy, saving, in)).p((the, public, sec tor)) 
p((energy, saving, in, the)).p(( public, sec tor)) 
p((energy, saving, in, the, public)).p(sector)
FC / 02
LocalMaxs Algorithm
W is a RE if, for
x  n1 (W ),y  n1 (W )
[length(W)=2 and g(W)>y] or
[length(W)>2 and g(W)x and g(W)>y]
n1 (W ) is the set of the g(.) values of all the (n-1)gram contained in the n-gram W
n1 (W ) is the set of the g(.) values of all the (n+1)-gram
containing the n-gram W
LocalMaxs Algorithm (improved)
W is a RE if, for
x  n1 (W ),y  n1 (W )
[length(W)=2 and g(W)>y] or
[length(W)>2 and g(W)>(x+y)/2]
is the set of the g(.) values of all the (n-1)gram contained in the n-gram W
is the set of the g(.) values of all the (n+1)-gram
containing the n-gram W
FC / 02
LocalMaxs Algorithm
The cohesion values of the n-grams and the
election of REs
g(.)=SCP_f(.)
in energy
saving
energy
saving
energy
saving in
energy
saving in the
energy saving
in the public
energy saving
in the public
sector
energy
saving in the
public sector
has
SCP_f(.)
0.0009276 Universidade Nova
0.0001322 Universidade Nova de
0.0004058 da Universidade Nova
0.00005399 na Universidade Nova
0.0002555 Nova de Lisboa
0.0053873 Universidade Nova de Lisboa
0.0001187 Universidade Nova de Lisboa (
0.00006521 Universidade Nova de Lisboa ,
0.00002609 Universidade Nova de Lisboa .
0.0001675 na Universidade Nova de Lisboa
0.0005022 da Universidade Nova de Lisboa
0.02768 Faculdade de Economia da Universidade
0.0001675 de Economia da Universidade Nova
0.004839 reitor da Universidade Nova de Lisboa
0.03134 Faculdade de Economia da Universidade Nova
0.00004907 , reitor da Universidade Nova de Lisboa
0.0001744 o reitor da Universidade Nova de Lisboa
0.00004893 reitor da Universidade Nova de Lisboa ,
0.00007832 reitor da Universidade Nova de Lisboa .
0.0001992 Faculdade de Economia da Universidade Nova ,
0.0007259 da Faculdade de Economia da Universidade Nova
Universidade Autodidacta
Universidade Nova
Universidade Tecnica
Universidade Técnica
Universidades Portuguesas
Associacao de Estudantes da Universidade do Algarve
cento dos estudantes da Universidade de Coimbra
reitor da Universidade Nova de Lisboa
Faculdade de Economia da Universidade Nova
académica da Universidade da Beira Interior
criação de uma Universidade de Bragança
dirigente da associação académica da Universidade
reitor da Universidade de Aveiro
Associacao de Estudantes da Universidade
Associação de Estudantes da Universidade
Estudantes da Universidade do Algarve
Hospitais da Universidade de Coimbra
Reitoria da Universidade de Lisboa
cento dos estudantes da Universidade
uma Universidade de Bragança
Economia da Universidade Nova
Universidade Clássica de Lisboa
Universidade Nova de Lisboa
Universidade da Beira Interior
associação académica da Universidade
criação de uma Universidade
Estudantes da Universidade
Hospitais da Universidade
Reitores de Universidades
Universidade Católica Portuguesa
Universidade de Aveiro
Universidade de Coimbra
Universidade de Edimburgo
Universidade de Evora
Universidade do Algarve
reitor da Universidade
Universidade Clássica de Lisboa
Universidade Nova de Lisboa
Universidade da Beira Interior
associação académica da Universidade
criação de uma Universidade
Estudantes da Universidade
Hospitais da Universidade
Reitores de Universidades
Universidade Católica Portuguesa
Universidade de Aveiro
Universidade de Coimbra
Universidade de Edimburgo
Universidade de Evora
Universidade do Algarve
reitor da Universidade
FC / 02
25,838 REs (Base Features) extrated from the multilingual
corpus: 872,795 words in 324 documents
.Reducing the Number of Base Features
Using Principal Components Analysis to reduce 25,838
Base Features?
 RE1,1
 RE
 1, 2
RE=  

 RE1, p
RE1, 2
RE2, 2

RE2, p
RE1, p 
 RE2, p 

 

 RE p , p 

REi,k  Cov( REi , REk )
No!!!
FC / 02
Matrix of Document Similarity
A smaller (324  324) covariance matrix:
 S1,1
S
 1, 2
S=  

 S1,n
S j ,l
S1, 2
S 2, 2

S 2,n
 S1,n 
 S 2,n 
  

 S n,n 
1 i p *
*
*
*

(
z

z
).(
z

z

i, j
., j
i ,l
.,l )
p  1 i 1
With n documents and p REs
FC / 02
z
*
i, j
is a Transformed Occurrence
zi*, j 
xi*, j  x.,* j
*
j
i  1,, p;
j  1,, n
V (D )
i p
1
*
* 2
V ( D*j ) 
(
x

x

i, j
., j )
p  1 i 1
*
i, j
is a Weighted Occurrence
xi*, j  xi, j .V (REi ).AL(REi )
x
xi , j is the number of occurrences of the ith RE in the jth document
AL( REi ) is the Average Length of words in the ith RE
FC / 02
The “Variance” of RE. Normalizing the
documents size
1 j n
2
V ( REi ) 
(
z

z
)
 i, j i,.
n  1 j 1
zi , j 
xi , j  x., j
V (D j )
1 i p
x., j   xi , j
p i 1
1 i p
2
V (D j ) 
(
x

x
)
 i, j ., j
p  1 i 1
i  1,, p;
j  1,, n
FC / 02
S is a covariance matrix
T
T
S=PP =QQ
1/2
with Q=P
[Escoufier and L’Hermier 78]
Q is a matrix of documents characterized by uncorrelated
features (components)
The first k columns of Q concentrate PTK(k)100% of the total
information contained in the original Base Features

PTV (k ) 

j k
j 1
j n
j

j 1 j
1 n
are the eigenvalues of S
FC / 02
How many clusters?
Model Based-Cluster Analysis [Fraley and
Raftery 98]
Parameterizations of the covariance matrix in
the Gaussian model and their geometric
interpretation
Σc
λI
λcI
T
λDAD
T
λcDcAcDc
T
λDcADc
T
λcDcADc
Distrib.
Spher.
Spher.
Ellips.
Ellips.
Ellips.
Ellips.
Vol.
Equal
Vari.
Equal
Vari.
Equal
Vari.
Shape
Equal
Equal
Equal
Vari.
Equal
Equal
Orient.
Equal
Vari.
Vari.
Vari.
Bayesian Information Criterion. Evidence of Clustering
FC / 02
Clusters Topics
Topics correspond to the most important REs
Score( REi )  V ( REi ).AL( REi ).Thr( REi )
Thr( REi )  1
Thr( REi )  0
if
SCP _ f ( REi )  threshold
otherwise
The 15 most important REs of the cluster occurring in at
least 50% of its documents and having a score(.) >
max(score(.))/50 are considered as topics
FC / 02
Results
First level of clusters: 3 components PTV(3)=.848; PTV(5)=.932 and PTV(8)=.955
Second level (sub-clusters): 11 components PTV(11)=.82
FC / 02
Clust. Main Topic
Prc.
%
Rec.
%
1
2
3
1.1
1.2
1.3
2.1
2.2
2.3
3.1
3.2
3.3
100
100
99.1
86.9
77.8
87.9
80.8
84
92.9
84.6
77.8
94.6
100
99.1
100
86.9
77.8
87.9
91.3
77.8
89.7
95.7
77.8
91.4
Corr. Tot. Act.
Corr.
#
#
#
108 108 108
European communities
108 107 107
Comunidade Europeia
108 109 108
Communauté européene
23 20
23
Rational use of energy
27 21
27
Agricultural products
58 51
58
Combined nomenclature
26 21
23
Economia de energia
25 21
27
Produtos agrícolas
56 52
Nomenclatura Combinada 58
26 22
23
Politique énergétique
27 21
27
Produits agricoles
56 53
58
Nomenclature combinée
FC / 02
Cluster 1 European Communities, Member
States, EUROPEAN COMMUNITIES,
Council Regulation, Having regard to
Council Regulation and Official
Journal
Cluster 2 Comunidade Europeia (European
Community), Nomenclatura
Combinada (Combined Nomenclature),
COMUNIDADES EUROPEIAS and
directamente aplicável (directly
applicable)
Cluster 3 Communauté européenne,
nomenclature combinée, États
membres, COMMUNAUTÉS
EUROPÉENNES and directement
applicable
FC / 02
Cluster 1.1
Cluster 1.2
Cluster 1.3
Rational use of energy, energy
consumption and rational use
Agricultural products, Official Journal,
detailed rules, Official Journal of the
European Communities, proposal from the
Commission, publication in the Official
Journal and entirely and directly
Combined nomenclature, Common
Customs, customs authorities, No 2658/87,
goods described, general rules, appropriate
CN, Having regard to Council Regulation,
tariff and statistical and Customs Code
FC / 02
Cluster 2.1 economia de energia (energy saving), utilização
racional (rational use), racional da energia
(rational of energy) and consumo de energia
(energy consuming)
Cluster 2.2 produtos agrícolas, Comunidades Europeias,
Jornal Oficial (Official Journal), directamente
aplicável (directly apllicable), COMUNIDADES
EUROPEIAS, Jornal Oficial das Comunidades
(Official Journal of the Communities),
directamente aplicável em todos os Estadosmembros (directly apllicable to all Member States),
publicação no Jornal Oficial (publication in the
Official Journal), publicação no Jornal Official das
Comunidades and Parlamento Europeu (European
Parliament)
Cluster 2.3 Nomenclatura Cominada, autoridades aduaneiras
(customs authorities), indicados na coluna
(indicated in the column), mercadorias descritas
(goods described), informações pautais vinculativas
(binding tariff informations), Aduaneira Comum
(Common Customs) regras gerais (general rules),
códigos NC (NC codes) and COMUNIDADES
EUROPEIAS
FC / 02
Cluster 3.1 politique énergétique (energy policy), rationnelle de
énergie and l’utilization rationnelle
Cluster 3.2 produits agricoles, organisation commune
(common organization), organisation commune des
marchés (common organization of the markets),
directment applicable, Journal officiel, Journal
officiel des Communautés and COMMUNAUTÉS
EUROPÉENNES
Cluster 3.3 Nomenclature combinée, autorités douanières
(customs authorities), nomenclature tarifaire, No
2658/87, marchandises décrites (goods described),
tarifaire et statistique (tariff and statistical) and
COMMUNAUTÉS EUROPÉENES
Ordered observations
FC / 02
2.5
2
1.5
1
0.5
0
-0.5
-1
-1.5
-2
-2.5
-2.5
Assessing Normality of Data
-2
-1.5
-1
-0.5
0
0.5
1
Standard normal quantiles
1.5
2
2.5
FC / 02
Ordered distances
25
20
15
10
Assessing Normality of Data
5
5
10
15
Chi-square percentiles
20
25
FC / 02
Data Transformations to approximate to Normality
For each column of Q, each observation can be modified
from y to y( ) [Box and Cox, 64]
y ( ) 
y ( )
y  1

 ln y
for
for
0
 0
 is chosen such that l() is maximized
j m
m 1 j m (  )
l ( )   ln[  ( y j  y ( ) ) 2 ]  (  1)  ln y j
2 m j 1
j 1
m is the number of elements of the cluster
y
( )
1 j m ( )
  yj
m j 1
Document Classification
• Representing the new Document with k
components
T
We need: x j  [v1, j , v2, j ,...vk , j ]
Where
vi , j
is the value of the document j for the component i
But we have:
y  [ z , z ,...,z ]
T
j
*
1, j
where
*
i ,l
*
2, j
*
p, j
z
is the weighted occurrence of the document l for Relevant
Expression (RE) i, and p the number of REs
Translate
y
T to
j
x
T
j
x  [v1, j , v2, j ,...vk , j ]
T
j
So, let
1/ 2
v  s PΛ
T
j
T
j
with S  PΛ PT
S The similarity document matrix, P
the eigenvectors matrix, Λ the
eigenvalues diagonal matrix, and sTj
is a vector of similarities
1
sTj 
p 1
y Tj Z
1/ 2
So, v  s PΛ is a vector with n (the number
of documents) elements, i.e. vT  [v , v ,...v ]
T
j
T
j
j
1, j
2, j
Vector x  [v1, j , v2, j ,...vk , j ] has the first k
elements (components) of the vector
T
j
v  [v1, j , v2, j ,...vn, j ]
T
j
n, j
Quadratic Discrimination Score
1
1
d (x )   ln(| Σi |)  (x  μi )T i1 (x  μi )  ln( pi )
2
2
Q
i
.Document class i is represented by cluster i
and
is estimated by SC (cov matrix
between components)

SCl ,m
1 i n

(cl ,i  cl ,. )(cm,i  cm,. )

n  1 i 1
 SC1,1
 SC
1, 2
SC 
 

 SC1,k
1 i n
cl ,.   cl ,i
n i 1
SC1, 2  SC1,k 
SC2, 2  SC2,k 



 

SC2,k  SCk ,k 
 is estimated by the vector of means
c T  [c1,., c2,.,...,ck ,. ]
A Criterion for Classifying
Documents
Be x the vector of components for a
document to classify and  r a class
represented by cluster r
x belongs to class
r
if
d (x)  maxi (d (x))  d (x)  minj (d (c j ))
Q
r
Q
i
Q
r
Q
r
i = 1, 2, … g ; j=1,2, …, n
Q
r
d (x) is the quadratic score for x
g is the number of classes; n is the number
of documents of the cluster r;
is a
j
document of cluster r
c
Results
•
•
•
•
Average Precision: 93%
Average Recall: 93%
Average Precision on rejection: 91.5%
Average Recall on rejection: 100%
FC / 02
Conclusions
.Unsupervised statistics-based and
language independent approach
.No pre-defined topics, features or
document descriptors
. Number of clusters not left to user
choice
.About 80% of the clusters REs can
be taken as Topics
FC / 02
Other Applications
NLP Lexicon enrichment
Parsing precision
Attachment decision
 Rewriting Grammar Rules
 Sequences of Strongly connected Characters
.Information Retrieval

FC / 02
Applying the Fair Dispersion to the 2(.):
(f(x, y)  N  f(x)  f(y)) 2
Φ ((x, y)) 
f(x)  f(y)  (N  f(x))  (N  f(y))
2
 f (w ...w )  N  Avp 
2
 2 _ f (( w1...wn)) 
1
n
Avp  ( N  Avx )  ( N  Avy )
1 i  n1
Avp 
 p(w1 ...wi )  p(wi 1 ...wn )
n  1 i 1
1 i  n1
Avx 
 f(w1 ...wi )
n  1 i 1
1 i n
Avy 
 f(wi ...wn )
n  1 i 2
FC / 02
Results for Contiguous MWUs
The scores for the several statistics-based
measures
Statisticsbased
measure: g(.)=
SCP_f(.)
SI_f(.)
 _f(.)
Dice_f(.)
81.00%
75.00%
76.00%
58.00%
Extracted
MWUs
(count)
24476
20906
24711
32381
LogLike_f(.)
53.00%
40602
2
Precision
(average)
LUSA Corpus containing 919,253 words.
Evaluation Criterion: Compound nouns,
locutions, frozen forms and Relevant
Expressions are correct (MWUs).
FC / 02
Multilingual Corpora ( Eupopean Parliament debates)
No morpho-syntactic filters used
Evaluation Criterion: MWU / Relevant Expressions are
correct extractions
LocalMaxs and SCP_f scores for Different Languages
Language
English
French
German
Medieval
Portuguese
77.00%
76.00%
75.00%
Extracted
MWUs
(count)
8017
8980
5190
73.00%
5451
Precision
Corpus size
493191
512031
454750
377724
FC / 02
Evaluation Criterion:
From aTagged Corpus [Marques&Lopes] (1,194,206 words)
 verb forms changed to infinitive forms
Ex: - arredar pé (to leave)
- estar para chegar (to be about arriving)
- ter pela frente (to face)
The Scores for the Contiguous Compound
Verb Extractions
Form
2-gram
3-gram
Precision Extracted
compound verbs
81.00%
108
73.00%
492
FC / 02
MWUs Samples
Universidade Autodidacta
Universidade Nova
Universidade Tecnica
Universidade Técnica
Universidades Portuguesas
Associacao de Estudantes da Universidade do Algarve
* cento dos estudantes da Universidade de Coimbra
reitor da Universidade Nova de Lisboa
Faculdade de Economia da Universidade Nova
FC / 02
English
sub-Thatcherite theology
sine qua non
deformed by the removal of a tumour
Vocational Training
Reform of the common agricultural
Council of Agriculture Ministers
Common agricultural policy
Spread of Organized Crime
Sanz Fernández
SIR JACK STEWART-CLARK
Royal Society
Richard Attenborough
Red Cross
LUCAS PIRES
Henry the Navigator
FC / 02
French
Contrôle de la croissance démographique
Infrastructures nécessaires
Résolutions adoptées
Président du tribunal
Mise en marché commune
Protection de la petite enfance
Miskito Tawahka Pech
Protection du touriste
Drame algérien
Commission a données aux amendements adoptés
Sécurité de nos approvisionnements énergétiques
Directive relative à la sécurité
FC / 02
German
Annahme einer Richtlinie über die Werbung (consideration
of a directrix about publicity)
Rechte beim Gerichtshof der Europäischen Gemeinschaft
geltend (European Community Parliament's Rights in force)
Algerischen Volkes (Algerian people)
Gefahr für die Volksgesundheit (Danger for public health)
Zusammensetzung der Ausschüsse und Delegationen
(composition of the ? and delegation)
Schaffung des EWR (creation of the EWR)
Währung und Industriepolitik über den Vorschlag (metal
money and industrial policy about a proposal)
Gemischten Parlamentarischen (mixed Parliament)
FC / 02
Using Tags as “Words” in LocalMaxs to obtain
Preference Selection for relative clauses or other clauses attachments.
_PR _ADV _ADV _V
que mais tipicamente corresponde
freq=2
_PR _ART _N _V que os mesmos derem
freq=6
_PR _N _PPOA cujo reexame se
freq=4
_PR _N _V _ART cuja realização impliquem a
freq=2
_PR _PPOA _V _VIRG _PREP _N que se desdobra , com normalidade
freq=2
_PR _V quem vier
freq=92
_PR _V _ART _PAR _N _PAR que precedeu a " deliberação "
freq=3
_PR _V _CONTR _N _ADJ _PREP _PIND que resultaram da análise sistemática
de todo
freq=2
_PREP _ADJ _N _PTO _V de particular importncia . Vejamos
freq=2
_PREP _ART _N _ADJ _ADJ _PREP de os serviços públicos municipais de freq=4
_PREP _ART _N _CONJCOORD _N _ADJ por o município ou municípios
concedentes
freq=5
FC / 02
Using Characters as “Words” in LocalMaxs to obtain
Strongly connected Sequences of Characters
(within words and between words) Ex:
trateg
trutur
xtracç
éctric
tratam
struíd
tradiç
trangeir
ntratuai
ncentraç
xtraí
xtrem
tritiva
trocas*
tráfico
trânsit
utras*in (utrazin)
FC / 02
. Information Retrieval
REs points to relevant information: Topics
and Subtopics
Ex. about “Human Rights”
Extracting its related Subject Matters (Topics
and Subtopics) … :
European Convention on Human Rights
European Court of Human Rights
Universal Declaration of Human Rights
European Commision of the Human Rights
Etc.
 Selection by Topic / Subtopic
Download

Acetatos de Text Mining - SSDI