Clustering: Introduction
Adriano Joaquim de O Cruz ©2002
NCE/UFRJ
[email protected]
Introduction
What is cluster analysis?
 The process of grouping a set of
physical or abstract objects into classes
of similar objects.
 The class label of each class is
unknown.
 Classification separates objects into
classes when the labels are known.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
What is cluster analysis? cont.
 Clustering is a form of learning by
observations.
 Neural Networks learn by examples.
 Unsupervised learning.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Applications
 In business helps to discover distinct
groups of customers.
 In data mining used to gain insight into
the distribution of data, to observe the
characteristics of each cluster.
 Pre-processing step for classification.
 Pattern recognition.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Requirements
 Scalability: work with large databases.
 Ability to deal with different types of
attributes (not only interval based data).
 Clusters of arbitrary shape, not only
spherical.
 Minimal requirements about domain.
 Ability do deal with noisy data.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Requirements cont.
 Insensitivity to the order of input records.
 Work with samples of high
dimensionality.
 Constrained-based clustering
 Interpretability and usability: results
should be easily interpretable.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Sensitivity to Input Order
 Some algorithms are sensitive to the
order of input data
 Leader algorithm is an example
 Ellipse: 2 1 3 5 4 6; Triangle: 1 2 6 4 5 3
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Clustering Techniques
Heuristic Clustering Techniques
 Incomplete or heuristic clustering: geometrical
methods or projection techniques.
 Dimension reduction techniques (e.g. PCA) are
used obtain a graphical representation in two or
three dimensions.
 Heuristic methods based on visualisation are
used to determine the clusters.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Deterministic Crisp Clustering
 Each datum will be assigned to only one
cluster.
 Each cluster partition defines a ordinary
partition of the data set.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Overlapping Crisp Clustering
 Each datum will be assigned to at least
one cluster.
 Elements may belong to more than one
cluster at various degrees.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Probabilistic Clustering
 For each element, a probabilistic distribution
over the clusters is determined.
 The distribution specifies the probability with
which a datum is assigned to a cluster.
 If the probabilities are interpreted as degree
of membership then these are fuzzy
clustering techniques.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Possibilistic Clustering
 Degrees of membership or possibility
indicate to what extent a datum belongs
to the clusters.
 Possibilistic cluster analysis drops the
constraint that the sum of memberships
of each datum to all clusters is equal to
one.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Hierarchical Clustering
 Descending techniques: they divide the
data into more fine-grained classes.
 Ascending techniques: they combine
small classes into more coarse-grained
ones.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Objective Function Clustering
 An objective function assigns to
each cluster partition values that
have to be optimised.
 This is strictly an optimisation
problem.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Data Types
Data Types
 Interval-scaled variables are continuous
measurements of a linear scale. Ex.
height, weight, temperature.
 Binary variables have only two states.
Ex. smoker, fever, client, owner.
 Nominal variables are a generalisation
of a binary variable with m states. Ex.
Map colour, Marital state.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Data Types
cont.
 Ordinal variables are ordered nominal
variables. Ex. Olympic medals,
Professional ranks.
 Ratio-scaled variables have a non-linear
scale. Ex. Growth of a bacteria
population
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Interval-scaled variables
 Interval-scaled variables are continuous
measurements of a linear scale. Ex.
height, weight, temperature.
 Interval-scaled variables are dependent
on the units used.
 Measurement unit can affect analysis,
so standardisation should be used.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Problems
Person
Age (yr)
Height (cm)
A
35
190
B
40
190
C
35
160
D
40
160
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Standardisation
 Converting original measurements to
unitless values.
 Attempts to give all variables the equal
weight.
 Useful when there is no prior knowledge
of the data.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Standardisation algorithm
 Z-scores indicate how far and in what direction
an item deviates from its distribution's mean,
expressed in units of its distribution's standard
deviation.
 The transformed scores will have a mean of zero
and standard deviation of one.
 It is useful when comparing relative standings of
items from distributions with different means
and/or different standard deviation.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Standardisation algorithm
 Consider n values of a variable x.
1
x 
n
 Calculate the mean value.
 Calculate the standard deviation.
x 
 Calculate the z-score.
z xi 
*@2001 Adriano Cruz
n
x
i 1
i
n
 ( x  x)
i 1
2
i
n
xi  x
x
*NCE e IM - UFRJ
Cluster ‹#›
Z-scores example
Sample Heights Ages z-heights z-ages
1
137,16
10
-0,45
-0,61
2
195,58
25
1,58
0,39
3
170,18
55
0,7
2,39
4
172,73
32
0,79
0,86
5
116,84
8
-1,16
-0,74
6
162,56
11
0,43
-0,54
7
157,48
9
0,26
-0,67
8
142,24
15
-0,28
-0,27
9
96,52
7
-1,87
-0,81
Means
150,14 19,11
Std Dev 28,67 15,01
*@2001 Adriano Cruz
0
1
*NCE e IM - UFRJ
0
1
Cluster ‹#›
Real heights and ages charts
Real Heights and Ages
200
Heights and Ages
175
150
125
Heights
Ages
100
75
50
25
0
Row
2
Row Row
3
4
Row Row
5
6
Row Row
7
8
Row Row
9
10
Samples
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Z-scores for heights and ages
Z-scores for heights and ages
2,5
Heights and ages
2
1,5
1
0,5
Z-heights
Z-ages
0
-0,5
-1
-1,5
-2
Row
2
Row
3
Row
4
Row
5
Row
6
Row
7
Row
8
Row
9
Row
10
Samples
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Data chart
Real data
Ages
60
40
20
Ages
0
0,00
50,00 100,00 150,00 200,00 250,00
Heights
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Data chart
Z-score data
3,00
Ages
2,00
1,00
-3,00
0,00
-1,00
-1,00
Seqüência1
1,00
heights
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Similarities
Data Matrices
 Data matrix: represents n objects with p
characteristics.
 Ex. person = {age, sex, income, ...}
 Dissimilarity matrix: represents a
collection of dissimilarities between all
pairs of objects.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Dissimilarities
 Dissimilarity measures some form of
distance between objects.
 Clustering algorithms use dissimilarities
to cluster data.
 How can dissimilarities be measured?
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
How to calculate dissimilarities?
 The most popular methods are based on the
distance between pairs of objects.
 Minkowski distance:
p
d (xi , x k )  ( xij  xkj )
q
1
q
j 1
 p is the number of characteristics
 q is the distance type
 q=2 (Euclides distance), q=1 (Manhattan)
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Similarities
 It is also possible to work with
similarities [s(xi,xj)]
 0<=s(xi,xj)<=1
 s(xi,xi)=1
 s(xi,xj)=s(xj,xi)
 It is possible to consider that d(xi,xj)=1s(xi,xj)
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Distances
Sample Heights Ages Z-heights Z-ages
1 137,16
10
-0,45 -0,61
2 195,58
25
1,58
0,39
3 170,18
55
0,70
2,39
4 172,73
32
0,79
0,86
5 116,84
8
-1,16 -0,74
6 162,56
11
0,43 -0,54
7 157,48
9
0,26 -0,67
8 142,24
15
-0,28 -0,27
9 96,52
7
-1,87 -0,81
Means
Std Dev
150,14 19,11
28,67 15,01
*@2001 Adriano Cruz
0,0000 0,0000
1,0000 1,0000
Euclides Manhatann
15,8613
22,0944
45,8167
51,3256
41,1033
55,9256
26,0054
35,4756
35,1080
44,4144
14,8312
20,5278
12,4924
17,4478
8,9086
12,0144
54,9740
65,7344
2
*NCE e IM - UFRJ
1
Euclides Manhatann
0,7574
1,0599
1,6325
1,9770
2,4915
3,0903
1,1654
1,6466
1,3774
1,9018
0,6926
0,9735
0,7207
0,9296
0,3886
0,5496
2,0368
2,6771
2
1
Cluster ‹#›
Dissimilarities
 There are other ways to obtain
dissimilarities.
 So we no longer speak of distances.
 Basically dissimilarities are nonnegative
numbers (d(i,j)) that are small (close to
0) when i and j are similar.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Pearson
 Pearson product-moment correlation
between variables f and g
 Coefficients lie between –1 and +1
n
R( f , g ) 
(x
i 1
n
 m f )( xig  mg )
2
(
x

m
)
 if f
i 1
*@2001 Adriano Cruz
if
n
2
(
x

m
)
 ig f
i 1
*NCE e IM - UFRJ
Cluster ‹#›
Pearson - cont
 A correlation of +1 means that there is a
perfect positive linear relationship
between variables.
 A correlation of -1 means that there is a
perfect negative linear relationship
between variables.
 A correlation of 0 means there is no
linear relationship between the two
variables.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Pearson - ex
 ryz = 0.9861; ryw = -0.9551; ryr= 0.2770
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Correlation and dissimilarities 1
 d(f,g)=(1-R(f,g))/2
(1)
 Variables with a high positive correlation
(+1) receive a dissimilarity close to 0
 Variables with strongly negative
correlation will be considered very
dissimilar
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Correlation and dissimilarities 2
 d(f,g)=1-|R(f,g)|
(2)
 Variables with a high positive correlation
(+1) and negative correlation will
receive a dissimilarity close to 0
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Numerical Example
Name
Ilan
Jack
Kim
Lieve
Leon
Peter
Talia
Tina
Weight
15
49
13
45
85
66
12
10
*@2001 Adriano Cruz
Height
95
156
95
160
178
176
90
78
*NCE e IM - UFRJ
Month
1
5
11
7
6
6
12
1
Year
82
55
81
56
48
56
83
84
Cluster ‹#›
Numerical Example
Name
Ilan
Jack
Kim
Lieve
Leon
Peter
Talia
Tina
Weight
15
49
13
45
85
66
12
10
*@2001 Adriano Cruz
Height
95
156
95
160
178
176
90
78
*NCE e IM - UFRJ
Month
1
5
11
7
6
6
12
1
Year
82
55
81
56
48
56
83
84
Cluster ‹#›
Numerical Example 1
Quanti
Corr
Weight
Height
Month
Weight
1
Height
0.957
1
Month
-0.036
0.021
1
Year
-0.953
-0.985
0.013
Diss
Weight
0
(1)
Height
0.021
0
Month
0.518
0.489
0
Year
0.977
0.992
0.493
Diss
Weight
0
(2)
Height
0.043
0
Month
0.964
0.979
0
Year
0.047
0.015
0.987
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Year
1
0
0
Cluster ‹#›
Binary Variables
 Binary variables have only two states.
 States can be symmetric or asymmetric.
 Binary variables are symmetric if both
states are equally valuable. Ex. gender
 When the states are not equally
important the variable is asymmetric.
Ex. disease tests (1-positive; 0-negative)
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Contingency tables
 Consider objects described by p binary
variables
 q variables are equal to one on i and j
 r variables are 1 on i and 0 on object j
Object i
*@2001 Adriano Cruz
1
0
Sum
Object j
1
0
Sum
q
r
q+r
s
t
s+t
q+s
r+t
p
*NCE e IM - UFRJ
Cluster ‹#›
Symmetric Variables
 Dissimilarity based on symmetric
variables is invariant.
 The result should not change when
variables are interchanged.
 Simple dissimilarity coefficient:
rs
d ( xi , x j ) 
qr  st
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Symmetric Variables
 Dissimilarity
rs
d ( xi , x j ) 
qr  st
 Similarity
qt
s( xi , x j ) 
qr  st
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Asymmetric Variables
 Similarity based on asymmetric
variables is not invariant.
 Two ones are more important than two
zeros
rs
 Jacard coefficient: d ( xi , x j ) 
qr s
q
s( xi , x j ) 
qrs
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Computing dissimilarities
Name fever cough Test1 Test2 Test3 Test4
Jack
Y
N
P
N
N
N
Mary
Y
N
P
N
P
N
Jim
Y
Y
N
N
N
N
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Computing Dissimilarities
Fever
Jack Mary q 1,1 r 1,0 s 0,1 t 0,0
Y
Y
1
0
0
0
Cough
N
N
0
0
0
1
Test1
Test2
Test3
P
N
N
P
N
P
1
0
0
0
0
0
0
0
1
0
1
0
Test4
N
N
0
0
0
1
2
0
1
3
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Computing dissimilarities
r s
0 1
d ( jack , mary ) 

 0.33
q r  s 2  0 1
11
d ( jack , jim ) 
 0.67
111
12
d ( jim , mary ) 
 0.75
112
•Jim and Mary have the highest dissimilarity
value, so they have low probability of
having the same disease.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Nominal Variables
 A nominal variable is a generalisation of
the binary variable.
 A nominal variable can take more than
two states
 Ex. Marital status: married, single,
divorced
 Each state can be represented by a
number or letter
 There is no specific ordering
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Computing dissimilarities
 Consider two objects i and j, described
by nominal variables
 Each object has p characteristics
pm
d (i, j ) 
p
 m is the number of matches
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Binarising nominal variables








An nominal variable can encoded to create a
new binary variable for each state
Example:
Marital state = {married, single, divorced}
Married: 1=yes – 0=no
Single: 1=yes – 0=no
Divorced: 1=yes – 0=no
Ex. Marital state = {married}
married = 1, single = 0, divorced = 0
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Ordinal variables
 A discrete ordinal variable is similar to a
nominal variable, except that the states
are ordered in a meaningful sequence
 Ex. Bronze, silver and gold medals
 Ex. Assistant, associate, full member
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Computing dissimilarities
 Consider n objects defined by a set of
ordinal variables
 f is one of these ordinal variables and
have Mf states.
 These states define the ranking
rf  {1,…, Mf}.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Steps to calculate dissimilarities

Assume that the value of f for the ith object is xif.
Replace each xif by its corresponding rank rif g
{1,…,Mf}.

Since the number of states of each variable
differs, it is often necessary map the range onto
[0.0,1.0] using the equation
zif 

rif  1
M f 1
Dissimilarity can be computed using distance
measures of interval-scaled variables
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Ratio-scaled variables
 Variables on a non-linear scale, such as
exponential
 To compute dissimilarities there are
three methods
•
•
•
Treat as interval-scaled. Not always good.
Apply a transformation like y=log(x) and
treat as interval-scaled
Treat as ordinal data and assume ranks as
interval-scaled
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Variables of mixed types


One technique is to bring all variables onto a
common scale of the interval [0.0.1.0]
Suppose that the data set contains p
variables of mixed type. Dissimilarity is
between i and j is
p
d (i, j ) 
(f) (f)

 ij dij
f 1
p
(f)

 ij
f 1
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Variables of mixed types

Dissimilarity is between i and j is
p
d (i, j ) 
(f) (f)

 ij dij
f 1
p
(f)

 ij
f 1
where
 ij( f )
if xif or x jf does not exist
0

 0 if xif  x jf  0 and f is assymmetri c
1
otherwise

*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Variables of mixed types cont


The contribution of each variable is dependent on its
type
f is binary or nominal:
dij( f )  0 if xif  x jf ; otherwise  1

f is interval-based:
d ij( f )


xif  x jf
max( x f )  min( x f )
f is ordinal of ratio-scaled: compute ranks and treat
as interval-based
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Clustering Methods
Classification types
 Clustering is an unsupervised method
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Clustering Methods
 Partitioning
 Hierarchical
 Density-based
 Grid-based
 Model-based
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Partitioning Methods
 Given n objects k partitions are created.
 Each partition must contain at least one
element.
 It uses an iterative relocation technique
to improve partitioning.
 Distance is the usual criterion.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Partitioning Methods cont.




They work well for finding spherical-shaped
clusters.
They are not efficient on very large
databases.
K-means where each cluster is represented
by the mean value of the objects in the
cluster.
K-medoids where each cluster is represented
by an object near the centre of the cluster.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Hierarchical Methods





Creates a hierarchical decomposition of the set
Agglomerative approaches start with each object
forming a separate group
Merges objects or groups until all objects belong
to one group or a termination condition occurs
Divisive approaches starts with all objects in the
same cluster
Each successive iteration splits a cluster until all
objects are on separate clusters or a termination
condition occurs
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Hierarchical Clustering cont
 Definition of cluster proximity.
 Min: most similar (sensitive to noise)
 Max: most dissimilar (break large
clusters
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Density-based methods
 Method creates clusters until the density
in the neighbourhood exceeds some
threshold
 Able to find clusters of arbitrary shapes
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Grid-based methods
 Grid methods divide the object space into finite
number of cells forming a grid-like structure.
 Cells that contain more than a certain number
of elements are treated as dense.
 Dense cells are connected to form clusters.
 Fast processing time, independent of the
number of objects.
 STING and CLIQUE are examples.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Model-based methods
 Model-based methods hypothesise a
model for each cluster and find the best
fit of the data to the given model.
 Statistical models
 SOM networks
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Partition methods
 Given a database of n objects a partition
method organises them into k clusters
(k<= n)
 The methods try to minimise an objective
function such as distance
 Similar objects are “close” to each other
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›