Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ [email protected] Introduction What is cluster analysis? The process of grouping a set of physical or abstract objects into classes of similar objects. The class label of each class is unknown. Classification separates objects into classes when the labels are known. *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› What is cluster analysis? cont. Clustering is a form of learning by observations. Neural Networks learn by examples. Unsupervised learning. *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Applications In business helps to discover distinct groups of customers. In data mining used to gain insight into the distribution of data, to observe the characteristics of each cluster. Pre-processing step for classification. Pattern recognition. *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Requirements Scalability: work with large databases. Ability to deal with different types of attributes (not only interval based data). Clusters of arbitrary shape, not only spherical. Minimal requirements about domain. Ability do deal with noisy data. *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Requirements cont. Insensitivity to the order of input records. Work with samples of high dimensionality. Constrained-based clustering Interpretability and usability: results should be easily interpretable. *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Sensitivity to Input Order Some algorithms are sensitive to the order of input data Leader algorithm is an example Ellipse: 2 1 3 5 4 6; Triangle: 1 2 6 4 5 3 *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Clustering Techniques Heuristic Clustering Techniques Incomplete or heuristic clustering: geometrical methods or projection techniques. Dimension reduction techniques (e.g. PCA) are used obtain a graphical representation in two or three dimensions. Heuristic methods based on visualisation are used to determine the clusters. *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Deterministic Crisp Clustering Each datum will be assigned to only one cluster. Each cluster partition defines a ordinary partition of the data set. *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Overlapping Crisp Clustering Each datum will be assigned to at least one cluster. Elements may belong to more than one cluster at various degrees. *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Probabilistic Clustering For each element, a probabilistic distribution over the clusters is determined. The distribution specifies the probability with which a datum is assigned to a cluster. If the probabilities are interpreted as degree of membership then these are fuzzy clustering techniques. *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Possibilistic Clustering Degrees of membership or possibility indicate to what extent a datum belongs to the clusters. Possibilistic cluster analysis drops the constraint that the sum of memberships of each datum to all clusters is equal to one. *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Hierarchical Clustering Descending techniques: they divide the data into more fine-grained classes. Ascending techniques: they combine small classes into more coarse-grained ones. *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Objective Function Clustering An objective function assigns to each cluster partition values that have to be optimised. This is strictly an optimisation problem. *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Data Types Data Types Interval-scaled variables are continuous measurements of a linear scale. Ex. height, weight, temperature. Binary variables have only two states. Ex. smoker, fever, client, owner. Nominal variables are a generalisation of a binary variable with m states. Ex. Map colour, Marital state. *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Data Types cont. Ordinal variables are ordered nominal variables. Ex. Olympic medals, Professional ranks. Ratio-scaled variables have a non-linear scale. Ex. Growth of a bacteria population *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Interval-scaled variables Interval-scaled variables are continuous measurements of a linear scale. Ex. height, weight, temperature. Interval-scaled variables are dependent on the units used. Measurement unit can affect analysis, so standardisation should be used. *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Problems Person Age (yr) Height (cm) A 35 190 B 40 190 C 35 160 D 40 160 *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Standardisation Converting original measurements to unitless values. Attempts to give all variables the equal weight. Useful when there is no prior knowledge of the data. *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Standardisation algorithm Z-scores indicate how far and in what direction an item deviates from its distribution's mean, expressed in units of its distribution's standard deviation. The transformed scores will have a mean of zero and standard deviation of one. It is useful when comparing relative standings of items from distributions with different means and/or different standard deviation. *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Standardisation algorithm Consider n values of a variable x. 1 x n Calculate the mean value. Calculate the standard deviation. x Calculate the z-score. z xi *@2001 Adriano Cruz n x i 1 i n ( x x) i 1 2 i n xi x x *NCE e IM - UFRJ Cluster ‹#› Z-scores example Sample Heights Ages z-heights z-ages 1 137,16 10 -0,45 -0,61 2 195,58 25 1,58 0,39 3 170,18 55 0,7 2,39 4 172,73 32 0,79 0,86 5 116,84 8 -1,16 -0,74 6 162,56 11 0,43 -0,54 7 157,48 9 0,26 -0,67 8 142,24 15 -0,28 -0,27 9 96,52 7 -1,87 -0,81 Means 150,14 19,11 Std Dev 28,67 15,01 *@2001 Adriano Cruz 0 1 *NCE e IM - UFRJ 0 1 Cluster ‹#› Real heights and ages charts Real Heights and Ages 200 Heights and Ages 175 150 125 Heights Ages 100 75 50 25 0 Row 2 Row Row 3 4 Row Row 5 6 Row Row 7 8 Row Row 9 10 Samples *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Z-scores for heights and ages Z-scores for heights and ages 2,5 Heights and ages 2 1,5 1 0,5 Z-heights Z-ages 0 -0,5 -1 -1,5 -2 Row 2 Row 3 Row 4 Row 5 Row 6 Row 7 Row 8 Row 9 Row 10 Samples *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Data chart Real data Ages 60 40 20 Ages 0 0,00 50,00 100,00 150,00 200,00 250,00 Heights *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Data chart Z-score data 3,00 Ages 2,00 1,00 -3,00 0,00 -1,00 -1,00 Seqüência1 1,00 heights *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Similarities Data Matrices Data matrix: represents n objects with p characteristics. Ex. person = {age, sex, income, ...} Dissimilarity matrix: represents a collection of dissimilarities between all pairs of objects. *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Dissimilarities Dissimilarity measures some form of distance between objects. Clustering algorithms use dissimilarities to cluster data. How can dissimilarities be measured? *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› How to calculate dissimilarities? The most popular methods are based on the distance between pairs of objects. Minkowski distance: p d (xi , x k ) ( xij xkj ) q 1 q j 1 p is the number of characteristics q is the distance type q=2 (Euclides distance), q=1 (Manhattan) *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Similarities It is also possible to work with similarities [s(xi,xj)] 0<=s(xi,xj)<=1 s(xi,xi)=1 s(xi,xj)=s(xj,xi) It is possible to consider that d(xi,xj)=1s(xi,xj) *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Distances Sample Heights Ages Z-heights Z-ages 1 137,16 10 -0,45 -0,61 2 195,58 25 1,58 0,39 3 170,18 55 0,70 2,39 4 172,73 32 0,79 0,86 5 116,84 8 -1,16 -0,74 6 162,56 11 0,43 -0,54 7 157,48 9 0,26 -0,67 8 142,24 15 -0,28 -0,27 9 96,52 7 -1,87 -0,81 Means Std Dev 150,14 19,11 28,67 15,01 *@2001 Adriano Cruz 0,0000 0,0000 1,0000 1,0000 Euclides Manhatann 15,8613 22,0944 45,8167 51,3256 41,1033 55,9256 26,0054 35,4756 35,1080 44,4144 14,8312 20,5278 12,4924 17,4478 8,9086 12,0144 54,9740 65,7344 2 *NCE e IM - UFRJ 1 Euclides Manhatann 0,7574 1,0599 1,6325 1,9770 2,4915 3,0903 1,1654 1,6466 1,3774 1,9018 0,6926 0,9735 0,7207 0,9296 0,3886 0,5496 2,0368 2,6771 2 1 Cluster ‹#› Dissimilarities There are other ways to obtain dissimilarities. So we no longer speak of distances. Basically dissimilarities are nonnegative numbers (d(i,j)) that are small (close to 0) when i and j are similar. *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Pearson Pearson product-moment correlation between variables f and g Coefficients lie between –1 and +1 n R( f , g ) (x i 1 n m f )( xig mg ) 2 ( x m ) if f i 1 *@2001 Adriano Cruz if n 2 ( x m ) ig f i 1 *NCE e IM - UFRJ Cluster ‹#› Pearson - cont A correlation of +1 means that there is a perfect positive linear relationship between variables. A correlation of -1 means that there is a perfect negative linear relationship between variables. A correlation of 0 means there is no linear relationship between the two variables. *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Pearson - ex ryz = 0.9861; ryw = -0.9551; ryr= 0.2770 *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Correlation and dissimilarities 1 d(f,g)=(1-R(f,g))/2 (1) Variables with a high positive correlation (+1) receive a dissimilarity close to 0 Variables with strongly negative correlation will be considered very dissimilar *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Correlation and dissimilarities 2 d(f,g)=1-|R(f,g)| (2) Variables with a high positive correlation (+1) and negative correlation will receive a dissimilarity close to 0 *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Numerical Example Name Ilan Jack Kim Lieve Leon Peter Talia Tina Weight 15 49 13 45 85 66 12 10 *@2001 Adriano Cruz Height 95 156 95 160 178 176 90 78 *NCE e IM - UFRJ Month 1 5 11 7 6 6 12 1 Year 82 55 81 56 48 56 83 84 Cluster ‹#› Numerical Example Name Ilan Jack Kim Lieve Leon Peter Talia Tina Weight 15 49 13 45 85 66 12 10 *@2001 Adriano Cruz Height 95 156 95 160 178 176 90 78 *NCE e IM - UFRJ Month 1 5 11 7 6 6 12 1 Year 82 55 81 56 48 56 83 84 Cluster ‹#› Numerical Example 1 Quanti Corr Weight Height Month Weight 1 Height 0.957 1 Month -0.036 0.021 1 Year -0.953 -0.985 0.013 Diss Weight 0 (1) Height 0.021 0 Month 0.518 0.489 0 Year 0.977 0.992 0.493 Diss Weight 0 (2) Height 0.043 0 Month 0.964 0.979 0 Year 0.047 0.015 0.987 *@2001 Adriano Cruz *NCE e IM - UFRJ Year 1 0 0 Cluster ‹#› Binary Variables Binary variables have only two states. States can be symmetric or asymmetric. Binary variables are symmetric if both states are equally valuable. Ex. gender When the states are not equally important the variable is asymmetric. Ex. disease tests (1-positive; 0-negative) *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Contingency tables Consider objects described by p binary variables q variables are equal to one on i and j r variables are 1 on i and 0 on object j Object i *@2001 Adriano Cruz 1 0 Sum Object j 1 0 Sum q r q+r s t s+t q+s r+t p *NCE e IM - UFRJ Cluster ‹#› Symmetric Variables Dissimilarity based on symmetric variables is invariant. The result should not change when variables are interchanged. Simple dissimilarity coefficient: rs d ( xi , x j ) qr st *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Symmetric Variables Dissimilarity rs d ( xi , x j ) qr st Similarity qt s( xi , x j ) qr st *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Asymmetric Variables Similarity based on asymmetric variables is not invariant. Two ones are more important than two zeros rs Jacard coefficient: d ( xi , x j ) qr s q s( xi , x j ) qrs *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Computing dissimilarities Name fever cough Test1 Test2 Test3 Test4 Jack Y N P N N N Mary Y N P N P N Jim Y Y N N N N *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Computing Dissimilarities Fever Jack Mary q 1,1 r 1,0 s 0,1 t 0,0 Y Y 1 0 0 0 Cough N N 0 0 0 1 Test1 Test2 Test3 P N N P N P 1 0 0 0 0 0 0 0 1 0 1 0 Test4 N N 0 0 0 1 2 0 1 3 *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Computing dissimilarities r s 0 1 d ( jack , mary ) 0.33 q r s 2 0 1 11 d ( jack , jim ) 0.67 111 12 d ( jim , mary ) 0.75 112 •Jim and Mary have the highest dissimilarity value, so they have low probability of having the same disease. *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Nominal Variables A nominal variable is a generalisation of the binary variable. A nominal variable can take more than two states Ex. Marital status: married, single, divorced Each state can be represented by a number or letter There is no specific ordering *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Computing dissimilarities Consider two objects i and j, described by nominal variables Each object has p characteristics pm d (i, j ) p m is the number of matches *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Binarising nominal variables An nominal variable can encoded to create a new binary variable for each state Example: Marital state = {married, single, divorced} Married: 1=yes – 0=no Single: 1=yes – 0=no Divorced: 1=yes – 0=no Ex. Marital state = {married} married = 1, single = 0, divorced = 0 *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Ordinal variables A discrete ordinal variable is similar to a nominal variable, except that the states are ordered in a meaningful sequence Ex. Bronze, silver and gold medals Ex. Assistant, associate, full member *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Computing dissimilarities Consider n objects defined by a set of ordinal variables f is one of these ordinal variables and have Mf states. These states define the ranking rf {1,…, Mf}. *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Steps to calculate dissimilarities Assume that the value of f for the ith object is xif. Replace each xif by its corresponding rank rif g {1,…,Mf}. Since the number of states of each variable differs, it is often necessary map the range onto [0.0,1.0] using the equation zif rif 1 M f 1 Dissimilarity can be computed using distance measures of interval-scaled variables *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Ratio-scaled variables Variables on a non-linear scale, such as exponential To compute dissimilarities there are three methods • • • Treat as interval-scaled. Not always good. Apply a transformation like y=log(x) and treat as interval-scaled Treat as ordinal data and assume ranks as interval-scaled *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Variables of mixed types One technique is to bring all variables onto a common scale of the interval [0.0.1.0] Suppose that the data set contains p variables of mixed type. Dissimilarity is between i and j is p d (i, j ) (f) (f) ij dij f 1 p (f) ij f 1 *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Variables of mixed types Dissimilarity is between i and j is p d (i, j ) (f) (f) ij dij f 1 p (f) ij f 1 where ij( f ) if xif or x jf does not exist 0 0 if xif x jf 0 and f is assymmetri c 1 otherwise *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Variables of mixed types cont The contribution of each variable is dependent on its type f is binary or nominal: dij( f ) 0 if xif x jf ; otherwise 1 f is interval-based: d ij( f ) xif x jf max( x f ) min( x f ) f is ordinal of ratio-scaled: compute ranks and treat as interval-based *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Clustering Methods Classification types Clustering is an unsupervised method *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Clustering Methods Partitioning Hierarchical Density-based Grid-based Model-based *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Partitioning Methods Given n objects k partitions are created. Each partition must contain at least one element. It uses an iterative relocation technique to improve partitioning. Distance is the usual criterion. *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Partitioning Methods cont. They work well for finding spherical-shaped clusters. They are not efficient on very large databases. K-means where each cluster is represented by the mean value of the objects in the cluster. K-medoids where each cluster is represented by an object near the centre of the cluster. *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Hierarchical Methods Creates a hierarchical decomposition of the set Agglomerative approaches start with each object forming a separate group Merges objects or groups until all objects belong to one group or a termination condition occurs Divisive approaches starts with all objects in the same cluster Each successive iteration splits a cluster until all objects are on separate clusters or a termination condition occurs *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Hierarchical Clustering cont Definition of cluster proximity. Min: most similar (sensitive to noise) Max: most dissimilar (break large clusters *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Density-based methods Method creates clusters until the density in the neighbourhood exceeds some threshold Able to find clusters of arbitrary shapes *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Grid-based methods Grid methods divide the object space into finite number of cells forming a grid-like structure. Cells that contain more than a certain number of elements are treated as dense. Dense cells are connected to form clusters. Fast processing time, independent of the number of objects. STING and CLIQUE are examples. *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Model-based methods Model-based methods hypothesise a model for each cluster and find the best fit of the data to the given model. Statistical models SOM networks *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Partition methods Given a database of n objects a partition method organises them into k clusters (k<= n) The methods try to minimise an objective function such as distance Similar objects are “close” to each other *@2001 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#›