Clustering
The goal of clustering is to segment observations into similar groups based on the observed variables. Clustering can be employed during the data-preparation step to identify variables or observations that can be aggregated or removed from consideration.
Clustering methods are used to identify groups of similar objects in a multivariate data sets collected from fields such as marketing, bio-medical and geo-spatial. They are different types of clustering methods, including:
- Partitioning methods
- Hierarchical clustering
- Fuzzy clustering
- Density-based clustering
- Model-based clustering
- Validation
Partitioning Methods
The goal of cluster analysis is to group observations into clusters such that observations within a cluster are similar and observations in different clusters are dissimilar. Therefore, to formalize this process, we need explicit measurements of similarity or, conversely, dissimilarity. Some metric track similarity between observations, and clustering method using such a metric would seek to maximize the similarity between observations. Other metrics measure dissimilarity, or distance, between observations, and a clustering method using one of these metrics would seek to minimize the distance between observations in a cluster.
When observations include numerical variables, Euclidean distance is the most common method to measure dissimilarity between observations. Let observations
\[
{\bf u} = ( u_1 u_2 , \ldots , u_n )\] and \[ {\bf v} =(v_1 , v_2 , \ldots , v_n )\] each comprise measurements of n variables. The Euclidean distance between u and v is
\[
d_2 ({\bf u}, {\bf v}) = \| {\bf u} - {\bf v} \|_2 = \sqrt{(u_1 - v_1 )^2 + (u_2 - v_2 )^2 + \cdots + (u_n - v_n )^2} .
\]
The distance between two points in a grid based on a strictly horizontal and/or vertical path (that is, along the grid lines), as opposed to the diagonal (Euclidean) or "as the crow flies" distance is called the
Manhattan distance or the taxicab distance:
\[
d_1 ({\bf u}, {\bf v}) = \| {\bf u} - {\bf v} \|_1 = \sum_{k=1}^n |u_k - v_k | .
\]
Other dissimilarity measures exist such as correlation-based distances, which is widely used for gene expression data analyses. Correlation-based distance is defined by subtracting the correlation coefficient from 1. Different types of correlation methods can be used such as:
- Pearson correlation distance: \[ d_{\scriptstyle cor} \left( {\bf x} , {\bf y} \right) = 1 - \dfrac{\sum_{i=1}^n \left( x_i - \overline{\bf x} \right)\left( y_i - \overline{\bf y} \right)}{\sqrt{\sum_{i=1}^n \left( x_i - \overline{\bf x} \right)^2 \sum_{k=1}^n \left( y_k - \overline{\bf y} \right)^2} } . \] Pearson correlation measures the degree of a linear relationship between two profiles.
- Eisen cosine correlation distance (Eisen et al., 1998): \[ d_{\scriptstyle eisen} \left( {\bf x} , {\bf y} \right) = 1 - \dfrac{\left\vert \sum_{i=1}^n x_i y_i \right\vert}{\sqrt{\sum_{i=1}^n x_i^2 \sum_{k=1}^n y_k^2}} . \] It’s a special case of Pearson’s correlation with \overline{\bf x} and \overline{\bf y} both replaced by zero.
- Spearman correlation distance: \[ d_{\scriptstyle spear} \left( {\bf x} , {\bf y} \right) = 1 - \dfrac{\sum_{i=1}^n \left( x_i - \overline{\bf x}' \right)\left( y_i - \overline{\bf y}' \right)}{\sqrt{\sum_{i=1}^n \left( x_i - \overline{\bf x}' \right)^2 \sum_{k=1}^n \left( y_k - \overline{\bf y}' \right)^2} } , \] where \overline{\bf x}' = rank({\bf x}) and \overline{\bf y}' = rank({\bf y}) . The spearman correlation method computes the correlation between the rank of x and the rank of y variables.
-
Kendall correlation distance:
\[
d_{\scriptstyle spear} \left( {\bf x} , {\bf y} \right) = 1 -
frac{n_c - n_d}{n(n-1)/2} ,
\]
where
- n_c: total number of concordant pairs;
- n_d: total number of discordant pairs
- n: size of x and y
- dist() R base function [stats package]: Accepts only numeric data as an input.
- get_dist() function [factoextra package]: Accepts only numeric data as an input. Compared to the standard dist() function, it supports correlation-based distance measures including “pearson”, “kendall” and “spearman” methods.
- daisy() function [cluster package]: Able to handle other variable types (e.g. nominal, ordinal, (a)symmetric binary). In that case, the Gower’s coefficient will be automatically used as the metric. It’s one of the most popular measures of proximity for mixed data types. For more details, read the R documentation of the daisy() function (?daisy).
- get_dist() & fviz_dist() for computing and visualizing distance matrix between rows of a data matrix. Compared to the standard dist() function, get_dist() supports correlation-based distance measures including “pearson”, “kendall” and “spearman” methods.
-
eclust(): enhanced cluster analysis. It has several advantages:
- It simplifies the workflow of clustering analysis
- It can be used to compute hierarchical clustering and partititioning clustering in a single line function call
- Compared to the standard partitioning functions (kmeans, pam, clara and fanny) which requires the user to specify the optimal number of clusters, the function eclust() computes automatically the gap statistic for estimating the right number of clusters.
- For hierarchical clustering, correlation-based metric is allowed
- It provides silhouette information for all partitioning methods and hierarchical clustering
- It creates beautiful graphs using ggplot2
- get_dist(): for computing a distance matrix between the rows of a data matrix. Compared to the standard dist() function, it supports correlation-based distance measures including “pearson”, “kendall” and “spearman” methods.
- fviz_dist(): for visualizing a distance matrix
Hierarchical Clustering
Hierarchical clustering
Hierarchical clustering is an alternative approach to partitioning clustering for identifying groups in the dataset. It does not require to pre-specify the number of clusters to be generated.
The result of hierarchical clustering is a tree-based representation of the objects, which is also known as dendrogram. Observations can be subdivided into groups by cutting the dendrogram at a desired similarity level.
R code to compute and visualize hierarchical clustering:
# Compute hierarchical clustering
res.hc <- USArrests %>%
scale() %>% # Scale the data
dist(method = "euclidean") %>% # Compute dissimilarity matrix
hclust(method = "ward.D2") # Compute hierachical clustering
# Visualize using factoextra
# Cut in 4 groups and color by groups
fviz_dend(res.hc, k = 4, # Cut in four groups
cex = 0.5, # label size
k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
color_labels_by_k = TRUE, # color labels by groups
rect = TRUE # Add rectangle around groups
)
# Enhanced hierarchical clustering
res.hc <- eclust(df, "hclust") # compute hclust
Hierarchical clustering [or hierarchical cluster analysis (HCA)] is an alternative approach to partitioning clustering for grouping objects based on their similarity. In contrast to partitioning clustering, hierarchical clustering does not require to pre-specify the number of clusters to be produced.
Hierarchical clustering can be subdivided into two types:
- Agglomerative clustering in which, each observation is initially considered as a cluster of its own (leaf). Then, the most similar clusters are successively merged until there is just one single big cluster (root).
- Divise clustering, an inverse of agglomerative clustering, begins with the root, in witch all objects are included in one cluster. Then the most heterogeneous clusters are successively divided until all observation are in their own cluster.
- rows representing observations (individuals);
- and columns representing variables.
- d: a dissimilarity structure as produced by the dist() function.
- method: The agglomeration (linkage) method to be used for computing distance between clusters. Allowed values is one of “ward.D”, “ward.D2”, “single”, “complete”, “average”, “mcquitty”, “median” or “centroid”.
- Maximum or complete linkage: The distance between two clusters is defined as the maximum value of all pairwise distances between the elements in cluster 1 and the elements in cluster 2. It tends to produce more compact clusters.
- Minimum or single linkage: The distance between two clusters is defined as the minimum value of all pairwise distances between the elements in cluster 1 and the elements in cluster 2. It tends to produce long, “loose” clusters.
- Mean or average linkage: The distance between two clusters is defined as the average distance between the elements in cluster 1 and the elements in cluster 2.
- Centroid linkage: The distance between two clusters is defined as the distance between the centroid for cluster 1 (a mean vector of length p variables) and the centroid for cluster 2.
- Ward’s minimum variance method: It minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged.
- McQuitty method: When onsidering merging two clusters A and B, the dissimilarity of the resulting cluster AB to any other cluster C is calculated as \[ \frac{1}{2} \left[ (\mbox{dissimilarity between A and C}) + (\mbox{dissimilarity between B and C}) \right] . \] At each step, this method then merges the pair of clusters that results in the minimal increase in total dissimilarity.
Fuzzy Clustering
R is a statistical analyzation language, and that means it is very good at data manipulation and data analyzation. One key way to analyze data is through plotting, and R excels in this field.
R will take two vectors, x and y, of the same length, and will plot them against each other. For example: y= \int f(x)\,e^{-x^2} \,{\text d}x
Density-based Clustering
R is a statistical analyzation language, and that means it is very good at data manipulation and data analyzation. One key way to analyze data is through plotting, and R excels in this field.
R will take two vectors, x and y, of the same length, and will plot them against each other. For example: y= \int f(x)\,e^{-x^2} \,{\text d}x
Model-based Clustering
R is a statistical analyzation language, and that means it is very good at data manipulation and data analyzation. One key way to analyze data is through plotting, and R excels in this field.
R will take two vectors, x and y, of the same length, and will plot them against each other. For example: y= \int f(x)\,e^{-x^2} \,{\text d}x
Validation
R is a statistical analyzation language, and that means it is very good at data manipulation and data analyzation. One key way to analyze data is through plotting, and R excels in this field.
R will take two vectors, x and y, of the same length, and will plot them against each other. For example: y= \int f(x)\,e^{-x^2} \,{\text d}x