Clustering
The goal of clustering is to segment observations into similar groups based on the observed variables. Clustering can be employed during the data-preparation step to identify variables or observations that can be aggregated or removed from consideration.
- Partitioning methods
- Hierarchical clustering
- Fuzzy clustering
- Density-based clustering
- Model-based clustering
- Validation
cluster
Similarity, Dissimilarity and Distance
Distance or similarity measures are essential to solve many pattern recognition problems such as classification, clustering, and retrieval problems. Various distance/similarity measures that are applicable to compare two probability density functions, pdf in short, are reviewed and categorized in both syntactic and semantic relationships. A correlation coefficient and a hierarchical clustering technique are adopted to reveal similarities among numerous distance/similarity measures.
From the scientific and mathematical point of view, distance is defined as a quantitative degree of how far apart two objects are. Synonyms for distance include dissimilarity. Those distance measures satisfying the metric properties are simply called metric while other non-metric distance measures are occasionally called divergence. Synonyms for similarity include proximity and similarity measures are often called similarity coefficients. The choice of distance/similarity measures depends on the measurement type or representation of objects. The probability density function, or pdf in short, is one of the most popular pattern representations.
Similarity is a characterization of the ratio of the number of attributes two objects share in common compared to the total list of attributes between them. Objects which have everything in common are identical, and have a similarity of 1.0. Objects which have nothing in common have a similarity of 0.0. There is a large number of similarity indices proposed and employed, but the concepts are common to all.
Dissimilarity is the complement of similarity, and is a characterization of the number of attributes two objects have uniquely compared to the total list of attributes between them. In general, dissimilarity can be calculated as 1 - similarity.
Distance is a geometric conception of the proximity of objects in a high
dimensional space defined by measurements on the attributes.
Remember that R calculates distances with the dist
function, and uses "euclidean", "manhattan", or "binary" as the "metric." The
vegan package provides vegdist()
. Similar
to the way in which these indices and
metrics influenced ordination results, they similarly
influence cluster analyses.
One can employ a broader range of distances or dissimilarity coefficients,
including ones which ignore joint absences. R provides a function for
calculating distances, the dist()
function, which provides a fairly
narrow range of distances (euclidean, manhattan, binary, canberra, and
maximum). However, the
vegan library provides the vegdist()
function.
Table 1. Distances and Dissimilarities
Distance, Dissimilarity, or Index | dist | vegan | Formula |
method | method | ||
Euclidean | X | X | \( \displaystyle \left( \sum_i | x_i - y_i |^2 \right)^{1/2} \) |
Manhattan | X | X | \( \displaystyle \sum_i | x_i - y_i | \) |
Lp | \( \displaystyle \left( \sum_i | x_i - y_i |^p \right)^{1/p} \) | ||
Chebyshev | \( \displaystyle \max_i |x_i - y_i | \) | ||
Sorensen | \( \displaystyle \dfrac{\sum_i | x_i - y_i |}{\sum_i ( x_i + y_i )} \) | ||
Canberra | X | \( \displaystyle \sum_i \,\dfrac{ | x_i - y_i |}{x_i + y_i } \) | |
Bray-Curtis | bray | ||
Gower | X | \( \displaystyle \frac{1}{N}\,\sum_i |x_i - y_i | \) | |
Kulczynski | X | \( \displaystyle \dfrac{\sum_i | x_i - y_i |}{\sum_i \min ( x_i , y_i )} \) | |
Soergel | \( \displaystyle \dfrac{\sum_i | x_i - y_i |}{\sum_i \max ( x_i , y_i )} \) | ||
Lorentzian | \( \displaystyle \sum_i \ln \left( 1 + |x_i - y_i | \right) \) | ||
Ochiai | \( \displaystyle \sqrt{\frac{a}{a+b} \, \frac{a}{a+c}} \) | ||
Jaccard | \( \displaystyle \frac{a}{a+b+c} = \frac{\sum_i x_i y_i}{\sum_i x_i^2 + \sum_i y_i^2 - \sum_i x_i y_i} \) | ||
Russell--Rao | \( \displaystyle \frac{a}{N} \) | ||
Kulczynski-2 | \( \displaystyle \frac{1}{2} \left( \frac{a}{a+b} + \frac{a}{a+c} \right) \) | ||
cosine | \( \displaystyle \sum x_i y_i \left( \sum_i x_i^2 \,*\, \sum_i y_i^2 \right)^{-1/2} \) | ||
Pseudo cosine | \( \displaystyle \sum x_i y_i \left( \sum_i x_i \,*\, \sum_i y_i \right)^{-1} \) | ||
Dice | \( \displaystyle \frac{2\,\sum_i x_i y_i}{\sum_i x_i^2 + \sum_i y_i^2} \) | ||
Sorensen-Dice | \( \displaystyle \frac{2a}{2a+b+c} \) | ||
simple matching | \( \displaystyle \frac{a+d}{a+2b+2c+d} \) | ||
Yule | \( \displaystyle \frac{ad-bc}{ad+bc} \) | ||
Ruzicka | \( \displaystyle \frac{\sum_i \min (x_i , y_i )}{\sum_i \max (x_i , y_i )} \) | ||
roberts | |||
Chi-Square | |||
morisita | X | ||
mountford | X | ||
horn | X | ||
minkowski | X |
a = number of variables on which both objects i and j are 1 b = number of variables where object i is 1 and j is 0 c = number of variables where object i is 0 and j is 1 d = number of variables where both i and j are 0 a+b+c+d = N, the nubmer of variables.
This table specifies the distance, dissimilarity, or index functions used in various programs and libraries. The second line specifies the argument expected in the function call. The first column specifies the name of the index as commonly known; X indicates present under that name, another name indicates that the index is present in the function, but is called something slightly different. In all cases, the first argument to the function is the vegetation matrix (transformed appropriately before hand if desired), and the second argument is the distance, dissimilarity or index. Since R allows function arguments to be specified by order, as well as named, you don't have to remember whether the specific function uses "metric", "method", or "index" as long as it's the second argument supplied.
For example:
sampleA<-c(1,1,0,1,1,1,1) sampleB<-c(0,1,1,0,0,1,1) sorenson(sampleA,sampleB)One of the oldest and best known occurrence measures is the Jaccard measure, also known as the Coefficient of Community (Jaccard 1901). The measure has seen extensive use, largely due to its simplicity and intuitiveness.
result <- dist(veg,method="canberra") result <- dist(veg,"canberra") result <- vegdist(veg,method="bray") result <- vegdist(veg,"bray") result <- dsvdis(veg,index="ruzicka") result <- dsvdis(veg,"ruzicka")
dis.euc <- dist(veg,'euclidean') dis.bray <- vegdist(veg,method="bray") dis.ruz <- dsvdis(veg,index="ruzicka")
cmdscale()
to calculate what it calls
"classical multi-dimensional scaling", a synonym for principal coordinates
analysis.
Albeit the concept of Euclidean distance has prevailed in different cultures and regions for millennia, it is not a panacea for all types of data or pattern to be compared. For instance, it often does not work well with vegetation data. Many ecologists suggest Manhattan distance as an alternative. There have been considerable efforts in finding the appropriate measures among such a plethora of choices because it is of fundamental importance to pattern classification, clustering, and information retrieval problems.
dis.mht <- dist(veg,"manhattan") mht.pco <- pco(dis.mht,k=10) plot(mht.pco)
- Pearson correlation distance: \[ d_{\scriptstyle cor} \left( {\bf x} , {\bf y} \right) = 1 - \dfrac{\sum_{i=1}^n \left( x_i - \overline{\bf x} \right)\left( y_i - \overline{\bf y} \right)}{\sqrt{\sum_{i=1}^n \left( x_i - \overline{\bf x} \right)^2 \sum_{k=1}^n \left( y_k - \overline{\bf y} \right)^2} } . \] Pearson correlation measures the degree of a linear relationship between two profiles.
- Eisen cosine correlation distance (Eisen et al., 1998): \[ d_{\scriptstyle eisen} \left( {\bf x} , {\bf y} \right) = 1 - \dfrac{\left\vert \sum_{i=1}^n x_i y_i \right\vert}{\sqrt{\sum_{i=1}^n x_i^2 \sum_{k=1}^n y_k^2}} . \] It’s a special case of Pearson’s correlation with \( \overline{\bf x} \) and \( \overline{\bf y} \) both replaced by zero.
- Spearman correlation distance: \[ d_{\scriptstyle spear} \left( {\bf x} , {\bf y} \right) = 1 - \dfrac{\sum_{i=1}^n \left( x_i - \overline{\bf x}' \right)\left( y_i - \overline{\bf y}' \right)}{\sqrt{\sum_{i=1}^n \left( x_i - \overline{\bf x}' \right)^2 \sum_{k=1}^n \left( y_k - \overline{\bf y}' \right)^2} } , \] where \( \overline{\bf x}' = \mbox{rank}({\bf x}) \) and \( \overline{\bf y}' = \mbox{rank}({\bf y}) . \) The spearman correlation method computes the correlation between the rank of x and the rank of y variables.
-
Kendall correlation distance:
\[
d_{\scriptstyle kendall} \left( {\bf x} , {\bf y} \right) = 1 -
\frac{n_c - n_d}{n(n-1)/2} ,
\]
where
- n_c: total number of concordant pairs;
- n_d: total number of discordant pairs
- n: size of x and y
Data Standardization
The value of distance measures is intimately related to the scale on which measurements are made. Therefore, variables are often scaled (i.e. standardized) before measuring the inter-observation dissimilarities. This is particularly recommended when variables are measured in different scales (e.g: kilograms, kilometers, centimeters, …); otherwise, the dissimilarity measures obtained will be severely affected. The goal is to make the variables comparable. Generally variables are scaled to have i) standard deviation one and ii) mean zero. The standardization of data is an approach widely used in the context of gene expression data analysis before clustering. We might also want to scale the data when the mean and/or the standard deviation of variables are largely different. When scaling variables, the data can be transformed as follow: \[ \frac{x_i - \mbox{center}({\bf x})}{\mbox{scale} ({\bf x})} \] wherecenter(x)
can be the mean or the median of x values,
and scale(x)
can be the standard deviation (SD), the
interquartile range, or the MAD (median absolute deviation).
The R base function scale()
can be used to standardize the data.
It takes a numeric matrix as an input and performs the scaling on the columns.
Standardization makes the four distance measure methods - Euclidean, Manhattan,
Correlation and Eisen -- more similar than they would be with non-transformed
data.
Note that, when the data are standardized, there is a functional relationship
between the Pearson correlation coefficient r({x, y) and the Euclidean distance.
With some maths, the relationship can be defined as follow:
\[
d_2 ({\bf x}, {\bf y}) = \sqrt{2m \left[ 1 - r({\bf x}, {\bf y}) \right]}
\]
where x and y are two standardized m-vectors with zero mean and
unit length. Therefore, the result obtained with Pearson correlation measures
and standardized Euclidean distances are comparable.
The conversion to z-scores also makes it easier to identify outlier
measurements, which can distort the Euclidean or taxicab distance between
observations. After converting to z-scores unequal weighting of variables can
also be considered by multiplying the variables of each observation by a
selected set of weights. For instance, after standardizing the units on
customer observations so that income and age are expressed as their respective
z-scores (instead of expressed in dollars and years), we can multiply
the income z-scores by 2 is we wish to treat income with twice the
importance of age. In other words, standardizing removes bias due to the
difference in measurement units, and variable weighting allows the analyst
to introduce appropriate bias based on the business context.
When clustering observations are done solely on the basis of categorial
variables enclosed as 0-1 (or dummy variables), a better measure of similarity
between two observations can be achieved by counting the number of variables
with matching values. The simplest overlap measure is called the matching
coefficient and is computed by
\[
\frac{\mbox{number of variables with matching value for observations } {\bf u}
\mbox{ and } {\bf v}}{\mbox{total number of variables}}
\]
One weakness of the matching coefficient is that if two observations both have
a 0 entry for categorical variable, this is counted as a sign of similarity
between the two observations. However, matching 0 entries do not necessarily
imply similarity. For example, if the categorical variable is Own A Minivan,
then a 0 entry in two different observations does not mean that these two
people own the same type of car; it means only that neither owns a minivan.
To avoid misstanding similarity due to the absence of a feature, a similarity
measure called Jaccard's coefficient does not count matching zero entries and
is computed by
\[
\frac{\mbox{number of variables with matching nonzero value for observations }
{\bf u} \mbox{ and } {\bf v}}{\mbox{total number of variables} - (\mbox{number of variables with matching zero values for observations }{\bf u} \mbox{ and } {\bf v})}
\]
Distance matrix computation
Data preparation We start with the USArrests data as demo data sets. Later we will use random number generated data. For the USArrests data, we use only a subset of the data by taking 15 random rows among the 50 rows in the data set. This is done by using the function sample(). Next, we standardize the data using the function scale():-
dist()
R base function [stats package]: Accepts only numeric data as an input. -
get_dist()
function [factoextra package]: Accepts only numeric data as an input. Compared to the standarddist()
function, it supports correlation-based distance measures including “pearson”, “kendall” and “spearman” methods. -
daisy()
function [cluster package]: Able to handle other variable types (e.g. nominal, ordinal, (a)symmetric binary). In that case, the Gower’s coefficient will be automatically used as the metric. It’s one of the most popular measures of proximity for mixed data types. For more details, read the R documentation of the daisy() function (?daisy).
dist()
function, as follow:
dist.eucl <- dist(df.scaled, method = "euclidean")Note that, allowed values for the option method include one of: “euclidean”, “maximum”, “manhattan”, “canberra”, “binary”, “minkowski”. \[ \| x-y \|_{\infty} = \max_i |x_i - y_i | . \] To make it easier to see the distance information generated by the
dist()
function, you can reformat the distance vector into a
matrix using the as.matrix()
function.
get_dist()
[factoextra package] can be used to
compute correlation-based distances. Correlation method can be either pearson,
spearman or kendall.
daisy()
[cluster package] provides a solution
(Gower’s metric) for computing the distance matrix, in the situation where
the data contain no-numeric columns.
The R code below applies the daisy()
function on flower data
which contains factor, ordered and numeric variables:
Visualizing distance matrices
A simple solution for visualizing the distance matrices is to use the functionfviz_dist()
[factoextra package]. Other specialized methods, such
as agglomerative hierarchical clustering or heatmap will be comprehensively
described in the dedicated chapters.
To use fviz_dist()
type this:
dist(xi,xj)=0
and pure blue corresponds
to the highest value of euclidean distance computed. Objects belonging to the
same cluster are displayed in consecutive order.
One can practice with many R commands using random generating data with the
aid of rnorm(n,mean,var)
, which generates multivariate normal
random variates based on: n --- number of datasets to be simulated,
mean --- the mean of the dataset to be simulate, and var ---
the variance covariance matrix. We give example of two sets of random data
with different means.
clara
command and Silhouette plot.
silhouette (x, dist, dmatrix, …)
. Here x is an
object of appropriate class; for the default method an integer vector with
k different integer cluster codes or a list with such an x$clustering
component. Note that silhouette statistics are only defined if
\( 2 \le k \le n-1 . \)
For each observation i, the silhouette width s(i) is defined as follows: Put a(i) = average dissimilarity between i and all other points of the cluster to which i belongs (if i is the only observation in its cluster, s(i):=0 without further calculations). For all other clusters C, put d(i,C) = average dissimilarity of i to all observations of C. The smallest of these d(i,C) is b(i):=minCd(i,C), and can be seen as the dissimilarity between i and its “neighbor” cluster, i.e., the nearest one to which it does not belong. Finally,
\[ s(i) := \frac{b(i) -a(i)}{\max \left\{ a(i) , b(i) \right\}} . \]silhouette.default()
is now based on C code donated by Romain
Francois (the R version being still available as
cluster:::silhouette.default.R
). Observations with a large
s(i) (almost 1) are very well clustered, a small s(i) (around 0)
means that the observation lies between two clusters, and observations with a
negative s(i) are probably placed in the wrong cluster.
silhouette
in package cluster
can do the plots for you. It just needs a vector of cluster membership (produced from whatever algorithm you choose) and a dissimilarity matrix (probably best to use the same one used in producing the clusters). For example:
library (cluster)
library (vegan)
data(xclara)
dis = vegdist(xclara)
res = pam(dis,3) # or whatever your choice of clustering algorithm is
sil = silhouette (res$clustering,dis) # or use your cluster vector
windows() # RStudio sometimes does not display silhouette plots correctly
plot(sil)
EDIT: For k-means (which uses squared Euclidean distance)
library (vegan)
library (cluster)
data(xclara)
dis = dist(xclara)^2
res = kmeans(xclara,3)
sil = silhouette (res$cluster, dis)
windows()
plot(sil)
Partitioning methods
In R, standard clustering methods (partitioning and hierarchical clustering) can be computed using the R packagesstats
and
cluster
. However the workflow,
generally, requires multiple steps and multiple lines of R codes.
Partitioning methods like pam
, clara
, and
fanny
require that the number of
clusters be given by the user.
It is recommended to utilize
easy-to-use wrapper functions, such as the factoextra
package
for an enhanced cluster analysis and visualization.
These functions include:
-
get_dist()
&fviz_dist()
for computing and visualizing distance matrix between rows of a data matrix. Compared to the standarddist()
function,get_dist()
supports correlation-based distance measures including “pearson”, “kendall,” and “spearman” methods. -
eclust()
: enhanced cluster analysis. It has several advantages:- It simplifies the workflow of clustering analysis
- It can be used to compute hierarchical clustering and partititioning clustering in a single line function call
- Compared to the standard partitioning functions (kmeans, pam, clara and fanny) which requires the user to specify the optimal number of clusters, the function eclust() computes automatically the gap statistic for estimating the right number of clusters.
- For hierarchical clustering, correlation-based metric is allowed
- It provides silhouette information for all partitioning methods and hierarchical clustering
- It creates beautiful graphs using ggplot2
install.packages("factoextra") install.packages("cluster") install.packages("magrittr")To load packages:
library("cluster") library("factoextra")Data preparation
The built-in R dataset USArrests is used:
Enhanced clustering analysis
The standard R code for computing hierarchical clustering looks like this:eclust()
function [factoextra package] to
simplify the workflow is as follow:
eclust(x, FUNcluster = "kmeans", hc_metric = "euclidean", ...)In the following R code, we’ll show some examples for enhanced k-means clustering and hierarchical clustering. Note that the same analysis can be done for PAM, CLARA, FANNY, AGNES and DIANA.
eclust(df, "kmeans", k = 4)
Distance measures
It’s simple to compute and visualize distance matrix using the functionsget_dist()
and fviz_dist()
[factoextra R package]:
-
get_dist()
: for computing a distance matrix between the rows of a data matrix. Compared to the standard dist() function, it supports correlation-based distance measures including “pearson”, “kendall” and “spearman” methods. -
fviz_dist()
: for visualizing a distance matrix
Read more: Partitioning Clustering methods. The following R codes show how to determine the optimal number of clusters and how to compute k-means and PAM clustering in R. Determining the optimal number of clusters: use factoextra::fviz_nbclust()
library("factoextra") fviz_nbclust(my_data, kmeans, method = "gap_stat")Suggested number of cluster: 3 Compute and visualize k-means clustering
K Means clustering
Example from https://www.r-bloggers.com/k-means-clustering-in-r/ #Implement K means clustering usingkmeans()
(stats package).
According to the kmeans()
help page, K means clustering
"aims to partition the points into k groups such that the sum of squares from
points to the assigned cluster centeres is minimized." centers referes to the
number of k cluster centeres (chosen randomly); nstart
refers to the random sets that are chosen to select the k cluster centers.
We're just going to use 2 traits(Petal Length and Width) for easier
visualisation.
agnes
, diana
, and mona
construct a
hierarchy of clusterings, with the number of clusters ranging from one to the
number of observations.
The result of hierarchical clustering is a tree-based representation of the
objects, which is also known as dendrogram. Observations can be subdivided
into groups by cutting the dendrogram at a desired similarity level.
R code to compute and visualize hierarchical clustering:
# Enhanced hierarchical clustering res.hc <- eclust(df, "hclust") # compute hclustHierarchical clustering [or hierarchical cluster analysis (HCA)] is an alternative approach to partitioning clustering for grouping objects based on their similarity. In contrast to partitioning clustering, hierarchical clustering does not require to pre-specify the number of clusters to be produced. Hierarchical clustering can be subdivided into two types:
- Agglomerative clustering in which, each observation is initially considered as a cluster of its own (leaf). Then, the most similar clusters are successively merged until there is just one single big cluster (root).
- Divise clustering, an inverse of agglomerative clustering, begins with the root, in witch all objects are included in one cluster. Then the most heterogeneous clusters are successively divided until all observation are in their own cluster.
In agglomerative hierarchical cluster analysis, sample plots all start out as individuals, and the two plots most similar (or least dissimilar) are fused to form the first cluster. Subsequently, plots are continually fused one-by-one in order of highest similarity (or equivalently lowest dissimilarity) to the plot or cluster to which they are most similar. The hierarchy is determined by the cluster at a height characterized by the similarity at which the plots fused to form the cluster. Eventually, all plots are contained in the final cluster at similarity 0.0
The agglomerative clustering is the most common type of hierarchical clustering used to group objects in clusters based on their similarity. It’s also known as AGNES (Agglomerative Nesting). The algorithm starts by treating each object as a singleton cluster. Next, pairs of clusters are successively merged until all clusters have been merged into one big cluster containing all objects. The result is a tree-based representation of the objects, named dendrogram.
The dendrogram is a multilevel hierarchy where clusters at one level are joined together to form the clusters at the next levels. This makes it possible to decide the level at which to cut the tree for generating suitable groups of a data objects.Algorithm
Cluster algorithms are classified by two characteristics: (1) hierarchical vs fixed-cluster, and (2) if hierarchical, agglomerative vs divisive.Agglomerative cluster algorithms differ in the calculation of similarity when more than one plot is involved; i.e. when a plot is considered for merger with a cluster containing more than one plot. Of all the algorithms invented, we will limit our consideration to those available in R, which are also those most commonly used. In R there are multiple methods:
- single --- nearest neighbor
- complete --- furthest neighbor or compact
- ward --- Ward's minimum variance method
- mcquitty --- McQuitty's method
- average --- average similarity
- median --- median (as opposed to average) similarity
- centroid --- geometric centroid
- flexible --- flexible Beta
In the average linkage, the similarity of a plot to a cluster is defined by the mean similarity of the plot to all the members of the cluster. In contrast to single linkage, a plot needs to be relatively similar to all the members of the cluster to join, rather than just one. Average linkage clusters tend to be relatively round or ellipsoid. Median or centroid approaches tend to give similar results.
In the "complete" linkage, or "compact" algorithm, the similarity of a plot to a cluster is calculated as the minimum similarity of the plot to any member of the cluster. Similar to the single linkage algorithm, the probability of a plot joining a cluster is determined by a single other member of a cluster, but now it is the least similar, not the most similar. Complete linkage clusters tend to be very tight and spherical, thus the alternative name "compact." The inverse of agglomerative clustering is divisive clustering, which is also known as DIANA (Divise Analysis) and it works in a “top-down” manner. It begins with the root, in which all objects are included in a single cluster. At each step of iteration, the most heterogeneous cluster is divided into two. The process is iterated until all objects are in their own cluster (see figure below). Note that, agglomerative clustering is good at identifying small clusters. Divisive clustering is good at identifying large clusters. In this article, we’ll focus mainly on agglomerative hierarchical clustering.
hclust()
function to perform
hierarchical cluster analysis. hclust()
will calculate a cluster
analysis from either a similarity or dissimilarity matrix, but plots better
when working from a dissimilarity matrix. We can use any dissimilarity object
from dist()
or vegdist()
.
The hclust()
function treats the dissimilarity object as the
first argument, and the method or metric as the second explicit argument.
For instance,
Data structure and preparation
The data should be a numeric matrix with:- rows representing observations (individuals);
- and columns representing variables.
scale()
can be used for
standardization, See ?scale
documentation.
There are many methods to calculate the (dis)similarity information, including
Euclidean and manhattan distances. In R software, you can use the function
dist()
to compute the distance between every pair of object in a
data set. The results of this computation is known as a distance or
dissimilarity matrix.
By default, the function dist()
computes the Euclidean distance
between objects; however, it’s possible to indicate other metrics using the
argument method. See ?dist
for more information.
For example, consider the R base data set USArrests, you can compute the distance matrix as follow:
To see easily the distance information between objects, we reformat the
results of the function dist()
into a matrix using the
as.matrix()
function. In this matrix, value in the cell formed
by the row i, the column j, represents the distance between
object i and object j in the original data set. For instance,
element 1,1 represents the distance between object 1 and itself (which is
zero). Element 1,2 represents the distance between object 1 and object 2,
and so on.
The R code below displays the first 6 rows and columns of the distance matrix:
Linkage
The linkage function takes the distance information, returned by the functiondist()
, and groups pairs of objects into clusters based on their
similarity. Next, these newly formed clusters are linked to each other to
create bigger clusters. This process is iterated until all the objects in the
original data set are linked together in a hierarchical tree.
For example, given a distance matrix “res.dist” generated by the function
dist()
, the R base function hclust()
can be used
to create the hierarchical tree.
hclust()
can be used as follow:
res.hc <- hclust(d = res.dist, method = "ward.D2")
-
d: a dissimilarity structure as produced by the
dist()
function. - method: The agglomeration (linkage) method to be used for computing distance between clusters. Allowed values is one of “ward.D”, “ward.D2”, “single”, “complete”, “average”, “mcquitty”, “median” or “centroid”.
- Maximum or complete linkage: The distance between two clusters is defined as the maximum value of all pairwise distances between the elements in cluster 1 and the elements in cluster 2. It tends to produce more compact clusters.
- Minimum or single linkage: The distance between two clusters is defined as the minimum value of all pairwise distances between the elements in cluster 1 and the elements in cluster 2. It tends to produce long, “loose” clusters.
- Mean or average linkage: The distance between two clusters is defined as the average distance between the elements in cluster 1 and the elements in cluster 2.
- Centroid linkage: The distance between two clusters is defined as the distance between the centroid for cluster 1 (a mean vector of length p variables) and the centroid for cluster 2.
- Ward’s minimum variance method: It minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged.
- McQuitty method: When onsidering merging two clusters A and B, the dissimilarity of the resulting cluster AB to any other cluster C is calculated as \[ \frac{1}{2} \left[ (\mbox{dissimilarity between A and C}) + (\mbox{dissimilarity between B and C}) \right] . \] At each step, this method then merges the pair of clusters that results in the minimal increase in total dissimilarity.
Examples: Clustering in R
We start with the example of iris data to perform a few different types of clustering analysis. You can also use the datasets that are included in R. We are using the Fisher's iris data set, which gives measurements (cm) of 4 variables for 150 flowers across 3 species of iris.Hierarchical Cluster Analysis
Dendextend cluster analysis vignette can be found on the official R web site. First, create a distance matrix representing distances between individual flower observations using the dist function. The default distance measure is Euclidian; check the arguments if you want to use something else.?dist dist <- dist(iris2)Next, perform hierarchical cluster analysis using the
hclust
function (base R). According to the help page for hclust
:
"Initially, each object is assigned to its own cluster and then the algorithm
proceeds iteratively, at each stage joining the two most similar clusters,
continuing until there is just a single cluster." A number of different
clustering methods are available; default is complete clustering. The output
of this object is an object that describes the tree produced by this process.