An R TUTORIAL for Introductory Statistics Applications

Clustering

The goal of clustering is to segment observations into similar groups based on the observed variables. Clustering can be employed during the data-preparation step to identify variables or observations that can be aggregated or removed from consideration.

Email Vladimir Dobrushkin

Clustering methods are used to identify groups of similar objects in a multivariate data sets collected from fields such as marketing, bio-medical and geo-spatial. They are different types of clustering methods, including:

Partitioning methods
Hierarchical clustering
Fuzzy clustering
Density-based clustering
Model-based clustering
Validation

Partitioning Methods

The goal of cluster analysis is to group observations into clusters such that observations within a cluster are similar and observations in different clusters are dissimilar. Therefore, to formalize this process, we need explicit measurements of similarity or, conversely, dissimilarity. Some metric track similarity between observations, and clustering method using such a metric would seek to maximize the similarity between observations. Other metrics measure dissimilarity, or distance, between observations, and a clustering method using one of these metrics would seek to minimize the distance between observations in a cluster. When observations include numerical variables, Euclidean distance is the most common method to measure dissimilarity between observations. Let observations \[ {\bf u} = ( u_1 u_2 , \ldots , u_n )\] and \[ {\bf v} =(v_1 , v_2 , \ldots , v_n )\] each comprise measurements of n variables. The Euclidean distance between u and v is \[ d_2 ({\bf u}, {\bf v}) = \| {\bf u} - {\bf v} \|_2 = \sqrt{(u_1 - v_1 )^2 + (u_2 - v_2 )^2 + \cdots + (u_n - v_n )^2} . \] The distance between two points in a grid based on a strictly horizontal and/or vertical path (that is, along the grid lines), as opposed to the diagonal (Euclidean) or "as the crow flies" distance is called the Manhattan distance or the taxicab distance: \[ d_1 ({\bf u}, {\bf v}) = \| {\bf u} - {\bf v} \|_1 = \sum_{k=1}^n |u_k - v_k | . \] Other dissimilarity measures exist such as correlation-based distances, which is widely used for gene expression data analyses. Correlation-based distance is defined by subtracting the correlation coefficient from 1. Different types of correlation methods can be used such as:

Pearson correlation distance: \[ d_{\scriptstyle cor} \left( {\bf x} , {\bf y} \right) = 1 - \dfrac{\sum_{i=1}^n \left( x_i - \overline{\bf x} \right)\left( y_i - \overline{\bf y} \right)}{\sqrt{\sum_{i=1}^n \left( x_i - \overline{\bf x} \right)^2 \sum_{k=1}^n \left( y_k - \overline{\bf y} \right)^2} } . \] Pearson correlation measures the degree of a linear relationship between two profiles.
Eisen cosine correlation distance (Eisen et al., 1998): \[ d_{\scriptstyle eisen} \left( {\bf x} , {\bf y} \right) = 1 - \dfrac{\left\vert \sum_{i=1}^n x_i y_i \right\vert}{\sqrt{\sum_{i=1}^n x_i^2 \sum_{k=1}^n y_k^2}} . \] It’s a special case of Pearson’s correlation with \overline{\bf x} and \overline{\bf y} both replaced by zero.
Spearman correlation distance: \[ d_{\scriptstyle spear} \left( {\bf x} , {\bf y} \right) = 1 - \dfrac{\sum_{i=1}^n \left( x_i - \overline{\bf x}' \right)\left( y_i - \overline{\bf y}' \right)}{\sqrt{\sum_{i=1}^n \left( x_i - \overline{\bf x}' \right)^2 \sum_{k=1}^n \left( y_k - \overline{\bf y}' \right)^2} } , \] where \overline{\bf x}' = rank({\bf x}) and \overline{\bf y}' = rank({\bf y}) . The spearman correlation method computes the correlation between the rank of x and the rank of y variables.
Kendall correlation distance: \[ d_{\scriptstyle spear} \left( {\bf x} , {\bf y} \right) = 1 - frac{n_c - n_d}{n(n-1)/2} , \] where
- n_c: total number of concordant pairs;
- n_d: total number of discordant pairs
- n: size of x and y
Kendall correlation method measures the correspondence between the ranking of x and y variables. The total number of possible pairings of x with y observations is n(n−1)/2, where n is the size of x and y. Begin by ordering the pairs by the x values. If x and y are correlated, then they would have the same relative rank orders. Now, for each y_i, count the number of y_j>y_i (concordant pairs (c)) and the number of y_j

Pearson correlation analysis is the most commonly used method. It is also known as a parametric correlation which depends on the distribution of the data. Kendall and Spearman correlations are non-parametric and they are used to perform rank-based correlation analysis. What type of distance measures should we choose? The choice of distance measures is very important, as it has a strong influence on the clustering results. For most common clustering software, the default distance measure is the Euclidean distance. Depending on the type of the data and the researcher questions, other dissimilarity measures might be preferred. For example, correlation-based distance is often used in gene expression data analysis. Correlation-based distance considers two objects to be similar if their features are highly correlated, even though the observed values may be far apart in terms of Euclidean distance. The distance between two objects is 0 when they are perfectly correlated. Pearson’s correlation is quite sensitive to outliers. This does not matter when clustering samples, because the correlation is over thousands of genes. When clustering genes, it is important to be aware of the possible impact of outliers. This can be mitigated by using Spearman’s correlation instead of Pearson’s correlation. If we want to identify clusters of observations with the same overall profiles regardless of their magnitudes, then we should go with correlation-based distance as a dissimilarity measure. This is particularly the case in gene expression data analysis, where we might want to consider genes similar when they are “up” and “down” together. It is also the case, in marketing if we want to identify group of shoppers with the same preference in term of items, regardless of the volume of items they bought. If Euclidean distance is chosen, then observations with high values of features will be clustered together. The same holds true for observations with low values of features. Data Standardization--------------------------------------------------------------------------- The value of distance measures is intimately related to the scale on which measurements are made. Therefore, variables are often scaled (i.e. standardized) before measuring the inter-observation dissimilarities. This is particularly recommended when variables are measured in different scales (e.g: kilograms, kilometers, centimeters, …); otherwise, the dissimilarity measures obtained will be severely affected. The goal is to make the variables comparable. Generally variables are scaled to have i) standard deviation one and ii) mean zero. The standardization of data is an approach widely used in the context of gene expression data analysis before clustering. We might also want to scale the data when the mean and/or the standard deviation of variables are largely different. When scaling variables, the data can be transformed as follow: \[ \frac{x_i - \mbox{center}({\bf x})}{\mbox{scale} ({\bf x})} \] where center(x) can be the mean or the median of x values, and scale(x) can be the standard deviation (SD), the interquartile range, or the MAD (median absolute deviation). The R base function scale() can be used to standardize the data. It takes a numeric matrix as an input and performs the scaling on the columns. Standardization makes the four distance measure methods - Euclidean, Manhattan, Correlation and Eisen - more similar than they would be with non-transformed data. Note that, when the data are standardized, there is a functional relationship between the Pearson correlation coefficient r({x, y) and the Euclidean distance. With some maths, the relationship can be defined as follow: \[ d_2 ({\bf x}, {\bf y}) = \sqrt{2m \left[ 1 - r({\bf x}, {\bf y}) \right]} \] where x and y are two standardized m-vectors with zero mean and unit length. Therefore, the result obtained with Pearson correlation measures and standardized Euclidean distances are comparable. The conversion to z-scores also makes it easier to identify outlier measurements, which can distort the Euclidean or taxicab distance between observations. After converting to z-scores unequal weighting of variables can also be considered by multiplying the variables of each observation by a selected set of weights. For instance, after standardizing the units on customer observations so that income and age are expressed as their respective z-scores (instead of expressed in dollars and years), we can multiply the income z-scores by 2 is we wish to treat income with twice the importance of age. In other words, standardizing removes bias due to the difference in measurement units, and variable weighting allows the analyst to introduce appropriate bias based on the business context. When clustering observations are done solely on the basis of categorial variables enclosed as 0-1 (or dummy variables), a better measure of similarity between two observations can be achieved by counting the number of variables with matching values. The simplest overlap measure is called the matching coefficient and is computed by \[ \frac{\mbox{number of variables with matching value for observations} {\bf u} \mbox{ and } {\bf v}}{\mbox{total number of variables}} \] One weakness of the matching coefficient is that if two observations both have a 0 entry for categorical variable, this is counted as a sign of similarity between the two observations. However, matching 0 entries do not necessarily imply similarity. For example, if the categorical avriable is Own A Minivan, then a 0 entry in two different observations does not mean that these two people own the same type of car; it means only that neither owns a minivan. To avoid misstanding similarity due to the absence of a feature, a similarity measure called Jaccard's coefficient does not count matching zero entries and is computed by \[ \frac{\mbox{number of variables with matching nonzero value for observations} {\bf u} \mbox{ and } {\bf v}}{\mbox{total number of variables} - (\mbox{number of variables with matching zero values for observations }{\bf u} \mbox{ and } {\bf v})} \] Distance matrix computation Data preparation We’ll use the USArrests data as demo data sets. We’ll use only a subset of the data by taking 15 random rows among the 50 rows in the data set. This is done by using the function sample(). Next, we standardize the data using the function scale(): # Subset of the data set.seed(123) ss <- sample(1:50, 15) # Take 15 random rows df <- USArrests[ss, ] # Subset the 15 rows df.scaled <- scale(df) # Standardize the variables There are many R functions for computing distances between pairs of observations:

dist() R base function [stats package]: Accepts only numeric data as an input.
get_dist() function [factoextra package]: Accepts only numeric data as an input. Compared to the standard dist() function, it supports correlation-based distance measures including “pearson”, “kendall” and “spearman” methods.
daisy() function [cluster package]: Able to handle other variable types (e.g. nominal, ordinal, (a)symmetric binary). In that case, the Gower’s coefficient will be automatically used as the metric. It’s one of the most popular measures of proximity for mixed data types. For more details, read the R documentation of the daisy() function (?daisy).

To compute Euclidean distance, you can use the R base dist() function, as follow: dist.eucl <- dist(df.scaled, method = "euclidean") Note that, allowed values for the option method include one of: “euclidean”, “maximum”, “manhattan”, “canberra”, “binary”, “minkowski”. \[ \| x-y \|__{\infty} = \max_i |x_i - y_i | \] To make it easier to see the distance information generated by the dist() function, you can reformat the distance vector into a matrix using the as.matrix() function. # Reformat as a matrix # Subset the first 3 columns and rows and Round the values round(as.matrix(dist.eucl)[1:3, 1:3], 1) In this matrix, the value represent the distance between objects. The values on the diagonal of the matrix represent the distance between objects and themselves (which are zero). Computing correlation based distances Correlation-based distances are commonly used in gene expression data analysis. The function get_dist()[factoextra package] can be used to compute correlation-based distances. Correlation method can be either pearson, spearman or kendall. # Compute library("factoextra") dist.cor <- get_dist(df.scaled, method = "pearson") # Display a subset round(as.matrix(dist.cor)[1:3, 1:3], 1) Computing distances for mixed data The function daisy() [cluster package] provides a solution (Gower’s metric) for computing the distance matrix, in the situation where the data contain no-numeric columns. The R code below applies the daisy() function on flower data which contains factor, ordered and numeric variables: library(cluster) # Load data data(flower) head(flower, 3) # Data structure str(flower) # Distance matrix dd <- daisy(flower) round(as.matrix(dd)[1:3, 1:3], 2) Visualizing distance matrices A simple solution for visualizing the distance matrices is to use the function fviz_dist() [factoextra package]. Other specialized methods, such as agglomerative hierarchical clustering (Chapter @ref(agglomerative-clustering)) or heatmap (Chapter @ref(heatmap)) will be comprehensively described in the dedicated chapters. To use fviz_dist() type this: library(factoextra) fviz_dist(dist.eucl) The color level is proportional to the value of the dissimilarity between observations: pure red if dist(xi,xj)=0 and pure blue corresponds to the highest value of euclidean distance computed. Objects belonging to the same cluster are displayed in consecutive order. Partitioning methods In R, standard clustering methods (partitioning and hierarchical clustering) can be computed using the R packages stats and cluster. However the workflow, generally, requires multiple steps and multiple lines of R codes. It is recommended to utilize easy-to-use wrapper functions, such as the factoextra package for an enhanced cluster analysis and visualization. These functions include:

get_dist() & fviz_dist() for computing and visualizing distance matrix between rows of a data matrix. Compared to the standard dist() function, get_dist() supports correlation-based distance measures including “pearson”, “kendall” and “spearman” methods.
eclust(): enhanced cluster analysis. It has several advantages:
- It simplifies the workflow of clustering analysis
- It can be used to compute hierarchical clustering and partititioning clustering in a single line function call
- Compared to the standard partitioning functions (kmeans, pam, clara and fanny) which requires the user to specify the optimal number of clusters, the function eclust() computes automatically the gap statistic for estimating the right number of clusters.
- For hierarchical clustering, correlation-based metric is allowed
- It provides silhouette information for all partitioning methods and hierarchical clustering
- It creates beautiful graphs using ggplot2

We start by presenting required R packages and data format for cluster analysis and visualization. To install packages, type in R: install.packages("factoextra") install.packages("cluster") install.packages("magrittr") To load packages: library("cluster") library("factoextra") library("magrittr") Data preparation The built-in R dataset USArrests is used: # Load and scale the dataset data("USArrests") df <- scale(USArrests) head(df) # Load and prepare the data data("USArrests") my_data <- USArrests %>% na.omit() %>% # Remove missing values (NA) scale() # Scale variables # View the firt 3 rows head(my_data, n = 3) Output Distance matrix computation and visualization library(factoextra) # Correlation-based distance method res.dist <- get_dist(df, method = "pearson") head(round(as.matrix(res.dist), 2))[, 1:6] Output # Visualize the dissimilarity matrix fviz_dist(res.dist, lab_size = 8) Output Enhanced clustering analysis The standard R code for computing hierarchical clustering looks like this: # Load and scale the dataset data("USArrests") df <- scale(USArrests) # Compute dissimilarity matrix res.dist <- dist(df, method = "euclidean") # Compute hierarchical clustering res.hc <- hclust(res.dist, method = "ward.D2") # Visualize plot(res.hc, cex = 0.5) The format of the eclust() function [factoextra package] to simplify the workflow is as follow: eclust(x, FUNcluster = "kmeans", hc_metric = "euclidean", ...) In the following R code, we’ll show some examples for enhanced k-means clustering and hierarchical clustering. Note that the same analysis can be done for PAM, CLARA, FANNY, AGNES and DIANA. library("factoextra") # Enhanced k-means clustering res.km <- eclust(df, "kmeans", nstart = 25) ## Clustering k = 1,2,..., K.max (= 10): .. done ## Bootstrapping, b = 1,2,..., B (= 100) [one "." per sample]: ## .................................................. 50 ## .................................................. 100 # Gap statistic plot fviz_gap_stat(res.km$gap_stat) Output # Silhouette plot fviz_silhouette(res.km) # Optimal number of clusters using gap statistics res.km$nbclust # Print result res.km fviz_dend(res.hc, rect = TRUE) # dendrogam fviz_cluster(res.hc) # scatter plot It’s also possible to specify the number of clusters as follow: eclust(df, "kmeans", k = 4) Distance measures It’s simple to compute and visualize distance matrix using the functions get_dist() and fviz_dist() [factoextra R package]:

get_dist(): for computing a distance matrix between the rows of a data matrix. Compared to the standard dist() function, it supports correlation-based distance measures including “pearson”, “kendall” and “spearman” methods.
fviz_dist(): for visualizing a distance matrix

res.dist <- get_dist(USArrests, stand = TRUE, method = "pearson") fviz_dist(res.dist, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07")) Partitioning algorithms are clustering techniques that subdivide the data sets into a set of k groups, where k is the number of groups pre-specified by the analyst. There are different types of partitioning clustering methods. The most popular is the K-means clustering (MacQueen 1967), in which, each cluster is represented by the center or means of the data points belonging to the cluster. The K-means method is sensitive to outliers. An alternative to k-means clustering is the K-medoids clustering or PAM (Partitioning Around Medoids, Kaufman & Rousseeuw, 1990), which is less sensitive to outliers compared to k-means. Read more: Partitioning Clustering methods. The following R codes show how to determine the optimal number of clusters and how to compute k-means and PAM clustering in R. Determining the optimal number of clusters: use factoextra::fviz_nbclust() library("factoextra") fviz_nbclust(my_data, kmeans, method = "gap_stat") Suggested number of cluster: 3 Compute and visualize k-means clustering set.seed(123) km.res <- kmeans(my_data, 3, nstart = 25) # Visualize library("factoextra") fviz_cluster(km.res, data = my_data, ellipse.type = "convex", palette = "jco", Similarly, the k-medoids/pam clustering can be computed as follow: # Compute PAM library("cluster") pam.res <- pam(my_data, 3) # Visualize fviz_cluster(pam.res)

Hierarchical Clustering

Hierarchical clustering Hierarchical clustering is an alternative approach to partitioning clustering for identifying groups in the dataset. It does not require to pre-specify the number of clusters to be generated. The result of hierarchical clustering is a tree-based representation of the objects, which is also known as dendrogram. Observations can be subdivided into groups by cutting the dendrogram at a desired similarity level. R code to compute and visualize hierarchical clustering: # Compute hierarchical clustering res.hc <- USArrests %>% scale() %>% # Scale the data dist(method = "euclidean") %>% # Compute dissimilarity matrix hclust(method = "ward.D2") # Compute hierachical clustering # Visualize using factoextra # Cut in 4 groups and color by groups fviz_dend(res.hc, k = 4, # Cut in four groups cex = 0.5, # label size k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"), color_labels_by_k = TRUE, # color labels by groups rect = TRUE # Add rectangle around groups ) # Enhanced hierarchical clustering res.hc <- eclust(df, "hclust") # compute hclust Hierarchical clustering [or hierarchical cluster analysis (HCA)] is an alternative approach to partitioning clustering for grouping objects based on their similarity. In contrast to partitioning clustering, hierarchical clustering does not require to pre-specify the number of clusters to be produced. Hierarchical clustering can be subdivided into two types:

Agglomerative clustering in which, each observation is initially considered as a cluster of its own (leaf). Then, the most similar clusters are successively merged until there is just one single big cluster (root).
Divise clustering, an inverse of agglomerative clustering, begins with the root, in witch all objects are included in one cluster. Then the most heterogeneous clusters are successively divided until all observation are in their own cluster.

The agglomerative clustering is the most common type of hierarchical clustering used to group objects in clusters based on their similarity. It’s also known as AGNES (Agglomerative Nesting). The algorithm starts by treating each object as a singleton cluster. Next, pairs of clusters are successively merged until all clusters have been merged into one big cluster containing all objects. The result is a tree-based representation of the objects, named dendrogram. The dendrogram is a multilevel hierarchy where clusters at one level are joined together to form the clusters at the next levels. This makes it possible to decide the level at which to cut the tree for generating suitable groups of a data objects. Algorithm Agglomerative clustering works in a “bottom-up” manner. That is, each object is initially considered as a single-element cluster (leaf). At each step of the algorithm, the two clusters that are the most similar are combined into a new bigger cluster (nodes). This procedure is iterated until all points are member of just one single big cluster (root) (see figure below). The inverse of agglomerative clustering is divisive clustering, which is also known as DIANA (Divise Analysis) and it works in a “top-down” manner. It begins with the root, in which all objects are included in a single cluster. At each step of iteration, the most heterogeneous cluster is divided into two. The process is iterated until all objects are in their own cluster (see figure below). Note that, agglomerative clustering is good at identifying small clusters. Divisive clustering is good at identifying large clusters. In this article, we’ll focus mainly on agglomerative hierarchical clustering. Data structure and preparation The data should be a numeric matrix with:

rows representing observations (individuals);
and columns representing variables.

Here, we’ll use the R base USArrests data sets. Note that, it’s generally recommended to standardize variables in the data set before performing subsequent analysis. Standardization makes variables comparable, when they are measured in different scales. For example one variable can measure the height in meter and another variable can measure the weight in kg. The R function scale() can be used for standardization, See ?scale documentation. # Load the data data("USArrests") # Standardize the data df <- scale(USArrests) # Show the first 6 rows head(df, nrow = 6) In order to decide which objects/clusters should be combined or divided, we need methods for measuring the similarity between objects. There are many methods to calculate the (dis)similarity information, including Euclidean and manhattan distances. In R software, you can use the function dist() to compute the distance between every pair of object in a data set. The results of this computation is known as a distance or dissimilarity matrix. By default, the function dist() computes the Euclidean distance between objects; however, it’s possible to indicate other metrics using the argument method. See ?dist for more information. For example, consider the R base data set USArrests, you can compute the distance matrix as follow: # Compute the dissimilarity matrix # df = the standardized data res.dist <- dist(df, method = "euclidean") Note that, the function () computes the distance between the rows of a data matrix using the specified distance measure method. To see easily the distance information between objects, we reformat the results of the function dist() into a matrix using the as.matrix() function. In this matrix, value in the cell formed by the row i, the column j, represents the distance between object i and object j in the original data set. For instance, element 1,1 represents the distance between object 1 and itself (which is zero). Element 1,2 represents the distance between object 1 and object 2, and so on. The R code below displays the first 6 rows and columns of the distance matrix: as.matrix(res.dist)[1:6, 1:6] Linkage The linkage function takes the distance information, returned by the function dist(), and groups pairs of objects into clusters based on their similarity. Next, these newly formed clusters are linked to each other to create bigger clusters. This process is iterated until all the objects in the original data set are linked together in a hierarchical tree. For example, given a distance matrix “res.dist” generated by the function dist(), the R base function hclust() can be used to create the hierarchical tree. hclust() can be used as follow: res.hc <- hclust(d = res.dist, method = "ward.D2")

d: a dissimilarity structure as produced by the dist() function.
method: The agglomeration (linkage) method to be used for computing distance between clusters. Allowed values is one of “ward.D”, “ward.D2”, “single”, “complete”, “average”, “mcquitty”, “median” or “centroid”.

There are many cluster agglomeration methods (i.e, linkage methods). The most common linkage methods are described below.

Maximum or complete linkage: The distance between two clusters is defined as the maximum value of all pairwise distances between the elements in cluster 1 and the elements in cluster 2. It tends to produce more compact clusters.
Minimum or single linkage: The distance between two clusters is defined as the minimum value of all pairwise distances between the elements in cluster 1 and the elements in cluster 2. It tends to produce long, “loose” clusters.
Mean or average linkage: The distance between two clusters is defined as the average distance between the elements in cluster 1 and the elements in cluster 2.
Centroid linkage: The distance between two clusters is defined as the distance between the centroid for cluster 1 (a mean vector of length p variables) and the centroid for cluster 2.
Ward’s minimum variance method: It minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged.
McQuitty method: When onsidering merging two clusters A and B, the dissimilarity of the resulting cluster AB to any other cluster C is calculated as \[ \frac{1}{2} \left[ (\mbox{dissimilarity between A and C}) + (\mbox{dissimilarity between B and C}) \right] . \] At each step, this method then merges the pair of clusters that results in the minimal increase in total dissimilarity.

Note that, at each stage of the clustering process the two clusters, that have the smallest linkage distance, are linked together. Complete linkage and Ward’s method are generally preferred. # EXAMPLE: CLUSTERING IN R #For this example we will use iris data to perform a few different types of clustering analysis. Again, please contact me or use the ?? and ? functions if you run into an questions! #You can also use the datasets that are included in R. I will be using the Fisher's iris data set, which gives measurements (cm) of 4 variables for 150 flowers across 3 species of iris. library(datasets) iris #Let's have a quick look: head(iris) tail(iris) plot(x=iris$Sepal.Length,y=iris$Petal.Length) #To index a value in the dataframe, we use brackets (just like we did for vectors!), referring to rows, then columns, respectively. iris[2,4] #To index an entire column or row within a vector, leave the first (or second) term blank. Keep the comma in there--it tells R that you are working with 2 dimensional data. iris[2,] iris[,2] iris[c(1,4,5),3] column2 <- iris[,2] #now this is a vector! #To isolate a variable within a data, frame, use $ after the data frame name. This is called data splicing. iris$Sepal.Length #i. Data Preparation #Read in data iris <- datasets::iris #Take a look at what's going on in your data frame head(iris) str(iris) #Remove species labels (To make some of the computations easier. We will add these back in later) iris2 <- iris[-5] labels <- iris[,5] #a. Hierarchical Cluster Analysis: Originally from dendextend cluster analysis vignette (https://cran.r-project.org/web/packages/dendextend/vignettes/Cluster_Analysis.html) #First, create a distance matrix representing distances between individual flower observations using the dist function. The default distance measure is Euclidian; check the arguments if you want to use something else. ?dist dist <- dist(iris2) #Next, perform hierarchical cluster analysis using the hclust function (base R). According to the help page for hclust: "Initially, each object is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster." A number of different clustering methods are available; default is complete clustering. The output of this object is an object that describes the tree produced by this process. ?hclust clust <- hclust(dist) species <- rev(levels(iris[,5])) #We can visualize this object as a dendrogram library(dendextend) library(colorspace) dend <- as.dendrogram(clust) #Order it to the order of observations: dend2 <- rotate(dend,1:150) #Color based on clusters: dend <- color_branches(dend, k=3) #Match dendrogram to the classification of flowers: labels_colors(dend) <- rainbow_hcl(3)[sort_levels_values( as.numeric(iris[,5])[order.dendrogram(dend)] )] #Add flower type to the labels: labels(dend) <- paste(as.character(iris[,5])[order.dendrogram(dend)], "(",labels(dend),")", sep = "") #Add plot: par(mar = c(3,3,3,7)) dend <- set(dend, "labels_cex", 0.5) plot(dend, main = "Clustered Iris Data: Hierarchical Clustering", horiz = TRUE, nodePar = list(cex = .007)) legend("topleft", legend = species, fill = rainbow_hcl(3)) #B. K Means clustering: example from https://www.r-bloggers.com/k-means-clustering-in-r/ #Implement K means clustering using kmeans() (stats package). According to the kmeans() help page, K means clustering "aims to partition the points into k groups such that the sum of squares from points to the assigned cluster centeres is minimized." centers referes to the number of k cluster centeres (chosen randomly); nstart refers to the random sets that are chosen to select the k cluster centers. We're just going to use 2 traits(Petal Length and Width) for easier visualisation., head(iris2) clust <- kmeans(iris2[,3:4], centers = 3, nstart = 20) #Visualise the cluster groupings of different species clustergroupings <- factor(clust$cluster) iris3 <- iris iris3$Cluster <- clustergroupings head(iris3) levels(iris3$Species) levels(iris3$Species) <- c("s","ve","vi") iris3$Cluster p1 <- ggplot(iris3) p1 + geom_text(aes(x=Petal.Length,y=Petal.Width, label=Species,colour=Cluster))+ labs(list(title="Clustered Iris Data: K means", x="Petal Length",y="Petal Width")) #looks good! #Recursive Partitioning: example from https://www.r-bloggers.com/trees-with-the-rpart-package/ library(rpart) library(rpart.plot) # Here, I'm trying to grow a tree that can tell me if a flower belongs to species 1, 2 or 3. ?rpart iris <- datasets::iris rpart <- rpart(Species ~ ., data = iris, method = "class") rpart.plot(rpart) #this is a fun visualization package for rpart! summary(rpart) #if you want to check out the results of cross validation or get more information about the different nodes !Add one more example based on dendogram (book, page 147)!

Fuzzy Clustering

R is a statistical analyzation language, and that means it is very good at data manipulation and data analyzation. One key way to analyze data is through plotting, and R excels in this field. R will take two vectors, x and y, of the same length, and will plot them against each other. For example: y= \int f(x)\,e^{-x^2} \,{\text d}x

An R TUTORIAL for Introductory Statistics Applications

Clustering

Partitioning Methods

Hierarchical Clustering

Fuzzy Clustering

Density-based Clustering

Model-based Clustering

Validation