KMeans Cluster Analysis
Software 

Courses 
Overview
Cluster analysis is a set of data reduction techniques which are designed to group similar observations in a dataset, such that observations in the same group are as similar to each other as possible, and similarly, observations in different groups are as different to each other as possible. Compared to other data reduction techniques like factor analysis (FA) and principal components analysis (PCA), which aim to group by similarities across variables (columns) of a dataset, cluster analysis aims to group observations by similarities across rows.
Description
Kmeans is one method of cluster analysis that groups observations by minimizing Euclidean distances between them. Euclidean distances are analagous to measuring the hypotenuse of a triangle, where the differences between two observations on two variables (x and y) are plugged into the Pythagorean equation to solve for the shortest distance between the two points (length of the hypotenuse). Euclidean distances can be extended to ndimensions with any number n, and the distances refer to numerical differences on any measured continuous variable, not just spatial or geometric distances. This definition of Euclidean distance, therefore, requires that all variables used to determine clustering using kmeans must be continuous.
Procedure
In order to perform kmeans clustering, the algorithm randomly assigns k initial centers (k specified by the user), either by randomly choosing points in the “Euclidean space” defined by all n variables, or by sampling k points of all available observations to serve as initial centers. It then iteratively assigns each observation to the nearest center. Next, it calculates the new center for each cluster as the centroid mean of the clustering variables for each cluster’s new set of observations. Kmeans reiterates this process, assigning observations to the nearest center (some observations will change cluster). This process repeats until a new iteration no longer reassigns any observations to a new cluster. At this point, the algorithm is considered to have converged, and the final cluster assignments constitute the clustering solution.
There are several kmeans algorithms available. The standard algorithm is the HartiganWong algorithm, which aims to minimize the Euclidean distances of all points with their nearest cluster centers, by minimizing withincluster sum of squared errors (SSE).
Software
Kmeans is implemented in many statistical software programs:
In R, in the cluster package, use the function: kmeans(x, centers, iter.max=10, nstart=1). The data object on which to perform clustering is declared in x. The number of clusters k is specified by the user in centers=#. kmeans() will repeat with different initial centroids (sampled randomly from the entire dataset) nstart=# times and choose the best run (smallest SSE). iter.max=# sets a maximum number of iterations allowed (default is 10) per run.
In STATA, use the command: cluster kmeans [varlist], k(#) [options]. Use [varlist] to declare the clustering variables, k(#) to declare k. There are other options to specify similarity measures instead of Euclidean distances.
In SAS, use the command: PROC FASTCLUS maxclusters=k; var [varlist]. This requires specifying k and the clustering variables in [varlist].
In SPSS, use the function: Analyze > Classify > KMeans Cluster. Additional help files are available online.
Considerations
Kmeans clustering requires all variables to be continuous. Other methods that do not require all variables to be continuous, including some heirarchical clustering methods, have different assumptions and are discussed in the resources list below. Kmeans clustering also requires a priori specification of the number of clusters, k. Though this can be done empirically with the data (using a screeplot to graph withingroup SSE against each cluster solution), the decision should be driven by theory, and improper choices can lead to erroneous clusters. See Peeples’ online R walkthrough R script for Kmeans cluster analysis below for examples of choosing cluster solutions.
The choice of clustering variables is also of particular importance. Generally, cluster analysis methods require the assumption that the variables chosen to determine clusters are a comprehensive representation of the underlying construct of interest that groups similar observations. While variable choice remains a debated topic, the consensus in the field recommends clustering on as many variables as possible, as long as the set fits this description, and the variables that do not describe much of the variance in Euclidean distances between observations will contribute less to cluster assignment. Sensitivity analyses are recommended using different cluster solutions and sets of clustering variables to determine robustness of the clustering algorithm.
Kmeans by default aims to minimize withingroup sum of squared error as measured by Euclidean distances, but this is not always justified when data assumptions are not met. Consult textbooks and online guides in resources section below, especially Robinson’s Rblog: Kmeans clustering is not a free lunch for examples of the issues encountered with kmeans clustering when assumptions are violated.
Lastly, cluster analysis methods are similar to other data reduction techniques in that they are largely exploratory tools, thus results should be interpreted with caution. Many techniques exist for validating results from cluster analysis, including internally with crossvalidation or bootstrapping, validating on conceptual groups theorized a priori or with expert opinion, or external validation with separate datasets. A common application of cluster analysis is as a tool for predicting cluster membership on future observations using existing data, but it does not describe why the observations are grouped that way. As such, cluster analysis is often used in conjunction with factor analysis, where cluster analysis is used to describe how observations are similar, and factor analysis is used to describe why observations are similar. Ultimately, validity of cluster analysis results should be determined by theory and by utility of cluster descriptions.
Readings
Textbooks & Chapters

Aldenderfer MS and Blashfield RK (1984). Cluster Analysis. Sage University Paper series on Quantitative Applications in the Social Sciences, series no. 07044. Newbury Park, California: Sage Publications. The cluster analysis “green book” is a classic reference text on theory and methods of cluster analysis, as well as guidelines for reporting results.

Everitt BS, Landau S, Leese M, Stahl D (2011). Cluster Analysis, 5th ed. Wiley Series. Indepth and contemporary descriptions of the various types of cluster analysis methods as the field has developed.

Lorr M (1983). Cluster Analysis for Social Scientists. JosseyBass Social and Behavioral Science Series. Lorr’s classic text details related methods with data typically encountered in social sciences – Kmeans data assumptions are often difficult to meet with data in social sciences, and alternatives are discussed.
Methodological Articles

Hauser J and Rybakowski J (1997). Three clusters of male alcoholics. Drug Alcohol Depend; 48(3):24350. An example of clustering behavior types in addiction research.

Breuhl S, et al. (1999). Use of clusters analysis to validate IHS diagnostic criteria for migraine and tensiontype headache. Headache; 39(3):1819. A study of validating diagnostic criteria using kmeans on symptom patterns.

Guthrie E, et al. (2003). Cluster analysis of symptoms and health seeking behaviour differ entiates subgroups of patients with severe irritable bowel syndrome. Gut; 52(11):161622. Care seeking behavior patterns are differentiated by cluster analysis.
Application Articles

MacQueen J (1967). Some methods for classification and analysis of multivariate observa tions. Proceedings of the 5th Berkeley Symposium on Math. Statist. and Prob., Vol. 1. Early statistical methods paper about kmeans the clustering algorithm from one of the early developers.

Salim SZ and Ismail MA. (1984). Kmeanstype algorithms: A generalized convergence theo rem and characterization of local optimality. IEEE Trans Pattern Anal Mach Intell; 6(1):817. Methodological considerations and recommendations for the use of kmeans clustering.

Saeed F, et al. (2012). Combining Kmeans clusterings of chemical structures using cluster based similarity partitioning algorithm. Communications in Computer and Information Science; 322:304312. A recent article on improving the performance of kmeans cluster solutions through multipleiteration and combination approaches.
Websites
Various walkthroughs for using R software to conduct kmeans cluster analysis, with applied examples and sample code.

statmethods.net: QuickR: Cluster Analysishttp://www.statmethods.net/advstats/cluster.html

2. Rstatistics blog: Kmeans clustering http://www.rstatistics.com/2013/08/kmeansclusteringfromrinaction/

3. Peeples MA (2011). R Script for Kmeans Cluster Analysishttp://www.mattpeeples.net/kmeans.html

4. Robinson D (2015). R bloggers: Kmeans clustering is not a free lunch http://www.rbloggers.com/kmeansclusteringisnotafreelunch/
Technical R resources

R kmeans() help file https://stat.ethz.ch/Rmanual/Rdevel/library/stats/html/kmeans.html
Related data reduction techniques

Exploratory factor analysis (EFA) on Advanced Epidemiology

Principal components analysis (PCA) on Advanced Epidemiology