Blog
Machine Learning With IBM SPSS – KMeans Clustering
Introduction
KMeans clustering (pretty close to Two-Step Clustering) belongs to the unsupervised learning techniques of machine learning, and it is often used when we have no idea about the categories or group of data. Therefore, the aim of the algorithm is to help you to find categories in a data and these categories are represented by the K value. Thus, the name K-Means.
To use the spss algortithm for kmeans clustering it is expedient to standardize the variables to prevent biasing in the results. By doing this, the variables are placed on the same scale. We are using the variables “wheelbase” “length” “width” “height” then the label variable is “enginelocation_tr.” The variables are highlighted below.
To standardize follow: Analyze >>> Descriptive Statistics >>> Descriptives as shown below.
Then select the variables to be standardized and put in the box. Often the Save standardized values as variables is unchecked whereas that is exactly what we need for this analysis. Therefore, ensure that it is checked then click OK. This will save the standardized version of the variables which we shall use later.
The standardized variables are shown below as “Zwheelbase” “Zlength” “Zwidth” “Zheight”.
So, we start the KMeans clustering analysis by following: Analyze >>> Classify >>> K-Means Cluster… as shown below.
Then the select the standardized variables and drop in the box and click OK as shown below.
You may click Option to indicate the number of iteration and convergence criterion. In this case we shall leave it in default.
Regarding the output, I often check the ANOVA table to see the statistics of the KMeans clustering such as the p-values. Then click Continue and OK.
Additionally, by default we are selecting 2 as the number of clusters and “enginelocation_tr” as the “Label Cases by:” as shown below.
When you click OK above, the output should display below and the results of the KMeans clustering such as the number of clusters, cluster centers and the p-values. One of the most interesting things I found so handy is the creation of plots from SPSS output. I’m going to use that here!
You must double click the result that you want to create graph from then highlight the data that you want to plot. Right click on the data and select Create Graph and the type of graph you want to create.
The bar chart below shows the distribution of the variables for the first and second cluster.
The tables below indicate the cluster center for each of the variable and the iteration history. The statistics show that the iteration converge at the fouth iteration for both cluster 1 and 2. In addition, the ANOVA table reveals that the variables p-value is < 0.0001.
Conclusion
In this blog, KMeans clustering of IBM SPSS has been introduced using the automobile data. With the algorithm you can make my choices that are functions of your goal for the analysis. Often, I consider many clusters to evaluate the model and select the ones with higher accuracy. Normally, 2-5 clusters is best theoretically however, when dealing with highly variable data it is required to increase the number of cluster.
See you in the next technique - Hierarchical Clustering.
No comments added
Your one-stop website for academic resources, tutoring, writing, editing, study abroad application, cv writing & proofreading needs.