Blog
Machine Learning Using IBM SPSS – Two-Step Cluster
In this part of the machine learning series using SPSS, the clustering techniques will be introduced. SPSS presents us with three types of clustering analysis: Two-step, K-Means and Hierarchical Cluster. This blog is on the Two-step Cluster technique. Two-Step combines both the K-Means and Hierarchical Cluster, and pre-clustering is first performed on the data to group then it runs the hierarchical algorithm. One advantage of Two-Step is that large datasets that would involve many steps in hierarchical methods can be done easily in Two-Step Cluster at it makes use of quick cluster algorithm. Actually, this is not the first blog on Machine Learning using SPSS so ensure you try to check the previous blogs: Linear Regression and Logistic Regression.
The position of engine of a vehicle will be used to evaluate some variables in the automobile dataset that we have been using for this series.
The TwoStep Cluster is located as thus: Analyze>>>Classify>>>TwoStep Cluster…
When you click on the TwoStep Cluster… the page below should appear where you need to select your variables for clustering. Two types of variables are specified: Categorical Variables and Continuous Variables. What this implies is that you must place your categorical independent variables in the Categorical Variables Box and the Continuous Independent Variables in the Continuous Variables Box. This is because SPSS treat them differently. If you want to evaluate how SPSS does the magic, then do it otherwise and compare your results; the difference will be obvious!
In this case, we are putting the “horsepowerbinned” into the categorical box and “numofcylinders_tr” and “enginesize” into the continuous variables box. The goal is to evaluate how these parameters dictate the position of the engine of a vehicle. We are leaving other options as default, for instance, SPSS is deciding the number of cluster (you may also select yours by clicking Specify fixed in the Number of Clusters category).
Then go to the Output… option where you select the evaluation variable. Evaluation variable in this case is the “enginelocation_tr.” Evaluation variable can be more than one variable. In addition, you may ask SPSS to show the cluster group of each of the sample by checking box in Working Data File. In this case we are not checking the box. Then click Continue.
Then the (brief) result shows up with the Model Summary which comprises of the number of inputs (the categorical and the continuous variables) and the number of clusters as decided by SPSS. In addition, the cluster quality (more or else like accuracy of the model) is shown; actually, in a Good region (luckily for us!). One may ask what if the cluster quality is Poor then I suggest you change the number of cluster or variables.
You can get to know more details about the cluster by double click on the result shown above then the Figure below shows. There are two dropdown view options by default you have: Model Summary and Cluster Sizes.
We are actually concerned with the Predictor Importance and the Clusters. There are many options that you can explore, and I strongly encourage you to try each and every options to know their functions and with that you gain a much better understanding of the usage for SPSS for TwoStep Cluster.
From the Figure below, “horsepowerbinned” has the highest importance followed by the “enginesize.” One interesting about this SPSS algorithm is that it shows you details of all the available clusters for you to select. This result is dependent on the maximum number of clusters. By clicking on each of the cluster you can observe their distribution.
Currently, the evaluation variable is not part of the details about to study its distribution as well. To do that, you click on Display. Then the Figure below show display. By default, the Evaluation Fields will be unchecked therefore you have to check it then click Ok.
When this is done you should see the Figure below containing the evaluation variable.
One of the features I found so handy using this technique is Cells show relative distributions as indicated below (the second option is for absolute distribution which I don’t really like because it does not enable me to see how my model really fits the original).
Conclusion
In this blog, the Two-step Cluster of IBM SPSS has been introduced to study how a (or some) variables impacts another variable(s). The predictor importance and the clusters are presented.
No comments added
Your one-stop website for academic resources, tutoring, writing, editing, study abroad application, cv writing & proofreading needs.