Unraveling R's Hierarchical Clustering: A Breezy Explanation!

In the world of automotive data, it's not always clear where certain models fit. A fascinating finding has emerged in a recent cluster analysis, revealing an unexpected grouping of the Honda Civic that sets it apart from the Mercedes450 group.

The Honda Civic, while sharing comparable parameters with the Mercedes450, clusters with Toyota and Fiat due to its unique rear axle ratio and miles per gallon (mpg) combination. This intriguing observation challenges our preconceived notions about these vehicles and underscores the value of data analysis.

Cluster analysis is a powerful tool used to understand the underlying structure of data. Its primary interest lies not just in determining which samples cluster together, but also in understanding why they do so. This approach allows us to make the analysis tell the story without stealing the show.

In this article, we will demonstrate a hierarchical cluster analysis using the mtcars dataset in R. The dendrogram produced by this analysis is a visual representation of the relationships between the cars, with the leaf labels and lines coloured by engine type.

The dendrogram's creation is facilitated by the package, developed by Tal Galili, and the new package in R makes creating appealing hierarchical clusters a breeze. The standard plot of the hierarchical cluster analysis is shown, with continuous variables cut by their quartiles to assign discrete colours.

To enhance the visualisation, coloured bars representing the feature levels of the samples are added to the dendrogram. In this case, the size of the labels is set by the number of cylinders. Adjustments to nodes in the dendrogram are also possible.

Interestingly, the group in the middle with the Mercedes450 series shares common weight class (wt), rear axle ratio (drat), the same number of gears, cylinders, type of transmission and engine (am, vs), yet the cars differ in displacement and horsepower (disp, hp).

This analysis raises intriguing questions about the factors that influence a car's grouping. For instance, there is speculation about where Tesla would be grouped if the dataset was renewed.

Understanding which sample features bring samples together and which drive them apart provides value in analysis. This knowledge can help us make more informed decisions and gain a deeper understanding of the data.

However, the question in cluster analysis often remains: "Now what?" Classical methods such as kMeans, dbscan, and hierarchical clustering can provide answers, but the real value lies in the insights we gain from the story the data tells.