Tuesday, March 05, 2019

Python - machine learning and clustering

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.

Within machine learning we place clustering under unsupervised learning, clustering is used for example in recommendation systems, targeted marketing and customer segmentation.

The below outline is a simple starting point showing a basic form of clustering on a relatively small dataset. The dataset we will use is displayed in the scatter chart below. The objective we have is to determine 3 clusters in the data shown in scatter chart. In this example case the data is just random data, the data can however represent virtually everything. The data could for example be customers, demographic data, sensor data-points or anything else.


If you look at the data humans are by default driven by Apophenia to try and see patterns. Apophenia has come to imply a universal human tendency to seek patterns in random information, such as gambling. However, even though the human mind will try to see a pattern this is far from correct in many cases. To make a true valid clustering we will need to actually base the clustering on math and not the feeling of the human mind. 

By leveraging Python code we can devide the data into 3 distinct clusters, the found clusters are shown below in different colors.


We can now see the different clusters that are within the data. Finding the members of the cluster is done based upon K-means clustering,  K-means is a clustering algorithm that aims to partition n observations into k clusters. The main steps are:


  • Initialisation – K initial “means” (centroids) are generated at random
  • Assignment – K clusters are created by associating each observation with the nearest centroid
  • Update – The centroid of the clusters becomes the new mean

The result is that after the updates yuu will end up with (in our case) 3 centroids and the datapoint which is assoicated with this centroid based upon the most optimal (smallest) distance to the centroid.


The above scatter chart shows the centroids which form the backbone of the clustering. Normallyt hey will be hidden as they do not form an actual datapoint from the dataset. As you can see we have now 3 clusters from the bigger dataset.

Examples of clustering can be found on my Github project containing machine learning examples. 

No comments: