Photo by Omar Flores on Unsplash

Unsupervised Learning and the Intuition Behind K-Means Clustering

Ali H Khanafer
Published in
7 min readJun 22, 2021

--

This is part eight of a series I’m working on, in which we’ll discuss and define introductory machine learning algorithms and concepts. At the very end of this article, you’ll find all the previous pieces of the series. I suggest you read these in sequence. Simply because I introduce some concepts there that are key to understanding the notions discussed in this article, and I’ll be referring back to them on numerous occasions.

Today we look at our very first (and only) unsupervised learning algorithm. We’ll define what it means to be unsupervised, then look at the intuition behind K-Means as well as answer some frequently asked questions.

Let’s get right into it.

Unsupervised Learning Algorithms

In all our previous articles, we’ve dealt with training data that looked like the following:

Figure 1: Classification Problem with Two Features and Two Classes

where all {x^(1), x^(2),...x^(m)} (training input) had a respective {y^(1),y^(2),...y^(m)} value (training label), indicating what the actual value is. With this information, we were able to train our model using the training input, and compare our predicted value to the actual training label. We called this supervised learning. In unsupervised learning, our data now looks something like the following:

Figure 2: Unsupervised Algorithm Training Data Example

Where all we have is the training input {x^(1), x^(2),...x^(m)}, and no understanding of these inputs’ labels i.e. no {y^(1),y^(2),...y^(m)}. The goal of the unsupervised learning algorithm is to find some order in this data, by creating different groups based on the similarity between points. These groups are called clusters. Here’s an example of how an unsupervised learning algorithm may create different clusters from the training data shown in figure 2:

Figure 3: Grouping Training Data Into Two Different Clusters

One of the most popular and most used clustering algorithms is K-Means.

K-Means

Intuition

The best way to understand this algorithm is through an example. As such, consider the following graph:

Figure 4: Classification Problem with Two Features and Two Classes

We want to group similar data points into clusters. The first thing K-Means will do is pick k arbitrary points. We refer to these points as centroids. For our example, we’ll choose k = 3, marked as red, blue, and black crosses:

Figure 4: K-Means Selecting K Arbitrary Centroids

We’ll then calculate the distance between each training point to these two centroids, using a distance equation of your choice, for example, Euclidean distance. We assign the training point to whichever centroid it’s closest to:

Figure 5: K-Means Assigning Training Points to the Closes Centroid

Now we reposition our centroids by calculating the mean of all the training points that are assigned to it, using the following formula:

Figure 5: K-Means Assigning Training Points to the Closes Centroid
Figure 6: K-Means Repositioning the Centroids

In equation 1 above, m^K represents the number of training examples assigned to cluster K. And finally, we repeat the process until we no longer see a change in position of our centroids (until we converge) i.e. the mean distance from the training points to their assigned centroids is minimal:

Figure 7: K-Means Converging

And there you have it. Here’s the official pseudocode of the algorithm, taken from Andrew Ng’s introduction to machine learning course on Coursera:

Figure 8: K-Means Algorithm Pseudocode

If you’ve been following along with the series, then you might have noticed something odd. There’s one step we haven’t even considered mentioning yet. What’s the cost function of this algorithm? How do we know when we’ve found an optimal solution. Believe it or not, we’ve kind of already introduced K-Means’ cost function in equation 1. Just like we found the mean distance between every training point and its assigned centroid, we can find the mean distance of all training points and their assigned centroids, which gives us an overall indication of how our model is performing:

Equation 2: Mean of Distance of All Training Points to Their Assigned Centroid

Where m now is the number of total training points. And so, this is the equation we try to minimize, this is our cost function for K-Means.

Things to Consider

Before concluding, there are a few common questions I want to address when it comes to this algorithm.

Will We Always Converge to a Global Optimum?

The answer is no. Consider the following graph:

Figure 9: Random Points on a Graph

We run K-Means on these training points. Depending on how you initialized your centroids, you can get any of the following results:

Figure 10: Optimal Solution
Figure 11: Non-Optimal Solution One
Figure 12: Non-Optimal Solution Two

While the graph in figure 10 is obviously the right solution, it’s very possible that K-Means converges to a local optimum instead of a global one, leaving us with inaccurate results, like the ones shown in figures 11 and 12. To mathematically prove this is difficult, but there’s a great way of visualizing it. Consider these two points:

.     .
. .

A solution like the following (where c is the cluster):

.  c  .
. c .

Is valid, but a better solution would be:

.     .
c c
. .

Is much more accurate. We can try to minimize the possibility of this occurring by choosing the right number of centroids K, as well as having a methodical way of randomly initializing them.

How to Randomly Initialize the Centroids?

One way is to have the selection be completely random. Although a valid solution, this is a highly volatile and risky way of choosing your centroids’ positions.

The most widely used method is selecting K training points and using those coordinates as the centroids’ initial positions. For example, if we have:

Figure 13: Random Points on a Graph

Then we can pick any arbitrary points and use them as the initial positions of our centroids, for example:

Figure 14: Training Points Positions Used as Centroid Initial Positions

Note that this still doesn’t guarantee that we won’t land on a local optimum, but it diminishes the chances.

How to Pick K?

The best answer to this question is that there’s simply no concrete way of picking the right K. There are some techniques, but none of which are guaranteed to work. At the end of the day, the best way to choose your K is through manually going through your data visualizations, and deciding what you think is the best decision.

Of the few techniques available, the elbow method is most widely used. It works by drawing a comparison between the cost and the number of clusters K:

Figure 15: Elbow Method Example

In this case, we see a significant decrease in cost between K=1 and K=2, and between K=2 and K=3, but it starts decreasing at a much slower pace afterward, so we can conclude that K=3 is the right number of centroids. But, although it’s always worth giving this method a shot, more often than not, your graph will come out looking something like this:

Figure 16: Elbow Method Counterexample

Where the decrease in error is minimal between the values of K, leaving us with little information to make valuable conclusions.

Again, the current best technique is simply looking at your data manually.

Conclusion

In this article, we defined the concept of an unsupervised learning algorithm. We then took this idea and used it to understand the theory behind one of the most popular supervised learning algorithms, K-Means. At the very end, we answered some of the algorithms most asked questions and concluded that, although there are formal methods, the best way to tune your hyperparameters is to do so manually, based on visualizations.

In the next article, we’ll be discussing dimensionality reduction.

Past Articles

  1. Part One: Data Pre-Processing
  2. Part Two: Linear Regression Using Gradient Descent: Intuition and Implementation
  3. Part Three: Logistic Regression Using Gradient Descent: Intuition and Implementation
  4. Part Four — 1: Neural Networks Part 1: Terminology, Motivation, and Intuition
  5. Part Four — 2: Neural Networks Part 2: Backpropagation and Gradient Checking
  6. Part Six: Evaluating Your Hypothesis and Understanding Bias vs Variance
  7. Part Seven: Support Vector Machines and Kernels

Shameless Plug

References

  1. Andrew Ng’s Machine Learning Coursera Course
  2. Stack Overflow: Why Doesn’t K-Means Give the Global Minima?

--

--

Ali H Khanafer
Geek Culture

Machine Learning Developer @ Kinaxis | I write about theoretical and practical computer science 🤖⚙️