Support Vector Machines and Kernels
This is part seven of a series I’m working on, in which we’ll discuss and define introductory machine learning algorithms and concepts. At the very end of this article, you’ll find all the previous pieces of the series. I suggest you read these in sequence. Simply because I introduce some concepts there that are key to understanding the notions discussed in this article, and I’ll be referring back to them on numerous occasions.
Today we look at our very last supervised learning algorithm, support vector machines (SVMs). We’ll study the intuition, as well as look at how we can use the concept of kernels to come up with complex decision boundaries.
Let’s get right into it.
Support Vector Machines
SVMs are widely used in both regression and classification problems. To understand the need for SVMs, consider the following classification problem with two features and two classes (Xs and Os):
What line best separates our two classes? There are many options:
Before concluding what line is best to use, we need to come up with some criteria that will help us differentiate between a good fit and a bad one. We’ve already seen some of these criteria, in the previous article. Here are a few:
- Isn’t overfitted
- Isn’t underfitted
- Generalizes well on new data
Consider now the two following examples:
And add a new data point to each:
Which graph is better? Definitely the one on the right. The one on the left falsely classifies an X as an O, while the one on the right has a distance large enough to the training points that it’s able to generalize to new data. That’s the goal of the SVM algorithm:
to find a hyperplane in an N-dimensional space that distinctly classifies the data points, where N is the number of features
Let’s look at how.
Intuition
As usual, our first step at understanding any machine learning algorithm will be to look at its cost function. In part three of the series, Logistic Regression Using Gradient Descent: Intuition and Implementation, we saw the cost function of the logistic regression algorithm:
This equation tells us that every training example contributes a cost of:
where h
is the logistic function:
When y^(i) = 1
, the right side of equation 2 will amount to zero, leaving us with:
We can draw the graph of equation 4 to get a better intuition as to how it behaves:
We see that Theta^T * X
must be much larger than zero for our model to produce a value of one. The cost function for SVMs will behave in the same way. However, rather than needing to be larger than zero, SVM will add harder constraints, requiring that Theta^T * X
be larger than one. Here’s an approximation of what that’ll look like:
We’ll refer to this as Cost_1(Theta^T * X)
. We can run the same exact analysis for when y=0
in equation 2:
We’ll refer to this case as Cost_0(Theta^T * X)
. Before we can put this all together, we need to make one final remark.
We saw when we were working with neural networks, that you can add a regularization parameter to your cost function to prevent overfitting. The reality is, this regularization parameter can and should be used with all cost functions. For example, here’s equation 1 with the regularization parameter added to it:
When working with SVMs, the notion of regularization is again used, but the convention is to use a slightly different notation. The model will aim to minimize the following function:
Notice that we still sum the square of the Theta
vector, except we no longer multiply by lambda/2m
. Instead, we multiply by a constant C
. We won’t go over why we do this but keep in mind that this is still achieving the same thing.
In all the examples we’ve seen so far, we were able to divide the different classes using a simple linear line. Of course, this won’t always be the case. Consider the following classification problem, for example:
How do we get our algorithm to come up with a decision boundary that’s complex enough to be able to separate the positive classes from the negative, like so:
The answer? Kernels.
Kernels
We’ve already looked into this problem before, when we studied neural networks in part 4.1 of the series. At the time, we wanted a solution that didn’t involve highly ordered equations.
SVM deals with complex decision boundaries in a different way. Normally, the functions under consideration in our models have the form of Thetas
being multiplied by Xs
to any order. For example:
This time around, our Xs
will be computed differently. Instead of x_1
and x_2
, we’ll have f_1
, f_2
, f_3
, etc. This change in notation is used to demonstrate that the input to our function will now have to go through a second function. To understand, consider the following graph:
l
here is reffered to as a landmark. These landmarks act as “thresholds”. The distance between each landmark and a training point will be calculated using a distance equation of our choise. We’ll refer to this distance equation as the similarity function. A similarity function is any function that calculates the similarity (distance) between any two points. We also refer to this as the kernel. For example, we can use the Gaussian kernel to calculate the similarity between our training points and the landmarks we have highlighted in figure 9:
Let’s see this in full action through an example. We’re still working with the graph in figure 9, but let’s add some training points to it:
Say now, that we want to use the following equation to fit this graph:
and that, after training this model, we come up with Theta_0 = -0.5
, Theta_1 = 1
, Theta_2 = 1
and Theta_3 = 0
.
Training point 1
is close to landmark one, so f_1
will approximetly be equal to one (since e^0 = 1
and we’re using the Gaussian kernel here). Similarly, since training point 1
is far away from landmarks two and three, f_2
and f_3
will be approximetaly equal to zero. Thus, equation 6 will be equal to -0.5 + 1 = 0.5 >= 0
so we predict one
for training point one. If you run the same calculations for points two and three, you’ll come to the conclusion that all the points close to landmarks one and two will yield a prediction of one, while all those close to landmark three will yield a prediction of zero. Thus, we can approximate the following decision boundary:
Conclusion
In this article, we introduced the support vector machine algorithm. In this algorithm, the goal is to choose a decision boundary that has a maximal distance to the training points. We showed why a large distance was necessary to prevent errors from occuring when we presented new data to our model.
We also explored the notion of kernels, and how they can be used to come up with more complex decision boundaries, for situations where our training data isn’t linearly seperable.
In the next article, we start our discussion on unsupervised learning algorithms. Until then, I leave you with the following points to ponder upon:
- We saw how landmarks are used to come up with more complex decision boundaries. How do we chose the position of our landmarks? This is a difficult concept to come up with on your own, so I refer you to Andrew Ng’s lesson on doing so.
- Why do we need SVMs if we already have neural networks?
- There are a few different ways of grasping the intuition behind SVMs. In this article, we derived an understanding by first looking at our logistic regression implementation, then manipulating it in a way that it can be used with SVMs. A different way is to look at it using vector manipulations. Here’s an awesome lecture describing SVMs from this lens.
Past Articles
- Part One: Data Pre-Processing
- Part Two: Linear Regression Using Gradient Descent: Intuition and Implementation
- Part Three: Logistic Regression Using Gradient Descent: Intuition and Implementation
- Part Four — 1: Neural Networks Part 1: Terminology, Motivation, and Intuition
- Part Four — 2: Neural Networks Part 2: Backpropagation and Gradient Checking
- Part Six: Evaluating Your Hypothesis and Understanding Bias vs Variance
Shameless Plug
- Twitter: twitter.com/ali_khanafer2
- LinkedIn: linkedin.com/in/ali-khanafer-319382152/