Support Vector Machines and Kernels

Ali H Khanafer
Published in
8 min readJun 15, 2021

--

This is part seven of a series I’m working on, in which we’ll discuss and define introductory machine learning algorithms and concepts. At the very end of this article, you’ll find all the previous pieces of the series. I suggest you read these in sequence. Simply because I introduce some concepts there that are key to understanding the notions discussed in this article, and I’ll be referring back to them on numerous occasions.

Today we look at our very last supervised learning algorithm, support vector machines (SVMs). We’ll study the intuition, as well as look at how we can use the concept of kernels to come up with complex decision boundaries.

Let’s get right into it.

Support Vector Machines

SVMs are widely used in both regression and classification problems. To understand the need for SVMs, consider the following classification problem with two features and two classes (Xs and Os):

Figure 1: Classification Problem with Two Features and Two Classes

What line best separates our two classes? There are many options:

Figure 2: Possible Lines Fitting the Classification Problem

Before concluding what line is best to use, we need to come up with some criteria that will help us differentiate between a good fit and a bad one. We’ve already seen some of these criteria, in the previous article. Here are a few:

  • Isn’t overfitted
  • Isn’t underfitted
  • Generalizes well on new data

Consider now the two following examples:

Figure 3: Different Possibilities of Fitting a Line Through the Same Points

And add a new data point to each:

Figure 4: Demonstration of How Different Graphs Generalize To New Data

Which graph is better? Definitely the one on the right. The one on the left falsely classifies an X as an O, while the one on the right has a distance large enough to the training points that it’s able to generalize to new data. That’s the goal of the SVM algorithm:

to find a hyperplane in an N-dimensional space that distinctly classifies the data points, where N is the number of features

Let’s look at how.

Intuition

As usual, our first step at understanding any machine learning algorithm will be to look at its cost function. In part three of the series, Logistic Regression Using Gradient Descent: Intuition and Implementation, we saw the cost function of the logistic regression algorithm:

Equation 1: Logistic Regression Cost Function

This equation tells us that every training example contributes a cost of:

Equation 2: Cost of Every Individual Training example

where h is the logistic function:

Equation 3: Logistic Function Used in Logistic Regression

When y^(i) = 1, the right side of equation 2 will amount to zero, leaving us with:

Equation 4: Cost When y=1

We can draw the graph of equation 4 to get a better intuition as to how it behaves:

Figure 5: Graph of Logistic Regression Cost Function When y=1

We see that Theta^T * X must be much larger than zero for our model to produce a value of one. The cost function for SVMs will behave in the same way. However, rather than needing to be larger than zero, SVM will add harder constraints, requiring that Theta^T * X be larger than one. Here’s an approximation of what that’ll look like:

Figure 6: Graph of SVM Cost Function When y=1

We’ll refer to this as Cost_1(Theta^T * X). We can run the same exact analysis for when y=0 in equation 2:

Figure 7: Graph of SVM Cost Function When y=0

We’ll refer to this case as Cost_0(Theta^T * X). Before we can put this all together, we need to make one final remark.

We saw when we were working with neural networks, that you can add a regularization parameter to your cost function to prevent overfitting. The reality is, this regularization parameter can and should be used with all cost functions. For example, here’s equation 1 with the regularization parameter added to it:

Equation 4: Logistic Regression Cost Function With Regularization Parameter

When working with SVMs, the notion of regularization is again used, but the convention is to use a slightly different notation. The model will aim to minimize the following function:

Equation 5: SVM Cost Function With Regularization Parameter

Notice that we still sum the square of the Theta vector, except we no longer multiply by lambda/2m. Instead, we multiply by a constant C. We won’t go over why we do this but keep in mind that this is still achieving the same thing.

In all the examples we’ve seen so far, we were able to divide the different classes using a simple linear line. Of course, this won’t always be the case. Consider the following classification problem, for example:

Figure 8: Complex Arrangement of Training Points

How do we get our algorithm to come up with a decision boundary that’s complex enough to be able to separate the positive classes from the negative, like so:

Figure 8: Decision Boundary

The answer? Kernels.

Kernels

We’ve already looked into this problem before, when we studied neural networks in part 4.1 of the series. At the time, we wanted a solution that didn’t involve highly ordered equations.

SVM deals with complex decision boundaries in a different way. Normally, the functions under consideration in our models have the form of Thetas being multiplied by Xs to any order. For example:

Equation 5: Arbitrary Example of A Function To Be Optimized

This time around, our Xs will be computed differently. Instead of x_1 and x_2, we’ll have f_1, f_2, f_3, etc. This change in notation is used to demonstrate that the input to our function will now have to go through a second function. To understand, consider the following graph:

Figure 9: Landmarks

l here is reffered to as a landmark. These landmarks act as “thresholds”. The distance between each landmark and a training point will be calculated using a distance equation of our choise. We’ll refer to this distance equation as the similarity function. A similarity function is any function that calculates the similarity (distance) between any two points. We also refer to this as the kernel. For example, we can use the Gaussian kernel to calculate the similarity between our training points and the landmarks we have highlighted in figure 9:

Figure 10: Example of Gaussian Kernel Used to Calculate Similarity

Let’s see this in full action through an example. We’re still working with the graph in figure 9, but let’s add some training points to it:

Figure 10: Landmarks and Training points

Say now, that we want to use the following equation to fit this graph:

Equation 6: Equation We Use to Fit Our Training Data

and that, after training this model, we come up with Theta_0 = -0.5, Theta_1 = 1, Theta_2 = 1 and Theta_3 = 0.

Training point 1 is close to landmark one, so f_1 will approximetly be equal to one (since e^0 = 1 and we’re using the Gaussian kernel here). Similarly, since training point 1 is far away from landmarks two and three, f_2 and f_3 will be approximetaly equal to zero. Thus, equation 6 will be equal to -0.5 + 1 = 0.5 >= 0 so we predict one for training point one. If you run the same calculations for points two and three, you’ll come to the conclusion that all the points close to landmarks one and two will yield a prediction of one, while all those close to landmark three will yield a prediction of zero. Thus, we can approximate the following decision boundary:

Figure 11: Landmarks, Training points and Decision Boundary

Conclusion

In this article, we introduced the support vector machine algorithm. In this algorithm, the goal is to choose a decision boundary that has a maximal distance to the training points. We showed why a large distance was necessary to prevent errors from occuring when we presented new data to our model.

We also explored the notion of kernels, and how they can be used to come up with more complex decision boundaries, for situations where our training data isn’t linearly seperable.

In the next article, we start our discussion on unsupervised learning algorithms. Until then, I leave you with the following points to ponder upon:

  • We saw how landmarks are used to come up with more complex decision boundaries. How do we chose the position of our landmarks? This is a difficult concept to come up with on your own, so I refer you to Andrew Ng’s lesson on doing so.
  • Why do we need SVMs if we already have neural networks?
  • There are a few different ways of grasping the intuition behind SVMs. In this article, we derived an understanding by first looking at our logistic regression implementation, then manipulating it in a way that it can be used with SVMs. A different way is to look at it using vector manipulations. Here’s an awesome lecture describing SVMs from this lens.

Past Articles

  1. Part One: Data Pre-Processing
  2. Part Two: Linear Regression Using Gradient Descent: Intuition and Implementation
  3. Part Three: Logistic Regression Using Gradient Descent: Intuition and Implementation
  4. Part Four — 1: Neural Networks Part 1: Terminology, Motivation, and Intuition
  5. Part Four — 2: Neural Networks Part 2: Backpropagation and Gradient Checking
  6. Part Six: Evaluating Your Hypothesis and Understanding Bias vs Variance

Shameless Plug

References

  1. Andrew Ng’s Machine Learning Coursera Course
  2. Rohith Gandhi’s Support Vector Machine — Introduction to Machine Learning Algorithms
  3. MIT OpenCourseWare’s learning: Support Vector Machines

--

--

Ali H Khanafer
Geek Culture

Machine Learning Developer @ Kinaxis | I write about theoretical and practical computer science 🤖⚙️