Deep Learning Part 2: Vanilla vs Stochastic Gradient Descent

How vanilla gradient descent works, and why it’s inefficient to the point where stochastic gradient descent is needed

Published in
13 min readAug 16, 2021

--

Welcome to part 2 of my introductory series on deep learning, where we aim to acquaint you with fundamental DL concepts. Refer to the Deep Learning Series section at the bottom for all previous articles. In this piece, we discuss vanilla and stochastic gradient descent (SGD).

Gradient descent (GD) is one of the most widely used learning algorithms when training machine learning systems and neural networks. It’s a fundamental concept in the world of computer intelligence, and as a result, is one of the first notions taught to new AI students.

Gradient descent is no new notion. In fact, it was first suggested by French mathematician Augustin-Louis Cauchy all the way back in 1807 [1]. As such, the volume of work describing and analyzing the algorithm is in no way limited. Hence, writing another piece on how GD works wouldn’t be of much value, since it’d be repeating what has been described over and over again by many other writers.

An idea that isn’t often discussed in much detail, however, is the inefficiencies of the basic (vanilla) gradient descent algorithm, and how stochastic gradient descent helps solve them. And that’s what we’ll aim to do in this article.

In case you’re still not fully acquainted with the optimization algorithm, we’ll start off by giving a brief description. We’ll then describe the differences between vanilla and stochastic gradient descent, and finally, run an analysis on how one compares to the other by using them on a neural network and comparing different performance metrics.

Let’s get right into it.

Gradient Descent

Algorithm

Before understanding the difference between vanilla and stochastic gradient descent, let’s first look at how vanilla GD works, and from there see what changes we make for stochastic GD.

As aforementioned, gradient descent is a learning algorithm. But, what is it learning? Gradient descent is also referred to as an optimization algorithm. But, what is it optimizing? Let’s start by answering the first question.

The entire idea behind the field of machine learning is that our algorithms learn on their own, “from experience”. Take the following neural network for example. This network takes as input the 784 pixels of a 28x28 grayscale image, and outputs the number between one and 9 that’s written on the image:

Figure 1: Example Neural Network for Digit Recognition

There are 3 * 784 = 2352 weights between the input layer and the hidden layer, and 3 * 15 = 45 weights between the hidden layer and the output layer for a total of 2397 weights. Not to mention the biases. Are we manually going to choose these weights and biases? Then manually update them if they don’t perform the way we want? Of course not. Instead, we’ll use gradient descent to learn these weights. But how? “How” is the answer to the second question we posed.

What we aim to optimize is the error produced by our neural network. In our case, optimization refers to minimizing the error.

Gradient descent will learn by example. We’ll feed our network images to which the number written is already known, and we’ll see if our NN is able to identify it correctly by comparing the output to the actual result. We then analyze how large this error is, and try to adjust our weights accordingly, in an attempt to minimize the error produced. The function we use to calculate the error of our learning algorithm is called the cost function.

There are many different cost functions available. In our case, we’ll use the mean squared error (MSE):

Equation 1: Mean Squared Error

Where J represents the cost function, w and b are vectors of the weights and biases of our model, n is the number of training examples we have, y_i is the actual result of our input and z is the weighted sum presented in equation 1 of part 1 i.e. the result predicted by our model.

Notice that the error is based on an average of the results for all the training examples. Remember this point, as it will come back when discussing the differences between stochastic and vanilla gradient descent.

The goal then is to minimize J. That is, find w and b that will produce the smallest error:

Figure 2: Minimizing Cost Function

The best way to understand how we can minimize J is by looking at a case where the size of the vector w is one ([w_1]), and we have no biases. We can graph the cost J(w) as a function of w_1:

Figure 3: J as a Function of w_1

Visually, it’s easy to tell what w_1 will minimize J. But how do we do this algebraically? Some calculus students might think of solving for dJ/dw_1 = 0, but that’s not always feasible for more complicated functions [2], for example:

Figure 4: Example of More Complicated Cost Function

Things get especially complicated in the digit recognition problem above, where we had over 2000 weights and a function with a dimension of over 2000.

A better approach is to start with a random value for w_1, then determining the direction in which to step in order to get a lower result. We can do this by finding the slope of J(w) at w_1. Consider a random point in figure 3:

Figure 5: Random Point On J vs w_1 Graph

We can calculate the slope of the point:

Figure 6: Slope of Random Point On J vs w_1 Graph

and since it’s negative, we know that we need to shift w_1 to the right i.e. increase w_1. If it was positive we’d shift left. We can keep doing this until we reach a slope of near-zero, indicating we’ve reached a global or local minimum:

Figure 7: Different Slopes Before Reaching Minimum

We can extend this idea to a multivariable cost function J, as we had in equation 1, where J was a function of both w and b. Just as we calculated the slope in the univariate case, in the multivariate case we’ll calculate the gradient (∇J), a vector giving us the direction of steepest ascent. Intuitively, to find the direction of steepest descent, we take the negative of our gradient -∇J.

If you understood everything I’ve just described, then the rest of the algorithm is self-explanatory:

  1. Initialize the weights and biases randomly
  2. Calculate the cost caused by these weights and biases
  3. Update all weights w_k and biases b_l based on the update rule, where k is the number of weights and l the number of biases:
Figure 7: Update The Weights and Biases Rule [3]

4. Restart from step 2 until the algorithm converges to a local or global minimum

η is referred to as the learning rate. I won’t be explaining it in this article and will leave it to your own research.

Problem: Inefficiency

So where’s the problem? Why is this inefficient? Why do we need a variant in stochastic gradient descent? To answer these questions, let’s move away from neural networks for a small moment, and consider a situation where we want to minimize the mean squared error of a linear regression model. Since we don’t need the chain rule to compute the partial derivative when dealing with linear regression, this assumption will make it easier for us to continue with the explanation of GD. Computing the partial derivatives of a neural network requires an algorithm referred to as backpropagation and is the topic of next week’s article. The partial derivative of equation 1 is:

Equation 2: Gradient of the Mean Squared Error Function

Note that we expanded z to its full form in the equation above. The gradient for the biases will be the same except we replace all w_k in equation 2 with b_l.

Do you see a problem here? Imagine we had 20,000 weights and biases, and n = 10,000 training examples. For every update to some w_k we want to make i.e. step 3 in the above GD algorithm, we have to sum over all n training examples. Leaving us with 20000*10000=200000000 operations every time we run step 3. Multiply that by the number of times we run this update step before the algorithm converges…That’s a lot of operations. This is why the basic (vanilla) GD algorithm is inefficient. Although the partial derivative for neural networks will be different, this idea of summing over all training examples holds.

Solution: Stochastic Gradient Descent

We can think of the difference between stochastic and vanilla gradient descent as the difference between waterfall and agile design, respectively. In the waterfall design process, we attempt to get everything right from the very first iteration. It’s a linear process involving requirements gathering, design, implementation, verification, and maintenance. Every step must be completed before moving on to the next, and we can never go back to a previous step. The end result? You get to the end of the process and notice that your project is full of bugs, your customer’s requirements have completely changed, and all the work you put in in the last six months isn’t great. On the other hand, the agile methodology involves smaller iterations, with the constant deployment of a product in order to get customer feedback, make appropriate changes, re-deploy and repeat the process, all within a week or two.

In stochastic gradient descent, rather than calculating the error as an average of all the training examples, we select m random training examples from the entire dataset and use that in our cost function. We call this subset a mini-batch. Once our network has been trained on all the data points in our mini-batch, we select a new subset of random points and train our model with that. We continue this process until we’ve exhausted all training points, at which point we’ve completed an epoch. We then start with a new epoch and continue until convergence.

More formally, equation 2 becomes:

Equation 3: Stochastic Gradient Descent Cost Function

Where m is the size of the mini-batch. If m is large enough, this works because [3]:

Equation 4: Mini-Batches Compared to Full Batch

Comparison

So how does SGD compare to basic GD? In this section, we’ll try to answer that question by running an analysis and looking at some metrics. More specifically we’ll look at how they compare in terms of time to converge, CPU utilization, and accuracy for different dataset sizes.

We’re going to use the digit recognition neural network created by Michael Nielsen in chapter 1 of his online introductory deep learning book. This network will have 28x28=784 inputs, one hidden layer with 30 neurons, and an output layer with 10 output neurons. The code can be found here, under src/network.py. The dataset we’ll be using is the MNIST Database of Handwritten Digits, which contains 60,000 handwritten digits.

As we’ll see, basic gradient descent can take a very long time to converge when used with neural networks. Sometimes more than 24 hours. As such, we’ll simulate gradient descent by running SGD for 200 epochs and a mini-batch size of the training set. We’ll then use the metrics received to estimate the actual outcome of gradient descent if we were to run it to completion. For a full simulation of vanilla gradient descent, we’d have to run SGD for a minimum of (784*30) + (30*10) = 23820epochs, but that’ll take very long to end. Doing things the way we are will give us less accurate results, but will save us a lot of time and give us a good idea of the differences between SGD and vanilla GD.

As for the epochs, mini-batch size, and learning rate, we’ll keep them constant at 30, 10, and 3 respectively.

Stochastic Gradient Descent

Let’s first look at how SGD performs.

Table 1: Stochastic Gradient Descent Results
Figure 9: Training Set Size vs Time to Completion (SGD)

Intuitively, the time linearly increases with the increase in training set size. With a dataset size of 50,000, it takes approximately two and a half minutes (148 seconds) to complete. Keep in mind that this is with the constant hyperparameters (learning rate, mini-batch size, etc.) we chose. We could have tried tweaking these values and maybe have gotten a slightly faster algorithm. The slope of this line is approximately 0.0031, giving us an approximate 30-second increase for every additional 10,000 training points.

Let’s now look at its CPU utilization:

Figure 10: Training Set Size vs CPU Utilization (SGD)

The Mac we’re using usually has a CPU utilization of between two and ten percent. We see from figure 10 that CPU utilization is somewhat constant at around 50%, with only two to three percentage changes between dataset sizes. Do these results conform with what we expected to see? Not really. As we’ll see later, vanilla GD will have similar CPU utilization results. But the whole point of SGD is that it’s more efficient, so these results go against our assumptions. This could be happening for multiple reasons:

  1. I’m working on a relatively old computer (MacBook Pro 2016)
  2. I didn’t run vanilla GD to completion
  3. Human error when calculating CPU utilization

So, we’ve confirmed that the algorithm is fast and that its CPU utilization isn’t too high. But, most importantly, how accurate is it?

Figure 11: Training Set Size vs CPU Utilization (SGD)

Awesome! With a training set of just 10,000 data points, we’re able to get an accuracy of 92% in just over 30 seconds. On the other hand, if we wish to be even more accurate, we can increase our training set size to 50,000 and get an accuracy of 95% in two and a half minutes. All this without going above 50% CPU utilization.

Vanilla Gradient Descent

Let’s now see how vanilla gradient descent performs on the same exact training data and neural network.

Table 2: Vanilla Gradient Descent Results

To estimate total time, we calculated the time it took to complete one epoch, then multiplied that by the minimum number of epochs we would have to run if we were to run to completion (23820 epochs).

Figure 12: Training Size vs Estimated Time to Completion (GD)

Now, do you see why we need SGD for training neural networks? For a training set of only 10,000, we need approximately 3 hours to train our model. Compare this number to a real-world scenario, where normally have millions of data points to train our neural network. Notice also the slope of approximately 1.19, compared to the 0.0031 we had when working with SGD. Almost a 200% difference in change of time to completion per 10,000 data points added.

Is this algorithm computationally expensive? Let’s see:

Figure 13: Training Set Size vs CPU Utilization (GD)

Although the utilization fluctuates a little more than it did for SGD, the results are very similar. Again, this could be due to a lot of different things, as aforementioned.

Finally, let’s see how well GD trains our model:

Figure 14: Training Set Size vs CPU Utilization (GD)

Do these results make complete sense? No. We should get an accuracy close enough to those received in SGD. But we anticipated this because we didn’t run the algorithm to completion. It’s important to recognize that gradient descent will eventually converge to a local minimum, yielding results as good as the ones we saw in SGD.

One interesting characteristic to recognize, however, is that the accuracy gets lower as we increase the training set size. Why could this be? One plausible answer is that our learning rate is too high/low. It could also be the way we’re simulating gradient descent through SGD. Does it matter? Not really. All we’re trying to do is get an idea of how much better SGD performs. After 211 seconds of running basic gradient descent, the highest accuracy we were able to get with a training set of 10,000 is 61%. For the same amount of data, and less time, we were able to get a 92% accuracy with SGD.

Conclusion

In this article, we aimed to answer a few important questions surrounding vanilla and stochastic gradient descent.

By first describing the basic functionality of gradient descent, we were able to point out a very big problem: In order to calculate the gradient of our cost function at some point w, one needs to compute a sum over all training points. This left us with an extremely computationally heavy algorithm. How heavy? The second part of this article explored this inefficiency and aimed to determine how much better stochastic gradient descent performs.

Although we took some shortcuts for running basic gradient descent to completion, we were able to conclude that stochastic gradient descent is not only faster but also presents better results in a much shorter time period.

But is basic gradient descent never to be used? Not quite. There are other machine learning algorithms that don’t require as many summations as neural networks do, such as linear regression. In such a case, it might even be better to avoid stochastic gradient descent, as the code is more complicated, and there are many more parameters to deal with.

Deep Learning Series

References

[1] Wikipedia, Gradient Descent (2020), Wikipedia Piece on Gradient Descent

[2] Grant Sanderson, Gradient descent, how neural networks learn | Chapter 2, Deep learning (2017), 3Blue1Brown Youtube Channel

[3] Michael A. Nielsen, Neural Networks and Deep Learning (2015), Determination Press

--

--

Machine Learning Developer @ Kinaxis | I write about theoretical and practical computer science 🤖⚙️