Neural Networks Part 2: Backpropagation and Gradient Checking

Published in

Geek Culture

8 min readJun 1, 2021

This is part five of a series I’m working on, in which we’ll discuss and define introductory machine learning algorithms and concepts. At the very end of this article, you’ll find all the previous pieces of the series. I suggest you read these in sequence. Simply because I introduce some concepts in previous posts that are key to understanding neural networks, and I’ll be referring back to them on numerous occasions.

In the previous article, we went through the basics of neural networks. We looked at the theory, as well as some basic terminology.

In this article, we continue with the same topic, except this time, we look more into how gradient descent is used along with the backpropagation algorithm to find the right Theta vectors.

Let’s get right into it.

Backpropagation

In all the previous algorithms we’ve seen, gradient descent was used to find a Theta vector that minimized our cost function. To achieve this, we followed the following algorithm:

**Figure 1:** Gradient Descent Algorithm

The bulk of the algorithm lies in finding the derivative for the cost function J. The difficulty of this task depends on how complicated our cost function is. We saw, for linear and logistic regression, that the derivative of the cost function required no more than the basic laws of derivation. When it comes to the cost function of neural networks that we’ll be using, the task of partially differentiating becomes much more difficult. That’s where the backpropagation algorithm comes into play. The algorithm is used to differentiate the cost function of neural networks.

Before we present the cost function, let’s define a few variables:

L is the total number of layers in the network
s_l is the number of activation units in layer l. This doesn’t include the bias unit
K is the number of output units

We can now present the cost function:

Equation 1: Neural Network Cost Function

Looks scary, I know. But, in reality, it isn’t much. Compare this to the cost function of the logistic regression model, and you’ll notice the similarities:

**Equation 2:** Logistic Regression Cost Function

With one output, the two cost functions are equivalent. Only once we have more than one output in our neural network do we start seeing a difference in functions. The summation from one to K is added to account for the multiple output nodes. Thus, h_theta(x^(i))_krepresents the h function used at the kth output node. In our case, the logistic function is used. The second term in the cost function (the one which starts with ƛ/2m) is referred to as the regularization term and is used in cost functions to prevent overfitting. I haven’t — and won’t in this article — spoken about the regularization term before. What’s important to note in our case is that this term is applied to all Theta . That is, all values in the Theta vectors defined at the different layers of our neural network will be squared, summed, and, at the very end, multiplied by a factor of ƛ/2m.

Now that that’s out of the way, let’s look at the backpropagation algorithm.

Intuition

The algorithm itself isn’t hard to understand. Understanding, however, how and why it works to give us the partial derivative we’re looking for, is where things get a little more complicated. Our aim is to give a basic intuition. We won’t be looking at any complicated math and proofs. But, if you’re interested in getting an in-depth explanation behind everything the algorithm is doing, I’ll refer you to chapter 2 of Micheal Nielsen’s book on neural networks and deep learning. Also, keep in mind that you most likely won’t understand the whole gist of it from the getgo. Don’t let that demoralize you. It takes lots of time and hands-on work with the algorithm to start getting the slightest idea of why it works the way it does.

Consider a neural network with three inputs, one hidden layer and one output:

**Figure 2:** Neural Network with Three Inputs and One Output

Imagine we’re given three different neural networks, each with exactly the same properties as the network shown in figure 2. Assume that the dataset we’re examining is the same for all networks. How will we compare the performance of one neural network against the two others? We can compute a value as simple as the difference between the expected value and the outputted value (the error). What values are we able to alter in order to reduce the error in our network? Definitely not the inputs, since those are provided to us. We also can’t directly change the values computed by our activation nodes, but we can do so indirectly by changing our Thetas. Consider the red arrow in figure 2. Any change made to the weight of that arrow (the theta of that arrow) is propagated to a_11, the output node as well as the error. Imagine if we had more than one hidden layer. The change made to the weight of the red arrow would also get propagated to the activation node in that new layer. This means that all weights used have an effect on our error. If we can find the impact that each theta has on our error, then really what we’re doing is finding the direction and impact to which these thetas are manipulating our cost function. If you read part two of this series, Linear Regression Using Gradient Descent: Intuition and Implementation, then you’ll know that this is the definition of differentiating our cost function. To find the impact caused by our Thetas we need to move backward from our output node, after having computed it along with our error. That’s where the term “backpropagation” comes from. Again, this can be hard to wrap your head around. Re-read this section and review your notes on partial differentiation to get a good feeling of what this algorithm is doing.

Let’s look at the pseudocode of the full algorithm:

**Figure 3:** Back Propagation Algorithm

D_ij calculated at the very end is the final result i.e. the derivative of our cost function. delta^(l)_ij (the triangle on the second line) is a matrix that we initialize to 0. This will be updated and will be used at the very end to calculate the partial derivative of our cost function. a^(1)is set to x^(i) because it’s our input layer. Forward propagation is the act of actually computing the values of our activation and output nodes using our input variables. After we perform forward propagation, we calculate the error at our output using the equation described in line 6. All errors for the rest of the layers (line 7) are calculated using the following equation:

**Figure 4:** Calculating Delta for l = L-1, L-2,…, 2

Where Theta^(l) is the Theta vector at layer l, Delta^(l+1) (capital delta instead of lowercase used on line 1) is the error of the previous layer, .* is an element-wise vector multiplication and a^(l) is the vector of activation values of layer l. From there, we can update our delta^(j)_ij matrix as we do in line 8, and finally, compute our derivative in lines 9 and 10.

Gradient Checking

As you probably noticed in the previous section, there’s a lot going on with the backpropagation algorithm. Due to its complex nature, there are many case scenarios where something could go wrong. Some of the time, it could look like your gradient descent is working fine, and that your cost is decreasing with every iteration when realistically, it isn’t. As such, we need a way to check that the cost we receive after running gradient descent with backpropagation is right. The method we present in this section is referred to as gradient checking.

Consider the following graph of J(theta):

**Figure 5:** Random Cost Function Graph

Now suppose I wish to estimate the partial derivative at a random point theta, where theta is a real number i.e. I want to calculate the slope of the following tangent line:

**Figure 6:** Random Cost Function Graph With Random Tangent Line

How can I make this approximation? Here’s a suggestion: We can pick two points, theta + epsilon and theta — epsilon, connect the two points by a line, and assume that this connecting line is a good approximation of the tangent at the point theta. Epsilon is a random threshold chosen by us:

To calculate the

**Figure 7:** Random Cost Function Graph With Line Tangent Approximation

To calculate the line between the two points (green hypotenuse) we simply divide the vertical distance by the horizontal distance:

**Equation 3:** Partial Derivative Approximation When Theta is A Real Number

Equation 3 is when theta is a real number. In most of the cases we’ve seen so far, Theta was a vector. The more general case of this equation then becomes:

**Equation 4:** Generalized Partial Derivative Approximation When Theta is A Vector of Thetas

It’s important to note here that this numerical method of approximating the derivative of Theta is used to check if backpropagation is properly calculating derivatives. So you may be asking yourself Why don’t I just use this simple formula instead of the head-breaking backpropagation algorithm? The answer is simple: This numerical way of calculating things is very computationally innefficient. As such, backpropagation has become the staple way of minimizing your cost function when working with neural networks.

Conclusion

In this article, we looked at an important algorithm used to minimize the cost function of neural networks: backpropagation. We also looked a way to test if our algorithm is correctly calculating the derivatives, called gradient checking.

If you didn’t understand much of what we spoke about, that’s fine. What’s important is that you continue working with the algorithm so that you get a better sense of what it’s doing. Of all the things we talked about, there’s one key point you must take away: backpropagation is not a replacement for gradient descent. It’s a method used to calculate the partial derivative of our cost function with respect to Theta. The value outputted by the algorithm is then used in gradient descent to minimize our cost function.

Past Articles

Part One: Data Pre-Processing
Part Two: Linear Regression Using Gradient Descent: Intuition and Implementation
Part Three: Logistic Regression Using Gradient Descent: Intuition and Implementation
Part Four — 1: Neural Networks Part 1: Terminology, Motivation, and Intuition

Shameless Plug

Twitter: twitter.com/ali_khanafer2
LinkedIn: www.linkedin.com/in/ali-khanafer-319382152/