Deep Learning Part 3: Backpropagation; Nothing But a Game of Telephone

An intuitive way of understanding backpropagation by relating it to the game of “telephone”

Published in
10 min readAug 22, 2021

--

Welcome to part 3 of my introductory series on deep learning, where we aim to acquaint you with fundamental DL concepts. Refer to the Deep Learning Series section at the bottom for all previous articles. In this piece, we discuss backpropagation.

In last week’s article, we looked at gradient descent. We saw that in order for our neural network to learn, we have to continuously update our weights and biases until our cost function is minimized. The update rule was based on the gradient of our cost function. For the purpose of simplifying our explanation of the gradient descent algorithm, we ignored, for just a moment, neural networks and worked instead with a linear regression model. This change in models raises a few doubts: What’s so complicated about gradient descent performed on a neural network? Is it not applied in the same way for neural networks as it was for linear regression? It is. Things get a little more complicated, however, when calculating the gradient of the cost function for a neural network.

We’ll start off this week’s article by building an intuition for calculating gradients efficiently. We’ll do this by seeing how errors are propagated from one person to another in the game of “telephone”. We’ll then relate this idea back to the backpropagation algorithm, and formalize how gradients are calculated for gradient descent when working with neural networks.

Let’s get right into it.

Telephone Analogy

You’re on the Canadien National Telephone Team, and the Telephone World Cup is just around the corner. For now, we’ll assume that each team consists of only two types of players:

  • Speaker: The first person to speak. This is the person who has the original message to be whispered to the rest of the players.
  • One or more Players: Any person receiving the message who isn’t the speaker

Your team doesn’t take the normal approach of simply listening, and hoping for the best. Rather, they understand that anxiety and nervousness will play a role and cause some teammates to hear the wrong thing. As such, each player will try to estimate the amount of error produced by the previous player, and try to make any corrections that make sense. For example, consider the following linear telephone line of two players:

Figure 1: Linear Telephone Line With Two Players

Player 1 will assume an error of 10% made by himself. Player 2 will assume an error of 20% made by player 1. We define an error as the number of different letters between two sentences. For example, if we have the following scenario:

Figure 2: Sentence Passed Through Linear Telephone Line With Two Players

Then player 1 got 10% of the original sentence wrong since one out of ten letters in the original sentence were changed. Player 2 will try to account for player 1’s errors and change the letters as he sees fit.

Since this is a team sport, the team will have to train to build the right chemistry and understand what impact each player will have on the original sentence. Training involves going through a number of sentences, determining where the team went wrong, and modifying the error rates accordingly.

Our telephone line in figure 2 is clearly not trained properly. What can we do to get better results? Well, we need a way to describe to the players where they went wrong. The only person who knows the original answer is the speaker. Hence, he’s the only one who can compare the predicted results to the actual results. He has to first go to player 1, and ask him what he heard. From there he can calculate the error produced by player 1, tell him to adjust his predicted error rate accordingly, and move on to player 2. The speaker will have to help player 2 adjust his error rate based on what player 1’s error rate was. Since player 1’s error rate was 10%, and player 2 predicted 20%, player 2 will have to diminish his error rate if the team wishes to get better results.

Is this an efficient way of training? Maybe for our linear line of two players and one speaker, it isn’t the biggest of deals. But the Telephone World Cup consists of more than just one type of telephone line. Teams will have to compete in a “threaded” telephone line game. In a threaded telephone line, each team is separated into groups, with a speaker per group. The speaker will then have to pass a different part of the sentence to the first person in their group. The first person of each group will then whisper the sentence to the second person of each group, and so on. The last player will receive a message from each group and will have to make the right manipulations and decisions to output the message he thinks was originally sent. For example:

Figure 3: Sentence Passed Through Threaded Telephone Line

How well does our training method work now? One speaker will have to first go to the first player in the first group, determine the error he caused, notify the second player of each group of this error, and finally notify the last player. The second speaker then has to repeat the same process. Our speakers will get very tired very fast. Imagine how timely this process would be if we had ten groups with 10 speakers. And all this is for just one training session. Imagine what would happen if we wanted to train on millions of sentences. The inefficiencies of our training method are starting to show. So, is there a better solution?

Let’s introduce a new type of player to our team, a judge. A judge is the last person to receive the message. He knows what the original message is, and is in charge of comparing the message received to the original message. Don’t confuse the judge with the last players in figure 2 and figure 3. These players don’t know what the original message is, while the judge does:

Figure 4: Threaded Telephone Line With A Judge

Since the judge knows what the original message is, he can calculate the total error produced on the final sentence. With the error now known, the judge can broadcast the error received to the player who whispered the sentence to him. This player will evaluate his impact on this error, and broadcast that to all the previous players. With just one pass from speaker to judge, and another from judge to speaker, each player is able to understand how he contributed to the overall error of the team. Compared to our original method, where we had to pass through the entire phone line for every speaker we had, this is much faster.

We conclude that when the speakers were left in charge of synchronizing errors (forward propagation), our training process was very inefficient. The solution was to propagate the error backward (backward propagation) starting from the judge. This conclusion holds for neural networks and is why backpropagation is the most widely used algorithm for training them. Let’s look at how the algorithm works.

Backpropagation

Training in our telephone line consisted of understanding the changes needed by every player’s predicted error to minimize the team’s overall error. Similarly, backpropagation aims to understand how changes in the neural network’s weights and biases will impact our cost function. Formally, the ultimate goal is to compute:

Note: The subscripts j and k for w^l are not mistakenly reversed. We do it this way because it’ll make our calculations easier down the road.

J plays a similar role as the the judge, while the weights and biases play similar roles to the speakers and players.

More often than not, backpropagation is explained by first showing the algorithm, then relating it back to the chain rule in calculus. We’re going to take the opposite approach. We’ll first provide a graphical way of understanding differentiation using the chain rule, then relate it back to the backpropagation algorithm.

Chain Rule

Consider the following equation:

Equation 1: A as a Function of B, C and D

We can split this one function into multiple:

Equation 2: A as a Function of F and G

This way, we can graphically represent the relationship between all inputs and A:

Figure 5: Graphical Representation of A

This graphical representation of A gives us an easy way to calculate the partial derivative of A with respect to any variable, using what is called the chain rule. For example, to find the partial derivative of A with respect to C (∂A/∂C), we follow the path from C to A in the above graph. There are two possible paths, one through F and one through G. Before we can find ∂A/∂C, we need to compute ∂F/∂C and ∂G/∂C. The rule is: all partial derivatives on the same path are multiplied, while all partial derivatives on different paths are summed. So the partial derivative of A with respect to C is calculated as follows:

Equation 3: Partial Derivative of A With Respect to C

Let’s apply these same rules to a neural network.

Algorithm

Consider the following neural network with no more than two nodes (we omit the bias term to simplify our graph but it’s there in theory nonetheless):

Figure 6: Simple Neural Network With Two Nodes

Where a^L is reffered to as the activation of the output layer (predicted result) Land is equal to:

Equation 4: Activation of Neuron j In Layer L

The weighted sum passed to the sigmoid function (σ) is going to be so common in our equations that we’ll assign it to a variable z^L:

Figure 7: Definition of z^L

Okay, we’ve dealt with all the jargon. From this point forward, we do no more than apply the chain rule we learnt a moment ago. Don’t forget our main goal: To find how a change in weights and biases impacts our cost function.

So, how do we draw a graph similar to figure 5 but for our cost function? Well J is a function of a^L (predicted result) and y (actual value), a^L is a function of z^L which in turn is a function of a^(L-1), w and b … Do you see the pattern? Here’s what a graphical representation of J will look like:

Figure 7: Graphical Representation of J

Can you find ∂J/∂w and ∂J/∂b using this graph? Yes you can! Just apply the same rules we did earlier:

Equation 5: ∂J/∂w and ∂J/∂b

We know what z^L and a^L are, so all that’s left to do is compute the partial derivatives defining ∂J/∂w and ∂J/∂b and we can come up with an expression for ∂J/∂w and ∂J/∂b with regards to any cost function J :

Equations 6, 7 and 8: Relevant Derivatives

Overall, we get:

Equations 9 and 10: ∂J/∂w and ∂J/∂b

That’s it! We found the general formulas for computing the change in J with respect to any weight or bias. Remember that we simplified things by looking at just one hidden layer, with one neuron and one output layer with one output. In reality, we’re gonna have many different layers l=L, (L-1), (L-2), ... ,2 each containing many different neurons. Nonetheless, we can loop through all the different layers and neurons and use their values in equations 9 and 10. Also remember that this is going to be used during gradient descent to calculate partial derivatives, meaning we’ll be running this algorithm for every training example we have.

Overall, the backpropogation algorithm involves:

Figure 8: Full Backpropagation Algorithm [1]

The feedforward step in our algorithm is equivalent to the sentence being passed from speaker to judge in our telephone example. In our neural network, a training example will be passed from the input to the output layer. This will allow us to evaluate our total cost J and propogate it backwards.

Conclusion

In this article we looked at the backpropagation algorithm.

By analysing how errors are propagated through a telephone line, we were able to understand the impact of one player’s mistakes on the whole team. We also determined the fastest way to diminish this error by first running through the entire phone line, and propogating the error calculated at the very end backwards.

With this idea of error propagation in mind, along with the chain rule for differentiating multivariate functions, we came up with a set of equations that can be used to calculate the gradient of our neural network’s cost function effeciently and quickly. We concluded by formalizing the algorithm, and saw how it is used in gradient descent.

Deep Learning Series

References

[1] Michael A. Nielsen, Neural Networks and Deep Learning (2015), Determination Press

--

--

Machine Learning Developer @ Kinaxis | I write about theoretical and practical computer science 🤖⚙️