A Different Way of Looking at Precision vs Recall
Going beyond the confusing definitions by looking at model evaluation through a different lens
If you’re anything like me, and you’re just getting started in your machine learning career, then you’ve probably stumbled upon the concepts of precision and recall. These terms often come up when studying the notion of model evaluation.
More often than not, beginners struggle to understand these concepts. Not because they’re difficult, but because the same confusing techniques and definitions are used time and time again. Normally, the concept will be explained by first defining the terms, then showing a real-world example of model evaluation. Let’s look at a few examples. The first one is taken from Google’s Machine Learning Crash Course:
Precision attempts to answer the following question: What proportion of positive identifications was actually correct? [1]
Recall attempts to answer the following question: What proportion of actual positives was identified correctly? [1]
In most of the literature, this is the definition you’ll find. I’m not quite sure why this play-on-words is the standard for describing the two terms, but it can be very confusing at times, especially for beginners. Although they’re not, both these definitions are phrased to look like they mean the same thing. So maybe we can find a better explanation? Scikit Learn’s documentation defines it as follows:
Precision (P) is defined as the number of true positives (Tp) over the number of true positives plus the number of false positives (Fp) [2]
Recall (R) is defined as the number of true positives (Tp) over the number of true positives plus the number of false negatives (Fn) [2]
Not much is being said here, other than how to calculate the values of precision and recall. Let’s look at one last example, from the Wikipedia piece on precision and recall:
Precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances, while recall (also known as sensitivity) is the fraction of relevant instances that were retrieved [3]
Apart from a slight change in phrasing and vocabulary, this definition is very similar to the first one we saw, from Google.
In this article, we’ll aim to explain these terms in a different way. Rather than looking at the definitions, we’ll first look at the equations of both precision and recall, and try to extract as much information from them as possible. The idea is that, once you understand how these terms are calculated, you should have a better foundation to understand the mainstream definitions. Note that this article is not an introduction to model evaluation. Instead, I’m assuming that the reader has basic knowledge of when, why, and how to apply these concepts, but is looking for a better way of understanding them.
Want to read this story later? Save it in Journal.
If this is your first time being introduced to these terms, I suggest you read my piece on evaluating your hypothesis and understanding bias vs variance.
Let’s get right into it.
The Decision Matrix
We won’t be going into great detail on the differences between the classification outcomes, since that isn’t the point of this article. If you don’t know what a decision matrix is and don’t fully understand the differences between a true positive, false positive, true negative, and false negative, then we suggest that you go through the section on Type I and Type II Errors in this article:
If, however, you understand these concepts and just need a quick refresher, then here’s a set of quick definitions of the four possible outcomes in a classification problem:
- True Positive (TP): The classification model labeled a data point as positive, and the data point is actually positive.
- False Positive (FP): The classification model labeled a data point as positive, and the data point is actually negative.
- True Negative (TN): The classification model labeled a data point as negative, and the data point is actually negative.
- False Negative (FN): The classification model labeled a data point as negative, and the data point is actually positive.
Precision, Recall, and Precision vs Recall
Precision and Recall
Understanding how precision and recall are calculated is no more difficult than understanding basic percentages.
Consider a scenario where you have a basket of 10 fruits, three of which are apples and the rest are oranges. How do you calculate what percentage of apples make up your basket? We use the basic percentage formula (part/whole) * 100
where part is the number of a subset of items you wish to get the percentage of, and whole is the total number of items in your set. So (3/10)*100 = 30%
. Simple, right?
Let’s look at how we calculate precision:
What information can we extract from this equation? Well, notice first that the total number of true positives is used in both the numerator and the denominator. In the denominator, however, we’re adding an extra term, the number of false positives. This added term makes it so that our precision will always hold a value of less than or equal to one, since the denominator will always be greater than or equal to TP. More importantly, however, notice how we get the denominator: we’re adding the number of data points that our model classified as positive, whether it was actually positive (true positive) or not (false positive). So, our numerator is the number of true positives, while our denominator is the number of all positives. What does that leave us with? It leaves us with a number (percentage, ratio) that describes how many of the data points our model classified as positive are actually positive. If you’re still confused, think about it this way: how many of the fruits in our basket were apples? 30%. Now, replace apples with the number of positive data points that were classified as positive, and replace the entire fruit basket with the number of positive and negative data points that were classified as positive. It’s the same idea, just communicating different things.
Next, recall. Here’s its equation:
We can run the exact same analysis as we did for precision. The only thing differentiating the denominator from our numerator is the addition of an extra value in our denominator. This time, however, instead of adding the number of false positives to TP, we add the number of false negatives. So, our numerator is still the number of true positives, while our denominator is now the total number of data points that should have been classified as positive, but may or may not have been. This leaves us with a number describing how many of the positive data points our model was able to identify. Again, this number will always be less than or equal to one.
Precision Versus Recall
Almost always, when studying the concept of model evaluation, we’re told that there’s normally a trade-off between precision and recall. As precision increases, the recall will decrease, and vice versa. Very rarely, however, are we taught why this is the case. Instead, we’re shown a graph similar to the one below:
Where the type of curve (red, blue, or black) you get depends on some hyperparameters. Although this is a good way of visualizing things, I personally understood the relationship between these two variables much more when I compared them mathematically.
Before even complicating things, try to compare the two equations. Surely by now, you’ve noticed the similarities. The only thing that separates the two is what value we add to TP in the denominator. What does that mean? It means the control we have over our results lies in the number of FPs and FNs. Consequently, increasing or decreasing these values will result in an increase or decrease in our precision and recall.
Consider the following comparison:
So much can be extracted from this result:
- If
FP > FN
, then our recall-to-precision ratio will be greater than one and our precision will be lower than our recall. - If
FP < FN
, then our recall-to-precision ratio will be smaller than one and our precision will be greater than our recall. - If
FP ~= FN
, then our recall-to-precision ratio will be equal to one and our precision will be equal to our recall.
From this, we can also present precision as a function of recall, and recall as a function of precision, to get an even better understanding of the relationship between the two:
Again, these equations tell us a lot about the relationship between the metrics. Through them, one can begin to control the performance of his model by tweaking the different parameters.
Conclusion
The aim of this article was to provide a different perspective on the concepts of precision and recall. Instead of looking first at the definitions, which is traditionally how the concept is explained, we first extracted the information communicated by their equations. From there, we were able to develop a good understanding, which lead us to the traditional definitions used in today's literature.
To conclude, we looked at the relationship between precision and recall and came up with conclusions as to why we’re often told that there’s a trade-off between the two.
From here, I suggest you start looking into ROC curves. Although this is a much more complex topic, with what you’ve learned in this article, you’re equipped with the proper knowledge to do so.
References
[1] Classification: Precision and Recall (2020), Google’s Machine Learning Crash Course
[2] Scikit-Learn Developers, Precision-Recall (2020), Scikit-Learn
[3] Precision and recall (2021), Wikipedia
Shameless Plug
- Twitter: twitter.com/ali_khanafer2
Enjoyed this post? Subscribe to the Machine Learnings newsletter for an easy understanding of AI advances shaping our world.