Photo by Will Myers on Unsplash

Anomaly Detection

Algorithm for catching outliers using Gaussian distribution

--

This is part ten of a series I’m working on, in which we’ll discuss and define introductory machine learning algorithms and concepts. At the very end of this article, you’ll find all the previous pieces of the series. I suggest you read these in sequence. Simply because I introduce some concepts there that are key to understanding the notions discussed in this article, and I’ll be referring back to them on numerous occasions.

In this article, we’ll quickly review the Gaussian distribution, define what an anomaly is, and how we detect them.

Let’s get right into it.

What is an Anomaly?

Formally, an anomaly (also referred to as outlier) is defined as:

something different, abnormal, peculiar, or not easily classified. Deviation from the common rule [2]

This definition holds in the machine learning world. Whenever you have a data point that is significantly different from the rest of the set, we say that this data point is anomalous.

Consider, for example, the following dataset describing the height and weight of a person:

Figure 1: Height vs Weight

The cross in red is so far off the general distribution of the dataset, that we label it as an anomaly i.e. the height and weight of this specific person does not conform to the height-to-weight ratio we normally see in people.

You might, however, have felt that the red cross is not that far off from the rest of the data points. Since we haven’t defined a threshold past which we label a point as anomalous or not, your feelings are completely valid.

Figure 2: Example of Different Possible Thresholds

In this article, we’re going to see how we come up with this threshold that’ll give us a concrete baseline as to what data points are anomalous or not.

Anomaly vs Erroneous Data Point

It’s important that we’re able to distinguish between an erroneous data point and an anomalous one.

When a data point does not comply with the general distribution of our training set due to some non-anticipated error in the entry or distribution of the data, we don’t classify it as an outlier, but rather, an error that needs to be fixed. For example, data is usually manually constructed by humans, so it’s very possible that we encounter some data entry errors. If we’re collecting data on people’s heights, and we have a value of two centimeters, then clearly this is an error in data entry and not an anomaly.

Anomalies are possible outcomes that are unusual, like high traffic on a network, or high CPU usage levels.

Gaussian (Normal) Distribution

This series isn’t about probability theory and statistics, so I won’t be going into great detail about probability density functions (PDFs) and continuous probability distributions. Instead, I’ll give a quick introduction, as well as all the information you need in order to understand the anomaly detection algorithm and leave the rest for your own research.

Consider the table below which defines the heights of a population consisting of 10 people. You’re asked to graphically represent these heights.

Figure 3: Table of Population Height in Centimeters

This is a one-dimensional dataset, so all we need is a one-dimensional cartesian coordinate system, like so:

Figure 4: Height of a Population

What do we notice from figure 4? First of all, the majority of our population is of height between 165 and 180 centimeters. Second, we notice that there are some that are either shorter or taller than the majority.

Next, you’re asked to graphically represent the probability that a person in this population is of a certain height. More formally, for any height X, determine the probability of all possible outcomes of X.

Before we put much thought into it, think of it logically. If the majority of our population is of height between 165 and 180 centimeters, would the probability not be the highest for the values in that range? Also, since we have such little amount of people who are below or above this range, would the probability of a person being of those heights not be smaller? Once you think about it this way, drawing our graph becomes much simpler:

Figure 5: Density Curve of Our Population

In figure 5, the y-axis represents the values of this curve’s probability density function (PDF), and the x-axis the height of a population. The 0.5 on the y-axis was chosen arbitrarily.

We went through this entire process in order to come up with the distribution of our data. We got to a graph that describes the possible values of some variable X and how often they occur. Our variable can be distributed in many different ways. Imagine we had more people with heights between 140 and 150 centimeters. In that case, our graph would be more skewed to the left:

Figure 6: Density Curve Skewed to the Left

When the majority of our data is found at the mean, we have a Gaussian (Normal) distribution. This form of distribution is normally characterized by its mean and its standard deviation (STD).

A few points to note before moving on to the algorithm:

  • When the value of the variable X is continuous (can take an infinite amount of values), then it’s impossible to get the probability of one point in specific. For example, you are asked to calculate the probability that height is 165 centimeters, that is, calculate P(X = 165). The probability of a person being exactly 165 centimeters is impossible. They might be 165.00001 centimeters, 164.9999 centimeters, but 165 is not possible. Instead, we look for the probability of a range of values. For example P(164.9 <= X <= 165.1). This probability is calculated by finding the area under the density curve between those two points.
  • The area of the entire density curve will always amount to one.

That first point might frighten you a little bit, but don’t worry, you won’t have to pull out your integration skills every time you want to calculate a probability from a distribution curve. Instead, just use its PDF. The PDF of a Gaussian distribution is the following:

Equation 1: PDF of Gaussian Distribution

Anomaly Detection Algorithm

So, we understand what a Gaussian distribution is, all that’s left to do is understand how this information is useful in the case of anomaly detection.

We’ll look first at the algorithm, and from there try to understand all the details. The following algorithm is taken from Andrew Ng’s introductory course on machine learning [1]:

Figure 7: Anomaly Detection Algorithm
  1. Some features will have greater anomalous tendencies than others. A feature such as CPU usage, for example, is likely to present outlying results at some point, so it’s a good feature to include in an outlying detection system for computer systems.
  2. Calculate the mean and variance of all your features. These will be used to calculate the probabilities through the PDF of the Gaussian distribution.
  3. Given a new datapoint x, identify whether or not this datapoint is anomalous by computing the product of the probabilities of its features. The easiest way to understand this is through an example. Say we’re given the height and weight of a person x=[x_1=165cm, x_2=140lb]. To calculate the probability of a person being 165 centimeters tall and weighing 140 pounds, we have to calculate p(x_1=165cm and x_2=140lb) = p(x_1=165cm)*p(x_2=140lb). We consider this data point to be anomalous if it’s smaller than some pre-defined threshold epsilon.

A final word on step number three: How you choose epsilon depends on how strict you want to be. A larger benchmark means you’re going to be more strict by requiring that all data points are as likely as possible to be non-anomalous. A smaller epsilon means you’re willing to accept lower probability data points.

Conclusion

In this article, we went through the concept of anomaly detection. We first described what an anomaly is and what it isn’t. From there, we reviewed the concept of Gaussian distribution and used it to develop an anomaly detection algorithm.

Anomalies are found in most data science and machine learning applications, so being equipped with this information will help you with future projects. Although we gave you a great overview, there are some points we didn’t consider. We leave you with the responsibility of researching the following points:

  • How do we test an anomaly detection system?
  • Do we have to use a Gaussian distribution? Can we use any other kind of distribution? If yes, what are the advantages and disadvantages of using one over the other? If no, why not?
  • Is there an algorithm for selecting features for our anomaly detection system, or do we always have to rely on our intuition?

Past Articles

  1. Part One: Data Pre-Processing
  2. Part Two: Linear Regression Using Gradient Descent: Intuition and Implementation
  3. Part Three: Logistic Regression Using Gradient Descent: Intuition and Implementation
  4. Part Four — 1: Neural Networks Part 1: Terminology, Motivation, and Intuition
  5. Part Four — 2: Neural Networks Part 2: Backpropagation and Gradient Checking
  6. Part Six: Evaluating Your Hypothesis and Understanding Bias vs Variance
  7. Part Seven: Support Vector Machines and Kernels
  8. Part Eight: Unsupervised Learning and the Intuition Behind K-Means Clustering
  9. Part Nine: Dimensionality Reduction and Principal Component Analysis

Shameless Plug

References

  1. Andrew Ng’s Machine Learning Coursera Course
  2. Anomaly | Definition of Anomaly by Merriam-Webster

--

--

Ali H Khanafer
Geek Culture

Machine Learning Developer @ Kinaxis | I write about theoretical and practical computer science 🤖⚙️