Dimensionality Reduction and Principal Component Analysis

Ali H Khanafer
Published in
8 min readJun 29, 2021

--

This is part nine of a series I’m working on, in which we’ll discuss and define introductory machine learning algorithms and concepts. At the very end of this article, you’ll find all the previous pieces of the series. I suggest you read these in sequence. Simply because I introduce some concepts there that are key to understanding the notions discussed in this article, and I’ll be referring back to them on numerous occasions.

Today we look at the importance of dimensionality reduction, as well as the principal component analysis (PCA) algorithm, the most widely used algorithm for dimensionality reduction.

Let’s get right into it.

Dimensionality Reduction

Dimensionality reduction is the idea of going from a complex dataset, with multiple dimensions, to a simpler dataset with fewer dimensions. For example, going from a three-dimensional dataset to a two-dimensional dataset. What’s important is that we do this in a way such that we maintain the meaningful properties of the original data.

This can be useful for two reasons.

Reason Number One: Data Compression

Reason number one is data compression. A lot of the time, the data we’ll be dealing with has many redundant features. For example, if we wish to predict the price of a house in Boston using linear regression, we may be working with data that represents the size of the house in two different units of measure, say meters and feet. Do we really need all that information? The house’s project manager might, but your machine learning model will certainly be fine with one or the other. We can reduce the size of our dataset by compressing this information into one unit instead of two. This will not only lead to a smaller dataset, but to faster learning from your algorithm. Consider the following two-dimensional graph, where x^(i) represents a training point defined by the features x_1 and x_2:

Figure 1: Random Training Points on a Graph

We wish to reduce this into a one-dimensional graph. To do so, we’ll fit a line through our points:

Figure 2: Line Fitted Through Random Training Points

Then calculate the orthogonal distance from our points to the fitted line:

Figure 3: Orthogonal Distance from Training Points to Fitted Line

You might be tempted to believe that the above graph is equal to what we see when we perform linear regression using ordinary least squares (refer to part two), but we’ll explain a little later how they have nothing to do with each other.

With the orthogonal distances calculated, we can project our points onto the fitted line. The value a training point holds on the line is the value it will hold in our one-dimensional plane. Here’s how that will look like:

Figure 3: Training Points After Dimensionality Reduction

Where {z^(1),z^(2),z^(3),z^(4)} are the new values after projecting {x^(1),x^(2),x^(3),x^(4)} onto our fitted line, and z_1 is the new feature describing our training points.

Orthogonal vs Squared Distances

In figure 3 we saw a graph that can be confused with the typical graph we see when calculating the distance between the training points and the fitted line in linear regression.

Although they are visually similar, the two are communicating completely different information. Let’s compare them side-by-side:

Figure 4: Squared vs Orthogonal Distance

We see that the orthogonal distances are normal to the fitted line. This has the impact of assuming an error in the dependent variable (x-axis), as well as in the independent variable (y-axis). We can see this by the fact that the distance between a training point and the line has both an xand a ycomponent:

Figure 5: X and Y Components in Orthogonal Distances

In the squared distances calculated for linear regression, the assumption is that there’s no error in our dependent variables. As such, we calculate only the y distance from the training point to the fitted line.

Reason Number Two: Visualization

It’s much easier to visualize our dataset when it’s two- or three-dimensional. When dealing with a dataset with thousands of features, dimension reduction can help us visualize our data more easily by diminishing the space our data is found in.

Linear Algebra Review

Before we can describe the PCA algorithm, there are a few linear algebra and statistics concepts that we need to review.

Variance, Covariance and the Covariance Matrix

  1. Variance: A measure describing how dispersed a dataset is. It describes the spread between the data points and the dataset’s mean:
Equation 1: Variance of a Population

2. Covariance: A measure describing how two random variables react to each other's changes. You might be tempted to believe that this definition is equal to that of the correlation between two variables. You’re not completely wrong. The difference between the two is that correlation is standardized while covariance isn’t. That is, the correlation between two variables is a value between -1 and +1. The covariance between a variable and itself is equal to the variance of said variable.

Equation 2: Covariance of a Population

3. Covariance matrix: A square and symmetric matrix describing the covariance between all pairs of combinations of features. Consider a simple 2x2 matrix that describes the covariance between two features A and B. The covariance matrix would look something like:

Figure 7: Covariance Matrix Example

Notice that the diagonal consists of all the variances of the feature set. Note also that Cov(A,B) = Cov(B,A), making our matrix symmetrical.

Principal Component Analysis

Principal component analysis (PCA) is the most widely used algorithm when dealing with dimensionality reduction.

In the example where we reduced the dimension of our training set from a two-dimension problem to a one-dimension problem, our aim was to find a line that minimizes the projection error. That is, we wanted to find a line where the average orthogonal distance between the training points and the line was minimal. More formally, we wanted to find a two-dimensional unit vector u^(1) (or -u^(1)) onto which we can project the training data so as to minimize the projection error:

Figure 8: Unit Vector That Minimizes Projection Error

But obviously, our data isn’t always going to be two-dimensional. So, a more generalized way of stating our problem is: Find k vectors {u^(1),u^(2),...,u^(k)} onto which to project the data, so as to minimize the projection error.

Let’s look at how PCA finds these vectors.

Intuition

The first thing we’ll want to do is normalize our data using mean normalization. If you don’t know what that means, refer to part one of the series, Data Pre-Processing. The most important thing to keep in mind is that normalizing will change our data so that it has a mean of zero.

Step two is feature scaling, but only if your data requires it. Again, I refer you to part one if you don’t know what that means.

Once we’re done with all the data pre-processing, we’re left with a dataset whose mean is zero, and is ready to be used in our PCA algorithm. First thing’s first, we calculate the covariance matrix of our training data. You might be wondering why we calculate the covariance matrix. Think about it this way: we said that the covariance is similar to correlation, in that it calculates the similarity in change between any two variables. Well, if the goal of PCA is to reduce the dimension of our training data by finding redundant features, what better way of determining similar features than by determining if they change in similar manners. Here’s how we construct our covariance matrix:

Equation 3: Covariance Matrix

This equation might bring upon some confusion:

  1. Don’t mistake the covariance symbol sigma with the summation symbol. Same symbol, different meanings.
  2. If you compare this with the covariance equation we saw in equation 2, you’ll notice that we’re not subtracting the mean from our training examples. Why? Well, since we normalized our data, the mean of our data is zero, so we omit that part of the equation.
  3. I advise you to run through an example to see how this successfully calculates the covariance matrix. Give yourself two 3x1 training matrices, then apply the covariance matrix equation.

Next, we perform singular value decomposition (SVD) on our covariance matrix. The theory behind SVD is way beyond the scope of this article, so we’ll just quickly describe how it benefits us. SVD is a matrix decomposition algorithm. Given a matrix M, SVD will decompose it into three separate matrices U, S and V. All three of these matrices hold valuable meaning. In our case, the matrix we’re most concerned with is the U matrix. The columns of these matrices describe our unit vectors {u^(1),u^(2),...,u^(n)}. Notice that this vector is an n x n matrix. All we need to do is select the k first vectors. For example, if we wish to reduce our dataset from three dimensions to two, U would be a 3 x 3 matrix, but we’d select the first two columns.

Figure 9: Unit Vector Matrix

Finally, all that’s left to do is multiply our training examples by the U unit vectors and we end up with the projected values of all our x^(i) training examples:

Figure 10: Projected Training Data

Conclusion

In this article, we looked at the concept of dimensionality reduction and the most widely used reduction algorithm, principal component analysis (PSA). We also did a quick review of important linear algebra and statistics concepts.

In the next article, we’ll look at anomaly detection. Until then, we leave you to ponder on the following points:

  • Is all data dimension reducible? If not, in what cases is it not?
  • Extend what we did when we reduced our dataset from two-dimensional to one-dimensional, but this time, do it for a three-dimensional dataset to a two-dimensional dataset.

Past Articles

  1. Part One: Data Pre-Processing
  2. Part Two: Linear Regression Using Gradient Descent: Intuition and Implementation
  3. Part Three: Logistic Regression Using Gradient Descent: Intuition and Implementation
  4. Part Four — 1: Neural Networks Part 1: Terminology, Motivation, and Intuition
  5. Part Four — 2: Neural Networks Part 2: Backpropagation and Gradient Checking
  6. Part Six: Evaluating Your Hypothesis and Understanding Bias vs Variance
  7. Part Seven: Support Vector Machines and Kernels
  8. Part Eight: Unsupervised Learning and the Intuition Behind K-Means Clustering

Shameless Plug

References

  1. Andrew Ng’s Machine Learning Coursera Course
  2. Scipy Orthogonal Distance Regression
  3. Variance of a population|Descriptive statistics|Probability and Statistics|Khan Academy
  4. Investopedia’s Understand Variance vs Covariance: What’s the Difference?
  5. Covariance Matrix

--

--

Ali H Khanafer
Geek Culture

Machine Learning Developer @ Kinaxis | I write about theoretical and practical computer science 🤖⚙️