Member-only story

Why You Shouldn’t Use PCA in a Supervised Machine Learning Project

Some flaws of Principal Component Analysis that affect supervised machine learning projects

Gianluca Malato
5 min readJul 10, 2022
Image by author

Principal Component Analysis is a very useful dimensionality reduction tool. It can really help you reduce the number of features of a model. Although it may seem a powerful tool for a data scientist, there are some drawbacks that I think make it unsuitable for supervised machine learning projects.

What is PCA?

Principal Component Analysis is a tool introduced by Karl Pearson (yes, “that” Karl Pearson). it is a procedure that makes a linear transformation of the features in order to get new features whose covariance matrix is diagonal.

The variance of a sum of random variables is:

If we assume that the covariance of a variable with itself is its variance, we can define the covariance matrix as:

By performing PCA, the resulting new features in a rotated space don’t show any covariance with each other.

Let’s consider a set of features.

This is their covariance matrix:

--

--

Gianluca Malato
Gianluca Malato

Written by Gianluca Malato

Theoretical Physicists, Data Scientist and fiction author. I teach Data Science, statistics and SQL on YourDataTeacher.com. E-mail: gianluca@gianlucamalato.it

Responses (8)