Member-only story
Why You Shouldn’t Use PCA in a Supervised Machine Learning Project
Some flaws of Principal Component Analysis that affect supervised machine learning projects
Principal Component Analysis is a very useful dimensionality reduction tool. It can really help you reduce the number of features of a model. Although it may seem a powerful tool for a data scientist, there are some drawbacks that I think make it unsuitable for supervised machine learning projects.
What is PCA?
Principal Component Analysis is a tool introduced by Karl Pearson (yes, “that” Karl Pearson). it is a procedure that makes a linear transformation of the features in order to get new features whose covariance matrix is diagonal.
The variance of a sum of random variables is:
If we assume that the covariance of a variable with itself is its variance, we can define the covariance matrix as:
By performing PCA, the resulting new features in a rotated space don’t show any covariance with each other.
Let’s consider a set of features.
This is their covariance matrix: