Are you still using the mean and median values to clean the blanks?

Mathematical foundations of blank filling

Gianluca Malato

--

Image by author

Working with missing values is a common task in machine learning. We can say that it’s the very first task to accomplish before starting a pre-processing pipeline. The most common approach to blank filling is to use the mean and the median values. Although this is a very common practice, maybe there’s a more data-driven approach we can use.

The need for blank filling

Machine learning models are mathematical formulas (or rely on mathematical formulas), so they need to work with numbers. When our data is missing, there’s no number to learn from and the model can’t handle this situation by itself. It needs to see numbers everywhere in the dataset it works with, so we have to fill in the possible blanks in order to make the model work properly.

The origins of the blanks may be several. For example, data doesn’t exist in the database we extract it from. Or data may be corrupted or dirty or even manually created so that it may contain errors and missing values due to human mistakes.

All these issues rise the need to fill in the blanks and clean our data.

Mean or median? Maybe none of them

--

--

Gianluca Malato

Theoretical Physicists, Data Scientist and fiction author. I teach Data Science, statistics and SQL on YourDataTeacher.com. E-mail: gianluca@gianlucamalato.it