Are you still using the mean and median values to clean the blanks?

Mathematical foundations of blank filling

Gianluca Malato
4 min readMar 13

--

Image by author

Working with missing values is a common task in machine learning. We can say that it’s the very first task to accomplish before starting a pre-processing pipeline. The most common approach to blank filling is to use the mean and the median values. Although this is a very common practice, maybe there’s a more data-driven approach we can use.

The need for blank filling

Machine learning models are mathematical formulas (or rely on mathematical formulas), so they need to work with numbers. When our data is missing, there’s no number to learn from and the model can’t handle this situation by itself. It needs to see numbers everywhere in the dataset it works with, so we have to fill in the possible blanks in order to make the model work properly.

The origins of the blanks may be several. For example, data doesn’t exist in the database we extract it from. Or data may be corrupted or dirty or even manually created so that it may contain errors and missing values due to human mistakes.

All these issues rise the need to fill in the blanks and clean our data.

Mean or median? Maybe none of them

Numerical variables are often cleaned using the mean value or, if the variable is asymmetric, using the median value. It’s an empirical approach that is widely used in machine learning. But are we sure that a better value doesn’t exist?

Let’s create a univariate dataset made by the mixture of two lognormal distributions and some missing values.

import matplotlib.pyplot as plt 
import numpy as np
import pandas as pd

np.random.seed(0)
col = pd.Series(np.concatenate((np.random.lognormal(size=500) ,np.random.normal(size=500,loc=10), [np.nan]*200)))

This is its histogram.

Image by author

Now, it’s time to try the blank filling using the mean value and the median value. How to choose between them? My personal experience and…

--

--

Gianluca Malato

Theoretical Physicists, Data Scientist and fiction author. I teach Data Science, statistics and SQL on YourDataTeacher.com. E-mail: gianluca@gianlucamalato.it