Member-only story
How to choose the bins of a histogram?
Histograms are a very useful tool when we want to give a quick sight to the shape of our data. However, we always have to choose the right number of bins.
What is a histogram?
A histogram is a representation of the probability distribution of a dataset. Given a bin width, the range of the variable is split into non-overlapping intervals of that width and, for each interval, we count how many values fall inside it. This determines the height of the histogram bar.
Histograms are very useful because they are able to give us a clear overview of the shape of the distribution of a variable. We can easily see if a variable is skewed, if it’s multimodal, if it has fat tails and so on. That’s why mastering the use of histograms is mandatory for any data scientist and analyst.
But there’s a problem: how to choose the number of bins in a histogram?
The number of bins
Let’s make a simple example in Python. Let’s simulate 6000 randomly generated points from a normal distribution.
import numpy as np
import matplotlib.pyplot as plt np.random.seed(0)
x = np.random.normal(size=6000)
If we plot the histogram using plt.hist function, the default number of bins is 10.