Gianluca Malato
1 min readSep 30, 2020

--

1 - With BIg Data you should use parallel computing frameworks as Spark. The code will change from the one I've written in this article, according to the framework you use.

2 - Yes, a well performed stratified sampling will keep the interactions as long as the destination dataset is large enough to make statistics converge properly.

--

--

Gianluca Malato
Gianluca Malato

Written by Gianluca Malato

Theoretical Physicists, Data Scientist and fiction author. I teach Data Science, statistics and SQL on YourDataTeacher.com. E-mail: gianluca@gianlucamalato.it

No responses yet