1 min readSep 30, 2020
1 - With BIg Data you should use parallel computing frameworks as Spark. The code will change from the one I've written in this article, according to the framework you use.
2 - Yes, a well performed stratified sampling will keep the interactions as long as the destination dataset is large enough to make statistics converge properly.