Wednesday, May 15, 2019

R.I.P. HDFS | The Cloud Wins!


HDFS is an evolutionary dead end in the tree of big data. Data lakes based on S3 object storage deliver on the promise of separating storage from compute and make it possible to scale your processing and downstream analytics/AI and data marts on top of a data lake in an agile and elastic fashion. The HDFS architecture always bugged me when it was first released (besides the fact it is written in Java). Moving the code to the Hadoop data node (usually only three replicas available by the way), seemed to be inherently limiting to me. It was not really better than using big unix SMP servers other than you got to use cheaper commodity hardware and grow incrementally. Good stuff, but not good enough - 1 step forward and a half step backwards.
While the idea of moving code to the data sounded cool at the time, it is fundamentally a bad data processing design for a truly scalable data lake that allows for rolling up an arbitrary number ephemeral compute clusters on top of your storage. There is a place for HDFS and traditional Hadoop clusters, if you have big fixed and slow evolving predictable cluster of compute/storage environment. For the rest of us, a cloud based data lake architecture will win in the end and allow for agile development to meet the fast paced needs of downsteam today's BI, analytics and AI/ML applications that need to sit on top of the mythical data lake.