Thursday, August 15, 2019

Data Lake vs Data Warehouse


Is a data lake part of your data warehouse platform or does the data lake sit beside it? There is a fair amount of ambiguity as to what a data lake is and how it should fit into your overall data strategy. 

I believe data lakes (coupled with elastic cloud storage and compute) are a game changer in both the DW and BI world. Your data warehousing strategy should be part of the data lake not the other way around. While you don't have to throw away everything you have done or learned in your traditional ETL and DW world, the fundamentals have changed. 

To take advantage of your data and build better BI/analytics you must build atop a sold data lake foundation. And this going well beyond the many failed Big Data and Hadoop projects of the recent past that many enterprises have experienced. 

While Hadoop was a necessary step forward at the time, it was and is an evolutionary dead end - RIP Hadoop. Cloud data lakes are the future and it is more than putting your data into S3 buckets. 

Well architected data lakes are the culmination of a succinct data management strategy that leverages the strengths of cloud services and many traditional DW best practices and data governance policies.

Wednesday, May 15, 2019

R.I.P. HDFS | The Cloud Wins!


HDFS is an evolutionary dead end in the tree of big data. Data lakes based on S3 object storage deliver on the promise of separating storage from compute and make it possible to scale your processing and downstream analytics/AI and data marts on top of a data lake in an agile and elastic fashion. The HDFS architecture always bugged me when it was first released (besides the fact it is written in Java). Moving the code to the Hadoop data node (usually only three replicas available by the way), seemed to be inherently limiting to me. It was not really better than using big unix SMP servers other than you got to use cheaper commodity hardware and grow incrementally. Good stuff, but not good enough - 1 step forward and a half step backwards.
While the idea of moving code to the data sounded cool at the time, it is fundamentally a bad data processing design for a truly scalable data lake that allows for rolling up an arbitrary number ephemeral compute clusters on top of your storage. There is a place for HDFS and traditional Hadoop clusters, if you have big fixed and slow evolving predictable cluster of compute/storage environment. For the rest of us, a cloud based data lake architecture will win in the end and allow for agile development to meet the fast paced needs of downsteam today's BI, analytics and AI/ML applications that need to sit on top of the mythical data lake.