Cloud Analytics & ML with Sam Taha

Thursday, November 26, 2020

Are Open Cloud Data Lakes the Future?

Building a cloud data platform? First question: open Data Lake or proprietary DW or maybe a mix of both? Not a simple question or architecture decision to make given the flood solutions and players in the space from the large cloud platforms to new entrants such as Snowflake.

I see the Fivetran argument from George Fraser that decoupled storage/compute cloud MPP DW engines such as Snowflake are the way to go. On the flip side I also see Dremio's Tomer Shiran argument that an open data lake on open data storage standards (apache parquet & arrow) along with data virtualization is the way to go.

What is the right answer? Well as with most things in engineering and technology there is no one size fits all. I do believe that data virtualization in the cloud along with cloud storage has been a game changer. Presto paved the way with demonstrating that data and query federation is possible, especially in a cloud environment. While HDFS/Hadoop largely fizzled for reasons I won't get into here, Parquet, Arrow and other Apache projects have taken off and brought us the modern data lake. Big data for both compute and storage has proved its scale and manageability in the cloud.

How much of your data to keep in a priority cloud DW vs an open cloud data lake is an important decision. There is a balance that does not lock you in totally and at the same time lets you use the best technology of the day while managing costs. Be wise.