Thursday, July 2, 2020

Tuning the Snowflake Data Cloud


To be clear, I do not classify Snowflake as an OLAP or MPP database. It has these capabilities for sure, but being born in the cloud and only for the cloud, it has much more to offer. I consider it a "Data Fabric". Yea that is broad term, but Snowflake is really what Big Data and Hadoop were aspiring to achieve, but never did for reasons I won't get into here.

What makes Snowflake a game changer for OLAP engines and data warehousing? The below listed features, shown in the diagram, are all true and not just marketing spin. How can Snowflake accomplish this? Built in the cloud and only for cloud - what does that really mean - Snowflake takes full advantage of two key superpowers only available in the cloud. 1) elastic and virtually limitless highly durable immutable storage and 2) spinning up virtually limitless compute. Starts with these two things, and lot more in Snowflake to deliver full package solution.



If Snowflake has no developer/DBA configurable indexes, partitioning, distribution keys, vacuuming, stats tuning, storage tuning...etc, like other MPP/OLAP engines, is there anything I can really tune (be-careful with auto re-clustering)? With great power comes great responsibility. There are multiple things you can do to tune and optimize for performance. This means being careful to monitoring and manage costs because it can be too easy to scale up and out and this will cost you. From a schema modeling/design perspective there are some optimizations you can do to minimize compute scale up/out requirements (and thus costs). One of them is using cluster/sort keys, one of the few DDL things you can tune at metadata level. Also how you use materialized views and manage joins vs de-normalization are important considerations. All these things are highly dependent on downstream consumption/usage patterns. So yes you still need good data engineers and architects :)

No comments: