Wednesday, August 12, 2020

The Lost Art of Data Lineage


 
Maybe it is more of mangled and ill defined art than a lost art. Data lineage is one of the aspects of data governance that gets lost in the shuffle of data analytics and data warehousing/lake projects. It is vital for many reasons, least of them compliance and auditing. Often time data lineage never makes on the train to the final destination when building large scale data warehousing and analytics solutions.

Part of the problem is that it gets turned into an all or nothing effort leading to very little getting delivered relating to data lineage or it gets turned into a mishmash of concepts and solution features. Often it gets dumped under auditing and logging, irrespective of business or technical metadata and real-time vs historical, and then forgotten.

Let's quickly breakdown what data lineage actually entails. There are three dimensions of data lineage to consider as part of a general data governance strategy. These include:

1) Logical Data Processing Flows (logical and/or visual DAG representation)
Defining the high-level visual graph and code module level relationships between data processes stagings and steps that produce and generate data models.

2) Metadata Relationship Management (low level data relationships and logic/code)
Tracking metadata relationships between data models (schemes/tables/columns) and the related source code used in the transformation. This includes showing the transformation logic/code used going from one or multiple source data models to a target data model.

3) Physical Data Processing History (what happened, is happening, and going to happen)
At a data set level (sets of records, rows and columns), this shows the record level and data set linkage between data sets that have happened in the past or will happen in the future. There is a temporal (real-time and historical) aspect to this showing transaction events from one or multiple data sets feeding a downstream data set and the transactional breadcrumbs and events involved.

Note, that the term data model denotes a static structure (more or less your schema/tables definitions) while data sets are live physical structures (the actual data at a record level) across one or multiple data models.

Before you get started on your data lineage journey you need to decide to what extent you will implement one or all of these dimensions of data lineage as part of your overall data governance strategy. There are varying degrees of exactness and completeness to each one of them as well. And make sure to keep them distinct.

No one commercial tool will do the complete job. It is usually a combination of multiple tools, hand stitched software services, and best practices/conventions that will be necessary to do the job well and depending on your criteria for success.

Thursday, July 2, 2020

Tuning the Snowflake Data Cloud


To be clear, I do not classify Snowflake as an OLAP or MPP database. It has these capabilities for sure, but being born in the cloud and only for the cloud, it has much more to offer. I consider it a "Data Fabric". Yea that is broad term, but Snowflake is really what Big Data and Hadoop were aspiring to achieve, but never did for reasons I won't get into here.

What makes Snowflake a game changer for OLAP engines and data warehousing? The below listed features, shown in the diagram, are all true and not just marketing spin. How can Snowflake accomplish this? Built in the cloud and only for cloud - what does that really mean - Snowflake takes full advantage of two key superpowers only available in the cloud. 1) elastic and virtually limitless highly durable immutable storage and 2) spinning up virtually limitless compute. Starts with these two things, and lot more in Snowflake to deliver full package solution.



If Snowflake has no developer/DBA configurable indexes, partitioning, distribution keys, vacuuming, stats tuning, storage tuning...etc, like other MPP/OLAP engines, is there anything I can really tune (be-careful with auto re-clustering)? With great power comes great responsibility. There are multiple things you can do to tune and optimize for performance. This means being careful to monitoring and manage costs because it can be too easy to scale up and out and this will cost you. From a schema modeling/design perspective there are some optimizations you can do to minimize compute scale up/out requirements (and thus costs). One of them is using cluster/sort keys, one of the few DDL things you can tune at metadata level. Also how you use materialized views and manage joins vs de-normalization are important considerations. All these things are highly dependent on downstream consumption/usage patterns. So yes you still need good data engineers and architects :)

Thursday, June 11, 2020

Is ML Curve Fitting The Best We Got?


Curve Fitting is for the most part what most machine learning boils down to, not that that is a bad thing. How do go be beyond the correlation of the black box? I see the rediscovery of symbolic AI and the introduction of casualty into purely probabilistic ML analogous to what happened in software decades ago when we evolved from assembler and procedural languages and we started to model software/data as richer abstractions with relationships. Not the same thing, but a similar evolution in engineering and computer science.

Causal relationships exist in the world and can influence how we collect our data and engineer the features that drive our ML model training. This includes everything from how we analyze covariance in the data and in how we manage and monitor data distributions. Collecting data and engineering features is not enough. Understanding causal relationships can sometimes be gleaned from the data we observer, but often times we must look at how we can develop experiments and interventions with A/B test strategies and multi-armed banded processes to uncover the causality in order to  better train our models.

Intervention and experiments can help us answer some "what if questions" and then you have counterfactual, which are beyond the reach of most experiments, yet understanding causal relationships have the potential to offer us insights and help business make better since of the world and their opportunities. We need better tools and engineering processes to incorporate these skills into our ML frameworks and ML processes.

This is starting to happen in AI and ML today across disciplines that are applying ML. This is a good article on the topic that I suggest all ML engineers and data scientists to read.


Friday, June 5, 2020

Choosing an ML Cloud Platform: GCP vs AWS

ML cloud services are evolving fast and furious. GCP and AWS are the leading players. Here is a quick visual peak at both ML tech stacks.

AWS has SageMaker as the centerpiece:


Then there is GCP with its Kubeflow angle and on-premises hybrid cloud options:


Tuesday, June 2, 2020

Cloud OLAP: Choosing between Redshift, Snowflake, BigQuery or other?


Which to choose for your cloud OLAP engine? There are a lot of choices when it comes to cloud based analytics engines. All the major clouds have their homemade solution (GCP/BigQuery, AWS/Redshift, Azure) and their are plenty of independent options from Snowflake to Databricks to mention a few.

Which is right for your business and in what situation? Needs can vary from internal data exploration to driving downstream analytics with tight SLA. I am a strong proponent of the approach that no matter what you do that you have start with a foundational data lake blueprint and you then choose to build that with either an open source analytics engine on top of your cloud data lake or license a commercial analytic engine such as Redshift, Snowflake or BigQuery.

There is no one answer without looking at your business needs, existing technical foundation and strategic direction, but I have to say have I am getting more impressed with Snowflake as the product matures. Without getting to deep into the details, Snowflake is sort of an in memory (backed by public cloud object storage) data lake with a highly elastic in-memory MPP layer. There are many pros and cons in selecting the best option for your business. The edge Snowflake has it is cloud agnostic (sort of the Anthos of the data cloud) and I really like their cross cloud and data center replication feature (recently released feature) and cross cloud management.

If you want to discuss how to approach making this decision process look me up!

Friday, January 24, 2020

Why Spark is the Wrong Abstraction


Is the sun setting on Spark? I don't want to knock Spark and frameworks like it, they have had their moment in the sun. Spark was a reasonable and important successor to Map/Reduce & HDFS/Hadoop, but its time has come to be exiled to the fridges of the big data ecosystem and used only when absolutely necessary. Spark is still has usefulness for some specialized ETL and data processing data applications, but overall Spark can be a massive overkill and burden to program and operate (expensive too). In many cases it is inefficient both in development, troubleshooting and the overhead of infrastructure management can be expensive relative to other options.

Not Everything is a Nail

I see many projects using Spark (and tools related to it such as aws emr for example) for transforming and moving data into and out of data lakes when often times simpler tools can be used. Spark is a pretty big and complex sledge hammer and many problems can often be solved with more effective tooling. The ever growing ubiquity of serverless technology, especially database and analytics capable services, have presented engineers with many more options and it is time to dial back when it is most appropriate to consider bringing out the Spark sledge hammer.

In a lot of cases, with Spark development, you end up writing a one-off database engine for your specific ETL scenario and using Spark's distributed compute and storage abstractions and DAG engine makes that convenient. While it is possible to use Spark as a database engine of sorts, the reality is databases are better at optimization and using the available compute/storage resources. And this is specially the case with serverless technologies that support SQL as a first class citizen. For Spark SQL is really does a bolted it on.

The only big challenge is that most database platforms are not designed for the cloud or for elastic compute/storage like Spark sort of is. I say sort of, because Spark leaves too much responsibility on the developer/DevOps to make data/compute optimization decisions and infrastructure decision which is something databases are intrinsically good at.

Declarative vs Imperative Analytics

Now, there are Spark serverless type of services as well (like aws glue and other managed Spark services), but given the general purpose nature of Spark, this still leaves optimizations and resource alloation a challenge for developers. What are the alternatives? I really like Presto and in particular the serverless aws flavor of it (Athena) as well as services like BigQuery. Tools like these are the future of big data ETL and analytics. Spark can still be useful in heavy data transformation scenarios and complex feature engineering, but not as a general data analytics engine and data movement engine. Streaming is one specialized area where solutions like Spark can still play, but there are many other solutions better designed out of the box for streaming and cloud scale-out. Spark has in many respects tried to be all things to all people. It has continuously expanding support for SQL semantics and has incorporating APIs for streaming...etc. This has made Spark a versatile framework and API for developers, but as a general purpose ETL and data analytics engine, I think there are now other options.

While the sun may not completely set on Spark, and tools like it, the declarative power of SQL will win in the end over the imperative programming model of Spark. This has been proven time and time again in the database and analytics tech space.

Tuesday, January 21, 2020

Data Lakes before AI/ML/Analytics (cart before horse thing)


Don't start or continue your AI and predictive analytics journey without building the necessary data infrastructure underpinnings. And that starts, first and foremost, with building a cloud data lake that is designed to meet the data and compute hungry needs of your . Why build a cloud data lake first?

1) Economics
2) Elastic compute
3) Elastic Storage
4) Storing (almost) everything
5) ML Model engineering
6) Feeding downstream analytics
7) Feeding downstream operational data stores
8) Data exploration, experimentation and discovery

A cloud data lake makes all the above possible at scale.

Building a cloud data lake securely and in an architecturally effective manner is achievable and will make your downstream AI/ML/Analytics journey attainable and long-term sustainable. Don't start your journey without this foundation.