Cloud Analytics & ML with Sam Taha

The foundation of a good data management strategy is based on number skills including data policies, best practices, and technology/tools as shown in the diagram above. The most commonly discussed term of the bunch, from the ones listed in the diagram, is Data Governance. This term is tossed around a lot when describing the organizational best practices and technical processes needed to have a sound data practice. Data governance can often be ambiguous and all encompassing, and in many cases what exists in organizations falls short of what is needed in our modern big data and data lake oriented world where ever increasing volumes of data are playing an ever more critical role in everything a business does.

What is often left out and is missing in many modern data lake and data warehousing solutions are the two lesser known cornerstones I show in the diagram: Data Lineage and Data Provenance. And without these additional pieces your data lakes (with ever increasing volumes and variety of data) can quickly become an unmanageable data dumping ground.

Who needs Data Lineage and Data Provenance management tools, APIs and visualization services?

Data Engineers (building and managing data pipelines)
Data Scientists(discovering data & understanding relationships)
Business/System Analysts (data stewardship)
Data Lake / BI Executives (bird's eye view of health and sources/destinations)
Data Ops (managing scale and infrastructure)
Workflow/ETL Process Operators (monitoring & troubleshooting)

Most of the ETL processing and dataflow orchestration tools out there in the market (open source and commercial) such as NiFi, Airflow, Informatica, and Talend among others, do not directly address this gap. What is the gap? The gap is knowing where your data is coming from, where it has been and where it is going. Put another way, having visibility into the journey your data takes through your data lake and overall data fabric. And doing this in a lightweight fashion with out a lot of complex and expensive commercial tools.

Let's spend a bit of time talking about data linage and data provenance in particular and why they are important parts of a modern and healthy overall data architecture. First let's touch on the broader Data Governance ecosystem.

Data Governance

Data Governance can be an over used term in the industry and sometimes is all encompassing when describing the data strategy, best practices and services in an organization. You can read a lot of differing definitions of what data governance is and is not. I take a simple and broader definition for data governance which includes:

Data Stewardship - org policies for access/permissions/rights
Data Map - location of data
MDM - identifying and curating key entities
Common Definitions - tagging and common terminology
Taxonomy/Ontology - the relationships between data elements
Data Quality - accuracy of data
Compliance - HIPAA, GDPR, PCI DSS...

These are all important concepts, yet the do not address the dynamic nature of data and the ever more complex journey data is taking through modern data lakes and the relationships between data models residing in your data lake. This relates to needing to know where is your data going and where did it come from. This is where data lineage and data provenance comes into play to compliment data governance and allow data engineers and analysts to wrangle and track the run-time dynamics of data as it moves through your systems and gets combined and transformed on its journey and its many destinations.

Data Control Plane == Data Lineage and Data Provenance

I view data lineage and data provenance as two sides of the same coin. You can't do one well without the other. Just like digital networks have control planes for visualizing data traffic, our data lakes need a Data Control Plane. And one that is independent of whatever ETL technology stack you are using.

In our modern technical age, data volumes are ever increasing and data is being mashed and integrated together from all corners of an organization. Managing this from both a business level and technical level is a challenge. A lot of ETL tools exist today to allow you to model your data transformation processes and build data lakes and data warehouses, but they all consistently fall short of giving you a comprehensive and I would say orthogonal and independent view of how your data is interconnected; where your data came from, where it is right now and where it is going to go (and where it is expected to go next) in its journey through your data fabric and destinations.

Many modern ETL tools let you build visual models and provide some degree of monitoring and tracking, but these tools are proprietary and can't be separated from the ETL tools themselves which creates lock-in and does not allow one to mix and match best of breed ETL tools (Airflow, aws step functions, lambda, Glue, Spark, EMR...etc) that are now prevalent in the cloud. If you are using multiple tools and cloud solutions it gets ever more complicated to have a holistic view of your data and its journey through your platform.

This is why I strongly believe that data lineage and data provenance should be completely independent from what underlying ETL tooling and data processing technology you are using. And if it is not, then you are both locking yourself in unnecessary and greatly limiting the potential of your data ops teams, data engineers and limiting your overall management of the data and the processes carrying your data through its journey in your data lake.

Data Provenance and Data Lineage are not just fancy words; they are a framework and tool set for managing your data and having a historical audit trail, reat-time tracing/logging and control plane graph of where your data is going and how it is interconnected.

So do not build your data lake without a benefits of Data Control Plane. Set your data free on its journey while still maintaining viability, traceability and control.

HDFS is an evolutionary dead end in the tree of big data. Data lakes based on S3 object storage deliver on the promise of separating storage from compute and make it possible to scale your processing and downstream analytics/AI and data marts on top of a data lake in an agile and elastic fashion. The HDFS architecture always bugged me when it was first released (besides the fact it is written in Java). Moving the code to the Hadoop data node (usually only three replicas available by the way), seemed to be inherently limiting to me. It was not really better than using big unix SMP servers other than you got to use cheaper commodity hardware and grow incrementally. Good stuff, but not good enough - 1 step forward and a half step backwards.
While the idea of moving code to the data sounded cool at the time, it is fundamentally a bad data processing design for a truly scalable data lake that allows for rolling up an arbitrary number ephemeral compute clusters on top of your storage. There is a place for HDFS and traditional Hadoop clusters, if you have big fixed and slow evolving predictable cluster of compute/storage environment. For the rest of us, a cloud based data lake architecture will win in the end and allow for agile development to meet the fast paced needs of downsteam today's BI, analytics and AI/ML applications that need to sit on top of the mythical data lake.

Monday, September 9, 2019

Know Where Your Data Lake Has Been?

Data Governance

Data Control Plane == Data Lineage and Data Provenance

Thursday, August 15, 2019

Data Lake vs Data Warehouse

Wednesday, May 15, 2019

R.I.P. HDFS | The Cloud Wins!