The most commonly discussed term of the bunch, from the ones listed in the diagram, is Data Governance. This term is tossed around a lot when describing the organizational best practices and technical processes needed to have a sound data practice. Data governance can often be ambiguous and all encompassing, and in many cases what exists in organizations falls short of what is needed in our modern big data and data lake oriented world where ever increasing volumes of data are playing an ever more critical role in everything a business does.
What is often left out and is missing in many modern data lake and data warehousing solutions are the two lesser known cornerstones I show in the diagram: Data Lineage and Data Provenance. And without these additional pieces your data lakes (with ever increasing volumes and variety of data) can quickly become an unmanageable data dumping ground.
Most of the ETL processing and dataflow orchestration tools out there in the market (open source and commercial) such as NiFi, Airflow, Informatica, and Talend among others, do not directly address this gap. What is the gap? The gap is knowing where your data is coming from, where it has been and where it is going. Put another way, having visibility into the journey your data takes through your data lake and overall data fabric. And doing this in a lightweight fashion with out a lot of complex and expensive commercial tools.
Let's spend a bit of time talking about data linage and data provenance in particular and why they are important parts of a modern and healthy overall data architecture. First let's touch on the broader Data Governance ecosystem.
Data GovernanceData Governance can be an over used term in the industry and sometimes is all encompassing when describing the data strategy, best practices and services in an organization. You can read a lot of differing definitions of what data governance is and is not. I take a simple and broader definition for data governance which includes:
- Data Stewardship - org policies for access/permissions/rights
- Data Map - location of data
- MDM - identifying and curating key entities
- Common Definitions - tagging and common terminology
- Taxonomy/Ontology - the relationships between data elements
- Data Quality - accuracy of data
- Compliance - HIPAA, GDPR, PCI DSS...
Data Control Plane == Data Lineage and Data ProvenanceI view data lineage and data provenance as two sides of the same coin. You can't do one well without the other. Just like digital networks have a control planes for visualizing data traffic, our data lakes need a Data Control Plane. And one that is independent of whatever ETL technology stack you are using.
In our modern technical age, data volumes are ever increasing and data is being mashed and integrated together from all corners of an organization. Managing this from both a business level and technical level is a challenge. A lot of ETL tools exist today to allow you to model your data transformation processes and build data lakes and data warehouses, but they all consistently fall short of giving you a comprehensive and I would say orthogonal and independent view of how your data is interconnected, where your data came from, where it is right now and where it is going to go (and where it is expected to go next) in its journey through your data fabric and destinations.
Many modern ETL tools let you build visual models and provide some degree of monitoring and tracking, but these tools are proprietary and can't be separated from the ETL tools themselves which creates lock-in and does not allow one to mix and match best of bread ETL tools (Airflow, aws step functions, lambda, Glue, Spark, EMR...etc) that are now prevalent in the cloud. If you are using multiple tools and cloud solutions it gets ever more complicated to have a holistic view of your data and its journey through your platform.
This is why I strongly believe that data lineage and data provenance should be completely independent from what underlying ETL tooling and data processing technology you are using. And if it is not, then you are both locking yourself in unnecessary and greatly limiting the potential of your data ops teams, data engineers and limiting your overall management of the data and the processes carrying your data through its journey in your data lake.
Data Provenance and Data Lineage are not just fancy words; they are a framework and tool set for managing your data and having a historical audit trail, reat-time tracing/logging and control plane graph of where your data is going and how it is interconnected.
So do not build your data lake with out a Data Control Plane and set your data free on its journey while still maintaining viability, traceability and control.