What is often left out and is missing in many modern data lake and data warehousing solutions are the two lesser known cornerstones I show in the diagram: Data Lineage and Data Provenance. And without these additional pieces your data lakes (with ever increasing volumes and variety of data) can quickly become an unmanageable data dumping ground.
Who needs Data Lineage and Data Provenance management tools, APIs and visualization services?
- Data Engineers (building and managing data pipelines)
- Data Scientists(discovering data & understanding relationships)
- Business/System Analysts (data stewardship)
- Data Lake / BI Executives (bird's eye view of health and sources/destinations)
- Data Ops (managing scale and infrastructure)
- Workflow/ETL Process Operators (monitoring & troubleshooting)
Let's spend a bit of time talking about data linage and data provenance in particular and why they are important parts of a modern and healthy overall data architecture. First let's touch on the broader Data Governance ecosystem.
Data Governance
Data Governance can be an over used term in the industry and sometimes is all encompassing when describing the data strategy, best practices and services in an organization. You can read a lot of differing definitions of what data governance is and is not. I take a simple and broader definition for data governance which includes:- Data Stewardship - org policies for access/permissions/rights
- Data Map - location of data
- MDM - identifying and curating key entities
- Common Definitions - tagging and common terminology
- Taxonomy/Ontology - the relationships between data elements
- Data Quality - accuracy of data
- Compliance - HIPAA, GDPR, PCI DSS...
Data Control Plane == Data Lineage and Data Provenance
I view data lineage and data provenance as two sides of the same coin. You can't do one well without the other. Just like digital networks have control planes for visualizing data traffic, our data lakes need a Data Control Plane. And one that is independent of whatever ETL technology stack you are using.In our modern technical age, data volumes are ever increasing and data is being mashed and integrated together from all corners of an organization. Managing this from both a business level and technical level is a challenge. A lot of ETL tools exist today to allow you to model your data transformation processes and build data lakes and data warehouses, but they all consistently fall short of giving you a comprehensive and I would say orthogonal and independent view of how your data is interconnected; where your data came from, where it is right now and where it is going to go (and where it is expected to go next) in its journey through your data fabric and destinations.
Many modern ETL tools let you build visual models and provide some degree of monitoring and tracking, but these tools are proprietary and can't be separated from the ETL tools themselves which creates lock-in and does not allow one to mix and match best of breed ETL tools (Airflow, aws step functions, lambda, Glue, Spark, EMR...etc) that are now prevalent in the cloud. If you are using multiple tools and cloud solutions it gets ever more complicated to have a holistic view of your data and its journey through your platform.
This is why I strongly believe that data lineage and data provenance should be completely independent from what underlying ETL tooling and data processing technology you are using. And if it is not, then you are both locking yourself in unnecessary and greatly limiting the potential of your data ops teams, data engineers and limiting your overall management of the data and the processes carrying your data through its journey in your data lake.
Data Provenance and Data Lineage are not just fancy words; they are a framework and tool set for managing your data and having a historical audit trail, reat-time tracing/logging and control plane graph of where your data is going and how it is interconnected.
So do not build your data lake without a benefits of Data Control Plane. Set your data free on its journey while still maintaining viability, traceability and control.
HDFS is an evolutionary dead end in the tree of big data. Data lakes based on S3 object storage deliver on the promise of separating storage from compute and make it possible to scale your processing and downstream analytics/AI and data marts on top of a data lake in an agile and elastic fashion. The HDFS architecture always bugged me when it was first released (besides the fact it is written in Java). Moving the code to the Hadoop data node (usually only three replicas available by the way), seemed to be inherently limiting to me. It was not really better than using big unix SMP servers other than you got to use cheaper commodity hardware and grow incrementally. Good stuff, but not good enough - 1 step forward and a half step backwards.
While the idea of moving code to the data sounded cool at the time, it is fundamentally a bad data processing design for a truly scalable data lake that allows for rolling up an arbitrary number ephemeral compute clusters on top of your storage. There is a place for HDFS and traditional Hadoop clusters, if you have big fixed and slow evolving predictable cluster of compute/storage environment. For the rest of us, a cloud based data lake architecture will win in the end and allow for agile development to meet the fast paced needs of downsteam today's BI, analytics and AI/ML applications that need to sit on top of the mythical data lake.