The
foundation of a good data management strategy is based on number skills
including data policies, best practices, and technology/tools as shown
in the diagram above. The
most commonly discussed term of the bunch, from the ones listed in the
diagram, is Data Governance. This term is tossed around a lot when
describing the organizational best practices and technical processes
needed to have a sound data practice. Data governance can often be
ambiguous and all encompassing, and in many cases what exists in
organizations falls short of what is needed in our modern big data and
data lake oriented world where ever increasing volumes of data are
playing an ever more critical role in everything a business does.
What
is often left out and is missing in many modern data lake and data
warehousing solutions are the two lesser known cornerstones I show in
the diagram: Data Lineage and Data Provenance. And
without these additional pieces your data lakes (with ever increasing
volumes and variety of data) can quickly become an unmanageable data
dumping ground.
Who needs Data Lineage and Data Provenance management tools, APIs and visualization services?
- Data Engineers (building and managing data pipelines)
- Data Scientists(discovering data & understanding relationships)
- Business/System Analysts (data stewardship)
- Data Lake / BI Executives (bird's eye view of health and sources/destinations)
- Data Ops (managing scale and infrastructure)
- Workflow/ETL Process Operators (monitoring & troubleshooting)
Most
of the ETL processing and dataflow orchestration tools out there in the
market (open source and commercial) such as NiFi, Airflow, Informatica,
and Talend among others, do not directly address this gap. What is the
gap? The gap is knowing where your data is coming from, where it has
been and where it is going. Put another way, having visibility into the
journey your data takes through your data lake and overall data fabric.
And doing this in a lightweight fashion with out a lot of complex and
expensive commercial tools.
Let's
spend a bit of time talking about data linage and data provenance in
particular and why they are important parts of a modern and healthy
overall data architecture. First let's touch on the broader Data
Governance ecosystem.
Data Governance
Data
Governance can be an over used term in the industry and sometimes is
all encompassing when describing the data strategy, best practices and
services in an organization. You can read a lot of differing definitions
of what data governance is and is not. I take a simple and broader
definition for data governance which includes:
- Data Stewardship - org policies for access/permissions/rights
- Data Map - location of data
- MDM - identifying and curating key entities
- Common Definitions - tagging and common terminology
- Taxonomy/Ontology - the relationships between data elements
- Data Quality - accuracy of data
- Compliance - HIPAA, GDPR, PCI DSS...
These
are all important concepts, yet the do not address the dynamic nature
of data and the ever more complex journey data is taking through modern
data lakes and the relationships between data models residing in your
data lake. This relates to needing to know where is your data going and
where did it come from. This is where data lineage and data provenance
comes into play to compliment data governance and allow data engineers
and analysts to wrangle and track the run-time dynamics of data as it
moves through your systems and gets combined and transformed on its
journey and its many destinations.
Data Control Plane == Data Lineage and Data Provenance
I
view data lineage and data provenance as two sides of the same coin.
You can't do one well without the other. Just like digital networks have control planes for visualizing data traffic, our data lakes need a
Data Control Plane. And one that is independent of whatever ETL
technology stack you are using.
In
our modern technical age, data volumes are ever increasing and data is
being mashed and integrated together from all corners of an
organization. Managing this from both a business level and technical
level is a challenge. A lot of ETL tools exist today to allow you to
model your data transformation processes and build data lakes and data
warehouses, but they all consistently fall short of giving you a
comprehensive and I would say orthogonal and independent view of how
your data is interconnected; where your data came from, where it is
right now and where it is going to go (and where it is expected to go
next) in its journey through your data fabric and destinations.
Many
modern ETL tools let you build visual models and provide some degree of
monitoring and tracking, but these tools are proprietary and can't be
separated from the ETL tools themselves which creates lock-in and does
not allow one to mix and match best of breed ETL tools (Airflow, aws
step functions, lambda, Glue, Spark, EMR...etc) that are now prevalent
in the cloud. If you are using multiple tools and cloud solutions it
gets ever more complicated to have a holistic view of your data and its
journey through your platform.
This
is why I strongly believe that data lineage and data provenance should
be completely independent from what underlying ETL tooling and data
processing technology you are using. And if it is not, then you are both
locking yourself in unnecessary and greatly limiting the potential of
your data ops teams, data engineers and limiting your overall management
of the data and the processes carrying your data through its journey in
your data lake.
Data
Provenance and Data Lineage are not just fancy words; they are a
framework and tool set for managing your data and having a historical
audit trail, reat-time tracing/logging and control plane graph of where
your data is going and how it is interconnected.
So
do not build your data lake without a benefits of Data Control Plane. Set your
data free on its journey while still maintaining viability, traceability
and control.
Is a data lake part of your data warehouse platform or does the data lake sit beside it? There is a fair amount of ambiguity as to what a data lake is and how it should fit into your overall data strategy.
I believe data lakes (coupled with elastic cloud storage and compute) are a game changer in both the DW and BI world. Your data warehousing strategy should be part of the data lake not the other way around. While you don't have to throw away everything you have done or learned in your traditional ETL and DW world, the fundamentals have changed.
To take advantage of your data and build better BI/analytics you must build atop a sold data lake foundation. And this going well beyond the many failed Big Data and Hadoop projects of the recent past that many enterprises have experienced.
While Hadoop was a necessary step forward at the time, it was and is an evolutionary dead end - RIP Hadoop. Cloud data lakes are the future and it is more than putting your data into S3 buckets.
Well architected data lakes are the culmination of a succinct data management strategy that leverages the strengths of cloud services and many traditional DW best practices and data governance policies.
HDFS is an evolutionary dead end in the tree of big data. Data lakes based on S3 object storage deliver on the promise of separating storage from compute and make it possible to scale your processing and downstream analytics/AI and data marts on top of a data lake in an agile and elastic fashion. The HDFS architecture always bugged me when it was first released (besides the fact it is written in Java). Moving the code to the Hadoop data node (usually only three replicas available by the way), seemed to be inherently limiting to me. It was not really better than using big unix SMP servers other than you got to use cheaper commodity hardware and grow incrementally. Good stuff, but not good enough - 1 step forward and a half step backwards.
While the idea of moving code to the data sounded cool at the time, it is fundamentally a bad data processing design for a truly scalable data lake that allows for rolling up an arbitrary number ephemeral compute clusters on top of your storage. There is a place for HDFS and traditional Hadoop clusters, if you have big fixed and slow evolving predictable cluster of compute/storage environment. For the rest of us, a cloud based data lake architecture will win in the end and allow for agile development to meet the fast paced needs of downsteam today's BI, analytics and AI/ML applications that need to sit on top of the mythical data lake.