Wednesday, August 12, 2020

The Lost Art of Data Lineage


 
Maybe it is more of mangled and ill defined art than a lost art. Data lineage is one of the aspects of data governance that gets lost in the shuffle of data analytics and data warehousing/lake projects. It is vital for many reasons, least of them compliance and auditing. Often time data lineage never makes on the train to the final destination when building large scale data warehousing and analytics solutions.

Part of the problem is that it gets turned into an all or nothing effort leading to very little getting delivered relating to data lineage or it gets turned into a mishmash of concepts and solution features. Often it gets dumped under auditing and logging, irrespective of business or technical metadata and real-time vs historical, and then forgotten.

Let's quickly breakdown what data lineage actually entails. There are three dimensions of data lineage to consider as part of a general data governance strategy. These include:

1) Logical Data Processing Flows (logical and/or visual DAG representation)
Defining the high-level visual graph and code module level relationships between data processes stagings and steps that produce and generate data models.

2) Metadata Relationship Management (low level data relationships and logic/code)
Tracking metadata relationships between data models (schemes/tables/columns) and the related source code used in the transformation. This includes showing the transformation logic/code used going from one or multiple source data models to a target data model.

3) Physical Data Processing History (what happened, is happening, and going to happen)
At a data set level (sets of records, rows and columns), this shows the record level and data set linkage between data sets that have happened in the past or will happen in the future. There is a temporal (real-time and historical) aspect to this showing transaction events from one or multiple data sets feeding a downstream data set and the transactional breadcrumbs and events involved.

Note, that the term data model denotes a static structure (more or less your schema/tables definitions) while data sets are live physical structures (the actual data at a record level) across one or multiple data models.

Before you get started on your data lineage journey you need to decide to what extent you will implement one or all of these dimensions of data lineage as part of your overall data governance strategy. There are varying degrees of exactness and completeness to each one of them as well. And make sure to keep them distinct.

No one commercial tool will do the complete job. It is usually a combination of multiple tools, hand stitched software services, and best practices/conventions that will be necessary to do the job well and depending on your criteria for success.