The Ultimate Guide To Data Lineage
The Ultimate Guide To Data Lineage
Data Lineage
The Ultimate Guide to Data Lineage
Table of Contents
Table of Contents 2
What Is Metadata? 6
Introduction
New data flows throughout your organization every Data lineage combats complexity in your data
minute of every day. With the high volume of data environment by providing a complete picture of your
you’re collecting, you need to be able to act and make data landscape, allowing you to tame data chaos and
informed decisions—and quickly. optimize data for valuable insights.
But today’s data systems are deeply complex, and In our ultimate guide to data lineage, we will discuss:
this complexity creates blind spots in the data • What data lineage is and why it matters in a variety
environment. With limited visibility, you have of industries
limited control. How can you overcome these • How to activate metadata
complexities and get the full picture of your data • How to create data lineage
landscape? • What to look for in a data lineage solution
• Why Manta should be part of your data toolkit
www.manta.io 2
The Ultimate Guide to Data Lineage
Data management has undergone a massive These data blind spots and complexities can
transformation in the past decade. Data infrastructure lead to the following challenges:
is constantly growing in complexity, evolving into data
ecosystems with thousands of components all aimed
Slower Delivery of Predictive Insights
at one goal: to derive more value from data.
1 MIT Technology Review; Getting the most from your data-driven transformation: 10 key principles; Janice Zdankus;
Anthony Delli Colli; 14 Oct. 2021
2 Seagate; Rethink Data’ Report Reveals That 68% Of Data Available To Businesses Goes Unleveraged; 15 July 2020
3 Statista; Number of data records exposed worldwide from 1st quarter 2020 to 3rd quarter 2022
www.manta.io 3
The Ultimate Guide to Data Lineage
Current data observability tools are still primarily Manta customers have saved between
reactive—meaning organizations find bugs after
$5-15M in the initial phase of
there has already been an incident, rather than
preventing them. This is concerning because even a implementation alone.
single data incident can cause severe damage to your
organization. IBM reports a single data breach costs
Shortage of Data Engineering Talent
an average of $4.3 million.4
Automated lineage also puts an end to the costly, In April of 2022, the Digital Services Act threatened
lengthy, manual processes of lineage collection and Big Tech with a 6% revenue penalty for having illegal
4 IBM; Cost of a data breach 2022: A million-dollar race to detect and respond
5 Gartner; The State of Data and Analytics Governance Is Worse Than You Think; Saul Judah, Andrew White; 19 June 2020
www.manta.io 4
The Ultimate Guide to Data Lineage
www.manta.io 5
The Ultimate Guide to Data Lineage
What Is Metadata?
You may have heard metadata described simply as information, you couldn’t use the dataset without a
data about data—essentially being information that detailed description. The description of a dataset is
describes or catalogs what the data is. metadata that defines data and explains the context
in which data can be used.
As defined by Dr. Irina Steenbeek: The terms “data”
and “metadata” have a complicated relationship. Data The challenging aspect of defining metadata is that
is the physical or electronic representation of facts the same data can be recognized as either data or
or signals “in a manner suitable for communication, metadata, depending on the context. For example,
interpretation, or processing by human beings or by data models are metadata for business users. For
automatic means.” data modelers, on the other hand, data models
can be considered data that will in turn require
Metadata puts data in a particular context. A data other metadata to describe data models. Different
model is an example of metadata. For example, sources contain different approaches to classifying
if you had a dataset with customer financial metadata.
Data lineage is crucial for understanding and utilizing dynamic metadata—how it moves across the systems,
where it originated, how it’s interconnected, and how it transforms.
www.manta.io 6
The Ultimate Guide to Data Lineage
• We query data to get answers to our questions. That is the very basic use case
we usually think of first. It can be good old pre-built reports, ad-hoc queries, or
smart AI/ML algorithms digging insights from the data we have. By performing
queries, we turn data into information and eventually knowledge.
• We embed data into places where people or machines naturally need them.
We do not force a sales representative to log into our reporting platform and
write ad-hoc SQL queries or use pre-built reports. No. Rather, we prepare all the
data they may need about a prospect or a customer, turn it into information, and
deliver it to their workspace (as a dashboard in an application like the CRM they
use daily). On top of that, we also enrich internal data with valuable external data
to provide an even more complex view of the customer.
• We also use data to automate tasks and processes. Instead of waiting for the
sales representative to open their workspace and search for customers who may
be a good fit for a new product offering, we have an algorithm running in the
background that scores existing customers and sends proactive notifications
that suggest who to call and what (or even how) to offer. Or, for an even simpler
example, we automatically send a reminder to a sales representative in case they
take no action (even if they should).
Obviously, there are more ways that we interact with • Continuous access - metadata is continuously
data; the above are the most traditional examples. collected. It is not something you do once per
The first “search” case represents a very “static” month or once per year, as we want to collect every
experience. Everything is sitting in a silo (e.g., a data change and every signal and respond to it.
warehouse or data lake), and we expect people to • Connecting dots - metadata is not just collected; it
come, find what they need, and ask questions they
is constantly processed to distill information (and
need to ask. Do not get me wrong - it is awesome for
knowledge) from all the signals and noise. And with
some use cases and, when compared to a case with
the right feedback loop, your system gets smarter
no data available, a huge jump forward.
over time, collecting and learning.
However, we see that data is put to much better use • Actionable - all the intelligence and insights derived
in the other two examples, actively supporting users from metadata are not locked into a silo, but
with limited data engineering skills and dramatically rather delivered in the form of recommendations,
increasing their productivity. Compared to the first warnings, and notifications to humans and systems/
example, the latter are more “active” and thus more applications that may need it.
useful and accessible to a broader audience. And that • Embedded - actionable information / knowledge
is what we want to achieve with active metadata too. is integrated into processes humans and machines
perform, embedded into their workspace. People
Gartner’s definition of metadata in their most recent are not forced to go in and look for the insights.
Market Guide for Active Metadata Management Instead, active metadata comes to them – when
touches on several key aspects of metadata: and where they need it.
www.manta.io 7
The Ultimate Guide to Data Lineage
Merely collecting and cataloging metadata fails This approach can empower you with:
to maximize its potential. This is one reason why
metadata has not historically been practical. We’ve • Continuous detection of “dead tables” where
spent decades focused on metadata collection, but potentially sensitive information is stored but not
data sitting in a repository is not useful and data accessed or used
without understanding can be a liability. This is • Instant alerts if a change negatively impacts
changing now with the focus on active metadata. tactical management reports or key data features
used by the data science team
According to Gartner, active metadata capabilities
• Notifications about overly complex parts of data
will expand to include monitoring, evaluating,
pipelines where refactoring or redesign would help
recommending design changes, and orchestrating
to reduce the risk of failure
processes in third-party data management solutions.6
• Warnings if the design of a data pipeline moves data
Gartner also predicts that organizations that adopt between locations where no data should ever be
aggressive metadata analysis across their complete
data management environment will decrease the As you can see with these examples, data lineage
time to delivery of new data assets to users by as allows you to leverage metadata to better manage,
much as 70%. 7 utilize, and optimize your data—even when working
with overly complex data environments. That can lead
By activating existing metadata with automation to immediate and long-term business benefits.
and intelligence, data lineage can provide the needed
visibility and control to help you become more aware
of your data management and proactive in your data
usage.
6 Gartner, Market Guide for Active Metadata Management; Guido De Simoni, Mark Beyer; 14 Nov 2022
7 Ibid
www.manta.io 8
The Ultimate Guide to Data Lineage
Automated Impact Analysis for Improved Greater Data Pipeline Observability for
Incident Prevention Faster Incident Resolution
In business, every decision contributes to the bottom As discussed above, there are countless threats
line. That’s why impact analysis is crucial—it predicts to your organization’s bottom line. Whether it is a
the consequences of a decision. How will one decision successful ransomware attack or a poorly planned
affect customers? Stakeholders? Sales? cloud migration, catching the problem before it can
wreak havoc is always less expensive.
Data lineage helps during these investigations.
Because lineage creates an environment where That’s why data pipeline observability is so important.
reports and data can be trusted, teams can make It not only protects your organization but also your
more informed decisions. Data lineage provides that customers.
reliability—and more.
Data lineage expands the scope of your data
One often-overlooked area of impact analysis is IT observability to include data processing infrastructure
resilience. This blind spot became apparent in March or data pipelines, in addition to the data itself.
of 2021 when CNA Financial was hit by a ransomware With this expanded observability, incidents can
attack that caused widespread network disruption. The be prevented in the design phase or identified in
company’s email was hacked, consumers panicked, the implementation and testing phase to reduce
and CNA Financial was forced to pay a record-breaking maintenance costs and achieve higher productivity.
$40 million in ransom. This is where lineage-supported
8
8 Bloomberg; CNA Financial Paid $40 Million in Ransom After March Cyberattack; Kartikay Mehrotra and William Turton; 20
May 2021
9 McKinsey Digital; IT resilience for the digital age; Arun Gundurao, Jorge Machado, Rut Patel, and Yanwing Wong; 11 May
2021
www.manta.io 9
The Ultimate Guide to Data Lineage
Improved Regulatory Compliance Dividing the system into smaller chunks of objects
(reports, tables, workflows, etc.) can make it more
Depending on your industry, you have to ensure you’re manageable, but poses another challenge—how to
in compliance with a host of regulatory bodies and migrate one part without breaking another. How do
policies—BASEL, HIPAA, GDPR, CCPA /CPRA, and you know what pieces can be grouped to minimize the
CCAR, just to name a few. number of external dependencies?
All of these regulations require accurate tracking of
data. Your organization must be able to answer: With data lineage, every object in the migrated
system is mapped and dependencies are documented.
1. Where does the data come from? Manta customers have used data lineage
2. How did the data get there?
to complete their migration projects 40%
3. Are we capable of proving it with up-to-date
evidence whenever necessary? faster with 30% fewer resources.
4. Do we need weeks or months to complete a report?
5. Is that report even entirely reliable?
Retention of Data Engineering Talent
Data lineage helps you answer these questions
by creating highly detailed visualizations of your Data engineers, developers, and data scientists
data flows. You can use these reports to accurately continue to be fast-growing and hard-to-fill roles in
track and report your data to ensure regulatory tech. The shortage of data engineering talent has
compliance. ballooned from a problem to a crisis, made worse by the
increasing complexity of data systems. The last thing
you want is to continually overstretch your valuable
Faster and More Efficient Migrations data engineers with routine, manual (and frustrating)
tasks like chasing data incidents, assessing the impacts
McKinsey predicts that $8 out of every $10 for IT of planned changes, or answering the same questions
hosting will go toward the cloud by 2024.10 However, if about the origins of data records again and again.
you have ever been involved in the migration of a data
system, you know how complex the process is. Data lineage can help to automate routine tasks and
enable self-service wherever possible, allowing data
Approximately $100 billion of cloud funding is expected scientists and other stakeholders to retrieve up-to-
to be wasted over the next three years—and most date lineage and data origin information on their own,
enterprises cite the costs around migration as a major whenever they need it. A detailed data lineage map
inhibitor to adopting the cloud.11 The process is so also enables faster onboarding of data engineers to
complex (and expensive) because every system consists integrate new or less-experienced engineers into the
of thousands or millions of interconnected parts, and it role without impacting the stability and reliability of
is impossible to migrate everything in a single step. the data environment.
10 McKinsey; Cloud-migration opportunity: Business value grows, but missteps abound; Tara Balakrishnan, Chandra
Gnanasambandam, Leandro Santos, and Bhargs Srivathsan; 12 Oct 2021
11 Ibid
www.manta.io 10
The Ultimate Guide to Data Lineage
Data governance is a clear priority in almost every One of the most critical processes for every
organization, regardless of industry. In one survey, business, regardless of size, is change management.
60% or respondents planned to spend more than Organizations face a variety of change management
$49K on data governance technology and tools in challenges or obstacles, including:
the next one to two years. 12
12 Zaloni & Dataversity; The 2022 State of Cloud Data Governance; Michelle P. Knight, Annie Bishop
www.manta.io 11
The Ultimate Guide to Data Lineage
In activating this metadata, data lineage creates a 1. Data as a source for pattern-based lineage
map to understand its movements, connections, and 2. Logs as a source for run-time lineage
dependencies. 3. Code as a source for design lineage
Advantages Disadvantages
It’s the best approach for identifying manual data flows You may miss important details. Because you’re only
happening outside of the system—like copying data to a watching data, this lineage is limited to the database—
flash drive, modifying it on another computer, or storing it you’re not seeing the application side of your environment
on a different part of the system. or the so-called “transformation logic” of how and where
data is being modified.
You don’t have to worry about the integration of different The approach is not always accurate. The impact on
system technologies because you’re watching the data as performance can be significant, and data privacy is at risk.
the source, rather than algorithms.
www.manta.io 12
The Ultimate Guide to Data Lineage
This technique relies on run-time information extracted from the data environment—log files, execution workflows
exported by ETL/ELT tools, or any other source with sufficient run-time details. Some data processing engines
use a trick called data tagging, where each piece of data being moved or transformed is tagged or labeled by a
transformation engine, which then tracks that label all the way from start to finish.
Advantages Disadvantages
It has an operational nature, which is valuable for incident Inaccurate data lineage. Run-time lineage only captures
resolution because it provides accurate information about information about recently executed data flows and may
the flow of a specific data element that has been identified fail to capture data calculations and scenarios that are not
as erroneous. executed equally or with the same frequency. This can lead
to inaccurate or inconsistent lineage, as some parts are
either missing or are no longer valid.
It considers different technologies in the data stack (unlike The absence of transformation details. Not everything is
pattern-based lineage), as the format and structure of the or can be logged, especially in the case of more complex
logging information vary significantly. algorithms or processing done outside the database/ETL/
ELT world. As a result, run-time lineage can often capture
only very high-level and generic table-to-table mappings.
Blindly using such metadata poses a big risk for an organization. If used by a data engineer to run impact
analysis, it leads to a high probability of incidents when designing and implementing changes in the system and
new requirements. If used by a risk analyst to prepare a regulatory report, it leads to inaccuracies in the report
and increased risk of (public) incidents and penalties. If used by a data scientist to analyze and prepare data to
train a new model, it leads to inherent inequality encoded into the AI/ML algorithm.
www.manta.io 13
The Ultimate Guide to Data Lineage
This technique looks directly into the code that processes and transforms data records to identify data flows.
This is “code” in the broadest sense—such as an SQL script, a PL/SQL stored procedure, an ETL/ELT workflow
encoded in a proprietary XML format, a macro in an Excel spreadsheet, a mapping between a field in a report
and a database column or table, a Java API, a Kafka stream definition, an XSLT transformation, or a Python
algorithm in a Jupyter notebook.
Advantages Disadvantages
The variety of code. The functionality to work with this The variety of code. It’s a challenge because parsing
variety gives design lineage the advantage as the best and reverse engineering the code is much tougher than
approach for gaining detailed visibility into your data parsing log files, and it requires specialized scanners for all
environment to identify and eliminate data blind spots. supported technologies.
These advantages make design lineage the preferred approach for the most successful vendors and
organizations.
www.manta.io 14
The Ultimate Guide to Data Lineage
Now that you know the potential sources of your information and lineage techniques, let’s look back at question
number two: What is the process for building your data lineage map?
Manually resolving lineage usually starts at the top with your people, by mapping and documenting the
knowledge in their heads. This process involves interviewing application owners, data stewards, and data
integration specialists for information about data movement within your organization. Then, you must begin
inputting that information into spreadsheets or other mapping mechanisms so the lineage can be defined.
Advantages Disadvantages
It’s the starting point. Manual data lineage analysis is The lineage cannot be trusted. You’re relying on what
where a lineage project needs to start to be able to gain people are telling you. Their information may be
insight into what is going on across the entire environment. contradictory, missing important details, or simply wrong.
It may be that there isn’t any code at all or any This can lead to a situation where you have lineage, but
permissions to access and profile data directly (especially you’re unable to use it because it cannot be trusted.
with legacy systems). In these cases, domain experts—
your people—are your only source of lineage.
www.manta.io 15
The Ultimate Guide to Data Lineage
This approach uses logs as a source. This approach uses a tool that fully controls your data’s movement, its changes,
and the entire data processing workflow to give you full insight. It’s the preferred choice of ETL/ELT vendors.
Advantages Disadvantages
It’s fully automated, so no tedious manual analysis is The data lineage is limited to the controlling platform—
needed. it’s self-contained. Anything that happens outside
the controlled environment is invisible. More complex
components within the environment can be missed. The
result is incomplete lineage.
Complete lineage of the entire data processing platform. It is limiting for the majority of data engineering tasks.
It provides full insight, control, unlimited access to internal Organizations using this approach enforce a single data
logs, details about executed workflows, and processing processing platform or prohibit the use of its more complex
instructions. components, as they’ll likely be missed. However, this slows
down new development and is limiting and frustrating for
data engineers.
External automated data lineage analysis is designed with the diversity of the data system environment in mind.
It does not require all data processing to happen in one tool or platform. As the name indicates, this approach
also offers fully automated data lineage analysis.
Advantages Disadvantages
It doesn’t require all the data processing to be on one platform. Unlike self-contained None
data lineage analysis, external automated data lineage analysis can be done across
system platforms, components, and tools.
It can use any of the three sources. Using either logs or code as a source for data
lineage discovery is most common, but data as a source can be used too. It’s also
versatile enough to combine sources and approaches.
Its versatility allows for flexibility. It can be adjusted based on the user’s level of
understanding and needs.
External automated data lineage is a powerful tool for gaining full visibility of the data environment, overcoming
data blind spots, and taking informed, timely action from your data.
www.manta.io 16
The Ultimate Guide to Data Lineage
Tapping into the true potential of data lineage means 2. Semantics and AI
automating manual processes, enabling trust in data,
and increasing the productivity of your organization Just mapping dependencies is not enough. To get the
for better business outcomes. But in order to do this, most out of your data and maximize insights, you
you need the right solution with the right tools. need AI.
To achieve your goals, the following key data lineage Core information about the flow of data and the data
elements must be present: journey has to be enriched by its meaning—what
does a specific transformation mean, and how does it
1. Accurate and Detailed Metadata affect the data?
2. Semantics and AI
3. Activating Integrations The ability to answer such questions provides more
power and control over dependencies and allows for
the deployment of more advanced techniques for
1. Accurate and Detailed Metadata automation. To fully deploy AI and other advanced
techniques, semantics is key.
We’ve emphasized the importance of recognizing
and capturing the dynamic aspects of data—the The semantic layer of data lineage provides various
transformations, calculations, and movements, all capabilities:
of which represent a type of dependency. These
are best represented by data lineage, but without • The ability to differentiate between different types
understanding and controlling data lineage, your data of dependencies (direct and indirect)
management will remain inaccessible. • The ability to understand the evolution of data
lineage over a period of time (time slicing and
Dependencies are everywhere and are usually well revisions)
hidden. There are even indirect dependencies like • The ability to translate the real data processing
filtering conditions. Automated discovery is non- code into more high-level, user-friendly expressions
negotiable—it’s the only thing that can uncover these
hidden dependencies.
www.manta.io 17
The Ultimate Guide to Data Lineage
www.manta.io 18
The Ultimate Guide to Data Lineage
As a modern organization, you process high The Manta platform includes unique features
volumes of data. Your IT environment will only to make the most value out of your lineage,
increase in complexity, and your IT team is with more than 50 out-of-the-box, fully
struggling to keep up. You need to get your automated scanners. In addition, Manta works
data systems under control and find a way alongside the most popular data catalogs; our
to stay efficient despite this skyrocketing platform integrates with catalogs like Collibra,
complexity. You need data lineage, and the Informatica, Alation, and more.
bare minimum ‘good enough’ data lineage
that comes with your data catalog can lead Don’t wait. Realize the benefits
to costly updates later.
of automated data lineage today.
www.manta.io 19