0% found this document useful (0 votes)
138 views

The Ultimate Guide To Data Lineage

Data lineage provides a complete picture of a company's data landscape and flows to help overcome complexity and visibility issues. This helps speed insights, reduce incidents, and build trust. Data lineage automates metadata tracking and dependency mapping to help data teams govern data more efficiently and focus on higher value work.

Uploaded by

fesik29259
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
138 views

The Ultimate Guide To Data Lineage

Data lineage provides a complete picture of a company's data landscape and flows to help overcome complexity and visibility issues. This helps speed insights, reduce incidents, and build trust. Data lineage automates metadata tracking and dependency mapping to help data teams govern data more efficiently and focus on higher value work.

Uploaded by

fesik29259
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

The Ultimate Guide to

Data Lineage
The Ultimate Guide to Data Lineage

Table of Contents

Table of Contents 2

Why Are We Talking About Data Lineage? 3

What Is Data Lineage? 5

What Is Metadata? 6

How Data Lineage Activates Metadata 7

Business Benefits of Data Lineage 9

How to Create Data Lineage and Keep It Up to Date 12

What to Look for in a Data Lineage Solution 17

How Manta Can Help with Data Lineage 19

Introduction
New data flows throughout your organization every Data lineage combats complexity in your data
minute of every day. With the high volume of data environment by providing a complete picture of your
you’re collecting, you need to be able to act and make data landscape, allowing you to tame data chaos and
informed decisions—and quickly. optimize data for valuable insights.

But today’s data systems are deeply complex, and In our ultimate guide to data lineage, we will discuss:
this complexity creates blind spots in the data • What data lineage is and why it matters in a variety
environment. With limited visibility, you have of industries
limited control. How can you overcome these • How to activate metadata
complexities and get the full picture of your data • How to create data lineage
landscape? • What to look for in a data lineage solution
• Why Manta should be part of your data toolkit

www.manta.io 2
The Ultimate Guide to Data Lineage

Why Are We Talking About Data Lineage?

Data management has undergone a massive These data blind spots and complexities can
transformation in the past decade. Data infrastructure lead to the following challenges:
is constantly growing in complexity, evolving into data
ecosystems with thousands of components all aimed
Slower Delivery of Predictive Insights
at one goal: to derive more value from data.

MIT and Hewlett Packard Enterprise reported that data-


Today’s data ecosystems are too much for the
driven companies are 58% more likely to beat revenue
human brain to handle. They are too diverse and
goals than non-data-driven companies, and 162%
interconnected, rife with directly and indirectly
more likely to significantly outperform them.1 This is the
connected applications, microservices, and
power of predictive insight. But due to an overabundance
infrastructures, with countless dependencies defining
of data, much of it goes to waste. One report found that
how all these touchpoints interact with one another.
as much as 68% of data goes unleveraged.2

In most cases this data remains inaccessible because:


1. It is stored in an unusable format or inaccessible
location
2. It cannot be traced and therefore cannot be trusted.
3. It’s difficult to determine what data is important
4. The data is sensitive and needs to be protected

When data engineering resources are spent on


unproductive impact analysis (such as assessing
the impact of new development requirements), it
distracts developers and slows down the delivery of
new features. Data lineage helps speed up the delivery
of predictive insights by automatically generating
comprehensive data flow visualizations.

When you have data from this many sources and


systems, you can’t extract meaningful insights. It
Growing Number of Data Incidents
can be difficult for any one person or team to have
complete visibility over your entire environment.
During the third quarter of 2022, nearly 15 million
data records were exposed worldwide through data
In complex environments, it’s not a breaches—a 37% increase compared to the previous
quarter.3 Due to the limited visibility of complex
question of if your governance team will
data systems, assessing the end-to-end impact of
miss something, but when. data dependencies and changes is demanding and
sometimes impossible.

1 MIT Technology Review; Getting the most from your data-driven transformation: 10 key principles; Janice Zdankus;
Anthony Delli Colli; 14 Oct. 2021
2 Seagate; Rethink Data’ Report Reveals That 68% Of Data Available To Businesses Goes Unleveraged; 15 July 2020
3 Statista; Number of data records exposed worldwide from 1st quarter 2020 to 3rd quarter 2022

www.manta.io 3
The Ultimate Guide to Data Lineage

Current data observability tools are still primarily Manta customers have saved between
reactive—meaning organizations find bugs after
$5-15M in the initial phase of
there has already been an incident, rather than
preventing them. This is concerning because even a implementation alone.
single data incident can cause severe damage to your
organization. IBM reports a single data breach costs
Shortage of Data Engineering Talent
an average of $4.3 million.4

Engineering talent is hard to find, especially in the


Data lineage grants IT teams high levels of
competitive post-COVID-19 environment. The last
observability without large amounts of manual
thing an organization should do is waste their team’s
intervention. This allows teams to be truly proactive—
time on manual, routine tasks. This can increase their
to catch incidents before they happen.
frustration and likelihood of leaving.

Due to the growing complexity of the data stack, data


Decreased Trust in Reports & Insights
engineers have become more critical than any other
role, as they oversee data pipelines and integrated
In 2020, 90% of companies reported that their
data structures. This requires a larger skill set, which
data governance projects had failed.5 Years later,
makes good data engineers harder to find and even
companies are still struggling to build trust.
harder to keep.

If you can’t fully explain how data was collected or


Organizations that invest in data lineage remove this
verify its origins, you can’t answer basic questions
burden from their IT team, allowing them to refocus
or leverage data for better customer outcomes, and
their efforts on tasks that can’t be automated.
you’re going to experience severe business impacts
and frustration.
Increased Risk of Non-Compliance
Data lineage is a powerful tool in the fight for building
trust in data. Detailed lineage creates an added Regulators are cracking down on organizations in every
semantic layer for more accurate and timely reports industry. In the United States, we’ve already seen a rapid
that lay the foundation for more informed decisions expansion of data regulations, like PCI, FERPA, and FISMA.
and better forecasting without second-guessing.
The EU is experiencing similar regulatory challenges.

Automated lineage also puts an end to the costly, In April of 2022, the Digital Services Act threatened

lengthy, manual processes of lineage collection and Big Tech with a 6% revenue penalty for having illegal

updating. content live on their sites.

4 IBM; Cost of a data breach 2022: A million-dollar race to detect and respond
5 Gartner; The State of Data and Analytics Governance Is Worse Than You Think; Saul Judah, Andrew White; 19 June 2020

www.manta.io 4
The Ultimate Guide to Data Lineage

Whether you need to comply with the GDPR, HIPAA,


the Sarbanes-Oxley Act, or the FDA, your organization
is at risk without lineage. Data lineage provides
a complete overview of all regulated data that is
processed by your organization. This helps you
prepare for audits and avoid hefty penalties for non-
compliance.

You need to get your data systems under


control and find a way to stay efficient
despite this skyrocketing complexity. You
need data lineage.

What Is Data Lineage?


Traditionally, data lineage has been seen as a way of A detailed dependency map can tell you:
understanding how your data flows through all your
processing systems—where the data comes from, where • How changing a bonus calculation algorithm in the
it’s flowing to, and what happens to it along the way. sales data mart will affect your weekly financial
forecast report
In reality, data lineage is so much more. • Where data that is heavily regulated is being used
and for what purpose
Data lineage represents a detailed map of all direct • What is the best subset of test cases that will cover
and indirect dependencies between the data entities the majority of data flow scenarios for your newly
in your environment. released pricing database app
• How to divide a data system into smaller chunks
that can be migrated to the cloud independently
Why is this so important? without breaking other parts of the system

A detailed dependency map is the core component of


There are endless opportunities when you
a modern data stack. It allows you to gain complete
visibility and a clear line of sight to uncover data tap into the full potential of your data. To
blind spots throughout your data systems, while also do that, activating metadata is key.
helping ensure ethical, compliant, and efficient data
management processes.

www.manta.io 5
The Ultimate Guide to Data Lineage

What Is Metadata?

You may have heard metadata described simply as information, you couldn’t use the dataset without a
data about data—essentially being information that detailed description. The description of a dataset is
describes or catalogs what the data is. metadata that defines data and explains the context
in which data can be used.
As defined by Dr. Irina Steenbeek: The terms “data”
and “metadata” have a complicated relationship. Data The challenging aspect of defining metadata is that
is the physical or electronic representation of facts the same data can be recognized as either data or
or signals “in a manner suitable for communication, metadata, depending on the context. For example,
interpretation, or processing by human beings or by data models are metadata for business users. For
automatic means.” data modelers, on the other hand, data models
can be considered data that will in turn require
Metadata puts data in a particular context. A data other metadata to describe data models. Different
model is an example of metadata. For example, sources contain different approaches to classifying
if you had a dataset with customer financial metadata.

Business metadata Technical metadata Operational metadata


Business metadata focuses Technical Metadata provides Operational Metadata
largely on the content and information about technical describes details of the
condition of the data and details of data, the systems processing and accessing of
includes details related to that store data, and the data.
data governance. processes that move it within
Note: “data governance” has different and between systems.
meanings in different contexts.

Data lineage is crucial for understanding and utilizing dynamic metadata—how it moves across the systems,
where it originated, how it’s interconnected, and how it transforms.

www.manta.io 6
The Ultimate Guide to Data Lineage

How Data Lineage Activates Metadata

We typically use data in the following ways.

• We query data to get answers to our questions. That is the very basic use case
we usually think of first. It can be good old pre-built reports, ad-hoc queries, or
smart AI/ML algorithms digging insights from the data we have. By performing
queries, we turn data into information and eventually knowledge.

• We embed data into places where people or machines naturally need them.
We do not force a sales representative to log into our reporting platform and
write ad-hoc SQL queries or use pre-built reports. No. Rather, we prepare all the
data they may need about a prospect or a customer, turn it into information, and
deliver it to their workspace (as a dashboard in an application like the CRM they
use daily). On top of that, we also enrich internal data with valuable external data
to provide an even more complex view of the customer.

• We also use data to automate tasks and processes. Instead of waiting for the
sales representative to open their workspace and search for customers who may
be a good fit for a new product offering, we have an algorithm running in the
background that scores existing customers and sends proactive notifications
that suggest who to call and what (or even how) to offer. Or, for an even simpler
example, we automatically send a reminder to a sales representative in case they
take no action (even if they should).

Obviously, there are more ways that we interact with • Continuous access - metadata is continuously
data; the above are the most traditional examples. collected. It is not something you do once per
The first “search” case represents a very “static” month or once per year, as we want to collect every
experience. Everything is sitting in a silo (e.g., a data change and every signal and respond to it.
warehouse or data lake), and we expect people to • Connecting dots - metadata is not just collected; it
come, find what they need, and ask questions they
is constantly processed to distill information (and
need to ask. Do not get me wrong - it is awesome for
knowledge) from all the signals and noise. And with
some use cases and, when compared to a case with
the right feedback loop, your system gets smarter
no data available, a huge jump forward.
over time, collecting and learning.

However, we see that data is put to much better use • Actionable - all the intelligence and insights derived
in the other two examples, actively supporting users from metadata are not locked into a silo, but
with limited data engineering skills and dramatically rather delivered in the form of recommendations,
increasing their productivity. Compared to the first warnings, and notifications to humans and systems/
example, the latter are more “active” and thus more applications that may need it.
useful and accessible to a broader audience. And that • Embedded - actionable information / knowledge
is what we want to achieve with active metadata too. is integrated into processes humans and machines
perform, embedded into their workspace. People
Gartner’s definition of metadata in their most recent are not forced to go in and look for the insights.
Market Guide for Active Metadata Management Instead, active metadata comes to them – when
touches on several key aspects of metadata: and where they need it.

www.manta.io 7
The Ultimate Guide to Data Lineage

Merely collecting and cataloging metadata fails This approach can empower you with:
to maximize its potential. This is one reason why
metadata has not historically been practical. We’ve • Continuous detection of “dead tables” where
spent decades focused on metadata collection, but potentially sensitive information is stored but not
data sitting in a repository is not useful and data accessed or used
without understanding can be a liability. This is • Instant alerts if a change negatively impacts
changing now with the focus on active metadata. tactical management reports or key data features
used by the data science team
According to Gartner, active metadata capabilities
• Notifications about overly complex parts of data
will expand to include monitoring, evaluating,
pipelines where refactoring or redesign would help
recommending design changes, and orchestrating
to reduce the risk of failure
processes in third-party data management solutions.6
• Warnings if the design of a data pipeline moves data
Gartner also predicts that organizations that adopt between locations where no data should ever be
aggressive metadata analysis across their complete
data management environment will decrease the As you can see with these examples, data lineage
time to delivery of new data assets to users by as allows you to leverage metadata to better manage,
much as 70%. 7 utilize, and optimize your data—even when working
with overly complex data environments. That can lead
By activating existing metadata with automation to immediate and long-term business benefits.
and intelligence, data lineage can provide the needed
visibility and control to help you become more aware
of your data management and proactive in your data
usage.

6 Gartner, Market Guide for Active Metadata Management; Guido De Simoni, Mark Beyer; 14 Nov 2022
7 Ibid

www.manta.io 8
The Ultimate Guide to Data Lineage

Business Benefits of Data Lineage

Automated Impact Analysis for Improved Greater Data Pipeline Observability for
Incident Prevention Faster Incident Resolution

In business, every decision contributes to the bottom As discussed above, there are countless threats
line. That’s why impact analysis is crucial—it predicts to your organization’s bottom line. Whether it is a
the consequences of a decision. How will one decision successful ransomware attack or a poorly planned
affect customers? Stakeholders? Sales? cloud migration, catching the problem before it can
wreak havoc is always less expensive.
Data lineage helps during these investigations.
Because lineage creates an environment where That’s why data pipeline observability is so important.
reports and data can be trusted, teams can make It not only protects your organization but also your
more informed decisions. Data lineage provides that customers.
reliability—and more.
Data lineage expands the scope of your data
One often-overlooked area of impact analysis is IT observability to include data processing infrastructure
resilience. This blind spot became apparent in March or data pipelines, in addition to the data itself.
of 2021 when CNA Financial was hit by a ransomware With this expanded observability, incidents can
attack that caused widespread network disruption. The be prevented in the design phase or identified in
company’s email was hacked, consumers panicked, the implementation and testing phase to reduce
and CNA Financial was forced to pay a record-breaking maintenance costs and achieve higher productivity.
$40 million in ransom. This is where lineage-supported
8

impact analysis is needed. If you experience a threat,


Manta customers who have created
you will want to be prepared to combat it, and know
exactly how much of your business will be affected. complete lineage have been able to trace
data-related issues back to the source
IT resilience is also threatened by natural disasters, user
90% faster compared to their previous
error, infrastructure failure, cloud transitions, and more.
In fact, according to McKinsey, 76% of organizations manual approach.
experienced an incident during the past two years
that required an IT disaster-recovery plan.9 This means the teams responsible for particular
systems can fix any issue in a matter of minutes,
Most organizations struggle with impact analysis as it according to Manta research.
requires significant resources when done
manually. But with automated lineage from Manta,
customers have seen as much as a 40% increase
in engineering teams‘ productivity after
adopting lineage.

8 Bloomberg; CNA Financial Paid $40 Million in Ransom After March Cyberattack; Kartikay Mehrotra and William Turton; 20
May 2021
9 McKinsey Digital; IT resilience for the digital age; Arun Gundurao, Jorge Machado, Rut Patel, and Yanwing Wong; 11 May
2021

www.manta.io 9
The Ultimate Guide to Data Lineage

Improved Regulatory Compliance Dividing the system into smaller chunks of objects
(reports, tables, workflows, etc.) can make it more
Depending on your industry, you have to ensure you’re manageable, but poses another challenge—how to
in compliance with a host of regulatory bodies and migrate one part without breaking another. How do
policies—BASEL, HIPAA, GDPR, CCPA /CPRA, and you know what pieces can be grouped to minimize the
CCAR, just to name a few. number of external dependencies?
All of these regulations require accurate tracking of
data. Your organization must be able to answer: With data lineage, every object in the migrated
system is mapped and dependencies are documented.
1. Where does the data come from? Manta customers have used data lineage
2. How did the data get there?
to complete their migration projects 40%
3. Are we capable of proving it with up-to-date
evidence whenever necessary? faster with 30% fewer resources.
4. Do we need weeks or months to complete a report?
5. Is that report even entirely reliable?
Retention of Data Engineering Talent
Data lineage helps you answer these questions
by creating highly detailed visualizations of your Data engineers, developers, and data scientists
data flows. You can use these reports to accurately continue to be fast-growing and hard-to-fill roles in
track and report your data to ensure regulatory tech. The shortage of data engineering talent has
compliance. ballooned from a problem to a crisis, made worse by the
increasing complexity of data systems. The last thing
you want is to continually overstretch your valuable
Faster and More Efficient Migrations data engineers with routine, manual (and frustrating)
tasks like chasing data incidents, assessing the impacts
McKinsey predicts that $8 out of every $10 for IT of planned changes, or answering the same questions
hosting will go toward the cloud by 2024.10 However, if about the origins of data records again and again.
you have ever been involved in the migration of a data
system, you know how complex the process is. Data lineage can help to automate routine tasks and
enable self-service wherever possible, allowing data
Approximately $100 billion of cloud funding is expected scientists and other stakeholders to retrieve up-to-
to be wasted over the next three years—and most date lineage and data origin information on their own,
enterprises cite the costs around migration as a major whenever they need it. A detailed data lineage map
inhibitor to adopting the cloud.11 The process is so also enables faster onboarding of data engineers to
complex (and expensive) because every system consists integrate new or less-experienced engineers into the
of thousands or millions of interconnected parts, and it role without impacting the stability and reliability of
is impossible to migrate everything in a single step. the data environment.

10 McKinsey; Cloud-migration opportunity: Business value grows, but missteps abound; Tara Balakrishnan, Chandra
Gnanasambandam, Leandro Santos, and Bhargs Srivathsan; 12 Oct 2021
11 Ibid

www.manta.io 10
The Ultimate Guide to Data Lineage

Established Trust in Data Improved Change Management

Data governance is a clear priority in almost every One of the most critical processes for every
organization, regardless of industry. In one survey, business, regardless of size, is change management.
60% or respondents planned to spend more than Organizations face a variety of change management
$49K on data governance technology and tools in challenges or obstacles, including:
the next one to two years. 12

• A lack of executive support or buy-in


Report developers, data scientists, and data citizens • Mis-alignment due to miscommunication
need data they can trust for accurate, timely, and • Juggling multiple simultaneous changes
confident decision-making. But in today’s complex • Lack of overall visibility
data environment, you must contend with dispersed
servers and infrastructure, resulting in disparate Nearly all of the benefits mentioned above address
sources of data and countless data dependencies. You these challenges directly. With data lineage, leaders
need a complete overview of all your data sources will gain greater visibility into the impact of proposed
to see how it moves through your organization, changes, greater pipeline visibility, and faster
understand all touchpoints, and how they interact with migrations (to name a few). With greater trust in the
one another. You can only completely trust your data data, getting executive support and communication
when you have a complete understanding of it. alignment is easier, and through greater visibility you’ll
be better equipped to manage multiple simultaneous
Data lineage provides a comprehensive overview of changes without the pressure of detangling
all your data flows, sources, transformations, and interconnected data dependencies.
dependencies. With data lineage, you will ensure
accurate reporting, see how crucial calculations
were derived, and gain confidence in your data
management framework and strategy.

12 Zaloni & Dataversity; The 2022 State of Cloud Data Governance; Michelle P. Knight, Annie Bishop

www.manta.io 11
The Ultimate Guide to Data Lineage

How to Create Data Lineage and Keep It


Up to Date
Now that you know what data lineage is and the 1. What is the source of information for building the
business benefits it provides, it is important to data lineage map?
understand how data lineage is created and delivered. 2. What is the process for building the data lineage
map?
We’ve defined static metadata (the information about
tables, fields, columns, business terms, and their data
types, locations, quality attributes, tags, etc.) and Question 1: Determining the Source of
dynamic metadata (the information about the data’s Data Lineage
journey from source to target, and all the changes,
transformations, and calculations that happen along Data lineage information can be derived from three
the way). major sources for three types of data lineage:

In activating this metadata, data lineage creates a 1. Data as a source for pattern-based lineage
map to understand its movements, connections, and 2. Logs as a source for run-time lineage
dependencies. 3. Code as a source for design lineage

Lineage metadata is about logic—instructions, 1. Data: Pattern-Based Lineage


stored procedures, or code in any form. It can be an
SQL script, a database stored procedure, a job in a This technique reads metadata about tables and
transformation tool, a Java API call, or a complex columns and uses information about data profiles to
macro in an Excel spreadsheet. It’s essentially create links representing possible data flows based
anything that moves your data from one place to on common patterns or similarities. This could be
another, transforms it, or modifies it. something like a table or column with similar names
and data values. When these similarities are found
To understand this logic and then build your data between columns, they can be linked together in the
lineage map, you need to be able to answer two data lineage diagram.
questions:

Advantages Disadvantages

It’s the best approach for identifying manual data flows You may miss important details. Because you’re only
happening outside of the system—like copying data to a watching data, this lineage is limited to the database—
flash drive, modifying it on another computer, or storing it you’re not seeing the application side of your environment
on a different part of the system. or the so-called “transformation logic” of how and where
data is being modified.

You don’t have to worry about the integration of different The approach is not always accurate. The impact on
system technologies because you’re watching the data as performance can be significant, and data privacy is at risk.
the source, rather than algorithms.

It’s the best approach in cases when it is impossible to


read the logic hidden in your programming code because
the code is unavailable or proprietary and cannot be
accessed.

www.manta.io 12
The Ultimate Guide to Data Lineage

2. Logs: Run-Time Lineage

This technique relies on run-time information extracted from the data environment—log files, execution workflows
exported by ETL/ELT tools, or any other source with sufficient run-time details. Some data processing engines
use a trick called data tagging, where each piece of data being moved or transformed is tagged or labeled by a
transformation engine, which then tracks that label all the way from start to finish.

Advantages Disadvantages

It has an operational nature, which is valuable for incident Inaccurate data lineage. Run-time lineage only captures
resolution because it provides accurate information about information about recently executed data flows and may
the flow of a specific data element that has been identified fail to capture data calculations and scenarios that are not
as erroneous. executed equally or with the same frequency. This can lead
to inaccurate or inconsistent lineage, as some parts are
either missing or are no longer valid.

It considers different technologies in the data stack (unlike The absence of transformation details. Not everything is
pattern-based lineage), as the format and structure of the or can be logged, especially in the case of more complex
logging information vary significantly. algorithms or processing done outside the database/ETL/
ELT world. As a result, run-time lineage can often capture
only very high-level and generic table-to-table mappings.

Regular expressions, rules, or AI/ML can be deployed to


identify relevant parts of log files and derive data flow
information.

Blindly using such metadata poses a big risk for an organization. If used by a data engineer to run impact
analysis, it leads to a high probability of incidents when designing and implementing changes in the system and
new requirements. If used by a risk analyst to prepare a regulatory report, it leads to inaccuracies in the report
and increased risk of (public) incidents and penalties. If used by a data scientist to analyze and prepare data to
train a new model, it leads to inherent inequality encoded into the AI/ML algorithm.

www.manta.io 13
The Ultimate Guide to Data Lineage

3. Code: Design Lineage

This technique looks directly into the code that processes and transforms data records to identify data flows.
This is “code” in the broadest sense—such as an SQL script, a PL/SQL stored procedure, an ETL/ELT workflow
encoded in a proprietary XML format, a macro in an Excel spreadsheet, a mapping between a field in a report
and a database column or table, a Java API, a Kafka stream definition, an XSLT transformation, or a Python
algorithm in a Jupyter notebook.

Advantages Disadvantages

The variety of code. The functionality to work with this The variety of code. It’s a challenge because parsing
variety gives design lineage the advantage as the best and reverse engineering the code is much tougher than
approach for gaining detailed visibility into your data parsing log files, and it requires specialized scanners for all
environment to identify and eliminate data blind spots. supported technologies.

It is the most accurate approach to lineage, with very few


false positives. This is critical for incident management as
it narrows down the scope of the investigation and makes
change management and impact analysis more efficient.

It accurately detects all data flows, with a close to zero


chance of missing any, even those rarely used or not used
at all. This is critical for change management processes
and impact analysis, as well as for migration projects,
privacy programs, and regulatory reporting.

It can reliably detect indirect data flows—where one


data element influences another, even without a direct
data lineage connection. This is essential for change
management, impact analysis, incident management,
migration projects, and regulatory reporting.

It records details about transformations and calculations


used to process data, which is especially important for
compliance and regulatory reporting.

These advantages make design lineage the preferred approach for the most successful vendors and
organizations.

www.manta.io 14
The Ultimate Guide to Data Lineage

Question 2: Understanding the Process

Now that you know the potential sources of your information and lineage techniques, let’s look back at question
number two: What is the process for building your data lineage map?

There are three major process approaches:

1. Manual Data Lineage Analysis


2. Self-Contained Data Lineage Analysis
3. External Automated Data Lineage Analysis

1. Manual Data Lineage Analysis

Manually resolving lineage usually starts at the top with your people, by mapping and documenting the
knowledge in their heads. This process involves interviewing application owners, data stewards, and data
integration specialists for information about data movement within your organization. Then, you must begin
inputting that information into spreadsheets or other mapping mechanisms so the lineage can be defined.

Advantages Disadvantages

It’s the starting point. Manual data lineage analysis is The lineage cannot be trusted. You’re relying on what
where a lineage project needs to start to be able to gain people are telling you. Their information may be
insight into what is going on across the entire environment. contradictory, missing important details, or simply wrong.
It may be that there isn’t any code at all or any This can lead to a situation where you have lineage, but
permissions to access and profile data directly (especially you’re unable to use it because it cannot be trusted.
with legacy systems). In these cases, domain experts—
your people—are your only source of lineage.

It’s tedious. Manual data lineage analysis uses code as


a source, where the code is analyzed by its authors or
external resources. This means manually examining the
code, comparing column names and reviewing tables and
file extracts by hand. Unless you have team members
with the requisite skills and expertise in the programs and
modules you need to map, manual data lineage analysis
may not even be worth attempting.

It’s unsustainable. Due to code volumes, complexity, and


the rate of change, manually managed lineage will fall out
of sync with the actual data transfers in the environment,
and you’re back to having data lineage that cannot be
trusted.

www.manta.io 15
The Ultimate Guide to Data Lineage

2. Self-Contained Data Lineage Analysis

This approach uses logs as a source. This approach uses a tool that fully controls your data’s movement, its changes,
and the entire data processing workflow to give you full insight. It’s the preferred choice of ETL/ELT vendors.

Advantages Disadvantages

It’s fully automated, so no tedious manual analysis is The data lineage is limited to the controlling platform—
needed. it’s self-contained. Anything that happens outside
the controlled environment is invisible. More complex
components within the environment can be missed. The
result is incomplete lineage.

Complete lineage of the entire data processing platform. It is limiting for the majority of data engineering tasks.
It provides full insight, control, unlimited access to internal Organizations using this approach enforce a single data
logs, details about executed workflows, and processing processing platform or prohibit the use of its more complex
instructions. components, as they’ll likely be missed. However, this slows
down new development and is limiting and frustrating for
data engineers.

3. External Automated Data Lineage Analysis

External automated data lineage analysis is designed with the diversity of the data system environment in mind.
It does not require all data processing to happen in one tool or platform. As the name indicates, this approach
also offers fully automated data lineage analysis.

Advantages Disadvantages

It doesn’t require all the data processing to be on one platform. Unlike self-contained None
data lineage analysis, external automated data lineage analysis can be done across
system platforms, components, and tools.

It can use any of the three sources. Using either logs or code as a source for data
lineage discovery is most common, but data as a source can be used too. It’s also
versatile enough to combine sources and approaches.

Its versatility allows for flexibility. It can be adjusted based on the user’s level of
understanding and needs.

External automated data lineage is a powerful tool for gaining full visibility of the data environment, overcoming
data blind spots, and taking informed, timely action from your data.

www.manta.io 16
The Ultimate Guide to Data Lineage

What to Look for in a Data Lineage Solution

Tapping into the true potential of data lineage means 2. Semantics and AI
automating manual processes, enabling trust in data,
and increasing the productivity of your organization Just mapping dependencies is not enough. To get the
for better business outcomes. But in order to do this, most out of your data and maximize insights, you
you need the right solution with the right tools. need AI.

To achieve your goals, the following key data lineage Core information about the flow of data and the data
elements must be present: journey has to be enriched by its meaning—what
does a specific transformation mean, and how does it
1. Accurate and Detailed Metadata affect the data?
2. Semantics and AI
3. Activating Integrations The ability to answer such questions provides more
power and control over dependencies and allows for
the deployment of more advanced techniques for
1. Accurate and Detailed Metadata automation. To fully deploy AI and other advanced
techniques, semantics is key.
We’ve emphasized the importance of recognizing
and capturing the dynamic aspects of data—the The semantic layer of data lineage provides various
transformations, calculations, and movements, all capabilities:
of which represent a type of dependency. These
are best represented by data lineage, but without • The ability to differentiate between different types
understanding and controlling data lineage, your data of dependencies (direct and indirect)
management will remain inaccessible. • The ability to understand the evolution of data
lineage over a period of time (time slicing and
Dependencies are everywhere and are usually well revisions)
hidden. There are even indirect dependencies like • The ability to translate the real data processing
filtering conditions. Automated discovery is non- code into more high-level, user-friendly expressions
negotiable—it’s the only thing that can uncover these
hidden dependencies.

Another challenge is that dependencies must be


mapped very accurately, in detail. Otherwise, the
resulting map will contain too many false positives
or will miss several critical relations among the data.
Without detail and accuracy, any attempt to control
dependencies is destined to fail.

www.manta.io 17
The Ultimate Guide to Data Lineage

3. Activating Integrations Strategies for activating data lineage metadata can


differ based on the domain it’s being integrated into,
Historically, metadata catalogs have focused on but for every domain, you want to ask the same set of
passively storing static metadata, overlooking its questions.
dynamic properties.
• What processes and tools are currently in use?
In activating metadata, the ultimate task is • What is still being done manually? Why hasn’t it
integrating it into all data management processes, so been automated yet?
you can proactively use this knowledge to speed up • How can accurate, detailed, semantically rich
processes and reduce manual tasks. A data catalog, metadata help with automation?
data privacy, or ETL/ELT tool that has access to • Is there anything that would have a major impact
detailed, accurate, semantically rich data lineage that we are not doing today but we could do if it
opens new doors for activating additional metadata. were automated?
• Is there a way to use automation to redesign and
Activating integrations saves time. You won’t have to improve an existing process?
spend hours manually analyzing and extracting data.
This ability to automate is why so many successful
organizations deploy enterprise-wide data lineage
platforms—to integrate them with other parts of their
data infrastructure.

www.manta.io 18
The Ultimate Guide to Data Lineage

How Manta Can Help


with Data Lineage

As a modern organization, you process high The Manta platform includes unique features
volumes of data. Your IT environment will only to make the most value out of your lineage,
increase in complexity, and your IT team is with more than 50 out-of-the-box, fully
struggling to keep up. You need to get your automated scanners. In addition, Manta works
data systems under control and find a way alongside the most popular data catalogs; our
to stay efficient despite this skyrocketing platform integrates with catalogs like Collibra,
complexity. You need data lineage, and the Informatica, Alation, and more.
bare minimum ‘good enough’ data lineage
that comes with your data catalog can lead Don’t wait. Realize the benefits
to costly updates later.
of automated data lineage today.

Manta has helped nearly one hundred


organizations realize the benefits of data lineage. Schedule a demo
We bring intelligence to metadata management with a Manta engineer to learn more.
by providing an automated solution that helps
you drive productivity, gain trust in your data,
and accelerate digital transformation.

www.manta.io 19

You might also like