Big Book of Data Engineering 2nd Edition Final
Big Book of Data Engineering 2nd Edition Final
With
all-new
content
The Big Book of Data Engineering – 2nd Edition 2
Contents
4 . 5 Rivian .................................................................................................................................................................................................... 90
4 . 6 AT&T ....................................................................................................................................................................................................... 94
01
SECTION
Introduction to
Data Engineering on Databricks
The Big Book of Data Engineering – 2nd Edition 4
Organizations realize the value data plays as a strategic asset for various How can Databricks help?
business-related initiatives, such as growing revenues, improving the customer
experience, operating efficiently or improving a product or service. However, With the Databricks Lakehouse Platform, data engineers have access to an
accessing and managing data for these initiatives has become increasingly end-to-end data engineering solution for ingesting, transforming, processing,
complex. Most of the complexity has arisen with the explosion of data volumes scheduling and delivering data. The Lakehouse Platform automates the
and data types, with organizations amassing an estimated 80% of data in complexity of building and maintaining pipelines and running ETL workloads
unstructured and semi-structured format. As the collection of data continues directly on a data lake so data engineers can focus on quality and reliability to
to increase, 73% of the data goes unused for analytics or decision-making. In drive valuable insights.
order to try and decrease this percentage and make more data usable, data
engineering teams are responsible for building data pipelines to efficiently and Lakehouse Platform
reliably deliver data. But the process of building these complex data pipelines
One platform to support multiple personas
comes with a number of difficulties:
• In order to get data into a data lake, data engineers are required
BI & Data Data Data Data
to spend immense time hand-coding repetitive data ingestion tasks Warehousing Engineering Streaming Science & ML
Figure 1
The Databricks Lakehouse Platform unifies your data, analytics and AI on one common platform for all your data use cases
The Big Book of Data Engineering – 2nd Edition 5
Machine &
Real-Time Operational
• Easily create and manage multiple tasks with dependencies via UI,
Application Logs
Apps
Real-Time Applications with
API or from your IDE
Spark Structured Streaming Alerts
Fraud
Detection
Dynamic
Pricing
Application
• Have full observability to all workflow runs and get alerted when
Events
Delta Lake for open and reliable data storage • Leverage high reliability of 99.95% uptime
• Use performance optimization clusters that parallelize jobs and
©2023 Databricks Inc. — All rights reserved
minimize data movement with cluster reuse
Figure 2
A unified set of tools for real-time data processing
Data quality validation and monitoring
Improve data reliability throughout the data lakehouse so data teams can
confidently trust the information for downstream initiatives by:
• Fault tolerant mechanisms to consistently recover the state of data • Complete, parameterized and automated deployment for the
• The ability to automatically track progress from the source with continuous delivery of data
checkpointing • End-to-end orchestration, testing and monitoring of data pipeline
• The ability to automatically recover and restore the data pipeline state deployment across all major cloud providers
Unified governance
With Unity Catalog, data engineering and governance teams benefit from an
enterprisewide data catalog with a single interface to manage permissions,
centralize auditing, automatically track data lineage down to the column level,
and share data across platforms, clouds and regions. Benefits:
2.3 Low-Latency Streaming Data Pipelines With Delta Live Tables and Apache Kafka
2.9 Simplifying Change Data Capture With Databricks Delta Live Tables
S ECT I O N 2 .1
As solutions architects, we work closely with customers every day to help them • Clean out your configurations. Configurations carried from one
get the best performance out of their jobs on Databricks — and we often end Apache Spark™ version to the next can cause massive problems. Clean up!
up giving the same advice. It’s not uncommon to have a conversation with a Read section 3 to learn more.
customer and get double, triple, or even more performance with just a few • Use Delta Caching. There’s a good chance you’re not using caching
tweaks. So what’s the secret? How are we doing this? Here are the top 5 things correctly, if at all. See Section 4 to learn more.
we see that can make a huge impact on the performance customers get
• Be aware of lazy evaluation. If this doesn’t mean anything to you and
from Databricks.
you’re writing Spark code, jump to section 5.
Here’s a TLDR: • Bonus tip! Table design is super important. We’ll go into this in a future
blog, but for now, check out the guide on Delta Lake best practices.
• Use larger clusters. It may sound obvious, but this is the number
one problem we see. It’s actually not any more expensive to use a large
cluster for a workload than it is to use a smaller one. It’s just faster.
1. Give your clusters horsepower!
If there’s anything you should take away from this article, it’s this.
Read section 1. Really. This is the number one mistake customers make. Many customers create tiny
• Use Photon, Databricks’ new, super-fast execution engine. Read section 2 clusters of two workers with four cores each, and it takes forever to do anything.
to learn more. You won’t regret it. The concern is always the same: they don’t want to spend too much money on
larger clusters. Here’s the thing: it’s actually not any more expensive to use a
large cluster for a workload than it is to use a smaller one. It’s just faster.
The Big Book of Data Engineering – 2nd Edition 12
The key is that you’re renting the cluster for the length of the workload. So, if
you spin up that two worker cluster and it takes an hour, you’re paying for those
workers for the full hour. However, if you spin up a four worker cluster and it takes
only half an hour, the cost is actually the same! And that trend continues as long
as there’s enough work for the cluster to do.
Number of Workers Cost Per Hour Length of Workload (hours) Cost of Workload
1 $1 2 $2
2 $2 1 $2
4 $4 0.5 $2
8 $8 0.25 $2
Notice that the total cost of the workload stays the same while the real-world
time it takes for the job to run drops significantly. So, bump up your Databricks Beyond the obvious improvements due to running the engine in native code,
cluster specs and speed up your workloads without spending any more money. It they’ve also made use of CPU-level performance features and better memory
can’t really get any simpler than that. management. On top of this, they’ve rewritten the Parquet writer in C++. So this
makes writing to Parquet and Delta (based on Parquet) super fast as well!
2. Use Photon But let’s also be clear about what Photon is speeding up. It improves
computation speed for any built-in functions or operations, as well as writes to
Our colleagues in engineering have rewritten the Spark execution engine in C++
Parquet or Delta. So joins? Yep! Aggregations? Sure! ETL? Absolutely! That UDF
and dubbed it Photon. The results are impressive!
(user-defined function) you wrote? Sorry, but it won’t help there. The job that’s
spending most of its time reading from an ancient on-prem database? Won’t
help there either, unfortunately.
The Big Book of Data Engineering – 2nd Edition 13
The good news is that it helps where it can. So even if part of your job can’t be If you’re using Databricks SQL Endpoints you’re in luck. Those have caching on
sped up, it will speed up the other parts. Also, most jobs are written with the by default. In fact, we recommend using CACHE SELECT * FROM table to preload
native operations and spend a lot of time writing to Delta, and Photon helps a lot your “hot” tables when you’re starting an endpoint. This will ensure blazing fast
there. So give it a try. You may be amazed by the results! speeds for any queries on those tables.
If you’re using regular clusters, be sure to use the i3 series on Amazon Web
3. Clean out old configurations Services (AWS), L series or E series on Azure Databricks, or n2 in GCP. These will
all have fast SSDs and caching enabled by default.
You know those Spark configurations you’ve been carrying along from version to
version and no one knows what they do anymore? They may not be harmless. Of course, your mileage may vary. If you’re doing BI, which involves reading the
We’ve seen jobs go from running for hours down to minutes simply by cleaning same tables over and over again, caching gives an amazing boost. However, if
out old configurations. There may have been a quirk in a particular version of you’re simply reading a table once and writing out the results as in some ETL
Spark, a performance tweak that has not aged well, or something pulled off jobs, you may not get much benefit. You know your jobs better than anyone.
some blog somewhere that never really made sense. At the very least, it’s worth Go forth and conquer.
revisiting your Spark configurations if you’re in this situation. Often the default
configurations are the best, and they’re only getting better. Your configurations
may be holding you back.
When you’re writing Spark code like select, groupBy, filter, etc., you’re really
building an execution plan. You’ll notice the code returns almost immediately when —--------
# Build an execution plan.
you run these functions. That’s because it’s not actually doing any computation. So # This returns in less than a second but does no work
even if you have petabytes of data, it will return in less than a second. df2 = (df
.join(...)
.select(...)
However, once you go to write your results out you’ll notice it takes longer. This .filter(...)
)
is due to lazy evaluation. It’s not until you try to display or write results that your
execution plan is actually run. # Now run the execution plan to get results
df2.display()
# Unfortunately this will run the plan again, including filtering, joining,
—-------- etc
# Build an execution plan. df2.display()
# This returns in less than a second but does no work
df2 = (df # So will this…
.join(...) df2.count()
.select(...) —------
.filter(...)
)
The developer of this code may very well be thinking that they’re just printing This works especially well when Delta Caching is turned on. In short, you
out results three times, but what they’re really doing is kicking off the same benefit greatly from lazy evaluation, but it’s something a lot of customers trip
processing three times. Oops. That’s a lot of extra work. This is a very common over. So be aware of its existence and save results you reuse in order to avoid
mistake we run into. So why is there lazy evaluation, and what do we do about it? unnecessary computation.
In short, processing with lazy evaluation is way faster than without it.
Databricks/Spark looks at the full execution plan and finds opportunities
Start experimenting with these
for optimization that can reduce processing time by orders of magnitude.
free Databricks notebooks.
So that’s great, but how do we avoid the extra computation? The answer
is pretty straightforward: save computed results you will reuse.
Let’s look at the same block of code again, but this time let’s avoid the
recomputation:
# save it
df2.write.save(path)
# load it back in
df3 = spark.read.load(path)
# now use it
df3.display()
# nor is this
df3.count()
The Big Book of Data Engineering – 2nd Edition 16
S ECT I O N 2 . 2
October 6, 2022
In Apache Spark™, declarative Python APIs are supported for big data workloads. Driver profiling
They are powerful enough to handle most common use cases. Furthermore,
PySpark UDFs offer more flexibility since they enable users to run arbitrary PySpark applications run as independent sets of processes on a cluster,
Python code on top of the Apache Spark™ engine. Users only have to state coordinated by the SparkContext object in the driver program. On the driver
“what to do”; PySpark, as a sandbox, encapsulates “how to do it.” That makes side, PySpark is a regular Python process; thus, we can profile it as a normal
PySpark easier to use, but it can be difficult to identify performance bottlenecks Python program using cProfile as illustrated below:
and apply custom optimizations.
import cProfile
To address the difficulty mentioned above, PySpark supports various profiling
with cProfile.Profile() as pr:
tools, which are all based on cProfile, one of the standard Python profiler # Your code
implementations. PySpark Profilers provide information such as the number
pr.print_stats()
of function calls, total time spent in the given function, and filename, as well
as line number to help navigation. That information is essential to exposing
tight loops in your PySpark programs, and allowing you to make performance Workers profiling
improvement decisions.
Executors are distributed on worker nodes in the cluster, which introduces
complexity because we need to aggregate profiles. Furthermore, a Python worker
process is spawned per executor for PySpark UDF execution, which makes the
profiling more intricate.
The Big Book of Data Engineering – 2nd Edition 17
The UDF profiler, which is introduced in Spark 3.3, overcomes all those obstacles Executing the example above and running sc.show_profiles() prints the
and becomes a major tool to profile workers for PySpark applications. We’ll following profile. The profile below can also be dumped to disk by sc.dump_
illustrate how to use the UDF profiler with a simple Pandas UDF example. profiles(path).
Later, we will group by the id column, which results in 8 groups with 1,000 rows
per group.
The Pandas UDF plus_one is then created and applied as shown below:
The UDF id in the profile (271, highlighted above) matches that in the Spark plan
for res. The Spark plan can be shown by calling res.explain().
import pandas as pd
Note that plus_one takes a pandas DataFrame and returns another pandas
DataFrame. For each group, all columns are passed together as a pandas
DataFrame to the plus_one UDF, and the returned pandas DataFrames are
combined into a PySpark DataFrame.
The Big Book of Data Engineering – 2nd Edition 18
The first line in the profile’s body indicates the total number of calls that were The updated profile is as shown below.
monitored. The column heading includes
Digging into the column details: plus_one is triggered once per group, 8 times • Arithmetic operation from 8,000 calls to 8 calls
in total; _arith_method of pandas Series is called once per row, 8,000 times • Total function calls from 2,898,160 calls to 2,384 calls
in total. pandas.DataFrame.apply applies the function lambda x: x + 1 row by
• Total execution time from 2.300 seconds to 0.004 seconds
row, thus suffering from high invocation overhead.
We can reduce such overhead by substituting the pandas.DataFrame.apply The short example above demonstrates how the UDF profiler helps us deeply
with pdf + 1, which is vectorized in pandas. The optimized Pandas UDF looks as understand the execution, identify the performance bottleneck and enhance
follows: the overall performance of the user-defined function.
import pandas as pd The UDF profiler was implemented based on the executor-side profiler,
which is designed for PySpark RDD API. The executor-side profiler is available
def plus_one_optimized(pdf: pd.DataFrame) -> pd.DataFrame:
return pdf + 1 in all active Databricks Runtime versions.
Both the UDF profiler and the executor-side profiler run on Python workers. Conclusion
They are controlled by the spark.python.profile Spark configuration, which
is false by default. We can enable that Spark configuration on a Databricks PySpark profilers are implemented based on cProfile; thus, the profile reporting
Runtime cluster as shown below. relies on the Stats class. Spark Accumulators also play an important role when
collecting profile reports from Python workers.
Powerful profilers are provided by PySpark in order to identify hot loops and
suggest potential improvements. They are easy to use and critical to enhance
the performance of PySpark programs. The UDF profiler, which is available
starting from Databricks Runtime 11.0 (Spark 3.3), overcomes all the technical
challenges and brings insights to user-defined functions.
S ECT I O N 2 . 3
August 9, 2022
Delta Live Tables (DLT) is the first ETL framework that uses a simple declarative Apache Kafka. The event stream from Kafka is then used for real-time streaming
approach for creating reliable data pipelines and fully manages the underlying data analytics. Multiple message consumers can read the same data from Kafka
infrastructure at scale for batch and streaming data. Many use cases require and use the data to learn about audience interests, conversion rates, and bounce
actionable insights derived from near real-time data. Delta Live Tables enables reasons. The real-time, streaming event data from the user interactions often
low-latency streaming data pipelines to support such use cases with low also needs to be correlated with actual purchases stored in a billing database.
latencies by directly ingesting data from event buses like Apache Kafka, AWS
Kinesis, Confluent Cloud, Amazon MSK, or Azure Event Hubs.
Apache Kafka
This article will walk through using DLT with Apache Kafka while providing the
required Python code to ingest streams. The recommended system architecture Apache Kafka is a popular open source event bus. Kafka uses the concept of a
will be explained, and related DLT settings worth considering will be explored topic, an append-only distributed log of events where messages are buffered for
along the way. a certain amount of time. Although messages in Kafka are not deleted once they
are consumed, they are also not stored indefinitely. The message retention for
Kafka can be configured per topic and defaults to 7 days. Expired messages will
Streaming platforms be deleted eventually.
Event buses or message buses decouple message producers from consumers. This article is centered around Apache Kafka; however, the concepts discussed
A popular streaming use case is the collection of click-through data from also apply to many other event busses or messaging systems.
users navigating a website where every user interaction is stored as an event in
The Big Book of Data Engineering – 2nd Edition 21
When developing DLT with Python, the @dlt.table decorator is used to create a Direct ingestion from streaming engines
Delta Live Table. To ensure the data quality in a pipeline, DLT uses Expectations
which are simple SQL constraints clauses that define the pipeline’s behavior with Delta Live Tables written in Python can directly ingest data from an event bus like
invalid records. Kafka using Spark Structured Streaming. You can set a short retention period for
the Kafka topic to avoid compliance issues, reduce costs and then benefit from
Since streaming workloads often come with unpredictable data volumes, the cheap, elastic and governable storage that Delta provides.
Databricks employs enhanced autoscaling for data flow pipelines to minimize the
overall end-to-end latency while reducing cost by shutting down unnecessary As a first step in the pipeline, we recommend ingesting the data as is to a Bronze
infrastructure. (raw) table and avoid complex transformations that could drop important data.
Like any Delta table the Bronze table will retain the history and allow it to perform
Delta Live Tables are fully recomputed, in the right order, exactly once for each GDPR and other compliance tasks.
pipeline run.
When writing DLT pipelines in Python, you use the @dlt.table annotation pipelines.reset.allowed=false
to create a DLT table. There is no special attribute to mark streaming DLTs in
Setting pipelines.reset.allowed to false prevents refreshes to the table but
Python; simply use spark.readStream() to access the stream. Example code
does not prevent incremental writes to the tables or new data from flowing into
for creating a DLT table with the name kafka_bronze that is consuming data
the table.
from a Kafka topic looks as follows:
import dlt
from pyspark.sql.functions import * Checkpointing
from pyspark.sql.types import *
TOPIC = "tracker-events" If you are an experienced Spark Structured Streaming developer, you will notice
KAFKA_BROKER = spark.conf.get("KAFKA_SERVER")
the absence of checkpointing in the above code. In Spark Structured Streaming
# subscribe to TOPIC at KAFKA_BROKER
raw_kafka_events = (spark.readStream checkpointing is required to persist progress information about what data has
.format("kafka")
been successfully processed and upon failure, this metadata is used to restart a
.option("subscribe", TOPIC)
.option("kafka.bootstrap.servers", KAFKA_BROKER) failed query exactly where it left off.
.option("startingOffsets", "earliest")
.load()
) Whereas checkpoints are necessary for failure recovery with exactly-once
guarantees in Spark Structured Streaming, DLT handles state automatically
@dlt.table(table_properties={"pipelines.reset.allowed":"false"})
def kafka_bronze(): without any manual configuration or explicit checkpointing required.
return raw_kafka_events
Note that event buses typically expire messages after a certain period of time, A DLT pipeline can consist of multiple notebooks but one DLT notebook is
whereas Delta is designed for infinite retention. required to be written entirely in either SQL or Python (unlike other Databricks
notebooks where you can have cells of different languages in a single notebook).
This might lead to the effect that source data on Kafka has already been deleted
when running a full refresh for a DLT pipeline. In this case, not all historic data Now, if your preference is SQL, you can code the data ingestion from Apache
could be backfilled from the messaging platform, and data would be missing in Kafka in one notebook in Python and then implement the transformation logic of
DLT tables. To prevent dropping data, use the following DLT table property: your data pipelines in another notebook in SQL.
The Big Book of Data Engineering – 2nd Edition 23
When reading data from messaging platform, the data stream is opaque and a Reading streaming data in DLT directly from a message broker minimizes the
schema has to be provided. architectural complexity and provides lower end-to-end latency since data is
directly streamed from the messaging broker and no intermediary step is involved.
The Python example below shows the schema definition of events from a fitness
tracker, and how the value part of the Kafka message is mapped to that schema.
def kafka_silver():
return (
# kafka streams are (timestamp,value)
# value contains the kafka payload
dlt.read_stream("kafka_bronze") Auto Loader can ingest data with a single line of SQL code. The syntax to ingest
.select(col("timestamp"),from_json(col("value") JSON files into a DLT table is shown below (it is wrapped across two lines for
.cast("string"), event_schema).alias("event"))
.select("timestamp", "event.*") readability).
)
Note that Auto Loader itself is a streaming data source and all newly arrived files Azure Event Hubs
will be processed exactly once, hence the streaming keyword for the raw table For Azure Event Hubs settings, check the official documentation at Microsoft and
that indicates data is ingested incrementally to that table. the article Delta Live Tables recipes: Consuming from Azure Event Hubs.
S ECT I O N 2 . 4
Releasing any data pipeline or application into a production state requires The recommendations in this blog post are written from the Structured
planning, testing, monitoring, and maintenance. Streaming pipelines are no Streaming engine perspective, most of which apply to both DLT and Workflows
different in this regard; in this blog we present some of the most important (although DLT does take care of some of these automatically, like Triggers and
considerations for deploying streaming pipelines and applications to a Checkpoints). We group the recommendations under the headings “Before
production environment. Deployment” and “After Deployment” to highlight when these concepts will
need to be applied and are releasing this blog series with this split between
At Databricks, we offer two different ways of building and running streaming the two. There will be additional deep-dive content for some of the sections
pipelines and applications — Delta Live Tables (DLT) and Databricks Workflows. beyond as well. We recommend reading all sections before beginning work
DLT is our flagship, fully managed ETL product that supports both batch and to productionalize a streaming pipeline or application, and revisiting these
streaming pipelines. It offers declarative development, automated operations, recommendations as you promote it from dev to QA and eventually production.
data quality, advanced observability capabilities, and more. Workflows enable
customers to run Apache Spark™ workloads in Databricks’ optimized runtime
environment (i.e., Photon) with access to unified governance (Unity Catalog) and Before deployment
storage (Delta Lake). Regarding streaming workloads, both DLT and Workflows
share the same core streaming engine — Spark Structured Streaming. In the There are many things you need to consider when creating your streaming
case of DLT, customers program against the DLT API and DLT uses the Structured application to improve the production experience. Some of these topics, like
Streaming engine under the hood. In the case of Jobs, customers program unit testing, checkpoints, triggers, and state management, will determine how
against the Spark API directly. your streaming application performs. Others, like naming conventions and how
many streams to run on which clusters, have more to do with managing multiple
streaming applications in the same environment.
The Big Book of Data Engineering – 2nd Edition 26
Unit testing
The cost associated with finding and fixing a bug goes up exponentially frameworks support all of the DataFrame/data set APIs so that you can easily
the farther along you get in the SDLC process, and a Structured Streaming create input, and they have specialized assertions that allow you to compare
application is no different. When you’re turning that prototype into a hardened DataFrame content and schemas. Some examples are:
production pipeline you need a CI/CD process with built-in tests. So how do you
create those tests? • The built-in Spark test suite, designed to test all parts of Spark
• spark-testing-base, which has support for both Scala and Python
At first you might think that unit testing a streaming pipeline requires something
• spark-fast-tests, for testing Scala Spark 2 & 3
special, but that isn’t the case. The general guidance for streaming pipelines is
no different than guidance you may have heard for Spark batch jobs. It starts by • chispa, a Python version of spark-fast-tests
organizing your code so that it can be unit tested effectively:
Code examples for each of these libraries can be found here.
• Divide your code into testable chunks
• Organize your business logic into functions calling other functions. But wait! I’m testing a streaming application here — don’t I need to make
If you have a lot of logic in a foreachBatch or you’ve implemented streaming DataFrames for my unit tests? The answer is no; you do not! Even
mapGroupsWithState or flatMapGroupsWithState, organize that code into though a streaming DataFrame represents a data set with no defined ending,
multiple functions that can be individually tested. when functions are executed on it they are executed on a microbatch — a
discrete set of data. You can use the same unit tests that you would use for a
• Do not code in dependencies on the global state or external systems
batch application, for both stateless and stateful streams. One of the advantages
• Any function manipulating a DataFrame or data set should be organized of Structured Streaming over other frameworks is the ability to use the same
to take the DataFrame/data set/configuration as input and output the transformation code for both streaming and with other batch operations for
DataFrame/data set the same sink. This allows you to simplify some operations, like backfilling
data, for example, where rather than trying to sync the logic between two
Once your code is separated out in a logical manner you can implement unit
different applications, you can just modify the input sources and write to the
tests for each of your functions. Spark-agnostic functions can be tested like any
same destination. If the sink is a Delta table, you can even do these operations
other function in that language. For testing UDFs and functions with DataFrames
concurrently if both processes are append-only operations.
and data sets, there are multiple Spark testing frameworks available. These
The Big Book of Data Engineering – 2nd Edition 27
Triggers
Now that you know your code works, you need to determine how often your process a microbatch in order to maximize resource utilization, but setting the
stream will look for new data. This is where triggers come in. Setting a trigger is interval longer would make sense if your stream is running on a shared cluster
one of the options for the writeStream command, and it looks like this: and you don’t want it to constantly take the cluster resources.
// Scala/Java If you do not need your stream to run continuously, either because data doesn’t
.trigger(Trigger.ProcessingTime("30 seconds"))
come that often or your SLA is 10 minutes or greater, then you can use the
# Python Trigger.Once option. This option will start up the stream, check for anything new
.trigger(processingTime='30 seconds')
since the last time it ran, process it all in one big batch, and then shut down.
Just like with a continuously running stream when using Trigger.Once, the
In the above example, if a microbatch completes in less than 30 seconds, checkpoint that guarantees fault tolerance (see below) will guarantee exactly-
then the engine will wait for the rest of the time before kicking off the next once processing.
microbatch. If a microbatch takes longer than 30 seconds to complete, then the
Spark has a new version of Trigger.Once called Trigger.AvailableNow. While
engine will start the next microbatch immediately after the previous one finishes.
Trigger.Once will process everything in one big batch, which depending on your
The two factors you should consider when setting your trigger interval are how data size may not be ideal, Trigger.AvailableNow will split up the data based on
long you expect your stream to process a microbatch and how often you want maxFilesPerTrigger and maxBytesPerTrigger settings. This allows the data to be
the system to check for new data. You can lower the overall processing latency processed in multiple batches. Those settings are ignored with Trigger.Once.
by using a shorter trigger interval and increasing the resources available for You can see examples for setting triggers here.
the streaming query by adding more workers or using compute or memory
Pop quiz — how do you turn your streaming process into a batch process
optimized instances tailored to your application’s performance. These increased
that automatically keeps track of where it left off with just one line of code?
resources come with increased costs, so if your goal is to minimize costs, then a
longer trigger interval with less compute can work. Normally you would not set a Answer — change your processing time trigger to Trigger.Once/Trigger.
trigger interval longer than what it would typically take for your stream to AvailableNow! Exact same code, running on a schedule, that will neither miss nor
reprocess any records.
The Big Book of Data Engineering – 2nd Edition 28
You name your children, you name your pets, now it’s time to name your streams. To configure checkpoints, add the checkpointLocation option to your stream
There’s a writeStream option called .queryName that allows you to provide a definition:
friendly name for your stream. Why bother? Well, suppose you don’t name it. In
that case, all you’ll have to go on in the Structured Streaming tab in the Spark UI // Scala/Java/Python
streamingDataFrame.writeStream
is the string <no name> and the unintelligible guid that is automatically generated
.format("delta")
as the stream’s unique identifier. If you have more than one stream running on a .option("path", "")
.queryName("TestStream")
cluster, and all of them have <no name> and unintelligible strings as identifiers,
.option("checkpointLocation", "")
how do you find the one you want? If you’re exporting metrics how do you tell .start()
which is which?
To keep it simple — every time you call .writeStream, you must specify the
Make it easy on yourself, and name your streams. When you’re managing them in checkpoint option with a unique checkpoint location. Even if you’re using
production you’ll be glad you did, and while you’re at it, go and name your batch foreachBatch and the writeStream itself doesn’t specify a path or table option,
queries in any foreachBatch() code you have. you must still specify that checkpoint. It’s how Spark Structured Streaming gives
you hassle-free fault tolerance.
Fault tolerance Efforts to manage the checkpointing in your stream should be of little concern
in general. As Tathagata Das has said, “The simplest way to perform streaming
How does your stream recover from being shut down? There are a few different
analytics is not having to reason about streaming at all.” That said, one setting
cases where this can come into play, like cluster node failures or intentional
deserves mention as questions around the maintenance of checkpoint files
halts, but the solution is to set up checkpointing. Checkpoints with write-ahead
come up occasionally. Though it is an internal setting that doesn’t require direct
logs provide a degree of protection from your streaming application being
configuration, the setting spark.sql.streaming.minBatchesToRetain (default 100)
interrupted, ensuring it will be able to pick up again where it last left off.
controls the number of checkpoint files that get created. Basically, the number
of files will be roughly this number times two, as there is a file created noting the
Checkpoints store the current offsets and state values (e.g., aggregate values) for
offsets at the beginning of the batch (offsets, a.k.a write ahead logs) and another
your stream. Checkpoints are stream specific so each should be set to its own
on completing the batch (commits). The number of files is checked periodically
location. Doing this will let you recover more gracefully from shutdowns, failures
for cleanup as part of the internal processes. This simplifies at least one aspect
from your application code or unexpected cloud provider failures or limitations.
of long-term streaming application maintenance for you.
The Big Book of Data Engineering – 2nd Edition 29
It is also important to note that some changes to your application code can the culprit of added latency is large amounts of time spent in garbage collection
invalidate the checkpoint. Checking for any of these changes during code (GC) pauses in the JVM. If you are monitoring the microbatch processing time,
reviews before deployment is recommended. You can find examples of changes this could look like a continual increase or wildly varying processing time across
where this can happen in Recovery Semantics after Changes in a Streaming microbatches.
Query. Suppose you want to look at checkpointing in more detail or consider
whether asynchronous checkpointing might improve the latency in your The default configuration for a state store, which is sufficient for most general
streaming application. In that case, these are covered in greater depth in streaming workloads, is to store the state data in the executors’ JVM memory.
Speed Up Streaming Queries With Asynchronous State Checkpointing. Large number of keys (typically millions, see the Monitoring & Instrumentation
section in part 2 of this blog) can add excessive memory pressure on the
machine memory and increase the frequency of hitting these GC pauses as it
State management and RocksDB tries to free up resources.
Stateful streaming applications are those where current records may depend On the Databricks Runtime (now also supported in Apache Spark 3.2+) you can
on previous events, so Spark has to retain data in between microbatches. use RocksDB as an alternative state store provider to alleviate this source of
The data it retains is called state, and Spark will store it in a state store and memory pressure. RocksDB is an embeddable persistent key-value store for fast
read, update and delete it during each microbatch. Typical stateful operations storage. It features high performance through a log-structured database engine
are streaming aggregations, streaming dropDuplicates, stream-stream joins, written entirely in C++ and optimized for fast, low-latency storage.
mapGroupsWithState, or flatMapGroupsWithState. Some common types of
Leveraging RocksDB as the state store provider still uses machine memory
examples where you’ll need to think about your application state could be
but no longer occupies space in the JVM and makes for a more efficient
sessionization or hourly aggregation using group by methods to calculate
state management system for large amounts of keys. This doesn’t come for
business metrics. Each record in the state store is identified by a key that is used
free, however, as it introduces an extra step in processing every microbatch.
as part of the stateful computation, and the more unique keys that are required
Introducing RocksDB shouldn’t be expected to reduce latency except when it is
the larger the amount of state data that will be stored.
related to memory pressure from state data storage in the JVM. The RocksDB-
When the amount of state data needed to enable these stateful operations backed state store still provides the same degree of fault tolerance as the
grows large and complex, it can degrade your workloads’ performance, leading regular state storage as it is included in the stream checkpointing.
to increased latency or even failures. A typical indicator of the state store being
The Big Book of Data Engineering – 2nd Edition 30
RocksDB configuration, like checkpoint configuration, is minimal by design and so Running multiple streams on a cluster
you only need to declare it in your overall Spark configuration:
Once your streams are fully tested and configured, it’s time to figure out how to
organize them in production. It’s a common pattern to stack multiple streams on
spark.conf.set(
"spark.sql.streaming.stateStore.providerClass", the same Spark cluster to maximize resource utilization and save cost. This is fine
"com.databricks.sql.streaming.state.RocksDBStateStoreProvider")
to a point, but there are limits to how much you can add to one cluster before
performance is affected. The driver has to manage all of the streams running on
If you are monitoring your stream using the streamingQueryListener class, then
the cluster, and all streams will compete for the same cores across the workers.
you will also notice that RocksDB metrics will be included in the stateOperators
You need to understand what your streams are doing and plan your capacity
field. For more detailed information on this see the RocksDB State Store Metrics
appropriately to stack effectively.
section of “Structured Streaming in Production.”
Here is what you should take into account when you’re planning on stacking
It’s worth noting that large numbers of keys can have other adverse impacts in
multiple streams on the same cluster:
addition to raising memory consumption, especially with unbounded or non-
expiring state keys. With or without RocksDB, the state from the application • Make sure your driver is big enough to manage all of your streams. Is your
also gets backed up in checkpoints for fault tolerance. So it makes sense that driver struggling with a high CPU utilization and garbage collection? That
if you have state files being created so that they will not expire, you will keep means it’s struggling to manage all of your streams. Either reduce the
accumulating files in the checkpoint, increasing the amount of storage required number of streams or increase the size of your driver.
and potentially the time to write it or recover from failures as well. For the data
• Consider the amount of data each stream is processing. The more data
in memory (see the Monitoring & Instrumentation section in part 2 of this blog)
you are ingesting and writing to a sink, the more cores you will need in
this situation can lead to somewhat vague out-of-memory errors, and for the
order to maximize your throughput for each stream. You’ll need to reduce
checkpointed data written to cloud storage you might observe unexpected
the number of streams or increase the number of workers depending on
and unreasonable growth. Unless you have a business need to retain streaming
how much data is being processed. For sources like Kafka you will need to
state for all the data that has been processed (and that is rare), read the Spark
configure how many cores are being used to ingest with the minPartitions
Structured Streaming documentation and make sure to implement your stateful
option if you don’t have enough cores for all of the partitions across all of
operations so that the system can drop state records that are no longer needed
your streams.
(pay close attention to dropDuplicates and stream-stream joins).
The Big Book of Data Engineering – 2nd Edition 31
• Consider the complexity and data volume of your streams. If all of the Conclusion
streams are doing minimal manipulation and just appending to a sink, then
each stream will need fewer resources per microbatch and you’ll be able to Some of the ideas we’ve addressed here certainly deserve their own time
stack more. If the streams are doing stateful processing or computation/ and special treatment with a more in-depth discussion, which you can look
memory-intensive operations, that will require more resources for good forward to in later deep dives. However, we hope these recommendations are
performance and you’ll want to stack fewer streams. useful as you begin your journey or seek to enhance your production streaming
• Consider scheduler pools. When stacking streams they will all be experience. Be sure to continue with the next post, “Streaming in Production:
contending for the same workers and cores, and one stream that needs a Collected Best Practices, Part 2.”
lot of cores will cause the other streams to wait. Scheduler pools enable
you to have different streams execute on different parts of the cluster.
Review Databrick’s Structured Streaming Getting Started Guide
This will enable streams to execute in parallel with a subset of the available
resources.
• Consider your SLA. If you have mission critical streams, isolate them as a
best practice so lower-criticality streams do not affect them.
On Databricks we typically see customers stack between 10-30 streams on a Start experimenting with these
cluster, but this varies depending on the use case. Consider the factors above so free Databricks notebooks.
that you can have a good experience with performance, cost and maintainability.
The Big Book of Data Engineering – 2nd Edition 32
S ECT I O N 2 . 5
In our two-part blog series titled “Streaming in Production: Collected Best Monitoring and instrumentation (How is my application running?)
Practices,” this is the second article. Here we discuss the “After Deployment”
considerations for a Structured Streaming Pipeline. The majority of the Streaming workloads should be pretty much hands-off once deployed to
suggestions in this post are relevant to both Structured Streaming Jobs and production. However, one thing that may sometimes come to mind is: “how is my
Delta Live Tables (our flagship and fully managed ETL product that supports application running?” Monitoring applications can take on different levels and
both batch and streaming pipelines). forms depending on:
We’ll start with an introduction to these topics, followed by a deeper dive later in If you want more fine-grained metrics or to create custom actions based on
this blog series. these metrics as part of your code base, then the StreamingQueryListener is
better aligned with what you’re looking for.
The Big Book of Data Engineering – 2nd Edition 33
If you want the Spark metrics to be reported (including machine level traces for Another point to consider is where you want to surface these metrics for
drivers or workers) you should use the platform’s metrics sink. observability. There is a Ganglia dashboard at the cluster level, integrated partner
applications like Datadog for monitoring streaming workloads, or even more open
source options you can build using tools like Prometheus and Grafana. Each
has advantages and disadvantages to consider around cost, performance, and
maintenance requirements.
Whether you have low volumes of streaming workloads where interactions in the
UI are sufficient or have decided to invest in a more robust monitoring platform,
you should know how to observe your production streaming workloads. Further
“Monitoring and Alerting” posts later in this series will contain a more thorough
discussion. In particular, we’ll see different measures on which to monitor
streaming applications and then later take a deeper look at some of the tools
you can leverage for observability.
frequently requested, a separate post in this series will address it. The key points Troubleshooting (How do I manage any problems that come up?)
that we’ll focus on will be efficiency of usage and sizing.
The last question we ask ourselves after deployment is “how do I manage any
Getting the cluster sizing right is one of the most significant differences between problems that come up?” As with cost optimization, troubleshooting streaming
efficiency and wastefulness in streaming applications. This can be particularly applications in Spark often looks the same as other applications since most of
tricky because in some cases it’s difficult to estimate the full load conditions of the mechanics remain the same under the hood. For streaming applications,
the application in production before it’s actually there. In other cases, it may be issues usually fall into two categories — failure scenarios and latency scenarios
difficult due to natural variations in volume handled throughout the day, week, or
year. When first deploying, it can be beneficial to oversize slightly, incurring the
extra expense to avoid inducing performance bottlenecks. Utilize the monitoring Failure scenarios
tools you chose to employ after the cluster has been running for a few weeks
Failure scenarios typically manifest with the stream stopping with an error,
to ensure proper cluster utilization. For example, are CPU and memory levels
executors failing or a driver failure causing the whole cluster to fail. Common
being used at a high level during peak load or is the load generally small and the
causes for this are:
cluster may be downsized? Maintain regular monitoring of this and keep an eye
out for changes in data volume over time; if either occurs, a cluster resize may be
• Too many streams running on the same cluster, causing the driver to be
required to maintain cost-effective operation.
overwhelmed. On Databricks, this can be seen in Ganglia, where the driver
node will show up as overloaded before the cluster fails.
As a general guideline, you should avoid excessive shuffle operations, joins, or an
excessive or extreme watermark threshold (don’t exceed your needs), as each • Too few workers in a cluster or a worker size with too small of a core-to-
can increase the number of resources you need to run your application. A large memory ratio, causing executors to fail with an Out Of Memory error.
watermark threshold will cause Structured Streaming to keep more data in the This can also be seen on Databricks in Ganglia before an executor fails,
state store between batches, leading to an increase in memory requirements or in the Spark UI under the executors tab.
across the cluster. Also, pay attention to the type of VM configured — are you • Using a collect to send too much data to the driver, causing it to fail
using memory-optimized for your memory-intense stream? Compute-optimized with an Out Of Memory error.
for your computationally-intensive stream? If not, look at the utilization levels
for each and consider trying a machine type that could be a better fit. Newer
families of servers from cloud providers with more optimal CPUs often lead to
faster execution, meaning you might need fewer of them to meet your SLA.
The Big Book of Data Engineering – 2nd Edition 35
Latency scenarios <no name> if you didn’t. You’ll also see a stream ID that will be visible on the Jobs
tab of the Spark UI so that you can tell which jobs are for a given stream.
For latency scenarios, your stream will not execute as fast as you want or expect.
A latency issue can be intermittent or constant. Too many streams or too small You’ll notice above we said which jobs are for a given stream. It’s a common
of a cluster can be the cause of this as well. Some other common causes are: misconception that if you were to look at a streaming application in the Spark
UI you would just see one job in the Jobs tab running continuously. Instead,
• Data skew — when a few tasks end up with much more data than the rest depending on your code, you will see one or more jobs that start and complete
of the tasks. With skewed data, these tasks take longer to execute than the for each microbatch. Each job will have the stream ID from the Structured
others, often spilling to disk. Your stream can only run as fast as its slowest Streaming tab and a microbatch number in the description, so you’ll be able to
task. tell which jobs go with which stream. You can click into those jobs to find the
• Executing a stateful query without defining a watermark or defining a very longest running stages and tasks, check for disk spills, and search by Job ID in
long one will cause your state to grow very large, slowing down your stream the SQL tab to find the slowest queries and check their explain plans.
over time and potentially leading to failure.
• Poorly optimized sink. For example, performing a merge into an over-
partitioned Delta table as part of your stream.
• Stable but high latency (batch execution time). Depending on the cause,
adding more workers to increase the number of cores concurrently available
for Spark tasks can help. Increasing the number of input partitions and/or
decreasing the load per core through batch size settings can also reduce
the latency.
Just like troubleshooting a batch job, you’ll use Ganglia to check cluster
utilization and the Spark UI to find performance bottlenecks. There is a
specific Structured Streaming tab in the Spark UI created to help monitor and
troubleshoot streaming applications. On that tab each stream that is running will
be listed, and you’ll see either your stream name if you named your stream or
If you click on your stream in the Structured Streaming tab you’ll see how much sure you’ve noticed by now, the topics we discussed should be included in
time the different streaming operations are taking for each microbatch, such as most production deployments.
adding a batch, query planning and committing (see earlier screenshot of the
Apache Spark Structured Streaming UI). You can also see how many rows are Across the majority of industries in the world today information is needed
being processed as well as the size of your state store for a stateful stream. faster than ever, but that won’t be a problem for you. With Spark Structured
This can give insights into where potential latency issues are. Streaming you’re set to make it happen at scale in production. Be on the lookout
for more in-depth discussions on some of the topics we’ve covered in this blog,
We will go more in-depth with troubleshooting later in this blog series, where and in the meantime keep streaming!
we’ll look at some of the causes and remedies for both failure scenarios and
latency scenarios as we outlined above.
S ECT I O N 2 . 6
January 6, 2023
Geospatial data has been driving innovation for centuries, through use of In this blog we will provide a point of view on how to design scalable geospatial
maps, cartography and more recently through digital content. For example, data products that are modern and robust. We will discuss how Databricks
the oldest map has been found etched in a piece of mammoth tusk and dates Lakehouse Platform can be used to unlock the full potential of geospatial
approximately 25,000 BC. This makes geospatial data one of the oldest data products that are one of the most valuable assets in solving the toughest
sources used by society to make decisions. A more recent example, labeled problems of today and the future.
as the birth of spatial analysis, is that of Charles Picquet in 1832 who used
geospatial data to analyze Cholera outbreaks in Paris; a couple of decades
later John Snow in 1854 followed the same approach for Cholera outbreaks in What is a data product? And how to design one?
London. These two individuals used geospatial data to solve one of the toughest
The most broad and the most concise definition of a “data product” was coined
problems of their times and in effect save countless lives. Fast-forwarding to the
by DJ Patil (the first U.S. Chief Data Scientist) in Data Jujitsu: The Art of Turning
20th century, the concept of Geographic Information Systems (GIS) was first
Data into Product: “a product that facilitates an end goal through the use of
introduced in 1967 in Ottawa, Canada, by the Department of Forestry and
data.” The complexity of this definition (as admitted by Patil himself) is needed to
Rural Development.
encapsulate the breadth of possible products, to include dashboards, reports, Excel
Today we are in the midst of the cloud computing industry revolution — spreadsheets, and even CSV extracts shared via emails. You might notice that the
supercomputing scale available to any organization, virtually infinitely scalable examples provided deteriorate rapidly in quality, robustness and governance.
for both storage and compute. Concepts like data mesh and data marketplace
What are the concepts that differentiate a successful product versus an
are emerging within the data community to address questions like platform
unsuccessful one? Is it the packaging? Is it the content? Is it the quality of the
federation and interoperability. How can we adopt these concepts to geospatial
content? Or is it only the product adoption in the market? Forbes defines the
data, spatial analysis and GIS systems? By adopting the concept of data
10 must-haves of a successful product. A good framework to summarize this is
products and approaching the design of geospatial data as a product.
through the value pyramid.
The Big Book of Data Engineering – 2nd Edition 38
We share the belief that the FAIR principles are crucial for the design of scalable The FAIR concepts weave really well together with the data product design.
data products we can trust. To be fair, FAIR is based on common sense, so why In fact FAIR is traversing the whole product value pyramid and forms a value
is it key to our considerations? “What I see in FAIR is not new in itself, but what it cycle. By adopting both the value pyramid and FAIR principles we design data
does well is to articulate, in an accessible way, the need for a holistic approach products with both internal and external outlook. This promotes data reuse
to data improvement. This ease in communication is why FAIR is being used as opposed to data accumulation.
increasingly widely as an umbrella for data improvement — and not just in the
geospatial community.” — A FAIR wind sets our course for data improvement.
To further support this approach, the Federal Geographic Data Committee has
developed the National Spatial Data Infrastructure (NSDI) Strategic Plan that
covers the years 2021-2024 and was approved in November 2020. The goals
of NSDI are in essence FAIR principles and convey the same message of designing
systems that promote the circular economy of data — data products that flow
between organizations following common standards and in each step through the
data supply chain unlock new value and new opportunities. The fact that these
principles are permeating different jurisdictions and are adopted across different
regulators is a testament to the robustness and soundness of the approach.
Why do FAIR principles matter for geospatial data and geospatial data
products? FAIR is transcendent to geospatial data, it is actually transcendent
to data, it is a simple yet coherent system of guiding principles for good design
— and that good design can be applied to anything including geospatial data
and geospatial systems.
Figure 2:
NDSI Strategic Goals
The Big Book of Data Engineering – 2nd Edition 41
Grid index systems are built from the start with the scalability aspects of the
geospatial data in mind. Rather than building the trees, they define a series of
grids that cover the area of interest. In the case of H3 (pioneered by Uber),
the grid covers the area of the Earth; in the case of local grid index systems
(e.g., British National Grid) they may only cover the specific area of interest.
These grids are composed of cells that have unique identifiers. There is a Figure 4: Grid Index Systems (H3, British National Grid)
mathematical relationship between location and the cell in the grid. This makes
the grid index systems very scalable and parallel in nature.
The Big Book of Data Engineering – 2nd Edition 42
Another important aspect of grid index systems is that they are open source, Databricks has recently announced native support for the H3 grid index system
allowing index values to be universally leveraged by data producers and following the same value proposition. Adopting common industry standards
consumers alike. Data can be enriched with the grid index information at any driven by the community is the only way to properly drive habit generation and
step of its journey through the data supply chain. This makes the grid index interoperability. To strengthen this statement, organizations like CARTO , ESRI
systems an example of community driven data standards. Community driven and Google have been promoting the usage of grid index systems for scalable
data standards by nature do not require enforcement, which fully adheres GIS system design. In addition, Databricks Labs project Mosaic supports the
to the habit generation aspect of value pyramid and meaningfully addresses British National Grid as the standard grid index system that is widely used in
interoperability and accessibility principles of FAIR. the UK government. Grid index systems are key for the scalability of geospatial
data processing and for properly designing solutions for complex problems
(e.g., figure 5 — flight holding patterns using H3).
Geospatial data domain is so diverse and has organically grown over the years Why introduce another format into an already complex ecosystem? GeoParquet
around the use cases it was addressing. Unification of such a diverse ecosystem isn’t a new format — it is a schema specification for Apache Parquet format that
is a massive challenge. A recent effort by the Open Geospatial Consortium is already widely adopted and used by the industry and the community. Parquet
(OGC) to standardize to Apache Parquet and its geospatial schema specification as the base format supports binary columns and allows for storage of arbitrary
GeoParquet is a step in the right direction. Simplicity is one of the key aspects data payload. At the same time the format supports structured data columns
of designing a good scalable and robust product — unification leads to simplicity that can store metadata together with the data payload. This makes it a choice
and addresses one of the main sources of friction in the ecosystem — the data that promotes interoperability and reproducibility. Finally, Delta Lake format
ingestion. Standardizing to GeoParquet brings a lot of value that addresses all of has been built on top of parquet and brings ACID properties to the table. ACID
the aspects of FAIR data and value pyramid. properties of a format are crucial for reproducibility and for trusted outputs. In
addition, Delta is the format used by scalable data sharing solution Delta Sharing.
Delta Sharing enables enterprise scale data sharing between any public cloud
using Databricks (DIY options for private cloud are available using open source
building blocks). Delta Sharing completely abstracts the need for custom built
Rest APIs for exposing data to other third parties. Any data asset stored in Delta
(using GeoParquet schema) automatically becomes a data product that can be
exposed to external parties in a controlled and governed manner. Delta Sharing
has been built from the ground up with security best practices in mind.
Interoperability is a key component of FAIR data principles, and from Designing scalable data products that will live in the marketplace is crucial.
interoperability a question of circularity comes to mind. How can we design an In order to maximize the value add of each data product one should strongly
ecosystem that maximizes data utilization and data reuse? Once again, FAIR consider FAIR principles and the product value pyramid. Without these guiding
together with the value pyramid holds answers. Findability of the data is key to principles we will only increase the issues that are already present in the
the data reuse and to solving for data pollution. With data assets that can be current systems. Each data product should solve a unique problem and should
discovered easily we can avoid the recreation of same data assets in multiple solve it in a simple, reproducible and robust way.
places with just slight alteration. Instead we gain a coherent data ecosystem
with data that can be easily combined and reused. Databricks has recently
announced the Databricks Marketplace. The idea behind the marketplace is in You can read more on how Databricks Lakehouse
line with the original definition of data product by DJ Patel. The marketplace
Platform can help you accelerate time to value from
will support sharing of data sets, notebooks, dashboards, and machine learning
your data products in the eBook: A New Approach
models. The critical building block for such a marketplace is the concept of
Delta Sharing — the scalable, flexible and robust channel for sharing any data —
to Data Sharing.
geospatial data included.
S ECT I O N 2 .7
June 8, 2022
This blog will discuss the importance of data lineage, some of the common Impact analysis
use cases, our vision for better data transparency and data understanding with Data goes through multiple updates or revisions over its lifecycle, and
data lineage. understanding the potential impact of any data changes on downstream
consumers becomes important from a risk management standpoint. With data
lineage, data teams can see all the downstream consumers — applications,
What is data lineage and why is it important? dashboards, machine learning models or data sets, etc. — impacted by data
changes, understand the severity of the impact, and notify the relevant
Data lineage describes the transformations and refinements of data from source stakeholders. Lineage also helps IT teams proactively communicate data
to insight. Lineage includes capturing all the relevant metadata and events migrations to the appropriate teams, ensuring business continuity.
associated with the data in its lifecycle, including the source of the data set,
what other data sets were used to create it, who created it and when, what Data understanding and transparency
transformations were performed, what other data sets leverage it, and many other Organizations deal with an influx of data from multiple sources, and building
events and attributes. With a data lineage solution, data teams get an end-to-end a better understanding of the context around data is paramount to ensure
view of how data is transformed and how it flows across their data estate. the trustworthiness of the data. Data lineage is a powerful tool that enables
data leaders to drive better transparency and understanding of data in their
As more and more organizations embrace a data-driven culture and set up
organizations. Data lineage also empowers data consumers such as data scientists,
processes and tools to democratize and scale data and AI, data lineage is
data engineers and data analysts to be context-aware as they perform analyses,
becoming an essential pillar of a pragmatic data management and governance
resulting in better quality outcomes. Finally, data stewards can see which data sets
strategy.
are no longer accessed or have become obsolete to retire unnecessary data and
To understand the importance of data lineage, we have highlighted some of the ensure data quality for end business users .
common use cases we have heard from our customers below.
The Big Book of Data Engineering – 2nd Edition 48
All of these capabilities rely upon the automatic collection of data lineage across
all use cases and personas — which is why the lakehouse and data lineage are a
powerful combination.
The Big Book of Data Engineering – 2nd Edition 49
Data lineage for tables Data Lineage for notebooks, workflows, dashboards
Built-in security: Lineage graphs in Unity Catalog are privilege-aware and share
the same permission model as Unity Catalog. If users do not have access to
a table, they will not be able to explore the lineage associated with the table,
adding an additional layer of security for privacy considerations.
Easily exportable via REST API: Lineage can be visualized in the Data Explorer
in near real-time, and retrieved via REST API to support integrations with our
catalog partners.
S ECT I O N 2 . 8
A new data management architecture known as the data lakehouse emerged that integrate with Databricks, such as Fivetran — to easily ingest data from
independently across many organizations and use cases to support AI and BI enterprise applications, databases, mainframes and more into the lakehouse.
directly on vast amounts of data. One of the key success factors for using the Finally, analysts can use the simple “COPY INTO” command to pull new data into
data lakehouse for analytics and machine learning is the ability to quickly and the lakehouse automatically, without the need to keep track of which files have
easily ingest data of various types, including data from on-premises storage already been processed.
platforms (data warehouses, mainframes), real-time streaming data, and bulk
data assets. This blog focuses on COPY INTO, a simple yet powerful SQL command that allows
you to perform batch file ingestion into Delta Lake from cloud object stores.
As data ingestion into the lakehouse is an ongoing process that feeds the It’s idempotent, which guarantees to ingest files with exactly-once semantics
proverbial ETL pipeline, you will need multiple options to ingest various formats, when executed multiple times, supporting incremental appends and simple
types and latency of data. For data stored in cloud object stores such as AWS transformations. It can be run once, in an ad hoc manner, and can be scheduled
S3, Google Cloud Storage and Azure Data Lake Storage, Databricks offers through Databricks Workflows. In recent Databricks Runtime releases, COPY
Auto Loader, a natively integrated feature, that allows data engineers to ingest INTO introduced new functionalities for data preview, validation, enhanced error
millions of files from the cloud storage continuously. In other streaming cases handling, and a new way to copy into a schemaless Delta Lake table so that users
(e.g., IoT sensor or clickstream data), Databricks provides native connectors can get started quickly, completing the end-to-end user journey to ingest from
for Apache Spark Structured Streaming to quickly ingest data from popular cloud object stores. Let’s take a look at the popular COPY INTO use cases.
message queues, such as Apache Kafka, Azure Event Hubs or AWS Kinesis at low
latencies. Furthermore, many customers can leverage popular ingestion tools
The Big Book of Data Engineering – 2nd Edition 51
4. Secure data access with COPY INTO 5. Filtering files for ingestion
COPY INTO supports secure access in several ways. In this section, we want to What about ingesting a subset of files where the filenames match a pattern? You
highlight two new options you can use in both Databricks SQL and notebooks can apply glob patterns — a glob pattern that identifies the files to load from the
from recent releases: source directory. For example, let’s filter and ingest files which contain the word
`raw_data` in the filename below.
Unity Catalog
With the general availability of Databrick Unity Catalog, you can use COPY INTO COPY INTO my_example_data
to ingest data to Unity Catalog managed or external tables from any source and FROM 's3://my-bucket/exampleDataPath'
FILEFORMAT = CSV
file format supported by COPY INTO. Unity Catalog also adds new options for PATTERN = '*raw_data*.csv'
configuring secure access to raw data, allowing you to use Unity Catalog external FORMAT_OPTIONS ('header' = 'true')
locations or storage credentials to access data in cloud object storage. Learn
more about how to use COPY INTO with Unity Catalog.
6. Ingest files in a time period
Temporary Credentials In data engineering, it is frequently necessary to ingest files that have been
What if you have not configured Unity Catalog or instance profile? How about modified before or after a specific timestamp. Data between two timestamps
data from a trusted third party bucket? Here is a convenient COPY INTO feature may also be of interest. The ‘modifiedAfter’ and ‘modifiedBefore’ format options
that allows you to ingest data with inline temporary credentials to handle the ad offered by COPY INTO allow users to ingest data from a chosen time window into
hoc bulk ingestion use case. a Delta table.
Error handling:
How about ingesting data with file corruption issues? Common examples of file
corruption are:
COPY INTO’s format option ignoreCorruptFiles helps skip those files while How does it compare to Auto Loader?
processing. The result of the COPY INTO command returns the number of files
skipped in the num_skipped_corrupt_files column. In addition, these corrupt COPY INTO is a simple and powerful command to use when your source
files aren’t tracked by the ingestion state in COPY INTO, therefore they can be directory contains a small number of files (i.e., thousands of files or less), and if
reloaded in a subsequent execution once the corruption is fixed. This option is you prefer SQL. In addition, COPY INTO can be used over JDBC to push data into
available in Databricks Runtime 11.0+. Delta Lake at your convenience, a common pattern by many ingestion partners.
To ingest a larger number of files both in streaming and batch we recommend
You can see which files have been detected as corrupt by running COPY INTO in using Auto Loader. In addition, for a modern data pipeline based on medallion
VALIDATE mode. architecture, we recommend using Auto Loader in Delta Live Tables pipelines,
leveraging advanced capabilities of automatic error handling, quality control,
COPY INTO my_example_data data lineage and setting expectations in a declarative approach.
FROM 's3://my-bucket/exampleDataPath'
FILEFORMAT = CSV
VALIDATE ALL
FORMAT_OPTIONS('ignoreCorruptFiles' = 'true') How to get started?
Observability: To get started, you can go to Databricks SQL query editor, update and run the
In Databricks Runtime 10.5, file metadata column was introduced to provide example SQL commands to ingest from your cloud object stores. Check out
input file metadata information, which allows users to monitor and get key the options in No. 4 to establish secure access to your data for querying it in
properties of the ingested files like path, name, size and modification time, by Databricks SQL. To get familiar with COPY INTO in Databricks SQL, you can also
querying a hidden STRUCT column called _metadata. To include this information follow this quickstart tutorial.
in the destination, you must explicitly reference the _metadata column in your
As an alternative, you can use this notebook in Data Science & Engineering and
query in COPY INTO.
Machine Learning workspaces to learn most of the COPY INTO features in this
blog, where source data and target Delta tables are generated in DBFS.
COPY INTO my_example_data
FROM (
SELECT *, _metadata source_metadata FROM 's3://my-bucket/ More tutorials for COPY INTO can be found here.
exampleDataPath'
)
FILEFORMAT = CSV
The Big Book of Data Engineering – 2nd Edition 57
S ECT I O N 2 . 9
This guide will demonstrate how you can leverage change data capture in Delta Since over 80% of organizations plan on implementing multicloud strategies
Live Tables pipelines to identify new records and capture changes made to the by 2025, choosing the right approach for your business that allows seamless
data set in your data lake. Delta Live Tables pipelines enable you to develop real-time centralization of all data changes in your ETL pipeline across multiple
scalable, reliable and low latency data pipelines, while performing change data environments is critical.
capturee in your data lake with minimum required computation resources and
seamless out-of-order data handling. By capturing CDC events, Databricks users can re-materialize the source table
as Delta Table in Lakehouse and run their analysis on top of it, while being able
to combine data with external systems. The MERGE INTO command in Delta Lake
Note: We recommend following Getting Started with Delta Live Tables on Databricks enables customers to efficiently upsert and delete records in
which explains creating scalable and reliable pipelines using Delta Live Tables their data lakes — you can check out our previous deep dive on the topic here.
(DLT) and its declarative ETL definitions. This is a common use case that we observe many of Databricks customers are
leveraging Delta Lakes to perform, and keeping their data lakes up to date with
real-time business data.
Background on change data capture While Delta Lake provides a complete solution for real-time CDC synchronization
in a data lake, we are now excited to announce the change data capture feature
Change data capture (CDC) is a process that identifies and captures incremental
in Delta Live Tables that makes your architecture even simpler, more efficient and
changes (data deletes, inserts and updates) in databases, like tracking customer,
scalable. DLT allows users to ingest CDC data seamlessly using SQL and Python.
order or product status for near-real-time data applications. CDC provides real-
time data evolution by processing data in a continuous incremental fashion as Earlier CDC solutions with Delta tables were using MERGE INTO operation, which
new events occur. requires manually ordering the data to avoid failure when multiple rows of the
source data set match while attempting to update the same rows of the target
The Big Book of Data Engineering – 2nd Edition 58
Delta table. To handle the out-of-order data, there was an extra step required to incremental changes to easily flow through analytical workloads at scale. Using
preprocess the source table using a foreachBatch implementation to eliminate CDC together with the medallion architecture provides multiple benefits to users
the possibility of multiple matches, retaining only the latest change for each since only changed or added data needs to be processed. Thus, it enables users
key (see the change data capture example). The new APPLY CHANGES INTO to cost-effectively keep Gold tables up-to-date with the latest business data.
operation in DLT pipelines automatically and seamlessly handles out-of-order
data without any need for data engineering manual intervention.
NOTE: The example here applies to both SQL and Python versions of CDC
and also on a specific way to use the operations; to evaluate variations,
CDC with Databricks Delta Live Tables please see the official documentation here.
In this blog, we will demonstrate how to use the APPLY CHANGES INTO command
in Delta Live Tables pipelines for a common CDC use case where the CDC data
is coming from an external system. A variety of CDC tools are available such Prerequisites
as Debezium, Fivetran, Qlik Replicate, Talend, and StreamSets. While specific To get the most out of this guide, you should have a basic familiarity with:
implementations differ, these tools generally capture and record the history
of data changes in logs; downstream applications consume these CDC logs. In • SQL or Python
our example, data is landed in cloud object storage from a CDC tool such as • Delta Live Tables
Debezium, Fivetran, etc. • Developing ETL pipelines and/or working with Big Data systems
We have data from various CDC tools landing in a cloud object storage or a • Databricks interactive notebooks and clusters
message queue like Apache Kafka. Typically we see CDC used in an ingestion • You must have access to a Databricks Workspace with permissions
to what we refer as the medallion architecture. A medallion architecture is a to create new clusters, run jobs, and save data to a location on
data design pattern used to logically organize data in a Lakehouse, with the external cloud object storage or DBFS
goal of incrementally and progressively improving the structure and quality of • For the pipeline we are creating in this blog, “Advanced” product
data as it flows through each layer of the architecture. Delta Live Tables allows edition which supports enforcement of data quality constraints,
you to seamlessly apply changes from CDC feeds to tables in your Lakehouse; needs to be selected
combining this functionality with the medallion architecture allows for
The Big Book of Data Engineering – 2nd Edition 59
The data set This blog focuses on a simple example that requires a JSON message with
four fields of customer’s name, email, address and id along with the two fields:
Here we are consuming realistic looking CDC data from an external database. In operation (which stores operation code (DELETE, APPEND, UPDATE, CREATE) and
this pipeline, we will use the Faker library to generate the data set that a CDC operation_date (which stores the date and timestamp for the record came for
tool like Debezium can produce and bring into cloud storage for the initial ingest each operation action) to describe the changed data.
in Databricks. Using Auto Loader we incrementally load the messages from cloud
object storage, and store them in the Bronze table as it stores the raw messages. To generate a sample data set with the above fields, we are using a Python
The Bronze tables are intended for data ingestion which enable quick access to a package that generates fake data, Faker. You can find the notebook related to this
single source of truth. Next we perform APPLY CHANGES INTO from the cleaned data generation section here. In this notebook we provide the name and storage
Bronze layer table to propagate the updates downstream to the Silver table. As location to write the generated data there. We are using the DBFS functionality of
data flows to Silver tables, generally it becomes more refined and optimized Databricks; see the DBFS documentation to learn more about how it works. Then,
(“just-enough”) to provide an enterprise a view of all its key business entities. we use a PySpark user-defined function to generate the synthetic data set for
See the diagram below. each field, and write the data back to the defined storage location, which we will
refer to in other notebooks for accessing the synthetic data set.
According to the medallion architecture paradigm, the Bronze layer holds the
most raw data quality. At this stage we can incrementally read new data using
Auto Loader from a location in cloud storage. Here we are adding the path to our
generated data set to the configuration section under pipeline settings, which
allows us to load the source path as a variable. So now our configuration under
pipeline settings looks like below:
"configuration": {
"source": "/tmp/demo/cdc_raw"
}
The Big Book of Data Engineering – 2nd Edition 60
Let’s take a look at the Bronze table we will ingest, a. In SQL, and b. Using Python
import dlt
from pyspark.sql.functions import *
from pyspark.sql.types import *
A. SQL
source = spark.conf.get("source")
Expectations and high-quality data Using APPLY CHANGES INTO statement to propagate changes to
downstream target table
In the next step to create a high-quality, diverse, and accessible data set,
we impose quality check expectation criteria using Constraints. Currently, Prior to executing the Apply Changes Into query, we must ensure that a target
a constraint can be either retain, drop, or fail. For more detail see here. All streaming table which we want to hold the most up-to-date data exists. If it
constraints are logged to enable streamlined quality monitoring. does not exist we need to create one. Below cells are examples of creating a
target streaming table. Note that at the time of publishing this blog, the target
A. SQL streaming table creation statement is required along with the Apply Changes
Into query, and both need to be present in the pipeline — otherwise your table
CREATE TEMPORARY STREAMING LIVE TABLE customer_bronze_clean_v(
CONSTRAINT valid_id EXPECT (id IS NOT NULL) ON VIOLATION DROP ROW, creation query will fail.
CONSTRAINT valid_address EXPECT (address IS NOT NULL),
CONSTRAINT valid_operation EXPECT (operation IS NOT NULL) ON VIOLATION
A. SQL
DROP ROW
)
TBLPROPERTIES ("quality" = "silver") CREATE STREAMING LIVE TABLE customer_silver
COMMENT "Cleansed bronze customer view (i.e. what will become Silver)" TBLPROPERTIES ("quality" = "silver")
AS SELECT * COMMENT "Clean, merged customers";
FROM STREAM(LIVE.customer_bronze);
B. PYTHON
B. PYTHON
dlt.create_target_table(name="customer_silver",
comment="Clean, merged customers",
table_properties={
@dlt.view(name="customer_bronze_clean_v", "quality": "silver"
comment="Cleansed bronze customer view (i.e. what will become Silver)") }
)
@dlt.expect_or_drop("valid_id", "id IS NOT NULL")
@dlt.expect("valid_address", "address IS NOT NULL")
@dlt.expect_or_drop("valid_operation", "operation IS NOT NULL")
def customer_bronze_clean_v():
return dlt.read_stream("customer_bronze") \
.select("address", "email", "id", "firstname", "lastname",
"operation", "operation_date", "_rescued_data")
The Big Book of Data Engineering – 2nd Edition 62
Now that we have a target streaming table, we can propagate changes to the A. SQL
downstream target table using the Apply Changes Into query. While CDC feed
comes with INSERT, UPDATE and DELETE events, DLT default behavior is to apply APPLY CHANGES INTO LIVE.customer_silver
FROM stream(LIVE.customer_bronze_clean_v)
INSERT and UPDATE events from any record in the source data set matching
KEYS (id)
on primary keys, and sequenced by a field which identifies the order of events. APPLY AS DELETE WHEN operation = "DELETE"
SEQUENCE BY operation_date
More specifically it updates any row in the existing target table that matches
COLUMNS * EXCEPT (operation, operation_date,
the primary key(s) or inserts a new row when a matching record does not exist _rescued_data);
in the target streaming table. We can use APPLY AS DELETE WHEN in SQL, or its
equivalent apply_as_deletes argument in Python to handle DELETE events.
B. PYTHON
In this example we used "id" as my primary key, which uniquely identifies the
customers and allows CDC events to apply to those identified customer records dlt.apply_changes(
target = "customer_silver",
in the target streaming table. Since "operation_date" keeps the logical order of source = "customer_bronze_clean_v",
CDC events in the source data set, we use "SEQUENCE BY operation_date" in keys = ["id"],
sequence_by = col("operation_date"),
SQL, or its equivalent "sequence_by = col("operation_date")" in Python to handle apply_as_deletes = expr("operation = 'DELETE'"),
change events that arrive out of order. Keep in mind that the field value we use except_column_list = ["operation", "operation_date", "_rescued_data"])
with SEQUENCE BY (or sequence_by) should be unique among all updates to
the same key. In most cases, the sequence by column will be a column with
To check out the full list of available clauses see here.
timestamp information.
Please note that, at the time of publishing this blog, a table that reads from the
Finally we used "COLUMNS * EXCEPT (operation, operation_date, _rescued_
target of an APPLY CHANGES INTO query or apply_changes function must be a
data)" in SQL, or its equivalent "except_column_list"= ["operation", "operation_
live table, and cannot be a streaming live table.
date", "_rescued_data"] in Python to exclude three columns of "operation",
"operation_date", "_rescued_data" from the target streaming table. By default all A SQL and Python notebook is available for reference for this section. Now that
the columns are included in the target streaming table, when we do not specify we have all the cells ready, let’s create a pipeline to ingest data from cloud object
the "COLUMNS" clause. storage. Open Jobs in a new tab or window in your workspace, and select “Delta
Live Tables.”
The Big Book of Data Engineering – 2nd Edition 63
The pipeline associated with this blog has the following DLT pipeline settings: 1. Select “Create Pipeline” to create a new pipeline
2. Specify a name such as “Retail CDC Pipeline”
{
"clusters": [ 3. Specify the Notebook Paths that you already created earlier, one for the
{
"label": "default", generated data set using Faker package, and another path for the ingestion
"num_workers": 1 of the generated data in DLT. The second notebook path can refer to the
}
], notebook written in SQL, or Python depending on your language of choice.
"development": true,
"continuous": false,
4. To access the data generated in the first notebook, add the data set path in
"edition": "advanced", configuration. Here we stored data in “/tmp/demo/cdc_raw/customers”, so
"photon": false,
"libraries": [
we set “source” to “/tmp/demo/cdc_raw/” to reference “source/customers” in
{ our second notebook.
"notebook": {
"path":"/Repos/[email protected]/Delta-Live-Tables/ 5. Specify the Target (which is optional and referring to the target database),
notebooks/1-CDC_DataGenerator" where you can query the resulting tables from your pipeline
}
}, 6. Specify the Storage Location in your object storage (which is optional), to
{
"notebook": { access your DLT produced data sets and metadata logs for your pipeline
"path":"/Repos/[email protected]/Delta-Live-Tables/
notebooks/2-Retail_DLT_CDC_sql" 7. Set Pipeline Mode to Triggered. In Triggered mode, DLT pipeline will consume
} new data in the source all at once, and once the processing is done it will
}
], terminate the compute resource automatically. You can toggle between
"name": "CDC_blog", Triggered and Continuous modes when editing your pipeline settings. Setting
"storage": "dbfs:/home/mydir/myDB/dlt_storage",
"configuration": { “continuous”: false in the JSON is equivalent to setting the pipeline to
"source": "/tmp/demo/cdc_raw", Triggered mode.
"pipelines.applyChangesPreviewEnabled": "true"
}, 8. For this workload you can disable the autoscaling under Autopilot Options,
"target": "my_database"
}
and use only one worker cluster. For production workloads, we recommend
enabling autoscaling and setting the maximum numbers of workers needed
for cluster size.
9. Select “Start”
10. Your pipeline is created and running now!
The Big Book of Data Engineering – 2nd Edition 64
You can check out our previous deep dive on the topic here. Try this notebook
to see pipeline observability and data quality monitoring on the example DLT
pipeline associated with this blog.
Conclusion
In this blog, we showed how we made it seamless for users to efficiently
implement change data capture (CDC) into their lakehouse platform with Delta
Live Tables (DLT). DLT provides built-in quality controls with deep visibility into
pipeline operations, observing pipeline lineage, monitoring schema, and quality
checks at each step in the pipeline. DLT supports automatic error handling and
best in class auto-scaling capability for streaming workloads, which enables
users to have quality data with optimum resources required for their workload.
Data engineers can now easily implement CDC with a new declarative APPLY
CHANGES INTO API with DLT in either SQL or Python. This new capability lets
your ETL pipelines easily identify changes and apply those changes across tens
of thousands of tables with low-latency support.
Ready to get started and try out CDC in Delta Live Tables for yourself?
Please watch this webinar to learn how Delta Live Tables simplifies the
DLT pipeline lineage observability and data quality complexity of data transformation and ETL, and see our Change data capture
monitoring with Delta Live Tables document, official github and follow the steps in this
video to create your pipeline!
All DLT pipeline logs are stored in the pipeline’s storage location. You can specify
your storage location only when you are creating your pipeline. Note that once
the pipeline is created you can no longer modify storage location.
The Big Book of Data Engineering – 2nd Edition 65
S ECT I O N 2 .1 0
M A R C E L L F E R E N C Z and E D W A R D K E L L Y
Government data exchange is the practice of sharing data between different between departments allows for increased focus. Still, the very same federation
government agencies and often partners in commercial sectors. Government introduces challenges around data completeness, data quality, data access,
can share data for various reasons, such as to improve government operations’ security and control, FAIR-ness of data, etc. These challenges are far from trivial
efficiency, provide better services to the public, or support research and policy- and require a strategic, multifaceted approach to be addressed appropriately.
making. In addition, data exchange in the public sector can involve sharing with the Technology, people, process, legal frameworks, etc., require dedicated
private sector or receiving data from the private sector. The considerations span consideration when designing a robust data sharing ecosystem.
multiple jurisdictions and over almost all industries. In this blog, we will address the
needs disclosed as part of national data strategies and how modern technologies, The National Data Strategy (NDS) by the UK government outlines five actionable
particularly Delta Sharing, Unity Catalog, and clean rooms, can help you design, missions through which we can materialize the value of data for the citizen and
implement and manage a future-proof and sustainable data ecosystem. society-wide benefits.
Probably the quote about sharing that applies the most profoundly to the
topic of data sharing. To the extent that the purpose of sharing the data is to
create new information, new insights, and new data. The importance of data
sharing is even more amplified in the government context, where federation
The Big Book of Data Engineering – 2nd Edition 66
It comes as no surprise that each and every one of the missions is strongly 2. Securing a pro-growth and trusted data regime — The key aspect of
related to the concept of data sharing, or more broadly, data access both within Mission 2 is data trust, or more broadly, adherence to data quality norms.
and outside of government departments: Data quality considerations become further amplified for data sharing and
data exchange use cases where we are considering the whole ecosystem
1. Unlocking the value of the data across the economy — Mission 1 of the at once, and quality implications transcend the boundaries of our own
NDS aims to assert government and the regulators as enablers of the value platform. This is precisely why we have to adopt “data sustainability.” What
extraction from data through the adoption of best practices. The UK data we mean by sustainable data products are data products that harness the
economy was estimated to be near £125 billion in 2021 with an upwards trend. existing sources over reinvention of the same/similar assets, accumulation of
In this context, it is essential to understand that the government-collected unnecessary data (data pollutants) and that anticipate future uses.
and provided open data can be crucial for addressing many of the challenges
across all industries. Ungoverned and unbounded data sharing could negatively impact data
For example, insurance providers can better assess the risk of insuring quality and hinder the growth and value of data. The quality of how the data
properties by ingesting and integrating Flood areas provided by DEFRA. On is shared should be a key consideration of data quality frameworks. For
the other hand, capital market investors could better understand the risk of this reason, we require a solid set of standards and best practices for data
their investments by ingesting and integrating the Inflation Rate Index by ONS. sharing with governance and quality assurance built into the process and
Reversely, it is crucial for regulators to have well-defined data access and technologies. Only this way can we ensure the sustainability of our data and
data sharing patterns for conducting their regulatory activities. This clarity secure a pro-growth trusted data regime.
truly enables the economic actors that interact with government data.
The Big Book of Data Engineering – 2nd Edition 67
3. Transforming government’s use of data to drive efficiency and improve 4. Ensuring the security and resilience of the infrastructure on which
public services — “By 2025 data assets are organized and supported as data relies — Reflecting on the vision of the year 2025 — this isn’t that far
products, regardless of whether they’re used by internal teams or external from now and even in a not so distant future, we will be required to rethink
customers… Data products continuously evolve in an agile manner to meet our approach to data, more specifically — what is our digital supply chain
the needs of consumers… these products provide data solutions that can infrastructure/data sharing infrastructure? Data and data assets are products
more easily and repeatedly be used to meet various business challenges and and should be managed as products. If data is a product, we need a coherent
reduce the time and cost of delivering new AI-driven capabilities.” — and unified way of providing those products.
The data-driven enterprise of 2025 by McKinsey. AI and ML can be powerful
enablers of digital transformation for both the public and private sectors. If data is to be used across industries and across both private and public
sectors, we need an open protocol that drives adoption and habit generation.
AI, ML, reports, and dashboards are just a few examples of data products To drive adoption, the technologies we use must be resilient, robust, trusted
and services that extract value from data. The quality of these solutions is and usable by/for all. Vendor lock-in, platform lock-in or cloud lock-in are all
directly reflected in the quality of data used for building them and our ability boundaries to achieving this vision.
to access and leverage available data assets both internally and externally.
Whilst there is a vast amount of data available for us to build new intelligent 5. Championing the international flow of data — Data exchange between
solutions for driving efficiency for better processes, better decision-making, jurisdictions and across governments will likely be one of the most
and better policies — there are numerous barriers that can trap the data, transformative applications of data at scale. Some of the world’s toughest
such as legacy systems, data silos, fragmented standards, proprietary challenges depend on the efficient exchange of data between governments
formats, etc. Modeling data solutions as data products and standardizing — prevention of criminal activities, counterterrorism activities, net-zero
them to a unified format allows us to abstract such barriers and truly emission goals, international trade, the list goes on and on. Some steps in
leverage the data ecosystem. this direction are already materializing: the U.S. federal government and UK
government have agreed on data exchange for countering serious crime
activities. This is a true example of championing international flow data and
using data for good. It is imperative that for these use cases, we approach
data sharing from a security-first angle. Data sharing standards and protocols
need to adhere to security and privacy best practices.
The Big Book of Data Engineering – 2nd Edition 68
While originally built with a focus on the UK government and how to better • Conscious design — These principles are strongly aligned with the idea of
integrate data as a key asset of a modern government, these concepts apply in data sustainability. The guidelines promote forward thinking around usability
a much wider global public sector context. In the same spirit, the U.S. Federal and interoperability of the data and user-centric design principles of
Government proposed the Federal Data Strategy as a collection of principles, sustainable data products.
practices, action steps and timeline through which government can leverage
the full value of Federal data for mission, service and the public good. • Learning culture — Data sharing, or alternatively knowledge sharing, has
an important role in building a scalable learning ecosystem and learning
culture. Data is front and center of knowledge synthesis, and from a
scientific angle, data proves factual knowledge. Another critical component
of knowledge is the “Why?” and data is what we need to address the
“Why?” component of any decisions we make, which policy to enforce, who
to sanction, who to support with grants, how to improve the efficiency of
government services, how to better serve citizens and society.
Delta Sharing is a protocol based on a lean set of REST APIs to manage sharing, quality requirements at storage. Unfortunately, other formats such as CSV,
permissions and access to any data asset stored in Delta or Parquet formats. CSVW, ORC, Avro, XML, etc., do not have such properties without significant
The protocol defines two main actors, the data provider (data supplier, data additional effort. The issue becomes even more emphasized by the fact
owner) and the data recipient (data consumer). The recipient, by definition, is that data quality cannot be ensured in the same way on both the data
agnostic to the data format at the source. Delta Sharing provides the necessary provider and data recipient side without the exact reimplementation of the
abstractions for governed data access in many different languages and tools. source systems. It is critical to embed quality and metadata together with
data to ensure quality travels together with data. Any decoupled approach
Delta Sharing is uniquely positioned to answer many of the challenges of data to managing data, metadata and quality separately increases the risk of
sharing in a scalable manner within the context of highly regulated domains like sharing and can lead to undesirable outcomes.
the public sector:
• Privacy and security concerns — Personally identifiable data or otherwise • Lack of standardization — Another challenge of data sharing is the lack
sensitive or restricted data is a major part of the data exchange needs of a of standardization in how data is collected, organized, and stored. This is
data-driven and modernized government. Given the sensitive nature of such particularly pronounced in the context of governmental activities. While
data, it is paramount that the governance of data sharing is maintained in a governments have proposed standard formats (e.g., Office for National
coherent and unified manner. Any unnecessary process and technological Statistics promotes usage of CSVW), aligning all private and public
complexities increase the risk of over-sharing data. With this in mind, sector companies to standards proposed by such initiatives is a massive
Delta Sharing has been designed with security best practices from the challenge. Other industries may have different requirements for scalability,
very inception. The protocol provides end-to-end encryption, short-lived interoperability, format complexity, lack of structure in data, etc. Most of
credentials, and accessible and intuitive audit and governance features. All the currently advocated standards are lacking in multiple such aspects.
of these capabilities are available in a centralized way across all your Delta Delta is the most mature candidate for assuming the central role in the
tables across all clouds. standardization of data exchange format. It has been built as a transactional
and scalable data format, it supports structured, semi-structured and
unstructured data, it stores data schema and metadata together with data
• Quality and accuracy — Another challenge of data sharing is ensuring
and it provides a scalable enterprise-grade sharing protocol through Delta
that the data being shared is of high quality and accuracy. Given that
Sharing. Finally, Delta is one of the most popular open source projects
the underlying data is stored as Delta tables, we can guarantee that the
in the ecosystem and, since May 2022, has surpassed 7 million monthly
transactional nature of data is respected; Delta ensures ACID properties
downloads.
of data. Furthermore, Delta supports data constraints to guarantee data
The Big Book of Data Engineering – 2nd Edition 71
• Cultural and organizational barriers — These challenges can be • Technical challenges — Federation at the government scale or even
summarized by one word: friction. Unfortunately, it’s a common problem further across multiple industries and geographies poses technical
for civil servants to struggle to obtain access to both internal and external challenges. Each organization within this federation owns its platform
data due to over-cumbersome processes, policies and outdated standards. and drives technological, architectural, platform and tooling choices.
The principles we are using to build our data platforms and our data sharing
platforms have to be self-promoting, have to drive adoption and have to How can we promote interoperability and data exchange in this vast,
generate habits that adhere to best practices. diverse technological ecosystem? The data is the only viable integration
vehicle. As long as the data formats we utilize are scalable, open and
If there is friction with standard adoption, the only way to ensure standards governed, we can use them to abstract from individual platforms and
are respected is by enforcement and that itself is yet another barrier to their intrinsic complexities.
achieving data sustainability. Organizations have already adopted Delta
Sharing both in the private and public sectors. For example, U.S. Citizenship Delta format and Delta Sharing solve this wide array of requirements and
and Immigration Services (USCIS) uses Delta Sharing to satisfy several challenges in a scalable, robust and open way. This positions Delta Sharing
interagency data-sharing requirements. Similarly, Nasdaq describes Delta as the strongest choice for unification and simplification of the protocol and
Sharing as the “future of financial data sharing,” and that future is open mechanism through which we share data across both private and public sectors.
and governed.
The Big Book of Data Engineering – 2nd Edition 72
For more details on the industry use cases that Delta Sharing is addressing
please consult A New Approach to Data Sharing eBook.
Ready-to-Use Notebooks
and Data Sets
The Big Book of Data Engineering – 2nd Edition 75
Digital Twins
This section includes several Solution Accelerators — free, ready-to-use
Leverage digital twins — virtual
examples of data solutions from different industries ranging from retail to
representations of devices and
manufacturing and healthcare. Each of the following scenarios includes objects — to optimize operations and
notebooks with code and step-by-step instructions to help you get gain insights
started. Get hands-on experience with the Databricks Lakehouse Platform
by trying the following for yourself: Explore the Solution
Case Studies
4.1 Akamai
4.2 Grammarly
4.3 Honeywell
4.5 Rivian
4.6 AT&T
The Big Book of Data Engineering – 2nd Edition 77
S ECT I O N 4 .1
<1 <85%
Min ingestion time, Of queries have a response
reduced from 15 min time of 7 seconds or less
INDUSTRY Akamai runs a pervasive, highly distributed content delivery network (CDN). Its CDN
Technology and Software uses approximately 345,000 servers in more than 135 countries and over 1,300 networks
SOLUTION worldwide to route internet traffic for some of the largest enterprises in media, commerce,
Threat Detection finance, retail and many other industries. About 30% of the internet’s traffic flows through
Akamai servers. Akamai also provides cloud security solutions.
P L AT F O R M U S E C A S E
Delta Lake, Data Streaming, Photon,
In 2018, the company launched a web security analytics tool that offers Akamai customers
Databricks SQL
a single, unified interface for assessing a wide range of streaming security events and
C LO U D performing analysis of those events. The web analytics tool helps Akamai customers to
Azure take informed actions in relation to security events in real time. Akamai is able to stream
massive amounts of data and meet the strict SLAs it provides to customers by leveraging
Delta Lake and the Databricks Lakehouse Platform for the web analytics tool.
The Big Book of Data Engineering – 2nd Edition 78
Ingesting and streaming enormous amounts of data After conducting proofs of concept with several companies, Akamai chose to
base its streaming analytics architecture on Spark and the Databricks Lakehouse
Akamai’s web security analytics tool ingests approximately 10GB of data related Platform. “Because of our scale and the demands of our SLA, we determined that
to security events per second. Data volume can increase significantly when Databricks was the right solution for us,” says Patel. “When we consider storage
retail customers conduct a large number of sales — or on big shopping days like optimization, and data caching, if we went with another solution, we couldn’t
Black Friday or Cyber Monday. The web security analytics tool stores several achieve the same level of performance.”
petabytes of data for analysis purposes. Those analyses are performed to
protect Akamai’s customers and provide them with the ability to explore and
query security events on their own. Improving speed and reducing costs
The web security analytics tool initially relied on an on-premises architecture Today, the web security analytics tool ingests and transforms data, stores it
running Apache Spark™ on Hadoop. Akamai offers strict service level agreements in cloud storage, and sends the location of the file via Kafka. It then uses a
(SLAs) to its customers of 5 to 7 minutes from when an attack occurs until it is Databricks Job as the ingest application. Delta Lake, the open source storage
displayed in the tool. The company sought to improve ingestion and query speed format at the base of the Databricks Lakehouse Platform, supports real-time
to meet those SLAs. “Data needs to be as real-time as possible so customers querying on the web security analytics data. Delta Lake also enables Akamai to
can see what is attacking them,” says Tomer Patel, Engineering Manager at scale quickly. “Delta Lake allows us to not only query the data better but to also
Akamai. “Providing queryable data to customers quickly is critical. We wanted to acquire an increase in the data volume,” says Patel. “We’ve seen an 80% increase
move away from on-prem to improve performance and our SLAs so the latency in traffic and data in the last year, so being able to scale fast is critical.”
would be seconds rather than minutes.”
Akamai also uses Databricks SQL (DBSQL) and Photon, which provide extremely
fast query performance. Patel added that Photon provided a significant boost
to query performance. Overall, Databricks’ streaming architecture combined
Delta Lake allows us to not only query the data better but to with DBSQL and Photon enables Akamai to achieve real-time analytics, which
also acquire an increase in the data volume. We’ve seen an translates to real-time business benefits.
80% increase in traffic and data in the last year, so being able
to scale fast is critical.
Tomer Patel
Engineering Manager, Akamai
The Big Book of Data Engineering – 2nd Edition 79
Patel says he likes that Delta Lake is open source, as the company has benefitted Meeting aggressive requirements for scale,
from a community of users working to improve the product. “The fact that Delta reliability and performance
Lake is open source and there’s a big community behind it means we don’t need
to implement everything ourselves,” says Patel. “We benefit from fixed bugs that Using Spark Structured Streaming on the Databricks Lakehouse Platform enables
others have encountered and from optimizations that are contributed to the the web security analytics tool to stream vast volumes of data and provide
project.” Akamai worked closely with Databricks to ensure Delta Lake can meet low-latency, real-time analytics-as-a-service to Akamai’s customers. That way
the scale and performance requirements Akamai defined. These improvements Akamai is able to make available security event data to customers within the
have been contributed back to the project (many of which were made available as SLA of 5 to 7 minutes from when an attack occurs. “Our focus is performance,
part of Delta Lake 2.0), and so any user running Delta Lake now benefits from the performance, performance,” says Patel. “The platform’s performance and
technology being tested at such a large scale in a real-world production scenario. scalability are what drives us.”
Using the Databricks Lakehouse Platform, it now takes under 1 minute to ingest
the security event data. “Reducing ingestion time from 15 minutes to under 1
minute is a huge improvement,” says Patel. “It benefits our customers because
they can see the security event data faster and they have a view of what exactly
is happening as well as the capability to filter all of it.”
S ECT I O N 4 . 2
110% 5 billion
Faster querying, at 10% of the cost Daily events available for
to ingest, than a data warehouse analytics in under 15 minutes
C LO U D
50,000 teams worldwide write more effectively every day.
AWS
The Big Book of Data Engineering – 2nd Edition 81
Historically, Grammarly relied on a homegrown legacy analytics platform and Grammarly also sought to unify its data warehouses in order to scale and
leveraged an in-house SQL-like language that was time-intensive to learn and improve data storage and query capabilities. As it stood, large Amazon EMR
made it challenging to onboard new hires. As the company grew, Grammarly clusters ran 24/7 and drove up costs. With the various data sources, the team
data analysts found that the platform did not sufficiently meet the needs of its also needed to maintain access control. “Access control in a distributed file
essential business functions, especially marketing, sales and customer success. system is difficult, and it only gets more complicated as you ingest more data
Analysts found themselves copying and pasting data from spreadsheets sources,” says Chris Locklin, Engineering Manager, Data Platforms at Grammarly.
because the existing system couldn’t effectively ingest the external data needed Meanwhile, reliance on a single streaming workflow made collaboration among
to answer questions such as, “Which marketing channel delivers the highest teams challenging. Data silos emerged as different business areas implemented
ROI?” Reporting proved challenging because the existing system didn’t support analytics tools individually. “Every team decided to solve their analytics needs in
Tableau dashboards, and company leaders and analysts needed to ensure they the best way they saw fit,” says Locklin. “That created challenges in consistency
could make decisions quickly and confidently. and knowing which data set was correct.”
The Big Book of Data Engineering – 2nd Edition 82
As its data strategy was evolving, Grammarly’s priority was to get the most out “Security is of utmost importance at Grammarly, and our team’s number-
of analytical data while keeping it secure. This was crucial because security is one objective is to own and protect our analytical data,” says Locklin. “Other
Grammarly’s number-one priority and most important feature, both in how it companies ask for your data, hold it for you, and then let you perform analytics
protects its users’ data and how it ensures its own company data remains secure. on it. Just as Grammarly ensures our users’ data always remains theirs, we
To accomplish that, Grammarly’s data platform team sought to consolidate wanted to ensure our company data remained ours. Grammarly’s data stays
data and unify the company on a single platform. That meant sustaining a highly inside of Grammarly.”
secure infrastructure that could scale alongside the company’s growth, improving
ingestion flexibility, reducing costs and fueling collaboration. With its data consolidated in the lakehouse, different areas of Grammarly’s
business can now analyze data more thoroughly and effectively. For example,
Grammarly’s marketing team uses advertising to attract new business. Using
Improving analytics, visualization and decision-making Databricks, the team can consolidate data from various sources to extrapolate
with the lakehouse a user’s lifetime value, compare it with customer acquisition costs and get rapid
feedback on campaigns. Elsewhere, data captured from user interactions flow
After conducting several proofs of concept to enhance its infrastructure, into a set of tables used by analysts for ad hoc analysis to inform and improve
Grammarly migrated to the Databricks Lakehouse Platform. Bringing all the the user experience.
analytical data into the lakehouse created a central hub for all data producers
and consumers across Grammarly, with Delta Lake at the core. By consolidating data onto one unified platform, Grammarly has eliminated data
silos. “The ability to bring all these capabilities, data processing and analysis
Using the lakehouse architecture, data analysts within Grammarly now have a under the same platform using Databricks is extremely valuable,” says Sergey
consolidated interface for analytics, which leads to a single source of truth and Blanket, Head of Business Intelligence at Grammarly. “Doing everything from ETL
confidence in the accuracy and availability of all data managed by the data and engineering to analytics and ML under the same umbrella removes barriers
platform team. Across the organization, teams are using Databricks SQL to and makes it easy for everyone to work with the data and each other.”
conduct queries within the platform on both internally generated product data
and external data from digital advertising platform partners. Now, they can easily
connect to Tableau and create dashboards and visualizations to present to
executives and key stakeholders.
The Big Book of Data Engineering – 2nd Edition 83
To manage access control, enable end-to-end observability and monitor data Moving to the lakehouse architecture also solved the challenge of access control
quality, Grammarly relies on the data lineage capabilities within Unity Catalog. over distributed file systems, while Unity Catalog enabled fine-grained, role-
“Data lineage allows us to effectively monitor usage of our data and ensure it based access controls and real-time data lineage. “Unity Catalog gives us the
upholds the standards we set as a data platform team,” says Locklin. “Lineage is ability to manage file permissions with more flexibility than a database would
the last crucial piece for access control. It allows analysts to leverage data to do allow,” says Locklin. “It solved a problem my team couldn’t solve at scale. While
their jobs while adhering to all usage standards and access controls, even when using Databricks allows us to keep analytical data in-house, Unity Catalog helps
recreating tables and data sets in another environment.” us continue to uphold the highest standards of data protection by controlling
access paradigms inside our data. That opens a whole new world of things that
we can do.”
Faster time to insight drives more intelligent
Ultimately, migrating to the Databricks Lakehouse Platform has helped
business decisions
Grammarly to foster a data-driven culture where employees get fast access
Using the Databricks Lakehouse Platform, Grammarly’s engineering teams now to analytics without having to write complex queries, all while maintaining
have a tailored, centralized platform and a consistent data source across the Grammarly’s enterprise-grade security practices. “Our team’s mission is to help
company, resulting in greater speed and efficiency and reduced costs. The Grammarly make better, faster business decisions,” adds Blanket. “My team
lakehouse architecture has led to 110% faster querying, at 10% of the cost to would not be able to effectively execute on that mission if we did not have a
ingest, than a data warehouse. Grammarly can now make its 5 billion daily events platform like Databricks available to us.” Perhaps most critically, migrating off its
available for analytics in under 15 minutes rather than 4 hours, enabling low- rigid legacy infrastructure gives Grammarly the adaptability to do more while
latency data aggregation and query optimization. This allows the team to quickly knowing the platform will evolve as its needs evolve. “Databricks has given us the
receive feedback about new features being rolled out and understand if they are flexibility to unleash our data without compromise,” says Locklin. “That flexibility
being adopted as expected. Ultimately, it helps them understand how groups has allowed us to speed up analytics to a pace we’ve never achieved before.”
of users engage with the UX, improving the experience and ensuring features
and product releases bring the most value to users. “Everything my team does
is focused on creating a rich, personalized experience that empowers people to
communicate more effectively and achieve their potential,” says Locklin.
The Big Book of Data Engineering – 2nd Edition 84
S ECT I O N 4 . 3
Companies are under growing pressure to reduce energy use, while at the same time
they are looking to lower costs and improve efficiency. Honeywell delivers industry-
specific solutions that include aerospace products and services, control technologies
for buildings and industry, and performance materials globally. Honeywell’s Energy
and Environmental Solutions division uses IoT sensors and other technologies to help
businesses worldwide manage energy demand, reduce energy consumption and carbon
emissions, optimize indoor air quality, and improve occupant well-being.
Accomplishing this requires Honeywell to collect vast amounts of data. Using Delta Live
Tables on the Databricks Lakehouse Platform, Honeywell’s data team can now ingest
INDUSTRY
Manufacturing
billions of rows of sensor data into Delta Lake and automatically build SQL endpoints for
real-time queries and multilayer insights into data at scale — helping Honeywell improve
P L AT F O R M U S E C A S E how it manages data and extract more value from it, both for itself and for its customers.
Lakehouse, Delta Lake, Delta Live Tables
C LO U D
Azure Databricks helps us pull together many different data sources, do
aggregations, and bring the significant amount of data we collect
from our buildings under control so we can provide customers value.
Processing billions of IoT data points per day Chris Inkpen, Global Solutions Architect at Honeywell Energy and Environmental
Solutions. “We give the system more data, and it copes. Out of the box, it’s given
Honeywell’s solutions and services are used in millions of buildings around the us the confidence that it will handle whatever we throw at it.”
world. Helping its customers create buildings that are safe, more sustainable
and productive can require thousands of sensors per building. Those sensors Honeywell credits the Databricks Lakehouse Platform for helping it to unify its
monitor key factors such as temperature, pressure, humidity and air quality. vast and varied data — batch, streaming, structured and unstructured — into
In addition to the data collected by sensors inside a building, data is also one platform. “We have many different data types. The Databricks Lakehouse
collected from outside, such as weather and pollution data. Another data set Platform allows us to use things like Apache Kafka and Auto Loader to load and
consists of information about the buildings themselves — such as building process multiple types of data and treat everything as a stream of data, which is
type, ownership, floor plan, square footage of each floor and square footage awesome. Once we’ve got structured data from unstructured data, we can write
of each room. That data set is combined with the two disparate data streams, standardized pipelines.”
adding up to a lot of data across multiple structured and unstructured formats,
Honeywell data engineers can now build and leverage their own ETL pipelines
including images and video streams, telemetry data, event data, etc. At peaks,
with Delta Live Tables and gain insights and analytics quickly. ETL pipelines can
Honeywell ingests anywhere between 200 to 1,000 events per second for any
be reused regardless of environment, and data can run in batches or streams. It’s
building, which equates to billions of data points per day. Honeywell’s existing
also helped Honeywell’s data team transition from a small team to a larger team.
data infrastructure was challenged to meet such demand. It also made it difficult
“When we wrote our first few pipelines before DLT existed, only one person could
for Honeywell’s data team to query and visualize its disparate data so it could
work in one part of the functionality. Now that we’ve got DLT and the ability to
provide customers with fast, high-quality information and analysis.
have folders with common functionality, we’ve got a really good platform where
we can easily spin off different pipelines.”
ETL simplified: high-quality, reusable data pipelines DLT also helped Honeywell establish standard log files to monitor and cost-
With Delta Live Tables (DLT) on the Databricks Lakehouse Platform, Honeywell’s justify its product pipelines. “Utilizing DLT, we can analyze which parts of our
data team can now ingest billions of rows of sensor data into Delta Lake and pipeline need optimization,” says Inkpen. “With standard pipelines, that was
automatically build SQL endpoints for real-time queries and multilayer insights much more chaotic.”
into data at scale. “We didn’t have to do anything to get DLT to scale,” says Dr.
The Big Book of Data Engineering – 2nd Edition 86
Enabling ease, simplicity and scalability across the Using Databricks and DLT also helps the Honeywell team perform with greater
infrastructure agility, which allows them to innovate faster while empowering developers to
respond to user requirements almost immediately. “Our previous architecture
Delta Live Tables has helped Honeywell’s data team consistently query made it impossible to know what bottlenecks we had and what we needed to
complex data while offering simplicity of scale. It also enables end-to-end data scale. Now we can do data science in near real-time.”
visualization of Honeywell’s data streams as they flow into its infrastructure, are
transformed, and then flow out. “Ninety percent of our ETL is now captured in Ultimately, Honeywell can now more quickly provide its customers with the
diagrams, so that’s helped considerably and improves data governance. DLT data and analysis they need to make their buildings more efficient, healthier
encourages — and almost enforces — good design,” says Inkpen. and safer for occupants. “I’m continuously looking for ways to improve our
lifecycles, time to market, and data quality,” says Inkpen. “Databricks helps
Using the lakehouse as a shared workspace has helped promote teamwork and us pull together many different data sources, do aggregations, and bring the
collaboration at Honeywell. “The team collaborates beautifully now, working significant amount of data we collect from our buildings under control so we
together every day to divvy up the pipeline into their own stories and workloads,” can provide customers value.”
says Inkpen.
Meanwhile, the ability to manage streaming data with low latency and better
throughput has improved accuracy and reduced costs. “Once we’ve designed Ready to get started? Learn more about Delta Live Tables here.
something using DLT, we’re pretty safe from scalability issues — certainly a
hundred times better than if we hadn’t written it in DLT,” says Inkpen. “We can
then go back and look at how we can take a traditional job and make it more
performant and less costly. We’re in a much better position to try and do that
from DLT.”
The Big Book of Data Engineering – 2nd Edition 87
S ECT I O N 4 . 4
Wood Mackenzie offers customized consulting and analysis for a wide range of clients
in the energy and natural resources sectors. Founded in Edinburgh, the company first
cultivated deep expertise in upstream oil and gas, then broadened its focus to deliver
INDUSTRY
Energy and Utilities detailed insight for every interconnected sector of the energy, chemicals, metals and
mining industries.
P L AT F O R M U S E C A S E
Lakehouse, Workflows Today it sees itself playing an important role in the transition to a more sustainable
future. Using Databricks Workflows to automate ETL pipelines helps Wood Mackenzie
C LO U D
AWS ingest and process massive amounts of data. Using a common workflow provided
higher visibility to engineering team members, encouraging better collaboration. With
an automated, transparent workflow in place, the team saw improved productivity and
data quality and an easier path to fix pipeline issues when they arise.
The Big Book of Data Engineering – 2nd Edition 88
Delivering insights to the energy industry Improved collaboration and transparency with a common
workflow
Fulfilling Wood Mackenzie’s mission, the Lens product is a data analytics platform
built to deliver insights at key decision points for customers in the energy sector. The data pipeline managed by the team includes several stages for standardizing
Feeding into Lens are vast amounts of data collected from various data sources and cleaning raw data, which can be structured or unstructured and may be in
and sensors used to monitor energy creation, oil and gas production, and more. the form of PDFs or even handwritten notes.
Those data sources update about 12 billion data points every week that must
be ingested, cleaned and processed as part of the input for the Lens platform. Different members of the data team are responsible for different parts of
Yanyan Wu, Vice President of Data at Wood Mackenzie, manages a team of big the pipeline, and there is a dependency between the processing stages each
data professionals that build and maintain the ETL pipeline that provides input team member owns. Using Databricks Workflows, the team defined a common
data for Lens. The team is leveraging the Databricks Lakehouse Platform and workstream that the entire team uses. Each stage of the pipeline is implemented
uses Apache Spark™ for parallel processing, which provides greater performance in a Python notebook, which is run as a job in the main workflow.
and scalability benefits compared to an earlier single-node system working
Each team member can now see exactly what code is running on each stage,
sequentially. “We saw a reduction of 80-90% in data processing time, which
making it easy to find the cause of the issue. Knowing who owns the part of the
results in us providing our clients with more up-to-date, more complete and
pipeline that originated the problem makes fixing issues much faster. “Without
more accurate data,” says Wu.
a common workflow, different members of the team would run their notebooks
independently, not knowing that failure in their run affected stages downstream,”
says Meng Zhang, Principal Data Analyst at Wood Mackenzie. “When trying to
Our mission is to transform the way we power the planet. rerun notebooks, it was hard to tell which notebook version was initially run and
Our clients in the energy sector need data, consulting services the latest version to use.”
and research to achieve that transformation. Databricks
Workflows gives us the speed and flexibility to deliver the
insights our clients need.
Yanyan Wu
Vice President of Data, Wood Mackenzie
The Big Book of Data Engineering – 2nd Edition 89
Using Workflows’ alerting capabilities to notify the team when a workflow task developing notebooks with the interactive notebook UI while leveraging the
fails ensures everyone knows a failure occurred and allows the team to work power of automation, which reduces potential issues that may happen when
together to resolve the issue quickly. The definition of a common workflow running notebooks manually.
created consistency and transparency that made collaboration easier. “Using
Databricks Workflows allowed us to encourage collaboration and break up the The team has gone even further in increasing productivity by developing a
walls between different stages of the process,” explains Wu. “It allowed us all to CI/CD process. “By connecting our source control code repository, we know
speak the same language.” the workflow always runs the latest code version we committed to the repo,”
explains Zhang. “It’s also easy to switch to a development branch to develop a
Creating transparency and consistency is not the only advantage the team saw. new feature, fix a bug and run a development workflow. When the code passes
Using Workflows to automate notebook runs also led to cost savings compared all tests, it is merged back to the main branch and the production workflow is
to running interactive notebooks manually. automatically updated with the latest code.”
S ECT I O N 4 . 5
Rivian is preserving the natural world for future generations with revolutionary Electric
Adventure Vehicles (EAVs). With over 25,000 EAVs on the road generating multiple
terabytes of IoT data per day, the company is using data insights and machine
INDUSTRY
Manufacturing learning to improve vehicle health and performance. However, with legacy cloud
tooling, it struggled to scale pipelines cost-effectively and spent significant resources
SOLUTION
on maintenance — slowing its ability to be truly data driven.
Predictive Maintenance, Scaling ML Models
for IoT, Data-Driven ESG
Since moving to the Databricks Lakehouse Platform, Rivian can now understand how
P L AT F O R M a vehicle is performing and how this impacts the driver using it. Equipped with these
Lakehouse, Delta Lake, Unity Catalog insights, Rivian is innovating faster, reducing costs, and ultimately, delivering a better
driving experience to customers.
C LO U D
AWS
The Big Book of Data Engineering – 2nd Edition 91
Predicting maintenance issues with Databricks Lakehouse metrics, Rivian can improve the accuracy of smart features and the control
that drivers have over them. Designed to take the stress out of long drives and
Rivian chose to modernize its data infrastructure on the Databricks Lakehouse driving in heavy traffic, features like adaptive cruise control, lane change assist,
Platform, giving it the ability to unify all of its data into a common view for automatic emergency driving, and forward collision warning can be honed over
downstream analytics and machine learning. Now, unique data teams have time to continuously optimize the driving experience for customers.
a range of accessible tools to deliver actionable insights for different use
cases, from predictive maintenance to smarter product development. Venkat Secure data sharing and collaboration was also facilitated with the Databricks
Sivasubramanian, Senior Director of Big Data at Rivian, says, “We were able Unity Catalog. Shiverick describes how unified governance for the lakehouse
to build a culture around an open data platform that provided a system for benefits Rivian productivity. “Unity Catalog gives us a truly centralized data
really democratizing data and analysis in an efficient way.” Databricks’ flexible catalog across all of our different teams,” he said. “Now we have proper access
support of all programming languages and seamless integration with a variety of management and controls.” Venkat adds, “With Unity Catalog, we are centralizing
toolsets eliminated access roadblocks and unlocked new opportunities. Wassym data catalog and access management across various teams and workspaces,
Bensaid, Vice President of Software Development at Rivian, explains, “Today we which has simplified governance.” End-to-end version controlled governance
have various teams, both technical and business, using Databricks Lakehouse and auditability of sensitive data sources, like the ones used for autonomous
to explore our data, build performant data pipelines, and extract actionable driving systems, produces a simple but secure solution for feature engineering.
business and product insights via visual dashboards.” This gives Rivian a competitive advantage in the race to capture the autonomous
driving grid.
Rivian’s ADAS (advanced driver-assistance systems) Team can now easily
prepare telemetric accelerometer data to understand all EAV motions. This core
recording data includes information about pitch, roll, speed, suspension and
airbag activity, to help Rivian understand vehicle performance, driving patterns
and connected car system predictability. Based on these key performance
The Big Book of Data Engineering – 2nd Edition 93
S ECT I O N 4 . 6
300% 3X
ROI from OpEx savings Faster delivery of ML/data
and cost avoidance science use cases
C LO U D
features and innovations to its millions of end users.
Azure
The Big Book of Data Engineering – 2nd Edition 95
Hadoop technology adds operational complexity and With these deeply rooted technology issues, AT&T was not in the best position
unnecessary costs to achieve its goals of increasing its use of insights for improving its customer
experience and operating more efficiently. “To truly democratize data across
AT&T is a technology giant with hundreds of millions of subscribers and ingests the business, we needed to pivot to a cloud-native technology environment,”
10+ petabytes[a] of data across the entire data platform each day. To harness said Mark Holcomb, Distinguished Solution Architect at AT&T. “This has freed
this data, it has a team of 2,500+ data users across 60+ business units to ensure up resources that had been focused on managing our infrastructure and move
the business is data powered — from building analytics to ensure decisions are them up the value chain, as well as freeing up capital for investing in growth-
based on the best data-driven situation awareness to building ML models that oriented initiatives.”
bring new innovations to its customers. To support these requirements, AT&T
needed to democratize and establish a data single version of truth (SVOT) while
simplifying infrastructure management to increase agility and lower overall costs. A seamless migration journey to Databricks
However, physical infrastructure was too resource intensive. The combination As part of its due diligence, AT&T ran a comprehensive cost analysis and
of a highly complex hardware setup (12,500 data sources and 1,500+ servers) concluded that Databricks was both the fastest and achieved the best price/
coupled with an on-premises Hadoop architecture proved complex to performance for data pipelines and machine learning workloads. AT&T knew the
maintain and expensive to manage. Not only were the operational costs to migration would be a massive undertaking. As such, the team did a lot of upfront
support workloads high, but there were also additional capital costs around planning — they prioritized migrating their largest workloads first to immediately
data centers, licensing and more. Up to 70% of the on-prem platform had to reduce their infrastructure footprint. They also decided to migrate their data
be prioritized to ensure 50K data pipeline jobs succeeded and met SLAs and before migrating users to ensure a smooth transition and experience for their
data quality objectives. Engineers’ time was focused on managing updates, thousands of data practitioners.
They spent a year deduplicating and synchronizing data to the cloud before during the course of the migration. “By prioritizing the migration of our most
migrating any users. This was a critical step in ensuring the successful migration compute-intensive workloads to Databricks, we were able to significantly drive
of such a large, complex multi-tenant environment of 2,500+ users from 60+ down costs while putting us in position to scale more efficiently moving forward,”
business units and their workloads. The user migration process occurred over explained Holcomb. The result is an anticipated 300% five-year migration ROI
nine months and enabled AT&T to retire on-premises hardware in parallel with from OpEx savings and cost avoidance (e.g., not needing to refresh data center
migration to accelerate savings as early as possible. Plus, due to the horizontal, hardware).
scalable nature of Databricks, AT&T didn’t need to have everything in one
contiguous environment. Separating data and compute, and across multiple With data readily available and the means to analyze data at any scale, teams
accounts and workspaces, ensured analytics worked seamlessly without any API of citizen data scientists and analysts can now spend more time innovating,
call limits or bandwidth issues and consumption clearly attributed to the 60+ instead of serializing analytics efforts or waiting on engineering to provide the
business units. necessary resources — or having data scientists spend their valuable time
on less complex or less insightful analyses. Data scientists are now able to
All in all, AT&T migrated over 1,500 servers, more than 50,000 production CPUs, collaborate more effectively and speed up machine learning workflows so that
12,500 data sources and 300 schemas. The entire process took about two and a teams can deliver value more quickly, with a 3x faster time to delivery for new
half years. And it was able to manage the entire migration with the equivalent of data science use cases.
15 full-time internal resources. “Databricks was a valuable collaborator throughout
the process,” said Holcomb. “The team worked closely with us to resolve product “Historically you would have had operations in one system and analytics in a
features and security concerns to support our migration timeline.” separate one,” said Holcomb. “Now we can do more use cases like operational
analytics in a platform that fosters cross-team collaboration, reduces cost and
improves the consistency of answers.” Since migrating to Databricks, AT&T now
Databricks reduces TCO and opens new paths to has a single version of truth to create new data-driven opportunities, including
a self-serve AI-as-a-Service analytics platform that will enable new revenue
innovation
streams and help it continue delivering exceptional innovations to its millions
One of the immediate benefits of moving to Databricks was huge cost savings. of customers.
AT&T was able to rationalize about 30% of its data by identifying and not
migrating underutilized and duplicate data. And prioritizing the migration of
the largest workloads allowed half the on-prem equipment to be rationalized
About Databricks
Databricks is the data and AI company. More than 9,000
organizations worldwide — including Comcast, Condé Nast and
over 50% of the Fortune 500 — rely on the Databricks Lakehouse
Platform to unify their data, analytics and AI. Databricks is
headquartered in San Francisco, with offices around the globe.
Founded by the original creators of Apache Spark™, Delta Lake
and MLflow, Databricks is on a mission to help data teams solve
the world’s toughest problems. To learn more, follow Databricks on
Twitter, LinkedIn and Facebook.
S TA R T YO U R F R E E T R I A L
© Databricks 2023. All rights reserved. Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation. Privacy Policy | Terms of Use