De Mod 2 Transform Data With Spark
De Mod 2 Transform Data With Spark
with Spark
Module 02
1. Extract data from a variety of file formats and data sources using Spark
2. Apply a number of common transformations to clean data using Spark
3. Reshape and manipulate complex data using advanced built-in functions
in Spark
4. Leverage UDFs for reusable code and apply best practices for
performance in Spark
Metastore
Catalog
Schema
(Database)
5
©2023 Databricks Inc. — All rights reserved
Data objects in the Lakehouse
Metastore
Catalog
Schema
(Database)
6
©2023 Databricks Inc. — All rights reserved
Data objects in the Lakehouse
Metastore
Catalog
Schema
(Database)
(Database)
7
©2023 Databricks Inc. — All rights reserved
Data objects in the Lakehouse
Metastore
Catalog
Schema
(Database)
Managed table
Table View Function
External table
8
©2023 Databricks Inc. — All rights reserved
Managed Tables
Metastore
Catalog
Schema
Managed table
Metastore storage
9
©2023 Databricks Inc. — All rights reserved
External Tables
Metastore
10
©2023 Databricks Inc. — All rights reserved
Data objects in the Lakehouse
Metastore
Catalog
Schema
(Database)
11
©2023 Databricks Inc. — All rights reserved
Data objects in the Lakehouse
Metastore
Catalog
Schema
(Database)
Global Temporary
Temporary View
View 12
©2023 Databricks Inc. — All rights reserved
Data objects in the Lakehouse
Metastore
Catalog
Schema
(Database)
13
©2023 Databricks Inc. — All rights reserved
Extracting Data
SELECT
user_id, event_timestamp, event_name,
explode(items) AS item
FROM events
1
2
3
Each item in the items array above is exploded into its own row, resulting in the 3 rows below
1
2
3
SELECT user_id,
collect_set(event_name) AS event_history,
array_distinct(flatten(collect_set(items.item_id))) AS cart_history
FROM events
GROUP BY user_id
df df.agg(collect_set(‘age’)) df.agg(collect_list(‘age’))
from_json uses JSON schema returned by schema_of_json to convert a column of JSON strings into structs
This highlighted JSON string is taken from the value field of a single row of data