0% found this document useful (0 votes)
279 views

De Mod 2 Transform Data With Spark

This document outlines a module on transforming data with Spark. The module objectives are to extract data from various sources using Spark, apply common transformations to clean data using Spark, reshape and manipulate complex data using Spark functions, and leverage UDFs for reusable code and performance. The module agenda covers querying files directly, extracting from external sources, cleaning data, complex transformations, and reshaping data using a lab. It also discusses data objects in the lakehouse like tables, views, and functions.

Uploaded by

nizammbb2u
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
279 views

De Mod 2 Transform Data With Spark

This document outlines a module on transforming data with Spark. The module objectives are to extract data from various sources using Spark, apply common transformations to clean data using Spark, reshape and manipulate complex data using Spark functions, and leverage UDFs for reusable code and performance. The module agenda covers querying files directly, extracting from external sources, cleaning data, complex transformations, and reshaping data using a lab. It also discusses data objects in the lakehouse like tables, views, and functions.

Uploaded by

nizammbb2u
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Transform Data

with Spark

Module 02

©2023 Databricks Inc. — All rights reserved 1


Module Objectives
Transform Data with Spark

1. Extract data from a variety of file formats and data sources using Spark
2. Apply a number of common transformations to clean data using Spark
3. Reshape and manipulate complex data using advanced built-in functions
in Spark
4. Leverage UDFs for reusable code and apply best practices for
performance in Spark

©2023 Databricks Inc. — All rights reserved 2


Module Agenda
Transform Data with Spark
Data Objects in the Lakehouse
DE 2.1 - Querying Files Directly
DE 2.2 - Options for External Sources
DE 2.3L - Extract Data Lab
DE 2.4 - Cleaning Data
DE 2.5 - Complex Transformations
DE 2.6L - Reshape Data Lab
DE 2.7A – SQL UDFs and Control Flow
DE 2.7B - Python UDFs
©2023 Databricks Inc. — All rights reserved 3
Data Objects in the
Lakehouse

©2023 Databricks Inc. — All rights reserved 4


Data objects in the Lakehouse

Metastore

Catalog

Schema
(Database)

Table View Function

5
©2023 Databricks Inc. — All rights reserved
Data objects in the Lakehouse

Metastore

Catalog

Schema
(Database)

Table View Function

6
©2023 Databricks Inc. — All rights reserved
Data objects in the Lakehouse

Metastore

Catalog

Schema
(Database)
(Database)

Table View Function

7
©2023 Databricks Inc. — All rights reserved
Data objects in the Lakehouse

Metastore

Catalog

Schema
(Database)

Managed table
Table View Function
External table

8
©2023 Databricks Inc. — All rights reserved
Managed Tables
Metastore

Catalog

Schema

Managed table

Metastore storage

9
©2023 Databricks Inc. — All rights reserved
External Tables
Metastore

Catalog Storage credential

Schema External location

Managed table External table

Metastore storage External storage

10
©2023 Databricks Inc. — All rights reserved
Data objects in the Lakehouse

Metastore

Catalog

Schema
(Database)

Table View Function

11
©2023 Databricks Inc. — All rights reserved
Data objects in the Lakehouse

Metastore

Catalog

Schema
(Database)

Table View Function

Global Temporary
Temporary View
View 12
©2023 Databricks Inc. — All rights reserved
Data objects in the Lakehouse

Metastore

Catalog

Schema
(Database)

Table View Function


Function

13
©2023 Databricks Inc. — All rights reserved
Extracting Data

©2023 Databricks Inc. — All rights reserved 14


Query files directly
SELECT * FROM file_format.`path/to/file`

Files can be queried directly using SQL


• SELECT * FROM json.`path/to/files/`
• SELECT * FROM text.`path/to/files/`

Process based on specified file format


• json pulls schema from underlying data
• binaryFile and text file formats have fixed data schemas
• text → string value column (row for each line)
• binaryFile → path, modificationTime, length, content columns (row for each file)

©2023 Databricks Inc. — All rights reserved 15


Configure external tables with read options
CREATE TABLE USING data_source OPTIONS (...)

Many data sources require schema declaration and other options to


correctly read data
• CSV options for delimiter, header, etc
• JDBC options for url, user, password, etc
• Note: using the JDBC driver pulls RDBMS tables dynamically for Spark processing

©2023 Databricks Inc. — All rights reserved 16


DE 2.1: Querying Files
Directly
Use Spark SQL to directly query JSON data files
Leverage text and binaryFile methods to review raw file contents

©2023 Databricks Inc. — All rights reserved 17


DE 2.2: Providing Options
for External Sources
Use Spark SQL to configure options for extracting data from external sources
Create tables against external data sources for various file formats
Describe behavior when querying tables defined against external RDBMS sources

©2023 Databricks Inc. — All rights reserved 18


DE 2.3L: Extract Data Lab

©2023 Databricks Inc. — All rights reserved 19


DE 2.4: Cleaning Data
Summarize datasets and describe NULL behaviors
Retrieve and removing Duplicates
Validate datasets for expected counts, missing values, and duplicate records
Apply date_format and regexp_extract to clean and transform data

©2023 Databricks Inc. — All rights reserved 20


Complex
Transformations

©2023 Databricks Inc. — All rights reserved 21


Interact with Nested Data
Use built-in syntax to traverse nested data with Spark SQL

Use “:” (colon) syntax in queries to access subfields in JSON strings

SELECT value:device, value:geo ...

Use “.” (dot) syntax in queries to access subfields in STRUCT types

SELECT value.device, value.geo ...

©2023 Databricks Inc. — All rights reserved 22


Complex Types
Nested data types storing multiple values
• Array: arbitrary number of elements of same data type
• Map: set of key-value pairs
• Struct: ordered (fixed) collection of column(s) and any data type

Example table with complex types

CREATE TABLE employees (name STRING, salary FLOAT,


subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING,city:STRING,state:STRING, zip:INT>)

©2023 Databricks Inc. — All rights reserved 23


DE 2.5: Complex
Transformations
Use : and . syntax to traverse nested data in strings and structs
Use .* syntax to flatten and query struct types
Parse JSON string fields
Flatten/unpack arrays and structs

©2023 Databricks Inc. — All rights reserved 24


explode lab explode outputs the elements of an array field into a separate row
for each element

SELECT
user_id, event_timestamp, event_name,
explode(items) AS item
FROM events

1
2
3

Each item in the items array above is exploded into its own row, resulting in the 3 rows below

1
2
3

©2023 Databricks Inc. — All rights reserved 25


flatten lab
collect_set returns an array of unique values from a field for each group of rows
flatten returns an array that flattens multiple arrays into one

SELECT user_id,
collect_set(event_name) AS event_history,
array_distinct(flatten(collect_set(items.item_id))) AS cart_history
FROM events
GROUP BY user_id

©2023 Databricks Inc. — All rights reserved 26


Collection example
collect_set returns an array with duplicate elements eliminated
collect_list returns an array with duplicate elements intact

df df.agg(collect_set(‘age’)) df.agg(collect_list(‘age’))

©2023 Databricks Inc. — All rights reserved 27


Parse JSON strings into structs
Create the schema to parse the JSON strings by providing an example JSON string from a row
that has no nulls

from_json uses JSON schema returned by schema_of_json to convert a column of JSON strings into structs

This highlighted JSON string is taken from the value field of a single row of data

Returns STRUCT column containing ARRAY of nested STRUCT

©2023 Databricks Inc. — All rights reserved 28


DE 2.5L: Reshape Data Lab
(Optional)

©2023 Databricks Inc. — All rights reserved 29


DE 2.7A: SQL UDFs and
Control Flow (Optional)

©2023 Databricks Inc. — All rights reserved 30


DE 2.7B: Python UDFs
(Optional)

©2023 Databricks Inc. — All rights reserved 31


©2023 Databricks Inc. — All rights reserved 32

You might also like