0% found this document useful (0 votes)

29 views8 pages

Scenario Series 19 - Handling JSON in Pyspark

..,....,...,

Uploaded by

Rajesh Tarra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views8 pages

Scenario Series 19 - Handling JSON in Pyspark

..,....,...,

Uploaded by

Rajesh Tarra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Scenario Series : Working With JSON files with different

scenarios
1. Reading Basic JSON Files

Challenge: Loading a single JSON file is straightforward, but you need to ensure the path is
correct.

Solution:

# Load a single JSON file into a DataFrame

df = spark.read.json("path/to/file.json")

# Show the DataFrame

df.show(truncate=False)

2. Handling Multiple JSON Files

Challenge: Reading multiple JSON files from a directory at once.

Solution: Point to the directory with a wildcard path:

# Load multiple JSON files from a directory into a DataFrame

df = spark.read.json("path/to/directory/*.json")

# Show the DataFrame

df.show(truncate=False)

3. Handling Deeply Nested JSON

Challenge: Extracting and manipulating deeply nested JSON data can be complex.

Example JSON Data:

{
"id": 1,
"name": "John",
"address": {
"street": "123 Elm St",
"city": "Springfield"

Follow me on LinkedIn – Shivakiran kotur

},
"orders": [
{"order_id": 1001, "amount": 250.0},
{"order_id": 1002, "amount": 150.0}
]
}

Solution: Flatten the nested structure.

# Load JSON file

df = spark.read.format("json").option("multiline",
"true").load("dbfs:/FileStore/shared_uploads/Deeply_Nested_JSON-
1.json")
df.show(truncate = False)

from pyspark.sql.functions import explode, col

# Flatten nested fields

flattened_df = df.select(
col("id"),
col("name"),
col("address.street"),
col("address.city"),
explode(col("orders")).alias("order")
)

flattened_df.show(truncate =False)

Follow me on LinkedIn – Shivakiran kotur

# Further flatten order details
flattened_df = flattened_df.select(
"id", "name", "street", "city",
col("order.order_id").alias("order_id"),
col("order.amount").alias("order_amount")
)

flattened_df.show(truncate=False)

3. Handling Arrays and Maps in JSON

Challenge: Working with arrays and maps in JSON data.

Example JSON Data:

{
"id": 1,
"tags": ["spark", "big data"],
"info": {"key": "value", "count": 10}
}

Follow me on LinkedIn – Shivakiran kotur

# Load JSON file
df1 =
spark.read.format("json").option("multiline","true").load("dbfs:/Fi
leStore/shared_uploads/[email protected]/complexjson2.json")
df1.display()

from pyspark.sql.functions import explode, col

# Handle arrays
tags_df = df1.withColumn("tag", explode(col("tags")))
tags_df.show(truncate=False)

# Handle maps
info_df = tags_df.select(
col("id"),
col("tag"),
col("info.key").alias("info_key"),
col("info.count").alias("info_count")
)
info_df.show(truncate=False)

Follow me on LinkedIn – Shivakiran kotur

4. Handle Malformed JSON in PySpark

To handle malformed JSON and capture corrupt records in PySpark, you can use the
PERMISSIVE mode when reading the JSON file. In this mode, Spark stores any corrupt or
invalid records in a special column called _corrupt_record. This is especially useful for
detecting and handling malformed JSON entries.

Here's an example of how to handle malformed JSON and store corrupt records in PySpark.

{"id": 1, "name": "Alice", "age": 30

{"id": 2, "name": "Bob", "age": 25}
{"id": 3, "name": "Charlie", "age": "thirty"}

from pyspark.sql import SparkSession

from pyspark.sql.functions import col

# Path to the malformed JSON file

json_file_path = "dbfs:/FileStore/shared_uploads/corruptjson-
1.json"

# Read the JSON file with permissive mode to handle malformed data
df_with_corrupt = spark.read.format("json") \
.option("mode", "PERMISSIVE") \
.load(json_file_path)

# Show the DataFrame including the corrupt record column

df_with_corrupt.show(truncate=False)

from pyspark.sql.functions import col

Follow me on LinkedIn – Shivakiran kotur

# Filter the rows where _corrupt_record is not null (i.e., the
corrupt records)
corrupt_records_df =
df_with_corrupt.filter(col("_corrupt_record").isNotNull())

# Show only the corrupt records

corrupt_records_df.show(truncate=False)

#filter correct records

# Filter the rows where _corrupt_record is not null (i.e., the
corrupt records)
correct_records_df =
df_with_corrupt.filter(col("_corrupt_record").isNull()).drop(col("_
corrupt_record"))

# Show only the corrupt records

correct_records_df.show(truncate=False)

5. Inferring Incorrect Schema

Challenge: PySpark may infer incorrect data types.

Example JSON Data:
{
"id": "1", # Should be Integer
"value": "100.5" # Should be Double
}

Follow me on LinkedIn – Shivakiran kotur

from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType, StringType

# Define schema
schema = StructType([
StructField("id", IntegerType(), True),
StructField("value", DoubleType(), True)
])

# Load JSON file with the defined schema

df = spark.read.schema(schema).json("path/to/file.json")

df.show(truncate=False)

6. Handling JSON with Missing or Null Fields

Challenge: Missing or null fields causing issues.

Example JSON Data:

{
"id": 1,
"name": null,
"age": 28
}

# Load JSON file

df = spark.read.json("path/to/file.json")

# Fill null values

df_filled = df.fillna({"name": "Unknown"})

# Drop rows with null values in specific fields

df_cleaned = df.dropna(subset=["name"])

df_filled.show(truncate=False)
df_cleaned.show(truncate=False)

Follow me on LinkedIn – Shivakiran kotur

7.Complex Nested Structures Impacting Performance

Challenge: Deeply nested structures affecting performance.

Example JSON Data:
{
"id": 1,
"items": [
{"item_id": 100, "details": {"price": 10.5}},
{"item_id": 101, "details": {"price": 20.75}}
]
}

Solution: Flatten the nested structure early and consider converting to Parquet.

from pyspark.sql.functions import explode, col

# Load JSON file

df = spark.read.json("path/to/file.json")

# Flatten nested structure

flattened_df = df.select(
col("id"),
explode(col("items")).alias("item")
).select(
"id",
col("item.item_id").alias("item_id"),
col("item.details.price").alias("price")
)

# Convert to Parquet for better performance

flattened_df.write.mode("overwrite").parquet("path/to/output_direct
ory")

flattened_df.show(truncate=False)

Follow me on LinkedIn – Shivakiran kotur

Computer Invoice
No ratings yet
Computer Invoice
3 pages
Mycarelink Patient Manual 24950
No ratings yet
Mycarelink Patient Manual 24950
50 pages
Windows Batch File Programming 1st edition by Premkumar ISBN instant download
100% (2)
Windows Batch File Programming 1st edition by Premkumar ISBN instant download
79 pages
Calculus II Chapter 7 ALL Lecture Notes
No ratings yet
Calculus II Chapter 7 ALL Lecture Notes
59 pages
w12_runningnotes-201026-001818
No ratings yet
w12_runningnotes-201026-001818
25 pages
Json Function in Pyspark
No ratings yet
Json Function in Pyspark
26 pages
Pyspark File Commands and Theory
No ratings yet
Pyspark File Commands and Theory
29 pages
HP Elitebook 2560p Inventec Styx MV Laptop Schematics
No ratings yet
HP Elitebook 2560p Inventec Styx MV Laptop Schematics
58 pages
Handling nulls in PySpark _
No ratings yet
Handling nulls in PySpark _
15 pages
Computer MCQs by GKwithFaisal
No ratings yet
Computer MCQs by GKwithFaisal
127 pages
Data Governance Final v1
100% (1)
Data Governance Final v1
750 pages
DSP Assignment
No ratings yet
DSP Assignment
19 pages
Pyspark Scenario-Based Interview Questions & Answers: Nitya Cloudtech PVT LTD
No ratings yet
Pyspark Scenario-Based Interview Questions & Answers: Nitya Cloudtech PVT LTD
12 pages
OOP - Tieng Anh
No ratings yet
OOP - Tieng Anh
114 pages
Lab06 Spark Dataframes
No ratings yet
Lab06 Spark Dataframes
12 pages
Part-19 Handling Json Files
No ratings yet
Part-19 Handling Json Files
8 pages
master_pyspark_zero_to_hero_1738689679
No ratings yet
master_pyspark_zero_to_hero_1738689679
102 pages
1737249906013
No ratings yet
1737249906013
106 pages
?Stuck in a Loop of Rejections_ Let’s Break the Cycle!?
No ratings yet
?Stuck in a Loop of Rejections_ Let’s Break the Cycle!?
7 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
What Is A Tree: - Organization Charts - File Systems - Programming Environments
No ratings yet
What Is A Tree: - Organization Charts - File Systems - Programming Environments
59 pages
Val Grind
No ratings yet
Val Grind
14 pages
Lab2 Linear Regression
100% (1)
Lab2 Linear Regression
18 pages
SLVM Online Volume Reconfiguration
No ratings yet
SLVM Online Volume Reconfiguration
10 pages
Builtin Com Artificial Intelligence
No ratings yet
Builtin Com Artificial Intelligence
20 pages
Chapter 3
No ratings yet
Chapter 3
33 pages
M Project
No ratings yet
M Project
11 pages
PySpark_FP_Course ID 58339 - Hands on 4
No ratings yet
PySpark_FP_Course ID 58339 - Hands on 4
2 pages
Module 4: The Graphics Device Interface (GDI), Colors, and Fonts
No ratings yet
Module 4: The Graphics Device Interface (GDI), Colors, and Fonts
33 pages
⚠️ TCS Rejected Many Due to Weak PySpark Logic!?
No ratings yet
⚠️ TCS Rejected Many Due to Weak PySpark Logic!?
7 pages
Working with csv file in Databricks
No ratings yet
Working with csv file in Databricks
4 pages
1731556887911
No ratings yet
1731556887911
275 pages
bLScCdW1geivYxBAmcEE3u (1)(1)
No ratings yet
bLScCdW1geivYxBAmcEE3u (1)(1)
166 pages
pyspark interview questions
No ratings yet
pyspark interview questions
9 pages
Roc8 moonshot assessment
No ratings yet
Roc8 moonshot assessment
4 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
PoC Proposal Template
100% (1)
PoC Proposal Template
43 pages
Pepper VA Survival Guide
No ratings yet
Pepper VA Survival Guide
17 pages
Week 6 Assignment
No ratings yet
Week 6 Assignment
2 pages
Python Pyspark q's
No ratings yet
Python Pyspark q's
16 pages
Command Reference Guide
No ratings yet
Command Reference Guide
18 pages
Shalom Hills International School
No ratings yet
Shalom Hills International School
3 pages
Redux Devtools
No ratings yet
Redux Devtools
5 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
Py Spark
No ratings yet
Py Spark
7 pages
Postgresql Jsonb: Learn This Powerful Tool By Example
From Everand
Postgresql Jsonb: Learn This Powerful Tool By Example
Mohammed N. S. Al Saadi
No ratings yet
Pyspark
No ratings yet
Pyspark
10 pages
Top 10 Production-Grade Reusable PySpark Scripts for Data Engineers _ by Mayurkumar Surani _ May, 2025 _ Medium
No ratings yet
Top 10 Production-Grade Reusable PySpark Scripts for Data Engineers _ by Mayurkumar Surani _ May, 2025 _ Medium
14 pages
JSON A Panda Python
No ratings yet
JSON A Panda Python
3 pages
Virtualization Workshop: Virtual Machines and Snapshots
No ratings yet
Virtualization Workshop: Virtual Machines and Snapshots
10 pages
Dta 100 C
No ratings yet
Dta 100 C
90 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
Karthik Ramamurthy: Digital Product Designer
No ratings yet
Karthik Ramamurthy: Digital Product Designer
4 pages
Spark RDD Commands - Spark Core
No ratings yet
Spark RDD Commands - Spark Core
7 pages
50_PySpark_interview_questions__1732556477
No ratings yet
50_PySpark_interview_questions__1732556477
7 pages
Lab 2
No ratings yet
Lab 2
5 pages
Day 11 Notes
No ratings yet
Day 11 Notes
3 pages
MyinterviewQs (1)
No ratings yet
MyinterviewQs (1)
9 pages
IBM_PySpark_CheatSheet
No ratings yet
IBM_PySpark_CheatSheet
2 pages
EDA Python for Data Analsis
No ratings yet
EDA Python for Data Analsis
10 pages
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
Deloitte Pyspark Interview Questions for Data Engineer 2024 _ by Ronit Malhotra _ Jun, 2024 _ Medium
No ratings yet
Deloitte Pyspark Interview Questions for Data Engineer 2024 _ by Ronit Malhotra _ Jun, 2024 _ Medium
9 pages
PySpark Real Time Q&A
No ratings yet
PySpark Real Time Q&A
5 pages
Pyspark Distinct and Filter
No ratings yet
Pyspark Distinct and Filter
3 pages
Spark Test Que
No ratings yet
Spark Test Que
3 pages
PYSPARK Interview Questions
100% (3)
PYSPARK Interview Questions
126 pages
Day11 Notes
No ratings yet
Day11 Notes
2 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
CV Rajaa Ben Khedher An
No ratings yet
CV Rajaa Ben Khedher An
1 page
Bharat-Pi-LoRa-ESP32-Pinout
No ratings yet
Bharat-Pi-LoRa-ESP32-Pinout
1 page
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
30 Pyspark Coding Questions
No ratings yet
30 Pyspark Coding Questions
9 pages
Spark Material
No ratings yet
Spark Material
6 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
Databricks Spark Knowledge Base
100% (1)
Databricks Spark Knowledge Base
22 pages
Cheat Sheet: From Spark Data Sources SQL Queries
No ratings yet
Cheat Sheet: From Spark Data Sources SQL Queries
1 page
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Be - Information Technology - Semester 8 - 2023 - May - Dloc VI Cloud Computing and Services Rev 2019 C Scheme
No ratings yet
Be - Information Technology - Semester 8 - 2023 - May - Dloc VI Cloud Computing and Services Rev 2019 C Scheme
1 page
50 PySpark Interview Questions.pdf
No ratings yet
50 PySpark Interview Questions.pdf
7 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
Mastering JSON Processing in Snowflake Cheat Sheet
No ratings yet
Mastering JSON Processing in Snowflake Cheat Sheet
2 pages
10 Lessons in Front-end
From Everand
10 Lessons in Front-end
Krasimir Tsonev
2/5 (1)
Cheat Sheet: From Spark Data Sources SQL Queries
No ratings yet
Cheat Sheet: From Spark Data Sources SQL Queries
1 page
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
PySpark Data Frame Questions PDF
100% (1)
PySpark Data Frame Questions PDF
57 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
22518-2023-Winter-Question-Paper (Msbte Study Resources)
0% (1)
22518-2023-Winter-Question-Paper (Msbte Study Resources)
2 pages
Rman
No ratings yet
Rman
3 pages

Uploaded by

Uploaded by

Scenario Series : Working With JSON files with different

# Load a single JSON file into a DataFrame

# Show the DataFrame

2. Handling Multiple JSON Files

Challenge: Reading multiple JSON files from a directory at once.

Solution: Point to the directory with a wildcard path:

# Load multiple JSON files from a directory into a DataFrame

# Show the DataFrame

3. Handling Deeply Nested JSON

Example JSON Data:

Follow me on LinkedIn – Shivakiran kotur

Solution: Flatten the nested structure.

# Load JSON file

from pyspark.sql.functions import explode, col

# Flatten nested fields

Follow me on LinkedIn – Shivakiran kotur

3. Handling Arrays and Maps in JSON

Challenge: Working with arrays and maps in JSON data.

Example JSON Data:

Follow me on LinkedIn – Shivakiran kotur

from pyspark.sql.functions import explode, col

Follow me on LinkedIn – Shivakiran kotur

{"id": 1, "name": "Alice", "age": 30

from pyspark.sql import SparkSession

# Path to the malformed JSON file

# Show the DataFrame including the corrupt record column

from pyspark.sql.functions import col

Follow me on LinkedIn – Shivakiran kotur

# Show only the corrupt records

#filter correct records

# Show only the corrupt records

5. Inferring Incorrect Schema

Challenge: PySpark may infer incorrect data types.

Follow me on LinkedIn – Shivakiran kotur

# Load JSON file with the defined schema

6. Handling JSON with Missing or Null Fields

Example JSON Data:

# Load JSON file

# Fill null values

# Drop rows with null values in specific fields

Follow me on LinkedIn – Shivakiran kotur

Challenge: Deeply nested structures affecting performance.

from pyspark.sql.functions import explode, col

# Load JSON file

# Flatten nested structure

# Convert to Parquet for better performance

Follow me on LinkedIn – Shivakiran kotur

You might also like