Scenario Series 19 - Handling JSON in Pyspark
Scenario Series 19 - Handling JSON in Pyspark
scenarios
1. Reading Basic JSON Files
Challenge: Loading a single JSON file is straightforward, but you need to ensure the path is
correct.
Solution:
Challenge: Extracting and manipulating deeply nested JSON data can be complex.
{
"id": 1,
"name": "John",
"address": {
"street": "123 Elm St",
"city": "Springfield"
flattened_df.show(truncate =False)
flattened_df.show(truncate=False)
{
"id": 1,
"tags": ["spark", "big data"],
"info": {"key": "value", "count": 10}
}
# Handle arrays
tags_df = df1.withColumn("tag", explode(col("tags")))
tags_df.show(truncate=False)
# Handle maps
info_df = tags_df.select(
col("id"),
col("tag"),
col("info.key").alias("info_key"),
col("info.count").alias("info_count")
)
info_df.show(truncate=False)
To handle malformed JSON and capture corrupt records in PySpark, you can use the
PERMISSIVE mode when reading the JSON file. In this mode, Spark stores any corrupt or
invalid records in a special column called _corrupt_record. This is especially useful for
detecting and handling malformed JSON entries.
Here's an example of how to handle malformed JSON and store corrupt records in PySpark.
# Read the JSON file with permissive mode to handle malformed data
df_with_corrupt = spark.read.format("json") \
.option("mode", "PERMISSIVE") \
.load(json_file_path)
# Define schema
schema = StructType([
StructField("id", IntegerType(), True),
StructField("value", DoubleType(), True)
])
df.show(truncate=False)
df_filled.show(truncate=False)
df_cleaned.show(truncate=False)
Solution: Flatten the nested structure early and consider converting to Parquet.
flattened_df.show(truncate=False)