PySpark Comprehensive Notes⚡
PySpark Comprehensive Notes⚡
INTERVIEW QUESTIONS
ABHINANDAN PATRA
DATA ENGINEER
1. What is Apache Spark?
Apache Spark is an open-source, distributed computing system that provides an interface for
programming entire clusters with implicit data parallelism and fault tolerance. It is designed
to process large-scale data efficiently.
Apache Spark is used because it is faster than traditional big data tools like Hadoop
MapReduce due to its in-memory processing capabilities, supports multiple languages (Scala,
Python, R, Java), provides libraries for various tasks (SQL, machine learning, graph
processing, etc.), and has robust fault tolerance.
Spark Core: The foundational engine for large-scale parallel and distributed data
processing.
Spark SQL: For structured data processing.
Spark Streaming: For real-time data processing.
MLlib: A library for scalable machine learning.
GraphX: For graph and graph-parallel computation.
Spark Core is the general execution engine for the Spark platform, responsible for tasks such
as scheduling, distributing, and monitoring applications.
Scala
Python
Java
R
SQL
Spark is better in several ways, including faster processing due to in-memory computation,
ease of use with APIs for various programming languages, flexibility with built-in libraries
for diverse tasks, and a rich set of APIs for transformations and actions.
7. What are the different methods to run Spark over Apache Hadoop?
SparkContext is the entry point for any Spark application. It acts as a connection to the
Spark cluster, allowing Spark jobs to be executed.
SparkSession is the unified entry point to work with DataFrames, Datasets, and SQL in
Apache Spark. It replaces SQLContext and HiveContext.
RDDs are immutable to provide fault tolerance and support functional programming
principles, allowing Spark to rebuild lost data from the lineage information.
Paired RDDs are RDDs where each element is a pair (key-value). They are used for
operations like aggregation, grouping, and joins.
RDD is an in-memory data structure optimized for processing, while Distributed Storage
(like HDFS) focuses on data storage and retrieval.
16. Explain transformation and action in RDD in Apache Spark.
Transformation: Lazy operations that define a new RDD without executing until an
action is called (e.g., map, filter).
Action: Triggers the execution of transformations (e.g., count, collect).
A lineage graph tracks the sequence of transformations that created an RDD, used for
recomputing lost data due to node failures.
21. By default, how many partitions are created in RDD in Apache Spark?
By default, Spark creates partitions based on the number of cores available or the input file's
HDFS block size.
DataFrames are distributed collections of data organized into named columns, similar to
tables in a relational database.
Benefits include optimizations (Catalyst query optimizer), improved performance, and easier
manipulation using SQL-like syntax.
Advantages include compile-time type safety, optimizations through Tungsten, and the ability
to leverage JVM object serialization.
A DAG in Spark represents a sequence of computations performed on data, where each node
is an RDD and edges represent transformations. It's used to optimize execution plans.
The DAG allows Spark to optimize execution by scheduling tasks efficiently, minimizing
data shuffling, and managing dependencies.
29. What is the difference between Caching and Persistence in Apache Spark?
Write-Ahead Log is a fault-tolerance mechanism where every received data is first written to
a log file (disk) before processing, ensuring no data loss.
Catalyst is Spark SQL's query optimizer that uses rule-based and cost-based optimization
techniques to generate efficient execution plans.
Shared variables are variables that can be used by tasks running on different nodes:
Spark stores metadata like lineage information, partition data, and task details in the driver
and worker nodes, managing it using its DAG scheduler.
MLlib is Spark's scalable machine learning library, which provides algorithms and utilities
for classification, regression, clustering, collaborative filtering, and more.
Linear Regression
Logistic Regression
Decision Trees
Random Forests
Gradient-Boosted Trees
K-Means Clustering
Lazy evaluation defers execution until an action is performed, optimizing the execution plan
by reducing redundant computations.
Benefits include:
Apache Spark is generally up to 100x faster than Hadoop for in-memory processing and up to
10x faster for on-disk data.
44. What are the ways to launch Apache Spark over YARN?
Spark supports:
47. How can data transfer be minimized when working with Apache Spark?
48. What are the cases where Apache Spark surpasses Hadoop?
49. What is an action, and how does it process data in Apache Spark?
The Spark Driver is responsible for converting the user's code into tasks, scheduling them on
executors, and collecting the results.
A worker node is a machine in a Spark cluster where the actual data processing tasks are
executed.
Transformations are lazy to build an optimized execution plan (DAG) and to avoid
unnecessary computation.
Yes, Spark can run independently using its built-in cluster manager or other managers like
Mesos and Kubernetes.
An accumulator is a variable used for aggregating information across executors, like counters
in MapReduce.
The Driver program coordinates the execution of tasks, maintains the SparkContext, and
communicates with the cluster manager.
57. How to identify that a given operation is a Transformation or Action in
your program?
Transformations return RDDs (e.g., map, filter), while actions return non-RDD values (e.g.,
collect, count).
58. Name the two types of shared variables available in Apache Spark.
Broadcast Variables
Accumulators
59. What are the common faults of developers while using Apache Spark?
60. By Default, how many partitions are created in RDD in Apache Spark?
The default number of partitions is based on the number of cores available in the cluster or
the HDFS block size.
61. Why do we need compression, and what are the different compression
formats supported?
Compression reduces the storage size of data and speeds up data transfer. Spark supports
several compression formats:
Snappy
Gzip
Bzip2
LZ4
Zstandard (Zstd)
The filter transformation creates a new RDD by selecting only elements that satisfy a given
predicate function.
sortByKey() sorts an RDD of key-value pairs by the key in ascending or descending order.
foreach() applies a function to each element in the RDD, typically used for side effects like
updating an external data store.
groupByKey: Groups values by key and shuffles all data across the network, which
can be less efficient.
reduceByKey: Combines values for each key locally before shuffling, reducing
network traffic.
mapis a transformation that applies a function to each element in the RDD, resulting in a new
RDD.
fold() aggregates the elements of an RDD using an associative function and a "zero value"
(an initial value).
textFile(): Reads a text file and creates an RDD of strings, each representing a line.
wholeTextFiles(): Reads entire files and creates an RDD of (filename, content) pairs.
cogroup() groups data from two or more RDDs sharing the same key.
pipe() passes each partition of an RDD to an external script or program and returns the
output as an RDD.
coalesce() reduces the number of partitions in an RDD, useful for minimizing shuffling
when reducing the data size.
fullOuterJoin() returns an RDD with all pairs of elements for matching keys and null for
non-matching keys from both RDDs.
leftOuterJoin(): Returns all key-value pairs from the left RDD and matching pairs
from the right, filling with null where no match is found.
rightOuterJoin(): Returns all key-value pairs from the right RDD and matching pairs
from the left, filling with null where no match is found.
These operations compute the sum, maximum, and minimum of elements in an RDD,
respectively.
countByValue() returns a map of the counts of each unique value in the RDD.
lookup() returns the list of values associated with a given key in a paired RDD.
saveAsTextFile() saves the RDD content as a text file or set of text files.
reduceByKey() applies a reducing function to the elements with the same key, reducing
them to a single element per key.
flatMap() applies a function that returns an iterable to each element and flattens the results
into a single RDD.
Limitations include high memory consumption, not ideal for OLTP (transactional
processing), lack of a mature security framework, and dependency on cluster resources.
Spark SQL is a Spark module for structured data processing, providing a DataFrame API and
allowing SQL queries to be executed.
Transformations include:
Starvation occurs when all tasks are waiting for resources that are occupied by other long-
running tasks, leading to delays or deadlocks.
103. What are the different input sources for Spark Streaming?
Kafka
Flume
Kinesis
Socket
HDFS or S3
Spark Streaming can receive real-time data streams over a socket using
socketTextStream().
The file system manages data storage, access, and security, ensuring data integrity and
availability.
106. How do you parse data in XML? Which kind of class do you use with
Java to parse data?
To parse XML data in Java, you can use classes from the javax.xml.parsers package, such
as:
DocumentBuilder: Used with the Document Object Model (DOM) for in-memory
tree representation.
SAXParser: Used with the Simple API for XML (SAX) for event-driven parsing.
PageRank is an algorithm used to rank web pages in search engine results, based on the
number and quality of links to a page. In Spark, it can be implemented using RDDs or
DataFrames to compute the rank of nodes in a graph.
108. What are the roles and responsibilities of worker nodes in the Apache
Spark cluster? Is the Worker Node in Spark the same as the Slave Node?
Worker Nodes: Execute tasks assigned by the Spark Driver, manage executors, and
store data in memory or disk as required.
Slave Nodes: Worker nodes in Spark are commonly referred to as slave nodes. Both
terms are used interchangeably.
110. On what basis can you differentiate RDD, DataFrame, and DataSet?
Spark Intro:
1. Spark : In-memory processing engine
2. Why spark is fast: Due to less I/O disc reads and writes
3. RDD: It is a data structure to store data in spark
4. When RDD fails: Using lineage graph we track which RDD
failed and reprocess it
5. Why RDD immutable : As it has to be recovered after its
failure and to track which RDD failed
6. Operations in spark: Transformation and Action
7. Transformation: Change data from one form to another,
are lazy.
8. Action: Operations which processes the tranformations,
not lazy. creates DAG to remember sequence of steps.
9. Broadcast Variables: Data which is distributed to all the
systems. Similar to map side join in hive
10. Accumulators: Shared copy in driver, executors can
update but not read. Similar to counters in MR
11. MR before Yarn: Job tracker (scheduling &monitoring),
task manager (manages tasks in its node)
12. Limitations of MR: Unable to add new clusters(scalable),
resource under-utilization, only MR jobs handled
13.YARN: Resource manager(scheduling), application
master(monitoring & resource negotiation), node
manager (manages tasks in its node)
14.Uberization: Tasks run on AM itself if they are very small
15.Spark components: Driver (gives location of executors)
and executors(process data in memory)
16.Client Mode: Driver is at client side
17.Cluster Mode: Driver is inside AM in the cluster
18.Types of transformation: Narrow and Wide
19.Narrow: Data shuffling doesn’t happen (map, flatMap,
filter)
20.Wide: Data shuffling happens (reduceByKey, groupByKey)
21.reduceByKey() is a transformation and reduce() is an
action
22.reduceByKey(): Data is processed at each partition,
groupByKey() : Data is grouped at each partition and
complete processing is done at reducer.
23.Repartition: used to increase/decrease partitions. Use it
for INCREASE
Coalesce: used to decrease partitions and optimized as
data shuffling is less
SPARK DATAFRAMES:
SPARK OPTIMIZATIONS
1. Spark optimization:
a. Cluster Configuration : To configure resources to the cluster so
that spark jobs can process well.
b. Code configuration: To apply optimization techniques at code
level so that processing will be fast.
2. Thin executor: More no. of executors with less no. of resources.
Multithreading not possible, too many broadcast variables
required. Ex. 1 executor with each 2 cpu cores, 1 GB ram.
3. Fat executor: Less no. of executors with more amount of
resources. System performance drops down, garbage collection
takes time. Ex 1 executor 16 cpu cores, 32 GB ram.
4. Garbage collection: To remove unused objects from memory.
5. Off heap memory: Memory stored outside of executors/ jvm. It
takes less time to clean objects than garbage collector, used for
java overheads (extra memory which directly doesn’t add to
performance but required by system to carry out its operation)
6. Static allocation: Resources are fixed at first and will remain the
same till the job ends.
7. Dynamic Allocation: Resources are allocated dynamically based on
the job requirement and released during job stages if they are no
longer required.
8. Edge node: It is also called as gateway node which is can be
accessed by client to enter into hadoop cluster and access name
node.
9. How to increase parallelism :
a. Salting : To increase no. of distinct keys so that work can be
distributed across many tasks which in turn increase parallelism.
b. Increase no. of shuffle partitions
c. Increase the resources of the cluster (more cpu cores)
10. Execution memory : To perform computations like shuffle,
sort, join
11. Storage memory : To store the cache
12. User memory : To store user’s data structures, meta data
etc.
13. Reserved memory : To run the executors
14. Kyro Serializer: Used to store the data in disk in serialized
manner which occupies less space.
15. Broadcast join: Used to send the copies of data to all
executors. Used when we have only 1 big table.
16. Optimization on using coalesce() rather than repartition
while reducing no. of partitions
17. Join optimizations:
a. To avoid or minimize shuffling of data
b. To increase parallelism
1. How to avoid/minimize shuffling?
a. Filter and aggregate data before shuffling
b. Use optimization methods which require less shuffling (
coalesce() )
18. How to increase parallelism ?
a. Min (total cpu cores, total shuffle partitions, total distinct
keys)
b. Use salting to increase no. of distinct keys
c. Increase default no. of shuffle partitions
d. Increase resources to inc total cpu cores
19. Skew partitions : Partitions in which data is unevenly
distributed. Bucketing, partitioning, salting can be used to
handle it.
20. Sort aggregate: Data is sorted based on keys and then
aggregated. More processing time
21. Hash aggregate: Hash table is created and similar keys
are added to the same hash value. Less processing time.
22. Stages of execution plan :
a. Parsed logical plan (unresolved logical plan) : To find out
syntax errors
b. Analytical logical plan (Resolved logical plan) : Checks for
column and table names from the catalog.
c. Optimized logical plan (Catalyst optimization) : Optimization
done based on built in rules.
d. Physical plan : Actual execution plan is selected based on cost
effective model.
e. Conversion into Rdd : Converted into rdd and sent to
executors for processing.
**Note:
1 hdfs block = 1 rdd partition = 128mb
1 hdfs block in local=1 rdd partition in local spark cluster= 32mb
1 rdd ~ can have n partitions in it
1 cluster = 1 machine
N cores = N blocks can run in parallel in each cluster/machine
N stages = N - 1 wide transformations
N tasks in each stage= N partitions in each stage for that rdd/data
frame
Must Know Pyspark Coding Before Your Next
Databricks Interview
Scenario: You have a DataFrame of employee salaries and want to find the second highest salary.
windowSpec = Window.orderBy(col("salary").desc())
second_highest_salary.show()
Scenario: Given a DataFrame, identify how many null values each column contains.
Scenario: For a stock price dataset, calculate a moving average over the last 3 days.
windowSpec = Window.orderBy("date").rowsBetween(-2, 0)
df_with_moving_avg.show()
Scenario: You need to remove duplicates from a DataFrame based on certain columns.
df = df.dropDuplicates(["column1", "column2"])
df.show()
5. Split a single column with comma-separated values into multiple columns.
Scenario: Your DataFrame contains a column with comma-separated values. You want to split this into multiple
columns.
df_split.show()
6. Group data by a specific column and calculate the sum of another column.
Scenario: Group sales data by "product" and calculate the total sales.
df.groupBy("product").sum("sales").show()
Scenario: You have two DataFrames: one for customer data and one for orders. Join these DataFrames on the
customer ID.
df_joined.show()
Scenario: Add a new column "category" that assigns "high", "medium", or "low" based on the value of the "sales"
column.
.otherwise("low"))
df.show()
9. Calculate the percentage contribution of each value in a column to the total.
Scenario: For a sales dataset, calculate the percentage contribution of each product's sales to the total sales.
total_sales = df.agg(sum("sales").alias("total_sales")).collect()[0]["total_sales"]
df.show()
df.orderBy(col("sales").desc()).limit(5).show()
Scenario: You have sales data by "year" and "product", and you want to pivot the table to show "product" sales by
year.
df_pivot = df.groupBy("product").pivot("year").sum("sales")
df_pivot.show()
windowSpec = Window.orderBy(col("sales").desc())
df_with_row_number.show()
13. Filter rows based on a condition.
Scenario: You want to filter only those customers who made purchases over ₹1000.
df_filtered.show()
Scenario: Your DataFrame contains a JSON column, and you want to extract specific fields from it.
schema = StructType([
])
df.select("json_data.name", "json_data.age").show()
Scenario: Convert a column from your DataFrame into a list for further processing.
df = df.na.fill({"sales": 0})
df.show()
Scenario: You have a hierarchy of employees and want to find each employee's manager.
.select(col("e1.employee_name"), col("e2.employee_name").alias("manager_name"))
df_self_join.show()
18. Write PySpark code to unpivot a DataFrame.
Scenario: You have a DataFrame with "year" columns and want to convert them to rows.
df_unpivot.show()
19. Write a PySpark code to group data based on multiple columns and calculate aggregate functions.
Scenario: Group data by "product" and "region" and calculate the average sales for each group.
20. Write PySpark code to remove rows with duplicate values in any column.
Scenario: You want to remove rows where any column has duplicate values.
df_cleaned = df.dropDuplicates()
df_cleaned.show()
21. Write PySpark code to read a CSV file and infer its schema.
Scenario: You need to load a CSV file into a DataFrame, ensuring the schema is inferred.
df.show()
22. Write PySpark code to merge multiple small files into a single file.
Scenario: You have multiple small files in HDFS, and you want to consolidate them into one large file.
df.coalesce(1).write.mode("overwrite").csv("output_path")
23. Write PySpark code to calculate the cumulative sum of a column.
windowSpec = Window.orderBy("date").rowsBetween(Window.unboundedPreceding, 0)
df_with_cumsum.show()
Scenario: Detect outliers in the "sales" column based on the 1.5 * IQR rule.
iqr = q3 - q1
df_outliers.show()
Scenario: Convert your PySpark DataFrame into a Pandas DataFrame for local processing.
pandas_df = df.toPandas()
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark
1 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark
( )
2 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark
3 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark
4 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark
5 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark
6 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark
7 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark
8 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark
9 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark
10 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark
11 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark
12 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark
13 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark
14 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark
15 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark
16 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark
17 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark
18 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark
19 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark
20 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark
21 of 21 26-06-2023, 09:12
Key Differences in Apache Spark Components and Concepts
GroupBykey vs ReduceBykey
Aspect GroupByKey ReduceByKey
Operation Groups all values by key Aggregates values with the same
key
Efficiency Can lead to high shuffling More efficient due to partial
aggregation
Data Requires shuffling of all values Minimizes data movement
Movement through local aggregation
Use Case Useful for simple grouping Preferred for aggregations and
reductions
Performance Less efficient with large datasets Better performance for large
datasets
Cache Vs Presist
Collect vs Take
Aspect Collect Take
Output Retrieves all data from the Retrieves a specified number of
RDD/DataFrame elements
Memory Can be expensive and use a lot of More memory-efficient
Usage memory
Use Case Used when you need the entire Useful for sampling or debugging
dataset
Performance Can cause performance issues Faster and more controlled
with large data
Action Type Triggers full data retrieval Triggers partial data retrieval
Shuffle vs MapReduce
Aspect Shuffle MapReduce
Operation Data reorganization across Data processing model for
partitions distributed computing
Efficiency Can be costly due to data Designed for batch processing
movement with high I/O
Performance Affects performance based on the Optimized for large-scale data
amount of data movement processing but less efficient for
iterative tasks
Use Case Used in Spark for data Used in Hadoop for data
redistribution processing tasks
Implementation Integrated into Spark operations Core component of the Hadoop
ecosystem
Executor vs Driver
Aspect Executor Driver
Role Executes tasks and processes data Coordinates and manages the
Spark application
Memory Memory allocated per executor Memory used for managing
for data processing application execution
Lifecycle Exists throughout the application Starts and stops the Spark
execution application
Tasks Runs the tasks assigned by the Schedules and coordinates tasks
driver and jobs
Parallelism Multiple executors run in parallel Single driver coordinates multiple
executors
Checkpointing vs Caching
Aspect Checkpointing Caching
Purpose Provides fault tolerance and Improves performance by storing
reliability intermediate data
Storage Writes data to stable storage (e.g., Stores data in memory or on disk
HDFS) (depends on storage level)
Use Case Used for recovery in case of Used for optimizing repeated
failures operations
Impact Can be more costly and slow Generally faster but not suitable
for fault tolerance
Data Data is written to external storage Data is kept in memory or disk
storage for quick access
Partitioning vs Bucketing
Aspect Partitioning Bucketing
Purpose Divides data into multiple Divides data into buckets based on
partitions based on a key a hash function
Usage Used to optimize queries by Used to improve join performance
reducing data scanned and maintain sorted data
Shuffling Reduces shuffling by placing Reduces shuffle during joins and
related data together aggregations
Data Layout Data is physically separated based Data is organized into fixed-size
on partition key buckets
Performance Improves performance for queries Enhances performance for join
involving partition keys operations