0% found this document useful (0 votes)
2K views59 pages

PySpark Comprehensive Notes⚡

Uploaded by

Richard Smith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views59 pages

PySpark Comprehensive Notes⚡

Uploaded by

Richard Smith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

APACHE SPARK

INTERVIEW QUESTIONS

ABHINANDAN PATRA

DATA ENGINEER
1. What is Apache Spark?

Apache Spark is an open-source, distributed computing system that provides an interface for
programming entire clusters with implicit data parallelism and fault tolerance. It is designed
to process large-scale data efficiently.

2. Why Apache Spark?

Apache Spark is used because it is faster than traditional big data tools like Hadoop
MapReduce due to its in-memory processing capabilities, supports multiple languages (Scala,
Python, R, Java), provides libraries for various tasks (SQL, machine learning, graph
processing, etc.), and has robust fault tolerance.

3. What are the components of the Apache Spark Ecosystem?

The main components are:

 Spark Core: The foundational engine for large-scale parallel and distributed data
processing.
 Spark SQL: For structured data processing.
 Spark Streaming: For real-time data processing.
 MLlib: A library for scalable machine learning.
 GraphX: For graph and graph-parallel computation.

4. What is Spark Core?

Spark Core is the general execution engine for the Spark platform, responsible for tasks such
as scheduling, distributing, and monitoring applications.

5. Which languages does Apache Spark support?

Apache Spark supports:

 Scala
 Python
 Java
 R
 SQL

6. How is Apache Spark better than Hadoop?

Spark is better in several ways, including faster processing due to in-memory computation,
ease of use with APIs for various programming languages, flexibility with built-in libraries
for diverse tasks, and a rich set of APIs for transformations and actions.

7. What are the different methods to run Spark over Apache Hadoop?

Spark can run on Hadoop in the following modes:


 Standalone: Using its cluster manager.
 YARN: Hadoop’s cluster manager.
 Mesos: Another cluster manager.

8. What is SparkContext in Apache Spark?

SparkContext is the entry point for any Spark application. It acts as a connection to the
Spark cluster, allowing Spark jobs to be executed.

9. What is SparkSession in Apache Spark?

SparkSession is the unified entry point to work with DataFrames, Datasets, and SQL in
Apache Spark. It replaces SQLContext and HiveContext.

10. SparkSession vs. SparkContext in Apache Spark

SparkSession is a combination of SQLContext, HiveContext, and SparkContext to


provide a single point of entry to interact with Spark.

11. What are the abstractions of Apache Spark?

The primary abstractions are:

 RDD (Resilient Distributed Dataset)


 DataFrames
 Datasets

12. How can we create RDD in Apache Spark?

RDDs can be created in three ways:

 Parallelizing a collection in your program.


 Referencing a dataset in an external storage system (e.g., HDFS, S3, etc.).
 Transforming an existing RDD.

13. Why is Spark RDD immutable?

RDDs are immutable to provide fault tolerance and support functional programming
principles, allowing Spark to rebuild lost data from the lineage information.

14. Explain the term paired RDD in Apache Spark.

Paired RDDs are RDDs where each element is a pair (key-value). They are used for
operations like aggregation, grouping, and joins.

15. How is RDD in Spark different from Distributed Storage Management?

RDD is an in-memory data structure optimized for processing, while Distributed Storage
(like HDFS) focuses on data storage and retrieval.
16. Explain transformation and action in RDD in Apache Spark.

 Transformation: Lazy operations that define a new RDD without executing until an
action is called (e.g., map, filter).
 Action: Triggers the execution of transformations (e.g., count, collect).

17. What are the types of Apache Spark transformations?

Transformations can be narrow (e.g., map, filter) or wide (e.g., groupByKey,


reduceByKey).

18. Explain the RDD properties.

RDD properties include:

 Immutability: Once created, RDDs cannot be changed.


 Partitioned: Distributed across various nodes in the cluster.
 Lazy evaluation: Operations are computed when an action is called.
 Fault tolerance: Recomputed using lineage information.

19. What is a lineage graph in Apache Spark?

A lineage graph tracks the sequence of transformations that created an RDD, used for
recomputing lost data due to node failures.

20. Explain the terms Spark Partitions and Partitioners.

 Partitions: Logical division of data in RDDs, physically stored across nodes.


 Partitioner: Determines how data is distributed across partitions (e.g.,
HashPartitioner, RangePartitioner).

21. By default, how many partitions are created in RDD in Apache Spark?

By default, Spark creates partitions based on the number of cores available or the input file's
HDFS block size.

22. What is Spark DataFrames?

DataFrames are distributed collections of data organized into named columns, similar to
tables in a relational database.

23. What are the benefits of DataFrame in Spark?

Benefits include optimizations (Catalyst query optimizer), improved performance, and easier
manipulation using SQL-like syntax.

24. What is Spark Dataset?


A Dataset is a distributed collection of data that provides type safety and object-oriented
programming interfaces.

25. What are the advantages of datasets in Spark?

Advantages include compile-time type safety, optimizations through Tungsten, and the ability
to leverage JVM object serialization.

26. What is Directed Acyclic Graph (DAG) in Apache Spark?

A DAG in Spark represents a sequence of computations performed on data, where each node
is an RDD and edges represent transformations. It's used to optimize execution plans.

27. What is the need for Spark DAG?

The DAG allows Spark to optimize execution by scheduling tasks efficiently, minimizing
data shuffling, and managing dependencies.

28. What is the difference between DAG and Lineage?

 DAG: Represents the entire execution plan of a Spark application.


 Lineage: Tracks transformations on a particular RDD, useful for fault recovery.

29. What is the difference between Caching and Persistence in Apache Spark?

 Caching: Default storage level is in-memory (MEMORY_ONLY).


 Persistence: Allows choosing different storage levels (disk, memory, etc.) for storing
RDDs.

30. What are the limitations of Apache Spark?

Limitations include high memory consumption, limited built-in libraries compared to


Hadoop, not suitable for small data or real-time streaming without specific tools.

31. Different Running Modes of Apache Spark

Spark can run in:

 Local mode: Single machine.


 Standalone mode: Using its cluster manager.
 YARN mode: On Hadoop’s cluster manager.
 Mesos mode: On Mesos cluster manager.
 Kubernetes mode: On Kubernetes.

32. What are the different ways of representing data in Spark?

Data can be represented as:

 RDDs (Resilient Distributed Datasets)


 DataFrames
 Datasets

33. What is Write-Ahead Log (WAL) in Spark?

Write-Ahead Log is a fault-tolerance mechanism where every received data is first written to
a log file (disk) before processing, ensuring no data loss.

34. Explain Catalyst Query Optimizer in Apache Spark.

Catalyst is Spark SQL's query optimizer that uses rule-based and cost-based optimization
techniques to generate efficient execution plans.

35. What are shared variables in Apache Spark?

Shared variables are variables that can be used by tasks running on different nodes:

 Broadcast variables: Efficiently share read-only data across nodes.


 Accumulators: Used for aggregating information (e.g., sums) across tasks.

36. How does Apache Spark handle accumulated metadata?

Spark stores metadata like lineage information, partition data, and task details in the driver
and worker nodes, managing it using its DAG scheduler.

37. What is Apache Spark's Machine Learning Library?

MLlib is Spark's scalable machine learning library, which provides algorithms and utilities
for classification, regression, clustering, collaborative filtering, and more.

38. List commonly used Machine Learning Algorithms.

Common algorithms in Spark MLlib include:

 Linear Regression
 Logistic Regression
 Decision Trees
 Random Forests
 Gradient-Boosted Trees
 K-Means Clustering

39. What is the difference between DSM and RDD?

 DSM (Distributed Storage Management): Focuses on data storage across clusters.


 RDD (Resilient Distributed Dataset): Focuses on distributed data processing with
fault tolerance.

40. List the advantage of Parquet file in Apache Spark.


Advantages of Parquet files:

 Columnar storage format, optimized for read-heavy workloads.


 Efficient compression and encoding schemes.
 Schema evolution support.

41. What is lazy evaluation in Spark?

Lazy evaluation defers execution until an action is performed, optimizing the execution plan
by reducing redundant computations.

42. What are the benefits of Spark lazy evaluation?

Benefits include:

 Reducing the number of passes over data.


 Optimizing the computation process.
 Decreasing execution time.

43. How much faster is Apache Spark than Hadoop?

Apache Spark is generally up to 100x faster than Hadoop for in-memory processing and up to
10x faster for on-disk data.

44. What are the ways to launch Apache Spark over YARN?

Spark can be launched over YARN in:

 Client mode: Driver runs on the client machine.


 Cluster mode: Driver runs inside YARN cluster.

45. Explain various cluster managers in Apache Spark.

Spark supports:

 Standalone Cluster Manager: Default cluster manager.


 Apache Mesos: A general-purpose cluster manager.
 Hadoop YARN: A resource manager for Hadoop clusters.
 Kubernetes: For container orchestration.

46. What is Speculative Execution in Apache Spark?

Speculative execution is a mechanism to detect slow-running tasks and run duplicates on


other nodes to speed up the process.

47. How can data transfer be minimized when working with Apache Spark?

Data transfer can be minimized by:


 Reducing shuffling and repartitioning.
 Using broadcast variables.
 Efficient data partitioning.

48. What are the cases where Apache Spark surpasses Hadoop?

Apache Spark outperforms Hadoop in scenarios involving iterative algorithms, in-memory


computations, real-time analytics, and complex data processing workflows.

49. What is an action, and how does it process data in Apache Spark?

An action is an operation that triggers the execution of transformations (e.g., count,


collect), performing computations and returning a result.

50. How is fault tolerance achieved in Apache Spark?

Fault tolerance is achieved through lineage information, allowing RDDs to be recomputed


from scratch if a partition is lost.

51. What is the role of the Spark Driver in Spark applications?

The Spark Driver is responsible for converting the user's code into tasks, scheduling them on
executors, and collecting the results.

52. What is a worker node in an Apache Spark cluster?

A worker node is a machine in a Spark cluster where the actual data processing tasks are
executed.

53. Why is Transformation lazy in Spark?

Transformations are lazy to build an optimized execution plan (DAG) and to avoid
unnecessary computation.

54. Can I run Apache Spark without Hadoop?

Yes, Spark can run independently using its built-in cluster manager or other managers like
Mesos and Kubernetes.

55. Explain Accumulator in Spark.

An accumulator is a variable used for aggregating information across executors, like counters
in MapReduce.

56. What is the role of the Driver program in a Spark Application?

The Driver program coordinates the execution of tasks, maintains the SparkContext, and
communicates with the cluster manager.
57. How to identify that a given operation is a Transformation or Action in
your program?

Transformations return RDDs (e.g., map, filter), while actions return non-RDD values (e.g.,
collect, count).

58. Name the two types of shared variables available in Apache Spark.

 Broadcast Variables
 Accumulators

59. What are the common faults of developers while using Apache Spark?

Common faults include:

 Inefficient data partitioning.


 Excessive shuffling and data movement.
 Inappropriate use of transformations and actions.
 Not leveraging caching and persistence properly.

60. By Default, how many partitions are created in RDD in Apache Spark?

The default number of partitions is based on the number of cores available in the cluster or
the HDFS block size.

61. Why do we need compression, and what are the different compression
formats supported?

Compression reduces the storage size of data and speeds up data transfer. Spark supports
several compression formats:

 Snappy
 Gzip
 Bzip2
 LZ4
 Zstandard (Zstd)

62. Explain the filter transformation.

The filter transformation creates a new RDD by selecting only elements that satisfy a given
predicate function.

63. How to start and stop Spark in the interactive shell?

To start Spark in the interactive shell:

 Use spark-shell for Scala or pyspark for Python. To stop Spark:


 Use :quit or Ctrl + D in the shell.
64. Explain the sortByKey() operation.

sortByKey() sorts an RDD of key-value pairs by the key in ascending or descending order.

65. Explain distinct(), union(), intersection(), and subtract()


transformations in Spark.

 distinct(): Returns an RDD with duplicate elements removed.


 union(): Combines two RDDs into one.
 intersection(): Returns an RDD with elements common to both RDDs.
 subtract(): Returns an RDD with elements in one RDD but not in another.

66. Explain foreach() operation in Apache Spark.

foreach() applies a function to each element in the RDD, typically used for side effects like
updating an external data store.

67. groupByKey vs reduceByKey in Apache Spark.

 groupByKey: Groups values by key and shuffles all data across the network, which
can be less efficient.
 reduceByKey: Combines values for each key locally before shuffling, reducing
network traffic.

68. Explain mapPartitions() and mapPartitionsWithIndex().

 mapPartitions(): Applies a function to each partition of the RDD.


 mapPartitionsWithIndex(): Applies a function to each partition, providing the
partition index.

69. What is map in Apache Spark?

mapis a transformation that applies a function to each element in the RDD, resulting in a new
RDD.

70. What is flatMap in Apache Spark?

flatMap is a transformation that applies a function to each element, resulting in multiple


elements (a flat structure) for each input.

71. Explain fold() operation in Spark.

fold() aggregates the elements of an RDD using an associative function and a "zero value"
(an initial value).

72. Explain createOrReplaceTempView() API.


createOrReplaceTempView() registers a DataFrame as a temporary table in Spark SQL,
allowing it to be queried using SQL.

73. Explain values() operation in Apache Spark.

values() returns an RDD containing only the values of key-value pairs.

74. Explain keys() operation in Apache Spark.

keys() returns an RDD containing only the keys of key-value pairs.

75. Explain textFile vs wholeTextFiles in Spark.

 textFile(): Reads a text file and creates an RDD of strings, each representing a line.
 wholeTextFiles(): Reads entire files and creates an RDD of (filename, content) pairs.

76. Explain cogroup() operation in Spark.

cogroup() groups data from two or more RDDs sharing the same key.

77. Explain pipe() operation in Apache Spark.

pipe() passes each partition of an RDD to an external script or program and returns the
output as an RDD.

78. Explain Spark coalesce() operation.

coalesce() reduces the number of partitions in an RDD, useful for minimizing shuffling
when reducing the data size.

79. Explain the repartition() operation in Spark.

repartition() reshuffles data across partitions, increasing or decreasing the number of


partitions, involving a full shuffle of data.

80. Explain fullOuterJoin() operation in Apache Spark.

fullOuterJoin() returns an RDD with all pairs of elements for matching keys and null for
non-matching keys from both RDDs.

81. Explain Spark leftOuterJoin() and rightOuterJoin() operations.

 leftOuterJoin(): Returns all key-value pairs from the left RDD and matching pairs
from the right, filling with null where no match is found.
 rightOuterJoin(): Returns all key-value pairs from the right RDD and matching pairs
from the left, filling with null where no match is found.

82. Explain Spark join() operation.


join() returns an RDD with all pairs of elements with matching keys from both RDDs.

83. Explain top() and takeOrdered() operations.

 top(): Returns the top n elements from an RDD in descending order.


 takeOrdered(): Returns the top n elements from an RDD in ascending order.

84. Explain first() operation in Spark.

first() returns the first element of an RDD.

85. Explain sum(), max(), min() operations in Apache Spark.

These operations compute the sum, maximum, and minimum of elements in an RDD,
respectively.

86. Explain countByValue() operation in Apache Spark RDD.

countByValue() returns a map of the counts of each unique value in the RDD.

87. Explain the lookup() operation in Spark.

lookup() returns the list of values associated with a given key in a paired RDD.

88. Explain Spark countByKey() operation.

countByKey() returns a map of the counts of each key in a paired RDD.

89. Explain Spark saveAsTextFile() operation.

saveAsTextFile() saves the RDD content as a text file or set of text files.

90. Explain reduceByKey() Spark operation.

reduceByKey() applies a reducing function to the elements with the same key, reducing
them to a single element per key.

91. Explain the operation reduce() in Spark.

reduce() aggregates the elements of an RDD using an associative and commutative


function.

92. Explain the action count() in Spark RDD.

count() returns the number of elements in an RDD.

93. Explain Spark map() transformation.


map() applies a function to each element of an RDD, creating a new RDD with the results.

94. Explain the flatMap() transformation in Apache Spark.

flatMap() applies a function that returns an iterable to each element and flattens the results
into a single RDD.

95. What are the limitations of Apache Spark?

Limitations include high memory consumption, not ideal for OLTP (transactional
processing), lack of a mature security framework, and dependency on cluster resources.

96. What is Spark SQL?

Spark SQL is a Spark module for structured data processing, providing a DataFrame API and
allowing SQL queries to be executed.

97. Explain Spark SQL caching and uncaching.

 Caching: Storing DataFrames in memory for faster access.


 Uncaching: Removing cached DataFrames to free memory.

98. Explain Spark Streaming.

Spark Streaming is an extension of Spark for processing real-time data streams.

99. What is DStream in Apache Spark Streaming?

DStream (Discretized Stream) is a sequence of RDDs representing a continuous stream of


data.

100. Explain different transformations in DStream in Apache Spark


Streaming.

Transformations include:

 map(), flatMap(), filter()


 reduceByKeyAndWindow()
 window(), countByWindow()
 updateStateByKey()

101. What is the Starvation scenario in Spark Streaming?

Starvation occurs when all tasks are waiting for resources that are occupied by other long-
running tasks, leading to delays or deadlocks.

102. Explain the level of parallelism in Spark Streaming.


Parallelism is controlled by the number of partitions in RDDs; increasing partitions increases
the level of parallelism.

103. What are the different input sources for Spark Streaming?

Input sources include:

 Kafka
 Flume
 Kinesis
 Socket
 HDFS or S3

104. Explain Spark Streaming with Socket.

Spark Streaming can receive real-time data streams over a socket using
socketTextStream().

105. Define the roles of the file system in any framework.

The file system manages data storage, access, and security, ensuring data integrity and
availability.

106. How do you parse data in XML? Which kind of class do you use with
Java to parse data?

To parse XML data in Java, you can use classes from the javax.xml.parsers package, such
as:

 DocumentBuilder: Used with the Document Object Model (DOM) for in-memory
tree representation.
 SAXParser: Used with the Simple API for XML (SAX) for event-driven parsing.

107. What is PageRank in Spark?

PageRank is an algorithm used to rank web pages in search engine results, based on the
number and quality of links to a page. In Spark, it can be implemented using RDDs or
DataFrames to compute the rank of nodes in a graph.

108. What are the roles and responsibilities of worker nodes in the Apache
Spark cluster? Is the Worker Node in Spark the same as the Slave Node?

 Worker Nodes: Execute tasks assigned by the Spark Driver, manage executors, and
store data in memory or disk as required.
 Slave Nodes: Worker nodes in Spark are commonly referred to as slave nodes. Both
terms are used interchangeably.

109. How to split a single HDFS block into partitions in an RDD?


When reading from HDFS, Spark splits a single block into multiple partitions based on the
number of available cores or executors. You can also use the repartition() method to
explicitly specify the number of partitions.

110. On what basis can you differentiate RDD, DataFrame, and DataSet?

 RDD: Low-level, unstructured data; provides functional programming APIs.


 DataFrame: Higher-level abstraction with schema; optimized for SQL queries and
transformations.
 Dataset: Combines features of RDDs and DataFrames; offers type safety and object-
oriented programming.
SPARK BASED TOPICS KEYWORDS:

Spark Intro:
1. Spark : In-memory processing engine
2. Why spark is fast: Due to less I/O disc reads and writes
3. RDD: It is a data structure to store data in spark
4. When RDD fails: Using lineage graph we track which RDD
failed and reprocess it
5. Why RDD immutable : As it has to be recovered after its
failure and to track which RDD failed
6. Operations in spark: Transformation and Action
7. Transformation: Change data from one form to another,
are lazy.
8. Action: Operations which processes the tranformations,
not lazy. creates DAG to remember sequence of steps.
9. Broadcast Variables: Data which is distributed to all the
systems. Similar to map side join in hive
10. Accumulators: Shared copy in driver, executors can
update but not read. Similar to counters in MR
11. MR before Yarn: Job tracker (scheduling &monitoring),
task manager (manages tasks in its node)
12. Limitations of MR: Unable to add new clusters(scalable),
resource under-utilization, only MR jobs handled
13.YARN: Resource manager(scheduling), application
master(monitoring & resource negotiation), node
manager (manages tasks in its node)
14.Uberization: Tasks run on AM itself if they are very small
15.Spark components: Driver (gives location of executors)
and executors(process data in memory)
16.Client Mode: Driver is at client side
17.Cluster Mode: Driver is inside AM in the cluster
18.Types of transformation: Narrow and Wide
19.Narrow: Data shuffling doesn’t happen (map, flatMap,
filter)
20.Wide: Data shuffling happens (reduceByKey, groupByKey)
21.reduceByKey() is a transformation and reduce() is an
action
22.reduceByKey(): Data is processed at each partition,
groupByKey() : Data is grouped at each partition and
complete processing is done at reducer.
23.Repartition: used to increase/decrease partitions. Use it
for INCREASE
Coalesce: used to decrease partitions and optimized as
data shuffling is less

SPARK DATAFRAMES:

1. Cache() : It is used to cache the data only in memory.


Rdd.cache()
2. Persist() : it is used to cache the data in different storage
levels (memory, disc, memory & disc, off heap).
Rdd.persist(StorageLevel.____)
3. Serialization: Process of converting data in object form
into bytes, occupies less space
4. De-Serialization: Process of converting data in bytes back
to objects, occupies more space.
5. DAG : Created when an action is called, represents tasks,
stages of a job
6. Map : performs one-to-one mapping on each line of input
7. mapPartitions: performs map function only once on each
partition
8. Driver: converts high level programming constructs to low
level to be fed to executors (dataframe to rdd)
9. Executors: Present in memory to process the rdd
10. Spark context: creates entry point into spark cluster
for spark appl
11. Spark session: creates unified entry point into spark
cluster
12. Data frame: it is a dataset[row] where type error
caught only at run time
13. Data set: it is a dataset[object] where type error
caught at compile time
14. Modes of dealing with corrupted record: permissive,
malformed, fail fast
15. Schema types: implicit, infer, explicit (case class,
StructType, DDL string)

SPARK OPTIMIZATIONS

1. Spark optimization:
a. Cluster Configuration : To configure resources to the cluster so
that spark jobs can process well.
b. Code configuration: To apply optimization techniques at code
level so that processing will be fast.
2. Thin executor: More no. of executors with less no. of resources.
Multithreading not possible, too many broadcast variables
required. Ex. 1 executor with each 2 cpu cores, 1 GB ram.
3. Fat executor: Less no. of executors with more amount of
resources. System performance drops down, garbage collection
takes time. Ex 1 executor 16 cpu cores, 32 GB ram.
4. Garbage collection: To remove unused objects from memory.
5. Off heap memory: Memory stored outside of executors/ jvm. It
takes less time to clean objects than garbage collector, used for
java overheads (extra memory which directly doesn’t add to
performance but required by system to carry out its operation)
6. Static allocation: Resources are fixed at first and will remain the
same till the job ends.
7. Dynamic Allocation: Resources are allocated dynamically based on
the job requirement and released during job stages if they are no
longer required.
8. Edge node: It is also called as gateway node which is can be
accessed by client to enter into hadoop cluster and access name
node.
9. How to increase parallelism :
a. Salting : To increase no. of distinct keys so that work can be
distributed across many tasks which in turn increase parallelism.
b. Increase no. of shuffle partitions
c. Increase the resources of the cluster (more cpu cores)
10. Execution memory : To perform computations like shuffle,
sort, join
11. Storage memory : To store the cache
12. User memory : To store user’s data structures, meta data
etc.
13. Reserved memory : To run the executors
14. Kyro Serializer: Used to store the data in disk in serialized
manner which occupies less space.
15. Broadcast join: Used to send the copies of data to all
executors. Used when we have only 1 big table.
16. Optimization on using coalesce() rather than repartition
while reducing no. of partitions
17. Join optimizations:
a. To avoid or minimize shuffling of data
b. To increase parallelism
1. How to avoid/minimize shuffling?
a. Filter and aggregate data before shuffling
b. Use optimization methods which require less shuffling (
coalesce() )
18. How to increase parallelism ?
a. Min (total cpu cores, total shuffle partitions, total distinct
keys)
b. Use salting to increase no. of distinct keys
c. Increase default no. of shuffle partitions
d. Increase resources to inc total cpu cores
19. Skew partitions : Partitions in which data is unevenly
distributed. Bucketing, partitioning, salting can be used to
handle it.
20. Sort aggregate: Data is sorted based on keys and then
aggregated. More processing time
21. Hash aggregate: Hash table is created and similar keys
are added to the same hash value. Less processing time.
22. Stages of execution plan :
a. Parsed logical plan (unresolved logical plan) : To find out
syntax errors
b. Analytical logical plan (Resolved logical plan) : Checks for
column and table names from the catalog.
c. Optimized logical plan (Catalyst optimization) : Optimization
done based on built in rules.
d. Physical plan : Actual execution plan is selected based on cost
effective model.
e. Conversion into Rdd : Converted into rdd and sent to
executors for processing.

**Note:
1 hdfs block = 1 rdd partition = 128mb
1 hdfs block in local=1 rdd partition in local spark cluster= 32mb
1 rdd ~ can have n partitions in it
1 cluster = 1 machine
N cores = N blocks can run in parallel in each cluster/machine
N stages = N - 1 wide transformations
N tasks in each stage= N partitions in each stage for that rdd/data
frame
Must Know Pyspark Coding Before Your Next
Databricks Interview

Document by – Siddhartha Subudhi


Visit my LinkedIn profile
1. Find the second highest salary in a DataFrame using PySpark.

Scenario: You have a DataFrame of employee salaries and want to find the second highest salary.

from pyspark.sql import Window

from pyspark.sql.functions import col, dense_rank

windowSpec = Window.orderBy(col("salary").desc())

df_with_rank = df.withColumn("rank", dense_rank().over(windowSpec))

second_highest_salary = df_with_rank.filter(col("rank") == 2).select("salary")

second_highest_salary.show()

2. Count the number of null values in each column of a PySpark DataFrame.

Scenario: Given a DataFrame, identify how many null values each column contains.

from pyspark.sql.functions import col, isnan, when, count

df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()

3. Calculate the moving average over a window of 3 rows.

Scenario: For a stock price dataset, calculate a moving average over the last 3 days.

from pyspark.sql import Window

from pyspark.sql.functions import avg

windowSpec = Window.orderBy("date").rowsBetween(-2, 0)

df_with_moving_avg = df.withColumn("moving_avg", avg("price").over(windowSpec))

df_with_moving_avg.show()

4. Remove duplicate rows based on a subset of columns in a PySpark DataFrame.

Scenario: You need to remove duplicates from a DataFrame based on certain columns.

df = df.dropDuplicates(["column1", "column2"])

df.show()
5. Split a single column with comma-separated values into multiple columns.

Scenario: Your DataFrame contains a column with comma-separated values. You want to split this into multiple
columns.

from pyspark.sql.functions import split

df_split = df.withColumn("new_column1", split(df["column"], ",").getItem(0)) \

.withColumn("new_column2", split(df["column"], ",").getItem(1))

df_split.show()

6. Group data by a specific column and calculate the sum of another column.

Scenario: Group sales data by "product" and calculate the total sales.

df.groupBy("product").sum("sales").show()

7. Join two DataFrames on a specific condition.

Scenario: You have two DataFrames: one for customer data and one for orders. Join these DataFrames on the
customer ID.

df_joined = df_customers.join(df_orders, df_customers.customer_id == df_orders.customer_id, "inner")

df_joined.show()

8. Create a new column based on conditions from existing columns.

Scenario: Add a new column "category" that assigns "high", "medium", or "low" based on the value of the "sales"
column.

from pyspark.sql.functions import when

df = df.withColumn("category", when(df.sales > 500, "high")

.when((df.sales <= 500) & (df.sales > 200), "medium")

.otherwise("low"))

df.show()
9. Calculate the percentage contribution of each value in a column to the total.

Scenario: For a sales dataset, calculate the percentage contribution of each product's sales to the total sales.

from pyspark.sql.functions import sum, col

total_sales = df.agg(sum("sales").alias("total_sales")).collect()[0]["total_sales"]

df = df.withColumn("percentage", (col("sales") / total_sales) * 100)

df.show()

10. Find the top N records from a DataFrame based on a column.

Scenario: You need to find the top 5 highest-selling products.

df.orderBy(col("sales").desc()).limit(5).show()

11. Write PySpark code to pivot a DataFrame.

Scenario: You have sales data by "year" and "product", and you want to pivot the table to show "product" sales by
year.

df_pivot = df.groupBy("product").pivot("year").sum("sales")

df_pivot.show()

12. Add row numbers to a PySpark DataFrame based on a specific ordering.

Scenario: Add row numbers to a DataFrame ordered by "sales" in descending order.

from pyspark.sql.window import Window

from pyspark.sql.functions import row_number

windowSpec = Window.orderBy(col("sales").desc())

df_with_row_number = df.withColumn("row_number", row_number().over(windowSpec))

df_with_row_number.show()
13. Filter rows based on a condition.

Scenario: You want to filter only those customers who made purchases over ₹1000.

df_filtered = df.filter(df.purchase_amount > 1000)

df_filtered.show()

14. Flatten a JSON column in PySpark.

Scenario: Your DataFrame contains a JSON column, and you want to extract specific fields from it.

from pyspark.sql.functions import from_json, col

from pyspark.sql.types import StructType, StructField, StringType

schema = StructType([

StructField("name", StringType(), True),

StructField("age", StringType(), True)

])

df = df.withColumn("json_data", from_json(col("json_column"), schema))

df.select("json_data.name", "json_data.age").show()

15. Convert a PySpark DataFrame column to a list.

Scenario: Convert a column from your DataFrame into a list for further processing.

column_list = df.select("column_name").rdd.flatMap(lambda x: x).collect()

16. Handle NULL values by replacing them with a default value.

Scenario: Replace all NULL values in the "sales" column with 0.

df = df.na.fill({"sales": 0})

df.show()

17. Perform a self-join on a PySpark DataFrame.

Scenario: You have a hierarchy of employees and want to find each employee's manager.

df_self_join = df.alias("e1").join(df.alias("e2"), col("e1.manager_id") == col("e2.employee_id"), "inner") \

.select(col("e1.employee_name"), col("e2.employee_name").alias("manager_name"))

df_self_join.show()
18. Write PySpark code to unpivot a DataFrame.

Scenario: You have a DataFrame with "year" columns and want to convert them to rows.

from pyspark.sql.functions import expr

df_unpivot = df.selectExpr("id", "stack(2, '2021', sales_2021, '2022', sales_2022) as (year, sales)")

df_unpivot.show()

19. Write a PySpark code to group data based on multiple columns and calculate aggregate functions.

Scenario: Group data by "product" and "region" and calculate the average sales for each group.

df.groupBy("product", "region").agg({"sales": "avg"}).show()

20. Write PySpark code to remove rows with duplicate values in any column.

Scenario: You want to remove rows where any column has duplicate values.

df_cleaned = df.dropDuplicates()

df_cleaned.show()

21. Write PySpark code to read a CSV file and infer its schema.

Scenario: You need to load a CSV file into a DataFrame, ensuring the schema is inferred.

df = spark.read.option("header", "true").option("inferSchema", "true").csv("path_to_csv")

df.show()

22. Write PySpark code to merge multiple small files into a single file.

Scenario: You have multiple small files in HDFS, and you want to consolidate them into one large file.

df.coalesce(1).write.mode("overwrite").csv("output_path")
23. Write PySpark code to calculate the cumulative sum of a column.

Scenario: You want to calculate a cumulative sum of sales in your DataFrame.

from pyspark.sql.window import Window

from pyspark.sql.functions import sum

windowSpec = Window.orderBy("date").rowsBetween(Window.unboundedPreceding, 0)

df_with_cumsum = df.withColumn("cumulative_sum", sum("sales").over(windowSpec))

df_with_cumsum.show()

24. Write PySpark code to find outliers in a dataset.

Scenario: Detect outliers in the "sales" column based on the 1.5 * IQR rule.

from pyspark.sql.functions import expr

q1 = df.approxQuantile("sales", [0.25], 0.01)[0]

q3 = df.approxQuantile("sales", [0.75], 0.01)[0]

iqr = q3 - q1

lower_bound = q1 - 1.5 * iqr

upper_bound = q3 + 1.5 * iqr

df_outliers = df.filter((col("sales") < lower_bound) | (col("sales") > upper_bound))

df_outliers.show()

25. Write PySpark code to convert a DataFrame to a Pandas DataFrame.

Scenario: Convert your PySpark DataFrame into a Pandas DataFrame for local processing.

pandas_df = df.toPandas()
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark

1 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark

( )

2 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark

3 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark

4 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark

5 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark

6 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark

7 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark

8 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark

9 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark

10 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark

每个文件 行,字符,名字叫 ,生成 开始

11 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark

12 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark

13 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark

14 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark

15 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark

16 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark

17 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark

18 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark

19 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark

20 of 21 26-06-2023, 09:12
Firefox https://louisazhou.gitbook.io/notes/spark/data-cleaning-with-apache-spark

21 of 21 26-06-2023, 09:12
Key Differences in Apache Spark Components and Concepts

Hadoop vs. Spark Architecture

Aspect Hadoop Spark


Storage Uses HDFS for storage Uses in-memory processing for
speed
Processing MapReduce is disk-based In-memory processing improves
performance
Integration Runs independently or with Can run on top of Hadoop; more
Hadoop ecosystem flexible
Complexity More complex setup and Simpler to deploy and configure
deployment
Performance Slower for iterative tasks due to Better performance for iterative
disk I/O tasks

RDD vs. DataFrame vs. Dataset


Aspect RDD DataFrame Dataset
API Level Low-level, more High-level, optimized High-level, type-safe
control with Catalyst
Schema No schema, Uses schema for Strongly typed,
unstructured structured data compile-time type
safety
Optimization No built-in Optimized using Optimized using
optimization Catalyst Catalyst, with type
safety
Type Safety No type safety No compile-time type Provides compile-time
safety type safety
Performance Less optimized for Better performance Combines type safety
performance due to optimizations with optimization

Follow me on LinkedIn – Shivakiran kotur


Action vs. Transformation
Aspect Action Transformation
Execution Triggers execution of the Spark job Builds up a logical plan of data
operations
Return Type Returns results or output Returns a new RDD/DataFrame
Evaluation Eager evaluation; executes Lazy evaluation; executed when an
immediately action is triggered
Computation Involves actual computation (e.g., Defines data transformations (e.g.,
collect()) map())
Performance Can cause data processing; affects Does not affect performance until
performance an action is called

Map vs. FlatMap


Aspect Map FlatMap
Output Returns one output element per Can return zero or more output
input element elements per input
Flattening Does not flatten output Flattens the output into a single
level
Use Case Suitable for one-to-one Suitable for one-to-many
transformations transformations
Complexity Simpler, straightforward More complex due to variable
number of outputs
Examples map(x => x * 2) flatMap(x => x.split(" "))

GroupBykey vs ReduceBykey
Aspect GroupByKey ReduceByKey
Operation Groups all values by key Aggregates values with the same
key
Efficiency Can lead to high shuffling More efficient due to partial
aggregation
Data Requires shuffling of all values Minimizes data movement
Movement through local aggregation
Use Case Useful for simple grouping Preferred for aggregations and
reductions
Performance Less efficient with large datasets Better performance for large
datasets

Follow me on LinkedIn – Shivakiran kotur


Repartition Vs Coalesce
Aspect Repartition Coalesce
Partitioning Can increase or decrease the Only decreases the number of
number of partitions partitions
Shuffling Involves full shuffle Avoids full shuffle, more efficient
Efficiency More expensive due to shuffling More efficient for reducing
partitions
Use Case Used for increasing partitions or Used for reducing partitions,
balancing load typically after filtering
Performance Can be costly for large datasets More cost-effective for reducing
partitions

Cache Vs Presist

Aspect Cache Persist


Storage Level Defaults to MEMORY_ONLY Can use various storage levels
(e.g., MEMORY_AND_DISK)
Flexibility Simplified, with default storage Offers more options for storage
level levels
Use Case Suitable for simple caching Suitable for complex caching
scenarios scenarios requiring different
storage levels
Implementation Easier to use, shorthand for More flexible, allows custom
MEMORY_ONLY storage options
Performance Suitable when memory suffices More efficient when dealing with
larger datasets and limited
memory

Follow me on LinkedIn – Shivakiran kotur


Narrow Vs Wide Transformation
Aspect Narrow Transformation Wide Transformation
Partitioning Each parent partition is used by Requires data from multiple
one child partition partitions
Shuffling No shuffling required Involves shuffling of data
Performance More efficient and less costly Less efficient due to data
movement
Examples map(), filter() groupByKey(), join()
Complexity Simpler and faster More complex and slower due to
data movement

Collect vs Take
Aspect Collect Take
Output Retrieves all data from the Retrieves a specified number of
RDD/DataFrame elements
Memory Can be expensive and use a lot of More memory-efficient
Usage memory
Use Case Used when you need the entire Useful for sampling or debugging
dataset
Performance Can cause performance issues Faster and more controlled
with large data
Action Type Triggers full data retrieval Triggers partial data retrieval

Broadcast Variable vs Accumulator


Aspect Broadcast Variable Accumulator
Purpose Efficiently shares read-only data Tracks metrics and aggregates
across tasks values
Data Type Data that is shared and read-only Counters and sums, often
numerical
Use Case Useful for large lookup tables or Useful for aggregating metrics like
configurations counts
Efficiency Reduces data transfer by Efficient for aggregating values
broadcasting data once across tasks
Mutability Immutable, read-only Mutable, can be updated during
computation

Follow me on LinkedIn – Shivakiran kotur


Spark SQL vs DataFrame API
Aspect Spark SQL DataFrame API
Interface Executes SQL queries Provides a programmatic interface
Syntax Uses SQL-like syntax Uses function-based syntax
Optimization Optimized with Catalyst Optimized with Catalyst
Use Case Preferred for complex queries and Preferred for programmatic data
legacy SQL code manipulations
Integration Can integrate with Hive and other Provides a unified interface for
SQL databases different data sources

Spark Streaming Vs Structured Streaming


Aspect Spark Streaming Structured Streaming
Processing Micro-batch processing Micro-batch and continuous
processing
API RDD-based API SQL-based API with
DataFrame/Dataset support
Complexity More complex and lower-level Simplified with high-level APIs
Consistency Can be less consistent due to Provides stronger consistency
micro-batches guarantees
Performance Can be slower for complex queries Better performance with
optimizations

Shuffle vs MapReduce
Aspect Shuffle MapReduce
Operation Data reorganization across Data processing model for
partitions distributed computing
Efficiency Can be costly due to data Designed for batch processing
movement with high I/O
Performance Affects performance based on the Optimized for large-scale data
amount of data movement processing but less efficient for
iterative tasks
Use Case Used in Spark for data Used in Hadoop for data
redistribution processing tasks
Implementation Integrated into Spark operations Core component of the Hadoop
ecosystem

Follow me on LinkedIn – Shivakiran kotur


Union vs Join
Aspect Union Join
Operation Combines two DataFrames/RDDs Combines rows from two
into one DataFrames/RDDs based on a key
Data Requires same schema for both Requires a common key for joining
Requirements DataFrames/RDDs
Performance Generally faster as it does not Can be slower due to key
require key matching matching and shuffling
Output Stacks data vertically Merges data horizontally based on
keys
Use Case Appending data or combining Merging related data based on
datasets keys

Executor vs Driver
Aspect Executor Driver
Role Executes tasks and processes data Coordinates and manages the
Spark application
Memory Memory allocated per executor Memory used for managing
for data processing application execution
Lifecycle Exists throughout the application Starts and stops the Spark
execution application
Tasks Runs the tasks assigned by the Schedules and coordinates tasks
driver and jobs
Parallelism Multiple executors run in parallel Single driver coordinates multiple
executors
Checkpointing vs Caching
Aspect Checkpointing Caching
Purpose Provides fault tolerance and Improves performance by storing
reliability intermediate data
Storage Writes data to stable storage (e.g., Stores data in memory or on disk
HDFS) (depends on storage level)
Use Case Used for recovery in case of Used for optimizing repeated
failures operations
Impact Can be more costly and slow Generally faster but not suitable
for fault tolerance
Data Data is written to external storage Data is kept in memory or disk
storage for quick access

Follow me on LinkedIn – Shivakiran kotur


ReducebyKey vs AggregateByKey
Aspect ReduceByKey AggregateByKey
Operation Combines values with the same Performs custom aggregation and
key using a function combinatory operations
Efficiency More efficient for simple Flexible for complex aggregation
aggregations scenarios
Shuffling Involves shuffling but can be Can be more complex due to
optimized custom aggregation
Use Case Suitable for straightforward Ideal for advanced and custom
aggregations aggregations
Performance Generally faster for simple Performance varies with
operations complexity

SQL Context vs Hive Context vs Spark Session


Aspect SQL Context Hive Context Spark Session
Purpose Provides SQL query Provides integration Unified entry point for
capabilities with Hive for SQL Spark functionality
queries
Integration Basic SQL capabilities Integrates with Hive Combines SQL,
Metastore DataFrame, and
Streaming APIs
Usage Legacy, less Supports HiveQL and Supports all Spark
functionality Hive UDFs functionalities including
Hive
Configuration Less flexible and older Requires Hive setup Modern and flexible,
and configuration manages configurations
Capabilities Limited to SQL Extends SQL Comprehensive access
queries capabilities with Hive to all Spark features
integration

Follow me on LinkedIn – Shivakiran kotur


Broadcast Join Vs Shuffle Join
Aspect Broadcast Join Shuffle Join
Operation Broadcasts a small dataset to all Shuffles data across nodes for
nodes joining
Data Size Suitable for small datasets Suitable for larger datasets
Efficiency More efficient for small tables More suited for large datasets
Performance Faster due to reduced shuffling Can be slower due to extensive
shuffling
Use Case Use when one dataset is small Use when both datasets are large
relative to others

Spark Context vs Spark Session


Aspect Spark Context Spark Session
Purpose Entry point for Spark functionality Unified entry point for Spark
functionalities
Lifecycle Created before Spark jobs start Manages the Spark application
lifecycle
Functionality Provides access to RDD and basic Provides access to RDD,
Spark functionality DataFrame, SQL, and Streaming
APIs
Configuration Configuration is less flexible More flexible and easier to
configure
Usage Older, used for legacy applications Modern and recommended for
new applications

Follow me on LinkedIn – Shivakiran kotur


Structured Streaming vs Spark Streaming
Aspect Structured Streaming Spark Streaming
Processing Micro-batch and continuous Micro-batch processing
processing
API SQL-based API with RDD-based API
DataFrame/Dataset support
Complexity Simplified and high-level More complex and low-level
Consistency Provides stronger consistency Can be less consistent due to
guarantees micro-batches
Performance Better performance with built-in Can be slower for complex queries
optimizations

Partitioning vs Bucketing
Aspect Partitioning Bucketing
Purpose Divides data into multiple Divides data into buckets based on
partitions based on a key a hash function
Usage Used to optimize queries by Used to improve join performance
reducing data scanned and maintain sorted data
Shuffling Reduces shuffling by placing Reduces shuffle during joins and
related data together aggregations
Data Layout Data is physically separated based Data is organized into fixed-size
on partition key buckets
Performance Improves performance for queries Enhances performance for join
involving partition keys operations

Follow me on LinkedIn – Shivakiran kotur

You might also like