100% found this document useful (3 votes)
2K views197 pages

Pyspark Interview Code

1. Persisting or caching RDDs breaks the lineage by storing the data in memory rather than relying on recomputation. 2. Checkpointing RDDs saves them to reliable storage like HDFS, breaking the lineage and allowing recovery from failure. 3. Forcing actions like count() or collect() materializes the RDD, breaking the lineage and caching the data in memory. Breaking the lineage is important for reducing memory usage and improving performance when datasets are transformed multiple times or used in iterative algorithms. Not breaking the lineage can consume significant memory tracking intermediate transformations.

Uploaded by

mailme me
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
2K views197 pages

Pyspark Interview Code

1. Persisting or caching RDDs breaks the lineage by storing the data in memory rather than relying on recomputation. 2. Checkpointing RDDs saves them to reliable storage like HDFS, breaking the lineage and allowing recovery from failure. 3. Forcing actions like count() or collect() materializes the RDD, breaking the lineage and caching the data in memory. Breaking the lineage is important for reducing memory usage and improving performance when datasets are transformed multiple times or used in iterative algorithms. Not breaking the lineage can consume significant memory tracking intermediate transformations.

Uploaded by

mailme me
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 197

1. What is Apache Spark?

Apache Spark is an open source cluster computing fast and flexible data processing framework. It
has an advanced execution engine supporting cyclic data flow with in-memory computing
functionalities. Apache Spark can run on Hadoop, as a standalone system or on the cloud. Spark
is capable of accessing diverse data sources including HDFS, HBase, Cassandra among others.
2. Explain the key features of Spark
Spark allows Integration with Hadoop and files included in HDFS.
It has an independent language (Scala) interpreter and hence comes with an interactive language
shell.
It consists of RDD’s (Resilient Distributed Datasets), that can be cached across computing nodes
in a cluster.
It supports multiple analytic tools that are used for interactive query analysis, real-time analysis
and graph processing. Additionally, some of the salient features of Spark include:
Lighting fast processing: When it comes to Big Data processing, speed always matters, and Spark
runs Hadoop clusters way faster than others. Spark makes this possible by reducing the number
of read/write operations to the disc. It stores this intermediate processing data in memory.
Support for sophisticated analytics: In addition to simple ‘map’ and ‘reduce’ operations, Spark
supports SQL queries, streaming data, and complex analytics such as machine learning and graph
algorithms. This allows users to combine all these capabilities in a single workflow.
Real-time stream processing: Spark can handle real-time streaming. MapReduce primarily handles
and processes previously stored data even though there are other frameworks to obtain real-
time streaming. Spark does this in the best way possible.
3. What is ‘RDD’?
RDD stands for Resilient Distribution Datasets: RDD is a fundamental data structure of Spark. It is
an immutable distributed collection of objects. It is a collection of fault-tolerant operational
elements that run in parallel. The partitioned data in RDD is immutable and is distributed in
nature. Each dataset in RDD is divided into logical partitions, which may be computed on
different nodes of the cluster.
4. How does one create RDDs in Spark?
In Spark, parallelized collections are created by calling the SparkContext ‘parallelize’ method on
an existing collection in your driver program.
val data = Array(4,6,7,8)
val distData = sc.parallelize(data)
Text file RDDs can be created using SparkContext’s ‘textFile’ method. Spark has the ability to
create distributed datasets from any storage source supported by Hadoop, including your local
file system, HDFS, Cassandra, HBase, Amazon S3, among others. Spark supports text files,
‘SequenceFiles’, and any other Hadoop ‘InputFormat’ components.
val inputfile = sc.textFile(‘input.txt’)
5. What does the Spark Engine do?
Spark Engine is responsible for scheduling, distributing and monitoring the data application
across the cluster.
6. Define ‘Partitions’.
A ‘Partition’ is a smaller and logical division of data, that is similar to the ‘split’ in Map Reduce.
Partitioning is the process that helps derive logical units of data in order to speed up data
processing.
Here’s an example: val someRDD = sc.parallelize( 1 to 100, 4)
Here an RDD of 100 elements is created in four partitions, which then distributes a dummy map
task before collecting the elements back to the driver program.
7. What operations does the ‘RDD’ support?
Transformations
Actions
When you should not use Apache Spark ? Explain with reason.

There are a few situations where it may not be appropriate to use Apache Spark, which is a
powerful open-source big data processing framework:
1. Small Data: If the data you are working with is small enough that it can be easily processed
by a single machine, then using Spark may be overkill. Spark is designed to handle large
amounts of data, and its overhead may not be worth the investment for small data sets.
2. Low Latency: Spark is designed for batch processing and does not provide low latency for
real-time data processing. If you need to process data in real-time, other technologies such
as Apache Storm or Apache Flink may be more appropriate.
3. Limited Resources: Spark requires significant resources, including memory and CPU, to run
effectively. If your cluster does not have enough resources to run Spark, it may not be able
to perform as expected.
4. Complexity: Spark is a complex system with many components and configuration options.
If your team is not familiar with Spark or big data processing in general, it may take a
significant amount of time and resources to get up to speed and effectively use the
framework.
5. Lack of Support: Spark is written in Scala and runs on the Java Virtual Machine (JVM). If
you’re working with a non-JVM language like Python, you might need to do extra work to
get the same functionality in Spark.
6. Limited scalability: Spark is designed to run on a cluster of machines, but it is not as well-
suited for extremely large data sets as other technologies such as Apache Hadoop.
Spark is a powerful big data processing framework, but it may not be the best choice for every
situation. It is important to evaluate the specific requirements of your project in the initial stage
itself  and the resources available to determine whether Spark is the right choice for your use
case.
Learn how to connect Hive with Apache Spark.
 
HiveContext is a Spark SQL module that allows you to work with Hive data in Spark. It provides a
way to access the Hive metastore, which stores metadata about Hive tables, partitions, and other
objects. With HiveContext, you can use the same SQL-like syntax that you would use in Hive to
query and manipulate data stored in Hive tables.
Here’s an example of how to use HiveContext in Spark:

from pyspark import SparkConf, SparkContext


from pyspark.sql import HiveContext

#Create Spark Configuration and Spark Context


conf = SparkConf().setAppName("HiveContextExample")
sc = SparkContext(conf=conf)

#Create HiveContext
hc = HiveContext(sc)

# Load Data from Hive table


data = hc.sql("SELECT * FROM mydatabase.mytable")

# Show Data
data.show()

Python
COPY
In this example, we first import the necessary modules (SparkConf, SparkContext, and
HiveContext) from the pyspark library. Next, we create a SparkConf and SparkContext, which are
used to configure and start the Spark application. Then, we create a HiveContext using the
SparkContext.
After that, we use the HiveContext to execute an SQL-like query “SELECT * FROM
mydatabase.mytable” to load data from a Hive table, and then use the show() method to display
the data.
Please note that, for this example to work, you need to have Hive installed and configured
properly in your environment, and your Spark should be configured to use Hive. Also the table
“mytable” should already exist in Hive.
Keep in mind that HiveContext is deprecated since Spark 2.0, instead you should use
SparkSession which is a unified entry point for reading structured data and it can be used to
create a DataFrame, create a Hive table, cache tables, and read parquet files as well.
How do you break a lineage in Apache Spark ? Why we need to break a lineage in Apache
Spark ?

In Apache Spark, a lineage refers to the series of RDD (Resilient Distributed Dataset) operations
that are performed on a specific dataset. Breaking a lineage means to prevent the system from
keeping track of the lineage of an RDD, which can be useful for reducing the amount of memory
used by the system and for improving performance.
There are a few ways to break a lineage in Apache Spark:
1. Persistence: By persist() or cache() an RDD, you can make it memory resident, which
allows Spark to reuse it in later stages without having to recompute it.
2. Checkpointing: By checkpoint() an RDD, you can save it to a reliable storage system, such
as HDFS, which allows Spark to recover the RDD in case of failure. This also breaks the
lineage of the RDD, as it is no longer stored in memory.
3. Materializing: By calling the action() method on an RDD, such as count() or collect(), you
can materialize the RDD, which means to force it to be computed and stored in memory.
4. Using the “unpersist()” method: This method is used to remove an RDD from memory and
breaking lineage of the RDD.
It’s important to note that breaking the lineage of an RDD can have a positive impact on
performance, but it can also increase the memory usage of the system, so it should be used
judiciously.
Real example on breaking the lineage
One scenario where breaking a lineage in Apache Spark may be necessary is when working with
large datasets that are transformed multiple times before being used for analysis or storage. In
this case, the lineage of the RDD can become quite complex, and the system may need to keep
track of all the intermediate transformations, which can consume a significant amount of
memory.
For example, you may have a large dataset that you need to filter, group, and aggregate multiple
times before it is ready for analysis. Each of these operations would create a new lineage, and the
system would need to keep track of all the previous transformations in order to recompute the
final result if necessary. Breaking the lineage by caching the RDD after one or more of the
transformations could help to reduce the memory usage, while also improving performance.
Another scenario can be when working with iterative algorithms like MLlib algorithms. These
algorithms need to iterate over the same dataset multiple times and each iteration creates a new
lineage, which can take a lot of memory. By breaking the lineage after each iteration, you can
reduce the memory usage of the system and improve performance.
You can also break lineage to improve performance when RDD is not used again in the pipeline
and it is not required to keep it in memory.
PySpark : Inserting row in Apache Spark Dataframe.
 USER  JANUARY 29, 2023  LEAVE A COMMENTON PYSPARK : INSERTING ROW IN APACHE SPARK
DATAFRAME.

In PySpark, you can insert a row into a DataFrame by first converting the DataFrame to a RDD
(Resilient Distributed Dataset), then adding the new row to the RDD, and finally converting the
RDD back to a DataFrame.
Here is an example:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("Insert Row").getOrCreate()
# Create a DataFrame
df = spark.createDataFrame([
(1, "Michele John", 25),
(2, "Barry Berner", 30),
(3, "Jacob Jim", 35)],
["id", "name", "age"])
df.show()

Python
COPY
Input data

+---+------------+---+
| id| name|age|
+---+------------+---+
| 1|Michele John| 25|
| 2|Barry Berner| 30|
| 3| Jacob Jim| 35|
+---+------------+---+

Bash
COPY

# Convert the DataFrame to an RDD


rdd = df.rdd
# Add the new row to the RDD
new_row = (4, "Elaine Berer Lee", 22)
rdd = rdd.union(spark.sparkContext.parallelize([new_row]))
# Convert the RDD back to a DataFrame
df = spark.createDataFrame(rdd, df.schema)
# Show the DataFrame
df.show()

Python
COPY
Result

+---+----------------+---+
| id| name|age|
+---+----------------+---+
| 1| Michele John| 25|
| 2| Barry Berner| 30|
| 3| Jacob Jim| 35|
| 4|Elaine Berer Lee| 22|
+---+----------------+---+

Bash
COPY
This code creates a DataFrame with three rows and three columns, then converts it to an RDD.
Then it creates a tuple with the values for a new row, and add it to the RDD using the union()
method. Finally, it converts the RDD back to a DataFrame using the same schema as the original
DataFrame, and shows the resulting DataFrame. The resulting DataFrame will have the new row
inserted at the bottom of the DataFrame.
It’s worth noting that this method of inserting a row is not efficient for large DataFrames, if you
need to insert a large number of rows, it’s better to use Spark SQL or DataFrame API to insert a
new row.
PySpark : How to decode in PySpark ?
 

pyspark.sql.functions.decode
The pyspark.sql.functions.decode Function in PySpark
PySpark is a popular library for processing big data using Apache Spark. One of its many functions
is the pyspark.sql.functions.decode function, which is used to convert binary data into a string
using a specified character set. The pyspark.sql.functions.decode function takes two arguments:
the first argument is the binary data to be decoded, and the second argument is the character set
to use for decoding the binary data.
The pyspark.sql.functions.decode function in PySpark supports the following character sets: US-
ASCII, ISO-8859-1, UTF-8, UTF-16BE, UTF-16LE, and UTF-16. The character set specified in the
second argument must match one of these supported character sets in order to perform the
decoding successfully.
Here’s a simple example to demonstrate the use of the pyspark.sql.functions.decode function in
PySpark:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# Initializing Spark Session


spark = SparkSession.builder.appName("DecodeFunction").getOrCreate()

# Creating DataFrame with sample data


data = [("Team",),("Freshers.in",)]
df = spark.createDataFrame(data, ["binary_data"])

# Decoding binary data


df = df.withColumn("string_data", decode(col("binary_data"), "UTF-8"))

# Showing the result


df.show()

Python
COPY
Output

+-----------+-----------+
|binary_data|string_data|
+-----------+-----------+
| Team| Team|
|Freshers.in|Freshers.in|
+-----------+-----------+

Bash
COPY
In the above example, the pyspark.sql.functions.decode function is used to decode binary data
into a string. The first argument to the pyspark.sql.functions.decode function is the binary data
to be decoded, which is stored in the “binary_data” column. The second argument is the
character set to use for decoding the binary data, which is “UTF-8“. The function returns a new
column “string_data” that contains the decoded string data.
The pyspark.sql.functions.decode function is a useful tool for converting binary data into a string
format that can be more easily analyzed and processed. It is important to specify the correct
character set for the binary data, as incorrect character sets can result in incorrect decoded data.
In conclusion, the pyspark.sql.functions.decode function in PySpark is a valuable tool for
converting binary data into a string format. It supports a variety of character sets and is an
important tool for processing binary data in PySpark.

PySpark : Explain in detail whether Apache Spark SQL lazy or not ?

Yes, Apache Spark SQL is lazy.


In Spark, the concept of “laziness” refers to the fact that computations are not executed
immediately when they are invoked, but rather are recorded and executed only when an action is
called. This means that Spark will not execute any transformations or computations on the data
until an action is called, such as count(), first(), or write().
For example, when you write a query using Spark SQL, the query is not executed immediately.
Instead, it is recorded and analyzed, and a logical execution plan is constructed. This logical plan
is then optimized and executed only when an action is called. This allows Spark to optimize the
execution plan by taking into account the entire data flow, rather than executing each query or
transformation as it is encountered.
When it comes to Spark SQL, the execution of a SQL query is also recorded, and executed only
when an action is called. This allows Spark to optimize the query for the specific data source it is
reading from. For example, if the data is stored in Parquet or ORC files, Spark can use specific
readers for those file formats to optimize the query execution.
In summary, Spark SQL is lazy, which means that it does not execute the query immediately but
records it and waits for the action to be called. This allows Spark to optimize the execution plan
and execute the query efficiently.
In pyspark what is the difference between Spark spark.table() and spark.read.table()

In PySpark, spark.table() is used to read a table from the Spark catalog, whereas
spark.read.table() is used to read a table from a structured data source, such as a data lake or a
database.
The spark.table() method requires that you have previously created a table in the Spark catalog
and registered it using the spark.createTable() method or the CREATE TABLE SQL statement.
Once a table has been registered in the catalog, you can use the spark.table() method to access
it.
On the other hand, spark.read.table() reads a table from a structured data source and returns a
DataFrame. It requires a configuration specifying the data source and the options to read the
table.
Here is an example of using spark.read.table() to read a table from a database:

df = spark.read.format("jdbc") \
.option("url", "jdbc:postgresql://localhost/mydatabase") \
.option("dbtable", "mytable") \
.option("user", "username") \
.option("password", "password") \
.load()

PySpark : Combine two or more arrays into a single array of tuple

pyspark.sql.functions.arrays_zip
In PySpark, the arrays_zip function can be used to combine two or more arrays into a single array
of tuple. Each tuple in the resulting array contains elements from the corresponding position in
the input arrays. This will returns a merged array of structs in which the N-th struct contains all N-
th values of input arrays.

from pyspark.sql.functions import arrays_zip


df = spark.createDataFrame([(([1, 2, 3], ['Sam John', 'Perter Walter', 'Johns Mike']))], ['si_no',
'name'])
df.show(20,False)

Python
COPY

+---------+-------------------------------------+
|si_no |name |
+---------+-------------------------------------+
|[1, 2, 3]|[Sam John, Perter Walter, Johns Mike]|
+---------+-------------------------------------+

Bash
COPY

zipped_array = df.select(arrays_zip(df.si_no,df.name))
zipped_array.show(20,False)

Python
COPY
Result

zipped_array = df.select(arrays_zip(df.si_no,df.name))
zipped_array.show(20,False)

Bash
COPY
You can also use arrays_zip with more than two arrays as input. For example:

from pyspark.sql.functions import arrays_zip


df = spark.createDataFrame([(([1, 2, 3], ['Sam John', 'Perter Walter', 'Johns Mike'],[23,43,41]))],
['si_no', 'name','age'])
zipped_array = df.select(arrays_zip(df.si_no,df.name,df.age))
zipped_array.show(20,False)

Python
COPY
Result

+----------------------------------------------------------------+
|arrays_zip(si_no, name, age) |
+----------------------------------------------------------------+
|[[1, Sam John, 23], [2, Perter Walter, 43], [3, Johns Mike, 41]]|
+----------------------------------------------------------------+

 POSTED INSOFTWARE
Apache Spark interview questions
 USER  MARCH 7, 2021  LEAVE A COMMENTON APACHE SPARK INTERVIEW QUESTIONS

8. Define ‘Transformations’ in Spark.


‘Transformations’ are functions applied on RDD, resulting in a new RDD. It does not execute until
an action occurs. map() and filer() are examples of ‘transformations’, where the former applies
the function assigned to it on each element of the RDD and results in another RDD. The filter()
creates a new RDD by selecting elements from the current RDD.
9. Define ‘Action’ in Spark.
An ‘action’ helps in bringing back the data from the RDD to the local machine. Execution of
‘action’ is the result of all transformations created previously. reduce() is an action that
implements the function passed again and again until only one value is left. On the other hand,
the take() action takes all the values from the RDD to the local node.
10. What are the functions of ‘Spark Core’?
The ‘SparkCore’ performs an array of critical functions like memory management, monitoring
jobs, fault tolerance, job scheduling and interaction with storage systems.
It is the foundation of the overall project. It provides distributed task dispatching, scheduling, and
basic input and output functionalities. RDD in Spark Core makes it fault tolerance. RDD is a
collection of items distributed across many nodes that can be manipulated in parallel. Spark Core
provides many APIs for building and manipulating these collections.
11. What is an ‘RDD Lineage’?
Spark does not support data replication in the memory. In the event of any data loss, it is rebuilt
using the ‘RDD Lineage’. It is a process that reconstructs lost data partitions. Its the graph of all
parent RDD of an RDD. A graph of all parent RDD.
toDebugString() can be used
12. What is a ‘Spark Driver’?
‘Spark Driver’ is the program that runs on the master node of the machine and declares
transformations and actions on data RDDs. The driver also delivers RDD graphs to the ‘Master’,
where the standalone cluster manager runs. – In technical perspective, it is the one which creates
spark context
13. What is SparkContext?
‘SparkContext’ is the main entry point for Spark functionality. A ‘SparkContext’ represents the
connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast
variables on that cluster.
14. What is Hive on Spark?
Hive is a component of Hortonworks’ Data Platform (HDP). Hive provides an SQL-like interface to
data stored in the HDP. Spark users will automatically get the complete set of Hive’s rich
features, including any new features that Hive might introduce in the future.
The main task around implementing the Spark execution engine for Hive lies in query planning,
where Hive operator plans from the semantic analyzer which is translated to a task plan that
Spark can execute. It also includes query execution, where the generated Spark plan gets actually
executed in the Spark cluster.

There are a few situations where it may not be appropriate to use Apache Spark, which is a
powerful open-source big data processing framework:
1. Small Data: If the data you are working with is small enough that it can be easily processed
by a single machine, then using Spark may be overkill. Spark is designed to handle large
amounts of data, and its overhead may not be worth the investment for small data sets.
2. Low Latency: Spark is designed for batch processing and does not provide low latency for
real-time data processing. If you need to process data in real-time, other technologies such
as Apache Storm or Apache Flink may be more appropriate.
3. Limited Resources: Spark requires significant resources, including memory and CPU, to run
effectively. If your cluster does not have enough resources to run Spark, it may not be able
to perform as expected.
4. Complexity: Spark is a complex system with many components and configuration options.
If your team is not familiar with Spark or big data processing in general, it may take a
significant amount of time and resources to get up to speed and effectively use the
framework.
5. Lack of Support: Spark is written in Scala and runs on the Java Virtual Machine (JVM). If
you’re working with a non-JVM language like Python, you might need to do extra work to
get the same functionality in Spark.
6. Limited scalability: Spark is designed to run on a cluster of machines, but it is not as well-
suited for extremely large data sets as other technologies such as Apache Hadoop.
Spark is a powerful big data processing framework, but it may not be the best choice for every
situation. It is important to evaluate the specific requirements of your project in the initial stage
itself  and the resources available to determine whether Spark is the right choice for your use
case.
Installing Apache Spark standalone on Linux

Installing Spark on a Linux machine can be done in a few steps. The following is a detailed guide
on how to install Spark in standalone mode on a Linux machine.
1. Install Java: Spark requires Java to be installed on the machine. You can check if Java is
already installed by running the command java -version. If Java is not installed, you can
install it by running sudo apt-get install openjdk-8-jdk or sudo yum install java-1.8.0-
openjdk-devel depending on your Linux distribution.
2. Download Spark: Go to the Spark website (https://spark.apache.org/downloads.html) and
download the latest version of Spark in the pre-built package for Hadoop. You can
download the package in tar format or in binary format.
3. Extract the package: Extract the package you downloaded in the previous step. You can
use the tar command to extract the package: tar -xvf spark-x.y.z-bin-
hadoopx.y.z.tar (replace x.y.z with the version number you downloaded). This will create a
directory called spark-x.y.z-bin-hadoopx.y.z.
4. Set environment variables: You need to set some environment variables to make Spark
work. You can do this by adding the following lines to your .bashrc file:

export SPARK_HOME=/path/to/spark-x.y.z-bin-hadoopx.y.z
export PATH=$PATH:$SPARK_HOME/bin

Bash
COPY
(replace the /path/to/ with the path to the directory where you extracted the Spark package)
5. Start the Spark Master: You can start the Spark Master by running the command start-
master.sh from the sbin directory of your Spark installation. You can access the Spark
Master web UI by going to http://<master-url>:8080 in your web browser.
6. Start the Spark Worker: You can start the Spark Worker by running the command start-
worker.sh <master-url> from the sbin directory of your Spark installation. Replace <master-
url> with the URL of the master node.
7. Verify the installation: You can verify the installation by running the pyspark command in
your terminal. This will start the PySpark shell. You can run Spark commands and check the
status of the cluster by visiting the Master web UI.
8. Optional: configure Spark: you can configure Spark by editing the conf/spark-
defaults.conf file.
You have now installed Spark in standalone mode on your Linux machine. You can now use Spark
to run big data processing and analytics tasks.
You should make sure that the version of Hadoop you are running is compatible with the version
of Spark you installed. You should also check the system requirements for Spark before installing
it, as it requires a certain amount of memory and disk space
Learn how to connect Hive with Apache Spark.
 USER  JANUARY 27, 2023  LEAVE A COMMENTON LEARN HOW TO CONNECT HIVE WITH APACHE
SPARK.
HiveContext is a Spark SQL module that allows you to work with Hive data in Spark. It provides a
way to access the Hive metastore, which stores metadata about Hive tables, partitions, and other
objects. With HiveContext, you can use the same SQL-like syntax that you would use in Hive to
query and manipulate data stored in Hive tables.
Here’s an example of how to use HiveContext in Spark:

from pyspark import SparkConf, SparkContext


from pyspark.sql import HiveContext

#Create Spark Configuration and Spark Context


conf = SparkConf().setAppName("HiveContextExample")
sc = SparkContext(conf=conf)

#Create HiveContext
hc = HiveContext(sc)

# Load Data from Hive table


data = hc.sql("SELECT * FROM mydatabase.mytable")

# Show Data
data.show()

Python
COPY
In this example, we first import the necessary modules (SparkConf, SparkContext, and
HiveContext) from the pyspark library. Next, we create a SparkConf and SparkContext, which are
used to configure and start the Spark application. Then, we create a HiveContext using the
SparkContext.
After that, we use the HiveContext to execute an SQL-like query “SELECT * FROM
mydatabase.mytable” to load data from a Hive table, and then use the show() method to display
the data.
Please note that, for this example to work, you need to have Hive installed and configured
properly in your environment, and your Spark should be configured to use Hive. Also the table
“mytable” should already exist in Hive.
Keep in mind that HiveContext is deprecated since Spark 2.0, instead you should use
SparkSession which is a unified entry point for reading structured data and it can be used to
create a DataFrame, create a Hive table, cache tables, and read parquet files as well.
Spark : Advantages of Google’s Serverless Spark

Google’s Serverless Spark has several advantages compared to traditional Spark clusters:
1. Cost-effective: Serverless Spark eliminates the need for dedicated servers and
infrastructure, reducing costs for managing, scaling and maintaining Spark clusters.
2. Scalability: Serverless Spark can automatically scale up or down based on the workload,
without the need for manual intervention.
3. Improved performance: With serverless Spark, you only pay for what you use, and the
execution of Spark jobs is optimized for maximum efficiency, resulting in improved
performance.
4. Flexibility: Serverless Spark provides the ability to run Spark jobs on a variety of different
compute resources, including virtual machines and Kubernetes clusters, making it easy to
switch between different execution environments.
5. Ease of use: Serverless Spark provides a simple and intuitive interface for Spark users,
making it easier to run Spark jobs without the need for deep technical knowledge.
6. Integration with Google Cloud services: Serverless Spark integrates seamlessly with Google
Cloud services, providing a comprehensive platform for data processing and analysis.
Serverless Spark provides organizations with a cost-effective, scalable, and flexible solution for
running Spark jobs, while also improving performance and reducing the complexity of Spark
administration.
PySpark : Inserting row in Apache Spark Dataframe.

In PySpark, you can insert a row into a DataFrame by first converting the DataFrame to a RDD
(Resilient Distributed Dataset), then adding the new row to the RDD, and finally converting the
RDD back to a DataFrame.
Here is an example:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("Insert Row").getOrCreate()
# Create a DataFrame
df = spark.createDataFrame([
(1, "Michele John", 25),
(2, "Barry Berner", 30),
(3, "Jacob Jim", 35)],
["id", "name", "age"])
df.show()

Python
COPY
Input data

+---+------------+---+
| id| name|age|
+---+------------+---+
| 1|Michele John| 25|
| 2|Barry Berner| 30|
| 3| Jacob Jim| 35|
+---+------------+---+

Bash
COPY

# Convert the DataFrame to an RDD


rdd = df.rdd
# Add the new row to the RDD
new_row = (4, "Elaine Berer Lee", 22)
rdd = rdd.union(spark.sparkContext.parallelize([new_row]))
# Convert the RDD back to a DataFrame
df = spark.createDataFrame(rdd, df.schema)
# Show the DataFrame
df.show()
Python
COPY
Result

+---+----------------+---+
| id| name|age|
+---+----------------+---+
| 1| Michele John| 25|
| 2| Barry Berner| 30|
| 3| Jacob Jim| 35|
| 4|Elaine Berer Lee| 22|
+---+----------------+---+

Bash
COPY
This code creates a DataFrame with three rows and three columns, then converts it to an RDD.
Then it creates a tuple with the values for a new row, and add it to the RDD using the union()
method. Finally, it converts the RDD back to a DataFrame using the same schema as the original
DataFrame, and shows the resulting DataFrame. The resulting DataFrame will have the new row
inserted at the bottom of the DataFrame.
It’s worth noting that this method of inserting a row is not efficient for large DataFrames, if you
need to insert a large number of rows, it’s better to use Spark SQL or DataFrame API to insert a
new row.
 POSTED INSPARK
PySpark : How to decode in PySpark ?

pyspark.sql.functions.decode
The pyspark.sql.functions.decode Function in PySpark
PySpark is a popular library for processing big data using Apache Spark. One of its many functions
is the pyspark.sql.functions.decode function, which is used to convert binary data into a string
using a specified character set. The pyspark.sql.functions.decode function takes two arguments:
the first argument is the binary data to be decoded, and the second argument is the character set
to use for decoding the binary data.
The pyspark.sql.functions.decode function in PySpark supports the following character sets: US-
ASCII, ISO-8859-1, UTF-8, UTF-16BE, UTF-16LE, and UTF-16. The character set specified in the
second argument must match one of these supported character sets in order to perform the
decoding successfully.
PySpark : Explain in detail whether Apache Spark SQL lazy or not ?
 

Yes, Apache Spark SQL is lazy.


In Spark, the concept of “laziness” refers to the fact that computations are not executed
immediately when they are invoked, but rather are recorded and executed only when an action is
called. This means that Spark will not execute any transformations or computations on the data
until an action is called, such as count(), first(), or write().
For example, when you write a query using Spark SQL, the query is not executed immediately.
Instead, it is recorded and analyzed, and a logical execution plan is constructed. This logical plan
is then optimized and executed only when an action is called. This allows Spark to optimize the
execution plan by taking into account the entire data flow, rather than executing each query or
transformation as it is encountered.
When it comes to Spark SQL, the execution of a SQL query is also recorded, and executed only
when an action is called. This allows Spark to optimize the query for the specific data source it is
reading from. For example, if the data is stored in Parquet or ORC files, Spark can use specific
readers for those file formats to optimize the query execution.
In summary, Spark SQL is lazy, which means that it does not execute the query immediately but
records it and waits for the action to be called. This allows Spark to optimize the execution plan
and execute the query efficiently

Here’s a simple example to demonstrate the use of the pyspark.sql.functions.decode function in


PySpark:

from pyspark.sql import SparkSession


from pyspark.sql.functions import *

# Initializing Spark Session


spark = SparkSession.builder.appName("DecodeFunction").getOrCreate()

# Creating DataFrame with sample data


data = [("Team",),("Freshers.in",)]
df = spark.createDataFrame(data, ["binary_data"])

# Decoding binary data


df = df.withColumn("string_data", decode(col("binary_data"), "UTF-8"))

# Showing the result


df.show()

Python
COPY
Output

+-----------+-----------+
|binary_data|string_data|
+-----------+-----------+
| Team| Team|
|Freshers.in|Freshers.in|
+-----------+-----------+

Bash
COPY
In the above example, the pyspark.sql.functions.decode function is used to decode binary data
into a string. The first argument to the pyspark.sql.functions.decode function is the binary data
to be decoded, which is stored in the “binary_data” column. The second argument is the
character set to use for decoding the binary data, which is “UTF-8“. The function returns a new
column “string_data” that contains the decoded string data.
The pyspark.sql.functions.decode function is a useful tool for converting binary data into a string
format that can be more easily analyzed and processed. It is important to specify the correct
character set for the binary data, as incorrect character sets can result in incorrect decoded data.
In conclusion, the pyspark.sql.functions.decode function in PySpark is a valuable tool for
converting binary data into a string format. It supports a variety of character sets and is an
important tool for processing binary data in PySpark
 POSTED INSPARK
In pyspark what is the difference between Spark spark.table() and spark.read.table()

In PySpark, spark.table() is used to read a table from the Spark catalog, whereas
spark.read.table() is used to read a table from a structured data source, such as a data lake or a
database.
The spark.table() method requires that you have previously created a table in the Spark catalog
and registered it using the spark.createTable() method or the CREATE TABLE SQL statement.
Once a table has been registered in the catalog, you can use the spark.table() method to access
it.
On the other hand, spark.read.table() reads a table from a structured data source and returns a
DataFrame. It requires a configuration specifying the data source and the options to read the
table.
Here is an example of using spark.read.table() to read a table from a database:

df = spark.read.format("jdbc") \
.option("url", "jdbc:postgresql://localhost/mydatabase") \
.option("dbtable", "mytable") \
.option("user", "username") \
.option("password", "password") \
.load()

15. Name a few commonly used Spark Ecosystems.


Spark SQL (Shark), Spark Streaming, GraphX, MLlib. SparkR
16. What is ‘Spark Streaming’?
Spark supports stream processing, essentially an extension to the Spark API. This allows stream
processing of live data streams. The data from different sources like Flume and HDFS is streamed
and processed to file systems, live dashboards and databases. It is similar to batch processing as
the input data is divided into streams like batches.
Business use cases for Spark streaming: Each Spark component has its own use case. Whenever
you want to analyze data with the latency of less than 15 minutes and greater than 2 minutes i.e.
near real time is when you use Spark streaming
17. What is ‘GraphX’ in Spark?
‘GraphX’ is a component in Spark which is used for graph processing. It helps to build and
transform interactive graphs.
18. What is the function of ‘MLlib’?
‘MLlib’ is Spark’s machine learning library. It aims at making machine learning easy and scalable
with common learning algorithms and real-life use cases including clustering, regression filtering,
and dimensional reduction among others.
19. What is ‘Spark SQL’?
Spark SQL is a Spark interface to work with structured as well as semi-structured data. It has the
capability to load data from multiple structured sources like ‘textfiles’, JSON files, Parquet files,
among others. Spark SQL provides a special type of RDD called SchemaRDD. These are row
objects, where each object represents a record.
Here’s how you can create an SQL context in Spark SQL:
SQL context: scala> var sqlContext=new SqlContext
HiveContext: scala> var hc = new HIVEContext(sc)
20. What is a ‘Parquet’ in Spark?
‘Parquet’ is a columnar format file supported by many data processing systems. Spark SQL
performs both read and write operations with the ‘Parquet’ file.
Organizing by column allows for better compression, as data is more homogeneous.
I/O will be reduced as we can efficiently scan only a subset of the columns while reading the data.
Better compression also reduces the bandwidth required to read the input.
As we store data of the same type in each column, we can use encoding better suited to the
modern processors.
21. What is an ‘Accumulator’?
‘Accumulators’ are Spark’s offline debuggers. Similar to ‘Hadoop Counters’, ‘Accumulators’
provide the number of ‘events’ in a program.
Accumulators are the variables that can be added through associative operations. Spark natively
supports accumulators of numeric value types and standard mutable collections.
‘AggregrateByKey()’ and ‘combineByKey()’ uses accumulators. When Spark ships a code to every
executor the variables become local to that executor and its updated value is not relayed back to
the driver. To avoid this problem we need to make variable an accumulator such that all the
updates to this variable in every executor is relayed back to the driver.
22. Which file systems does Spark support?
Hadoop Distributed File System (HDFS)
Local File system
S3
23. What is ‘YARN’?
‘YARN’ is a large-scale, distributed operating system for big data applications. It is one of the key
features of Spark, providing a central and resource management platform to deliver scalable
operations across the cluster.
24. List the benefits of Spark over MapReduce.
Due to the availability of in-memory processing, Spark implements the processing around 10-100x
faster than Hadoop MapReduce.
Unlike MapReduce, Spark provides in-built libraries to perform multiple tasks form the same core;
like batch processing, steaming, machine learning, interactive SQL queries among others.
MapReduce is highly disk-dependent whereas Spark promotes caching and in-memory data
storage
Spark is capable of iterative computation while MapReduce is not.
Additionally, Spark stores data in-memory whereas Hadoop stores data on the disk. Hadoop uses
replication to achieve fault tolerance while Spark uses a different data storage model, resilient
distributed datasets (RDD). It also uses a clever way of guaranteeing fault tolerance that
minimizes network input and output.
-Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage
25. What is a ‘Spark Executor’?
When ‘SparkContext’ connects to a cluster manager, it acquires an ‘Executor’ on the cluster
nodes. ‘Executors’ are Spark processes that run computations and store the data on the worker
node. The final tasks by ‘SparkContext’ are transferred to executors.
26. List the various types of ‘Cluster Managers’ in Spark.
The Spark framework supports three major types of Cluster Managers:
a. Standalone: a basic manager to set up a cluster
b. Apache Mesos: generalized/commonly-used cluster manager, also runs Hadoop MapReduce
and other applications
c. Yarn: responsible for resource management in Hadoop
27. What is a ‘worker node’?
‘Worker node’ refers to any node that can run the application code in a cluster.
28. Define ‘PageRank’.
‘PageRank’ is the measure of each vertex in a graph.A unique feature and algorithm in graph,
PageRank is the measure of each vertex in the graph. For instance, an edge from u to v
represents endorsement of v’s importance by u. In simple terms, if a user at Instagram is
followed massively, it will rank high on that platform.
29. Can we do real-time processing using Spark SQL?
Not directly but we can register an existing RDD as a SQL table and trigger SQL queries on top of
that.
30. What is the biggest shortcoming of Spark?
Spark utilizes more storage space compared to Hadoop and MapReduce.
Also, Spark streaming is not actually streaming, in the sense that some of the window functions
cannot properly work on top of micro batching.
No File Management System
Expensive – In-memory capability
31. What is Spark?
Spark is a parallel data processing framework. It allows to develop fast, unified big data
application combine batch, streaming and interactive analytics.
32. Why Spark?
Spark is third generation distributed data processing platform. It’s unified bigdata solution for all
bigdata processing problems such as batch , interacting, streaming processing.So it can ease
many bigdata problems.
33. What is RDD?
Spark’s primary core abstraction is called Resilient Distributed Datasets. RDD is a collection of
partitioned data that satisfies these properties. Immutable, distributed, lazily evaluated,
catchable are common RDD properties.
34. What is Immutable?
Once created and assign a value, it’s not possible to change, this property is called Immutability.
Spark is by default immutable, it’s not allows updates and modifications. Please note data
collection is not immutable, but data value is immutable.
35. What is Distributed?
RDD can automatically the data is distributed across different parallel computing nodes.
36. What is Lazy evaluated?
If you execute a bunch of program, it’s not mandatory to evaluate immediately. Especially in
Transformations, this Laziness is trigger.
37. What is Catchable?
Keep all the data in memory for computation, rather than going to the disk. So Spark can catch
the data 100 times faster than Hadoop.
38. What is Spark engine responsibility?
Spark responsible for scheduling, distributing, and monitoring the application across the cluster.
39. What are common Spark Ecosystems?
Spark SQL(Shark) for SQL developers,
Spark Streaming for streaming data,
MLLib for machine learning algorithms,
GraphX for Graph computation,
SparkR to run R on Spark engine,
BlinkDB enabling interactive queries over massive data are common Spark ecosystems. GraphX,
SparkR and BlinkDB are in incubation stage.
40. What is Partitions?
Partition is a logical division of the data, this idea derived from Mapreduce (split). Logical data
specifically derived to process the data. Small chunks of data also it can support scalability and
speed up the process. Input data, intermediate data and output data everything is Partitioned
RDD.
41. How spark partition the data?
Spark use mapreduce API to do the partition the data. In Input format we can create number of
partitions. By default HDFS block size is partition size (for best performance), but its’ possible to
change partition size like Split.
42. How Spark store the data?
Spark is a processing engine, there is no storage engine. It can retrieve data from any storage
engine like HDFS, S3 and other data resources.
43. Is it mandatory to start Hadoop to run spark application?
No not mandatory, but there is no separate storage in Spark, so it use local file system to store
the data. You can load data from local system and process it, Hadoop or HDFS is not mandatory
to run spark application.
44. What is SparkContext?
When a programmer creates a RDDs, SparkContext connect to the Spark cluster to create a new
SparkContext object. SparkContext tell spark how to access the cluster. SparkConf is key factor
to create programmer application.
45. What is SparkCore functionalities?
SparkCore is a base engine of apache spark framework. Memory management, fault tolarance,
scheduling and monitoring jobs, interacting with store systems are primary functionalities of
Spark.
46. How SparkSQL is different from HQL and SQL?
SparkSQL is a special component on the sparkCore engine that support SQL and
HiveQueryLanguage without
changing any syntax. It’s possible to join SQL table and HQL table.
47. When did we use Spark Streaming?
Spark Streaming is a real time processing of streaming data API. Spark streaming gather
streaming data from different resources like web server log files, social media data, stock market
data or Hadoop ecosystems like Flume, and Kafka.
48. How Spark Streaming API works?
Programmer set a specific time in the configuration, with in this time how much data gets into
the Spark, that data separates as a batch. The input stream (DStream) goes into spark streaming.
Framework breaks up into small chunks called batches, then feeds into the spark engine for
processing. Spark Streaming API passes that batches to the core engine. Core engine can
generate the final results in the form of streaming batches. The output also in the form of
batches. It can allows streaming data and batch data for processing.
49. What is Spark MLlib?
Mahout is a machine learning library for Hadoop, similarly MLlib is a Spark library. MetLib
provides different algorithms, that algorithms scale out on the cluster for data processing. Most
of the data scientists use this MLlib library.
50. What is GraphX?
GraphX is a Spark API for manipulating Graphs and collections. It unifies ETL, other analysis, and
iterative graph computation. It’s fastest graph system, provides fault tolerance and ease of use
without special skills.
51. What is File System API?
FS API can read data from different storage devices like HDFS, S3 or local FileSystem. Spark uses
FS API to read data from different storage engines.
52. Why Partitions are immutable?
Every transformation generate new partition. Partitions uses HDFS API so that partition is
immutable, distributed and fault tolerance. Partition also aware of data locality.
53. What is Transformation in spark?
Spark provides two special operations on RDDs called transformations and Actions.
Transformation follow lazy operation and temporary hold the data until unless called the Action.
Each transformation generate/return new RDD. Example of transformations: Map, flatMap,
groupByKey, reduceByKey, filter, cogroup, join, sortByKey, Union, distinct, sample are common
spark transformations.
54. What is Action in Spark?
Actions is RDD’s operation, that value return back to the spar driver programs, which kick off a
job to execute on a cluster. Transformation’s output is input of Actions. reduce, collect,
takeSample, take, first, saveAsTextfile, saveAsSequenceFile, countByKey, foreach are common
actions in Apache spark.
55. What is RDD Lineage?
Lineage is a RDD process to reconstruct lost partitions. Spark not replicate the data in memory, if
data lost, Rdd use linege to rebuild lost data.Each RDD remembers how the RDD build from other
datasets.
56. What is Map and flatMap in Spark?
Map is a specific line or row to process that data. In FlatMap each input item can be mapped to
multiple output items (so function should return a Seq rather than a single item). So most
frequently used to return Array elements.map and flatMap are similar, in the sense they take a
line from the input RDD and apply a function on it. The way they differ is that the function in map
returns only one element, while function in flatMap can return a list of elements (0 or more) as an
iterator.Basically map is defined in abstract class RDD in spark and it is a transformation kind of
operation which means it is a lazy operation
57. What are broadcast variables?
Broadcast variables let programmer keep a readonly variable cached on each machine, rather
than shipping a copy of it with tasks. Spark supports 2 types of shared variables called broadcast
variables (like Hadoop distributed cache) and accumulators (like Hadoop counters). Broadcast
variables stored as Array Buffers, which sends readonly values to work nodes.
58. What are Accumulators in Spark?
Spark ofline debuggers called accumulators. Spark accumulators are similar to Hadoop counters,
to count the number of events and what’s happening during job you can use accumulators. Only
the driver program can read an accumulator value, not the tasks.
59. How RDD persist the data?
There are two methods to persist the data, such as persist() to persist permanently and cache()
to persist temporarily in the memory. Different storage level options there such as
MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY and many more. Both persist() and cache()
uses different options depends on the task.
60. What do you understand by SchemaRDD?
An RDD that consists of row objects (wrappers around basic string or integer arrays) with
schema information about the type of data in each column. SchemaRDDs are composed Row
objects along with a schema that describes the data types of each column in the row.
61. Define a worker node.
A node that can run the Spark application code in a cluster can be called as a worker node. A
worker node can have more than one worker which is configured by setting the SPARK_
WORKER_INSTANCES property in the spark-env.sh file. Only one worker is started if the SPARK_
WORKER_INSTANCES property is not defined.
62. What do you understand by Lazy Evaluation?
Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate
on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget –
but it does nothing, unless asked for the final result. When a transformation like map () is called
on a RDD-the operation is not performed immediately. Transformations in Spark are not
evaluated till you perform an action. This helps optimize the overall data processing workflow.
63. Explain about the core components of a distributed Spark application.
Driver – The process that runs the main () method of the program to create RDDs and perform
transformations and actions on them.
Executor – The worker processes that run the individual tasks of a Spark job.
Cluster Manager – A pluggable component in Spark, to launch Executors and Drivers. The cluster
manager allows Spark to run on top of other external managers like Apache Mesos or YARN.
 USER  MARCH 7, 2021  LEAVE A COMMENTON APACHE SPARK INTERVIEW QUESTIONS

64. Hadoop uses replication to achieve fault tolerance. How is this achieved in Apache Spark?
Data storage model in Apache Spark is based on RDDs. RDDs help achieve fault tolerance
through lineage. RDD always has the information on how to build from other datasets. If any
partition of a RDD is lost due to failure, lineage helps build only that particular lost partition.
65 How can you achieve high availability in Apache Spark?
Implementing single node recovery with local file system
Using StandBy Masters with Apache ZooKeeper.
66 How Spark uses Akka?
Spark uses Akka basically for scheduling. All the workers request for a task to master after
registering. The master just assigns the task. Here Spark uses Akka for messaging between the
workers and masters.
67 How can you launch Spark jobs inside Hadoop MapReduce?
Using SIMR (Spark in MapReduce) users can run any spark job inside MapReduce without
requiring any admin rights.SIMR provides a quick way for Hadoop MapReduce 1 users to use
Apache Spark.
68 Does Apache Spark provide check pointing?
Lineage graphs are always useful to recover RDDs from a failure but this is generally time
consuming if the RDDs have long lineage chains. Spark has an API for check pointing i.e. a
REPLICATE flag to persist. However, the decision on which data to checkpoint – is decided by the
user. Checkpoints are useful when the lineage graphs are long and have wide dependencies.
69 How Spark handles monitoring and logging in Standalone mode?
Spark has a web based user interface for monitoring the cluster in standalone mode that shows
the cluster and job statistics. The log output for each job is written to the work directory of the
slave nodes.
70 What are the various levels of persistence in Apache Spark?
Apache Spark automatically persists the intermediary data from various shuffle operations,
however it is often suggested that users call persist () method on the RDD in case they plan to
reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a
combination of both with different replication levels.
The various storage/persistence levels in Spark are –
MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER, DISK_ONLY
OFF_HEAP.
71. What is the difference between persist() and cache()
Answer : Persist () allows the user to specify the storage level whereas cache () uses the default
storage level.With cache(), you use only the default storage level MEMORY_ONLY. With persist(),
you can specify which storage level you want.So cache() is the same as calling persist() with the
default storage level.Spark has many levels of persistence to choose from based on what our
goals are.The default persist() will store the data in the JVM heap as unserialized objects. When
we write data out to disk, that data is also always serialized.
72. How can you remove the elements with a key present in any other RDD?
Use the subtractByKey () function. Remove elements with a key present in the other RDD.
>>> x = sc.parallelize([(“a”, 1), (“b”, 4), (“b”, 5), (“a”, 2)])
>>> y = sc.parallelize([(“a”, 3), (“c”, None)])
>>> sorted(x.subtractByKey(y).collect())
[(‘b’, 4), (‘b’, 5)]
73. What is Spark Core?
It has all the basic functionalities of Spark, like – memory management, fault recovery, interacting
with storage systems, scheduling tasks, etc.
74. Is Apache Spark a good fit for Reinforcement learning?
No. Apache Spark works well only for simple machine learning algorithms like clustering,
regression, classification.
75.Explain about the popular use cases of Apache Spark
Apache Spark is mainly used for
Iterative machine learning.
Interactive data analytics and processing.
Stream processing
Sensor data processing
76. Explain about the different types of transformations on DStreams?
Stateless Transformations- Processing of the batch does not depend on the output of the
previous batch. Examples – map (), reduceByKey (), filter ().
Stateful Transformations- Processing of the batch depends on the intermediary results of the
previous batch. Examples –Transformations that depend on sliding windows.
77. What do you understand by Pair RDD?
Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are
referred to as Pair RDDs. Pair RDDs allow users to access each key in parallel. They have a
reduceByKey () method that collects data based on each key and a join () method that combines
different RDDs together, based on the elements having the same key.Spark provides special
operations on RDDs containing key/value pairs. These RDDs are called pair RDDs. Pair RDDs are a
useful building block in many programs, as they expose operations that allow you to act on each
key in parallel or regroup data across the network
78. What are the key features of Apache Spark that you like?
Spark provides advanced analytic options like graph algorithms, machine learning, streaming
data, etc
It has built-in APIs in multiple languages like Java, Scala, Python and R
It has good performance gains, as it helps run an application in the Hadoop cluster ten times
faster on disk and 100 times faster in memory.
79. How Spark uses Hadoop?
Spark has its own cluster management computation and mainly uses Hadoop for storage.
80. What are the various data sources available in SparkSQL?
Parquet file
JSON Datasets
Hive tables
81. What is the advantage of a Parquet file?
Parquet file is a columnar format file that helps – Parquet is technically a hybrid columnar format.
All columns are stored in a single file and is compressed in a single file.
Limit I/O operations
Consumes less space
Fetches only required columns.
Offer better write performance by storing metadata at the end of the file
82. How can you compare Hadoop and Spark in terms of ease of use?
Hadoop MapReduce requires programming in Java which is difficult, though Pig and Hive make it
considerably easier. Learning Pig and Hive syntax takes time. Spark has interactive APIs for
different languages like Java, Python or Scala and also includes Shark i.e. Spark SQL for SQL
lovers – making it comparatively easier to use than Hadoop.
83. Why is BlinkDB used?
BlinkDB is a query engine for executing interactive SQL queries on huge volumes of data and
renders query results marked with meaningful error bars. BlinkDB helps users balance ‘query
accuracy’ with response time. BlinkDB is a massively parallel, approximate query engine for
running interactive SQL queries on large volumes of data.
84. Which spark library allows reliable file sharing at memory speed
across different cluster frameworks?
Tachyon, a memory centric fault-tolerant distributed file system, which enables reliable file
sharing at memory-speed across cluster frameworks, such as Spark and MapReduce
85. What is Catalyst framework?
Catalyst framework is a new optimizationaffecting framework present in Spark SQL. It allows
Spark to automatically transform SQL queries by adding new optimizations to build a faster
processing system.
86. When running Spark applications, is it necessary to install Spark on all the nodes of YARN
cluster?
Spark need not be installed when running a job under YARN or Mesos because Spark can execute
on top of YARN or Mesos clusters without any change to the cluster.
87. What is a DStream?
A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous
sequence of RDDs (of the same type) representing a continuous stream of data (see spark.RDD
for more details on RDDs). DStreams can either be created from live data (such as, data from
HDFS, Kafka or Flume) or it can be generated by transformation existing DStreams using
operations such as map, window and reduceByKeyAndWindow. While a Spark Streaming
program is running, each DStream periodically generates a RDD, either from live data or by
transforming the RDD generated by a parent DStream.
DStreams have two operations –
Transformations that produce a new DStream.
Output operations that write data to an external system.
DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an
immutable, distributed dataset
88. What is the significance of Sliding Window operation?
Sliding Window controls transmission of data packets between various computer networks.
Spark Streaming library provides windowed computations where the transformations on RDDs
are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within
the particular window are combined and operated upon to produce new RDDs of the windowed
DStream.
89. What are the benefits of using Spark with Apache Mesos?
It renders/provides scalable partitioning among various Spark instances and dynamic partitioning
between Spark and other big data frameworks.
89. What are the advantages of using Apache Spark over Hadoop
MapReduce for big data processing?
Simplicity, Flexibility and Performance are the major advantages of using Spark over Hadoop.
Spark is 100 times faster than Hadoop for big data processing as it stores the data in-memory, by
placing it in Resilient Distributed Databases (RDD).
Spark is easier to program as it comes with an interactive mode.
It provides complete recovery using lineage graph whenever something goes wrong.
90. What is Shark?
Most of the data users know only SQL and are not good at programming. Shark is a tool,
developed for people who are from a database background – to access Scala MLib capabilities
through Hive like SQL interface. Shark tool helps data users run Hive on Spark – offering
compatibility with Hive metastore, queries and data.
91. List some use cases where Spark outperforms Hadoop in processing.
Sensor Data Processing – Apache Spark’s ‘In-memory computing’ works best here, as data is
retrieved and combined from different sources.
Spark is preferred over Hadoop for real time querying of data
Stream Processing – For processing logs and detecting frauds in live streams for alerts, Apache
Spark is the best solution.

92. What is a Sparse Vector?


A sparse vector has two parallel arrays –one for indices and the other for values. These vectors
are used for storing non-zero entries to save space.
93. What is RDD?
RDDs (Resilient Distributed Datasets) are basic abstraction in Apache Spark that represent the
data coming into the system in object format. RDDs are used for in-memory computations on
large clusters, in a fault tolerant manner. RDDs are read-only portioned, collection of records,
that are –
Immutable – RDDs cannot be altered.
Resilient – If a node holding the partition fails the other node takes the data.
94. Explain about transformations and actions in the context of RDDs.
Transformations are functions executed on demand, to produce a new RDD. All transformations
are followed by actions. Some examples of transformations include map, filter and reduceByKey.
Actions are the results of RDD computations or transformations. After an action is performed,
the data from RDD moves back to the local machine. Some examples of actions include reduce,
collect, first, and take.
95. What are the languages supported by Apache Spark for
developing big data applications?
Scala, Java, Python, R and Clojure
96. Can you use Spark to access and analyse data stored in Cassandra databases?
Yes, it is possible if you use Spark Cassandra Connector.
97. Is it possible to run Apache Spark on Apache Mesos?
Yes, Apache Spark can be run on the hardware clusters managed by Mesos.
98. Explain about the different cluster managers in Apache Spark
The 3 different clusters managers supported in Apache Spark are:
YARN
Apache Mesos -Has rich resource scheduling capabilities and is well suited to run Spark along
with other applications. It is advantageous when several users run interactive shells because it
scales down the CPU allocation between commands.
Standalone deployments – Well suited for new deployments which only run and are easy to set
up

99. How can Spark be connected to Apache Mesos?


To connect Spark with Mesos-
Configure the spark driver program to connect to Mesos. Spark binary package should be in a
location accessible by Mesos. (or)
Install Apache Spark in the same location as that of Apache Mesos and configure the property
‘spark.mesos.executor.home’ to point to the location where it is installed.
100. How can you minimize data transfers when working with Spark?
Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and
reliable manner. The various ways in which data transfers can be minimized when working with
Apache Spark are:
Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between small and
large RDDs.
Using Accumulators – Accumulators help update the values of variables in parallel while
executing.
The most common way is to avoid operations ByKey, repartition or any other operations which
trigger shuffles.
101. Why is there a need for broadcast variables when working with Apache Spark?
These are read only variables, present in-memory cache on every machine. When working with
Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every
task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the
memory which enhances the retrieval efficiency when compared to an RDD lookup ().
102. Is it possible to run Spark and Mesos along with Hadoop?
Yes, it is possible to run Spark and Mesos with Hadoop by launching each of these as a separate
service on the machines. Mesos acts as a unified scheduler that assigns tasks to either Spark or
Hadoop.
103. What is lineage graph?
The RDDs in Spark, depend on one or more other RDDs. The representation of dependencies in
between RDDs is known as the lineage graph. Lineage graph information is used to compute
each RDD on demand, so that whenever a part of persistent RDD is lost, the data that is lost can
be recovered using the lineage graph information.
104. How can you trigger automatic clean-ups in Spark to handle
accumulated metadata?
You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long
running jobs into different batches and writing the intermediary results to the disk.
105. Explain about the major libraries that constitute the Spark Ecosystem
Spark MLib- Machine learning library in Spark for commonly used learning algorithms like
clustering, regression, classification, etc.
Spark Streaming – This library is used to process real time streaming data.
Spark GraphX – Spark API for graph parallel computations with basic operators like joinVertices,
subgraph, aggregateMessages, etc.
Spark SQL – Helps execute SQL like queries on Spark data using standard visualization or BI tools.
106. What is Piping?
Spark provides a pipe() method on RDDs. Spark’s pipe() lets us write parts of jobs using any
language we want as long as it can read and write to Unix standard streams. With pipe(), you can
write a transformation of an RDD that reads each RDD element from standard input as a String,
manipulates that String however you like, and then writes the result(s) as Strings to standard
output.Pipe operator in Spark, allows developer to process RDD data using external applications.
Sometimes in data analysis, we need to use an external library which may not be written using
Java/Scala
pipeRDD = dataRDD.pipe(scriptPath)
107. What are the steps that occur when you run a Spark application on a cluster?
-The user submits an application using spark-submit.
-spark-submit launches the driver program and invokes the main() method specified by the user.
-The driver program contacts the cluster manager to ask for resources to launch executors.
-The cluster manager launches executors on behalf of the driver program.
-The driver process runs through the user application. Based on the RDD actions and
transformations in the program, the driver sends work to executors in the form of tasks.
-Tasks are run on executor processes to compute and save results.
-If the driver’s main() method exits or it calls SparkContext.stop(),it will terminate the executors
and release resources from the cluster manager.
108. What is a schema RDD/DataFrame?
A SchemaRDD is an RDD composed of Row objects with additional schema information of the
types in each column. Row objects are just wrappers around arrays of basic types (e.g., integers
and strings).
109. What are Row objects?
Row objects represent records inside SchemaRDDs, and are simply fixed-length arrays of
fields.Row objects have a number of getter functions to obtain the value of each field given its
index. The standard getter, get (or apply in Scala), takes a column number and returns an Object
type (or Any in Scala) that we are responsible for casting to the correct type. For Boolean, Byte,
Double, Float, Int, Long, Short, and String, there is a getType() method, which returns that type.
For example, get String(0) would return field 0 as a string.
110. Explain Spark Streaming Architecture?
Spark Streaming uses a ‘micro-batch’ architecture, where Spark Streaming receives data from
various input sources and groups it into small batches. New batches are created at regular time
intervals. At the beginning of each time interval a new batch is created, and any data that arrives
during that interval gets added to that batch. At the end of the time interval the batch is done
growing. The size of the time intervals is determined by a parameter called the batch interval.
Each input batch forms an RDD, and is processed using Spark jobs to create other RDDs. The
processed results can then be pushed out to external systems in batches.
111. How Spark achieves fault tolerance?
Spark stores data in-memory whereas Hadoop stores data on disk. Hadoop uses replication to
achieve fault tolerance whereas Spark uses different data storage model, RDD. RDDs achieve
fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough
information to rebuild just that partition.This removes the need for replication to achieve fault
tolerance.
112. What is a Apache Spark Executor ?
When ‘SparkContext’ connects to a cluster manager, it acquires an ‘Executor’ on the cluster
nodes. ‘Executors’ are Spark processes that run computations and store the data on the worker
node. The final tasks by ‘SparkContext’ are transferred to executors. Spark executors are worker
processes responsible for running the individual tasks in a given Spark job. Executors are
launched once at the beginning of a Spark application and typically run for the entire lifetime of
an application, though Spark applications can continue if executors fail. Executors have two
roles. First, they run the tasks that make up the application and return results to the driver.
Second, they provide in-memory storage for RDDs that are cached by user programs, through a
service called the Block Manager that lives within each executor.
113. What is Checkpointing
Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed
(HDFS) or local file system.
There are two types of checkpointing:
reliable – in Spark (core), RDD checkpointing that saves the actual intermediate RDD data to a
reliable distributed file system, e.g. HDFS.
local – in Spark Streaming or GraphX – RDD checkpointing that truncates RDD lineage graph.
Before checkpointing is used, a Spark developer has to set the checkpoint directory using
SparkContext.setCheckpointDir(directory: String) method.
Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint directory set with
SparkContext.setCheckpointDir() and all references to its parent RDDs will be removed.
114. How to see number of partitions
rdd.getNumPartitions()
115. Explain MapPartition
mapPartitions transformation is faster than map since it calls your function once/partition, not
once/element.Whenever you have heavyweight initialization that should be done once for many
RDD elements rather than once per RDD element, and if this initialization, such as creation of
objects from a third-party library, cannot be serialized (so that Spark can transmit it across the
cluster to the worker nodes), use mapPartitions()
map works the function being utilized at a per element level while mapPartitions exercises the
function at the partition level.
Example Scenario : if we have 100K elements in a particular RDD partition then we will fire off the
function being used by the mapping transformation 100K times when we use map.
Conversely, if we use mapPartitions then we will only call the particular function one time, but we
will pass in all 100K records and get back all responses in one function call.The entire content of
the respective partitions is available as a sequential stream of values via the input argument
(Iterarator[T]). The custom function must return yet another Iterator[U]. The combined result
iterators are automatically converted into a new RDD.
116. What is cache and persist
cache is merely persist with the default storage level MEMORY_ONLY.
persist(MEMORY_ONLY) .Note that cache() is an alias for persist(StorageLevel.MEMORY_ONLY)
which may not be ideal for datasets larger than available cluster memory.
117 . What is Checkpointing ?
Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed
(HDFS) or local file system.
There are two types of checkpointing:
reliable – in Spark (core), RDD checkpointing that saves the actual intermediate RDD data to a
reliable distributed file system, e.g. HDFS.
local – in Spark Streaming or GraphX – RDD checkpointing that truncates RDD lineage graph.
118. How to see number of partitions
rdd.getNumPartitions()
119. What is cache
cache is merely persist with the default storage level MEMORY_ONLY.
persist(MEMORY_ONLY) .Note that cache() is an alias for persist(StorageLevel.MEMORY_ONLY)
which may not be ideal for datasets larger than available cluster memory.
parsedData3.persist(StorageLevel.MEMORY_ONLY)
120. Explain Lambda
Lambda function is an anonymous function because it has no name . Lambda can have only
line.Python supports the creation of anonymous functions (i.e. functions that are not bound to a
name) at runtime, using a construct called lambda.Like def, the lambda creates a function to be
called later. But it returns the function instead of assigning it to a name. This is why lambdas are
sometimes known as anonymous functions.
Add = lambda num1,num2 : num1 + num2
Print ( Add(1,3))
121. map vs flatMap
map :It returns a new RDD by applying a function to each element of the RDD. Function in map
can return only one item.
flatMap: Similar to map, it returns a new RDD by applying a function to each element of the RDD,
but output is flattened.
Also, function in flatMap can return a list of elements (0 or more)
Example1:-
sc.parallelize([3,4,5]).map(lambda x: range(1,x)).collect()
Output:
[[1, 2], [1, 2, 3], [1, 2, 3, 4]]
sc.parallelize([3,4,5]).flatMap(lambda x: range(1,x)).collect()
Output: notice o/p is flattened out in a single list
[1, 2, 1, 2, 3, 1, 2, 3, 4]
122. What are the list of Transformations in Apache Spark ?
Map
Filter
Flatmap
MapPartitions
MapPartitionsWithIndex
Sample
Union
Intersection
Distinct
Cartesian
Pipe
Coalesce
Repartition
RepartitionAndSortWithinPartitions
123. What are the list of Actions in Apache Spark ?
Reduce
Collect
Count
First
Take
TakeSample
TakeOrdered
SaveAsTextFile
CountByKey
ForEach
124. What are the list of Key-Value Pair RDDs in Apache Spark ?
GroupByKey
ReduceByKey
AggregateByKey
SortByKey
Join
CoGroup
125. What are the list of Input And Output functions in Apache Spark ?
WholeTextFile
Pickle Files
HadoopInputFormat
HadoopOutputFormat
126. What are the list of modules to in crease the Performance in Apache Spark ?
Broadcast Variables
Accumulators
Using A Custom Accumulator
Partitioning
127. What is Apache Mesos
Apache Mesos is a general-purpose cluster manager that can run both analytics workloads and
long-running services (e.g., web applications or key/value stores) on a cluster. To use Spark on
Mesos, pass a mesos:// URI to spark-submit:
spark-submit –master mesos://masternode:5050 yourapp
128. What is a stage in Apache Spark?
A stage is a set of independent tasks all computing the same function that need to run as part of
a Spark job, where all the tasks have the same shuffle dependencies. Each DAG of tasks run by
the scheduler is split up into stages at the boundaries where shuffle occurs, and then the
DAGScheduler runs these stages in topological order.
Each Stage can either be a shuffle map stage, in which case its tasks’ results are input for another
stage, or a result stage, in which case its tasks directly compute the action that initiated a job
(e.g. count(), save(), etc). For shuffle map stages, we also track the nodes that each output
partition is on.
Each Stage also has a jobId, identifying the job that first submitted the stage. When FIFO
scheduling is used, this allows Stages from earlier jobs to be computed first or recovered faster
on failure.
129. What is coalesce() and repartition() Apache Spark?
Sometimes, we want to change the partitioning of an RDD outside the context of grouping and
aggregation operations. For those cases, Spark provides the repartition() function, which shuffles
the data across the network to create a new set of partitions. Keep in mind that repartitioning
your data is a fairly expensive operation. Spark also has an optimized version of repartition()
called coalesce() that allows avoiding data movement, but only if you are decreasing the number
of RDD partitions. To know whether you can safely call coalesce(), you can check the size of the
RDD using rdd.partitions.size() in Java/Scala and rdd.getNumPartitions() in Python and make sure
that you are coalescing.
coalesce() avoids a full shuffle. If it’s known that the number is decreasing then the executor can
safely keep data on the minimum number of partitions, only moving the data off the extra nodes,
onto the nodes that we kept.
So, it would go something like this:
Node 1 = 1,2,3
Node 2 = 4,5,6
Node 3 = 7,8,9
Node 4 = 10,11,12
Then coalesce down to 2 partitions:
Node 1 = 1,2,3 + (10,11,12)
Node 3 = 7,8,9 + (4,5,6)
130. What is sortByKey in Apache Spark ?
The sortByKey() function takes a parameter called ascending indicating whether we want it in
ascending order (it defaults to true).
rdd.sortByKey(ascending=True, numPartitions=None, keyfunc = lambda x: str(x))
131. What are the Actions Available on Pair RDDs Apache Spark?
CountByKey()
colletAsMap()
lookup(key)
132. What are the Operations That Benefit from Partitioning in Apache Spark ?
Many of Spark’s operations involve shuffling data by key across the network. All of these will
benefit from partitioning. As of Spark 1.0, the operations that benefit from partitioning are
cogroup(), groupWith(), join(), leftOuterJoin(), rightOuter Join(), groupByKey(), reduceByKey(),
combineByKey(), and lookup().
133. What are the Operations That Affect Partitioning in Apache Spark ?
All the operations that result in a partitioner being set on the output RDD: cogroup(),
groupWith(), join(), leftOuterJoin(), rightOuter Join(), groupByKey(), reduceByKey(),
combineByKey(), partitionBy(), sort(), mapValues() (if the parent RDD has a partitioner),
flatMapValues() (if parent has a partitioner), and filter() (if parent has a partitioner). All other
operations will produce a result with no partitioner.
134. What are the Custom Partitioners in Apache Spark ?
To implement a custom partitioner, you need to subclass the org.apache.spark.Partitioner class
and implement three methods:
numPartitions: Int, which returns the number of partitions you will create.
getPartition(key: Any): Int, which returns the partition ID (0 tonumPartitions-1) for a given key.
equals(), the standard Java equality method. This is important to implement because Spark will
need to test your Partitioner object against other instances of itself when it decides whether two
of your RDDs are partitioned the same way!
135. Explain addPyFile(path) in Apache Spark ?
Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The
path passed can be either a local file, a file in HDFS (or other Hadoopsupported
filesystems), or an HTTP, HTTPS or FTP URI.
136. What is applicationId in Apache Spark ?
A unique identifier for the Spark application. Its format depends on the scheduler
implementation.
In case of local spark app something like ‘local1433865536131’
In case of YARN something like ‘application_1433865536131_34483’
137. What are binaryFiles(path, minPartitions=None) in Apache Spark ?
Read a directory of binary files from HDFS, a local file system (available on all nodes), or any
Hadoopsupported file system URI as a byte array. Each file is read as a single record and
returned in a keyvalue pair, where the key is the path of each file, the value is the content of
each file.
138. What are the binaryRecords(path, recordLength) is Spache Spark ?
Load data from a flat binary file, assuming each record is a set of numbers with the specified
numerical format (see ByteBuffer), and the number of bytes per record is constant.
Parameters: path – Directory to the input data files
recordLength – The length at which to split the records
139. What are broadcast(value) in Apache Spark ?
Broadcast a readonly variable to the cluster, returning a
L{Broadcast<pyspark.broadcast.Broadcast>} object for reading it in distributed functions. The
variable will be sent to each cluster only once.
140. What is the function of cancelAllJobs() in Apache spark ?
Cancel all jobs that have been scheduled or are running.
141. What will cancelJobGroup(groupId) do in Apache spark ?
Cancel active jobs for the specified group. See SparkContext.setJobGroup for more information.
142. What is clearFiles() in Apache Spark ?
Clear the job’s list of files added by addFile or addPyFile so that they do not get downloaded to
any new nodes.
143. What is defaultMinPartitions in Apache Spark ?
Default min number of partitions for Hadoop RDDs when not given by user
144. What is defaultParallelism in Apache Spark ?
Default level of parallelism to use when not given by user (e.g. for reduce tasks)
145. What is dump_profiles(path) in Apache Spark ?
Dump the profile stats into directory path
146. What is emptyRDD() in Apache Spark
Create an RDD that has no partitions or elements.
147. What will getLocalProperty(key) do in Apache Spark ?
Get a local property set in this thread, or null if it is missing. See setLocalProperty
148. What is sequenceFile in Apache Spark ? How to access ?
sequenceFile(path, keyClass=None, valueClass=None, keyConverter=None,
valueConverter=None, minSplits=None, batchSize=0)
Read a Hadoop SequenceFile with arbitrary key and value Writable class from HDFS, a local file
system (available on all nodes), or any Hadoopsupported file system URI. The mechanism is as
follows
1. A Java RDD is created from the SequenceFile or other InputFormat, and the
key and value Writable classes
2. Serialization is attempted via Pyrolite pickling
3. If this fails, the fallback is to call ‘toString’ on each key and value
4. PickleSerializer is used to deserialize pickled objects on the Python side
Parameters: path – path to sequncefile
keyClass – fully qualified classname of key Writable class (e.g.
“org.apache.hadoop.io.Text”)
valueClass – fully qualified classname of value Writable class (e.g.
“org.apache.hadoop.io.LongWritable”)
keyConverter –
valueConverter –
minSplits – minimum splits in dataset (default min(2, sc.defaultParallelism))
batchSize – The number of Python objects represented as a single Java
object. (default 0, choose batchSize automatically)
149. What is setCheckpointDir(dirName) in Apache Spark ?
Set the directory under which RDDs are going to be checkpointed. The directory must be a HDFS
path if running on a cluster.
150. What is sparkUser() in Apache Spark ?
Get SPARK_USER for user who is running SparkContext.
>>> sc.sparkUser()
151. What is startTime in Apache Spark ?
Return the epoch time when the Spark Context was started.
>>> sc._jsc.startTime()
1510567393189L
152. What is statusTracker() in Apache Spark ?
Return StatusTracker object (Low-level status reporting APIs for monitoring job and stage
progress.)
>>> status = sc.statusTracker()
>>> print status
<pyspark.status.StatusTracker object at 0x7f954a5c1dd0>
153. What is stop() in Apache Spark ?
Shut down the SparkContext.
154. What is textFile(name, minPartitions=None, use_unicode=True) in Apache Spark ?
Read a text file from HDFS, a local file system (available on all nodes), or any Hadoopsupported
file system URI, and return it as an RDD of Strings.
155. What is union(rdds) in Apache Spark ?
Build the union of a list of RDDs
rdd1 = sc.parallelize([1, 2, 3])
rdd2 = sc.parallelize([5, 6, 7,4])
x = sc.union([rdd1,rdd2])
print x.collect()
1. What is a Snowflake data warehouse?
Snowflake is a data warehouse built on top of the Amazon Web Services or Microsoft Azure
cloud infrastructure. There’s no hardware or software to select, install, configure, or manage, so
it’s ideal for organizations that don’t want to dedicate resources for setup, maintenance, and
support of in-house servers.
2. Explain Database storage in snowflake ?
The database storage layer holds all data loaded into Snowflake, including structured and
semistructured data. Snowflake automatically manages all aspects of how the data is stored:
organization, file size, structure, compression, metadata, and statistics. This storage layer runs
independently of compute resources.
3. Explain Compute layer in snowflake ?
The compute layer is made up of virtual warehouses that execute data processing tasks required
for queries. Each virtual warehouse (or cluster) can access all the data in the storage layer, then
work independently, so the warehouses do not share, or compete for, compute resources. This
enables nondisruptive, automatic scaling, which means that while queries are running, compute
resources can scale without the need to redistribute or rebalance the data in the storage layer.
4. Explain the Snowflake benefits?
1. Performance and speed
2. Storage and support for structured and semistructured data
3. Concurrency and accessibility
4. Seamless data sharing
5. Availability and security
5. What is Seamless data sharing in snowflake?
Snowflake’s architecture enables data sharing among Snowflake users. It also allows
organizations to seamlessly share data with any data consumer — whether they are a Snowflake
customer or not — through reader accounts that can be created directly from the user interface.
This functionality allows the provider to create and manage a Snowflake account for a consumer.
6. How snowflake handles Concurrency and accessibility ?
Snowflake addresses concurrency issues with its unique multicluster architecture: Queries from
one virtual warehouse never affect the queries from another, and each virtual warehouse can
scale up or down as required. Data analysts and data scientists can get what they need, when
they need it, without waiting for other loading and processing tasks to complete.
7. What are the Supported Cloud Platforms in snowflake ?
Amazon Web Services (AWS)
Google Cloud Platform (GCP)
Microsoft Azure (Azure)
8. Explain Data Loading in snowflake ?
Snowflake supports loading data from files staged in any of the following locations, regardless of
the cloud platform for your Snowflake account:
Internal (i.e. Snowflake) stages
Amazon S3
Google Cloud Storage
Microsoft Azure blob storage
9. What are the different editions of Snowflake ?
Standard Edition
Enterprise Edition
Business Critical Edition
Virtual Private Snowflake (VPS)
10. Can we consider snowflake as an ETL tool?
Yes, if can Extracts data from the source and creates data files. Data files support multiple data
formats like JSON, CSV, XML, and more. Loads data to an internal or external stage. Data can be
staged in an internal, Microsoft Azure blob, Amazon S3 bucket, or Snowflake managed location.
Data is copied into a Snowflake database table using the COPY INTO command.
 POSTED INSPARK
PySpark : from_utc_timestamp Function: A Detailed Guide

The from_utc_timestamp function  in PySpark is a highly useful function that allows users to


convert UTC time to a specified timezone. This conversion can be essential when you’re dealing
with data that spans different time zones. In this article, we’re going to deep dive into this
function, exploring its syntax, use-cases, and providing examples for a better understanding.
Syntax
The function from_utc_timestamp accepts two parameters:
1. The timestamp to convert from UTC.
2. The string that represents the timezone to convert to.
The syntax is as follows:

from pyspark.sql.functions import from_utc_timestamp


from_utc_timestamp(timestamp, tz)

Python
COPY
Use-Case Scenario
Imagine you’re a data analyst working with a global company that receives sales data from
different regions around the world. The data you’re working with includes the timestamp of each
transaction, which is stored in UTC time. However, for your analysis, you need to convert these
timestamps into local times to get a more accurate picture of customer behaviors during their
local hours. Here, the comes from_utc_timestamp function into play.
Detailed Examples
First, let’s start by creating a PySpark session:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Learning @ Freshers.in
from_utc_timestamp').getOrCreate()

Python
COPY
Let’s assume we have a data frame with sales data, which includes a timestamp column with UTC
times. We’ll use hardcoded values for simplicity:

from pyspark.sql.functions import to_utc_timestamp, lit


from pyspark.sql.types import TimestampType
data = [("1", "2023-01-01 13:30:00"),
("2", "2023-02-01 14:00:00"),
("3", "2023-03-01 15:00:00")]
df = spark.createDataFrame(data, ["sale_id", "timestamp"])
# Cast the timestamp column to timestamp type
df = df.withColumn("timestamp", df["timestamp"].cast(TimestampType()))

Python
COPY
Now, our data frame has a ‘timestamp’ column with UTC times. Let’s convert these to New York
time using the from_utc_timestamp function:

from pyspark.sql.functions import from_utc_timestamp


df = df.withColumn("NY_time", from_utc_timestamp(df["timestamp"], "America/New_York"))
df.show(truncate=False)

Python
COPY
Output

+-------+-------------------+-------------------+
|sale_id|timestamp |NY_time |
+-------+-------------------+-------------------+
|1 |2023-01-01 13:30:00|2023-01-01 08:30:00|
|2 |2023-02-01 14:00:00|2023-02-01 09:00:00|
|3 |2023-03-01 15:00:00|2023-03-01 10:00:00|
+-------+-------------------+-------------------+

Bash
COPY
As you can see, the from_utc_timestamp function correctly converted our UTC times to New
York local times considering the time difference.
Remember that PySpark supports all timezones that are available in Python. To list all available
timezones, you can use pytz library:

import pytz
for tz in pytz.all_timezones:
print(tz)

PySpark : Fixing ‘TypeError: an integer is required (got type bytes)’ Error in PySpark with Spark
2.4.4

Apache Spark is an open-source distributed general-purpose cluster-computing framework.


PySpark is the Python library for Spark, and it provides an easy-to-use API for Spark
programming. However, sometimes, you might run into an error like TypeError: an integer is
required (got type bytes) when trying to use PySpark after installing Spark 2.4.4.
This issue is typically related to a Python version compatibility problem, especially if you are using
Python 3.7 or later versions. Fortunately, there’s a straightforward way to address it. This article
will guide you through the process of fixing this error so that you can run your PySpark
applications smoothly.
Let’s assume we’re trying to run the following simple PySpark code that reads a CSV file and
displays its content:

from pyspark.sql import SparkSession


spark = SparkSession.builder.appName("CSV Reader").getOrCreate()
data = spark.read.csv('sample.csv', inferSchema=True, header=True)
data.show()

Python
COPY
OR with hardcoded values

from pyspark.sql import SparkSession


from pyspark.sql.types import StructType, StructField, IntegerType, StringType
spark = SparkSession.builder.appName("DataFrame Creator").getOrCreate()
data = [("John", 1), ("Doe", 2)]
schema = StructType([
StructField("Name", StringType(), True),
StructField("ID", IntegerType(), True)
])
df = spark.createDataFrame(data, schema)
df.show()

Python
COPY
We will have this error message:

TypeError: an integer is required (got type bytes)

Bash
COPY
How to resolve
First you can try installing again
pip install --upgrade pyspark

Bash
COPY
The issue occurs due to a compatibility problem with Python 3.7 or later versions and PySpark
with Spark 2.4.4. PySpark uses an outdated method to check for a file type, which leads to this
TypeError.
A quick fix for this issue is to downgrade your Python version to 3.6. However, if you don’t want
to downgrade your Python version, you can apply a patch to PySpark’s codebase.
The patch involves modifying the pyspark/serializers.py file in your PySpark directory:
1. Open the pyspark/serializers.py file in a text editor. The exact path depends on your PySpark
installation.
2. Find the following function definition (around line 377):

def _read_with_length(stream):
length = read_int(stream)
if length == SpecialLengths.END_OF_DATA_SECTION:
return None
return stream.read(length)

Python
COPY
3. Replace the return stream.read(length) line with the following code:

result = stream.read(length)
if length and not result:
raise EOFError
return result

Python
COPY
4. Save and close the file.
This patch adds a check to ensure that the stream has not reached the end before attempting to
read from it, which is the cause of the TypeError.
Now, try running your PySpark code again. The error should be resolved, and you should be able
to run your PySpark application successfully.

 POSTED INSPARK
PySpark : Converting Decimal to Integer in PySpark: A Detailed Guide
 USER  JULY 15, 2023  LEAVE A COMMENTON PYSPARK : CONVERTING DECIMAL TO INTEGER IN
PYSPARK: A DETAILED GUIDE

One of PySpark’s capabilities is the conversion of decimal values to integers. This conversion is
beneficial when you need to eliminate fractional parts of numbers for specific calculations or
simplify your data for particular analyses. PySpark allows for this conversion, and importantly,
treats NULL inputs to produce NULL outputs, preserving the integrity of your data.
In this article, we will walk you through a step-by-step guide to convert decimal values to integer
numbers in PySpark.
PySpark’s Integer Casting Function.
The conversion of decimal to integer in PySpark is facilitated using the cast function. The cast
function allows us to change the data type of a DataFrame column to another type. In our case,
we are changing a decimal type to an integer type.
Here’s the general syntax to convert a decimal column to integer:

from pyspark.sql.functions import col


df.withColumn("integer_column", col("decimal_column").cast("integer"))

Python
COPY
In the above code:
df is your DataFrame.
integer_column is the new column with integer values.
decimal_column is the column you want to convert from decimal to integer.
Now, let’s illustrate this process with a practical example. We will first initialize a PySpark session
and create a DataFrame:

from pyspark.sql import SparkSession


from pyspark.sql.functions import col
spark = SparkSession.builder.appName("DecimalToIntegers").getOrCreate()
data = [("Sachin", 10.5), ("Ram", 20.8), ("Vinu", 30.3), (None, None)]
df = spark.createDataFrame(data, ["Name", "Score"])
df.show()

Python
COPY

+------+-----+
| Name|Score|
+------+-----+
|Sachin| 10.5|
| Ram| 20.8|
| Vinu| 30.3|
| null| null|
+------+-----+

Bash
COPY
Let’s convert the ‘Score’ column to integer:

df = df.withColumn("Score", col("Score").cast("integer"))
df.show()

Bash
COPY

+------+-----+
| Name|Score|
+------+-----+
|Sachin| 10|
| Ram| 20|
| Vinu| 30|
| null| null|
+------+-----+

Bash
COPY
The ‘Score’ column values are now converted into integers. The decimal parts have been
truncated, and not rounded. Also, observe how the NULL value remained NULL after the
conversion.
PySpark’s flexible and powerful data manipulation functions, like cast, make it a highly capable
tool for data analysis.

 POSTED INSPARK
PySpark : A Comprehensive Guide to Converting Expressions to Fixed-Point Numbers in PySpark
 USER  JULY 15, 2023  LEAVE A COMMENTON PYSPARK : A COMPREHENSIVE GUIDE TO
CONVERTING EXPRESSIONS TO FIXED-POINT NUMBERS IN PYSPARK
Among PySpark’s numerous features, one that stands out is its ability to convert input
expressions into fixed-point numbers. This feature comes in handy when dealing with data that
requires a high level of precision or when we want to control the decimal places of numbers to
maintain consistency across datasets.
In this article, we will walk you through a detailed explanation of how to convert input
expressions to fixed-point numbers using PySpark. Note that PySpark’s fixed-point function,
when given a NULL input, will output NULL.
Understanding Fixed-Point Numbers
Before we get started, it’s essential to understand what fixed-point numbers are. A fixed-point
number has a specific number of digits before and after the decimal point. Unlike floating-point
numbers, where the decimal point can ‘float’, in fixed-point numbers, the decimal point is ‘fixed’.
PySpark’s Fixed-Point Function
PySpark uses the cast function combined with the DecimalType function to convert an
expression to a fixed-point number. DecimalType allows you to specify the total number of digits
as well as the number of digits after the decimal point.
Here is the syntax for converting an expression to a fixed-point number:

from pyspark.sql.functions import col


from pyspark.sql.types import DecimalType
df.withColumn("fixed_point_column", col("input_column").cast(DecimalType(precision, scale)))

Python
COPY
In the above code:
df is the DataFrame.
fixed_point_column is the new column with the fixed-point number.
input_column is the column you want to convert.
precision is the total number of digits.
scale is the number of digits after the decimal point.
A Practical Example
Let’s work through an example to demonstrate this.
Firstly, let’s initialize a PySpark session and create a DataFrame:

from pyspark.sql import SparkSession


from pyspark.sql.functions import col
from pyspark.sql.types import DecimalType
spark = SparkSession.builder.appName("FixedPointNumbers").getOrCreate()
data = [("Sachin", 10.123456), ("James", 20.987654), ("Smitha ", 30.111111), (None, None)]
df = spark.createDataFrame(data, ["Name", "Score"])
df.show()

Python
COPY

+-------+---------+
| Name| Score|
+-------+---------+
| Sachin|10.123456|
| James|20.987654|
|Smitha |30.111111|
| null| null|
+-------+---------+

Bash
COPY
Next, let’s convert the ‘Score’ column to a fixed-point number with a total of 5 digits, 2 of which
are after the decimal point:

df = df.withColumn("Score", col("Score").cast(DecimalType(5, 2)))


df.show()

Python
COPY

+-------+-----+
| Name|Score|
+-------+-----+
| Sachin|10.12|
| James|20.99|
|Smitha |30.11|
| null| null|
+-------+-----+

Bash
COPY
The score column values are now converted into fixed-point numbers. Notice how the NULL
value remained NULL after the conversion, which adheres to PySpark’s rule of NULL input
leading to NULL output.
PySpark : Skipping Sundays in Date Computations
 USER  JULY 15, 2023  LEAVE A COMMENTON PYSPARK : SKIPPING SUNDAYS IN DATE
COMPUTATIONS

When working with data in fields such as finance or certain business operations, it’s often the
case that weekends or specific days of the week, such as Sundays, are considered non-working
days or holidays. In these situations, you might need to compute the next business day from a
given date or timestamp, excluding these non-working days. This article will walk you through
the process of accomplishing this task using PySpark, the Python library for Apache Spark. We’ll
provide a detailed example to ensure a clear understanding of this operation.
Setting Up the Environment
Firstly, we need to set up our PySpark environment. Assuming you’ve properly installed Spark
and PySpark, you can initialize a SparkSession as follows:

from pyspark.sql import SparkSession


spark = SparkSession.builder \
.appName("Freshers.in Learning @ Skipping Sundays in Date Computations") \
.getOrCreate()

Bash
COPY
Understanding date_add, date_format Functions and Conditional Statements
The functions we’ll be using in this tutorial are PySpark’s built-
in date_add and date_format functions, along with the when function for conditional logic.
The date_add function adds a number of days to a date or timestamp, while
the date_format function converts a date or timestamp to a string based on a given format. The
when function allows us to create a new column based on conditional logic.
Creating a DataFrame with Timestamps
Let’s start by creating a DataFrame that contains some sample timestamps:

from pyspark.sql import functions as F


from pyspark.sql.types import TimestampType
data = [("2023-01-14 13:45:30",), ("2023-02-25 08:20:00",), ("2023-07-07 22:15:00",), ("2023-07-08
22:15:00",)]
df = spark.createDataFrame(data, ["Timestamp"])
df = df.withColumn("Timestamp", F.col("Timestamp").cast(TimestampType()))
df.show(truncate=False)

Python
COPY

+-------------------+
|Timestamp |
+-------------------+
|2023-01-14 13:45:30|
|2023-02-25 08:20:00|
|2023-07-07 22:15:00|
|2023-07-08 22:15:00|
+-------------------+

Bash
COPY
Getting the Next Day Excluding Sundays
To get the next day from each timestamp, excluding Sundays, we first use the date_add function
to compute the next day. Then we use date_format to get the day of the week. If this day is a
Sunday, we use date_add again to get the following day:

df = df.withColumn("Next_Day", F.date_add(F.col("Timestamp"), 1))


df = df.withColumn("Next_Day",
F.when(F.date_format(F.col("Next_Day"), "EEEE") == "Sunday",
F.date_add(F.col("Next_Day"), 1))
.otherwise(F.col("Next_Day")))
df.show(truncate=False)

Bash
COPY
Result 

+-------------------+----------+
|Timestamp |Next_Day |
+-------------------+----------+
|2023-01-14 13:45:30|2023-01-16|
|2023-02-25 08:20:00|2023-02-27|
|2023-07-07 22:15:00|2023-07-08|
|2023-07-08 22:15:00|2023-07-10|
+-------------------+----------+

Bash
COPY
In the Next_Day column, you’ll see that if the next day would have been a Sunday, it has been
replaced with the following Monday.
The use of date_add, date_format, and conditional logic with when function enables us to easily
compute the next business day from a given date or timestamp, while excluding non-working
days like Sundays.
Spark important ur
 POSTED INSPARK
PySpark : Getting the Next and Previous Day from a Timestamp
 USER  JULY 15, 2023  LEAVE A COMMENTON PYSPARK : GETTING THE NEXT AND PREVIOUS DAY
FROM A TIMESTAMP

In data processing and analysis, there can often arise situations where you might need to
compute the next day or the previous day from a given date or timestamp. This article will guide
you through the process of accomplishing these tasks using PySpark, the Python library for
Apache Spark. Detailed examples will be provided to ensure a clear understanding of these
operations.
Setting Up the Environment
Firstly, we need to set up our PySpark environment. Assuming you have properly installed Spark
and PySpark, you can initialize a SparkSession as follows:

from pyspark.sql import SparkSession


spark = SparkSession.builder \
.appName("Freshers.in Learning @ Next Day and Previous Day") \
.getOrCreate()

Python
COPY
Creating a DataFrame with Timestamps
Let’s start by creating a DataFrame containing some sample timestamps:

from pyspark.sql import functions as F


from pyspark.sql.types import TimestampType
data = [("2023-01-15 13:45:30",), ("2023-02-22 08:20:00",), ("2023-07-07 22:15:00",)]
df = spark.createDataFrame(data, ["Timestamp"])
df = df.withColumn("Timestamp", F.col("Timestamp").cast(TimestampType()))
df.show(truncate=False)

Python
COPY

+-------------------+
|Timestamp |
+-------------------+
|2023-01-15 13:45:30|
|2023-02-22 08:20:00|
|2023-07-07 22:15:00|
+-------------------+

Bash
COPY
Getting the Next Day
To get the next day from each timestamp, we use the date_add function, passing in the
timestamp column and the number 1 to indicate that we want to add one day:

df.withColumn("Next_Day", F.date_add(F.col("Timestamp"), 1)).show(truncate=False)

Python
COPY

+-------------------+----------+
|Timestamp |Next_Day |
+-------------------+----------+
|2023-01-15 13:45:30|2023-01-16|
|2023-02-22 08:20:00|2023-02-23|
|2023-07-07 22:15:00|2023-07-08|
+-------------------+----------+

Bash
COPY
The Next_Day column shows the date of the day after each timestamp.
Getting the Previous Day
To get the previous day, we use the date_sub function, again passing in the timestamp column
and the number 1 to indicate that we want to subtract one day:

df.withColumn("Previous_Day", F.date_sub(F.col("Timestamp"), 1)).show(truncate=False)

Python
COPY

+-------------------+------------+
|Timestamp |Previous_Day|
+-------------------+------------+
|2023-01-15 13:45:30|2023-01-14 |
|2023-02-22 08:20:00|2023-02-21 |
|2023-07-07 22:15:00|2023-07-06 |
+-------------------+------------+

Bash
COPY
The Previous_Day column shows the date of the day before each timestamp.
PySpark provides simple yet powerful functions for manipulating dates and timestamps. The
date_add and date_sub functions allow us to easily compute the next day and previous day from
a given date or timestamp
 POSTED INSPARK
PySpark : Determining the Last Day of the Month and Year from a Timestamp
 USER  JULY 15, 2023  LEAVE A COMMENTON PYSPARK : DETERMINING THE LAST DAY OF THE
MONTH AND YEAR FROM A TIMESTAMP

Working with dates and times is a common operation in data processing. Sometimes, it’s
necessary to compute the last day of a month or year based on a given date or timestamp. This
article will guide you through how to accomplish these tasks using PySpark, the Python library for
Apache Spark, with examples to enhance your understanding.
Setting up the Environment
Firstly, it’s important to set up our PySpark environment. Assuming you’ve installed Spark and
PySpark correctly, you can initialize a SparkSession as follows:

from pyspark.sql import SparkSession


spark = SparkSession.builder \
.appName("freshers.in Learning : Date and Time Operations") \
.getOrCreate()

Python
COPY
Understanding last_day and year Functions
The functions we’ll be utilizing in this tutorial are PySpark’s built-in last_day and year functions.
The last_day function takes a date column and returns the last day of the month. The year
function returns the year of a date as a number.
Getting the Last Day of the Month
To demonstrate, let’s create a DataFrame with some sample timestamps:
data = [("2023-01-15 13:45:30",), ("2023-02-22 08:20:00",), ("2023-07-07 22:15:00",)]
df = spark.createDataFrame(data, ["Timestamp"])
df = df.withColumn("Timestamp", F.col("Timestamp").cast(TimestampType()))
df.show(truncate=False)

Python
COPY

+-------------------+
|Timestamp |
+-------------------+
|2023-01-15 13:45:30|
|2023-02-22 08:20:00|
|2023-07-07 22:15:00|
+-------------------+

Bash
COPY
Now, we can use the last_day function to get the last day of the month for each timestamp:

df.withColumn("Last_Day_of_Month", F.last_day(F.col("Timestamp"))).show(truncate=False)

Python
COPY

+-------------------+-----------------+
|Timestamp |Last_Day_of_Month|
+-------------------+-----------------+
|2023-01-15 13:45:30|2023-01-31 |
|2023-02-22 08:20:00|2023-02-28 |
|2023-07-07 22:15:00|2023-07-31 |
+-------------------+-----------------+

Bash
COPY
The new Last_Day_of_Month column shows the last day of the month for each corresponding
timestamp.
Getting the Last Day of the Year
Determining the last day of the year is slightly more complex, as there isn’t a built-in function for
this in PySpark. However, we can accomplish it by combining the year function with some string
manipulation. Here’s how:

df.withColumn("Year", F.year(F.col("Timestamp")))\
.withColumn("Last_Day_of_Year", F.expr("make_date(Year, 12, 31)"))\
.show(truncate=False)

Python
COPY
In the code above, we first extract the year from the timestamp using the year function. Then, we
construct a new date representing the last day of that year using the make_date function. The
make_date function creates a date from the year, month, and day values.
PySpark’s last_day function makes it straightforward to determine the last day of the month for
a given date or timestamp, finding the last day of the year requires a bit more creativity. By
combining the year and make_date functions, however, you can achieve this with relative ease.

+-------------------+----+----------------+
|Timestamp |Year|Last_Day_of_Year|
+-------------------+----+----------------+
|2023-01-15 13:45:30|2023|2023-12-31 |
|2023-02-22 08:20:00|2023|2023-12-31 |
|2023-07-07 22:15:00|2023|2023-12-31 |
+-------------------+----+----------------+

PySpark : Adding and Subtracting Months to a Date or Timestamp while Preserving End-of-
Month Information
 USER  JULY 15, 2023  LEAVE A COMMENTON PYSPARK : ADDING AND SUBTRACTING MONTHS
TO A DATE OR TIMESTAMP WHILE PRESERVING END-OF-MONTH INFORMATION

This article will explain how to add or subtract a specific number of months from a date or
timestamp while preserving end-of-month information. This is especially useful when dealing
with financial, retail, or similar data, where preserving the end-of-month status of a date is
critical.
Setting up the Environment
Before we begin, we must set up our PySpark environment. Assuming you’ve installed Spark and
PySpark properly, you should be able to initialize a SparkSession as follows:

from pyspark.sql import SparkSession


spark = SparkSession.builder.appName("freshers.in Learning Adding and Subtracting Months
").getOrCreate()

Python
COPY
Understanding add_months and date_add Functions
We will utilize PySpark’s built-in functions add_months and date_add or date_sub for our
operations. The add_months function adds a specified number of months to a date, and if the
original date was the last day of the month, the resulting date will also be the last day of the new
month.
The date_add or date_sub function, on the other hand, adds or subtracts a certain number of
days from a date, which is not ideal for preserving end-of-month information.
Using add_months Function
To demonstrate, let’s create a DataFrame with some sample dates:

from pyspark.sql import functions as F


from pyspark.sql.types import DateType
data = [("2023-01-31",), ("2023-02-28",), ("2023-07-15",)]
df = spark.createDataFrame(data, ["Date"])
df = df.withColumn("Date", F.col("Date").cast(DateType()))
df.show()
Python
COPY

+----------+
| Date|
+----------+
|2023-01-31|
|2023-02-28|
|2023-07-15|
+----------+

Bash
COPY
Now, we will add two months to each date using add_months:

df.withColumn("New_Date", F.add_months(F.col("Date"), 2)).show()

Python
COPY

+----------+----------+
| Date| New_Date|
+----------+----------+
|2023-01-31|2023-03-31|
|2023-02-28|2023-04-28|
|2023-07-15|2023-09-15|
+----------+----------+

Bash
COPY
Note how the dates originally at the end of a month are still at the end of the month in
the New_Date column.
Subtracting Months
Subtracting months is as simple as adding months. We simply use a negative number as the
second parameter to the add_months function:

df.withColumn("New_Date", F.add_months(F.col("Date"), -2)).show()

Python
COPY

+----------+----------+
| Date| New_Date|
+----------+----------+
|2023-01-31|2022-11-30|
|2023-02-28|2022-12-28|
|2023-07-15|2023-05-15|
+----------+----------+

Bash
COPY
Adding or Subtracting Months to a Timestamp
To work with timestamps instead of dates, we need to cast our column to a TimestampType.
Let’s create a new DataFrame to demonstrate:

from pyspark.sql.types import TimestampType


data = [("2023-01-31 13:45:30",), ("2023-02-28 08:20:00",), ("2023-07-15 22:15:00",)]
df = spark.createDataFrame(data, ["Timestamp"])
df = df.withColumn("Timestamp", F.col("Timestamp").cast(TimestampType()))
df.show(truncate=False)

Python
COPY

+-------------------+
|Timestamp |
+-------------------+
|2023-01-31 13:45:30|
|2023-02-28 08:20:00|
|2023-07-15 22:15:00|
+-------------------+

Bash
COPY
Then, we can add or subtract months as before:

df.withColumn("New_Timestamp", F.add_months(F.col("Timestamp"), 2)).show(truncate=False)


df.withColumn("New_Timestamp", F.add_months(F.col("Timestamp"), -
2)).show(truncate=False)

Python
COPY

+-------------------+-------------+
|Timestamp |New_Timestamp|
+-------------------+-------------+
|2023-01-31 13:45:30|2023-03-31 |
|2023-02-28 08:20:00|2023-04-28 |
|2023-07-15 22:15:00|2023-09-15 |
+-------------------+-------------+

Bash
COPY

+-------------------+-------------+
|Timestamp |New_Timestamp|
+-------------------+-------------+
|2023-01-31 13:45:30|2022-11-30 |
|2023-02-28 08:20:00|2022-12-28 |
|2023-07-15 22:15:00|2023-05-15 |
+-------------------+-------------+

Bash
COPY
PySpark’s built-in add_months function provides a straightforward way to add or subtract a
specified number of months from dates and timestamps, preserving end-of-month information.
PySpark : Understanding Joins in PySpark using DataFrame API
 USER  JULY 6, 2023  LEAVE A COMMENTON PYSPARK : UNDERSTANDING JOINS IN PYSPARK
USING DATAFRAME API

Apache Spark, a fast and general-purpose cluster computing system, provides high-level APIs in
various programming languages like Java, Scala, Python, and R, along with an optimized engine
supporting general computation graphs. One of the many powerful functionalities that PySpark
provides is the ability to perform various types of join operations on datasets.
This article will explore how to perform the following types of join operations in PySpark using
the DataFrame API:
 Inner Join
 Left Join
 Right Join
 Full Outer Join
 Left Semi Join
 Left Anti Join
 Joins with Multiple Conditions
To illustrate these join operations, we will use two sample data frames –
‘freshers_personal_details’ and ‘freshers_academic_details’.
Sample Data

from pyspark.sql import SparkSession


spark = SparkSession.builder.appName('JoinExample').getOrCreate()
freshers_personal_details = spark.createDataFrame([
('1', 'Sachin', 'New York'),
('2', 'Shekar', 'Bangalore'),
('3', 'Antony', 'Chicago'),
('4', 'Sharat', 'Delhi'),
('5', 'Vijay', 'London'),
], ['Id', 'Name', 'City'])
freshers_academic_details = spark.createDataFrame([
('1', 'Computer Science', 'MIT', '3.8'),
('2', 'Electrical Engineering', 'Stanford', '3.5'),
('3', 'Physics', 'Princeton', '3.9'),
('6', 'Mathematics', 'Harvard', '3.7'),
('7', 'Chemistry', 'Yale', '3.6'),
], ['Id', 'Major', 'University', 'GPA'])

Python
COPY
We have ‘Id’ as a common column between the two data frames which we will use as a key for
joining.
Inner Join
The inner join in PySpark returns rows from both data frames where key records of the first data
frame match the key records of the second data frame.

inner_join_df = freshers_personal_details.join(freshers_academic_details, on=['Id'], how='inner')


inner_join_df.show()

Python
COPY
Output

+---+------+---------+--------------------+----------+---+
| Id| Name| City| Major|University|GPA|
+---+------+---------+--------------------+----------+---+
| 1|Sachin| New York| Computer Science| MIT|3.8|
| 2|Shekar|Bangalore|Electrical Engine...| Stanford|3.5|
| 3|Antony| Chicago| Physics| Princeton|3.9|
+---+------+---------+--------------------+----------+---+

Bash
COPY
Left Join (Left Outer Join)
The left join in PySpark returns all rows from the first data frame along with the matching rows
from the second data frame. If there is no match, the result is NULL on the right side.

left_join_df = freshers_personal_details.join(freshers_academic_details, on=['Id'], how='left')


left_join_df.show()

Python
COPY
Output

+---+------+---------+--------------------+----------+----+
| Id| Name| City| Major|University| GPA|
+---+------+---------+--------------------+----------+----+
| 1|Sachin| New York| Computer Science| MIT| 3.8|
| 2|Shekar|Bangalore|Electrical Engine...| Stanford| 3.5|
| 3|Antony| Chicago| Physics| Princeton| 3.9|
| 5| Vijay| London| null| null|null|
| 4|Sharat| Delhi| null| null|null|
+---+------+---------+--------------------+----------+----+

Bash
COPY
Right Join (Right Outer Join)
The right join in PySpark returns all rows from the second data frame and the matching rows
from the first data frame. If there is no match, the result is NULL on the left side.

right_join_df = freshers_personal_details.join(freshers_academic_details, on=['Id'], how='right')


right_join_df.show()

Python
COPY
Output

+---+------+---------+--------------------+----------+---+
| Id| Name| City| Major|University|GPA|
+---+------+---------+--------------------+----------+---+
| 1|Sachin| New York| Computer Science| MIT|3.8|
| 2|Shekar|Bangalore|Electrical Engine...| Stanford|3.5|
| 7| null| null| Chemistry| Yale|3.6|
| 3|Antony| Chicago| Physics| Princeton|3.9|
| 6| null| null| Mathematics| Harvard|3.7|
+---+------+---------+--------------------+----------+---+

Bash
COPY
Full Outer Join
The full outer join in PySpark returns all rows from both data frames where there is a match in
either of the data frames.

full_outer_join_df = freshers_personal_details.join(freshers_academic_details, on=['Id'],


how='outer')
full_outer_join_df.show()

Python
COPY
Output

+---+------+---------+--------------------+----------+----+
| Id| Name| City| Major|University| GPA|
+---+------+---------+--------------------+----------+----+
| 1|Sachin| New York| Computer Science| MIT| 3.8|
| 2|Shekar|Bangalore|Electrical Engine...| Stanford| 3.5|
| 3|Antony| Chicago| Physics| Princeton| 3.9|
| 4|Sharat| Delhi| null| null|null|
| 5| Vijay| London| null| null|null|
| 6| null| null| Mathematics| Harvard| 3.7|
| 7| null| null| Chemistry| Yale| 3.6|
+---+------+---------+--------------------+----------+----+

Bash
COPY
Left Semi Join
The left semi join in PySpark returns all the rows from the first data frame where there is a match
in the second data frame on the key.
left_semi_join_df = freshers_personal_details.join(freshers_academic_details, on=['Id'],
how='leftsemi')
left_semi_join_df.show()

Python
COPY

+---+------+---------+
| Id| Name| City|
+---+------+---------+
| 1|Sachin| New York|
| 2|Shekar|Bangalore|
| 3|Antony| Chicago|
+---+------+---------+

Bash
COPY
Left Anti Join
The left anti join in PySpark returns all the rows from the first data frame where there is no match
in the second data frame on the key.

left_anti_join_df = freshers_personal_details.join(freshers_academic_details, on=['Id'],


how='leftanti')
left_anti_join_df.show()

Python
COPY
Output

+---+------+------+
| Id| Name| City|
+---+------+------+
| 5| Vijay|London|
| 4|Sharat| Delhi|
+---+------+------+

Bash
COPY
Joins with Multiple Conditions
In PySpark, we can also perform join operations based on multiple conditions.

freshers_additional_details = spark.createDataFrame([
('1', 'Sachin', 'Python'),
('2', 'Shekar', 'Java'),
('3', 'Sanjo', 'C++'),
('6', 'Rakesh', 'Scala'),
('7', 'Sorya', 'JavaScript'),
], ['Id', 'Name', 'Programming_Language'])
# Perform inner join based on multiple conditions
multi_condition_join_df = freshers_personal_details.join(
freshers_additional_details,
(freshers_personal_details['Id'] == freshers_additional_details['Id']) &
(freshers_personal_details['Name'] == freshers_additional_details['Name']),
how='inner'
)
multi_condition_join_df.show()

Python
COPY
Output

+---+------+---------+---+------+--------------------+
| Id| Name| City| Id| Name|Programming_Language|
+---+------+---------+---+------+--------------------+
| 1|Sachin| New York| 1|Sachin| Python|
| 2|Shekar|Bangalore| 2|Shekar| Java|
+---+------+---------+---+------+--------------------+

Bash
COPY
Note : When working with larger datasets, as the choice of join types and the order of operations
can have a significant impact on the performance of the Spark application.

 POSTED INSPARK
PySpark : Reversing the order of lists in a dataframe column using PySpark
 USER  JULY 5, 2023  LEAVE A COMMENTON PYSPARK : REVERSING THE ORDER OF LISTS IN A
DATAFRAME COLUMN USING PYSPARK

pyspark.sql.functions.reverse
Collection function: returns a reversed string or an array with reverse order of elements.
In order to reverse the order of lists in a dataframe column, we can use the PySpark function
reverse() from pyspark.sql.functions. Here’s an example.
Let’s start by creating a sample dataframe with a list of strings.

from pyspark.sql import SparkSession


from pyspark.sql.functions import reverse
spark = SparkSession.builder.getOrCreate()
#Create a sample data
data = [("Sachin", ["Python", "C", "Go"]),
("Renjith", ["RedShift", "Snowflake", "Oracle"]),
("Ahamed", ["Android", "MacOS", "Windows"])]
#Create DataFrame
df = spark.createDataFrame(data, ["Name", "Techstack"])
df.show()

Python
COPY
Output

+-------+--------------------+
| Name| Techstack|
+-------+--------------------+
| Sachin| [Python, C, Go]|
|Renjith|[RedShift, Snowfl...|
| Ahamed|[Android, MacOS, ...|
+-------+--------------------+

Bash
COPY
Now, we can apply the reverse() function to the “Techstack” column to reverse the order of the
list.

df_reversed = df.withColumn("Fruits", reverse(df["Techstack"]))


df_reversed.show()

Python
COPY
Output

+-------+--------------------+
| Name| Techstack|
+-------+--------------------+
| Sachin| [Go, C, Python]|
|Renjith|[Oracle, Snowflak...|
| Ahamed|[Windows, MacOS, ...|
+-------+--------------------+

Bash
COPY
As you can see, the order of the elements in each list in the “Techstack” column has been
reversed. The withColumn() function is used to add a new column or replace an existing column
(with the same name) in the dataframe. Here, we are replacing the “Fruits” column with a new
column where the lists have been reversed.
PySpark : Reversing the order of strings in a list using PySpark
 USER  JULY 5, 2023  LEAVE A COMMENTON PYSPARK : REVERSING THE ORDER OF STRINGS IN A
LIST USING PYSPARK

Lets create a sample data in the form of a list of strings.

from pyspark import SparkContext, SparkConf


from pyspark.sql import SparkSession
conf = SparkConf().setAppName('Reverse String @ Freshers.in Learning')
sc = SparkContext.getOrCreate();
spark = SparkSession(sc)
# Sample data
data = ['Sachin', 'Narendra', 'Arun', 'Oracle', 'Redshift']
# Parallelize the data with Spark
rdd = sc.parallelize(data)

Python
COPY
Now, we can apply a map operation on this RDD (Resilient Distributed Datasets, the fundamental
data structure of Spark). The map operation applies a given function to each element of the RDD
and returns a new RDD.
We will use the built-in Python function reversed() inside a map operation to reverse the order of
each string. reversed() returns a reverse iterator, so we have to join it back into a string with
”.join().

# Apply map operation to reverse the strings


reversed_rdd = rdd.map(lambda x: ''.join(reversed(x)))

Python
COPY
The lambda function here is a simple anonymous function that takes one argument, x, and
returns the reversed string. x is each element of the RDD (each string in this case).
After this operation, we have a new RDD where each string from the original RDD has been
reversed. You can collect the results back to the driver program using the collect() action.

# Collect the results


reversed_data = reversed_rdd.collect()

# Print the reversed strings


for word in reversed_data:
print(word)

Python
COPY
As you can see, the order of characters in each string from the list has been reversed. Note that
Spark operations are lazily evaluated, meaning the actual computations (like reversing the
strings) only happen when an action (like collect()) is called. This feature allows Spark to optimize
the overall data processing workflow.
Complete code

from pyspark import SparkContext, SparkConf


from pyspark.sql import SparkSession
conf = SparkConf().setAppName('Reverse String @ Freshers.in Learning')
sc = SparkContext.getOrCreate();
spark = SparkSession(sc)
#Sample data for testing
data = ['Sachin', 'Narendra', 'Arun', 'Oracle', 'Redshift']
#Parallelize the data with Spark
rdd = sc.parallelize(data)
reversed_rdd = rdd.map(lambda x: ''.join(reversed(x)))
#Collect the results
reversed_data = reversed_rdd.collect()
#Print the reversed strings
for word in reversed_data:
print(word)

 POSTED INSPARK
PySpark : Generating a 64-bit hash value in PySpark
 USER  JULY 5, 2023  LEAVE A COMMENTON PYSPARK : GENERATING A 64-BIT HASH VALUE IN
PYSPARK

Introduction to 64-bit Hashing


A hash function is a function that can be used to map data of arbitrary size to fixed-size values.
The values returned by a hash function are called hash codes, hash values, or simply hashes.
When we say a hash value is a “signed 64-bit” value, it means the hash function outputs a 64-bit
integer that can represent both positive and negative numbers. In computing, a 64-bit integer
can represent a vast range of numbers, from -9,223,372,036,854,775,808 to
9,223,372,036,854,775,807.
A 64-bit hash function can be useful in a variety of scenarios, particularly when working with large
data sets. It can be used for quickly comparing complex data structures, indexing data, and
checking data integrity.
Use of 64-bit Hashing in PySpark
While PySpark does not provide a direct function for 64-bit hashing, it does provide a function
hash() that returns a hash as an integer, which is usually a 32-bit hash. For a 64-bit hash, we can
consider using the murmur3 hash function from Python’s mmh3 library, which produces a 128-bit
hash and can be trimmed down to 64-bit. You can install the library using pip:

pip install mmh3

Bash
COPY
Here is an example of how to generate a 64-bit hash value in PySpark:

from pyspark.sql import SparkSession


from pyspark.sql.functions import udf
from pyspark.sql.types import LongType
import mmh3

#Create a Spark session


spark = SparkSession.builder.appName("freshers.in Learning for 64-bit Hashing in PySpark
").getOrCreate()

#Creating sample data


data = [("Sachin",), ("Ramesh",), ("Babu",)]
df = spark.createDataFrame(data, ["Name"])

#Function to generate 64-bit hash


def hash_64(input):
return mmh3.hash64(input.encode('utf-8'))[0]
#Create a UDF for the 64-bit hash function
hash_64_udf = udf(lambda z: hash_64(z), LongType())

#Apply the UDF to the DataFrame


df_hashed = df.withColumn("Name_hashed", hash_64_udf(df['Name']))

#Show the DataFrame


df_hashed.show()

Python
COPY
In this example, we create a Spark session and a DataFrame df with a single column “Name”.
Then, we define the function hash_64 to generate a 64-bit hash of an input string. After that, we
create a user-defined function (UDF) hash_64_udf using PySpark SQL functions. Finally, we apply
this UDF to the column “Name” in the DataFrame df and create a new DataFrame df_hashed
with the 64-bit hashed values of the names.
Advantages and Drawbacks of 64-bit Hashing
Advantages:
1. Large Range: A 64-bit hash value has a very large range of possible values, which can help
reduce hash collisions (different inputs producing the same hash output).
2. Fast Comparison and Lookup: Hashing can turn time-consuming operations such as string
comparison into a simple integer comparison, which can significantly speed up certain
operations like data lookups.
3. Data Integrity Checks: Hash values can provide a quick way to check if data has been
altered.
Drawbacks:
1. Collisions: While the possibility is reduced, hash collisions can still occur where different
inputs produce the same hash output.
2. Not for Security: A hash value is not meant for security purposes. It can be reverse-
engineered to get the original input.
3. Data Loss: Hashing is a one-way function. Once data is hashed, it cannot be converted back
to the original input.
Introduction to MD5 Hash
MD5 (Message Digest Algorithm 5) is a widely used cryptographic hash function that produces a
128-bit (16-byte) hash value. It is commonly used to check the integrity of files. However, MD5 is
not collision-resistant; as of 2021, it is possible to find different inputs that hash to the same
output, which makes it unsuitable for functions such as SSL certificates or encryption that require
a high degree of security.
An MD5 hash is typically expressed as a 32-digit hexadecimal number.
Use of MD5 Hash in PySpark
Yes, you can use PySpark to generate a 32-character hex-encoded string containing the 128-bit
MD5 message digest. PySpark does not have a built-in MD5 function, but you can easily use
Python’s built-in libraries to create a User Defined Function (UDF) for this purpose.
Here is how you can create an MD5 hash of a certain string column in PySpark.

from pyspark.sql import SparkSession


from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
import hashlib
#Create a Spark session
spark = SparkSession.builder.appName("freshers.in Learning MD5 hash ").getOrCreate()

#Creating sample data


data = [("Sachin",), ("Ramesh",), ("Krishna",)]
df = spark.createDataFrame(data, ["Name"])

#Function for generating MD5 hash


def md5_hash(input):
return hashlib.md5(input.encode('utf-8')).hexdigest()

#UDF for the MD5 function


md5_udf = udf(lambda z: md5_hash(z), StringType())

#Apply the above UDF to the DataFrame


df_hashed = df.withColumn("Name_hashed", md5_udf(df['Name']))

df_hashed.show(20,False)

Python
COPY
In this example, we first create a Spark session and a DataFrame df with a single column “Name”.
Then, we define the function md5_hash to generate an MD5 hash of an input string. After that,
we create a user-defined function (UDF) md5_udf using PySpark SQL functions. Finally, we apply
this UDF to the column “Name” in the DataFrame df and create a new DataFrame df_hashed
with the MD5 hashed values of the names.
Output

+----+--------------------------------+
|Name|Name_hashed |
+----+--------------------------------+
|John|61409aa1fd47d4a5332de23cbf59a36f|
|Jane|2b95993380f8be6bd4bd46bf44f98db9|
|Mike|1b83d5da74032b6a750ef12210642eea|
+----+--------------------------------+

PySpark : Introduction to BASE64_ENCODE and its Applications in PySpark


 USER  JULY 5, 2023  LEAVE A COMMENTON PYSPARK : INTRODUCTION TO BASE64_ENCODE
AND ITS APPLICATIONS IN PYSPARK

Introduction to BASE64_ENCODE and its Applications in PySpark


BASE64 is a group of similar binary-to-text encoding schemes that represent binary data in an
ASCII string format by translating it into a radix-64 representation. It is designed to carry data
stored in binary formats across channels that are designed to deal with text. This ensures that
the data remains intact without any modification during transport.
BASE64_ENCODE is a function used to encode data into this base64 format.
Where is BASE64_ENCODE used?
Base64 encoding schemes are commonly used when there is a need to encode binary data,
especially when that data needs to be stored or sent over media that are designed to deal with
text. This encoding helps to ensure that the data remains intact without modification during
transport.
Base64 is used commonly in a number of applications including email via MIME, as well as storing
complex data in XML or JSON.
Advantages of BASE64_ENCODE
1. Data Integrity: Base64 ensures that data remains intact without modification during
transport.
2. Usability: It can be used to send binary data, such as images or files, over channels
designed to transmit text-based data.
3. Security: While it’s not meant to be a secure encryption method, it does provide a layer of
obfuscation.
How to Encode the Input Using Base64 Encoding in PySpark
PySpark, the Python library for Spark programming, does not natively support Base64 encoding
functions until the version that’s available as of my knowledge cutoff in September 2021.
However, PySpark can easily use Python’s built-in libraries, and we can create a User Defined
Function (UDF) to perform Base64 encoding. Below is a sample way of how you can achieve that.

from pyspark.sql.functions import udf


from pyspark.sql.types import StringType
import base64

def base64_encode(input):
try:
return base64.b64encode(input.encode('utf-8')).decode('utf-8')
except Exception as e:
return None

base64_encode_udf = udf(lambda z: base64_encode(z), StringType())

df_encoded = df.withColumn('encoded_column', base64_encode_udf(df['column_to_encode']))

Python
COPY
Example with Data
The BASE64_ENCODE function is a handy tool for preserving binary data integrity when it needs
to be stored and transferred over systems that are designed to handle text.

# Import the required libraries


from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
import base64
# Start a Spark Session
spark = SparkSession.builder.appName("freshers.in Learning for BASE64_ENCODE
").getOrCreate()
# Create a sample DataFrame
data = [('Sachin', 'Tendulkar', '[email protected]'),
('Mahesh', 'Babu', '[email protected]'),
('Mohan', 'Lal', '[email protected]')]
df = spark.createDataFrame(data, ["First Name", "Last Name", "Email"])
# Display original DataFrame
df.show(20,False)
# Define the base64 encode function
def base64_encode(input):
try:
return base64.b64encode(input.encode('utf-8')).decode('utf-8')
except Exception as e:
return None
# Create a UDF for the base64 encode function
base64_encode_udf = udf(lambda z: base64_encode(z), StringType())
# Add a new column to the DataFrame with the encoded email
df_encoded = df.withColumn('Encoded Email', base64_encode_udf(df['Email']))
# Display the DataFrame with the encoded column
df_encoded.show(20,False)

Python
COPY
Output

+----------+---------+----------------------------+
|First Name|Last Name|Email |
+----------+---------+----------------------------+
|Sachin |Tendulkar|[email protected]|
|Mahesh |Babu |[email protected] |
|Mohan |Lal |[email protected] |
+----------+---------+----------------------------+

+----------+---------+----------------------------+----------------------------------------+
|First Name|Last Name|Email |Encoded Email |
+----------+---------+----------------------------+----------------------------------------+
|Sachin |Tendulkar|[email protected]|
c2FjaGluLnRlbmR1bGthckBmcmVzaGVycy5pbg==|
|Mahesh |Babu |[email protected] |bWFoZXNoLmJhYnVAZnJlc2hlcnMuaW4= |
|Mohan |Lal |[email protected] |bW9oYW4ubGFsQGZyZXNoZXJzLmlu |
+----------+---------+----------------------------+----------------------------------------+

Bash
COPY
In this script, we first create a SparkSession, which is the entry point to any functionality in Spark.
We then create a DataFrame with some sample data.
The base64_encode function takes an input string and returns the Base64 encoded version of the
string. We then create a user-defined function (UDF) out of this, which can be applied to our
DataFrame.
Finally, we create a new DataFrame, df_encoded, which includes a new column ‘Encoded Email’.
This column is the result of applying our UDF to the ‘Email’ column of the original DataFrame.
When you run the df.show() and df_encoded.show(), it will display the original and the base64
encoded DataFrames respectively.
PySpark : Understanding the PySpark next_day Function
 USER  JULY 4, 2023  LEAVE A COMMENTON PYSPARK : UNDERSTANDING THE PYSPARK
NEXT_DAY FUNCTION
Time series data often involves handling and manipulating dates. Apache Spark, through its
PySpark interface, provides an arsenal of date-time functions that simplify this task. One such
function is next_day(), a powerful function used to find the next specified day of the week from a
given date. This article will provide an in-depth look into the usage and application of the
next_day() function in PySpark.
The next_day() function takes two arguments: a date and a day of the week. The function returns
the next specified day after the given date. For instance, if the given date is a Monday and the
specified day is ‘Thursday’, the function will return the date of the coming Thursday.
The next_day() function recognizes the day of the week case-insensitively, and both in full (like
‘Monday’) and abbreviated form (like ‘Mon’). 
To begin with, let’s initialize a SparkSession, the entry point to any Spark functionality.

from pyspark.sql import SparkSession


# Initialize SparkSession
spark = SparkSession.builder.getOrCreate()

Python
COPY
Create a DataFrame with a single column date filled with some hardcoded date values.

data = [("2023-07-04",),
("2023-12-31",),
("2022-02-28",)]
df = spark.createDataFrame(data, ["date"])
df.show()

Python
COPY
Output

+----------+
| date|
+----------+
|2023-07-04|
|2023-12-31|
|2022-02-28|
+----------+

Bash
COPY
Given the dates are in string format, we need to convert them into date type using the to_date
function.

from pyspark.sql.functions import col, to_date


df = df.withColumn("date", to_date(col("date"), "yyyy-MM-dd"))
df.show()

Bash
COPY
Use the next_day() function to find the next Sunday from the given date.

from pyspark.sql.functions import next_day


df = df.withColumn("next_sunday", next_day("date", 'Sunday'))
df.show()

Python
COPY
Result DataFrame 

+----------+-----------+
| date|next_sunday|
+----------+-----------+
|2023-07-04| 2023-07-09|
|2023-12-31| 2024-01-07|
|2022-02-28| 2022-03-05|
+----------+-----------+

Bash
COPY
The next_day() function in PySpark is a powerful tool for manipulating date-time data,
particularly when you need to perform operations based on the days of the week.
PySpark : Extracting the Month from a Date in PySpark
 USER  JULY 4, 2023  LEAVE A COMMENTON PYSPARK : EXTRACTING THE MONTH FROM A DATE
IN PYSPARK

Working with dates and time is a common task in data analysis. Apache Spark provides a variety
of functions to manipulate date and time data types, including a function to extract the month
from a date. In this article, we will explore how to use the month() function in PySpark to extract
the month of a given date as an integer.
The month() function extracts the month part from a given date and returns it as an integer. For
example, if you have a date “2023-07-04”, applying the month() function to this date will return
the integer value 7.
Firstly, let’s start by setting up a SparkSession, which is the entry point to any Spark functionality.

from pyspark.sql import SparkSession


# Initialize SparkSession
spark = SparkSession.builder.getOrCreate()

Python
COPY
Create a DataFrame with a single column called date that contains some hard-coded date values.

data = [("2023-07-04",),
("2023-12-31",),
("2022-02-28",)]
df = spark.createDataFrame(data, ["date"])
df.show()

Python
COPY
Output

+----------+
| date|
+----------+
|2023-07-04|
|2023-12-31|
|2022-02-28|
+----------+

Bash
COPY
As our dates are in string format, we need to convert them into date type using
the to_date function.

from pyspark.sql.functions import col, to_date


df = df.withColumn("date", to_date(col("date"), "yyyy-MM-dd"))
df.show()

Python
COPY
Let’s use the month() function to extract the month from the date column.

from pyspark.sql.functions import month


df = df.withColumn("month", month("date"))
df.show()

Python
COPY
Result

+----------+
| date|
+----------+
|2023-07-04|
|2023-12-31|
|2022-02-28|
+----------+

Bash
COPY
As you can see, the month column contains the month part of the corresponding date in the date
column. The month() function in PySpark provides a simple and effective way to retrieve the
month part from a date, making it a valuable tool in a data scientist’s arsenal. This function, along
with other date-time functions in PySpark, simplifies the process of handling date-time data.
PySpark : Calculating the Difference Between Dates with PySpark: The months_between
Function
 USER  JULY 4, 2023  LEAVE A COMMENTON PYSPARK : CALCULATING THE DIFFERENCE
BETWEEN DATES WITH PYSPARK: THE MONTHS_BETWEEN FUNCTION
When working with time series data, it is often necessary to calculate the time difference
between two dates. Apache Spark provides an extensive collection of functions to perform date-
time manipulations, and months_between is one of them. This function computes the number of
months between two dates. If the first date (date1) is later than the second one (date2), the
result will be positive. Notably, if both dates are on the same day of the month, the function will
return a precise whole number. This article will guide you on how to utilize this function in
PySpark.
Firstly, we need to create a SparkSession, which is the entry point to any functionality in Spark.

from pyspark.sql import SparkSession


# Initialize SparkSession
spark = SparkSession.builder.getOrCreate()

Python
COPY
Let’s create a DataFrame with hardcoded dates for illustration purposes. We’ll create two
columns, date1 and date2, which will contain our dates in string format.

data = [("2023-07-04", "2022-07-04"),


("2023-12-31", "2022-01-01"),
("2022-02-28", "2021-02-28")]
df = spark.createDataFrame(data, ["date1", "date2"])
df.show()

Python
COPY
Output

+----------+----------+
| date1| date2|
+----------+----------+
|2023-07-04|2022-07-04|
|2023-12-31|2022-01-01|
|2022-02-28|2021-02-28|
+----------+----------+

Bash
COPY
In this DataFrame, date1 is always later than date2. Now, we need to convert the date strings to
date type using the to_date function.

from pyspark.sql.functions import col, to_date


df = df.withColumn("date1", to_date(col("date1"), "yyyy-MM-dd"))
df = df.withColumn("date2", to_date(col("date2"), "yyyy-MM-dd"))
df.show()

Python
COPY
Let’s use the months_between function to calculate the number of months between date1 and
date2.

from pyspark.sql.functions import months_between


df = df.withColumn("months_between", months_between("date1", "date2"))
df.show()

Python
COPY
Result

+----------+----------+--------------+
| date1| date2|months_between|
+----------+----------+--------------+
|2023-07-04|2022-07-04| 12.0|
|2023-12-31|2022-01-01| 23.96774194|
|2022-02-28|2021-02-28| 12.0|
+----------+----------+--------------+

Python
COPY
months_between returns a floating-point number indicating the number of months between the
two dates. The function considers the day of the month as well, hence for the first and the last
row where the day of the month is the same for date1 and date2, the returned number is a whole
number.
PySpark : Retrieving Unique Elements from two arrays in PySpark
 USER  JULY 4, 2023  LEAVE A COMMENTON PYSPARK : RETRIEVING UNIQUE ELEMENTS FROM
TWO ARRAYS IN PYSPARK

Let’s start by creating a DataFrame named freshers_in. We’ll make it contain two array columns
named ‘array1’ and ‘array2’, filled with hard-coded values.

from pyspark.sql import SparkSession


from pyspark.sql.functions import array

# Initialize SparkSession
spark = SparkSession.builder.getOrCreate()

data = [(["java", "c++", "python"], ["python", "java", "scala"]),


(["javascript", "c#", "java"], ["java", "javascript", "php"]),
(["ruby", "php", "c++"], ["c++", "ruby", "perl"])]

# Create DataFrame
freshers_in = spark.createDataFrame(data, ["array1", "array2"])
freshers_in.show(truncate=False)

Python
COPY
The show() function will display the DataFrame freshers_in, which should look something like
this:
+-------------------+-------------------+
|array1 |array2 |
+-------------------+-------------------+
|[java, c++, python]|[python, java, scala]|
|[javascript, c#, java]|[java, javascript, php]|
|[ruby, php, c++]|[c++, ruby, perl]|
+-------------------+-------------------+

Bash
COPY
To create a new array column containing unique elements from ‘array1’ and ‘array2’, we can
utilize the concat() function to merge the arrays and the array_distinct() function to extract the
unique elements.

from pyspark.sql.functions import array_distinct, concat


# Add 'unique_elements' column
freshers_in = freshers_in.withColumn("unique_elements", array_distinct(concat("array1",
"array2")))
freshers_in.show(truncate=False)

Python
COPY
Result 

+-------------------+-------------------+-----------------------------------+
|array1 |array2 |unique_elements |
+-------------------+-------------------+-----------------------------------+
|[java, c++, python]|[python, java, scala]|[java, c++, python, scala] |
|[javascript, c#, java]|[java, javascript, php]|[javascript, c#, java, php] |
|[ruby, php, c++]|[c++, ruby, perl]|[ruby, php, c++, perl] |
+-------------------+-------------------+-----------------------------------+

Bash
COPY
unique_elements column is a unique combination of the elements from the ‘array1’ and ‘array2’
columns.
Note that PySpark’s array functions treat NULLs as valid array elements. If your arrays could
contain NULLs, and you want to exclude them from the result, you should filter them out before
applying the array_distinct and concat operations.

 POSTED INSPARK
Extracting Unique Values From Array Columns in PySpark
 USER  JUNE 28, 2023  LEAVE A COMMENTON EXTRACTING UNIQUE VALUES FROM ARRAY
COLUMNS IN PYSPARK

When dealing with data in Spark, you may find yourself needing to extract distinct values from
array columns. This can be particularly challenging when working with large datasets, but
PySpark’s array and dataframe functions can make this process much easier.
In this article, we’ll walk you through how to extract an array containing the distinct values from
arrays in a column in PySpark. We will demonstrate this process using some sample data, which
you can execute directly.
Let’s create a PySpark DataFrame to illustrate this process:

from pyspark.sql import SparkSession


from pyspark.sql.functions import *

spark = SparkSession.builder.getOrCreate()

data = [("James", ["Java", "C++", "Python"]),


("Michael", ["Python", "Java", "C++", "Java"]),
("Robert", ["CSharp", "VB", "Python", "Java", "Python"])]

df = spark.createDataFrame(data, ["Name", "Languages"])


df.show(truncate=False)

Python
COPY
Result

+-------+-------------------------+
|Name |Languages |
+-------+-------------------------+
|James |[Java, C++, Python] |
|Michael|[Python, Java, C++, Java]|
|Robert |[CSharp, VB, Python, Java, Python]|
+-------+-------------------------+

Bash
COPY
Here, the column Languages is an array type column containing programming languages known
by each person. As you can see, there are some duplicate values in each array. Now, let’s extract
the distinct values from this array.
Using explode and distinct Functions
The first method involves using the explode function to convert the array into individual rows
and then using the distinct function to remove duplicates:

df2 = df.withColumn("Languages", explode(df["Languages"]))\


.dropDuplicates(["Name", "Languages"])

df2.show(truncate=False)

Python
COPY
Result

+-------+---------+
|Name |Languages|
+-------+---------+
|James |Python |
|James |Java |
|James |C++ |
|Michael|Java |
|Robert |Java |
|Robert |CSharp |
|Robert |Python |
|Robert |VB |
|Michael|C++ |
|Michael|Python |
+-------+---------+

Batch
COPY
Here, the explode function creates a new row for each element in the given array or map column,
and the dropDuplicates function eliminates duplicate rows.
However, the result is not an array but rather individual rows. To get an array of distinct values
for each person, we can group the data by the ‘Name’ column and use the collect_list function:

df3 = df2.groupBy("Name").agg(collect_list("Languages").alias("DistinctLanguages"))
df3.show(truncate=False)

Python
COPY
Result

+-------+--------------------------+
|Name |DistinctLanguages |
+-------+--------------------------+
|James |[Python, Java, C++] |
|Michael|[Java, C++, Python] |
|Robert |[Java, CSharp, Python, VB]|
+-------+--------------------------+

Bash
COPY
You want to get the list of all the Languages without duplicate , you can perform the below

df4 = df.select(explode(df["Languages"])).dropDuplicates(["col"])
df4.show(truncate=False)

Python
COPY
Python
COPY

+------+
|col |
+------+
|C++ |
|Python|
|Java |
|CSharp|
|VB |
+------+

Bash
COPY
Spark important urls to refer
 POSTED INSPARK
PySpark : Returning an Array that Contains Matching Elements in Two Input Arrays in PySpark
 USER  JUNE 24, 2023  LEAVE A COMMENTON PYSPARK : RETURNING AN ARRAY THAT
CONTAINS MATCHING ELEMENTS IN TWO INPUT ARRAYS IN PYSPARK

This article will focus on a particular use case: returning an array that contains the matching
elements in two input arrays in PySpark. To illustrate this, we’ll use PySpark’s built-in functions
and DataFrame transformations.
PySpark does not provide a direct function to compare arrays and return the matching elements.
However, you can achieve this by utilizing some of its in-built functions like explode, collect_list,
and array_intersect.
Let’s assume we have a DataFrame that has two columns, both of which contain arrays:

from pyspark.sql import SparkSession


from pyspark.sql.functions import array
spark = SparkSession.builder.getOrCreate()
data = [
("1", list(["apple", "banana", "cherry"]), list(["banana", "cherry", "date"])),
("2", list(["pear", "mango", "peach"]), list(["mango", "peach", "lemon"])),
]
df = spark.createDataFrame(data, ["id", "Array1", "Array2"])
df.show()

Python
COPY
DataFrame is created successfully.
To return an array with the matching elements in ‘Array1’ and ‘Array2’, use
the array_intersect function:

from pyspark.sql.functions import array_intersect

df_with_matching_elements = df.withColumn("MatchingElements", array_intersect(df.Array1,


df.Array2))
df_with_matching_elements.show(20,False)

Python
COPY
The ‘MatchingElements’ column will contain the matching elements in ‘Array1’ and ‘Array2’ for
each row.
Using the PySpark array_intersect function, you can efficiently find matching elements in two
arrays. This function is not only simple and efficient but also scalable, making it a great tool for
processing and analyzing big data with PySpark. It’s important to remember, however, that this
approach works on a row-by-row basis. If you want to find matches across all rows in the
DataFrame, you’ll need to apply a different technique.

+---+-----------------------+----------------------+----------------+
|id |Array1 |Array2 |MatchingElements|
+---+-----------------------+----------------------+----------------+
|1 |[apple, banana, cherry]|[banana, cherry, date]|[banana, cherry]|
|2 |[pear, mango, peach] |[mango, peach, lemon] |[mango, peach] |
+---+-----------------------+----------------------+-------------

PySpark : Reversing the order of strings in a list using PySpark


 USER  JULY 5, 2023  LEAVE A COMMENTON PYSPARK : REVERSING THE ORDER OF STRINGS IN A
LIST USING PYSPARK

Lets create a sample data in the form of a list of strings.

from pyspark import SparkContext, SparkConf


from pyspark.sql import SparkSession
conf = SparkConf().setAppName('Reverse String @ Freshers.in Learning')
sc = SparkContext.getOrCreate();
spark = SparkSession(sc)
# Sample data
data = ['Sachin', 'Narendra', 'Arun', 'Oracle', 'Redshift']
# Parallelize the data with Spark
rdd = sc.parallelize(data)

Python
COPY
Now, we can apply a map operation on this RDD (Resilient Distributed Datasets, the fundamental
data structure of Spark). The map operation applies a given function to each element of the RDD
and returns a new RDD.
We will use the built-in Python function reversed() inside a map operation to reverse the order of
each string. reversed() returns a reverse iterator, so we have to join it back into a string with
”.join().

# Apply map operation to reverse the strings


reversed_rdd = rdd.map(lambda x: ''.join(reversed(x)))

Python
COPY
The lambda function here is a simple anonymous function that takes one argument, x, and
returns the reversed string. x is each element of the RDD (each string in this case).
After this operation, we have a new RDD where each string from the original RDD has been
reversed. You can collect the results back to the driver program using the collect() action.
# Collect the results
reversed_data = reversed_rdd.collect()

# Print the reversed strings


for word in reversed_data:
print(word)

Python
COPY
As you can see, the order of characters in each string from the list has been reversed. Note that
Spark operations are lazily evaluated, meaning the actual computations (like reversing the
strings) only happen when an action (like collect()) is called. This feature allows Spark to optimize
the overall data processing workflow.
Complete code

from pyspark import SparkContext, SparkConf


from pyspark.sql import SparkSession
conf = SparkConf().setAppName('Reverse String @ Freshers.in Learning')
sc = SparkContext.getOrCreate();
spark = SparkSession(sc)
#Sample data for testing
data = ['Sachin', 'Narendra', 'Arun', 'Oracle', 'Redshift']
#Parallelize the data with Spark
rdd = sc.parallelize(data)
reversed_rdd = rdd.map(lambda x: ''.join(reversed(x)))
#Collect the results
reversed_data = reversed_rdd.collect()
#Print the reversed strings
for word in reversed_data:
print(word)

 POSTED INSPARK
PySpark : Generating a 64-bit hash value in PySpark
 USER  JULY 5, 2023  LEAVE A COMMENTON PYSPARK : GENERATING A 64-BIT HASH VALUE IN
PYSPARK

Introduction to 64-bit Hashing


A hash function is a function that can be used to map data of arbitrary size to fixed-size values.
The values returned by a hash function are called hash codes, hash values, or simply hashes.
When we say a hash value is a “signed 64-bit” value, it means the hash function outputs a 64-bit
integer that can represent both positive and negative numbers. In computing, a 64-bit integer
can represent a vast range of numbers, from -9,223,372,036,854,775,808 to
9,223,372,036,854,775,807.
A 64-bit hash function can be useful in a variety of scenarios, particularly when working with large
data sets. It can be used for quickly comparing complex data structures, indexing data, and
checking data integrity.
Use of 64-bit Hashing in PySpark
While PySpark does not provide a direct function for 64-bit hashing, it does provide a function
hash() that returns a hash as an integer, which is usually a 32-bit hash. For a 64-bit hash, we can
consider using the murmur3 hash function from Python’s mmh3 library, which produces a 128-bit
hash and can be trimmed down to 64-bit. You can install the library using pip:

pip install mmh3

Bash
COPY
Here is an example of how to generate a 64-bit hash value in PySpark:

from pyspark.sql import SparkSession


from pyspark.sql.functions import udf
from pyspark.sql.types import LongType
import mmh3

#Create a Spark session


spark = SparkSession.builder.appName("freshers.in Learning for 64-bit Hashing in PySpark
").getOrCreate()

#Creating sample data


data = [("Sachin",), ("Ramesh",), ("Babu",)]
df = spark.createDataFrame(data, ["Name"])

#Function to generate 64-bit hash


def hash_64(input):
return mmh3.hash64(input.encode('utf-8'))[0]

#Create a UDF for the 64-bit hash function


hash_64_udf = udf(lambda z: hash_64(z), LongType())

#Apply the UDF to the DataFrame


df_hashed = df.withColumn("Name_hashed", hash_64_udf(df['Name']))

#Show the DataFrame


df_hashed.show()

Python
COPY
In this example, we create a Spark session and a DataFrame df with a single column “Name”.
Then, we define the function hash_64 to generate a 64-bit hash of an input string. After that, we
create a user-defined function (UDF) hash_64_udf using PySpark SQL functions. Finally, we apply
this UDF to the column “Name” in the DataFrame df and create a new DataFrame df_hashed
with the 64-bit hashed values of the names.
Advantages and Drawbacks of 64-bit Hashing
Advantages:
1. Large Range: A 64-bit hash value has a very large range of possible values, which can help
reduce hash collisions (different inputs producing the same hash output).
2. Fast Comparison and Lookup: Hashing can turn time-consuming operations such as string
comparison into a simple integer comparison, which can significantly speed up certain
operations like data lookups.
3. Data Integrity Checks: Hash values can provide a quick way to check if data has been
altered.
Drawbacks:
1. Collisions: While the possibility is reduced, hash collisions can still occur where different
inputs produce the same hash output.
2. Not for Security: A hash value is not meant for security purposes. It can be reverse-
engineered to get the original input.
3. Data Loss: Hashing is a one-way function. Once data is hashed, it cannot be converted back
to the original input.

Introduction to BASE64_ENCODE and its Applications in PySpark


BASE64 is a group of similar binary-to-text encoding schemes that represent binary data in an
ASCII string format by translating it into a radix-64 representation. It is designed to carry data
stored in binary formats across channels that are designed to deal with text. This ensures that
the data remains intact without any modification during transport.
BASE64_ENCODE is a function used to encode data into this base64 format.
Where is BASE64_ENCODE used?
Base64 encoding schemes are commonly used when there is a need to encode binary data,
especially when that data needs to be stored or sent over media that are designed to deal with
text. This encoding helps to ensure that the data remains intact without modification during
transport.
Base64 is used commonly in a number of applications including email via MIME, as well as storing
complex data in XML or JSON.
Advantages of BASE64_ENCODE
1. Data Integrity: Base64 ensures that data remains intact without modification during
transport.
2. Usability: It can be used to send binary data, such as images or files, over channels
designed to transmit text-based data.
3. Security: While it’s not meant to be a secure encryption method, it does provide a layer of
obfuscation.
How to Encode the Input Using Base64 Encoding in PySpark
PySpark, the Python library for Spark programming, does not natively support Base64 encoding
functions until the version that’s available as of my knowledge cutoff in September 2021.
However, PySpark can easily use Python’s built-in libraries, and we can create a User Defined
Function (UDF) to perform Base64 encoding. Below is a sample way of how you can achieve that.

from pyspark.sql.functions import udf


from pyspark.sql.types import StringType
import base64

def base64_encode(input):
try:
return base64.b64encode(input.encode('utf-8')).decode('utf-8')
except Exception as e:
return None

base64_encode_udf = udf(lambda z: base64_encode(z), StringType())

df_encoded = df.withColumn('encoded_column', base64_encode_udf(df['column_to_encode']))

Python
COPY
Example with Data
The BASE64_ENCODE function is a handy tool for preserving binary data integrity when it needs
to be stored and transferred over systems that are designed to handle text.

# Import the required libraries


from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
import base64
# Start a Spark Session
spark = SparkSession.builder.appName("freshers.in Learning for BASE64_ENCODE
").getOrCreate()
# Create a sample DataFrame
data = [('Sachin', 'Tendulkar', '[email protected]'),
('Mahesh', 'Babu', '[email protected]'),
('Mohan', 'Lal', '[email protected]')]
df = spark.createDataFrame(data, ["First Name", "Last Name", "Email"])
# Display original DataFrame
df.show(20,False)
# Define the base64 encode function
def base64_encode(input):
try:
return base64.b64encode(input.encode('utf-8')).decode('utf-8')
except Exception as e:
return None
# Create a UDF for the base64 encode function
base64_encode_udf = udf(lambda z: base64_encode(z), StringType())
# Add a new column to the DataFrame with the encoded email
df_encoded = df.withColumn('Encoded Email', base64_encode_udf(df['Email']))
# Display the DataFrame with the encoded column
df_encoded.show(20,False)

Python
COPY
Output

+----------+---------+----------------------------+
|First Name|Last Name|Email |
+----------+---------+----------------------------+
|Sachin |Tendulkar|[email protected]|
|Mahesh |Babu |[email protected] |
|Mohan |Lal |[email protected] |
+----------+---------+----------------------------+

+----------+---------+----------------------------+----------------------------------------+
|First Name|Last Name|Email |Encoded Email |
+----------+---------+----------------------------+----------------------------------------+
|Sachin |Tendulkar|[email protected]|
c2FjaGluLnRlbmR1bGthckBmcmVzaGVycy5pbg==|
|Mahesh |Babu |[email protected] |bWFoZXNoLmJhYnVAZnJlc2hlcnMuaW4= |
|Mohan |Lal |[email protected] |bW9oYW4ubGFsQGZyZXNoZXJzLmlu |
+----------+---------+----------------------------+----------------------------------------+

Bash
COPY
In this script, we first create a SparkSession, which is the entry point to any functionality in Spark.
We then create a DataFrame with some sample data.
The base64_encode function takes an input string and returns the Base64 encoded version of the
string. We then create a user-defined function (UDF) out of this, which can be applied to our
DataFrame.
Finally, we create a new DataFrame, df_encoded, which includes a new column ‘Encoded Email’.
This column is the result of applying our UDF to the ‘Email’ column of the original DataFrame.
When you run the df.show() and df_encoded.show(), it will display the original and the base64
encoded DataFrames respectively.
PySpark : Understanding the PySpark next_day Function
 USER  JULY 4, 2023  LEAVE A COMMENTON PYSPARK : UNDERSTANDING THE PYSPARK
NEXT_DAY FUNCTION

Time series data often involves handling and manipulating dates. Apache Spark, through its
PySpark interface, provides an arsenal of date-time functions that simplify this task. One such
function is next_day(), a powerful function used to find the next specified day of the week from a
given date. This article will provide an in-depth look into the usage and application of the
next_day() function in PySpark.
The next_day() function takes two arguments: a date and a day of the week. The function returns
the next specified day after the given date. For instance, if the given date is a Monday and the
specified day is ‘Thursday’, the function will return the date of the coming Thursday.
The next_day() function recognizes the day of the week case-insensitively, and both in full (like
‘Monday’) and abbreviated form (like ‘Mon’). 
To begin with, let’s initialize a SparkSession, the entry point to any Spark functionality.

from pyspark.sql import SparkSession


# Initialize SparkSession
spark = SparkSession.builder.getOrCreate()

Python
COPY
Create a DataFrame with a single column date filled with some hardcoded date values.

data = [("2023-07-04",),
("2023-12-31",),
("2022-02-28",)]
df = spark.createDataFrame(data, ["date"])
df.show()

Python
COPY
Output

+----------+
| date|
+----------+
|2023-07-04|
|2023-12-31|
|2022-02-28|
+----------+

Bash
COPY
Given the dates are in string format, we need to convert them into date type using the to_date
function.

from pyspark.sql.functions import col, to_date


df = df.withColumn("date", to_date(col("date"), "yyyy-MM-dd"))
df.show()

Bash
COPY
Use the next_day() function to find the next Sunday from the given date.

from pyspark.sql.functions import next_day


df = df.withColumn("next_sunday", next_day("date", 'Sunday'))
df.show()

Python
COPY
Result DataFrame 

+----------+-----------+
| date|next_sunday|
+----------+-----------+
|2023-07-04| 2023-07-09|
|2023-12-31| 2024-01-07|
|2022-02-28| 2022-03-05|
+----------+-----------+

Bash
COPY
The next_day() function in PySpark is a powerful tool for manipulating date-time data,
particularly when you need to perform operations based on the days of the week.
PySpark : Extracting the Month from a Date in PySpark
 USER  JULY 4, 2023  LEAVE A COMMENTON PYSPARK : EXTRACTING THE MONTH FROM A DATE
IN PYSPARK

Working with dates and time is a common task in data analysis. Apache Spark provides a variety
of functions to manipulate date and time data types, including a function to extract the month
from a date. In this article, we will explore how to use the month() function in PySpark to extract
the month of a given date as an integer.
The month() function extracts the month part from a given date and returns it as an integer. For
example, if you have a date “2023-07-04”, applying the month() function to this date will return
the integer value 7.
Firstly, let’s start by setting up a SparkSession, which is the entry point to any Spark functionality.

from pyspark.sql import SparkSession


# Initialize SparkSession
spark = SparkSession.builder.getOrCreate()

Python
COPY
Create a DataFrame with a single column called date that contains some hard-coded date values.

data = [("2023-07-04",),
("2023-12-31",),
("2022-02-28",)]
df = spark.createDataFrame(data, ["date"])
df.show()

Python
COPY
Output

+----------+
| date|
+----------+
|2023-07-04|
|2023-12-31|
|2022-02-28|
+----------+

Bash
COPY
As our dates are in string format, we need to convert them into date type using
the to_date function.

from pyspark.sql.functions import col, to_date


df = df.withColumn("date", to_date(col("date"), "yyyy-MM-dd"))
df.show()

Python
COPY
Let’s use the month() function to extract the month from the date column.

from pyspark.sql.functions import month


df = df.withColumn("month", month("date"))
df.show()

Python
COPY
Result

+----------+
| date|
+----------+
|2023-07-04|
|2023-12-31|
|2022-02-28|
+----------+

Bash
COPY
As you can see, the month column contains the month part of the corresponding date in the date
column. The month() function in PySpark provides a simple and effective way to retrieve the
month part from a date, making it a valuable tool in a data scientist’s arsenal. This function, along
with other date-time functions in PySpark, simplifies the process of handling date-time data.
 POSTED INSPARK
PySpark : Calculating the Difference Between Dates with PySpark: The months_between
Function
 USER  JULY 4, 2023  LEAVE A COMMENTON PYSPARK : CALCULATING THE DIFFERENCE
BETWEEN DATES WITH PYSPARK: THE MONTHS_BETWEEN FUNCTION

When working with time series data, it is often necessary to calculate the time difference
between two dates. Apache Spark provides an extensive collection of functions to perform date-
time manipulations, and months_between is one of them. This function computes the number of
months between two dates. If the first date (date1) is later than the second one (date2), the
result will be positive. Notably, if both dates are on the same day of the month, the function will
return a precise whole number. This article will guide you on how to utilize this function in
PySpark.
Firstly, we need to create a SparkSession, which is the entry point to any functionality in Spark.

from pyspark.sql import SparkSession


# Initialize SparkSession
spark = SparkSession.builder.getOrCreate()

Python
COPY
Let’s create a DataFrame with hardcoded dates for illustration purposes. We’ll create two
columns, date1 and date2, which will contain our dates in string format.

data = [("2023-07-04", "2022-07-04"),


("2023-12-31", "2022-01-01"),
("2022-02-28", "2021-02-28")]
df = spark.createDataFrame(data, ["date1", "date2"])
df.show()

Python
COPY
Output
+----------+----------+
| date1| date2|
+----------+----------+
|2023-07-04|2022-07-04|
|2023-12-31|2022-01-01|
|2022-02-28|2021-02-28|
+----------+----------+

Bash
COPY
In this DataFrame, date1 is always later than date2. Now, we need to convert the date strings to
date type using the to_date function.

from pyspark.sql.functions import col, to_date


df = df.withColumn("date1", to_date(col("date1"), "yyyy-MM-dd"))
df = df.withColumn("date2", to_date(col("date2"), "yyyy-MM-dd"))
df.show()

Python
COPY
Let’s use the months_between function to calculate the number of months between date1 and
date2.

from pyspark.sql.functions import months_between


df = df.withColumn("months_between", months_between("date1", "date2"))
df.show()

Python
COPY
Result

+----------+----------+--------------+
| date1| date2|months_between|
+----------+----------+--------------+
|2023-07-04|2022-07-04| 12.0|
|2023-12-31|2022-01-01| 23.96774194|
|2022-02-28|2021-02-28| 12.0|
+----------+----------+--------------+

Python
COPY
months_between returns a floating-point number indicating the number of months between the
two dates. The function considers the day of the month as well, hence for the first and the last
row where the day of the month is the same for date1 and date2, the returned number is a whole
number.
 POSTED INSPARK
PySpark : Retrieving Unique Elements from two arrays in PySpark
 USER  JULY 4, 2023  LEAVE A COMMENTON PYSPARK : RETRIEVING UNIQUE ELEMENTS FROM
TWO ARRAYS IN PYSPARK
Let’s start by creating a DataFrame named freshers_in. We’ll make it contain two array columns
named ‘array1’ and ‘array2’, filled with hard-coded values.

from pyspark.sql import SparkSession


from pyspark.sql.functions import array

# Initialize SparkSession
spark = SparkSession.builder.getOrCreate()

data = [(["java", "c++", "python"], ["python", "java", "scala"]),


(["javascript", "c#", "java"], ["java", "javascript", "php"]),
(["ruby", "php", "c++"], ["c++", "ruby", "perl"])]

# Create DataFrame
freshers_in = spark.createDataFrame(data, ["array1", "array2"])
freshers_in.show(truncate=False)

Python
COPY
The show() function will display the DataFrame freshers_in, which should look something like
this:

+-------------------+-------------------+
|array1 |array2 |
+-------------------+-------------------+
|[java, c++, python]|[python, java, scala]|
|[javascript, c#, java]|[java, javascript, php]|
|[ruby, php, c++]|[c++, ruby, perl]|
+-------------------+-------------------+

Bash
COPY
To create a new array column containing unique elements from ‘array1’ and ‘array2’, we can
utilize the concat() function to merge the arrays and the array_distinct() function to extract the
unique elements.

from pyspark.sql.functions import array_distinct, concat


# Add 'unique_elements' column
freshers_in = freshers_in.withColumn("unique_elements", array_distinct(concat("array1",
"array2")))
freshers_in.show(truncate=False)

Python
COPY
Result 

+-------------------+-------------------+-----------------------------------+
|array1 |array2 |unique_elements |
+-------------------+-------------------+-----------------------------------+
|[java, c++, python]|[python, java, scala]|[java, c++, python, scala] |
|[javascript, c#, java]|[java, javascript, php]|[javascript, c#, java, php] |
|[ruby, php, c++]|[c++, ruby, perl]|[ruby, php, c++, perl] |
+-------------------+-------------------+-----------------------------------+

Bash
COPY
unique_elements column is a unique combination of the elements from the ‘array1’ and ‘array2’
columns.
Note that PySpark’s array functions treat NULLs as valid array elements. If your arrays could
contain NULLs, and you want to exclude them from the result, you should filter them out before
applying the array_distinct and concat operations
POSTED INSPARK
Extracting Unique Values From Array Columns in PySpark
 USER  JUNE 28, 2023  LEAVE A COMMENTON EXTRACTING UNIQUE VALUES FROM ARRAY
COLUMNS IN PYSPARK

When dealing with data in Spark, you may find yourself needing to extract distinct values from
array columns. This can be particularly challenging when working with large datasets, but
PySpark’s array and dataframe functions can make this process much easier.
In this article, we’ll walk you through how to extract an array containing the distinct values from
arrays in a column in PySpark. We will demonstrate this process using some sample data, which
you can execute directly.
Let’s create a PySpark DataFrame to illustrate this process:

from pyspark.sql import SparkSession


from pyspark.sql.functions import *

spark = SparkSession.builder.getOrCreate()

data = [("James", ["Java", "C++", "Python"]),


("Michael", ["Python", "Java", "C++", "Java"]),
("Robert", ["CSharp", "VB", "Python", "Java", "Python"])]

df = spark.createDataFrame(data, ["Name", "Languages"])


df.show(truncate=False)

Python
COPY
Result

+-------+-------------------------+
|Name |Languages |
+-------+-------------------------+
|James |[Java, C++, Python] |
|Michael|[Python, Java, C++, Java]|
|Robert |[CSharp, VB, Python, Java, Python]|
+-------+-------------------------+
Bash
COPY
Here, the column Languages is an array type column containing programming languages known
by each person. As you can see, there are some duplicate values in each array. Now, let’s extract
the distinct values from this array.
Using explode and distinct Functions
The first method involves using the explode function to convert the array into individual rows
and then using the distinct function to remove duplicates:

df2 = df.withColumn("Languages", explode(df["Languages"]))\


.dropDuplicates(["Name", "Languages"])

df2.show(truncate=False)

Python
COPY
Result

+-------+---------+
|Name |Languages|
+-------+---------+
|James |Python |
|James |Java |
|James |C++ |
|Michael|Java |
|Robert |Java |
|Robert |CSharp |
|Robert |Python |
|Robert |VB |
|Michael|C++ |
|Michael|Python |
+-------+---------+

Batch
COPY
Here, the explode function creates a new row for each element in the given array or map column,
and the dropDuplicates function eliminates duplicate rows.
However, the result is not an array but rather individual rows. To get an array of distinct values
for each person, we can group the data by the ‘Name’ column and use the collect_list function:

df3 = df2.groupBy("Name").agg(collect_list("Languages").alias("DistinctLanguages"))
df3.show(truncate=False)

Python
COPY
Result

+-------+--------------------------+
|Name |DistinctLanguages |
+-------+--------------------------+
|James |[Python, Java, C++] |
|Michael|[Java, C++, Python] |
|Robert |[Java, CSharp, Python, VB]|
+-------+--------------------------+

Bash
COPY
You want to get the list of all the Languages without duplicate , you can perform the below

df4 = df.select(explode(df["Languages"])).dropDuplicates(["col"])
df4.show(truncate=False)

Python
COPY
Python
COPY

+------+
|col |
+------+
|C++ |
|Python|
|Java |
|CSharp|
|VB |
+------+

 POSTED INSPARK
PySpark : Returning an Array that Contains Matching Elements in Two Input Arrays in PySpark
 USER  JUNE 24, 2023  LEAVE A COMMENTON PYSPARK : RETURNING AN ARRAY THAT
CONTAINS MATCHING ELEMENTS IN TWO INPUT ARRAYS IN PYSPARK

This article will focus on a particular use case: returning an array that contains the matching
elements in two input arrays in PySpark. To illustrate this, we’ll use PySpark’s built-in functions
and DataFrame transformations.
PySpark does not provide a direct function to compare arrays and return the matching elements.
However, you can achieve this by utilizing some of its in-built functions like explode, collect_list,
and array_intersect.
Let’s assume we have a DataFrame that has two columns, both of which contain arrays:

from pyspark.sql import SparkSession


from pyspark.sql.functions import array
spark = SparkSession.builder.getOrCreate()
data = [
("1", list(["apple", "banana", "cherry"]), list(["banana", "cherry", "date"])),
("2", list(["pear", "mango", "peach"]), list(["mango", "peach", "lemon"])),
]
df = spark.createDataFrame(data, ["id", "Array1", "Array2"])
df.show()

Python
COPY
DataFrame is created successfully.
To return an array with the matching elements in ‘Array1’ and ‘Array2’, use
the array_intersect function:

from pyspark.sql.functions import array_intersect

df_with_matching_elements = df.withColumn("MatchingElements", array_intersect(df.Array1,


df.Array2))
df_with_matching_elements.show(20,False)

Python
COPY
The ‘MatchingElements’ column will contain the matching elements in ‘Array1’ and ‘Array2’ for
each row.
Using the PySpark array_intersect function, you can efficiently find matching elements in two
arrays. This function is not only simple and efficient but also scalable, making it a great tool for
processing and analyzing big data with PySpark. It’s important to remember, however, that this
approach works on a row-by-row basis. If you want to find matches across all rows in the
DataFrame, you’ll need to apply a different technique.

+---+-----------------------+----------------------+----------------+
|id |Array1 |Array2 |MatchingElements|
+---+-----------------------+----------------------+----------------+
|1 |[apple, banana, cherry]|[banana, cherry, date]|[banana, cherry]|
|2 |[pear, mango, peach] |[mango, peach, lemon] |[mango, peach] |
+---+-----------------------+----------------------+----------------+

PySpark : Creating Ranges in PySpark DataFrame with Custom Start, End, and Increment Values
 USER  JUNE 22, 2023  LEAVE A COMMENTON PYSPARK : CREATING RANGES IN PYSPARK
DATAFRAME WITH CUSTOM START, END, AND INCREMENT VALUES

In PySpark, there isn’t a built-in function to create an array sequence given a start, end, and
increment value. In PySpark, you can use the range function, but it’s only available for integer
values. For float values, PySpark doesn’t provide such an option. But, we can use a workaround
and apply an UDF (User-Defined Function) to create a list between the start_val and end_val with
increments of increment_val.
Here’s how to do it:

from pyspark.sql import SparkSession


from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, IntegerType
# Create a SparkSession
spark = SparkSession.builder.getOrCreate()
# Create a DataFrame
df = spark.createDataFrame([(1, 10, 2), (3, 6, 1), (10, 20, 5)], ['start_val', 'end_val',
'increment_val'])
# Define UDF to create the range
def create_range(start, end, increment):
return list(range(start, end + 1, increment))
create_range_udf = udf(create_range, ArrayType(IntegerType()))
# Apply the UDF
df = df.withColumn('range', create_range_udf(df['start_val'], df['end_val'], df['increment_val']))
# Show the DataFrame
df.show(truncate=False)

Python
COPY
This will create a new column called range in the DataFrame that contains a list from start_val to
end_val with increments of increment_val.
Result

+---------+-------+-------------+------------------+
|start_val|end_val|increment_val|range |
+---------+-------+-------------+------------------+
|1 |10 |2 |[1, 3, 5, 7, 9] |
|3 |6 |1 |[3, 4, 5, 6] |
|10 |20 |5 |[10, 15, 20] |
+---------+-------+-------------+------------------+

Bash
COPY
Remember that using Python UDFs might have a performance impact when dealing with large
volumes of data, as data needs to be moved from the JVM to Python, which is an expensive
operation. It is usually a good idea to profile your Spark application and ensure the performance
is acceptable.
Second Option [This below method is not suggested] Just for your information

from pyspark.sql import SparkSession


from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, IntegerType
import numpy as np
# Start SparkSession
spark = SparkSession.builder \
.appName('Array Sequence Generator') \
.getOrCreate()
# Sample DataFrame
df = spark.createDataFrame([
(1, 10, 2),
(5, 20, 3),
(0, 15, 5)
], ["start_val", "end_val", "increment_val"])
# Define UDF
def sequence_array(start, end, step):
return list(np.arange(start, end, step))
sequence_array_udf = udf(sequence_array, ArrayType(IntegerType()))
# Use the UDF
df = df.withColumn("sequence", sequence_array_udf(df.start_val, df.end_val, df.increment_val))
# Show the DataFrame
df.show(truncate=False)

Python
COPY
In this example, the sequence_array function uses numpy’s arange function to generate a
sequence of numbers given a start, end, and step value. The udf function is used to convert this
function into a UDF that can be used with PySpark DataFrames.
The DataFrame df is created with three columns: start_val, end_val, and increment_val. The UDF
sequence_array_udf is then used to generate a new column “sequence” in the DataFrame, which
contains arrays of numbers starting at start_val, ending at end_val (exclusive), and incrementing
by increment_val.
PySpark : How to Prepending an Element to an Array on specific condition in PySpark
 USER  JUNE 16, 2023  LEAVE A COMMENTON PYSPARK : HOW TO PREPENDING AN ELEMENT TO
AN ARRAY ON SPECIFIC CONDITION IN PYSPARK

If you want to prepend an element to the array only when the array contains a specific word, you
can achieve this with the help of PySpark’s when() and otherwise() functions along with
array_contains(). The when() function allows you to specify a condition, the array_contains()
function checks if an array contains a certain value, and the otherwise() function allows you to
specify what should happen if the condition is not met.
Here is the example to prepend an element only when the array contains the word “four”.

from pyspark.sql import SparkSession


from pyspark.sql.functions import array
from pyspark.sql.functions import when, array_contains, lit, array, concat
# Initialize a SparkSession
spark = SparkSession.builder.getOrCreate()
# Create a DataFrame
data = [("fruits", ["apple", "banana", "cherry", "date", "elderberry"]),
("numbers", ["one", "two", "three", "four", "five"]),
("colors", ["red", "blue", "green", "yellow", "pink"])]
df = spark.createDataFrame(data, ["Category", "Items"])
df.show()
######################
# Element to prepend
#####################
element = "zero"
# Prepend the element only when the array contains "four"
df = df.withColumn("Items", when(array_contains(df["Items"], "four"),
concat(array(lit(element)), df["Items"]))
.otherwise(df["Items"]))
df.show(20,False)

Python
COPY
Source Data

+--------+-----------------------------------------+
|Category|Items |
+--------+-----------------------------------------+
|fruits |[apple, banana, cherry, date, elderberry]|
|numbers |[one, two, three, four, five] |
|colors |[red, blue, green, yellow, pink] |
+--------+-----------------------------------------+

Bash
COPY
Output

+--------+-----------------------------------------+
|Category|Items |
+--------+-----------------------------------------+
|fruits |[apple, banana, cherry, date, elderberry]|
|numbers |[zero, one, two, three, four, five] |
|colors |[red, blue, green, yellow, pink] |
+--------+-----------------------------------------+

Bash
COPY
In this code, when(array_contains(df[“Items”], “four”), concat(array(lit(element)),
df[“Items”])) prepends the element to the array if the array contains “four“. If the array does not
contain “four“, otherwise(df[“Items”]) leaves the array as it is.
This results in a new DataFrame where “zero” is prepended to the array in the “Items” column
only if the array contains “four“.
PySpark : Prepending an Element to an Array in PySpark
 USER  JUNE 16, 2023  LEAVE A COMMENTON PYSPARK : PREPENDING AN ELEMENT TO AN
ARRAY IN PYSPARK

When dealing with arrays in PySpark, a common requirement is to prepend an element at the
beginning of an array, effectively creating a new array that includes the new element as well as all
elements from the source array. PySpark, doesn’t have a built-in function for prepending.
However, you can achieve this by using a combination of existing PySpark functions. This article
guides you through this process with a working example.
Creating the DataFrame
Let’s first create a PySpark DataFrame with an array column to use in the demonstration:
Creating the DataFrame
Let’s first create a PySpark DataFrame with an array column to use in the demonstration:

from pyspark.sql import SparkSession


from pyspark.sql.functions import array
# Initialize a SparkSession
spark = SparkSession.builder.getOrCreate()
# Create a DataFrame
data = [("fruits", ["apple", "banana", "cherry", "date", "elderberry"]),
("numbers", ["one", "two", "three", "four", "five"]),
("colors", ["red", "blue", "green", "yellow", "pink"])]
df = spark.createDataFrame(data, ["Category", "Items"])
df.show()

Python
COPY
Source data output

+--------+-----------------------------------------+
|Category|Items |
+--------+-----------------------------------------+
|fruits |[apple, banana, cherry, date, elderberry]|
|numbers |[one, two, three, four, five] |
|colors |[red, blue, green, yellow, pink] |
+--------+-----------------------------------------+

Bash
COPY
Prepending an Element to an Array
The approach to prepending an element to an array in PySpark involves combining the array()
and concat() functions. We will create a new array with the element to prepend and concatenate
it with the original array:

from pyspark.sql.functions import array, concat


# Element to prepend
element = "zero"
# Prepend the element
df = df.withColumn("Items", concat(array(lit(element)), df["Items"]))
df.show(20,False)

Python
COPY
This code creates a new column “Items” by concatenating a new array containing the element to
prepend (“zero”) with the existing “Items” array.
The lit() function is used to create a column of literal value. The array() function is used to create
an array with the literal value, and the concat() function is used to concatenate two arrays.
This results in a new DataFrame where “zero” is prepended to each array in the “Items” column.
While PySpark doesn’t provide a built-in function for prepending an element to an array, we can
achieve the same result by creatively using the functions available. We walked through an
example of how to prepend an element to an array in a PySpark DataFrame. This method
highlights the flexibility of PySpark and how it can handle a variety of data manipulation tasks by
combining its available functions.

+--------+-----------------------------------------------+
|Category|Items |
+--------+-----------------------------------------------+
|fruits |[zero, apple, banana, cherry, date, elderberry]|
|numbers |[zero, one, two, three, four, five] |
|colors |[zero, red, blue, green, yellow, pink] |
+--------+-----------------------------------------------+

PySpark : Finding the Index of the First Occurrence of an Element in an Array in PySpark
 USER  JUNE 16, 2023  LEAVE A COMMENTON PYSPARK : FINDING THE INDEX OF THE FIRST
OCCURRENCE OF AN ELEMENT IN AN ARRAY IN PYSPARK

This article will walk you through the steps on how to find the index of the first occurrence of an
element in an array in PySpark with a working example.
Installing PySpark
Before we get started, you’ll need to have PySpark installed. You can install it via pip:

pip install pyspark

Bash
COPY
Creating the DataFrame
Let’s first create a PySpark DataFrame with an array column for demonstration purposes.

from pyspark.sql import SparkSession


from pyspark.sql.functions import array
# Initiate a SparkSession
spark = SparkSession.builder.getOrCreate()
# Create a DataFrame
data = [("fruits", ["apple", "banana", "cherry", "date", "elderberry"]),
("numbers", ["one", "two", "three", "four", "five"]),
("colors", ["red", "blue", "green", "yellow", "pink"])]
df = spark.createDataFrame(data, ["Category", "Items"])
df.show(20,False)

Python
COPY
Source data

+--------+-----------------------------------------+
|Category|Items |
+--------+-----------------------------------------+
|fruits |[apple, banana, cherry, date, elderberry]|
|numbers |[one, two, three, four, five] |
|colors |[red, blue, green, yellow, pink] |
+--------+-----------------------------------------+

Bash
COPY
Defining the UDF
Since PySpark doesn’t have a built-in function to find the index of an element in an array, we’ll
need to create a User-Defined Function (UDF).
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
# Define the UDF to find the index
def find_index(array, item):
try:
return array.index(item)
except ValueError:
return None
# Register the UDF
find_index_udf = udf(find_index, IntegerType())

Python
COPY
This UDF takes two arguments: an array and an item. It tries to return the index of the item in the
array. If the item is not found, it returns None.
Applying the UDF
To pass a literal value to the UDF, you should use the lit function from pyspark.sql.functions.
Here’s how you should modify your code:
Finally, we’ll apply the UDF to our DataFrame to find the index of an element.

from pyspark.sql.functions import lit


# Use the UDF to find the index
df = df.withColumn("ItemIndex", find_index_udf(df["Items"], lit("three")))
df.show(20,False)

Python
COPY
Final Output

+--------+-----------------------------------------+---------+
|Category|Items |ItemIndex|
+--------+-----------------------------------------+---------+
|fruits |[apple, banana, cherry, date, elderberry]|null |
|numbers |[one, two, three, four, five] |2 |
|colors |[red, blue, green, yellow, pink] |null |
+--------+-----------------------------------------+---------+

Bash
COPY
This will add a new column to the DataFrame, “ItemIndex”, that contains the index of the first
occurrence of “three” in the “Items” column. If “three” is not found in an array, the
corresponding entry in the “ItemIndex” column will be null.
lit(“three”) creates a Column of literal value “three”, which is then passed to the UDF. This
ensures that the UDF correctly interprets “three” as a string value, not a column name.
PySpark : Finding the Index of the First Occurrence of an Element in an Array in PySpark
 USER  JUNE 16, 2023  LEAVE A COMMENTON PYSPARK : FINDING THE INDEX OF THE FIRST
OCCURRENCE OF AN ELEMENT IN AN ARRAY IN PYSPARK

This article will walk you through the steps on how to find the index of the first occurrence of an
element in an array in PySpark with a working example.
Installing PySpark
Before we get started, you’ll need to have PySpark installed. You can install it via pip:

pip install pyspark

Bash
COPY
Creating the DataFrame
Let’s first create a PySpark DataFrame with an array column for demonstration purposes.

from pyspark.sql import SparkSession


from pyspark.sql.functions import array
# Initiate a SparkSession
spark = SparkSession.builder.getOrCreate()
# Create a DataFrame
data = [("fruits", ["apple", "banana", "cherry", "date", "elderberry"]),
("numbers", ["one", "two", "three", "four", "five"]),
("colors", ["red", "blue", "green", "yellow", "pink"])]
df = spark.createDataFrame(data, ["Category", "Items"])
df.show(20,False)

Python
COPY
Source data

+--------+-----------------------------------------+
|Category|Items |
+--------+-----------------------------------------+
|fruits |[apple, banana, cherry, date, elderberry]|
|numbers |[one, two, three, four, five] |
|colors |[red, blue, green, yellow, pink] |
+--------+-----------------------------------------+

Bash
COPY
Defining the UDF
Since PySpark doesn’t have a built-in function to find the index of an element in an array, we’ll
need to create a User-Defined Function (UDF).

from pyspark.sql.functions import udf


from pyspark.sql.types import IntegerType
# Define the UDF to find the index
def find_index(array, item):
try:
return array.index(item)
except ValueError:
return None
# Register the UDF
find_index_udf = udf(find_index, IntegerType())
Python
COPY
This UDF takes two arguments: an array and an item. It tries to return the index of the item in the
array. If the item is not found, it returns None.
Applying the UDF
To pass a literal value to the UDF, you should use the lit function from pyspark.sql.functions.
Here’s how you should modify your code:
Finally, we’ll apply the UDF to our DataFrame to find the index of an element.

from pyspark.sql.functions import lit


# Use the UDF to find the index
df = df.withColumn("ItemIndex", find_index_udf(df["Items"], lit("three")))
df.show(20,False)

Python
COPY
Final Output

+--------+-----------------------------------------+---------+
|Category|Items |ItemIndex|
+--------+-----------------------------------------+---------+
|fruits |[apple, banana, cherry, date, elderberry]|null |
|numbers |[one, two, three, four, five] |2 |
|colors |[red, blue, green, yellow, pink] |null |
+--------+-----------------------------------------+---------+

Bash
COPY
This will add a new column to the DataFrame, “ItemIndex”, that contains the index of the first
occurrence of “three” in the “Items” column. If “three” is not found in an array, the
corresponding entry in the “ItemIndex” column will be null.
lit(“three”) creates a Column of literal value “three”, which is then passed to the UDF. This
ensures that the UDF correctly interprets “three” as a string value, not a column name.
Spark important urls to refer
PySpark : Returning the input values, pivoted into an ARRAY
 USER  JUNE 15, 2023  LEAVE A COMMENTON PYSPARK : RETURNING THE INPUT VALUES,
PIVOTED INTO AN ARRAY

To pivot data in PySpark into an array, you can use a combination of groupBy, pivot, and
collect_list functions. The groupBy function is used to group the DataFrame using the specified
columns, pivot can be used to pivot a column of the DataFrame and perform a specified
aggregation, and collect_list function collects and returns a list of non-unique elements.
Below is an example where I create a DataFrame, and then pivot the ‘value’ column into an array
based on ‘id’ and ‘type’.

from pyspark.sql import SparkSession


from pyspark.sql.functions import collect_list
# Spark session
spark = SparkSession.builder.appName('pivot_to_array').getOrCreate()
# Creating DataFrame
data = [("1", "type1", "value1"), ("1", "type2", "value2"), ("2", "type1", "value3"), ("2", "type2",
"value4")]
df = spark.createDataFrame(data, ["id", "type", "value"])
# DataFrame
df.show()

Python
COPY
Result

+---+-----+------+
| id| type| value|
+---+-----+------+
| 1|type1|value1|
| 1|type2|value2|
| 2|type1|value3|
| 2|type2|value4|
+---+-----+------+

Bash
COPY

# Pivot and collect values into array


df_pivot = df.groupBy("id").pivot("type").agg(collect_list("value"))
# Pivoted DataFrame
df_pivot.show()

Python
COPY
Final Output
In this example, groupBy(“id”) groups the DataFrame by ‘id’, pivot(“type”) pivots the ‘type’
column, and agg(collect_list(“value”)) collects the ‘value’ column into an array for each group.
The resulting DataFrame will have one row for each unique ‘id’, and a column for each unique
‘type’, with the values in these columns being arrays of the corresponding ‘value’ entries.
‘collect_list’ collects all values including duplicates. If you want to collect only unique values, use
‘collect_set’ instead.
PySpark : Extract values from JSON strings within a DataFrame in PySpark [json_tuple]
 USER  MAY 26, 2023  LEAVE A COMMENTON PYSPARK : EXTRACT VALUES FROM JSON STRINGS
WITHIN A DATAFRAME IN PYSPARK [JSON_TUPLE]

pyspark.sql.functions.json_tuple
PySpark provides a powerful function called json_tuple that allows you to extract values from
JSON strings within a DataFrame. This function is particularly useful when you’re working with
JSON data and need to retrieve specific values or attributes from the JSON structure. In this
article, we will explore the json_tuple function in PySpark and demonstrate its usage with an
example.
Understanding json_tuple
The json_tuple function in PySpark extracts the values of specified attributes from JSON strings
within a DataFrame. It takes two or more arguments: the first argument is the input column
containing JSON strings, and the subsequent arguments are the attribute names you want to
extract from the JSON.
The json_tuple function returns a tuple of columns, where each column represents the extracted
value of the corresponding attribute from the JSON string.
Example Usage
Let’s dive into an example to understand how to use json_tuple in PySpark. Consider the
following sample data:

from pyspark.sql import SparkSession


# Create a SparkSession
spark = SparkSession.builder.getOrCreate()
# Sample data as a DataFrame
data = [
('{"name": "Sachin", "age": 30}',),
('{"name": "Narendra", "age": 25}',),
('{"name": "Jacky", "age": 40}',)
]
df = spark.createDataFrame(data, ['json_data'])
# Show the DataFrame
df.show(truncate=False)

Python
COPY
Output:

+-----------------------+
|json_data |
+-----------------------+
|{"name": "Sachin", "age": 30}|
|{"name": "Narendra", "age": 25}|
|{"name": "Jacky", "age": 40} |
+-----------------------+

Bash
COPY
In this example, we have a DataFrame named df with a single column called ‘json_data’, which
contains JSON strings representing people’s information.
Now, let’s use the json_tuple function to extract the values of the ‘name’ and ‘age’ attributes
from the JSON strings:

from pyspark.sql.functions import json_tuple


# Extract 'name' and 'age' attributes using json_tuple
extracted_data = df.select(json_tuple('json_data', 'name', 'age').alias('name', 'age'))
# Show the extracted data
extracted_data.show(truncate=False)

Python
COPY
Output
+----+---+
|name|age|
+----+---+
|Sachin|30 |
|Narendra|25 |
|Jacky |40 |
+----+---+

Bash
COPY
In the above code, we use the json_tuple function to extract the ‘name’ and ‘age’ attributes from
the ‘json_data’ column. We specify the attribute names as arguments to json_tuple (‘name’ and
‘age’), and use the alias method to assign meaningful column names to the extracted attributes.
The resulting extracted_data DataFrame contains two columns: ‘name’ and ‘age’ with the extract
PySpark : Finding the cube root of the given value using PySpark
 USER  MAY 26, 2023  LEAVE A COMMENTON PYSPARK : FINDING THE CUBE ROOT OF THE GIVEN
VALUE USING PYSPARK

The pyspark.sql.functions.cbrt(col) function in PySpark computes the cube root of the given
value. It takes a column as input and returns a new column with the cube root values.
Here’s an example to illustrate the usage of pyspark.sql.functions.cbrt(col):
To use the cbrt function in PySpark, you need to import it from the pyspark.sql.functions module.
Here’s the corrected code:

from pyspark.sql import SparkSession


from pyspark.sql.functions import col, cbrt
# Create a SparkSession
spark = SparkSession.builder.getOrCreate()
# Create a DataFrame with a column of values
data = [(1,), (8,), (27,), (64,)]
df = spark.createDataFrame(data, ['value'])
# Apply the cube root transformation using cbrt() function
transformed_df = df.withColumn('cbrt_value', cbrt(col('value')))
# Show the transformed DataFrame
transformed_df.show()

Python
COPY
Output

+-----+----------+
|value|cbrt_value|
+-----+----------+
| 1| 1.0|
| 8| 2.0|
| 27| 3.0|
| 64| 4.0|
+-----+----------+

Bash
COPY
We import the cbrt function from pyspark.sql.functions. Then, we use the cbrt() function directly
in the withColumn method to apply the cube root transformation to the ‘value’ column. The
col(‘value’) expression retrieves the column ‘value’, and cbrt(col(‘value’)) computes the cube
root of that column.
Now, the transformed_df DataFrame will contain the expected cube root values in the
‘cbrt_value’ column.
PySpark : Identify the grouping level in data after performing a group by operation with cube
or rollup in PySpark [grouping_id]
 USER  MAY 23, 2023  LEAVE A COMMENTON PYSPARK : IDENTIFY THE GROUPING LEVEL IN DATA
AFTER PERFORMING A GROUP BY OPERATION WITH CUBE OR ROLLUP IN PYSPARK
[GROUPING_ID]

pyspark.sql.functions.grouping_id(*cols)
This function is valuable when you need to identify the grouping level in data after performing a
group by operation with cube or rollup. In this article, we will delve into the details of the
grouping_id function and its usage with an example.
The grouping_id function signature in PySpark is as follows:

pyspark.sql.functions.grouping_id(*cols)

Bash
COPY
This function doesn’t require any argument, but it’s often used with columns in a DataFrame.
The grouping_id function is used in conjunction with the cube or rollup operations, and it
provides an ID to indicate the level of grouping. The more columns the data is grouped by, the
smaller the grouping ID will be.
Example Usage
Let’s go through a simple example to understand the usage of the grouping_id function.
Suppose we have a DataFrame named df containing three columns: ‘City’, ‘Product’, and ‘Sales’.

from pyspark.sql import SparkSession


spark = SparkSession.builder.getOrCreate()
data = [("New York", "Apple", 100),
("Los Angeles", "Orange", 200),
("New York", "Banana", 150),
("Los Angeles", "Apple", 120),
("New York", "Orange", 75),
("Los Angeles", "Banana", 220)]
df = spark.createDataFrame(data, ["City", "Product", "Sales"])
df.show()

Python
COPY
Result : DataFrame

+-----------+-------+-----+
| City|Product|Sales|
+-----------+-------+-----+
| New York| Apple| 100|
|Los Angeles| Orange| 200|
| New York| Banana| 150|
|Los Angeles| Apple| 120|
| New York| Orange| 75|
|Los Angeles| Banana| 220|
+-----------+-------+-----+

Python
COPY
Now, let’s perform a cube operation on the ‘City’ and ‘Product’ columns and compute the total
‘Sales’ for each group. Also, let’s add a grouping_id column to identify the level of grouping.

from pyspark.sql.functions import sum, grouping_id


df_grouped = df.cube("City", "Product").agg(sum("Sales").alias("TotalSales"),
grouping_id().alias("GroupingID"))
df_grouped.orderBy("GroupingID").show()

Python
COPY
The orderBy function is used here to sort the result by the ‘GroupingID’ column. The output will
look something like this:

+-----------+-------+----------+----------+
| City|Product|TotalSales|GroupingID|
+-----------+-------+----------+----------+
| New York| Banana| 150| 0|
|Los Angeles| Orange| 200| 0|
|Los Angeles| Apple| 120| 0|
| New York| Apple| 100| 0|
| New York| Orange| 75| 0|
|Los Angeles| Banana| 220| 0|
| New York| null| 325| 1|
|Los Angeles| null| 540| 1|
| null| Apple| 220| 2|
| null| Banana| 370| 2|
| null| Orange| 275| 2|
| null| null| 865| 3|
+-----------+-------+----------+----------+

Bash
COPY
As you can see, the grouping_id function provides a numerical identifier that describes the level
of grouping in the DataFrame, with smaller values corresponding to more columns being used for
grouping.
The grouping_id function is a powerful tool for understanding the level of grouping in your data
when using cube or rollup operations in PySpark. It provides valuable insights, especially when
dealing with complex datasets with multiple levels of aggregation.
PySpark : Calculating the exponential of a given column in PySpark [exp]
 USER  MAY 23, 2023  LEAVE A COMMENTON PYSPARK : CALCULATING THE EXPONENTIAL OF A
GIVEN COLUMN IN PYSPARK [EXP]
PySpark offers the exp function in its pyspark.sql.functions module, which calculates the
exponential of a given column.
In this article, we will delve into the details of this function, exploring its usage through an
illustrative example.
Function Signature
The exp function signature in PySpark is as follows:

pyspark.sql.functions.exp(col)

Bash
COPY
The function takes a single argument:
col: A column expression representing a column in a DataFrame. The column should contain
numeric data for which you want to compute the exponential.
Example Usage
Let’s examine a practical example to better understand the exp function. Suppose we have a
DataFrame named df containing a single column, col1, with five numeric values.

from pyspark.sql import SparkSession


from pyspark.sql.functions import lit
spark = SparkSession.builder.getOrCreate()
data = [(1.0,), (2.0,), (3.0,), (4.0,), (5.0,)]
df = spark.createDataFrame(data, ["col1"])
df.show()

Python
COPY
Result :  DataFrame:

+----+
|col1|
+----+
| 1.0|
| 2.0|
| 3.0|
| 4.0|
| 5.0|
+----+

Bash
COPY
Now, we wish to compute the exponential of each value in the col1 column. We can achieve this
using the exp function:

from pyspark.sql.functions import exp


df_exp = df.withColumn("col1_exp", exp(df["col1"]))
df_exp.show()

Python
COPY
In this code, the withColumn function is utilized to add a new column to the DataFrame. This new
column, col1_exp, will contain the exponential of each value in the col1 column. The output will
resemble the following:

+----+------------------+
|col1| col1_exp|
+----+------------------+
| 1.0|2.7182818284590455|
| 2.0| 7.38905609893065|
| 3.0|20.085536923187668|
| 4.0|54.598150033144236|
| 5.0| 148.4131591025766|
+----+------------------+

Bash
COPY
As you can see, the col1_exp column now holds the exponential of the values in the col1 column.
PySpark’s exp function is a beneficial tool for computing the exponential of numeric data. It is a
must-have in the toolkit of data scientists and engineers dealing with large datasets, as it
empowers them to perform complex transformations with ease.
PySpark : An Introduction to the PySpark encode Function
 USER  MAY 23, 2023  LEAVE A COMMENTON PYSPARK : AN INTRODUCTION TO THE PYSPARK
ENCODE FUNCTION

PySpark provides the encode function in its pyspark.sql.functions module, which is useful for


encoding a column of strings into a binary column using a specified character set.
In this article, we will discuss this function in detail and walk through an example of how it can be
used in a real-world scenario.
Function Signature
The encode function signature in PySpark is as follows:

pyspark.sql.functions.encode(col, charset)

Bash
COPY
This function takes two arguments:
col: A column expression representing a column in a DataFrame. This column should contain
string data to be encoded into binary.
charset: A string representing the character set to be used for encoding. This can be one of US-
ASCII, ISO-8859-1, UTF-8, UTF-16BE, UTF-16LE, or UTF-16.
Example Usage
Let’s walk through a simple example to understand how to use this function.
Assume we have a DataFrame named df containing one column, col1, which has two rows of
strings: ‘Hello’ and ‘World’.

from pyspark.sql import SparkSession


from pyspark.sql.functions import lit
spark = SparkSession.builder.getOrCreate()
data = [("Hello",), ("World",)]
df = spark.createDataFrame(data, ["col1"])
df.show()

Python
COPY
This will display the following DataFrame:

+-----+
|col1 |
+-----+
|Hello|
|World|
+-----+

Bash
COPY
Now, let’s say we want to encode these strings into a binary format using the UTF-8 charset. We
can do this using the encode function as follows:

from pyspark.sql.functions import encode


df_encoded = df.withColumn("col1_encoded", encode(df["col1"], "UTF-8"))
df_encoded.show()

Python
COPY
The withColumn function is used here to add a new column to the DataFrame. This new column,
col1_encoded, will contain the binary encoded representation of the strings in the col1 column.
The output will look something like this:

+-----+-------------+
|col1 |col1_encoded |
+-----+-------------+
|Hello|[48 65 6C 6C 6F]|
|World|[57 6F 72 6C 64]|
+-----+-------------+

Bash
COPY
The col1_encoded column now contains the binary representation of the strings in the col1
column, encoded using the UTF-8 character set.
PySpark’s encode function is a useful tool for converting string data into binary format, and it’s
incredibly flexible with its ability to support multiple character sets. It’s a valuable tool for any
data scientist or engineer who is working with large datasets and needs to perform
PySpark : Returning an Array that Contains Matching Elements in Two Input Arrays in PySpark
 USER  JUNE 24, 2023  LEAVE A COMMENTON PYSPARK : RETURNING AN ARRAY THAT
CONTAINS MATCHING ELEMENTS IN TWO INPUT ARRAYS IN PYSPARK
This article will focus on a particular use case: returning an array that contains the matching
elements in two input arrays in PySpark. To illustrate this, we’ll use PySpark’s built-in functions
and DataFrame transformations.
PySpark does not provide a direct function to compare arrays and return the matching elements.
However, you can achieve this by utilizing some of its in-built functions like explode, collect_list,
and array_intersect.
Let’s assume we have a DataFrame that has two columns, both of which contain arrays:

from pyspark.sql import SparkSession


from pyspark.sql.functions import array
spark = SparkSession.builder.getOrCreate()
data = [
("1", list(["apple", "banana", "cherry"]), list(["banana", "cherry", "date"])),
("2", list(["pear", "mango", "peach"]), list(["mango", "peach", "lemon"])),
]
df = spark.createDataFrame(data, ["id", "Array1", "Array2"])
df.show()

Python
COPY
DataFrame is created successfully.
To return an array with the matching elements in ‘Array1’ and ‘Array2’, use
the array_intersect function:

from pyspark.sql.functions import array_intersect

df_with_matching_elements = df.withColumn("MatchingElements", array_intersect(df.Array1,


df.Array2))
df_with_matching_elements.show(20,False)

Python
COPY
The ‘MatchingElements’ column will contain the matching elements in ‘Array1’ and ‘Array2’ for
each row.
Using the PySpark array_intersect function, you can efficiently find matching elements in two
arrays. This function is not only simple and efficient but also scalable, making it a great tool for
processing and analyzing big data with PySpark. It’s important to remember, however, that this
approach works on a row-by-row basis. If you want to find matches across all rows in the
DataFrame, you’ll need to apply a different technique.

+---+-----------------------+----------------------+----------------+
|id |Array1 |Array2 |MatchingElements|
+---+-----------------------+----------------------+----------------+
|1 |[apple, banana, cherry]|[banana, cherry, date]|[banana, cherry]|
|2 |[pear, mango, peach] |[mango, peach, lemon] |[mango, peach] |
+---+-----------------------+----------------------+----------------+

PySpark : Subtracting a specified number of days from a given date in PySpark [date_sub]
 USER  MAY 22, 2023  LEAVE A COMMENTON PYSPARK : SUBTRACTING A SPECIFIED NUMBER OF
DAYS FROM A GIVEN DATE IN PYSPARK [DATE_SUB]
In this article, we will delve into the date_sub function in PySpark. This versatile function allows
us to subtract a specified number of days from a given date, enabling us to perform date-based
operations and gain valuable insights from our data.
from pyspark.sql.functions import date_sub
Understanding date_sub:
The date_sub function in PySpark facilitates date subtraction by subtracting a specified number
of days from a given date. It helps us analyze historical data, calculate intervals, and perform
various time-based computations within our Spark applications.
Syntax:
The syntax for using date_sub in PySpark is as follows:

date_sub(start_date, days)

Python
COPY
Here, start_date represents the initial date from which we want to subtract days, and days
indicates the number of days to subtract.
Example Usage:
To illustrate the usage of date_sub in PySpark, let’s consider a scenario where we have a dataset
containing sales records. We want to analyze sales data from the past 7 days.
Step 1: Importing the necessary libraries and creating a SparkSession.

from pyspark.sql import SparkSession


from pyspark.sql.functions import date_sub

# Create a SparkSession
spark = SparkSession.builder \
.appName("date_sub Example at Freshers.in") \
.getOrCreate()

Python
COPY
Step 2: Creating a sample DataFrame with hardcoded values.

# Sample DataFrame with hardcoded values


data = [("Product A", "2023-05-15", 100),
("Product B", "2023-05-16", 150),
("Product C", "2023-05-17", 200),
("Product D", "2023-05-18", 120),
("Product E", "2023-05-19", 90),
("Product F", "2023-05-20", 180),
("Product G", "2023-05-21", 210),
("Product H", "2023-05-22", 160)]
df = spark.createDataFrame(data, ["Product", "Date", "Sales"])

# Show the initial DataFrame


df.show()

Python
COPY
Result 
+---------+----------+-----+
| Product | Date |Sales|
+---------+----------+-----+
|Product A|2023-05-15| 100|
|Product B|2023-05-16| 150|
|Product C|2023-05-17| 200|
|Product D|2023-05-18| 120|
|Product E|2023-05-19| 90|
|Product F|2023-05-20| 180|
|Product G|2023-05-21| 210|
|Product H|2023-05-22| 160|
+---------+----------+-----+

Bash
COPY
Step 3: Subtracting days using date_sub.

# Subtract 7 days from the current date


df_subtracted = df.withColumn("SubtractedDate", date_sub(df.Date, 7))

# Show the resulting DataFrame


df_subtracted.show()

Python
COPY
Result 

+---------+----------+-----+--------------+
| Product| Date|Sales|SubtractedDate|
+---------+----------+-----+--------------+
|Product A|2023-05-15| 100| 2023-05-08|
|Product B|2023-05-16| 150| 2023-05-09|
|Product C|2023-05-17| 200| 2023-05-10|
|Product D|2023-05-18| 120| 2023-05-11|
|Product E|2023-05-19| 90| 2023-05-12|
|Product F|2023-05-20| 180| 2023-05-13|
|Product G|2023-05-21| 210| 2023-05-14|
|Product H|2023-05-22| 160| 2023-05-15|
+---------+----------+-----+--------------+

Bash
COPY
In the above code snippet, we used the `date_sub` function to subtract 7 days from the
“Date” column in the DataFrame. The resulting column, “SubtractedDate,” contains the dates
obtained after subtracting 7 days.
Step 4: Filtering data based on the subtracted date.

# Filter sales data from the past 7 days


recent_sales = df_subtracted.filter(df_subtracted.SubtractedDate >= '2023-05-15')

# Show the filtered DataFrame


recent_sales.show()

Python
COPY
Result

+---------+----------+-----+--------------+
| Product | Date |Sales|SubtractedDate|
+---------+----------+-----+--------------+
|Product H|2023-05-22| 160| 2023-05-15|
+---------+----------+-----+--------------+

Bash
COPY
By filtering the DataFrame based on the “SubtractedDate” column, we obtained sales data from
the past 7 days. In this case, we selected records where the subtracted date was greater than or
equal to ‘2023-05-15’.
Here we explored the functionality of PySpark’s date_sub function, which allows us to subtract a
specified number of days from a given date. By incorporating this powerful function into our
PySpark workflows, we can perform date-based operations, analyze historical data, and gain
valuable insights from our datasets. Whether it’s calculating intervals, filtering data based on
specific timeframes, or performing time-based computations, the date_sub function proves to be
an invaluable tool for date subtraction in PySpark applications.
PySpark : A Comprehensive Guide to PySpark’s current_date and current_timestamp Functions
 USER  MAY 22, 2023  LEAVE A COMMENTON PYSPARK : A COMPREHENSIVE GUIDE TO
PYSPARK’S CURRENT_DATE AND CURRENT_TIMESTAMP FUNCTIONS

PySpark enables data engineers and data scientists to perform distributed data processing tasks
efficiently. In this article, we will explore two essential PySpark functions: current_date and
current_timestamp. These functions allow us to retrieve the current date and timestamp within a
Spark application, enabling us to perform time-based operations and gain valuable insights from
our data.
Understanding current_date and current_timestamp:
Before diving into the details, let’s take a moment to understand the purpose of these functions:
current_date: This function returns the current date as a date type in the format ‘yyyy-MM-dd’. It
retrieves the date based on the system clock of the machine running the Spark application.
current_timestamp: This function returns the current timestamp as a timestamp type in the
format ‘yyyy-MM-dd HH:mm:ss.sss’. It provides both the date and time information based on the
system clock of the machine running the Spark application.
Example Usage:
To demonstrate the usage of current_date and current_timestamp in PySpark, let’s consider a
scenario where we have a dataset containing customer orders. We want to analyze the orders
placed on the current date and timestamp.
Step 1: Importing the necessary libraries and creating a SparkSession.

from pyspark.sql import SparkSession


from pyspark.sql.functions import current_date, current_timestamp

# Create a SparkSession
spark = SparkSession.builder \
.appName("Current Date and Timestamp Example at Freshers.in") \
.getOrCreate()

Python
COPY
Step 2: Creating a sample DataFrame.

# Sample DataFrame
data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)]
df = spark.createDataFrame(data, ["Name", "OrderID"])

# Adding current date and timestamp columns


df_with_date = df.withColumn("CurrentDate", current_date())
df_with_timestamp = df_with_date.withColumn("CurrentTimestamp", current_timestamp())

# Show the resulting DataFrame


df_with_timestamp.show()

Python
COPY
Output

+-------+------+------------+--------------------+
| Name|OrderID|CurrentDate | CurrentTimestamp |
+-------+------+------------+--------------------+
| Alice| 1| 2023-05-22|2023-05-22 10:15:...|
| Bob| 2| 2023-05-22|2023-05-22 10:15:...|
|Charlie| 3| 2023-05-22|2023-05-22 10:15:...|
+-------+------+------------+--------------------+

Bash
COPY
As seen in the output, we added two new columns to the DataFrame: “CurrentDate” and
“CurrentTimestamp.” These columns contain the current date and timestamp for each row in the
DataFrame.
Step 3: Filtering data based on the current date.

# Filter orders placed on the current date


current_date_orders = df_with_timestamp.filter(df_with_timestamp.CurrentDate ==
current_date())

# Show the filtered DataFrame


current_date_orders.show()

Python
COPY
Output:

+-------+------+------------+--------------------+
| Name|OrderID|CurrentDate | CurrentTimestamp |
+-------+------+------------+--------------------+
| Alice| 1| 2023-05-22|2023-05-22 10:15:...|
| Bob| 2| 2023-05-22|2023-05-22 10:15:...|
|Charlie| 3| 2023-05-22|2023-05-22 10:15:...|
+-------+------+------------+--------------------+

Bash
COPY
Step 4: Performing time-based operations using current_timestamp.

# Calculate the time difference between current timestamp and order placement time
df_with_timestamp = df_with_timestamp.withColumn("TimeElapsed", current_timestamp() -
df_with_timestamp.CurrentTimestamp)

# Show the DataFrame with the time elapsed


df_with_timestamp.show()

Python
COPY
Output

+-------+------+------------+--------------------+-------------------+
| Name|OrderID|CurrentDate | CurrentTimestamp | TimeElapsed |
+-------+------+------------+--------------------+-------------------+
| Alice| 1| 2023-05-22|2023-05-22 10:15:...| 00:01:23.456789 |
| Bob| 2| 2023-05-22|2023-05-22 10:15:...| 00:00:45.678912 |
|Charlie| 3| 2023-05-22|2023-05-22 10:15:...| 00:02:10.123456 |
+-------+------+------------+--------------------+-------------------+

Bash
COPY
In the above code snippet, we calculate the time elapsed between the current timestamp and the
order placement time for each row in the DataFrame. The resulting column, “TimeElapsed,”
shows the duration in the format ‘HH:mm:ss.sss’. This can be useful for analyzing time-based
metrics and understanding the timing patterns of the orders.
In this article, we explored the powerful PySpark functions current_date and current_timestamp.
These functions provide us with the current date and timestamp within a Spark application,
enabling us to perform time-based operations and gain valuable insights from our data. By
incorporating these functions into our PySpark workflows, we can effectively handle time-related
tasks and leverage temporal information for various data processing and analysis tasks.
PySpark : Understanding the ‘take’ Action in PySpark with Examples. [Retrieves a specified
number of elements from the beginning of an RDD or DataFrame]
 USER  APRIL 29, 2023  LEAVE A COMMENTON PYSPARK : UNDERSTANDING THE ‘TAKE’ ACTION
IN PYSPARK WITH EXAMPLES. [RETRIEVES A SPECIFIED NUMBER OF ELEMENTS FROM THE
BEGINNING OF AN RDD OR DATAFRAME]
In this article, we will focus on the ‘take’ action, which is commonly used in PySpark operations.
We’ll provide a brief explanation of the ‘take’ action, followed by a simple example to help you
understand its usage.
What is the ‘take’ Action in PySpark?
The ‘take’ action in PySpark retrieves a specified number of elements from the beginning of an
RDD (Resilient Distributed Dataset) or DataFrame. It is an action operation, which means it
triggers the execution of any previous transformations on the data, returning the result to the
driver program. This operation is particularly useful for previewing the contents of an RDD or
DataFrame without having to collect all the elements, which can be time-consuming and
memory-intensive for large datasets.
Syntax:
take(num)
Where num is the number of elements to retrieve from the RDD or DataFrame.
Simple Example
Let’s go through a simple example using the ‘take’ action in PySpark. First, we’ll create a PySpark
RDD and then use the ‘take’ action to retrieve a specified number of elements.
RDD Version
Step 1: Start a PySpark session
Before starting with the example, you’ll need to start a PySpark session:

from pyspark.sql import SparkSession


spark = SparkSession.builder \
.appName("Understanding the 'take' action in PySpark") \
.getOrCreate()

Python
COPY
Step 2: Create an RDD
Now, let’s create an RDD containing some numbers:

data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


rdd = spark.sparkContext.parallelize(data)

Python
COPY

data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


rdd = spark.sparkContext.parallelize(data)

Python
COPY
Step 3: Use the ‘take’ action
We’ll use the ‘take’ action to retrieve the first 5 elements of the RDD:

first_five_elements = rdd.take(5)
print("The first five elements of the RDD are:", first_five_elements)

Python
COPY
Output:
The first five elements of the RDD are: [1, 2, 3, 4, 5]

Bash
COPY
We introduced the ‘take’ action in PySpark, which allows you to retrieve a specified number of
elements from the beginning of an RDD or DataFrame. We provided a simple example to help
you understand how the ‘take’ action works. It is a handy tool for previewing the contents of an
RDD or DataFrame, especially when working with large datasets, and can be a valuable part of
your PySpark toolkit.
DataFrame Version
Let’s go through an example using a DataFrame and the ‘take’ action in PySpark. We’ll create a
DataFrame with some sample data, and then use the ‘take’ action to retrieve a specified number
of rows.

from pyspark.sql import SparkSession


spark = SparkSession.builder \
.appName("Understanding the 'take' action in PySpark with DataFrames") \
.getOrCreate()
from pyspark.sql import Row
data = [
Row(name="Alice", age=30, city="New York"),
Row(name="Bob", age=28, city="San Francisco"),
Row(name="Cathy", age=25, city="Los Angeles"),
Row(name="David", age=32, city="Chicago"),
Row(name="Eva", age=29, city="Seattle")
]
schema = "name STRING, age INT, city STRING"
df = spark.createDataFrame(data, schema=schema)
first_three_rows = df.take(3)
print("The first three rows of the DataFrame are:")
for row in first_three_rows:
print(row)

Python
COPY
Output

The first three rows of the DataFrame are:


Row(name='Alice', age=30, city='New York')
Row(name='Bob', age=28, city='San Francisco')
Row(name='Cathy', age=25, city='Los Angeles')

SQL
COPY
We created a DataFrame with some sample data and used the ‘take’ action to retrieve a specified
number of rows. This operation is useful for previewing the contents of a DataFrame, especially
when working with large datasets.
PySpark : Exploring PySpark’s joinByKey on DataFrames: [combining data from two different
DataFrames] – A Comprehensive Guide
 USER  APRIL 13, 2023  LEAVE A COMMENTON PYSPARK : EXPLORING PYSPARK’S JOINBYKEY ON
DATAFRAMES: [COMBINING DATA FROM TWO DIFFERENT DATAFRAMES] – A COMPREHENSIVE
GUIDE

In PySpark, join operations are a fundamental technique for combining data from two different
DataFrames based on a common key. While there isn’t a specific joinByKey function, PySpark
provides various join functions that are applicable to DataFrames. In this article, we will explore
the different types of join operations available in PySpark for DataFrames and provide a concrete
example with hardcoded values instead of reading from a file.
Types of Join Operations in PySpark for DataFrames
1. Inner join: Combines rows from both DataFrames that have matching keys.
2. Left outer join: Retains all rows from the left DataFrame and matching rows from the right
DataFrame, filling with null values when there is no match.
3. Right outer join: Retains all rows from the right DataFrame and matching rows from the
left DataFrame, filling with null values when there is no match.
4. Full outer join: Retains all rows from both DataFrames, filling with null values when there is
no match.
Inner join using DataFrames
Suppose we have two datasets, one containing sales data for a chain of stores, and the other
containing store information. The sales data includes store ID, product ID, and the number of
units sold, while the store information includes store ID and store location. Our goal is to
combine these datasets based on store ID.

#Exploring PySpark's joinByKey on DataFrames: A Comprehensive Guide @ Freshers.in


from pyspark.sql import SparkSession
from pyspark.sql import Row

# Initialize the Spark session


spark = SparkSession.builder.appName("join example @ Freshers.in ").getOrCreate()

# Sample sales data as (store_id, product_id, units_sold)


sales_data = [
Row(store_id=1, product_id=6567876, units_sold=5),
Row(store_id=2, product_id=6567876, units_sold=7),
Row(store_id=1, product_id=102, units_sold=3),
Row(store_id=2, product_id=9878767, units_sold=10),
Row(store_id=3, product_id=6567876, units_sold=4),
Row(store_id=3, product_id=5565455, units_sold=6),
Row(store_id=4, product_id=9878767, units_sold=6),
Row(store_id=4, product_id=5565455, units_sold=6),
Row(store_id=4, product_id=9878767, units_sold=6),
Row(store_id=5, product_id=5565455, units_sold=6),
]

# Sample store information as (store_id, store_location)


store_info = [
Row(store_id=1, store_location="New York"),
Row(store_id=2, store_location="Los Angeles"),
Row(store_id=3, store_location="Chicago"),
Row(store_id=1, store_location="Maryland"),
Row(store_id=2, store_location="Texas")
]

# Create DataFrames from the sample data


sales_df = spark.createDataFrame(sales_data)
store_info_df = spark.createDataFrame(store_info)

# Perform the join operation


joined_df = sales_df.join(store_info_df, on="store_id", how="inner")

# Collect the results and print


for row in joined_df.collect():
print(f"Store {row.store_id} ({row.store_location}) sales data: (Product {row.product_id},
Units Sold {row.units_sold})")

Python
COPY
Output:

Store 1 (New York) sales data: (Product 6567876, Units Sold 5)


Store 1 (Maryland) sales data: (Product 6567876, Units Sold 5)
Store 1 (New York) sales data: (Product 102, Units Sold 3)
Store 1 (Maryland) sales data: (Product 102, Units Sold 3)
Store 2 (Los Angeles) sales data: (Product 6567876, Units Sold 7)
Store 2 (Texas) sales data: (Product 6567876, Units Sold 7)
Store 2 (Los Angeles) sales data: (Product 9878767, Units Sold 10)
Store 2 (Texas) sales data: (Product 9878767, Units Sold 10)
Store 3 (Chicago) sales data: (Product 6567876, Units Sold 4)
Store 3 (Chicago) sales data: (Product 5565455, Units Sold 6)

Bash
COPY
Spark important urls to refer
PySpark : Exploring PySpark’s joinByKey on RDD : A Comprehensive Guide
 USER  APRIL 13, 2023  LEAVE A COMMENTON PYSPARK : EXPLORING PYSPARK’S JOINBYKEY ON
RDD : A COMPREHENSIVE GUIDE

In PySpark, join operations are a fundamental technique for combining data from two different
RDDs based on a common key. Although there isn’t a specific joinByKey function, PySpark
provides several join functions that are applicable to Key-Value pair RDDs. In this article, we will
explore the different types of join operations available in PySpark and provide a concrete
example with hardcoded values instead of reading from a file.
Types of Join Operations in PySpark
1. join: Performs an inner join between two RDDs based on matching keys.
2. leftOuterJoin: Performs a left outer join between two RDDs, retaining all keys from the left
RDD and matching keys from the right RDD.
3. rightOuterJoin: Performs a right outer join between two RDDs, retaining all keys from the
right RDD and matching keys from the left RDD.
4. fullOuterJoin: Performs a full outer join between two RDDs, retaining all keys from both
RDDs.
Example: Inner join using ‘join’
Suppose we have two datasets, one containing sales data for a chain of stores, and the other
containing store information. The sales data includes store ID, product ID, and the number of
units sold, while the store information includes store ID and store location. Our goal is to
combine these datasets based on store ID.

#PySpark's joinByKey on RDD: A Comprehensive Guide @ Freshers.in


from pyspark import SparkContext
# Initialize the Spark context
sc = SparkContext("local", "join @ Freshers.in")

# Sample sales data as (store_id, (product_id, units_sold))


sales_data = [
(1, (6567876, 5)),
(2, (6567876, 7)),
(1, (4643987, 3)),
(2, (4643987, 10)),
(3, (6567876, 4)),
(4, (9878767, 6)),
(4, (5565455, 6)),
(4, (9878767, 6)),
(5, (5565455, 6)),
]

# Sample store information as (store_id, store_location)


store_info = [
(1, "New York"),
(2, "Los Angeles"),
(3, "Chicago"),
(4, "Maryland"),
(5, "Texas")
]

# Create RDDs from the sample data


sales_rdd = sc.parallelize(sales_data)
store_info_rdd = sc.parallelize(store_info)

# Perform the join operation


joined_rdd = sales_rdd.join(store_info_rdd)

# Collect the results and print


for store_id, (sales, location) in joined_rdd.collect():
print(f"Store {store_id} ({location}) sales data: {sales}")

Python
COPY
Output:

Store 2 (Los Angeles) sales data: (6567876, 7)


Store 2 (Los Angeles) sales data: (4643987, 10)
Store 4 (Maryland) sales data: (9878767, 6)
Store 4 (Maryland) sales data: (5565455, 6)
Store 4 (Maryland) sales data: (9878767, 6)
Store 1 (New York) sales data: (6567876, 5)
Store 1 (New York) sales data: (4643987, 3)
Store 3 (Chicago) sales data: (6567876, 4)
Store 5 (Texas) sales data: (5565455, 6)

Bash
COPY
In this article, we explored the different types of join operations in PySpark for Key-Value pair
RDDs. We provided a concrete example using hardcoded values for an inner join between two
RDDs based on a common key. By leveraging join operations in PySpark, you can combine data
from various sources, enabling more comprehensive data analysis and insights.
PySpark : Unraveling PySpark’s groupByKey: A Comprehensive Guide
 USER  APRIL 13, 2023  LEAVE A COMMENTON PYSPARK : UNRAVELING PYSPARK’S
GROUPBYKEY: A COMPREHENSIVE GUIDE

In this article, we will explore the groupByKey transformation in PySpark. groupByKey is an


essential tool when working with Key-Value pair RDDs (Resilient Distributed Datasets), as it
allows developers to group the values for each key. We will discuss the syntax, usage, and
provide a concrete example with hardcoded values instead of reading from a file.
What is groupByKey?
groupByKey is a transformation operation in PySpark that groups the values for each key in a
Key-Value pair RDD. This operation takes no arguments and returns an RDD of (key, values) pairs,
where ‘values’ is an iterable of all values associated with a particular key.
Syntax
The syntax for the groupByKey function is as follows:
groupByKey()
Example
Let’s dive into an example to better understand the usage of groupByKey. Suppose we have a
dataset containing sales data for a chain of stores. The data includes store ID, product ID, and the
number of units sold. Our goal is to group the sales data by store ID.

#Unraveling PySpark's groupByKey: A Comprehensive Guide @ Freshers.in


from pyspark import SparkContext
# Initialize the Spark context
sc = SparkContext("local", "groupByKey @ Freshers.in")

# Sample sales data as (store_id, (product_id, units_sold))


sales_data = [
(1, (6567876, 5)),
(2, (6567876, 7)),
(1, (4643987, 3)),
(2, (4643987, 10)),
(3, (6567876, 4)),
(4, (9878767, 6)),
(4, (5565455, 6)),
(4, (9878767, 6)),
(5, (5565455, 6)),
]

# Create the RDD from the sales_data list


sales_rdd = sc.parallelize(sales_data)

# Perform the groupByKey operation


grouped_sales_rdd = sales_rdd.groupByKey()

# Collect the results and print


for store_id, sales in grouped_sales_rdd.collect():
sales_list = list(sales)
print(f"Store {store_id} sales data: {sales_list}")

Python
COPY
Output:

Store 1 sales data: [(6567876, 5), (4643987, 3)]


Store 2 sales data: [(6567876, 7), (4643987, 10)]
Store 3 sales data: [(6567876, 4)]
Store 4 sales data: [(9878767, 6), (5565455, 6), (9878767, 6)]
Store 5 sales data: [(5565455, 6)]

Bash
COPY
Here, we have explored the groupByKey transformation in PySpark. This powerful function
allows developers to group values by their corresponding keys in Key-Value pair RDDs. We
covered the syntax, usage, and provided an example using hardcoded values. By leveraging
groupByKey, you can effectively organize and process your data in PySpark, making it an
indispensable tool in your Big Data toolkit.
PySpark : Mastering PySpark’s reduceByKey: A Comprehensive Guide
 USER  APRIL 13, 2023  LEAVE A COMMENTON PYSPARK : MASTERING PYSPARK’S REDUCEBYKEY:
A COMPREHENSIVE GUIDE

In this article, we will explore the reduceByKey transformation in PySpark. reduceByKey is a


crucial tool when working with Key-Value pair RDDs (Resilient Distributed Datasets), as it allows
developers to aggregate data by keys using a given function. We will discuss the syntax, usage,
and provide a concrete example with hardcoded values instead of reading from a file.
What is reduceByKey?
reduceByKey is a transformation operation in PySpark that enables the aggregation of values for
each key in a Key-Value pair RDD. This operation takes a single argument: the function to perform
the aggregation. It applies the aggregation function cumulatively to the values of each key.
Syntax
The syntax for the reduceByKey function is as follows:
reduceByKey(func)
where:
 func: The function that will be used to aggregate the values for each key
Example
Let’s dive into an example to better understand the usage of reduceByKey. Suppose we have a
dataset containing sales data for a chain of stores. The data includes store ID, product ID, and the
number of units sold. Our goal is to calculate the total units sold for each store.

#Mastering PySpark's reduceByKey: A Comprehensive Guide @ Freshers.in


from pyspark import SparkContext

# Initialize the Spark context


sc = SparkContext("local", "reduceByKey @ Freshers.in ")

# Sample sales data as (store_id, (product_id, units_sold))


sales_data = [
(1, (6567876, 5)),
(2, (6567876, 7)),
(1, (4643987, 3)),
(2, (4643987, 10)),
(3, (6567876, 4)),
(4, (9878767, 6)),
(4, (5565455, 6)),
(4, (9878767, 6)),
(5, (5565455, 6)),
]

# Create the RDD from the sales_data list


sales_rdd = sc.parallelize(sales_data)

# Map the data to (store_id, units_sold) pairs


store_units_rdd = sales_rdd.map(lambda x: (x[0], x[1][1]))

# Define the aggregation function


def sum_units(a, b):
return a + b

# Perform the reduceByKey operation


total_sales_rdd = store_units_rdd.reduceByKey(sum_units)

# Collect the results and print


for store_id, total_units in total_sales_rdd.collect():
print(f"Store {store_id} sold a total of {total_units} units.")

Python
COPY
Output:

Store 1 sold a total of 8 units.


Store 2 sold a total of 17 units.
Store 3 sold a total of 4 units.
Store 4 sold a total of 18 units.
Store 5 sold a total of 6 units.

Bash
COPY
Here we have explored the reduceByKey transformation in PySpark. This powerful function
allows developers to perform aggregations on Key-Value pair RDDs efficiently. We covered the
syntax, usage, and provided an example using hardcoded values. By leveraging reduceByKey, you
can simplify and optimize your data processing tasks in PySpark.
PySpark : Harnessing the Power of PySparks foldByKey[aggregate data by keys using a given
function]
 USER  APRIL 13, 2023  LEAVE A COMMENTON PYSPARK : HARNESSING THE POWER OF
PYSPARKS FOLDBYKEY[AGGREGATE DATA BY KEYS USING A GIVEN FUNCTION]

In this article, we will explore the foldByKey transformation in PySpark. foldByKey is an essential
tool when working with Key-Value pair RDDs (Resilient Distributed Datasets), as it allows
developers to aggregate data by keys using a given function. We will discuss the syntax, usage,
and provide a concrete example with hardcoded values instead of reading from a file.
What is foldByKey?
foldByKey is a transformation operation in PySpark that enables the aggregation of values for
each key in a Key-Value pair RDD. This operation takes two arguments: the initial zero value and
the function to perform the aggregation. It applies the aggregation function cumulatively to the
values of each key, starting with the initial zero value.
Syntax
The syntax for the foldByKey function is as follows:
foldByKey(zeroValue, func)
where:
 zeroValue: The initial value used for the aggregation (commonly known as the zero value)
 func: The function that will be used to aggregate the values for each key
Example
Let’s dive into an example to better understand the usage of foldByKey. Suppose we have a
dataset containing sales data for a chain of stores. The data includes store ID, product ID, and the
number of units sold. Our goal is to calculate the total units sold for each store.

#Harnessing the Power of PySpark: A Deep Dive into foldByKey @ Freshers.in


from pyspark import SparkContext
# Initialize the Spark context
sc = SparkContext("local", "foldByKey @ Freshers.in ")
# Sample sales data as (store_id, (product_id, units_sold))
sales_data = [
(1, (189876, 5)),
(2, (189876, 7)),
(1, (267434, 3)),
(2, (267434, 10)),
(3, (189876, 4)),
(3, (267434, 6)),
]

# Create the RDD from the sales_data list


sales_rdd = sc.parallelize(sales_data)

# Define the aggregation function


def sum_units(a, b):
return a + b[1]
# Perform the foldByKey operation
total_sales_rdd = sales_rdd.foldByKey(0, sum_units)

# Collect the results and print


for store_id, total_units in total_sales_rdd.collect():
print(f"Store {store_id} sold a total of {total_units} units.")

Python
COPY
Output:

Store 1 sold a total of 8 units.


Store 2 sold a total of 17 units.
Store 3 sold a total of 10 units.

Bash
COPY
Here we have explored the foldByKey transformation in PySpark. This powerful function allows
developers to perform aggregations on Key-Value pair RDDs efficiently. We covered the syntax,
usage, and provided an example using hardcoded values. By leveraging foldByKey, you can
simplify and optimize your data processing tasks in PySpark, making it an essential tool in your
Big Data toolkit.
PySpark : Aggregation operations on key-value pair RDDs [combineByKey in PySpark]
 USER  APRIL 13, 2023  LEAVE A COMMENTON PYSPARK : AGGREGATION OPERATIONS ON KEY-
VALUE PAIR RDDS [COMBINEBYKEY IN PYSPARK]

In this article, we will explore the use of combineByKey in PySpark, a powerful and flexible
method for performing aggregation operations on key-value pair RDDs. We will provide a
detailed example.
First, let’s create a PySpark RDD:

# Using combineByKey in PySpark with a Detailed Example


from pyspark import SparkContext
sc = SparkContext("local", "combineByKey Example")
data = [("America", 1), ("Botswana", 2), ("America", 3), ("Botswana", 4), ("America", 5),
("Egypt", 6)]
rdd = sc.parallelize(data)

Python
COPY
Using combineByKey
Now, let’s use the combineByKey method to compute the average value for each key in the RDD:

def create_combiner(value):
return (value, 1)

def merge_value(acc, value):


sum, count = acc
return (sum + value, count + 1)

def merge_combiners(acc1, acc2):


sum1, count1 = acc1
sum2, count2 = acc2
return (sum1 + sum2, count1 + count2)

result_rdd = rdd.combineByKey(create_combiner, merge_value, merge_combiners)


average_rdd = result_rdd.mapValues(lambda acc: acc[0] / acc[1])
result_data = average_rdd.collect()

print("Average values per key:")


for key, value in result_data:
print(f"{key}: {value:.2f}")

Python
COPY
In this example, we used the combineByKey method on the RDD, which requires three functions
as arguments:
1. A function that initializes the accumulator for each key. In our case, it creates a tuple with
the value and a count of 1.
2. merge_value: A function that updates the accumulator for a key with a new value. It takes
the current accumulator and the new value, then updates the sum and count.
3. merge_combiners: A function that merges two accumulators for the same key. It takes
two accumulators and combines their sums and counts.
We then use mapValues to compute the average value for each key by dividing the sum by the
count.
The output will be:

Average values per key:


America: 3.00
Botswana: 3.00
Egypt: 6.00

Bash
COPY
Notes: 

RDD.combineByKey(createCombiner: Callable[[V], U], mergeValue: Callable[[U, V], U],


mergeCombiners: Callable[[U, U], U], numPartitions: Optional[int] = None, partitionFunc:
Callable[[K], int] = <function portable_hash>) → pyspark.rdd.RDD[Tuple[K, U]]

Bash
COPY
Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a “combined type” C. Here users can
control the partitioning of the output RDD.
PySpark : Retrieves the key-value pairs from an RDD as a dictionary [collectAsMap in PySpark]
 USER  APRIL 13, 2023  LEAVE A COMMENTON PYSPARK : RETRIEVES THE KEY-VALUE PAIRS
FROM AN RDD AS A DICTIONARY [COLLECTASMAP IN PYSPARK]
In this article, we will explore the use of collectAsMap in PySpark, a method that retrieves the
key-value pairs from an RDD as a dictionary. We will provide a detailed example using hardcoded
values as input.
First, let’s create a PySpark RDD:

#collectAsMap in PySpark @ Freshers.in


from pyspark import SparkContext
sc = SparkContext("local", "collectAsMap @ Freshers.in ")
data = [("America", 1), ("Botswana", 2), ("Costa Rica", 3), ("Denmark", 4), ("Egypt", 5)]
rdd = sc.parallelize(data)

Python
COPY
Using collectAsMap
Now, let’s use the collectAsMap method to retrieve the key-value pairs from the RDD as a
dictionary:

result_map = rdd.collectAsMap()
print("Result as a Dictionary:")
for key, value in result_map.items():
print(f"{key}: {value}")

Python
COPY
In this example, we used the collectAsMap method on the RDD, which returns a dictionary
containing the key-value pairs in the RDD. This can be useful when you need to work with the
RDD data as a native Python dictionary.
Output will be:

Result as a Dictionary:
America: 1
Botswana: 2
Costa Rica: 3
Denmark: 4
Egypt: 5

Bash
COPY
The resulting dictionary contains the key-value pairs from the RDD, which can now be accessed
and manipulated using standard Python dictionary operations.
Keep in mind that using collectAsMap can cause the driver to run out of memory if the RDD has a
large number of key-value pairs, as it collects all data to the driver. Use this method judiciously
and only when you are certain that the resulting dictionary can fit into the driver’s memory.
Here, we explored the use of collectAsMap in PySpark, a method that retrieves the key-value
pairs from an RDD as a dictionary. We provided a detailed example using hardcoded values as
input, showcasing how to create an RDD with key-value pairs, use the collectAsMap method, and
interpret the results. collectAsMap can be useful in various scenarios when you need to work
with RDD data as a native Python dictionary, but it’s important to be cautious about potential
memory issues when using this method on large RDDs.
PySpark :Remove any key-value pair that has a key present in another RDD [subtractByKey]
 USER  APRIL 13, 2023  LEAVE A COMMENTON PYSPARK :REMOVE ANY KEY-VALUE PAIR THAT
HAS A KEY PRESENT IN ANOTHER RDD [SUBTRACTBYKEY]

In this article, we will explore the use of subtractByKey in PySpark, a transformation that returns
an RDD consisting of key-value pairs from one RDD by removing any pair that has a key present in
another RDD. We will provide a detailed example using hardcoded values as input.
First, let’s create two PySpark RDDs

#Using subtractByKey in PySpark @Freshers.in


from pyspark import SparkContext
sc = SparkContext("local", "subtractByKey @ Freshers.in ")
data1 = [("America", 1), ("Botswana", 2), ("Costa Rica", 3), ("Denmark", 4), ("Egypt", 5)]
data2 = [("Botswana", 20), ("Denmark", 40), ("Finland", 60)]

rdd1 = sc.parallelize(data1)
rdd2 = sc.parallelize(data2)

Python
COPY
Using subtractByKey
Now, let’s use the subtractByKey method to create a new RDD by removing key-value pairs from
rdd1 that have keys present in rdd2:

result_rdd = rdd1.subtractByKey(rdd2)
result_data = result_rdd.collect()
print("Result of subtractByKey:")
for element in result_data:
print(element)

Python
COPY
In this example, we used the subtractByKey method on rdd1 and passed rdd2 as an argument.
The method returns a new RDD containing key-value pairs from rdd1 after removing any pair with
a key present in rdd2. The collect method is then used to retrieve the results.
Interpreting the Results

Result of subtractByKey:
('Costa Rica', 3)
('America', 1)
('Egypt', 5)

Bash
COPY
The resulting RDD contains key-value pairs from rdd1 with the key-value pairs having keys
“Botswana” and “Denmark” removed, as these keys are present in rdd2.
In this article, we explored the use of subtractByKey in PySpark, a transformation that returns an
RDD consisting of key-value pairs from one RDD by removing any pair that has a key present in
another RDD. We provided a detailed example using hardcoded values as input, showcasing how
to create two RDDs with key-value pairs, use the subtractByKey method, and interpret the
results. subtractByKey can be useful in various scenarios, such as filtering out unwanted data
based on keys or performing set-like operations on key-value pair RDDs.
PySpark : Assigning a unique identifier to each element in an RDD [ zipWithUniqueId in PySpark]
 USER  APRIL 12, 2023  LEAVE A COMMENTON PYSPARK : ASSIGNING A UNIQUE IDENTIFIER TO
EACH ELEMENT IN AN RDD [ ZIPWITHUNIQUEID IN PYSPARK]

In this article, we will explore the use of zipWithUniqueId in PySpark, a method that assigns a
unique identifier to each element in an RDD. We will provide a detailed example using hardcoded
values as input.
Prerequisites
 Python 3.7 or higher
 PySpark library
 Java 8 or higher
First, let’s create a PySpark RDD

#Using zipWithUniqueId in PySpark at Freshers.in


from pyspark import SparkContext
sc = SparkContext("local", "zipWithUniqueId @ Freshers.in")
data = ["America", "Botswana", "Costa Rica", "Denmark", "Egypt"]
rdd = sc.parallelize(data)

Python
COPY
Using zipWithUniqueId
Now, let’s use the zipWithUniqueId method to assign a unique identifier to each element in the
RDD:

unique_id_rdd = rdd.zipWithUniqueId()
unique_id_data = unique_id_rdd.collect()
print("Data with Unique IDs:")
for element in unique_id_data:
print(element)

Python
COPY
In this example, we used the zipWithUniqueId method on the RDD, which creates a new RDD
containing tuples of the original elements and their corresponding unique identifier. The collect
method is then used to retrieve the results.
Interpreting the Results

Data with Unique IDs:


('America', 0)
('Botswana', 1)
('Costa Rica', 2)
('Denmark', 3)
('Egypt', 4)

 POSTED INSPARK
PySpark : Feature that allows you to truncate the lineage of RDDs [Checkpointing in PySpark-
Used when you have long chain of transformations]
 USER  APRIL 11, 2023  LEAVE A COMMENTON PYSPARK : FEATURE THAT ALLOWS YOU TO
TRUNCATE THE LINEAGE OF RDDS [CHECKPOINTING IN PYSPARK- USED WHEN YOU HAVE LONG
CHAIN OF TRANSFORMATIONS]
In this article, we will explore checkpointing in PySpark, a feature that allows you to truncate the
lineage of RDDs, which can be beneficial in certain situations where you have a long chain of
transformations. We will provide a detailed example using hardcoded values as input.
Prerequisites
 Python 3.7 or higher
 PySpark library
 Java 8 or higher
 A local directory to store checkpoint files
Let’s create a PySpark RDD

from pyspark import SparkContext

sc = SparkContext("local", "Checkpoint Example")


sc.setCheckpointDir("checkpoint_directory") # Replace with the path to your local checkpoint
directory

data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

Python
COPY
Performing Transformations
Now, let’s apply several transformations to the RDD:

rdd1 = rdd.map(lambda x: x * 2)
rdd2 = rdd1.filter(lambda x: x > 2)
rdd3 = rdd2.map(lambda x: x * 3)

Python
COPY
Applying Checkpoint
Next, let’s apply a checkpoint to rdd2:

rdd2.checkpoint()

Python
COPY
By calling the checkpoint method on rdd2, we request PySpark to truncate the lineage of rdd2
during the next action. This will save the state of rdd2 to the checkpoint directory, and
subsequent operations on rdd2 and its derived RDDs will use the checkpointed data instead of
computing the full lineage.
Executing an Action
Finally, let’s execute an action on rdd3 to trigger the checkpoint:

result = rdd3.collect()
print("Result:", result)

Python
COPY
Output

Result: [12, 18, 24, 30]

Bash
COPY
When executing the collect action on rdd3, PySpark will process the checkpoint for rdd2. The
lineage of rdd3 will now be based on the checkpointed data instead of the full lineage from the
original RDD.
Analyzing the Benefits of Checkpointing
Checkpointing can be helpful in situations where you have a long chain of transformations,
leading to a large lineage graph. A large lineage graph may result in performance issues due to
the overhead of tracking dependencies and can also cause stack overflow errors during recursive
operations.
By applying checkpoints, you can truncate the lineage, reducing the overhead of tracking
dependencies and mitigating the risk of stack overflow errors.
However, checkpointing comes at the cost of writing data to the checkpoint directory, which can
be a slow operation, especially when using distributed file systems like HDFS. Therefore, it’s
essential to use checkpointing judiciously and only when necessary.
In this article, we explored checkpointing in PySpark, a feature that allows you to truncate the
lineage of RDDs. We provided a detailed example using hardcoded values as input, showcasing
how to create an RDD, apply transformations, set up checkpointing, and execute an action that
triggers the checkpoint. Checkpointing can be beneficial when dealing with long chains of
transformations that may cause performance issues or stack overflow errors. However, it’s
important to consider the trade-offs and use checkpointing only when necessary, as it can
introduce additional overhead due to writing data to the checkpoint directory.
PySpark : Assigning an index to each element in an RDD [zipWithIndex in PySpark]
 USER  APRIL 11, 2023  LEAVE A COMMENTON PYSPARK : ASSIGNING AN INDEX TO EACH
ELEMENT IN AN RDD [ZIPWITHINDEX IN PYSPARK]

In this article, we will explore the use of zipWithIndex in PySpark, a method that assigns an index
to each element in an RDD. We will provide a detailed example using hardcoded values as input.
First, let’s create a PySpark RDD

from pyspark import SparkContext


sc = SparkContext("local", "zipWithIndex Example @ Freshers.in")
data = ["USA", "INDIA", "CHINA", "JAPAN", "CANADA"]
rdd = sc.parallelize(data)

Python
COPY
Using zipWithIndex
Now, let’s use the zipWithIndex method to assign an index to each element in the RDD:

indexed_rdd = rdd.zipWithIndex()
indexed_data = indexed_rdd.collect()
print("Indexed Data:")
for element in indexed_data:
print(element)

Python
COPY
In this example, we used the zipWithIndex method on the RDD, which creates a new RDD
containing tuples of the original elements and their corresponding index. The collect method is
then used to retrieve the results.
Interpreting the Results
The output of the example will be:

Indexed Data:
('USA', 0)
('INDIA', 1)
('CHINA', 2)
('JAPAN', 3)
('CANADA', 4)

Bash
COPY
Each element in the RDD is now paired with an index, starting from 0. The zipWithIndex method
assigns the index based on the position of each element in the RDD.
Keep in mind that zipWithIndex might cause a performance overhead since it requires a full pass
through the RDD to assign indices. Consider using alternatives such as zipWithUniqueId if unique
identifiers are sufficient for your use case, as it avoids this performance overhead.
In this article, we explored the use of zipWithIndex in PySpark, a method that assigns an index to
each element in an RDD. We provided a detailed example using hardcoded values as input,
showcasing how to create an RDD, use the zipWithIndex method, and interpret the results.
zipWithIndex can be useful when you need to associate an index with each element in an RDD,
but be cautious about the potential performance overhead it may introduce.
PySpark : Covariance Analysis in PySpark with a detailed example
 USER  APRIL 11, 2023  LEAVE A COMMENTON PYSPARK : COVARIANCE ANALYSIS IN PYSPARK
WITH A DETAILED EXAMPLE
In this article, we will explore covariance analysis in PySpark, a statistical measure that describes
the degree to which two continuous variables change together. We will provide a detailed
example using hardcoded values as input.
Prerequisites
 Python 3.7 or higher
 PySpark library
 Java 8 or higher
First, let’s create a PySpark DataFrame with hardcoded values:

from pyspark.sql import SparkSession


from pyspark.sql.types import StructType, StructField, StringType, DoubleType

spark = SparkSession.builder \
.appName("Covariance Analysis Example") \
.getOrCreate()

data_schema = StructType([
StructField("name", StringType(), True),
StructField("variable1", DoubleType(), True),
StructField("variable2", DoubleType(), True),
])
data = spark.createDataFrame([
("A", 1.0, 2.0),
("B", 2.0, 3.0),
("C", 3.0, 4.0),
("D", 4.0, 5.0),
("E", 5.0, 6.0),
], data_schema)

data.show()

Python
COPY
Output

+----+---------+---------+
|name|variable1|variable2|
+----+---------+---------+
| A| 1.0| 2.0|
| B| 2.0| 3.0|
| C| 3.0| 4.0|
| D| 4.0| 5.0|
| E| 5.0| 6.0|
+----+---------+---------+

Bash
COPY
Calculating Covariance
Now, let’s calculate the covariance between variable1 and variable2:

covariance_value = data.stat.cov("variable1", "variable2")


print(f"Covariance between variable1 and variable2: {covariance_value:.2f}")

Python
COPY
Output

Covariance between variable1 and variable2: 2.50

Bash
COPY
In this example, we used the cov function from the stat module of the DataFrame API to
calculate the covariance between the two variables.
Interpreting the Results
Covariance values can be positive, negative, or zero, depending on the relationship between the
two variables:
 Positive covariance: Indicates that as one variable increases, the other variable also
increases.
 Negative covariance: Indicates that as one variable increases, the other variable decreases.
 Zero covariance: Indicates that the two variables are independent and do not change
together.
In our example, the covariance value is 2.5, which indicates a positive relationship between
variable1 and variable2. This means that as variable1 increases, variable2 also increases, and vice
versa.
It’s important to note that covariance values are not standardized, making them difficult to
interpret in isolation. For a standardized measure of the relationship between two variables, you
may consider using correlation analysis instead.
Here we explored covariance analysis in PySpark, a statistical measure that describes the degree
to which two continuous variables change together. We provided a detailed example using
hardcoded values as input, showcasing how to create a DataFrame, calculate the covariance
between two variables, and interpret the results. Covariance analysis can be useful in various
fields to understand the relationships between variables and make data-driven decisions.
However, due to the lack of standardization, it’s often more informative to use correlation
analysis for comparing the strength of relationships between different pairs of variables.
PySpark : Correlation Analysis in PySpark with a detailed example
 USER  APRIL 11, 2023  LEAVE A COMMENTON PYSPARK : CORRELATION ANALYSIS IN PYSPARK
WITH A DETAILED EXAMPLE

In this article, we will explore correlation analysis in PySpark, a statistical technique used to
measure the strength and direction of the relationship between two continuous variables. We
will provide a detailed example using hardcoded values as input.
Prerequisites
 Python 3.7 or higher
 PySpark library
 Java 8 or higher
Creating a PySpark DataFrame with Hardcoded Values
First, let’s create a PySpark DataFrame with hardcoded values:

from pyspark.sql import SparkSession


from pyspark.sql.types import StructType, StructField, StringType, DoubleType

spark = SparkSession.builder \
.appName("Correlation Analysis Example") \
.getOrCreate()

data_schema = StructType([
StructField("name", StringType(), True),
StructField("variable1", DoubleType(), True),
StructField("variable2", DoubleType(), True),
])

data = spark.createDataFrame([
("A", 1.0, 2.0),
("B", 2.0, 3.0),
("C", 3.0, 4.0),
("D", 4.0, 5.0),
("E", 5.0, 6.0),
], data_schema)
data.show()

Python
COPY
Output

+----+---------+---------+
|name|variable1|variable2|
+----+---------+---------+
| A| 1.0| 2.0|
| B| 2.0| 3.0|
| C| 3.0| 4.0|
| D| 4.0| 5.0|
| E| 5.0| 6.0|
+----+---------+---------+

Bash
COPY
Calculating Correlation
Now, let’s calculate the correlation between variable1 and variable2:

from pyspark.ml.stat import Correlation


from pyspark.ml.feature import VectorAssembler

vector_assembler = VectorAssembler(inputCols=["variable1", "variable2"], outputCol="features")


data_vector = vector_assembler.transform(data).select("features")

correlation_matrix = Correlation.corr(data_vector, "features").collect()[0][0]


correlation_value = correlation_matrix[0, 1]
print(f"Correlation between variable1 and variable2: {correlation_value:.2f}")

Python
COPY
Output

Correlation between variable1 and variable2: 1.00

Bash
COPY
In this example, we used the VectorAssembler to combine the two variables into a single feature
vector column called features. Then, we used the Correlation module from pyspark.ml.stat to
calculate the correlation between the two variables. The corr function returns a correlation
matrix, from which we can extract the correlation value between variable1 and variable2.
Interpreting the Results
The correlation value ranges from -1 to 1, where:
 -1 indicates a strong negative relationship
 0 indicates no relationship
 1 indicates a strong positive relationship
In our example, the correlation value is 1.0, which indicates a strong positive relationship
between variable1 and variable2. This means that as variable1 increases, variable2 also increases,
and vice versa.
In this article, we explored correlation analysis in PySpark, a statistical technique used to measure
the strength and direction of the relationship between two continuous variables. We provided a
detailed example using hardcoded values as input, showcasing how to create a DataFrame,
calculate the correlation between two variables, and interpret the results. Correlation analysis
can be useful in various fields, such as finance, economics, and social sciences, to understand the
relationships between variables and make data-driven decisions.
PySpark : Understanding Broadcast Joins in PySpark with a detailed example
 USER  APRIL 11, 2023  LEAVE A COMMENTON PYSPARK : UNDERSTANDING BROADCAST JOINS
IN PYSPARK WITH A DETAILED EXAMPLE

In this article, we will explore broadcast joins in PySpark, which is an optimization technique used
when joining a large DataFrame with a smaller DataFrame. This method reduces the data
shuffling between nodes, resulting in improved performance. We will provide a detailed example
using hardcoded values as input.
Prerequisites
 Python 3.7 or higher
 PySpark library
 Java 8 or higher
Let’s create two PySpark DataFrames with hardcoded values:

from pyspark.sql import SparkSession


from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder \
.appName("Broadcast Join Example @ Freshers.in") \
.getOrCreate()

orders_schema = StructType([
StructField("order_id", IntegerType(), True),
StructField("customer_id", IntegerType(), True),
StructField("product_id", IntegerType(), True),
])

orders_data = spark.createDataFrame([
(1, 101, 1001),
(2, 102, 1002),
(3, 103, 1001),
(4, 104, 1003),
(5, 105, 1002),
], orders_schema)

products_schema = StructType([
StructField("product_id", IntegerType(), True),
StructField("product_name", StringType(), True),
StructField("price", IntegerType(), True),
])
products_data = spark.createDataFrame([
(1001, "Product A", 50),
(1002, "Product B", 60),
(1003, "Product C", 70),
], products_schema)

orders_data.show()
products_data.show()

Python
COPY
Performing Broadcast Join
Now, let’s use the broadcast join to join the orders_data DataFrame with the products_data
DataFrame:

from pyspark.sql.functions import broadcast

joined_data = orders_data.join(broadcast(products_data), on="product_id", how="inner")


joined_data.show()

Python
COPY
In this example, we used the broadcast function from pyspark.sql.functions to indicate that the
products_data DataFrame should be broadcasted to all worker nodes. This is useful when joining
a small DataFrame (in this case, products_data) with a large DataFrame (in this case,
orders_data). Broadcasting the smaller DataFrame reduces the amount of data shuffling and
network overhead, resulting in improved performance.
It’s essential to broadcast only small DataFrames because broadcasting a large DataFrame can
cause memory issues due to the replication of data across all worker nodes.
Analyzing the Join Results
The resulting joined_data DataFrame contains the following columns:
 order_id
 customer_id
 product_id
 product_name
 price
This DataFrame provides a combined view of the orders and products, allowing for further
analysis, such as calculating the total order value or finding the most popular products.
In this article, we explored broadcast joins in PySpark, an optimization technique for joining a
large DataFrame with a smaller DataFrame. We provided a detailed example using hardcoded
values as input to create two DataFrames and perform a broadcast join. This method can
significantly improve performance by reducing data shuffling and network overhead during join
operations. However, it’s crucial to use broadcast joins only with small DataFrames, as
broadcasting large DataFrames can cause memory issues.
PySpark : Splitting a DataFrame into multiple smaller DataFrames [randomSplit function in
PySpark]
 USER  APRIL 11, 2023  LEAVE A COMMENTON PYSPARK : SPLITTING A DATAFRAME INTO
MULTIPLE SMALLER DATAFRAMES [RANDOMSPLIT FUNCTION IN PYSPARK]
In this article, we will discuss the randomSplit function in PySpark, which is useful for splitting a
DataFrame into multiple smaller DataFrames based on specified weights. This function is
particularly helpful when you need to divide a dataset into training and testing sets for machine
learning tasks. We will provide a detailed example using hardcoded values as input.
First, let’s create a PySpark DataFrame :

from datetime import datetime


from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType

spark = SparkSession.builder \
.appName("RandomSplit @ Freshers.in Example") \
.getOrCreate()

schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("timestamp", TimestampType(), True)
])

data = spark.createDataFrame([
("Sachin", 30, datetime.strptime("2022-12-01 12:30:15.123", "%Y-%m-%d %H:%M:%S.%f")),
("Barry", 25, datetime.strptime("2023-01-10 16:45:35.789", "%Y-%m-%d %H:%M:%S.%f")),
("Charlie", 35, datetime.strptime("2023-02-07 09:15:30.246", "%Y-%m-%d %H:%M:%S.%f")),
("David", 28, datetime.strptime("2023-03-15 18:20:45.567", "%Y-%m-%d %H:%M:%S.%f")),
("Eva", 22, datetime.strptime("2023-04-21 10:34:25.890", "%Y-%m-%d %H:%M:%S.%f"))
], schema)

data.show(20,False)

Python
COPY
Output

+-------+---+--------------------+
| name|age| timestamp|
+-------+---+--------------------+
| Sachin| 30|2022-12-01 12:30:...|
| Barry| 25|2023-01-10 16:45:...|
|Charlie| 35|2023-02-07 09:15:...|
| David| 28|2023-03-15 18:20:...|
| Eva| 22|2023-04-21 10:34:...|
+-------+---+--------------------+

Bash
COPY
Using randomSplit Function
Now, let’s use the randomSplit function to split the DataFrame into two smaller DataFrames. In
this example, we will split the data into 70% for training and 30% for testing:
train_data, test_data = data.randomSplit([0.7, 0.3], seed=42)
train_data.show()
test_data.show()

Python
COPY
Output

+------+---+-----------------------+
|name |age|timestamp |
+------+---+-----------------------+
|Barry |25 |2023-01-10 16:45:35.789|
|Sachin|30 |2022-12-01 12:30:15.123|
|David |28 |2023-03-15 18:20:45.567|
|Eva |22 |2023-04-21 10:34:25.89 |
+------+---+-----------------------+

+-------+---+-----------------------+
|name |age|timestamp |
+-------+---+-----------------------+
|Charlie|35 |2023-02-07 09:15:30.246|
+-------+---+-----------------------+

Bash
COPY
The randomSplit function accepts two arguments: a list of weights for each DataFrame and a
seed for reproducibility. In this example, we’ve used the weights [0.7, 0.3] to allocate
approximately 70% of the data to the training set and 30% to the testing set. The seed value 42
ensures that the split will be the same every time we run the code.
Please note that the actual number of rows in the resulting DataFrames might not exactly match
the specified weights due to the random nature of the function. However, with a larger dataset,
the split will be closer to the specified weights.
Here we demonstrated how to use the randomSplit function in PySpark to divide a DataFrame
into smaller DataFrames based on specified weights. This function is particularly useful for
creating training and testing sets for machine learning tasks. We provided an example using
hardcoded values as input, showcasing how to create a DataFrame and perform the random split.
PySpark : Using randomSplit Function in PySpark for train and test data
 USER  APRIL 11, 2023  LEAVE A COMMENTON PYSPARK : USING RANDOMSPLIT FUNCTION IN
PYSPARK FOR TRAIN AND TEST DATA

In this article, we will discuss the randomSplit function in PySpark, which is useful for splitting a
DataFrame into multiple smaller DataFrames based on specified weights. This function is
particularly helpful when you need to divide a dataset into training and testing sets for machine
learning tasks. We will provide a detailed example using hardcoded values as input.
Prerequisites
 Python 3.7 or higher
 PySpark library
 Java 8 or higher
Loading the Dataset with Hardcoded Values
First, let’s create a PySpark DataFrame with hardcoded values:
from datetime import datetime
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType

spark = SparkSession.builder \
.appName("RandomSplit @ Freshers.in Example") \
.getOrCreate()
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("timestamp", TimestampType(), True)
])
data = spark.createDataFrame([
("Sachin", 30, datetime.strptime("2022-12-01 12:30:15.123", "%Y-%m-%d %H:%M:%S.%f")),
("Barry", 25, datetime.strptime("2023-01-10 16:45:35.789", "%Y-%m-%d %H:%M:%S.%f")),
("Charlie", 35, datetime.strptime("2023-02-07 09:15:30.246", "%Y-%m-%d %H:%M:%S.%f")),
("David", 28, datetime.strptime("2023-03-15 18:20:45.567", "%Y-%m-%d %H:%M:%S.%f")),
("Eva", 22, datetime.strptime("2023-04-21 10:34:25.890", "%Y-%m-%d %H:%M:%S.%f"))
], schema)
data.show(20,False)

Python
COPY
Output

+-------+---+--------------------+
| name|age| timestamp|
+-------+---+--------------------+
| Sachin| 30|2022-12-01 12:30:...|
| Barry| 25|2023-01-10 16:45:...|
|Charlie| 35|2023-02-07 09:15:...|
| David| 28|2023-03-15 18:20:...|
| Eva| 22|2023-04-21 10:34:...|
+-------+---+--------------------+

Bash
COPY
Using randomSplit Function
Now, let’s use the randomSplit function to split the DataFrame into two smaller DataFrames. In
this example, we will split the data into 70% for training and 30% for testing:

train_data, test_data = data.randomSplit([0.7, 0.3], seed=42)


train_data.show()
test_data.show()

Python
COPY
The randomSplit function accepts two arguments: a list of weights for each DataFrame and a
seed for reproducibility. In this example, we’ve used the weights [0.7, 0.3] to allocate
approximately 70% of the data to the training set and 30% to the testing set. The seed value 42
ensures that the split will be the same every time we run the code.
Please note that the actual number of rows in the resulting DataFrames might not exactly match
the specified weights due to the random nature of the function. However, with a larger dataset,
the split will be closer to the specified weights.
In this article, we demonstrated how to use the randomSplit function in PySpark to divide a
DataFrame into smaller DataFrames based on specified weights. This function is particularly
useful for creating training and testing sets for machine learning tasks. We provided an example
using hardcoded values as input, showcasing how to create a DataFrame and perform the
random split.
PySpark : Extracting Time Components and Converting Timezones with PySpark
 USER  APRIL 11, 2023  LEAVE A COMMENTON PYSPARK : EXTRACTING TIME COMPONENTS AND
CONVERTING TIMEZONES WITH PYSPARK

In this article, we will be working with a dataset containing a column with names, ages, and
timestamps. Our goal is to extract various time components from the timestamps, such as hours,
minutes, seconds, milliseconds, and more. We will also demonstrate how to convert the
timestamps to a specific timezone using PySpark. To achieve this, we will use the PySpark and
PySpark SQL functions.
Prerequisites
 Python 3.7 or higher
 PySpark library
 Java 8 or higher
Input Data
First, let’s load the dataset into a PySpark DataFrame:

#Extracting Time Components and Converting Timezones with PySpark


from datetime import datetime
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType
spark = SparkSession.builder \
.appName("Time Components and Timezone Conversion @ Freshers.in") \
.getOrCreate()
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("timestamp", TimestampType(), True)
])
#data = spark.read.csv("data.csv", header=True, inferSchema=True)
data = spark.createDataFrame([
("Sachin", 30, datetime.strptime("2022-12-01 12:30:15.123", "%Y-%m-%d %H:%M:%S.%f")),
("Wilson", 25, datetime.strptime("2023-01-10 16:45:35.789", "%Y-%m-%d %H:%M:%S.%f")),
("Johnson", 35, datetime.strptime("2023-02-07 09:15:30.246", "%Y-%m-%d %H:%M:%S.%f"))
], schema)
data.printSchema()
data.show(20, False)

Python
COPY
Input data results
Schema
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- timestamp: timestamp (nullable = true)

Bash
COPY
Data frame output

+-------+---+-----------------------+
|name |age|timestamp |
+-------+---+-----------------------+
|Sachin |30 |2022-12-01 12:30:15.123|
|Wilson |25 |2023-01-10 16:45:35.789|
|Johnson|35 |2023-02-07 09:15:30.246|
+-------+---+-----------------------+

Bash
COPY
Now, we will extract various time components from the ‘timestamp’ column using PySpark SQL
functions:

from pyspark.sql.functions import (


hour, minute, second, year, month, dayofmonth, weekofyear, quarter, substring)

Python
COPY
# 1. Extract hour

data.withColumn("hour", hour("timestamp")).show(20, False)

Python
COPY
Output

+-------+---+-----------------------+----+
|name |age|timestamp |hour|
+-------+---+-----------------------+----+
|Alice |30 |2022-12-01 12:30:15.123|12 |
|Bob |25 |2023-01-10 16:45:35.789|16 |
|Charlie|35 |2023-02-07 09:15:30.246|9 |
+-------+---+-----------------------+----+

Bash
COPY
# 2. Extract minute

data.withColumn("minute", minute("timestamp")).show(20, False)


Python
COPY
Output

+-------+---+-----------------------+------+
|name |age|timestamp |minute|
+-------+---+-----------------------+------+
|Alice |30 |2022-12-01 12:30:15.123|30 |
|Bob |25 |2023-01-10 16:45:35.789|45 |
|Charlie|35 |2023-02-07 09:15:30.246|15 |
+-------+---+-----------------------+------+

Bash
COPY
# 3. Extract second

data.withColumn("second", second("timestamp")).show(20, False)

Python
COPY
Output

+-------+---+-----------------------+------+
|name |age|timestamp |second|
+-------+---+-----------------------+------+
|Alice |30 |2022-12-01 12:30:15.123|15 |
|Bob |25 |2023-01-10 16:45:35.789|35 |
|Charlie|35 |2023-02-07 09:15:30.246|30 |
+-------+---+-----------------------+------+

Bash
COPY
# 4. Extract millisecond

data.withColumn("millisecond", (substring("timestamp", 21, 3)).cast("int")).show(20, False)

Python
COPY
Output

+-------+---+-----------------------+-----------+
|name |age|timestamp |millisecond|
+-------+---+-----------------------+-----------+
|Alice |30 |2022-12-01 12:30:15.123|123 |
|Bob |25 |2023-01-10 16:45:35.789|789 |
|Charlie|35 |2023-02-07 09:15:30.246|246 |
+-------+---+-----------------------+-----------+

Bash
COPY
# 5. Extract year

data.withColumn("year", year("timestamp")).show(20, False)

Python
COPY
Output

+-------+---+-----------------------+----+
|name |age|timestamp |year|
+-------+---+-----------------------+----+
|Alice |30 |2022-12-01 12:30:15.123|2022|
|Bob |25 |2023-01-10 16:45:35.789|2023|
|Charlie|35 |2023-02-07 09:15:30.246|2023|
+-------+---+-----------------------+----+

Bash
COPY
# 6. Extract month

data.withColumn("month", month("timestamp")).show(20, False)

Python
COPY
Output

+-------+---+-----------------------+-----+
|name |age|timestamp |month|
+-------+---+-----------------------+-----+
|Alice |30 |2022-12-01 12:30:15.123|12 |
|Bob |25 |2023-01-10 16:45:35.789|1 |
|Charlie|35 |2023-02-07 09:15:30.246|2 |
+-------+---+-----------------------+-----+

Bash
COPY
# 7. Extract day

data.withColumn("day", dayofmonth("timestamp")).show(20, False)

Python
COPY
Output

+-------+---+-----------------------+---+
|name |age|timestamp |day|
+-------+---+-----------------------+---+
|Alice |30 |2022-12-01 12:30:15.123|1 |
|Bob |25 |2023-01-10 16:45:35.789|10 |
|Charlie|35 |2023-02-07 09:15:30.246|7 |
+-------+---+-----------------------+---+

Bash
COPY
# 8. Extract week

data.withColumn("week", weekofyear("timestamp")).show(20, False)

Python
COPY
Output

+-------+---+-----------------------+----+
|name |age|timestamp |week|
+-------+---+-----------------------+----+
|Alice |30 |2022-12-01 12:30:15.123|48 |
|Bob |25 |2023-01-10 16:45:35.789|2 |
|Charlie|35 |2023-02-07 09:15:30.246|6 |
+-------+---+-----------------------+----+

Bash
COPY
# 9. Extract quarter

data.withColumn("quarter", quarter("timestamp")).show(20, False)

Python
COPY
Output

+-------+---+-----------------------+-------+
|name |age|timestamp |quarter|
+-------+---+-----------------------+-------+
|Alice |30 |2022-12-01 12:30:15.123|4 |
|Bob |25 |2023-01-10 16:45:35.789|1 |
|Charlie|35 |2023-02-07 09:15:30.246|1 |
+-------+---+-----------------------+-------+

Bash
COPY
# 10. Convert timestamp to specific timezone
To convert the timestamps to a specific timezone, we will use the PySpark
SQL from_utc_timestamp function. In this example, we will convert the timestamps to the
‘America/New_York’ timezone:

from pyspark.sql.functions import from_utc_timestamp


data.withColumn("timestamp_local", from_utc_timestamp("timestamp",
"America/New_York")).show(20, False)

Python
COPY
Output

+-------+---+-----------------------+-----------------------+
|name |age|timestamp |timestamp_local |
+-------+---+-----------------------+-----------------------+
|Alice |30 |2022-12-01 12:30:15.123|2022-12-01 07:30:15.123|
|Bob |25 |2023-01-10 16:45:35.789|2023-01-10 11:45:35.789|
|Charlie|35 |2023-02-07 09:15:30.246|2023-02-07 04:15:30.246|
+-------+---+-----------------------+-----------------------+

PySpark : Understanding PySpark’s map_from_arrays Function with detailed examples


 USER  APRIL 10, 2023  LEAVE A COMMENTON PYSPARK : UNDERSTANDING PYSPARK’S
MAP_FROM_ARRAYS FUNCTION WITH DETAILED EXAMPLES

PySpark provides a wide range of functions to manipulate and transform data within
DataFrames. In this article, we will focus on the map_from_arrays function, which allows you to
create a map column by combining two arrays. We will discuss the functionality, syntax, and
provide a detailed example with input data to illustrate its usage.
1. The map_from_arrays Function in PySpark
The map_from_arrays function is a part of the PySpark SQL library, which provides various
functions to work with different data types. This function creates a map column by combining
two arrays, where the first array contains keys, and the second array contains values. The
resulting map column is useful for representing key-value pairs in a compact format.
Syntax:

pyspark.sql.functions.map_from_arrays(keys, values)

Python
COPY
keys: An array column containing the map keys.
values: An array column containing the map values.
2. A Detailed Example of Using the map_from_arrays Function
Let’s create a PySpark DataFrame with two array columns, representing keys and values, and
apply the map_from_arrays function to combine them into a map column.
First, let’s import the necessary libraries and create a sample DataFrame:

from pyspark.sql import SparkSession


from pyspark.sql.functions import map_from_arrays
from pyspark.sql.types import StringType, ArrayType
# Create a Spark session
spark = SparkSession.builder.master("local").appName("map_from_arrays Function
Example").getOrCreate()
# Sample data
data = [(["a", "b", "c"], [1, 2, 3]), (["x", "y", "z"], [4, 5, 6])]
# Define the schema
schema = ["Keys", "Values"]
# Create the DataFrame
df = spark.createDataFrame(data, schema)

Python
COPY
Now that we have our DataFrame, let’s apply the map_from_arrays function to it:

# Apply the map_from_arrays function


df = df.withColumn("Map", map_from_arrays(df["Keys"], df["Values"]))
# Show the results
df.show(truncate=False)

Python
COPY
Output

+---------+---------+------------------------+
|Keys |Values |Map |
+---------+---------+------------------------+
|[a, b, c]|[1, 2, 3]|{a -> 1, b -> 2, c -> 3}|
|[x, y, z]|[4, 5, 6]|{x -> 4, y -> 5, z -> 6}|
+---------+---------+------------------------+

Bash
COPY
In this example, we created a PySpark DataFrame with two array columns, “Keys” and “Values”,
and applied the map_from_arrays function to combine them into a “Map” column. The output
DataFrame displays the original keys and values arrays, as well as the resulting map column.
The PySpark map_from_arrays function is a powerful and convenient tool for working with array
columns and transforming them into a map column. With the help of the detailed example
provided in this article, you should be able to effectively use the map_from_arrays function in
your own PySpark projects.
 POSTED INSPARK
PySpark : Understanding PySpark’s LAG and LEAD Window Functions with detailed examples
 USER  APRIL 10, 2023  LEAVE A COMMENTON PYSPARK : UNDERSTANDING PYSPARK’S LAG AND
LEAD WINDOW FUNCTIONS WITH DETAILED EXAMPLES

One of its powerful features is the ability to work with window functions, which allow for
complex calculations and data manipulation tasks. In this article, we will focus on two common
window functions in PySpark: LAG and LEAD. We will discuss their functionality, syntax, and
provide a detailed example with input data to illustrate their usage.
1. LAG and LEAD Window Functions in PySpark
LAG and LEAD are window functions used to access the previous (LAG) or the next (LEAD) row in
a result set, allowing you to perform calculations or comparisons across rows. These functions
can be especially useful for time series analysis or when working with ordered data.
Syntax:
LAG(column, offset=1, default=None)
LEAD(column, offset=1, default=None)

Python
COPY
column: The column or expression to apply the LAG or LEAD function on.
offset: The number of rows to look behind (LAG) or ahead (LEAD) from the current row (default
is 1).
default: The value to return when no previous or next row exists. If not specified, it returns NULL.
2. A Detailed Example of Using LAG and LEAD Functions
Let’s create a PySpark DataFrame with sales data and apply LAG and LEAD functions to calculate
the previous and next month’s sales, respectively.
First, let’s import the necessary libraries and create a sample DataFrame:

from pyspark.sql import SparkSession


from pyspark.sql.functions import to_date
from pyspark.sql.types import StringType, IntegerType, DateType
from pyspark.sql.window import Window
# Create a Spark session
spark = SparkSession.builder.master("local").appName("LAG and LEAD Functions
Example").getOrCreate()
# Sample data
data = [("2023-01-01", 100), ("2023-02-01", 200), ("2023-03-01", 300), ("2023-04-01", 400)]
# Define the schema
schema = ["Date", "Sales"]
# Create the DataFrame
df = spark.createDataFrame(data, schema)
# Convert the date string to date type
df = df.withColumn("Date", to_date(df["Date"], "yyyy-MM-dd"))

Python
COPY
Now that we have our DataFrame, let’s apply the LAG and LEAD functions using a Window
specification:

from pyspark.sql.functions import lag, lead

# Define the window specification


window_spec = Window.orderBy("Date")

# Apply the LAG and LEAD functions


df = df.withColumn("Previous Month Sales", lag(df["Sales"]).over(window_spec))
df = df.withColumn("Next Month Sales", lead(df["Sales"]).over(window_spec))

# Show the results


df.show()

Python
COPY
This will have the following output:
+----------+-----+--------------------+----------------+
| Date|Sales|Previous Month Sales|Next Month Sales|
+----------+-----+--------------------+----------------+
|2023-01-01| 100| null| 200|
|2023-02-01| 200| 100| 300|
|2023-03-01| 300| 200| 400|
|2023-04-01| 400| 300| null|
+----------+-----+--------------------+----------------+

Bash
COPY
In this example, we used the LAG function to obtain the sales from the previous month and the
LEAD
 POSTED INSPARK
PySpark : Exploring PySpark’s last_day function with detailed examples
 USER  APRIL 10, 2023  LEAVE A COMMENTON PYSPARK : EXPLORING PYSPARK’S LAST_DAY
FUNCTION WITH DETAILED EXAMPLES

PySpark provides an easy-to-use interface for programming Spark with the Python programming
language. Among the numerous functions available in PySpark, the last_day function is used to
retrieve the last date of the month for a given date. In this article, we will discuss the PySpark
last_day function, its syntax, and a detailed example illustrating its use with input data.
1. The last_day function in PySpark
The last_day function is a part of the PySpark SQL library, which provides various functions to
work with dates and times. It is useful when you need to perform time-based aggregations or
calculations based on the end of the month.
Syntax:

pyspark.sql.functions.last_day(date)

Python
COPY
Where date is a column or an expression that returns a date or a timestamp.
2. A detailed example of using the last_day function
To illustrate the usage of the last_day function, let’s create a PySpark DataFrame containing date
information and apply the function to it.
First, let’s import the necessary libraries and create a sample DataFrame:

from pyspark.sql import SparkSession


from pyspark.sql.functions import last_day, to_date
from pyspark.sql.types import StringType, DateType

# Create a Spark session


spark = SparkSession.builder.master("local").appName("last_day Function Example @
Freshers.in ").getOrCreate()

# Sample data
data = [("2023-01-15",), ("2023-02-25",), ("2023-03-05",), ("2023-04-10",)]
# Define the schema
schema = ["Date"]

# Create the DataFrame


df = spark.createDataFrame(data, schema)

# Convert the date string to date type


df = df.withColumn("Date", to_date(df["Date"], "yyyy-MM-dd"))

Python
COPY
Now that we have our DataFrame, let’s apply the last_day function to it:

# Apply the last_day function


df = df.withColumn("Last Day of Month", last_day(df["Date"]))
# Show the results
df.show()

Python
COPY
Output

+----------+-----------------+
| Date|Last Day of Month|
+----------+-----------------+
|2023-01-15| 2023-01-31|
|2023-02-25| 2023-02-28|
|2023-03-05| 2023-03-31|
|2023-04-10| 2023-04-30|
+----------+-----------------+

Bash
COPY
In this example, we created a PySpark DataFrame with a date column and applied
the last_day function to calculate the last day of the month for each date. The output DataFrame
displays the original date along with the corresponding last day of the month.
The PySpark last_day function is a powerful and convenient tool for working with dates,
particularly when you need to determine the last day of the month for a given date. With the help
of the detailed example provided in this article, you should be able to effectively use the last_day
function in your own PySpark projects.
PySpark : Format phone numbers in a specific way using PySpark
 USER  APRIL 4, 2023  LEAVE A COMMENTON PYSPARK : FORMAT PHONE NUMBERS IN A
SPECIFIC WAY USING PYSPARK

In this article, we’ll be working with a PySpark DataFrame that contains a column of phone
numbers. We’ll use PySpark’s string manipulation functions to format these phone numbers in a
specific way, and then save the formatted phone numbers to a new DataFrame.
Sample Data
To demonstrate how to format phone numbers in PySpark, we’ll create a sample DataFrame with
some phone numbers. Here’s the code to create the sample data:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("FormatPhoneNumbers").getOrCreate()
data = [("John", "123-456-7890"),
("Jane", "234-567-8901"),
("Bob", "345-678-9012")]
df = spark.createDataFrame(data, ["name", "phone_number"])
df.show()

Python
COPY
Output

+----+------------+
|name|phone_number|
+----+------------+
|John|123-456-7890|
|Jane|234-567-8901|
| Bob|345-678-9012|
+----+------------+

Bash
COPY
The sample data consists of a DataFrame with two columns: “name” and “phone_number”. The
phone numbers are in the format “XXX-XXX-XXXX”.
Formatting Phone Numbers
Now that we have our sample data, we can start formatting the phone numbers. Here’s the code
to remove any non-numeric characters from the phone numbers:

from pyspark.sql.functions import regexp_replace


df = df.withColumn("phone_number", regexp_replace("phone_number", "[^0-9]", ""))
df.show()

Python
COPY
This code uses PySpark’s regexp_replace() function to remove any characters that are not digits
from the phone numbers. Now we have phone numbers that only contain digits.
Next, we’ll format the phone numbers in the desired way. Let’s say we want to format the phone
numbers as “(XXX) XXX-XXXX”. Here’s the code to do that:
Output

+----+------------+
|name|phone_number|
+----+------------+
|John| 1234567890|
|Jane| 2345678901|
| Bob| 3456789012|
+----+------------+

Bash
COPY
This code uses PySpark’s regexp_replace() function to remove any characters that are not digits
from the phone numbers. Now we have phone numbers that only contain digits.
Next, we’ll format the phone numbers in the desired way. Let’s say we want to format the phone
numbers as “(XXX) XXX-XXXX”. Here’s the code to do that:

from pyspark.sql.functions import regexp_replace, col, concat, lit

df = df.withColumn("phone_number", regexp_replace(col("phone_number"), "[^0-9]", ""))


df = df.withColumn("phone_number",
concat(lit("("), substring(col("phone_number"), 1, 3), lit(") "),
substring(col("phone_number"), 4, 3), lit("-"),
substring(col("phone_number"), 7, 4)))
df.show()

Python
COPY
Output

+----+--------------+
|name| phone_number|
+----+--------------+
|John|(123) 456-7890|
|Jane|(234) 567-8901|
| Bob|(345) 678-9012|
+----+--------------+

Bash
COPY
This code uses PySpark’s substring() function to extract the first three digits, the next three
digits, and the final four digits of each phone number, and then concatenates them together with
the desired formatting.
We learned how to use PySpark to format phone numbers in a specific way. We used PySpark’s
string manipulation functions to remove non-numeric characters from the phone numbers and to
format the phone numbers with the desired formatting.
PySpark : PySpark to extract specific fields from XML data
 USER  APRIL 3, 2023  LEAVE A COMMENTON PYSPARK : PYSPARK TO EXTRACT SPECIFIC FIELDS
FROM XML DATA

XML data is commonly used in data exchange and storage, and it can contain complex
hierarchical structures. PySpark provides a simple and efficient way to extract specific fields from
XML data using its built-in functions.
Sample XML data

<?xml version="1.0" encoding="UTF-8"?>


<People>
<Person>
<Name>John</Name>
<Age>30</Age>
<City>New York</City>
</Person>
<Person>
<Name>Jane</Name>
<Age>35</Age>
<City>London</City>
</Person>
<Person>
<Name>Jack</Name>
<Age>25</Age>
<City>Paris</City>
</Person>
<Person>
<Name>Emily</Name>
<Age>40</Age>
<City>Tokyo</City>
</Person>
</People>

Bash
COPY
Input Data
Let’s assume we have the following dataset that contains XML data in a column:

+----+-----------------------+
| ID | XML |
+----+-----------------------+
| 1 |<Person><Name>John</Name><Age>30</Age><City>New York</City></Person>|
| 2 |<Person><Name>Jane</Name><Age>35</Age><City>London</City></Person>|
| 3 |<Person><Name>Jack</Name><Age>25</Age><City>Paris</City></Person>|
| 4 |<Person><Name>Emily</Name><Age>40</Age><City>Tokyo</City></Person>|
+----+-----------------------+

Bash
COPY
Extracting Specific Fields from XML Data in PySpark
To extract specific fields from XML data in PySpark, we can use the xpath function. The xpath
function evaluates an XPath expression against the XML data and returns the result as a string,
an array, or a struct.
For example, to extract the Name and Age fields from the XML data in the input DataFrame, we
can use the following code:

from pyspark.sql import SparkSession


from pyspark.sql.functions import xpath

# create a SparkSession
spark = SparkSession.builder.appName("ExtractXMLFields").getOrCreate()

# load the input data into a DataFrame


df = spark.createDataFrame([
(1, "<Person><Name>John</Name><Age>30</Age><City>New York</City></Person>"),
(2, "<Person><Name>Jane</Name><Age>35</Age><City>London</City></Person>"),
(3, "<Person><Name>Jack</Name><Age>25</Age><City>Paris</City></Person>"),
(4, "<Person><Name>Emily</Name><Age>40</Age><City>Tokyo</City></Person>")
], ["ID", "XML"])

# extract the Name and Age fields from the XML column
df_xml = df.select("ID", xpath("XML", "/Person/Name/text()").alias("Name"), xpath("XML",
"/Person/Age/text()").alias("Age"))

# show the result


df_xml.show()

Python
COPY
Output

+---+----+---+
| ID|Name|Age|
+---+----+---+
| 1|John| 30|
| 2|Jane| 35|
| 3|Jack| 25|
| 4|Emily| 40|
+---+----+---+

Bash
COPY
As we can see, the output DataFrame contains the Name and Age fields extracted from the XML
data in the input DataFrame.
Extracting specific fields from XML data in PySpark is a simple and efficient process using the
xpath function. By specifying the XPath expression that matches the desired fields, we can easily
extract the specific fields from the XML data. This is an essential step in data preprocessing and
cleaning that facilitates the analysis and modeling of complex hierarchical data structures.
Sample code to read XML data from a file using PySpark

from pyspark.sql import SparkSession


# create a SparkSession
spark = SparkSession.builder.appName("ReadXML").getOrCreate()
# read the XML file into a DataFrame
df = spark.read.format("xml").option("rowTag", "Person").load("path/to/xml/file.xml")
# show the result
df.show()

PySpark : Replacing special characters with a specific value using PySpark.


 USER  APRIL 3, 2023  LEAVE A COMMENTON PYSPARK : REPLACING SPECIAL CHARACTERS WITH
A SPECIFIC VALUE USING PYSPARK.
Working with datasets that contain special characters can be a challenge in data preprocessing
and cleaning. PySpark provides a simple and efficient way to replace special characters with a
specific value using its built-in functions.
Input Data
Let’s assume we have the following dataset that contains columns with special characters:

+----+-----------------+------------------+
| ID | First_Name | Last_Name |
+----+-----------------+------------------+
| 1 | John-Doe | Smith & Jones |
| 2 | Jane~Johnson | Lee*Chang&Kim |
| 3 | Jack!Brown | Lee+Park |
| 4 | Emily?Wong$Li | Perez/Sanchez |
+----+-----------------+------------------+

Bash
COPY
Replacing Special Characters with a Specific Value in PySpark
To replace special characters with a specific value in PySpark, we can use
the regexp_replace function. The regexp_replace function replaces all occurrences of a specified
regular expression pattern with a specified replacement value.
For example, to replace all special characters in the input DataFrame with an
underscore (_) character, we can use the following code:

from pyspark.sql import SparkSession


from pyspark.sql.functions import regexp_replace
# create a SparkSession
spark = SparkSession.builder.appName("ReplaceSpecialChars").getOrCreate()
# load the input data into a DataFrame
df = spark.createDataFrame([
(1, "John-Doe", "Smith & Jones"),
(2, "Jane~Johnson", "Lee*Chang&Kim"),
(3, "Jack!Brown", "Lee+Park"),
(4, "Emily?Wong$Li", "Perez/Sanchez")
], ["ID", "First_Name", "Last_Name"])
# replace all special characters in the First_Name and Last_Name columns with an underscore
character
df_clean = df.select("ID",
regexp_replace("First_Name", "[^a-zA-Z0-9]+", "_").alias("First_Name"),
regexp_replace("Last_Name", "[^a-zA-Z0-9]+", "_").alias("Last_Name"))
# show the result
df_clean.show()

Python
COPY
Output

+---+-----------+--------------+
| ID| First_Name| Last_Name|
+---+-----------+--------------+
| 1| John_Doe| Smith_Jones|
| 2| Jane_Johnson| Lee_Chang_Kim|
| 3| Jack_Brown| Lee_Park|
| 4| Emily_Wong_Li| Perez_Sanchez|
+---+-----------+--------------+

Bash
COPY
The output DataFrame contains the First_Name and Last_Name columns with all special
characters replaced by an underscore character. Replacing special characters with a specific value
in PySpark is a simple and efficient process using the regexp_replace function. By specifying the
regular expression pattern that matches the special characters and the replacement value, we
can easily replace all special characters in a column with a specific value. This is an essential step
in data preprocessing and cleaning that ensures the consistency and reliability of data analysis
and modeling results.
 POSTED INSPARK
PySpark : Dataset has column that contains a string with multiple values separated by a
delimiter.Count the number of occurrences of each value using PySpark.
 USER  APRIL 3, 2023  LEAVE A COMMENTON PYSPARK : DATASET HAS COLUMN THAT
CONTAINS A STRING WITH MULTIPLE VALUES SEPARATED BY A DELIMITER.COUNT THE
NUMBER OF OCCURRENCES OF EACH VALUE USING PYSPARK.

Counting the number of occurrences of each value in a string column with multiple values
separated by a delimiter is a common task in data preprocessing and cleaning. PySpark provides a
simple and efficient way to split the string column into multiple columns based on the delimiter
and count the number of occurrences of each value using its built-in functions.
Input Data
Let’s assume we have the following dataset that contains a string column with multiple values
separated by a comma:

+----+------------+
| ID | Items |
+----+------------+
| 1 | A,B,C,D,E |
| 2 | A,C,F,G,H |
| 3 | B,C,D,G,H |
| 4 | A,C,D,E,F |
+----+------------+

Bash
COPY
Counting the Number of Occurrences of Each Value in a String Column in PySpark
To count the number of occurrences of each value in a string column with multiple values
separated by a delimiter in PySpark, we can use the split and explode functions. The split function
splits the string column into an array of strings based on the delimiter, and the explode function
creates a new row for each element in the array. We can then group the rows by the exploded
column and count the number of occurrences of each value using the count function.
For example, to count the number of occurrences of each value in the Items column in the input
DataFrame, we can use the following code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, explode, count

# create a SparkSession
spark = SparkSession.builder.appName("CountOccurrences").getOrCreate()

# load the input data into a DataFrame


df = spark.createDataFrame([
(1, "A,B,C,D,E"),
(2, "A,C,F,G,H"),
(3, "B,C,D,G,H"),
(4, "A,C,D,E,F")
], ["ID", "Items"])

# split the Items column into an array of strings and explode the array into multiple rows
df_exploded = df.select("ID", explode(split("Items", ","))).alias("Item")

# group the rows by the Item column and count the number of occurrences of each value
df_count = df_exploded.groupBy("col").agg(count("ID").alias("Count")).orderBy("Count",
ascending=False)

# show the result


df_count.show()

Python
COPY
Output

+---+-----+
|col|Count|
+---+-----+
| C| 4|
| D| 3|
| A| 3|
| F| 2|
| E| 2|
| B| 2|
| H| 2|
| G| 2|
+---+-----+

Bash
COPY
Counting the number of occurrences of each value in a string column with multiple values
separated by a delimiter in PySpark is a simple and efficient process using the split, explode, and
count functions. By splitting the string column into an array of strings and exploding the array
into multiple rows, we can easily count the number of occurrences of each value in the column.
PySpark : Dataset has datetime column. Need to convert this column to a different timezone.
 USER  APRIL 3, 2023  LEAVE A COMMENTON PYSPARK : DATASET HAS DATETIME COLUMN.
NEED TO CONVERT THIS COLUMN TO A DIFFERENT TIMEZONE.
Working with datetime data in different timezones can be a challenge in data analysis and
modeling. PySpark provides a simple and efficient way to convert a datetime column to a
different timezone using its built-in functions.
Input Data
Let’s assume we have the following dataset that contains a datetime column:

+----+---------------------+
| ID | Timestamp |
+----+---------------------+
| 1 | 2023-04-02 10:30:00 |
| 2 | 2023-04-02 12:00:00 |
| 3 | 2023-04-02 13:30:00 |
| 4 | 2023-04-02 15:00:00 |
+----+---------------------+

Bash
COPY
Converting a Datetime Column to a Different Timezone in PySpark
To convert a datetime column to a different timezone in PySpark, we can use the
from_utc_timestamp and to_utc_timestamp functions. The from_utc_timestamp function
converts a UTC timestamp to a local timestamp in the specified timezone, and the
to_utc_timestamp function converts a local timestamp in the specified timezone to a UTC
timestamp.
For example, to convert the Timestamp column in the input DataFrame to the
“America/New_York” timezone, we can use the following code:

from pyspark.sql import SparkSession


from pyspark.sql.functions import from_utc_timestamp, to_utc_timestamp

# create a SparkSession
spark = SparkSession.builder.appName("ConvertTimezone").getOrCreate()

# load the input data into a DataFrame


df = spark.createDataFrame([
(1, "2023-04-02 10:30:00"),
(2, "2023-04-02 12:00:00"),
(3, "2023-04-02 13:30:00"),
(4, "2023-04-02 15:00:00")
], ["ID", "Timestamp"])

# convert the Timestamp column to the "America/New_York" timezone


df_new = df.select("ID", to_utc_timestamp(from_utc_timestamp("Timestamp", "UTC"),
"America/New_York").alias("Timestamp"))
df_new.show()

Python
COPY
The output of this code will be:
+---+-------------------+
| ID| Timestamp|
+---+-------------------+
| 1|2023-04-02 14:30:00|
| 2|2023-04-02 16:00:00|
| 3|2023-04-02 17:30:00|
| 4|2023-04-02 19:00:00|
+---+-------------------+

Bash
COPY
As we can see, the Timestamp column in the output DataFrame is converted to the
“America/New_York” timezone, and the ID column is retained.
Converting a datetime column to a different timezone in PySpark is a simple and efficient process
using the from_utc_timestamp and to_utc_timestamp functions. By specifying the input
timezone and the output timezone, we can easily convert the datetime column to the desired
timezone. This is an essential step in data preprocessing and data cleaning that ensures the
accuracy and consistency of datetime data across different timezones
PySpark : Dataset with columns contain duplicate values, How to to keep only the last
occurrence of each value.
 USER  APRIL 3, 2023  LEAVE A COMMENTON PYSPARK : DATASET WITH COLUMNS CONTAIN
DUPLICATE VALUES, HOW TO TO KEEP ONLY THE LAST OCCURRENCE OF EACH VALUE.
Duplicate values in a dataset can cause problems for data analysis and modeling. It is often
necessary to remove duplicates and keep only the last occurrence of each value to ensure the
accuracy and reliability of the results. PySpark provides a simple and efficient way to remove
duplicates and keep only the last occurrence of each value using its built-in functions.
Input Data
Let’s assume we have the following dataset that contains duplicate values:

+------+-------+-------+
| Name | Score | Grade |
+------+-------+-------+
| John | 80 | B |
| Jane | 90 | A |
| John | 85 | B |
| Jane | 95 | A |
| Mary | 75 | C |
| John | 90 | A |
+------+-------+-------+

Bash
COPY
Removing Duplicates and Keeping the Last Occurrence in PySpark
To remove duplicates and keep the last occurrence of each value in PySpark, we can use
the dropDuplicates function and specify the column(s) to use for comparison. By default,
the dropDuplicates function keeps the first occurrence of each value, but we can change this
behavior by using the subset parameter and specifying the column(s) to order by.
For example, to remove duplicates and keep only the last occurrence of each value based on
the Name column, we can use the following code:
To remove duplicates and keep the last occurrence of each value in PySpark, we can use the
Window function and the row_number() function. The Window function is used to define a
window or a group of rows based on one or more columns, and the row_number() function is
used to assign a unique sequential number to each row within the window. By ordering the rows
within the window in descending order based on a timestamp or an ID column, we can assign the
highest number to the last occurrence of each value.
Here is an example code snippet that demonstrates how to remove duplicates and keep the last
occurrence of each value based on the Name column:

from pyspark.sql import SparkSession


from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
# create a SparkSession
spark = SparkSession.builder.appName("RemoveDuplicates").getOrCreate()
# load the input data into a DataFrame
df = spark.createDataFrame([
("John", 80, "B"),
("Jane", 90, "A"),
("John", 85, "B"),
("Jane", 95, "A"),
("Mary", 75, "C"),
("John", 90, "A")
], ["Name", "Score", "Grade"])
# create a window function to partition the data by Name and order by Score in descending order
window = Window.partitionBy(df["Name"]).orderBy(df["Score"].desc())
# assign a sequential number to each row within the window
df_last = df.select("*", row_number().over(window).alias("row_number"))
# filter the rows that have the highest row_number within each partition
df_last = df_last.filter(df_last.row_number == 1).drop("row_number")
# show the result
df_last.show()

Python
COPY
Output

+----+-----+-----+
|Name|Score|Grade|
+----+-----+-----+
|Jane| 95| A|
|John| 90| A|
|Mary| 75| C|
+----+-----+-----+

Bash
COPY
As we can see, the output DataFrame contains only the last occurrence of each unique Name
value, and the Score and Grade columns are retained.
PySpark : Large dataset that does not fit into memory. How can you use PySpark to process this
dataset
 USER  APRIL 3, 2023  LEAVE A COMMENTON PYSPARK : LARGE DATASET THAT DOES NOT FIT
INTO MEMORY. HOW CAN YOU USE PYSPARK TO PROCESS THIS DATASET

Processing large datasets that do not fit into memory can be challenging for traditional
programming approaches. However, PySpark, a Python library for Apache Spark, offers a scalable
and distributed computing solution for processing large datasets. In this article, we will explore
how to use PySpark to process large datasets that do not fit into memory.
Introduction to PySpark
PySpark is a Python library for Apache Spark, an open-source distributed computing framework.
Apache Spark provides a unified engine for big data processing, including batch processing,
streaming, and machine learning. PySpark offers a simple and easy-to-use interface for Python
developers to access the powerful capabilities of Apache Spark.
PySpark provides two main abstractions for distributed data processing: Resilient Distributed
Datasets (RDDs) and DataFrames. RDDs are a distributed collection of objects that can be
processed in parallel across a cluster of computers. DataFrames are a distributed collection of
structured data that can be manipulated using SQL-like queries.
Loading Data into PySpark
To process large datasets in PySpark, we first need to load the data into PySpark. PySpark
supports reading data from various data sources, including Hadoop Distributed File System
(HDFS), Amazon S3, and local file systems.
To load data into PySpark, we can use the spark.read method, which returns a DataFrame. The
spark.read method supports various data formats, such as CSV, JSON, and Parquet. For example,
to load a CSV file into PySpark, we can use the following code:

from pyspark.sql import SparkSession

# create a SparkSession
spark = SparkSession.builder.appName("LoadData").getOrCreate()

# load a CSV file into a DataFrame


df = spark.read.csv("path/to/csv/file")

Python
COPY
Processing Data in PySpark
Once we have loaded the data into PySpark, we can process it using various PySpark functions.
PySpark provides a rich set of built-in functions for data manipulation, including filtering,
aggregation, and joining.
For example, to filter the rows in a DataFrame that satisfy a certain condition, we can use the
filter function. The filter function takes a boolean expression as its argument and returns a new
DataFrame that contains only the rows that satisfy the expression. For example, to filter the rows
in the df DataFrame that have a value greater than 10 in the age column, we can use the following
code:

# filter the rows in the DataFrame


filtered_df = df.filter(df.age > 10)

Python
COPY
To aggregate the data in a DataFrame, we can use the groupBy function, which groups the data
by one or more columns and applies an aggregation function, such as sum, count, or avg, to each
group. For example, to calculate the total number of rows in the df DataFrame grouped by the
gender column, we can use the following code:

# group the data by the gender column and count the number of rows in each group
grouped_df = df.groupBy("gender").count()

Python
COPY
To join two DataFrames, we can use the join function, which joins the DataFrames based on a
common column. For example, to join the df1 and df2 DataFrames on the id column, we can use
the following code:

# join the df1 and df2 DataFrames on the id column


joined_df = df1.join(df2, "id")

Python
COPY
To join two DataFrames, we can use the join function, which joins the DataFrames based on a
common column. For example, to join the df1 and df2 DataFrames on the id column, we can use
the following code:

# join the df1 and df2 DataFrames on the id column


joined_df = df1.join(df2, "id")

Python
COPY
Processing Large Datasets with PySpark
To process large datasets that do not fit into memory with PySpark, we can take advantage of
PySpark’s distributed computing capabilities. PySpark automatically distributes the data across
multiple nodes in a cluster and processes the data in parallel. This allows us to process large
datasets that would otherwise be impossible to handle with a single machine.
One way to distribute the data in PySpark is to partition the data into smaller chunks. Each
partition is processed independently by a worker node in the cluster. PySpark provides various
functions for partitioning the data, such as repartition and coalesce.
For example, to repartition the data in the df DataFrame into four partitions, we can use the
following code:

# repartition the data into four partitions


repartitioned_df = df.repartition(4)

Python
COPY
We can also use PySpark’s caching mechanism to improve the performance of iterative
algorithms that access the same data multiple times. Caching stores the data in memory or on
disk, so that subsequent accesses to the data are faster.
To cache a DataFrame, we can use the cache or persist function. For example, to cache the df
DataFrame in memory, we can use the following code:

# cache the DataFrame in memory


df.cache()

Python
COPY
It is important to note that caching can consume a significant amount of memory or disk space,
so it should be used judiciously.
PySpark provides a powerful and scalable solution for processing large datasets that do not fit
into memory. By distributing the data across multiple nodes in a cluster and processing the data
in parallel, PySpark allows us to handle datasets that would otherwise be impossible to handle
with a single machine. PySpark also provides a rich set of functions for data manipulation,
including filtering, aggregation, and joining. By using PySpark’s partitioning and caching
mechanisms, we can further improve the performance of our data processing tasks.
Spark important urls to refer
1. Spark Examples
2. PySpark Blogs
3. Bigdata Blogs
4. Spark Interview Questions
5. Official Page
 Post Views: 316
Related Posts
 BigQuery : How to process BigQuery Data with PySpark on Dataproc ?
To process BigQuery data with PySpark on Dataproc, you will need to follow these steps:…
 PySpark : Explanation of MapType in PySpark with Example
MapType in PySpark is a data type used to represent a value that maps keys…
 PySpark : How to decode in PySpark ?
pyspark.sql.functions.decode The pyspark.sql.functions.decode Function in PySpark PySpark is a
popular library for processing big data…
PySpark : RowMatrix in PySpark : Distributed matrix consisting of rows
 USER  APRIL 3, 2023  LEAVE A COMMENTON PYSPARK : ROWMATRIX IN PYSPARK :
DISTRIBUTED MATRIX CONSISTING OF ROWS

RowMatrix is a class in PySpark’s MLLib library that represents a distributed matrix consisting of
rows. Each row in the matrix is represented as a SparseVector or DenseVector object. RowMatrix
is useful for performing operations on large datasets that are too big to fit into memory on a
single machine.
Creating a RowMatrix in PySpark
To create a RowMatrix in PySpark, you first need to create an RDD of vectors. You can create an
RDD of DenseVectors by calling the parallelize() method on a list of NumPy arrays:

!pip install pyspark


from pyspark.sql import SparkSession
from pyspark.mllib.linalg import Vectors, DenseVector, SparseVector
from pyspark.mllib.linalg.distributed import RowMatrix, IndexedRowMatrix, CoordinateMatrix
# create a SparkSession object
spark = SparkSession.builder.appName("RowMatrixExample").getOrCreate()
# create an RDD of vectors
data = [
DenseVector([1.0, 2.0, 3.0]),
DenseVector([4.0, 5.0, 6.0]),
DenseVector([7.0, 8.0, 9.0])
]
rdd = spark.sparkContext.parallelize(data)
row_matrix = RowMatrix(rdd)

Python
COPY
You can also create an RDD of SparseVectors by passing in an array of tuples representing the
indices and values of each non-zero element:

# convert the RowMatrix to an IndexedRowMatrix and compute the column summary statistics
indexed_matrix = IndexedRowMatrix(row_matrix.rows.zipWithIndex().map(lambda x: (x[1],
x[0])))
coordinate_matrix = indexed_matrix.toCoordinateMatrix()
transposed_coordinate_matrix = coordinate_matrix.transpose()
col_stats = transposed_coordinate_matrix.toIndexedRowMatrix().columnSimilarities()

# print the column summary statistics


print(col_stats)

Python
COPY
This example creates a SparkSession object, creates an RDD of vectors, and creates a RowMatrix
object from the RDD. It then converts the RowMatrix to an IndexedRowMatrix using the
IndexedRowMatrix() constructor and the zipWithIndex() method, converts the
IndexedRowMatrix to a CoordinateMatrix using the toCoordinateMatrix() method, transposes
the CoordinateMatrix using the transpose() method, and converts the resulting transposed
CoordinateMatrix back to an IndexedRowMatrix using the toIndexedRowMatrix() method.
Finally, it computes the column similarities of the transposed IndexedRowMatrix using the
columnSimilarities() method and prints the result.
Compute Singular Value Decomposition (SVD)
To compute the Singular Value Decomposition (SVD) of a RowMatrix, you call the computeSVD()
method.  The SVD is a factorization of a matrix into three matrices: U, S, and V. The U matrix is a
unitary matrix whose columns are the left singular vectors of the original matrix. The V matrix is a
unitary matrix whose columns are the right singular vectors of the original matrix. The S matrix is
a diagonal matrix whose diagonal entries are the singular values of the original matrix.

svd = row_matrix.computeSVD(k=2, computeU=True)


U = svd.U # The U factor is a RowMatrix.
s = svd.s # The singular values are returned as a NumPy array.
V = svd.V # The V factor is a local dense matrix.

Python
COPY
The k parameter specifies the number of singular values to compute. The computeU parameter
specifies whether to compute the U matrix.
PySpark : cannot import name ‘RowMatrix’ from ‘pyspark.ml.linalg’
 USER  APRIL 3, 2023  LEAVE A COMMENTON PYSPARK : CANNOT IMPORT NAME ‘ROWMATRIX’
FROM ‘PYSPARK.ML.LINALG’
The RowMatrix class was actually part of the older version of PySpark (before version 3.0), which
was under the pyspark.mllib.linalg module.
Starting from PySpark version 3.0, the RowMatrix class has been deprecated and replaced with
the DenseMatrix and SparseMatrix classes in the pyspark.ml.linalg module. Therefore, you can no
longer import the RowMatrix class in PySpark version 3.0 and later.
If you need to create a distributed matrix of row vectors in PySpark version 3.0 and later, you can
use the DenseMatrix or SparseMatrix classes instead. Here’s an example of how to create a
DenseMatrix from an RDD of row vectors:

from pyspark.sql import SparkSession


from pyspark.ml.linalg import Vectors, DenseMatrix
# create a SparkSession object
spark = SparkSession.builder.appName("DenseMatrixExample").getOrCreate()
# create an RDD of row vectors
rows = spark.sparkContext.parallelize([
Vectors.dense([1.0, 2.0, 3.0]),
Vectors.dense([4.0, 5.0, 6.0]),
Vectors.dense([7.0, 8.0, 9.0])
])
# convert the RDD of row vectors to a list of arrays
data = rows.map(lambda x: x.toArray()).collect()
# create a DenseMatrix from the list of arrays
mat = DenseMatrix(numRows=len(data), numCols=len(data[0]), values=data)
# print the DenseMatrix
print(mat)

Python
COPY
In this example, we first create an RDD of row vectors using the parallelize method. We then
convert the RDD of row vectors to a list of arrays using the map method and the toArray method
of each row vector. We create a DenseMatrix from the list of arrays using the DenseMatrix
constructor. Finally, we print the DenseMatrix.
PySpark : Py4JJavaError: An error occurred while calling o46.computeSVD.
 USER  APRIL 3, 2023  LEAVE A COMMENTON PYSPARK : PY4JJAVAERROR: AN ERROR
OCCURRED WHILE CALLING O46.COMPUTESVD.

The error message “Py4JJavaError: An error occurred while calling o46.computeSVD” usually
occurs when there is an issue with the singular value decomposition (SVD) computation in
PySpark. The computeSVD method is used to compute the SVD of a distributed matrix in PySpark.
Here are some common reasons why this error may occur and how to resolve them:
Insufficient Memory: The SVD computation is a memory-intensive operation, and if there is not
enough memory available, the computation may fail with an error. You can try increasing the
memory allocated to the PySpark driver or executor to resolve this issue.

conf = SparkConf().setAppName("MyApp").set("spark.driver.memory",
"4g").set("spark.executor.memory", "2g")

Bash
COPY
Incorrect Number of Singular Values: The computeSVD method takes an argument k which
specifies the number of singular values to compute. If you set k to a value that is larger than the
number of rows or columns in the matrix, the computation may fail with an error. Make sure that
k is set to a value that is less than or equal to the minimum of the number of rows and columns in
the matrix.

svd = mat.computeSVD(k, computeU=True)

Bash
COPY
Unsupported Data Types: The computeSVD method only works with matrices whose elements
are of numeric data types. If your matrix contains non-numeric data types, the computation may
fail with an error. Make sure that your matrix only contains numeric data types before calling the
computeSVD method.

from pyspark.mllib.linalg import Matrices


# create a dense matrix of floating point values
data = [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]
mat = Matrices.dense(3, 2, [elem for sublist in data for elem in sublist])

Python
COPY
Unstable Matrix: The SVD computation can fail if the matrix is unstable or has a high condition
number. In such cases, you may need to preprocess the matrix to make it more stable before
calling the computeSVD method.

from pyspark.mllib.linalg import Matrices


from pyspark.mllib.feature import StandardScaler
# create a dense matrix of floating point values
data = [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]
mat = Matrices.dense(3, 2, [elem for sublist in data for elem in sublist])
# standardize the matrix to make it more stable
scaler = StandardScaler(withMean=True, withStd=True).fit(mat)
mat_std = scaler.transform(mat)

Python
COPY
In this example, we first create a dense matrix mat of floating-point values. We then use the
StandardScaler class to standardize the matrix and make it more stable. The resulting
standardized matrix is stored in mat_std, which can then be used for the SVD computation
 POSTED INSPARK
PySpark : TypeError: Cannot convert type into Vector
 USER  APRIL 3, 2023  LEAVE A COMMENTON PYSPARK : TYPEERROR: CANNOT CONVERT TYPE
INTO VECTOR

The error message “TypeError: Cannot convert type <class ‘pyspark.ml.linalg.DenseVector’> into
Vector” usually occurs when you are trying to use an instance of pyspark.ml.linalg.DenseVector in
a place where PySpark is expecting an instance of pyspark.mllib.linalg.Vector.
This error occurs because pyspark.ml.linalg.DenseVector and pyspark.mllib.linalg.Vector are two
different vector classes in PySpark and are not interchangeable. pyspark.ml.linalg.DenseVector is
a newer vector class introduced in PySpark version 2.0, whereas pyspark.mllib.linalg.Vector is an
older vector class that was used in earlier versions of PySpark.
To resolve this error, you can convert the pyspark.ml.linalg.DenseVector instance to a
pyspark.mllib.linalg.Vector instance before using it in the code that is causing the error. You can
do this using the fromML() method provided by the pyspark.ml.linalg.DenseVector class.
Here’s an example of how to convert a pyspark.ml.linalg.DenseVector instance to a
pyspark.mllib.linalg.Vector instance:

from pyspark.ml.linalg import DenseVector


from pyspark.mllib.linalg import Vectors
# Create a DenseVector instance
dense_vector = DenseVector([1.0, 2.0, 3.0])
# Convert the DenseVector instance to a Vector instance
vector = Vectors.fromML(dense_vector)
# Now you can use the Vector instance wherever PySpark expects a Vector

Python
COPY
In this example, we first create a pyspark.ml.linalg.DenseVector instance called dense_vector. We
then convert it to a pyspark.mllib.linalg.Vector instance called vector using the fromML() method
provided by the pyspark.ml.linalg.DenseVector class. Finally, we can use the vector instance
wherever PySpark expects a pyspark.mllib.linalg.Vector instance.
 POSTED INSPARK
PySpark : Dropping duplicate rows in Pyspark – A Comprehensive Guide with example
 USER  MARCH 29, 2023  LEAVE A COMMENTON PYSPARK : DROPPING DUPLICATE ROWS IN
PYSPARK – A COMPREHENSIVE GUIDE WITH EXAMPLE

PySpark provides several methods to remove duplicate rows from a dataframe. In this article, we
will go over the steps to drop duplicate rows in Pyspark.
First, let’s create a sample dataframe with 5 columns. We will use the createDataFrame() method
of the SparkSession object to create a dataframe.

from pyspark.sql import SparkSession


from pyspark.sql.functions import *
from pyspark.sql.types import *
spark = SparkSession.builder.appName("drop-duplicates").getOrCreate()
# Create a sample dataframe with 5 columns
data = [("Sachin", 25, "M", "New York", 20000),
("Sharry", 30, "F", "San Francisco", 30000),
("Mandid", 35, "M", "Los Angeles", 40000),
("Sachin", 25, "M", "New York", 20000),
("Sharry", 30, "F", "San Francisco", 30000)]
schema = StructType([
StructField("Name", StringType(), True),
StructField("Age", IntegerType(), True),
StructField("Gender", StringType(), True),
StructField("City", StringType(), True),
StructField("Salary", IntegerType(), True)
])
df = spark.createDataFrame(data, schema=schema)
df.show()

Python
COPY
Output

+------+---+------+-------------+------+
| Name|Age|Gender| City|Salary|
+------+---+------+-------------+------+
|Sachin| 25| M| New York| 20000|
|Sharry| 30| F|San Francisco| 30000|
|Mandid| 35| M| Los Angeles| 40000|
|Sachin| 25| M| New York| 20000|
|Sharry| 30| F|San Francisco| 30000|
+------+---+------+-------------+------+

Bash
COPY
As you can see, there are some duplicate rows in the dataframe. Now, let’s drop these duplicate
rows.
Method 1: Using dropDuplicates()
The simplest way to drop duplicate rows in Pyspark is to use the dropDuplicates() method. This
method returns a new dataframe with the duplicate rows removed.

# Drop duplicate rows using dropDuplicates() method


df = df.dropDuplicates()
df.show()

Python
COPY
Output : Duplicate rows have been removed.

+------+---+------+-------------+------+
| Name|Age|Gender| City|Salary|
+------+---+------+-------------+------+
|Sachin| 25| M| New York| 20000|
|Mandid| 35| M| Los Angeles| 40000|
|Sharry| 30| F|San Francisco| 30000|
+------+---+------+-------------+------+

Bash
COPY
Method 2: Using groupBy() and agg() functions
Another way to drop duplicate rows is to use the groupBy() and agg() functions. This method
groups the dataframe by all the columns and then aggregates the data using any aggregation
function, such as first() or last(). This method is useful when you want to retain only one row for
each combination of column values.

# Drop duplicate rows using groupBy() and agg() functions


df = df.groupBy(df.columns)\
.agg(first(df.columns[0]), first(df.columns[1]), first(df.columns[2]), first(df.columns[3]),
first(df.columns[4]))
df.show()

Python
COPY
Other column you can drop , if not required. This is for your understanding purpose.

+------+---+------+-------------+------+-----------+----------+-------------+-------------+-------------+
| Name|Age|Gender| City|Salary|first(Name)|first(Age)|first(Gender)| first(City)|first(Salary)|
+------+---+------+-------------+------+-----------+----------+-------------+-------------+-------------+
|Sharry| 30| F|San Francisco| 30000| Sharry| 30| F|San Francisco| 30000|
|Mandid| 35| M| Los Angeles| 40000| Mandid| 35| M| Los Angeles| 40000|
|Sachin| 25| M| New York| 20000| Sachin| 25| M| New York| 20000|
+------+---+------+-------------+------+-----------+----------+-------------+-------------+-------------+

Python
COPY
As you can see, the duplicate rows have been removed and only one row is retained for each
combination of column values.
PySpark : Replacing null column in a PySpark dataframe to 0 or any value you wish.
 USER  MARCH 29, 2023  LEAVE A COMMENTON PYSPARK : REPLACING NULL COLUMN IN A
PYSPARK DATAFRAME TO 0 OR ANY VALUE YOU WISH.

To replace null values in a PySpark DataFrame column that contain null with a numeric value (e.g.,
0), you can use the na.fill() method. This method replaces all null values in a DataFrame with a
specified value.
In this example, we create a PySpark DataFrame with 3 columns: “id”, “name”, and “age”. The
first two columns are of StringType, and the third column is of IntegerType. We also include some
sample data, including a null value in the “age” column, to demonstrate how to handle null
values in PySpark.

from pyspark.sql import SparkSession


from pyspark.sql.functions import when, col
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
# Create a SparkSession
spark = SparkSession.builder.appName("None to 0 pyspark @ Freshers.in").getOrCreate()
# Define the schema for the DataFrame
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)
])
# Create a list of tuples containing sample data
data = [
(1, "Barry", 25),
(2, "Charlie", 30),
(3, "Marrie", 35),
(4, "Gold", None),
(5, "Twinkle", 28)
]
# Create a DataFrame from the list of tuples and the schema
df = spark.createDataFrame(data, schema)
# Show the DataFrame
df.show()
df = df.withColumn("age", when(col("age").isNull(), 0).otherwise(col("age")))
df.show()

Python
COPY
Assuming that the name of the DataFrame is df and the name of the column that you want to
replace null values with 0 is social_col, you can use the following code:
Here, when(col(“age”).isNull(), 0) creates a conditional expression that checks if the value in the
age column is null. If it is null, it replaces it with the integer value 0. Otherwise, it leaves the value
unchanged. The otherwise(col(“age”)) function is used to ensure that the original value is
retained for any non-null values.
The withColumn() method is used to apply the above transformation to the age column in the df
DataFrame. The resulting DataFrame will have null values in the age column replaced with the
integer value 0.
Output before changing 

+---+-------+----+
| id| name| age|
+---+-------+----+
| 1| Barry| 25|
| 2|Charlie| 30|
| 3| Marrie| 35|
| 4| Gold|null|
| 5|Twinkle| 28|
+---+-------+----+

Bash
COPY
Output after changing 

+---+-------+---+
| id| name|age|
+---+-------+---+
| 1| Barry| 25|
| 2|Charlie| 30|
| 3| Marrie| 35|
| 4| Gold| 0|
| 5|Twinkle| 28|
+---+-------+---+

PySpark : unix_timestamp function – A comprehensive guide


 USER  MARCH 27, 2023  LEAVE A COMMENTON PYSPARK : UNIX_TIMESTAMP FUNCTION – A
COMPREHENSIVE GUIDE

One of the key functionalities of PySpark is the ability to transform data into the desired format.
In some cases, it is necessary to convert a date or timestamp into a numerical format for further
analysis. The unix_timestamp() function is a PySpark function that helps to convert a date or
timestamp to a Unix timestamp, which is a numerical representation of time.
In this article, we will discuss the unix_timestamp() function in PySpark in detail, including its
syntax, parameters, and examples.
Syntax
The unix_timestamp() function in PySpark has the following syntax:
The timestamp parameter is the date or timestamp that you want to convert to Unix timestamp.
The format parameter is an optional parameter that specifies the format of the input timestamp.
If the format parameter is not specified, PySpark will use the default format, which is “yyyy-MM-
dd HH:mm:ss”.
Parameters
The unix_timestamp() function in PySpark takes two parameters:
timestamp: This is the date or timestamp that you want to convert to Unix timestamp. The
timestamp can be a string or a column of a DataFrame.
format: This is an optional parameter that specifies the format of the input timestamp. The
format should be a string that conforms to the date format pattern syntax. If this parameter is
not specified, PySpark will use the default format, which is “yyyy-MM-dd HH:mm:ss”.
Examples
Let’s look at some examples to understand how the unix_timestamp() function works in
PySpark.
Example 1: Converting a Timestamp to Unix Timestamp
Suppose we have a timestamp “2022-03-24 12:30:00” that we want to convert to Unix timestamp.
We can use the unix_timestamp() function to do this as follows:

from pyspark.sql.functions import unix_timestamp


spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
time_df = spark.createDataFrame([('2023-04-08',)], ['dt'])
time_df.select(unix_timestamp('dt', 'yyyy-MM-dd').alias('unix_time')).collect()

Python
COPY
In this example, we pass the timestamp as a string to the unix_timestamp() function. The
function returns the Unix timestamp of the input timestamp.
Output:

[Row(unix_time=1680937200)]

Bash
COPY
Example 2: Converting a Timestamp with a Custom Format to Unix Timestamp
Suppose we have a timestamp “03-24-2022 12:30:00 PM” with a custom format that we want to
convert to Unix timestamp. We can use the unix_timestamp() function and specify the format
parameter to do this as follows:
Output:

from pyspark.sql.functions import unix_timestamp


spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
time_df = spark.createDataFrame([('2023-04-08',)], ['dt'])
format = "yyyy-MM-dds"
time_df.select(unix_timestamp('dt', format).alias('unix_time')).collect()
Python
COPY
Output:

[Row(unix_time=1680937200)]

Bash
COPY
In this example, we pass the timestamp and format parameters to the unix_timestamp()
function. The function returns the Unix timestamp of the input timestamp using the specified
format.
Example 3: Converting a Timestamp Column to Unix Timestamp Column in a DataFrame
Suppose we have a DataFrame that contains a timestamp column “timestamp” that we want to
convert to a Unix timestamp column. We can use the unix_timestamp() function with the col()
function to do this as follows:

from pyspark.sql.functions import unix_timestamp, col


from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("UnixTimestamp @Freshers.in ").getOrCreate()
df = spark.createDataFrame([
("2023-04-24 12:30:00",),
("2023-04-24 13:30:00",),
("2023-04-24 14:30:00",),
("2023-04-24 15:30:00",)
], ["timestamp"])
df = df.withColumn("unix_timestamp", unix_timestamp(col("timestamp")))
df.show()

Python
COPY
Output

timestamp |unix timestamp |


|12023-04-24 12:30:00| 1682364600|
|12023-04-24 13:30:00| 1682368200|
|12023-04-24 14:30:00| 1682371800|
|12023-04-24 15:30:00| 1682375400|

Bash
COPY
In this example, we first create a DataFrame with a timestamp column “timestamp”. We then use
the withColumn() function to add a new column “unix_timestamp” to the DataFrame, which
contains the Unix timestamp of the “timestamp” column. We use the col() function to refer to
the “timestamp” column in the DataFrame.
The unix_timestamp() function is a useful PySpark function for converting a date or timestamp to
Unix timestamp. In this article, we discussed the syntax and parameters of the unix_timestamp()
function, as well as provided some examples of how to use the function. The unix_timestamp()
function is a powerful tool for transforming data and can be used in various data processing
scenarios.
 POSTED INSPARK
PySpark : Reading parquet file stored on Amazon S3 using PySpark
 USER  MARCH 27, 2023  LEAVE A COMMENTON PYSPARK : READING PARQUET FILE STORED ON
AMAZON S3 USING PYSPARK

To read a Parquet file stored on Amazon S3 using PySpark, you can use the following code:

from pyspark.sql import SparkSession


# create a SparkSession
spark = SparkSession.builder \
.appName("Read S3 Parquet file") \
.getOrCreate()
# set S3 credentials if necessary
spark.conf.set("spark.hadoop.fs.s3a.access.key", "ACCESS_KEY")
spark.conf.set("spark.hadoop.fs.s3a.secret.key", "SECRET_KEY")
# read parquet file from S3
df = spark.read.parquet("s3a://freshers_bkt/training/view_country/parquet_file")
# show data
df.show()

Python
COPY
If in your system , if you have already configured access s3 with your instance then you can
remove the line starting with spark.conf.set . You can directly read using spark.read.parquet ,
make sure that you need to read as s3a
In this code, you first create a SparkSession. Then, you can set the S3 credentials if necessary
using the spark.conf.set() method. Finally, you can read the Parquet file from S3 using the
spark.read.parquet() method and passing the S3 path of the file as an argument. Once you have
read the file, you can use the df.show() method to display the data.
PySpark : Setting PySpark parameters – A complete Walkthru [3 Ways]
 USER  MARCH 16, 2023  LEAVE A COMMENTON PYSPARK : SETTING PYSPARK PARAMETERS – A
COMPLETE WALKTHRU [3 WAYS]

In PySpark, you can set various parameters to configure your Spark application. These
parameters can be set in different ways depending on your use case. Here are a few examples:
Setting Spark properties programmatically:
You can set Spark properties directly in your PySpark code using the SparkConf object. Here’s an
example:

from pyspark.conf import SparkConf


from pyspark.sql import SparkSession
conf = SparkConf()
conf.setAppName("MyPySparkApp")
conf.setMaster("local[2]")
conf.set("spark.executor.memory", "32g")
conf.set("spark.driver.memory", "32g")
spark = SparkSession.builder.config(conf=conf).getOrCreate()

Python
COPY
In this example, we create a SparkConf object and set several properties using the set method.
We set the application name, the master URL to run the application locally with 2 cores, and the
amount of memory allocated for both the driver and the executor to 32GB. Finally, we pass the
conf object to the SparkSession builder using the config method to create a Spark session.
Setting Spark properties through spark-defaults.conf file:
You can also set Spark properties by creating a spark-defaults.conf file and placing it in the conf
directory of your Spark installation. In this file, you can specify Spark properties and their values,
one per line. For example:

spark.app.name=MyPySparkApp
spark.master=local[2]
spark.executor.memory=32g
spark.driver.memory=32g

Python
COPY
In this case, we set the same properties as in the previous example, but through a configuration
file. Note that you need to make sure the spark-defaults.conf file is properly placed and
configured to take effect.
Setting Spark properties through command-line options:
You can also set Spark properties through command-line options when you run your PySpark
application. For example:

pyspark --conf spark.app.name=MyPySparkApp --conf spark.master=local[2] --conf


spark.executor.memory=32g --conf spark.driver.memory=32g

Bash
COPY
In this case, we use the –conf option followed by the property name and value to set the same
properties as in the previous examples.
 POSTED INSPARK
PySpark : Using CASE WHEN for Spark SQL to conditionally execute expressions : Dataframe
and SQL way explained
 USER  MARCH 16, 2023  LEAVE A COMMENTON PYSPARK : USING CASE WHEN FOR SPARK SQL
TO CONDITIONALLY EXECUTE EXPRESSIONS : DATAFRAME AND SQL WAY EXPLAINED

The WHEN clause is used in Spark SQL to conditionally execute expressions. It’s similar to a CASE
statement in SQL and can be used to perform conditional logic in your Spark SQL queries.
Here’s the syntax of the WHEN clause:

CASE WHEN condition THEN value

Bash
COPY
The condition is a Boolean expression that evaluates to either true or false. If the condition is
true, then the value is returned. If the condition is false, then the next WHEN clause (if any) is
evaluated. If none of the WHEN clauses evaluate to true, then the ELSE clause (if any) is
executed.
Example 
Let our data be 
+------+---+
| name |age|
+------+---+
|Sachin| 12|
|Barry | 25|
|Suzy | 72|
+------+---+

Bash
COPY
We can use the WHEN clause to add an age_group column to the table that classifies each person
as a child, adult, or senior based on their age:

from pyspark.sql.functions import when


spark = SparkSession.builder.appName("WHEN Clause Freshers.in").getOrCreate()
data = [("Sachin", 12), ("Barry", 25), ("Suzy", 72)]
df = spark.createDataFrame(data, ["name", "age"])

df.withColumn("age_group",
when(df["age"] < 18, "Child")
.when(df["age"] < 65, "Adult")
.otherwise("Senior")
).show()

Python
COPY
Output

+------+---+---------+
|name|age|age_group |
+------+---+---------+
|Sachin| 12| Child|
|Barry | 25| Adult|
|Suzy | 72| Senior|
+------+---+---------+

Bash
COPY
In this example, we’re using the withColumn function to add a new column named age_group to
the df DataFrame. We’re using the when function to determine the age_group of each person
based on their age. If a person is under 18, they’re classified as a child. If they’re between 18 and
65, they’re classified as an adult. If they’re 65 or older, they’re classified as a senior.
Note that we’re using the otherwise function to specify what value to return if none of the when
conditions evaluate to true. In this example, if a person’s age is greater than or equal to 65, the
otherwise function is executed and returns the value ‘Senior’.
How to do the same using Spark SQL after creating temporary table, from the above spark
program
df.createOrReplaceTempView("people_temp")

Python
COPY
The below is the SQL that you need to execute thru Spark

SELECT name, age,


CASE
WHEN age < 18 THEN 'Child'
WHEN age < 65 THEN 'Adult'
ELSE 'Senior' END AS age_group
FROM people_temp

SQL
COPY
Spark Code

spark.sql("SELECT name, age, CASE WHEN age < 18 THEN 'Child' WHEN age < 65 THEN 'Adult'
ELSE 'Senior' END AS age_group FROM people_temp").show()

SQL
COPY
This will output:

+------+---+---------+
|name|age|age_group |
+------+---+---------+
|Sachin| 12| Child|
|Barry | 25| Adult|
|Suzy | 72| Senior|
+------+---+---------+

Bash
COPY
In this example, we’re using the CASE WHEN clause to determine the age_group of each person
in the my_table table based on their age. If a person is under 18, they’re classified as a child. If
they’re between 18 and 65, they’re classified as an adult. If they’re 65 or older, they’re classified
as a senior.
Note that the WHEN clause is part of a larger expression that includes an ELSE clause. The ELSE
clause specifies what value to return if none of the WHEN conditions evaluate to true. In this
example, if a person’s age is greater than or equal to 65, the ELSE clause is executed and returns
the value ‘Senior’.
You can also use multiple WHEN clauses in a single expression, as shown in the example above.
The WHEN clauses are evaluated in order, so it’s important to order them appropriately to ensure
that the correct value is returned.
CASE WHEN  clause in Spark SQL is a useful tool for performing conditional logic in your queries.
By using the WHEN clause with ELSE, you can specify what value to return based on the result of
the condition. This can help you to write more complex queries that can handle a variety of
scenarios.
 POSTED INSPARK
Spark : Calculation of executor memory in Spark – A complete info.
 USER  MARCH 16, 2023  LEAVE A COMMENTON SPARK : CALCULATION OF EXECUTOR MEMORY
IN SPARK – A COMPLETE INFO.

The executor memory is the amount of memory allocated to each executor in a Spark cluster. It
determines the amount of data that can be processed in memory and can significantly affect the
performance of your Spark applications. Therefore, it’s important to carefully calculate the
amount of executor memory that your Spark applications need.
To calculate the executor memory, you need to consider the following factors:
1. Available cluster resources: The amount of memory available in your cluster should be
considered when calculating the executor memory. You don’t want to allocate more
memory than what’s available, as it can lead to performance issues or even failures.
2. Application requirements: The amount of executor memory required by your Spark
application depends on the size of your data and the complexity of your processing logic.
For example, if you’re processing a large dataset or performing complex computations,
you may need more executor memory.
3. Overhead: Spark needs some memory overhead to manage tasks and shuffle data. You
should allocate enough memory for overhead to ensure that your application doesn’t run
out of memory.
Here’s the formula to calculate the executor memory:
executor_memory  =   ( total_memory * 0.8 - memory_overhead )  / num_executors
where:
1. total_memory is the total memory available in your cluster. You can get this information
from your cluster manager, such as YARN or Mesos.
2. memory_overhead is the amount of memory allocated for Spark overhead. The default
value is 10% of the executor memory, but you can adjust it using the
spark.yarn.executor.memoryOverhead or spark.executor.memoryOverhead configuration
properties.
3. num_executors is the number of executors that you want to run in your Spark application.
You can adjust it using the spark.executor.instances configuration property.
For example, let’s say you have a cluster with 100 GB of memory and you want to run 4 executors
with 4 GB of memory each. To calculate the executor memory, you can use the following
formula:
executor_memory = (100 GB * 0.8 – 4 GB * 0.1) / 4 = 18.5 GB
This means that you should allocate 18.5 GB of memory to each executor to ensure optimal
performance.
Calculating the executor memory in Spark is an important task to ensure that your applications
run efficiently and avoid out-of-memory errors. By taking into account the available cluster
resources, application requirements, and overhead, you can determine the optimal amount of
executor memory for your Spark applications.
Another example
If we want to provide each executor 5GB of memory, we must declare that 5GB Plus max(384mb,
10% of 5GB) (off heap memory) = 5.5GB.
spark.executor.memory=5.5g
spark.memory.fraction=0.6
spark.memory.storageFraction=0.5
Java heap memory 5 GB : (5 * 1024 MB = 5120 MB )

Reserved memory 300 MB


5120 MB – 300 MB
Usable memory 4820 MB

Usable memory * spark.memory.fraction


4820 MB * 0.6
Spark memory 2892 MB

Spark storage memory Spark memory * spark.memory.storageFraction

2892 MB * 0.5
  1446 MB

Spark memory * (1.0 – spark.memory.storageFraction)


2892 MB * ( 1 – 0.5)
2892 MB * 0.5
Spark execution memory 1446 MB

Usable memory * (1.0 — spark.memory.fraction)


4820 MB * (1.0 – 0.6)
4820 MB * 0.4
User memory 1928 MB

  4820 MB * (1.0 – 0.6)


Spark important urls to refer
PySpark : PySpark program to write DataFrame to Snowflake table.
 USER  MARCH 6, 2023  LEAVE A COMMENTON PYSPARK : PYSPARK PROGRAM TO WRITE
DATAFRAME TO SNOWFLAKE TABLE.

Overview of Snowflake and PySpark.


Snowflake is a cloud-based data warehousing platform that allows users to store and analyze
large volumes of data using SQL. It provides high-performance, scalable, and secure storage and
analytics solutions. Snowflake is highly compatible with various programming languages,
including Python.
PySpark is the Python API for Apache Spark, a fast and general-purpose cluster computing
system. PySpark provides a simple and efficient way to process large amounts of data in parallel.
To write DataFrame to a Snowflake table using PySpark, we need to perform the following steps:
1. Configure Spark to use the Snowflake JDBC driver
2. Create a Spark session and load data into a DataFrame
3. Write the DataFrame to a Snowflake table using the Snowflake connector
Here’s an example program that demonstrates how to write DataFrame to Snowflake table using
PySpark:

from pyspark.sql import SparkSession


from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import col
import os

# Set Snowflake credentials as environment variables


os.environ['SNOWFLAKE_USER'] = 'username'
os.environ['SNOWFLAKE_PASSWORD'] = 'password'
os.environ['SNOWFLAKE_ACCOUNT'] = 'account_name'
os.environ['SNOWFLAKE_DATABASE'] = 'database_name'
os.environ['SNOWFLAKE_SCHEMA'] = 'schema_name'
os.environ['SNOWFLAKE_WAREHOUSE'] = 'warehouse_name'

# Create Spark session


spark = SparkSession.builder.appName('SnowflakeWrite').getOrCreate()

# Define schema for DataFrame


schema = StructType([
StructField('id', IntegerType(), True),
StructField('name', StringType(), True),
StructField('age', IntegerType(), True),
StructField('city', StringType(), True)
])

# Load data into DataFrame


data = [
(1, 'John', 25, 'New York'),
(2, 'Jane', 30, 'San Francisco'),
(3, 'Bob', 40, 'Chicago')
]

df = spark.createDataFrame(data, schema)

# Write DataFrame to Snowflake table


df.write \
.format('snowflake') \
.option('dbtable', 'table_name') \
.option('user', os.environ['SNOWFLAKE_USER']) \
.option('password', os.environ['SNOWFLAKE_PASSWORD']) \
.option('account', os.environ['SNOWFLAKE_ACCOUNT']) \
.option('database', os.environ['SNOWFLAKE_DATABASE']) \
.option('schema', os.environ['SNOWFLAKE_SCHEMA']) \
.option('warehouse', os.environ['SNOWFLAKE_WAREHOUSE']) \
.mode('overwrite') \
.save()

# Read the data from the Snowflake table


df_from_snowflake = spark.read \
.format('snowflake') \
.option('dbtable', 'table_name') \
.option('user', os.environ['SNOWFLAKE_USER']) \
.option('password', os.environ['SNOWFLAKE_PASSWORD']) \
.option('account', os.environ['SNOWFLAKE_ACCOUNT']) \
.option('database', os.environ['SNOWFLAKE_DATABASE']) \
.option('schema', os.environ['SNOWFLAKE_SCHEMA']) \
.option('warehouse', os.environ['SNOWFLAKE_WAREHOUSE']) \
.load()
# Show the data from the Snowflake table
df_from_snowflake.show()

Python
COPY
In this example program, we set the Snowflake credentials as environment variables, create a
Spark session, define a schema for the DataFrame, and load data into the DataFrame. We then
write the DataFrame to a Snowflake table using the Snowflake connector, specifying the table
name and Snowflake credentials. We also read the data from the Snowflake table and show it in
the console. The .option() method is used to specify various connection and configuration
options for the Snowflake connector, such as the Snowflake account, database, schema,
warehouse, and authentication credentials. Finally, we use the .show() method to display the
data from the Snowflake table.
Note that the mode parameter is set to ‘overwrite’ in the df.write statement. This means that if
the table already exists, it will be truncated and the new data will overwrite the existing data. If
you want to append data to an existing table, you can set mode
PySpark : LongType and ShortType data types in PySpark
 USER  MARCH 2, 2023  LEAVE A COMMENTON PYSPARK : LONGTYPE AND SHORTTYPE DATA
TYPES IN PYSPARK

pyspark.sql.types.LongType
pyspark.sql.types.ShortType
In this article, we will explore PySpark’s LongType and ShortType data types, their properties,
and how to work with them.
PySpark is a powerful data processing framework that allows users to work with large-scale
datasets. It provides several data types that allow users to represent data in different ways. Two
of these data types are LongType and ShortType.
LongType:
LongType is a data type in PySpark that represents signed 64-bit integer values. The range of
values that can be represented by a LongType variable is from -9223372036854775808 to
9223372036854775807. This data type is useful when working with large numbers, such as
timestamps or IDs.
ShortType:
ShortType is a data type in PySpark that represents signed 16-bit integer values. The range of
values that can be represented by a ShortType variable is from -32768 to 32767. This data type is
useful when working with small integers, such as counts or indices.
To use LongType or ShortType in PySpark, we need to import the LongType and ShortType
classes from the pyspark.sql.types module. Here is an example:

from pyspark.sql.types import LongType, ShortType

# create a LongType variable


long_var = LongType()

# create a ShortType variable


short_var = ShortType()

Python
COPY
Now that we have created LongType and ShortType variables, we can use them to define the
schema of a DataFrame. Here is an example:

from pyspark.sql.types import StructType, StructField

# define the schema of the DataFrame


schema = StructType([
StructField("id", LongType(), True),
StructField("name", ShortType(), True)
])

# create the DataFrame


data = [(1, "John"), (2, "Jane"), (3, "Bob")]
df = spark.createDataFrame(data, schema)

# show the DataFrame


df.show()

Python
COPY
In the above example, we define the schema of the DataFrame using the StructType and
StructField classes. The StructField class takes three arguments: the name of the field, the data
type, and a boolean value that indicates whether the field can be null or not. We then create the
DataFrame using the spark.createDataFrame() method and pass in the data and schema as
arguments.
We can perform various operations on LongType and ShortType variables in PySpark, such as
arithmetic operations and comparisons. Here is an example:

# create a DataFrame with LongType and ShortType columns


data = [(1, 100), (2, 200), (3, 300)]
df = spark.createDataFrame(data, ["id", "count"])

# perform arithmetic operations


df.withColumn("count_double", df["count"] * 2).show()
df.withColumn("count_sum", df["count"] + 100).show()

# perform comparisons
df.filter(df["count"] > 200).show()
df.filter(df["count"] == 200).show()

Python
COPY
In the above example, we create a DataFrame with LongType and ShortType columns and
perform arithmetic operations and comparisons on them. The withColumn() method adds a new
column to the DataFrame that is the result of an arithmetic operation on the existing columns.
The filter() method filters the rows of the DataFrame based on a comparison condition.
LongType and ShortType are useful data types in PySpark that allow users to represent large and
small integers, respectively. They can be used to define the schema of a DataFrame, perform
arithmetic operations and comparisons, and more.
 POSTED INSPARK
PySpark : HiveContext in PySpark – A brief explanation
 USER  FEBRUARY 26, 2023  LEAVE A COMMENTON PYSPARK : HIVECONTEXT IN PYSPARK – A
BRIEF EXPLANATION

One of the key components of PySpark is the HiveContext, which provides a SQL-like interface to
work with data stored in Hive tables. The HiveContext provides a way to interact with Hive from
PySpark, allowing you to run SQL queries against tables stored in Hive. Hive is a data
warehousing system built on top of Hadoop, and it provides a way to store and manage large
datasets. By using the HiveContext, you can take advantage of the power of Hive to query and
analyze data in PySpark.
The HiveContext is created using the SparkContext, which is the entry point for PySpark. Once
you have created a SparkContext, you can create a HiveContext as follows:
from  pyspark . sql  import  HiveContext hiveContext  =  HiveContext ( sparkContext )
The HiveContext provides a way to create DataFrame objects from Hive tables, which can be
used to perform various operations on the data. For example, you can use the select method to
select specific columns from a table, and you can use the filter method to filter rows based on
certain conditions.
# create a DataFrame from a Hive table  df  =  hiveContext . table ("my_table")   # select specific
columns from the DataFrame  df . select ("col1",   "col2")   # filter rows based on a
condition  df . filter ( df . col1  >   10)
You can also create temporary tables in the HiveContext, which are not persisted to disk but can
be used in subsequent queries. To create a temporary table, you can use the registerTempTable
method:
# create a temporary table from a DataFrame  df . registerTempTable ("my_temp_table")   #
query the temporary table  hiveContext . sql ("SELECT * FROM my_temp_table WHERE col1 >
10")
In addition to querying and analyzing data, the HiveContext also provides a way to write data
back to Hive tables. You can use the saveAsTable method to write a DataFrame to a new or
existing Hive table:
# write a DataFrame to a Hive table  df . write . saveAsTable ("freshers_in_table")
the HiveContext in PySpark provides a powerful SQL-like interface for working with data stored
in Hive. It allows you to easily query and analyze large datasets, and it provides a way to write
data back to Hive tables. By using the HiveContext, you can take advantage of the power of Hive
in your PySpark applications.
PySpark: Explanation of PySpark Full Outer Join with example.
 USER  FEBRUARY 21, 2023  LEAVE A COMMENTON PYSPARK: EXPLANATION OF PYSPARK FULL
OUTER JOIN WITH EXAMPLE.

One of the most commonly used operations in PySpark is joining two dataframes together. Full
outer join is one of the four types of joins that can be performed in PySpark. In this article, we will
explain what a full outer join is and how to perform it in PySpark with a detailed example.
What is a Full Outer Join?
A full outer join combines the rows from both the left and right dataframes, filling in null values
where there is no match. In other words, it returns all the rows from both dataframes, and where
there are no matches, the columns from the other dataframe will be filled with null values.
To illustrate this concept, consider the following two dataframes:

df1:
+---+-------+
| id| name|
+---+-------+
| 1| John Bob|
| 2| Alice Jim|
| 3|Charlie Barry|
+---+-------+

df2:
+---+--------+
| id| email|
+---+--------+
| 1|[email protected]|
| 4|[email protected]|
+---+--------+

Bash
COPY
A full outer join between df1 and df2 on the id column would result in the following dataframe:

+---+-------+--------------+
| id| name| email|
+---+-------+--------------+
| 1| Bob| [email protected]|
| 2| Alice| null|
| 3|Charlie| null|
| 4| null|[email protected]|
+---+-------+--------------+

Bash
COPY
As you can see, the resulting dataframe contains all the rows from both df1 and df2, and fills in
null values where there are no matches.
How to perform a Full Outer Join in PySpark
To perform a full outer join in PySpark, we can use the join() function on one of the dataframes
and specify the other dataframe, the join type, and the join condition. Here is the general syntax:

joined_df = df1.join(df2, on="join_column", how="full_outer")

Python
COPY
where df1 and df2 are the dataframes to be joined, “join_column” is the column to join on, and
“full_outer” is the join type.
Let’s use the same dataframes from the previous example to perform a full outer join in PySpark:

from pyspark.sql import SparkSession


# create a SparkSession
spark = SparkSession.builder.appName("Full Outer Join Example at Freshers.in").getOrCreate()
# create df1
df1 = spark.createDataFrame([(1, "<code class="language-bash">John Bob"), (2, "Alice Jim"), (3,
"Charlie Barry")], ["id", "name"]) df1.show() # create df2 df2 = spark.createDataFrame([(1,
"[email protected]"), (4, "[email protected]")], ["id", "email"]) # perform a full outer join
on the 'id' column joined_df = df1.join(df2, on="id", how="full_outer") # display the resulting
dataframe joined_df.show()

Python
COPY
Result of df1

+---+-------+
| id| name|
+---+-------+
| 1| John Bob|
| 2| Alice Jim|
| 3|Charlie Barry|
+---+-------+

Bash
COPY
Result of joined_df

+---+-------+---------------+
| id| name| email|
+---+-------+---------------+
| 1| John Bob| [email protected]|
| 3|Charlie| null|
| 2| Alice Jim| null|
| 4| null|[email protected]|
+---+-------+---------------+

 POSTED INSPARK
PySpark : Reading from multiple files , how to get the file which contain each record in PySpark
[input_file_name]
 USER  FEBRUARY 20, 2023  LEAVE A COMMENTON PYSPARK : READING FROM MULTIPLE FILES ,
HOW TO GET THE FILE WHICH CONTAIN EACH RECORD IN PYSPARK [INPUT_FILE_NAME]
pyspark.sql.functions.input_file_name
One of the most useful features of PySpark is the ability to access metadata about the input files
being processed in a job. This metadata can be used to perform a variety of tasks, including
filtering data based on file name, partitioning data based on file location, and more.
The input_file_name function is a built-in PySpark function that allows you to access the name of
the file being processed in a PySpark job. This function is available in PySpark 2.2 and later
versions and can be used to extract information about the input file being processed.
To use input_file_name, you first need to import it from the PySpark SQL functions module.
Here’s an example:

from pyspark.sql.functions import input_file_name


Python
COPY
Once you’ve imported input_file_name, you can use it in a PySpark DataFrame transformation to
extract information about the input file. Here’s an example:

from pyspark.sql.functions import input_file_name


# create a DataFrame from a CSV file
df = spark.read.format('csv').load('D:\Learning\PySpark\infiles\*')
df.show(20,False)

Python
COPY
Result

+---+-----+---+----+
|_c0| _c1|_c2| _c3|
+---+-----+---+----+
|301| John| 30|3000|
|302|Jibin| 31|3050|
|303|Jerry| 32|3075|
|101| Sam| 10|1000|
|102|Peter| 11|1050|
|103| Eric| 12|1075|
|201|Albin| 20|2000|
|202| Eldo| 21|2050|
|203| Joy| 22|2075|
+---+-----+---+----+

Bash
COPY
Implementing input_file_name()

# add a new column to the DataFrame containing the file name


df = df.withColumn('input_file', input_file_name())
df.show(20,False)

Python
COPY
Output

+---+-----+---+----+--------------------------------------------------+
|_c0|_c1 |_c2|_c3 |input_file |
+---+-----+---+----+--------------------------------------------------+
|301|John |30 |3000|file:///D:/Learning/PySpark/infiles/21-03-2023.csv|
|302|Jibin|31 |3050|file:///D:/Learning/PySpark/infiles/21-03-2023.csv|
|303|Jerry|32 |3075|file:///D:/Learning/PySpark/infiles/21-03-2023.csv|
|101|Sam |10 |1000|file:///D:/Learning/PySpark/infiles/19-03-2023.csv|
|102|Peter|11 |1050|file:///D:/Learning/PySpark/infiles/19-03-2023.csv|
|103|Eric |12 |1075|file:///D:/Learning/PySpark/infiles/19-03-2023.csv|
|201|Albin|20 |2000|file:///D:/Learning/PySpark/infiles/20-03-2023.csv|
|202|Eldo |21 |2050|file:///D:/Learning/PySpark/infiles/20-03-2023.csv|
|203|Joy |22 |2075|file:///D:/Learning/PySpark/infiles/20-03-2023.csv|
+---+-----+---+----+--------------------------------------------------+

PySpark : Exploding a column of arrays or maps into multiple rows in a Spark DataFrame
[posexplode_outer]
 USER  FEBRUARY 14, 2023  LEAVE A COMMENTON PYSPARK : EXPLODING A COLUMN OF
ARRAYS OR MAPS INTO MULTIPLE ROWS IN A SPARK DATAFRAME [POSEXPLODE_OUTER]

pyspark.sql.functions.posexplode_outer
The posexplode_outer function in PySpark is part of the pyspark.sql.functions module and is
used to explode a column of arrays or maps into multiple rows in a Spark DataFrame. This
function is similar to the posexplode function, but it also includes the original row even if the
column is empty or null. In other words, the posexplode_outer function is an “outer explode”
that preserves the original rows even if they don’t have any values in the exploded column.
Returns a new row for each element in the specified array or map, along with its position. In
contrast to posexplode, the row (null, null) is produced if the array or map is null or empty.
Unless otherwise provided, uses the default column names pos for position, col for array items,
and key and value for map elements.
Here is an example of how to use the posexplode_outer function in PySpark:

from pyspark.sql import SparkSession


from pyspark.sql.functions import posexplode_outer
# Start a SparkSession
spark = SparkSession.builder.appName("PosexplodeOuter Example @
Freshers.in").getOrCreate()
# Create a DataFrame
data = [([1, 2, 3],), ([4, 5],), ([],), (None,)]
df = spark.createDataFrame(data, ["values"])
df.show()

Python
COPY

+---------+
| values|
+---------+
|[1, 2, 3]|
| [4, 5]|
| []|
| null|
+---------+

Bash
COPY
Applying posexplode_outer function
# Use the posexplode_outer function to transform the values column
df2 = df.select("values", posexplode_outer("values").alias("position", "value"))
df2.show()

Python
COPY
Result

+---------+--------+-----+
| values|position|value|
+---------+--------+-----+
|[1, 2, 3]| 0| 1|
|[1, 2, 3]| 1| 2|
|[1, 2, 3]| 2| 3|
| [4, 5]| 0| 4|
| [4, 5]| 1| 5|
| []| null| null|
| null| null| null|
+---------+--------+-----+

PySpark : Transforming a column of arrays or maps into multiple columns, with one row for
each element in the array or map [posexplode]
 USER  FEBRUARY 14, 2023  LEAVE A COMMENTON PYSPARK : TRANSFORMING A COLUMN OF
ARRAYS OR MAPS INTO MULTIPLE COLUMNS, WITH ONE ROW FOR EACH ELEMENT IN THE
ARRAY OR MAP [POSEXPLODE]

pyspark.sql.functions.posexplode
The posexplode function in PySpark is part of the pyspark.sql.functions module and is used to
transform a column of arrays or maps into multiple columns, with one row for each element in
the array or map. The posexplode function is similar to the explode function, but it also returns
the position of each element in the array or map in a separate column.
From official doc “Returns a new row for each element with position in the given array or map.
Uses the default column name pos  for position, and col for elements in the array
and  key  and value for elements in the map unless specified otherwise.“
Here is an example of how to use the posexplode function in PySpark:

from pyspark.sql import SparkSession


from pyspark.sql.functions import posexplode
# Start a SparkSession
spark = SparkSession.builder.appName("PosExplode example @ Freshers.in").getOrCreate()
# Create a DataFrame
data = [([101, 202, 330],), ([43, 51],), ([66],)]
df = spark.createDataFrame(data, ["values"])
df.show()

Python
COPY
+---------------+
| values|
+---------------+
|[101, 202, 330]|
| [43, 51]|
| [66]|
+---------------+

Bash
COPY
Applying posexplode function

# Use the posexplode function to transform the values column


df = df.select("values", posexplode("values").alias("position", "value"))
df.show()

Python
COPY

+---------------+--------+-----+
| values|position|value|
+---------------+--------+-----+
|[101, 202, 330]| 0| 101|
|[101, 202, 330]| 1| 202|
|[101, 202, 330]| 2| 330|
| [43, 51]| 0| 43|
| [43, 51]| 1| 51|
| [66]| 0| 66|
+---------------+--------+-----+

Bash
COPY
As you can see, the posexplode function has transformed the values column into two separate
columns: position and value. The position column contains the position of each element in the
array, and the value column contains the value of each element. Each row of the DataFrame
represents a single element in the array, with its position and value.
In conclusion, the posexplode function in PySpark is a useful tool for transforming arrays and
maps in Spark dataframes. Whether you need to extract the position and value of each element
in an array or perform more complex operations, the pyspark.sql.functions module provides the
tools you need to get the job done.
PySpark : Calculate the percent rank of a set of values in a DataFrame column using
PySpark[percent_rank]
 USER  FEBRUARY 14, 2023  LEAVE A COMMENTON PYSPARK : CALCULATE THE PERCENT RANK
OF A SET OF VALUES IN A DATAFRAME COLUMN USING PYSPARK[PERCENT_RANK]

pyspark.sql.functions.percent_rank
PySpark provides a percent_rank function as part of the pyspark.sql.functions module, which is
used to calculate the percent rank of a set of values in a DataFrame column. The percent rank is a
value between 0 and 1 that indicates the relative rank of a value within a set of values. The lower
the value of percent rank, the lower the rank of the value.
Here is an example to demonstrate the use of the percent_rank function in PySpark:

from pyspark.sql import Window


from pyspark.sql import SparkSession
from pyspark.sql.functions import percent_rank

# Start a SparkSession
spark = SparkSession.builder.appName("PercentRank Example @ Freshers.in").getOrCreate()
# Create a DataFrame
data = [(1,), (2,), (3,), (4,), (5,)]
df = spark.createDataFrame(data, ["value"])

# Use the percent_rank function to calculate the percent rank of the values in the DataFrame
df = df.select("value", percent_rank().over(Window.orderBy("value")).alias("percent_rank"))
df.show()

Python
COPY
Output

+-----+------------+
|value|percent_rank|
+-----+------------+
| 1| 0.0|
| 2| 0.25|
| 3| 0.5|
| 4| 0.75|
| 5| 1.0|
+-----+------------+

Bash
COPY
As you can see, the percent_rank function has calculated the percent rank of each value in the
DataFrame. The values are sorted in ascending order, and the percent rank of each value is
calculated based on its position within the set of values. The lower the value of percent rank, the
lower the rank of the value.
The percent_rank function is especially useful when working with large datasets, as it provides a
quick and efficient way to determine the relative rank of a set of values. Additionally, the function
can be used in combination with other functions in the pyspark.sql.functions module to perform
more complex operations on DataFrames.
In conclusion, the percent_rank function in PySpark is a valuable tool for working with data in
Spark dataframes. Whether you need to determine the relative rank of a set of values or perform
more complex operations, the pyspark.sql.functions module provides the tools you need to get
the job done.
PySpark : Extracting minutes of a given date as integer in PySpark [minute]
 USER  FEBRUARY 14, 2023  LEAVE A COMMENTON PYSPARK : EXTRACTING MINUTES OF A GIVEN
DATE AS INTEGER IN PYSPARK [MINUTE]

pyspark.sql.functions.minute
The minute function in PySpark is part of the pyspark.sql.functions module, and is used to extract
the minute from a date or timestamp. The minute function takes a single argument, which is a
column containing a date or timestamp, and returns an integer representing the minute
component of the date or timestamp.
Here is an example of how to use the minute function in PySpark:

from pyspark.sql import SparkSession


from pyspark.sql.functions import minute
# Start a SparkSession
spark = SparkSession.builder.appName("MinuteExample at Freshers.in").getOrCreate()
# Create a DataFrame
data = [("2023-10-01 11:30:00",),
("2023-11-12 08:45:00",),
("2023-12-15 09:15:00",)]
df = spark.createDataFrame(data, ["timestamp"])
# Use the minute function to extract the minute component
df = df.select("timestamp", minute("timestamp").alias("minute"))
df.show()

Python
COPY
Result

+-------------------+------+
| timestamp|minute|
+-------------------+------+
|2023-10-01 11:30:00| 30|
|2023-11-12 08:45:00| 45|
|2023-12-15 09:15:00| 15|
+-------------------+------+

Bash
COPY
As you can see, the minute function has extracted the minute component of each timestamp in
the DataFrame and returned it as an integer.
In addition to the minute function, the pyspark.sql.functions module also includes functions for
extracting other components of a date or timestamp, such as the hour, day, month, and year.
These functions can be used in combination with each other to perform more complex
operations on dates and timestamps.
In conclusion, the minute function in PySpark is a useful tool for working with dates and
timestamps in Spark dataframes. Whether you need to extract the minute component of a date
or perform more complex operations, the pyspark.sql.functions module provides the tools you
need to get the job done.
PySpark : Function to perform simple column transformations [expr]
pyspark.sql.functions.expr
The expr module is part of the PySpark SQL module and is used to create column expressions
that can be used to perform operations on Spark dataframes. These expressions can be used to
transform columns, calculate new columns based on existing columns, and perform various other
operations on Spark dataframes.
One of the most common uses for expr is to perform simple column transformations. For
example, you can use the expr function to convert a string column to a numeric column by using
the cast function. Here is an example:
from pyspark.sql.functions import expr
df = spark.createDataFrame([(1, "100"), (2, "200"), (3, "300")], ["id", "value"])
df.printSchema()

Python
COPY

root
|-- id: long (nullable = true)
|-- value: string (nullable = true)

Bash
COPY
Use expr

df = df.withColumn("value", expr("cast(value as int)"))


df.printSchema()

Python
COPY

root
|-- id: long (nullable = true)
|-- value: integer (nullable = true)

Bash
COPY
In this example, we create a Spark dataframe with two columns, id and value. The value column is
a string column, but we want to convert it to a numeric column. To do this, we use the expr
function to create a column expression that casts the value column as an integer. The result is a
new Spark dataframe with the value column converted to a numeric column.
Another common use for expr is to perform operations on columns. For example, you can use
expr to create a new column that is the result of a calculation involving multiple columns. Here is
an example:

from pyspark.sql.functions import expr


df = spark.createDataFrame([(1, 100, 10), (2, 200, 20), (3, 300, 30)], ["id", "value1", "value2"])
df = df.withColumn("sum", expr("value1 + value2"))
df.show()

Python
COPY
Result

+---+------+------+---+
| id|value1|value2|sum|
+---+------+------+---+
| 1| 100| 10|110|
| 2| 200| 20|220|
| 3| 300| 30|330|
+---+------+------+---+

Bash
COPY
In this example, we create a Spark dataframe with three columns, id, value1, and value2. We use
the expr function to create a new column, sum, that is the result of adding value1 and value2. The
result is a new Spark dataframe with the sum column containing the result of the calculation.
The expr module also provides a number of other functions that can be used to perform
operations on Spark dataframes. For example, you can use the coalesce function to select the
first non-null value from a set of columns, the ifnull function to return a specified value if a
column is null, and the case function to perform conditional operations on columns.
In conclusion, the expr module in PySpark provides a convenient and flexible way to perform
operations on Spark dataframes. Whether you want to transform columns, calculate new
columns, or perform other operations, the expr module provides the tools you need to do so.
 POSTED INSPARK
PySpark : Formatting numbers to a specific number of decimal places.
 
pyspark.sql.functions.format_number
One of the useful functions in PySpark is the format_number function, which is used to format
numbers to a specific number of decimal places. In this article, we will discuss the
PySpark format_number function and its usage.
The format_number function in PySpark is used to format numbers to a specific number of
decimal places. This function is useful when you need to display numbers in a specific format,
such as displaying a currency value with two decimal places. The format_number function takes
two arguments: the number to be formatted and the number of decimal places to format it to.
Here is an example of how to use the format_number function in PySpark:

from pyspark.sql import SparkSession


from pyspark.sql.functions import format_number

# Initialize SparkSession
spark = SparkSession.builder.appName("PySpark Format Number").getOrCreate()

# Create a dataframe with a column of numbers


data = [(1, 3.14159),
(2, 2.71828),
(3, 1.41421)]
df = spark.createDataFrame(data, ["id", "value"])

# Use the format_number function to format the numbers to two decimal places
df = df.select("id", format_number("value", 2).alias("value_formatted"))

# Show the resulting dataframe


df.show()

Python
COPY
In this example, we start by creating a SparkSession, which is the entry point for PySpark. Then,
we create a dataframe with two columns, id and value, where the value column contains
numbers.
Next, we use the format_number function to format the numbers in the value column to two
decimal places. The format_number function takes the value column and formats it to two
decimal places. We use the alias method to give a name to the newly created column, which is
value_formatted in this example.
Finally, we display the resulting dataframe using the show method, which outputs the following
result:

+---+---------------+
| id|value_formatted|
+---+---------------+
| 1| 3.14|
| 2| 2.72|
| 3| 1.41|
+---+---------------+

Bash
COPY
As you can see, the format_number function has formatted the numbers in the value column to
two decimal places, as specified in the second argument of the function.
In conclusion, the format_number function in PySpark is a useful tool for formatting numbers to
a specific number of decimal places. Whether you are a data scientist or a software engineer,
understanding the basics of the PySpark format_number function is crucial for performing
effective big data analysis. With its simple yet powerful functionality,
the format_number function can help you present your data in a clear and concise format,
making it easier to understand and analyze.
PySpark : Creating multiple rows for each element in the array[explode]
 USER  FEBRUARY 9, 2023  LEAVE A COMMENTON PYSPARK : CREATING MULTIPLE ROWS FOR
EACH ELEMENT IN THE ARRAY[EXPLODE]

pyspark.sql.functions.explode
One of the important operations in PySpark is the explode function, which is used to convert a
column of arrays or maps into separate rows in a dataframe. In this article, we will discuss the
PySpark explode function and its usage.
The explode function in PySpark is used to take a column of arrays or maps and create multiple
rows for each element in the array or map. For example, if you have a dataframe with a column
that contains arrays, the explode function can be used to create separate rows for each element
in the array. This can be useful when you need to analyze the data at a more granular level.
Here is an example of how to use the explode function in PySpark:

from pyspark.sql import SparkSession


from pyspark.sql.functions import explode
# Initialize SparkSession
spark = SparkSession.builder.appName("PySpark Explode").getOrCreate()

# Create a dataframe with a column of arrays


data = [(1, ["coconut", "banana", "cherry"]),
(2, ["orange", "mandarins", "kiwi"])]
df = spark.createDataFrame(data, ["id", "fruits"])

# Use the explode function to convert the column of arrays into separate rows
df = df.select("id", explode("fruits").alias("fruit"))

# Show the resulting dataframe


df.show()

Python
COPY
In this example, we start by creating a SparkSession, which is the entry point for PySpark. Then,
we create a dataframe with two columns, id and fruits, where the fruits column contains arrays
of fruit names.
Next, we use the explode function to convert the column of arrays into separate rows. The
explode function takes a column of arrays as input and creates separate rows for each element in
the array. We use the alias method to give a name to the newly created column, which is fruit in
this example.
Finally, we display the resulting dataframe using the show method, which outputs the following
result:

+---+---------+
| id| fruit|
+---+---------+
| 1| coconut|
| 1| banana|
| 1| cherry|
| 2| orange|
| 2|mandarins|
| 2| kiwi|
+---+---------+

Bash
COPY
As you can see, the explode function has transformed the column of arrays into separate rows,
one for each element in the array. This allows us to analyze the data at a more granular level.
In conclusion, the explode function in PySpark is a useful tool for converting columns of arrays or
maps into separate rows in a dataframe. It allows you to analyze the data at a more granular
level, making it easier to perform complex data processing and analysis tasks. Whether you are a
data scientist or a software engineer, understanding the basics of the PySpark explode function
is crucial for performing effective big data analysis.
 POSTED INSPARK
PySpark : How decode works in PySpark ?
 USER  FEBRUARY 9, 2023  LEAVE A COMMENTON PYSPARK : HOW DECODE WORKS IN
PYSPARK ?

One of the important concepts in PySpark is data encoding and decoding, which refers to the
process of converting data into a binary format and then converting it back into a readable
format.
In PySpark, encoding and decoding are performed using various methods that are available in the
library. The most commonly used methods are base64 encoding and decoding, which is a
standard encoding scheme that is used for converting binary data into ASCII text. This method is
used for transmitting binary data over networks, where text data is preferred over binary data.
Another popular method for encoding and decoding in PySpark is the JSON encoding and
decoding. JSON is a lightweight data interchange format that is easy to read and write. In
PySpark, JSON encoding is used for storing and exchanging data between systems, whereas
JSON decoding is used for converting the encoded data back into a readable format.
Additionally, PySpark also provides support for encoding and decoding data in the Avro format.
Avro is a data serialization system that is used for exchanging data between systems. It is similar
to JSON encoding and decoding, but it is more compact and efficient. Avro encoding and
decoding in PySpark is performed using the Avro library.
To perform encoding and decoding in PySpark, one must first create a Spark context and then
import the necessary libraries. The data to be encoded or decoded must then be loaded into the
Spark context, and the appropriate encoding or decoding method must be applied to the data.
Once the encoding or decoding is complete, the data can be stored or transmitted as needed.
In conclusion, encoding and decoding are important concepts in PySpark, as they are used for
storing and exchanging data between systems. PySpark provides support for base64 encoding
and decoding, JSON encoding and decoding, and Avro encoding and decoding, making it a
powerful tool for big data analysis. Whether you are a data scientist or a software engineer,
understanding the basics of PySpark encoding and decoding is crucial for performing effective
big data analysis.
Here is a sample PySpark program that demonstrates how to perform base64 decoding using
PySpark:

from pyspark import SparkContext


from pyspark.sql import SparkSession
import base64

# Initialize SparkContext and SparkSession


sc = SparkContext("local", "base64 decode example @ Freshers.in")
spark = SparkSession(sc)

# Load data into Spark dataframe


df = spark.createDataFrame([("data1", "ZGF0YTE="),("data2", "ZGF0YTI=")], ["key",
"encoded_data"])

# Create a UDF (User Defined Function) for decoding base64 encoded data
decode_udf = spark.udf.register("decode", lambda x: base64.b64decode(x).decode("utf-8"))

# Apply the UDF to the "encoded_data" column


df = df.withColumn("decoded_data", decode_udf(df["encoded_data"]))

# Display the decoded data


df.show()

Python
COPY
Output

+-----+------------+------------+
| key|encoded_data|decoded_data|
+-----+------------+------------+
|data1| ZGF0YTE=| data1|
|data2| ZGF0YTI=| data2|
+-----+------------+------------+

Bash
COPY
Explanation
1. The first step is to import the necessary
libraries, SparkContext and SparkSession from pyspark and base64 library.
2. Next, we initialize the SparkContext and SparkSession by creating an instance of
SparkContext with the name “local” and “base64 decode example” as the application
name.
3. In the next step, we create a Spark dataframe with two columns, key and encoded_data,
and load some sample data into the dataframe.
4. Then, we create a UDF (User Defined Function) called decode which takes a base64
encoded string as input and decodes it using the base64.b64decode method and returns
the decoded string. The .decode("utf-8") is used to convert the binary decoded data into a
readable string format.
5. After creating the UDF, we use the withColumn method to apply the UDF to
the encoded_data column of the dataframe and add a new column called decoded_data to
store the decoded data.
6. Finally, we display the decoded data using the show method.
PySpark : Extracting dayofmonth, dayofweek, and dayofyear in PySpark
 USER  FEBRUARY 9, 2023  LEAVE A COMMENTON PYSPARK : EXTRACTING DAYOFMONTH,
DAYOFWEEK, AND DAYOFYEAR IN PYSPARK

1. pyspark.sql.functions.dayofmonth
2. pyspark.sql.functions.dayofweek
3. pyspark.sql.functions.dayofyear
One of the most common data manipulations in PySpark is working with date and time columns.
PySpark provides several functions to extract day-related information from date and time
columns, such as dayofmonth, dayofweek, and dayofyear. In this article, we will explore these
functions in detail.
dayofmonth
dayofmonth: The dayofmonth function returns the day of the month from a date column. The
function returns an integer between 1 and 31, representing the day of the month.
Syntax: df.select(dayofmonth(col("date_column_name"))).show()

from pyspark.sql.functions import dayofmonth


from pyspark.sql.functions import col
# Sample DataFrame
df = spark.createDataFrame([("2023-01-19",),("2023-02-11",),("2023-03-12",)], ["date"])
# Extract day of the month
df.select(dayofmonth(col("date"))).show()

Python
COPY
Output
+----------------+
|dayofmonth(date)|
+----------------+
| 19|
| 11|
| 12|
+----------------+

Bash
COPY
dayofweek
dayofweek: The dayofweek function returns the day of the week from a date column. The
function returns an integer between 1 (Sunday) and 7 (Saturday), representing the day of the
week.
Syntax: df.select(dayofweek(col("date_column_name"))).show()

from pyspark.sql.functions import dayofweek


from pyspark.sql.functions import col
# Sample DataFrame
df = spark.createDataFrame([("2023-01-19",),("2023-02-11",),("2023-03-12",)], ["date"])
# Extract day of the week
df.select(dayofweek(col("date"))).show()

Python
COPY
Output

+---------------+
|dayofweek(date)|
+---------------+
| 5|
| 7|
| 1|
+---------------+

Bash
COPY
dayofyear
dayofyear: The dayofyear function returns the day of the year from a date column. The function
returns an integer between 1 and 366, representing the day of the year.
Syntax: df.select(dayofyear(col("date_column_name"))).show()

from pyspark.sql.functions import dayofyear


from pyspark.sql.functions import col
# Sample DataFrame
df = spark.createDataFrame([("2023-01-19",),("2023-02-11",),("2023-03-12",)], ["date"])
# Extract day of the year
df.select(dayofyear(col("date"))).show()
Python
COPY
Output

+---------------+
|dayofyear(date)|
+---------------+
| 19|
| 42|
| 71|
+---------------+

Bash
COPY
The dayofmonth, dayofweek, and dayofyear functions in PySpark provide an easy way to extract
day-related information .
Spark : Calculate the number of unique elements in a column using PySpark
 USER  FEBRUARY 8, 2023  LEAVE A COMMENTON SPARK : CALCULATE THE NUMBER OF UNIQUE
ELEMENTS IN A COLUMN USING PYSPARK
pyspark.sql.functions.countDistinct
In PySpark, the countDistinct function is used to calculate the number of unique elements in a
column. This is also known as the number of distinct values. After removing duplicate rows,
DataFrame distinct() returns a new DataFrame (distinct on all columns). Use the PySpark SQL
function countDistinct to obtain the count distinct for a selection of multiple columns (). The
result of this function is the number of unique items in a group.
Here is an example of how to use the countDistinct function in PySpark:

from pyspark.sql import SparkSession


from pyspark.sql.functions import countDistinct
spark = SparkSession.builder.appName('Freshers.in countDistinct Learning').getOrCreate()
data = [("John", "Finance"), ("Jane", "IT"), ("Jim", "Finance"), ("Wilson John", "Travel"),
("Mike", "Travel")]
columns = ["Name","Dept"]
df = spark.createDataFrame(data=data,schema=columns)
# Using countDistrinct()
df.select(countDistinct("Dept",)).show()

Python
COPY
Output

+--------------------+
|count(DISTINCT Dept)|
+--------------------+
| 3|
+--------------------+

Bash
COPY
Returns a new Column for distinct count of col or cols.
Google’s Serverless Spark has several advantages compared to traditional Spark clusters:
1. Cost-effective: Serverless Spark eliminates the need for dedicated servers and
infrastructure, reducing costs for managing, scaling and maintaining Spark clusters.
2. Scalability: Serverless Spark can automatically scale up or down based on the workload,
without the need for manual intervention.
3. Improved performance: With serverless Spark, you only pay for what you use, and the
execution of Spark jobs is optimized for maximum efficiency, resulting in improved
performance.
4. Flexibility: Serverless Spark provides the ability to run Spark jobs on a variety of different
compute resources, including virtual machines and Kubernetes clusters, making it easy to
switch between different execution environments.
5. Ease of use: Serverless Spark provides a simple and intuitive interface for Spark users,
making it easier to run Spark jobs without the need for deep technical knowledge.
6. Integration with Google Cloud services: Serverless Spark integrates seamlessly with Google
Cloud services, providing a comprehensive platform for data processing and analysis.
Serverless Spark provides organizations with a cost-effective, scalable, and flexible solution for
running Spark jobs, while also improving performance and reducing the complexity of Spark
administration.
PySpark : How to decode in PySpark ?
 USER  FEBRUARY 3, 2023  LEAVE A COMMENTON PYSPARK : HOW TO DECODE IN PYSPARK ?

pyspark.sql.functions.decode
The pyspark.sql.functions.decode Function in PySpark
PySpark is a popular library for processing big data using Apache Spark. One of its many functions
is the pyspark.sql.functions.decode function, which is used to convert binary data into a string
using a specified character set. The pyspark.sql.functions.decode function takes two arguments:
the first argument is the binary data to be decoded, and the second argument is the character set
to use for decoding the binary data.
The pyspark.sql.functions.decode function in PySpark supports the following character sets: US-
ASCII, ISO-8859-1, UTF-8, UTF-16BE, UTF-16LE, and UTF-16. The character set specified in the
second argument must match one of these supported character sets in order to perform the
decoding successfully.
Here’s a simple example to demonstrate the use of the pyspark.sql.functions.decode function in
PySpark:

from pyspark.sql import SparkSession


from pyspark.sql.functions import *

# Initializing Spark Session


spark = SparkSession.builder.appName("DecodeFunction").getOrCreate()

# Creating DataFrame with sample data


data = [("Team",),("Freshers.in",)]
df = spark.createDataFrame(data, ["binary_data"])

# Decoding binary data


df = df.withColumn("string_data", decode(col("binary_data"), "UTF-8"))

# Showing the result


df.show()

Python
COPY
Output

+-----------+-----------+
|binary_data|string_data|
+-----------+-----------+
| Team| Team|
|Freshers.in|Freshers.in|
+-----------+-----------+

Bash
COPY
In the above example, the pyspark.sql.functions.decode function is used to decode binary data
into a string. The first argument to the pyspark.sql.functions.decode function is the binary data
to be decoded, which is stored in the “binary_data” column. The second argument is the
character set to use for decoding the binary data, which is “UTF-8“. The function returns a new
column “string_data” that contains the decoded string data.
The pyspark.sql.functions.decode function is a useful tool for converting binary data into a string
format that can be more easily analyzed and processed. It is important to specify the correct
character set for the binary data, as incorrect character sets can result in incorrect decoded data.
In conclusion, the pyspark.sql.functions.decode function in PySpark is a valuable tool for
converting binary data into a string format. It supports a variety of character sets and is an
important tool for processing binary data in PySpark.
 POSTED INSPARK
PySpark : Date Formatting : Converts a date, timestamp, or string to a string value with
specified format in PySpark
 USER  FEBRUARY 3, 2023  LEAVE A COMMENTON PYSPARK : DATE FORMATTING : CONVERTS A
DATE, TIMESTAMP, OR STRING TO A STRING VALUE WITH SPECIFIED FORMAT IN PYSPARK
pyspark.sql.functions.date_format
In PySpark, dates and timestamps are stored as timestamp type. However, while working with
timestamps in PySpark, sometimes it becomes necessary to format the date in a specific way.
This is where the date_format function in PySpark comes in handy.
The date_format function in PySpark takes two arguments: the first argument is the date column,
and the second argument is the format in which the date needs to be formatted. The function
returns a new column with the formatted date.
Here’s a simple example to demonstrate the use of date_format function in PySpark:

from pyspark.sql import SparkSession


from pyspark.sql.functions import *

# Initializing Spark Session


spark = SparkSession.builder.appName("DateFormatting").getOrCreate()

# Creating DataFrame with sample data


data = [("2023-02-01",),("2023-02-02",),("2023-02-03",)]
df = spark.createDataFrame(data, ["date"])

# Formatting date column in desired format


df = df.withColumn("formatted_date", date_format(col("date"), "dd-MM-yyyy"))
# Showing the result
df.show()

Python
COPY
This will produce the following output:

+----------+--------------+
| date|formatted_date|
+----------+--------------+
|2023-02-01| 01-02-2023|
|2023-02-02| 02-02-2023|
|2023-02-03| 03-02-2023|
+----------+--------------+

Bash
COPY
In the above example, the date_format function is used to format the date column in the desired
format. The first argument to the date_format function is the date column, and the second
argument is the format in which the date needs to be formatted.
The date format string used in the second argument of the date_format function is made up of
special characters that represent various parts of the date and time. Some of the most commonly
used special characters in the date format string are:
 dd: Represents the day of the month (01 to 31).
 MM: Represents the month of the year (01 to 12).
 yyyy: Represents the year (with four digits).
There are many other special characters that can be used in the date format string to format the
date and time. A complete list of special characters can be found in the PySpark documentation.
In conclusion, the date_format function in PySpark is a useful tool for formatting dates and
timestamps. It provides an easy way to format dates in a specific format, which can be useful in
various data processing tasks.
PySpark : Adding a specified number of days to a date column in PySpark
 USER  FEBRUARY 3, 2023  LEAVE A COMMENTON PYSPARK : ADDING A SPECIFIED NUMBER OF
DAYS TO A DATE COLUMN IN PYSPARK

pyspark.sql.functions.date_add
The date_add function in PySpark is used to add a specified number of days to a date column. It’s
part of the built-in Spark SQL functions and can be used in Spark DataFrames, Spark SQL and
Spark Datasets.
The basic syntax of the function is as follows:

date_add(date, days)

Python
COPY
where:
date is the date column that you want to add days to.
days is the number of days to add to the date column (can be a positive or negative number).
Here’s an example of how you can use the date_add function in PySpark:
from pyspark.sql import functions as F
df = spark.createDataFrame([("2023-01-01",),("2023-01-02",)], ["date"])
df = df.withColumn("new_date", F.date_add(df["date"], 1))
df.show()

Python
COPY
Result

+----------+----------+
| date| new_date|
+----------+----------+
|2023-01-01|2022-01-02|
|2023-01-02|2022-01-03|
+----------+----------+

Bash
COPY
Note that the date_add function only adds days to a date, not to a timestamp. If you want to add
a specified number of seconds, minutes, hours, etc. to a timestamp column, you can use
the date_add function along with the expr function to write a custom expression.
PySpark : How to Compute the cumulative distribution of a column in a DataFrame
 USER  FEBRUARY 3, 2023  LEAVE A COMMENTON PYSPARK : HOW TO COMPUTE THE
CUMULATIVE DISTRIBUTION OF A COLUMN IN A DATAFRAME

pyspark.sql.functions.cume_dist
The cumulative distribution is a method used in probability and statistics to determine the
distribution of a random variable, X, at any given point. The cumulative distribution function
(CDF) of X, denoted by F(x), is defined as the probability that X will take a value less than or equal
to x.
In PySpark, the cume_dist function is used to compute the cumulative distribution of a column in
a DataFrame. This function computes the cumulative distribution of a column in a DataFrame,
with respect to the order specified in the sort order.
Here’s an example to demonstrate the usage of the cume_dist function in PySpark:

from pyspark.sql import SparkSession


<code class="sig-prename descclassname">from pyspark.sql.functions import ume_dist #
Initialize Spark session spark =
SparkSession.builder.appName("CumeDistExample").getOrCreate() # Create a DataFrame with
sample data data = [("Mike Jack", 30), ("King Elene", 40), ("Barry Tim", 25), ("Yang Jakie", 35),
("Joby John", 20)] df = spark.createDataFrame(data, ["Name", "Age"]) # Sort the DataFrame by
Age in ascending order df = df.sort("Age") # Compute the cumulative distribution of the Age
column cumulative_dist = df.selectExpr("cume_dist() over (order by Age) as cum_dist").show()

Python
COPY
Output
+--------+
|cum_dist|
+--------+
| 0.2|
| 0.4|
| 0.6|
| 0.8|
| 1.0|
+--------+

Bash
COPY
In this example, the cumulative distribution of the Age column is calculated with respect to the
ascending order of the column. The result shows the cumulative distribution of the Age column,
with the first row having a cumulative distribution of 0.2, and the last row having a cumulative
distribution of 1.0, which indicates that 100% of the values are less than or equal to the
corresponding value.
PySpark : How to Compute the cumulative distribution of a column in a DataFrame
 

pyspark.sql.functions.cume_dist
The cumulative distribution is a method used in probability and statistics to determine the
distribution of a random variable, X, at any given point. The cumulative distribution function
(CDF) of X, denoted by F(x), is defined as the probability that X will take a value less than or equal
to x.
In PySpark, the cume_dist function is used to compute the cumulative distribution of a column in
a DataFrame. This function computes the cumulative distribution of a column in a DataFrame,
with respect to the order specified in the sort order.
Here’s an example to demonstrate the usage of the cume_dist function in PySpark:

from pyspark.sql import SparkSession


<code class="sig-prename descclassname">from pyspark.sql.functions import ume_dist #
Initialize Spark session spark =
SparkSession.builder.appName("CumeDistExample").getOrCreate() # Create a DataFrame with
sample data data = [("Mike Jack", 30), ("King Elene", 40), ("Barry Tim", 25), ("Yang Jakie", 35),
("Joby John", 20)] df = spark.createDataFrame(data, ["Name", "Age"]) # Sort the DataFrame by
Age in ascending order df = df.sort("Age") # Compute the cumulative distribution of the Age
column cumulative_dist = df.selectExpr("cume_dist() over (order by Age) as cum_dist").show()

Python
COPY
Output

+--------+
|cum_dist|
+--------+
| 0.2|
| 0.4|
| 0.6|
| 0.8|
| 1.0|
+--------+

Bash
COPY
In this example, the cumulative distribution of the Age column is calculated with respect to the
ascending order of the column. The result shows the cumulative distribution of the Age column,
with the first row having a cumulative distribution of 0.2, and the last row having a cumulative
distribution of 1.0, which indicates that 100% of the values are less than or equal to the
corresponding value.
PySpark : Truncate date and timestamp in PySpark [date_trunc and trunc]

pyspark.sql.functions.date_trunc(format, timestamp)
Truncation function offered by Spark Dateframe SQL functions is date_trunc(), which returns
Date in the format “yyyy-MM-dd HH:mm:ss.SSSS” and truncates at Year, Month, Day, Hour,
Minute, and Seconds units. This will returns timestamp truncated to the unit specified by the
format.

from pyspark.sql.functions import date_trunc


from pyspark.sql.functions import lit
from pyspark.sql import SparkSession
# Initialize a Spark session
spark = SparkSession.builder.appName("date_trunc_example").getOrCreate()
# Create a dataframe with a single column of timestamps
df = spark.createDataFrame([
("2023-01-01 10:10:10",),
("2023-02-01 11:11:11",),
("2023-03-01 12:12:12",),
("2023-03-31 11:11:11",),
("2023-01-31 11:11:11",),
("2023-02-01 11:11:11",)], ["timestamp"])
# Round down each timestamp to the nearest month
df_truncated = df.withColumn("month", date_trunc("month", df["timestamp"]))
df_truncated.show(20,False)

Python
COPY
Output

+-------------------+-------------------+
| timestamp| month|
+-------------------+-------------------+
|2023-01-01 10:10:10|2023-01-01 00:00:00|
|2023-02-01 11:11:11|2023-02-01 00:00:00|
|2023-03-01 12:12:12|2023-03-01 00:00:00|
|2023-03-31 11:11:11|2023-03-01 00:00:00|
|2023-01-31 11:11:11|2023-01-01 00:00:00|
|2023-02-01 11:11:11|2023-02-01 00:00:00|
+-------------------+-------------------+

Bash
COPY
In this example, we use the date_trunc function to round down the timestamps in the
“timestamp” column to the nearest month. The output is a new dataframe with two columns:
the original “timestamp” column and a new column “month” which contains the rounded-down
timestamps.
pyspark.sql.functions.trunc
The trunc function in PySpark can also be used to truncate date or timestamp values to a
specified unit of time (e.g. year, quarter, month, day, etc.).
Here’s an example of using trunc on date or timestamp values in PySpark:

from pyspark.sql.functions import trunc


from pyspark.sql.functions import lit
from pyspark.sql import SparkSession
# Initialize a Spark session
spark = SparkSession.builder.appName("trunc example").getOrCreate()
# Create a dataframe with a single column of timestamps
df = spark.createDataFrame([
("2023-01-01 10:10:10",),
("2023-02-01 11:11:11",),
("2023-03-01 12:12:12",),
("2023-03-31 11:11:11",),
("2023-01-31 11:11:11",),
("2023-02-01 11:11:11",)], ["timestamp"])
df.select(df.timestamp,trunc(df.timestamp, 'year').alias('year')).show()

Python
COPY
Result

+-------------------+----------+
| timestamp| year|
+-------------------+----------+
|2023-01-01 10:10:10|2023-01-01|
|2023-02-01 11:11:11|2023-01-01|
|2023-03-01 12:12:12|2023-01-01|
|2023-03-31 11:11:11|2023-01-01|
|2023-01-31 11:11:11|2023-01-01|
|2023-02-01 11:11:11|2023-01-01|
+-------------------+----------+

Bash
COPY

df.select(df.timestamp,trunc(df.timestamp, 'month').alias('month')).show()

Python
COPY

+-------------------+----------+
| timestamp| month|
+-------------------+----------+
|2023-01-01 10:10:10|2023-01-01|
|2023-02-01 11:11:11|2023-02-01|
|2023-03-01 12:12:12|2023-03-01|
|2023-03-31 11:11:11|2023-03-01|
|2023-01-31 11:11:11|2023-01-01|
|2023-02-01 11:11:11|2023-02-01|
+-------------------+----------+

Bash
COPY

df.select(df.timestamp,trunc(df.timestamp, 'day').alias('day')).show()

Python
COPY

+-------------------+----+
| timestamp| day|
+-------------------+----+
|2023-01-01 10:10:10|null|
|2023-02-01 11:11:11|null|
|2023-03-01 12:12:12|null|
|2023-03-31 11:11:11|null|
|2023-01-31 11:11:11|null|
|2023-02-01 11:11:11|null|
+-------------------+----+

Bash
COPY
Keep in mind that the trunc() function does not work with Day and that it always returns null.
PySpark : Explain map in Python or PySpark ? How it can be used.

‘map’ in PySpark is a transformation operation that allows you to apply a function to each
element in an RDD (Resilient Distributed Dataset), which is the basic data structure in PySpark.
The function takes a single element as input and returns a single output.
The result of the map operation is a new RDD where each element is the result of applying the
function to the corresponding element in the original RDD.
Example:
Suppose you have an RDD of integers, and you want to multiply each element by 2. You can use
the map transformation as follows:

rdd = sc.parallelize([1, 2, 3, 4, 5])


result = rdd.map(lambda x: x * 2)
result.collect()
Python
COPY
The output of this code will be [2, 4, 6, 8, 10]. The map operation takes a lambda function (or any
other function) that takes a single integer as input and returns its double. The collect action is
used to retrieve the elements of the RDD back to the driver program as a list.
 POSTED INSPARK
PySpark : Explanation of MapType in PySpark with Example
 USER  JANUARY 31, 2023  LEAVE A COMMENTON PYSPARK : EXPLANATION OF MAPTYPE IN
PYSPARK WITH EXAMPLE
MapType in PySpark is a data type used to represent a value that maps keys to values. It is similar
to Python’s built-in dictionary data type. The keys must be of a specific data type and the values
must be of another specific data type.
Advantages of MapType in PySpark:
 It allows for a flexible schema, as the number of keys and their values can vary for each
row in a DataFrame.
 MapType is particularly useful when working with semi-structured data, where there is a
lot of variability in the structure of the data.
Example: Let’s say we have a DataFrame with the following schema:

root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- hobbies: map (nullable = true)
| |-- key: string
| |-- value: integer

Bash
COPY
We can create this DataFrame using the following code:

from pyspark.sql.types import *


from pyspark.sql.functions import *
data = [("John", 30, {"reading": 3, "traveling": 5}),
("Jane", 25, {"cooking": 2, "painting": 4})]
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("hobbies", MapType(StringType(), IntegerType()), True)
])
df = spark.createDataFrame(data, schema)
df.show(20,False)

Python
COPY
Result

+----+---+------------------------------+
|name|age|hobbies |
+----+---+------------------------------+
|John|30 |[reading -> 3, traveling -> 5]|
|Jane|25 |[painting -> 4, cooking -> 2] |
+----+---+------------------------------+

Bash
COPY
Spark important urls to refer

You might also like