0% found this document useful (0 votes)

134 views

Spark DataFrames Project Exercise - Jupyter Notebook

The document provides instructions for analyzing stock market data from Walmart from 2012-2017 using Spark DataFrames. It asks the user to load the data, examine the schema and columns, calculate summary statistics, and answer questions about the data such as the peak high price, mean close price, and correlation between high prices and volume.

Uploaded by

ROHAN CHOPDE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

134 views

Spark DataFrames Project Exercise - Jupyter Notebook

Uploaded by

ROHAN CHOPDE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

12/24/21, 1:52 PM Spark DataFrames Project Exercise - Jupyter Notebook

Spark DataFrames Project Exercise

Let's get some quick practice with your new Spark DataFrame skills, you will be asked some basic questions
about some stock market data, in this case Walmart Stock from the years 2012-2017. This exercise will just ask
a bunch of questions, unlike the future machine learning exercises, which will be a little looser and be in the
form of "Consulting Projects", but more on that later!

For now, just answer the questions and complete the tasks below.

Use the walmart_stock.csv file to Answer and complete the tasks below!

Start a simple Spark Session

In [1]:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('walmart').getOrCreate()

Load the Walmart Stock CSV File, have Spark infer the data types.

In [2]:

df = spark.read.csv('walmart_stock.csv', inferSchema=True, header=True)

What are the column names?

In [3]:

df.columns

Out[3]:

['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']

What does the Schema look like?

In [5]:

df.printSchema()

root

|-- Date: string (nullable = true)

|-- Open: double (nullable = true)

|-- High: double (nullable = true)

|-- Low: double (nullable = true)

|-- Close: double (nullable = true)

|-- Volume: integer (nullable = true)

|-- Adj Close: double (nullable = true)

localhost:8888/notebooks/Downloads/Spark DataFrames Project Exercise.ipynb 1/7

12/24/21, 1:52 PM Spark DataFrames Project Exercise - Jupyter Notebook

Print out the first 5 columns.

In [8]:

for line in df.head(5):

print(line,'\n')

Row(Date='2012-01-03', Open=59.970001, High=61.060001, Low=59.869999, Close=

60.330002, Volume=12668800, Adj Close=52.619234999999996)

Row(Date='2012-01-04', Open=60.209998999999996, High=60.349998, Low=59.47000

1, Close=59.709998999999996, Volume=9593300, Adj Close=52.078475)

Row(Date='2012-01-05', Open=59.349998, High=59.619999, Low=58.369999, Close=

59.419998, Volume=12768200, Adj Close=51.825539)

Row(Date='2012-01-06', Open=59.419998, High=59.450001, Low=58.869999, Close=

59.0, Volume=8069400, Adj Close=51.45922)

Row(Date='2012-01-09', Open=59.029999, High=59.549999, Low=58.919998, Close=

59.18, Volume=6679300, Adj Close=51.616215000000004)

Use describe() to learn about the DataFrame.

In [10]:

df.describe().show()

+-------+----------+------------------+-----------------+-----------------+-
----------------+-----------------+-----------------+

|summary| Date| Open| High| Low|

Close| Volume| Adj Close|

+-------+----------+------------------+-----------------+-----------------+-
----------------+-----------------+-----------------+

| count| 1258| 1258| 1258| 1258|

1258| 1258| 1258|

| mean| null| 72.35785375357709|72.83938807631165| 71.9186009594594|7

2.38844998012726|8222093.481717011|67.23883848728146|

| stddev| null| 6.76809024470826|6.768186808159218|6.744075756255496|

6.756859163732991| 4519780.8431556|6.722609449996857|

| min|2012-01-03|56.389998999999996| 57.060001| 56.299999|

56.419998| 2094900| 50.363689|

| max|2016-12-30| 90.800003| 90.970001| 89.25|

90.470001| 80898100|84.91421600000001|

+-------+----------+------------------+-----------------+-----------------+-
----------------+-----------------+-----------------+

Bonus Question!

There are too many decimal places for mean and stddev in the describe() dataframe. Format the
numbers to just show up to two decimal places. Pay careful attention to the datatypes that .describe()
returns, we didn't cover how to do this exact formatting, but we covered something very similar. Check
this link for a hint
(http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.cast)

localhost:8888/notebooks/Downloads/Spark DataFrames Project Exercise.ipynb 2/7

12/24/21, 1:52 PM Spark DataFrames Project Exercise - Jupyter Notebook

If you get stuck on this, don't worry, just view the solutions.

In [18]:

from pyspark.sql.types import (StructField, StringType,

IntegerType, StructType)
data_schema = [StructField('summary', StringType(), True),
StructField('Open', StringType(), True),
StructField('High', StringType(), True),
StructField('Low', StringType(), True),
StructField('Close', StringType(), True),
StructField('Volume', StringType(), True),
StructField('Adj Close', StringType(), True)
]
final_struc = StructType(fields=data_schema)

df = spark.read.csv('walmart_stock.csv', inferSchema=True, header=True)

In [19]:

df.printSchema()

root

|-- Date: string (nullable = true)

|-- Open: double (nullable = true)

|-- High: double (nullable = true)

|-- Low: double (nullable = true)

|-- Close: double (nullable = true)

|-- Volume: integer (nullable = true)

|-- Adj Close: double (nullable = true)

In [22]:

from pyspark.sql.functions import format_number

summary = df.describe()
summary.select(summary['summary'],
format_number(summary['Open'].cast('float'), 2).alias('Ope
format_number(summary['High'].cast('float'), 2).alias('Hig
format_number(summary['Low'].cast('float'), 2).alias('Low'
format_number(summary['Close'].cast('float'), 2).alias('Cl
format_number(summary['Volume'].cast('int'),0).alias('Volu
).show()

+-------+--------+--------+--------+--------+----------+

|summary| Open| High| Low| Close| Volume|

+-------+--------+--------+--------+--------+----------+

| count|1,258.00|1,258.00|1,258.00|1,258.00| 1,258|

| mean| 72.36| 72.84| 71.92| 72.39| 8,222,093|

| stddev| 6.77| 6.77| 6.74| 6.76| 4,519,780|

| min| 56.39| 57.06| 56.30| 56.42| 2,094,900|

| max| 90.80| 90.97| 89.25| 90.47|80,898,100|

+-------+--------+--------+--------+--------+----------+

Create a new dataframe with a column called HV Ratio that is the ratio of the High Price versus volume
of stock traded for a day.

localhost:8888/notebooks/Downloads/Spark DataFrames Project Exercise.ipynb 3/7

12/24/21, 1:52 PM Spark DataFrames Project Exercise - Jupyter Notebook

In [23]:

df_hv = df.withColumn('HV Ratio', df['High']/df['Volume']).select(['HV Ratio'])

df_hv.show()

+--------------------+

| HV Ratio|

+--------------------+

|4.819714653321546E-6|

|6.290848613094555E-6|

|4.669412994783916E-6|

|7.367338463826307E-6|

|8.915604778943901E-6|

|8.644477436914568E-6|

|9.351828421515645E-6|

| 8.29141562102703E-6|

|7.712212102001476E-6|

|7.071764823529412E-6|

|1.015495466386981E-5|

|6.576354146362592...|

| 5.90145296180676E-6|

|8.547679455011844E-6|

|8.420709512685392E-6|

|1.041448341728929...|

|8.316075414862431E-6|

|9.721183814992126E-6|

|8.029436027707578E-6|

|6.307432259386365E-6|

+--------------------+

only showing top 20 rows

What day had the Peak High in Price?

In [25]:

df.orderBy(df['High'].desc()).select(['Date']).head(1)[0]['Date']

Out[25]:

'2015-01-13'

What is the mean of the Close column?

In [26]:

from pyspark.sql.functions import mean

df.select(mean('Close')).show()

+-----------------+

| avg(Close)|

+-----------------+

|72.38844998012726|

+-----------------+

What is the max and min of the Volume column?

localhost:8888/notebooks/Downloads/Spark DataFrames Project Exercise.ipynb 4/7
12/24/21, 1:52 PM Spark DataFrames Project Exercise - Jupyter Notebook

In [27]:

from pyspark.sql.functions import min, max

In [28]:

df.select(max('Volume'),min('Volume')).show()

+-----------+-----------+

|max(Volume)|min(Volume)|

+-----------+-----------+

| 80898100| 2094900|

+-----------+-----------+

How many days was the Close lower than 60 dollars?

In [29]:

df.filter(df['Close'] < 60).count()

Out[29]:

What percentage of the time was the High greater than 80 dollars ?

In other words, (Number of Days High>80)/(Total Days in the dataset)

In [107]:

df.filter('High > 80').count() * 100/df.count()

Out[107]:

9.141494435612083

What is the Pearson correlation between High and Volume?

Hint
(http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameStatFunctions.co

In [31]:

df.corr('High', 'Volume')

+-------------------+

| corr(High, Volume)|

+-------------------+

|-0.3384326061737161|

+-------------------+

localhost:8888/notebooks/Downloads/Spark DataFrames Project Exercise.ipynb 5/7

12/24/21, 1:52 PM Spark DataFrames Project Exercise - Jupyter Notebook

What is the max High per year?

In [32]:

from pyspark.sql.functions import (dayofmonth, hour,

dayofyear, month,
year, weekofyear,
format_number, date_format)
year_df = df.withColumn('Year', year(df['Date']))
year_df.groupBy('Year').max()['Year', 'max(High)'].show()

+----+---------+

|Year|max(High)|

+----+---------+

|2015|90.970001|

|2013|81.370003|

|2014|88.089996|

|2012|77.599998|

|2016|75.190002|

+----+---------+

What is the average Close for each Calendar Month?

In other words, across all the years, what is the average Close price for Jan,Feb, Mar, etc... Your result
will have a value for each of these months.

In [33]:

month_df = df.withColumn('Month', month(df['Date']))

month_df = month_df.groupBy('Month').mean()

month_df = month_df.orderBy('Month')

month_df['Month', 'avg(Close)'].show()

+-----+-----------------+

|Month| avg(Close)|

+-----+-----------------+

| 1|71.44801958415842|

| 2| 71.306804443299|

| 3|71.77794377570092|

| 4|72.97361900952382|

| 5|72.30971688679247|

| 6| 72.4953774245283|

| 7|74.43971943925233|

| 8|73.02981855454546|

| 9|72.18411785294116|

| 10|71.57854545454543|

| 11| 72.1110893069307|

| 12|72.84792478301885|

+-----+-----------------+

localhost:8888/notebooks/Downloads/Spark DataFrames Project Exercise.ipynb 6/7

12/24/21, 1:52 PM Spark DataFrames Project Exercise - Jupyter Notebook

Great Job!

localhost:8888/notebooks/Downloads/Spark DataFrames Project Exercise.ipynb 7/7

Rom2box v1.03 Free Tool For Windows Tool
No ratings yet
Rom2box v1.03 Free Tool For Windows Tool
11 pages
Pythons Basics
No ratings yet
Pythons Basics
104 pages
CBSE Notes For Class 8 Computer in Action HTML
100% (1)
CBSE Notes For Class 8 Computer in Action HTML
11 pages
Beginning Microsoft SQL Server 2012 Programming
From Everand
Beginning Microsoft SQL Server 2012 Programming
Paul Atkinson
1/5 (1)
PySpark RDD Assignment
No ratings yet
PySpark RDD Assignment
1 page
Introduction To Apache Spark (Spark) : - by Praveen
No ratings yet
Introduction To Apache Spark (Spark) : - by Praveen
19 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Big Data and Spark Developers
No ratings yet
Big Data and Spark Developers
5 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Apache Sqoop
No ratings yet
Apache Sqoop
21 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
10
No ratings yet
10
4 pages
Interview
No ratings yet
Interview
86 pages
Big Data Analytics
No ratings yet
Big Data Analytics
134 pages
Snowflake Demo
No ratings yet
Snowflake Demo
13 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
Data Science in Spark With Sparklyr::: Cheat Sheet
No ratings yet
Data Science in Spark With Sparklyr::: Cheat Sheet
2 pages
HDFS Exercises - Basic
No ratings yet
HDFS Exercises - Basic
5 pages
spark
No ratings yet
spark
160 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Apache Hive
No ratings yet
Apache Hive
3 pages
Transformations and Actions: A Visual Guide of The API
No ratings yet
Transformations and Actions: A Visual Guide of The API
122 pages
Hive Queries
No ratings yet
Hive Queries
5 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Class XII Data Handlinng Using PandasI
No ratings yet
Class XII Data Handlinng Using PandasI
46 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
Mongodb Cheat Sheet
No ratings yet
Mongodb Cheat Sheet
10 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
PostgreSQL Cheat Sheet - Hackr - Io
No ratings yet
PostgreSQL Cheat Sheet - Hackr - Io
90 pages
MIcrosoft SQL Server 2012 - T-SQL
No ratings yet
MIcrosoft SQL Server 2012 - T-SQL
9 pages
Apache Spark Analytics Made Simple
No ratings yet
Apache Spark Analytics Made Simple
76 pages
Apache Hue-Cloudera
No ratings yet
Apache Hue-Cloudera
63 pages
02 - Apache Spark On Amazon EMR
No ratings yet
02 - Apache Spark On Amazon EMR
31 pages
LS1.1 - V6 Generalized Architecture of Big Data Systems
No ratings yet
LS1.1 - V6 Generalized Architecture of Big Data Systems
8 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Pandas in Python 16sept2022
No ratings yet
Pandas in Python 16sept2022
8 pages
Snowflake - Billing Components
No ratings yet
Snowflake - Billing Components
9 pages
Data Warehouse - What Is It
No ratings yet
Data Warehouse - What Is It
5 pages
Spark RDD Dataframes SQL
No ratings yet
Spark RDD Dataframes SQL
3 pages
Explain in Detail About Hadoop Framework
No ratings yet
Explain in Detail About Hadoop Framework
4 pages
Databricks
No ratings yet
Databricks
11 pages
Snowflake Setup - MD
No ratings yet
Snowflake Setup - MD
2 pages
Nosql Database Systems: M.Tech. (Iind, Sem Ce/Cn)
100% (1)
Nosql Database Systems: M.Tech. (Iind, Sem Ce/Cn)
135 pages
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Instant Pentaho Data Integration Kitchen
From Everand
Instant Pentaho Data Integration Kitchen
Sergio Ramazzina
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
5
No ratings yet
5
1 page
Music Java
No ratings yet
Music Java
6 pages
Char 2403
No ratings yet
Char 2403
189 pages
Gift Java
No ratings yet
Gift Java
6 pages
Numpy Exercises: #### Import Numpy As NP
100% (1)
Numpy Exercises: #### Import Numpy As NP
6 pages
Matplotlib Exercises - Jupyter Notebook
No ratings yet
Matplotlib Exercises - Jupyter Notebook
7 pages
DBMS Practical No.-05
No ratings yet
DBMS Practical No.-05
4 pages
ReactJS[1]
No ratings yet
ReactJS[1]
52 pages
Main Method in Java
No ratings yet
Main Method in Java
10 pages
React Development
No ratings yet
React Development
60 pages
Session Cookie
No ratings yet
Session Cookie
5 pages
Building Web Services
No ratings yet
Building Web Services
13 pages
Principles of Project Management
100% (4)
Principles of Project Management
44 pages
Web Development Snehaa
100% (1)
Web Development Snehaa
18 pages
How To Update The Samsung Galaxy Xcover Pro Software
No ratings yet
How To Update The Samsung Galaxy Xcover Pro Software
6 pages
Big Data Introduction PDF
No ratings yet
Big Data Introduction PDF
180 pages
How To Open MDI in Office 2007 Job Aid
No ratings yet
How To Open MDI in Office 2007 Job Aid
12 pages
NVS Getting Access To On Premises Apps and Resources With Microsoft Tunnel
No ratings yet
NVS Getting Access To On Premises Apps and Resources With Microsoft Tunnel
15 pages
UNIT1Lesson 5 1
0% (1)
UNIT1Lesson 5 1
13 pages
Manoj Kumar - Software QA Engineer
No ratings yet
Manoj Kumar - Software QA Engineer
2 pages
Paquestes Samsung S10e
No ratings yet
Paquestes Samsung S10e
13 pages
ITCS114-Final - Key
No ratings yet
ITCS114-Final - Key
6 pages
Comp 2001 Chapter 1 Handout-Edited
No ratings yet
Comp 2001 Chapter 1 Handout-Edited
24 pages
Internet Information Server: Indexing Web Sites
No ratings yet
Internet Information Server: Indexing Web Sites
80 pages
10-DotNet Modifiers
No ratings yet
10-DotNet Modifiers
10 pages
934-1586283022822-Unit 5.6 IntroductiontoHTML PDF
No ratings yet
934-1586283022822-Unit 5.6 IntroductiontoHTML PDF
27 pages
Computer Programming Lecture
No ratings yet
Computer Programming Lecture
16 pages
CH 1
No ratings yet
CH 1
4 pages
NEW
No ratings yet
NEW
52 pages
Agile Module 1 Notes
No ratings yet
Agile Module 1 Notes
17 pages
Advanced Python
100% (2)
Advanced Python
4 pages
Distributed Software Engineering
No ratings yet
Distributed Software Engineering
17 pages
Internship Report OasisInfobyte Gauri-1
No ratings yet
Internship Report OasisInfobyte Gauri-1
18 pages
Module 4 - XML and PHP
No ratings yet
Module 4 - XML and PHP
115 pages

Uploaded by

Uploaded by

12/24/21, 1:52 PM Spark DataFrames Project Exercise - Jupyter Notebook

Spark DataFrames Project Exercise

Start a simple Spark Session

from pyspark.sql import SparkSession

df = spark.read.csv('walmart_stock.csv', inferSchema=True, header=True)

What are the column names?

['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']

What does the Schema look like?

|-- Date: string (nullable = true)

|-- Open: double (nullable = true)

|-- High: double (nullable = true)

|-- Low: double (nullable = true)

|-- Close: double (nullable = true)

|-- Volume: integer (nullable = true)

|-- Adj Close: double (nullable = true)

localhost:8888/notebooks/Downloads/Spark DataFrames Project Exercise.ipynb 1/7

Print out the first 5 columns.

for line in df.head(5):

Row(Date='2012-01-03', Open=59.970001, High=61.060001, Low=59.869999, Close=

Row(Date='2012-01-04', Open=60.209998999999996, High=60.349998, Low=59.47000

Row(Date='2012-01-05', Open=59.349998, High=59.619999, Low=58.369999, Close=

Row(Date='2012-01-06', Open=59.419998, High=59.450001, Low=58.869999, Close=

Row(Date='2012-01-09', Open=59.029999, High=59.549999, Low=58.919998, Close=

Use describe() to learn about the DataFrame.

|summary| Date| Open| High| Low|

| count| 1258| 1258| 1258| 1258|

| mean| null| 72.35785375357709|72.83938807631165| 71.9186009594594|7

| stddev| null| 6.76809024470826|6.768186808159218|6.744075756255496|

| min|2012-01-03|56.389998999999996| 57.060001| 56.299999|

| max|2016-12-30| 90.800003| 90.970001| 89.25|

localhost:8888/notebooks/Downloads/Spark DataFrames Project Exercise.ipynb 2/7

from pyspark.sql.types import (StructField, StringType,

df = spark.read.csv('walmart_stock.csv', inferSchema=True, header=True)

|-- Date: string (nullable = true)

|-- Open: double (nullable = true)

|-- High: double (nullable = true)

|-- Low: double (nullable = true)

|-- Close: double (nullable = true)

|-- Volume: integer (nullable = true)

|-- Adj Close: double (nullable = true)

from pyspark.sql.functions import format_number

|summary| Open| High| Low| Close| Volume|

| mean| 72.36| 72.84| 71.92| 72.39| 8,222,093|

| stddev| 6.77| 6.77| 6.74| 6.76| 4,519,780|

| min| 56.39| 57.06| 56.30| 56.42| 2,094,900|

| max| 90.80| 90.97| 89.25| 90.47|80,898,100|

localhost:8888/notebooks/Downloads/Spark DataFrames Project Exercise.ipynb 3/7

df_hv = df.withColumn('HV Ratio', df['High']/df['Volume']).select(['HV Ratio'])

only showing top 20 rows

What day had the Peak High in Price?

What is the mean of the Close column?

from pyspark.sql.functions import mean

What is the max and min of the Volume column?

from pyspark.sql.functions import min, max

How many days was the Close lower than 60 dollars?

df.filter(df['Close'] < 60).count()

In other words, (Number of Days High>80)/(Total Days in the dataset)

df.filter('High > 80').count() * 100/df.count()

What is the Pearson correlation between High and Volume?

localhost:8888/notebooks/Downloads/Spark DataFrames Project Exercise.ipynb 5/7

What is the max High per year?

from pyspark.sql.functions import (dayofmonth, hour,

What is the average Close for each Calendar Month?

month_df = df.withColumn('Month', month(df['Date']))

localhost:8888/notebooks/Downloads/Spark DataFrames Project Exercise.ipynb 6/7

localhost:8888/notebooks/Downloads/Spark DataFrames Project Exercise.ipynb 7/7

You might also like