0% found this document useful (0 votes)
55 views

Exercise - 3 Submission - Group - 12

This document contains an exercise on data munging, cleaning, rankings and scores. It includes 3 parts: 1) analyzing a dataset of storage prices over time and making projections, 2) identifying potential outliers in datasets of student grades, salary data and lifespans, and 3) cleaning and analyzing a dataset of city population, GDP and life expectancy. The student provides detailed solutions and code snippets to model and visualize trends in the data.

Uploaded by

Mehmet Yalçın
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

Exercise - 3 Submission - Group - 12

This document contains an exercise on data munging, cleaning, rankings and scores. It includes 3 parts: 1) analyzing a dataset of storage prices over time and making projections, 2) identifying potential outliers in datasets of student grades, salary data and lifespans, and 3) cleaning and analyzing a dataset of city population, GDP and life expectancy. The student provides detailed solutions and code snippets to model and visualize trends in the data.

Uploaded by

Mehmet Yalçın
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

exercise_3

May 12, 2023

1 Data Science I
1.1 Exercise 3: Data Munging, Data Cleaning, Rankings & Scores
<div style="position: absolute; top: -90px; right: 10px; padding: 5px; background-color: #ddd;
<span style="font-weight: bold;">Overall Points: &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; / 100</span
Submission Deadline: May 15 2023, 07:00 UTC
University of Oldenburg Summer 2023 Instructors: Maria Fernanda “MaFe” Davila Re-
strepo, Wolfram “Wolle” Wingerath
Submitted by: <Akalin,Alp| Bagdatli,Ilayda| Yalcin,Mehmet >

2 Part 1: Data Munging & Data Cleaning


<div style="position: absolute; top: -45px; right: 10px; padding: 5px; background-color: #ddd;
<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; / 40</span>
</div>
1.) Find a table of storage prices over time (HDD/SDD).
a) How would you assess the quality of the data set you found? Do you need to do
any preparation before doing the analysis? (If so, what exactly did you do?)
Solution:
Based on the information available in the table, the data set seems to be reliable and trustworthy
as it contains clear and consistent data for both HDD and SSD prices over a significant period of
time.
Regarding data preparation before analysis, it would depend on the specific requirements of the
analysis and the data format. In this case, the data set is already in a structured tabular format
and appears to be clean and ready for analysis. However, if there were missing values, duplicates,
or inconsistencies in the data set, it would be necessary to clean and preprocess it before starting
the analysis.
Overall, it is always important to carefully examine the data set before starting any analysis, and
ensure that it is clean, complete, and relevant to the research question at hand.
b) Analyze this data and make a projection about the cost/volume of data storage five
years from now.

1
Solution:

[31]: import pandas as pd


import numpy as np
import matplotlib.pyplot as plt

# Load the data


file_path = "C:/Users/10126426/Desktop/storage prices over time (HDD.SDD).xlsx"
data = pd.read_excel(file_path)

# Set the 'Year' column as the index


data.set_index('Year', inplace=True)

# Display the first few rows of the data


print(data.head())

HDD Price (per GB) SSD Price (per GB)


Year
2000 11.05 NaN
2001 5.72 NaN
2002 2.70 NaN
2003 1.42 NaN
2004 0.83 NaN

[32]: import pandas as pd


import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Load the data from an xlsx file


df = pd.read_excel(file_path)

# Drop any rows with NaN values


df = df.dropna()

# Plot the data


plt.plot(df['Year'], df['HDD Price (per GB)'], label='HDD')
plt.plot(df['Year'], df['SSD Price (per GB)'], label='SSD')
plt.xlabel('Year')
plt.ylabel('Price (USD per GB)')
plt.title('Storage Prices Over Time')
plt.legend()
plt.show()

# Fit a linear regression model to the data


X = df['Year'].values.reshape(-1, 1)
y = df['SSD Price (per GB)'].values.reshape(-1, 1)
model = LinearRegression().fit(X, y)

2
# Make a projection for five years from now
future_year = [[2028]]
predicted_price = model.predict(future_year)[0][0]
print(f"The projected SSD price per GB in 2028 is: {predicted_price:.2f} USD")

The projected SSD price per GB in 2028 is: -2.04 USD


In this code, we load the data from an xlsx file using the pd.read_excel() method from the Pandas
library. Then, we plot the trend of HDD and SSD prices over time using Matplotlib.
Next, we fit a linear regression model to the SSD price data using Scikit-learn and make a projection
for the SSD price per GB five years from now (in 2028) based on the linear regression model.
c) What will disk prices be in 25 or 50 years?
Solution:

[33]: # Fit a linear regression model to the historical SSD data


X_ssd = df['Year'].values.reshape(-1, 1)
y_ssd = df['SSD Price (per GB)'].values.reshape(-1, 1)
model_ssd = LinearRegression().fit(X_ssd, y_ssd)

3
# Make predictions for the historical SSD data and future years
y_ssd_pred = model_ssd.predict(X_ssd)
future_years = np.arange(df['Year'].max(), df['Year'].max()+51).reshape(-1,1)
future_ssd_pred = model_ssd.predict(future_years)

# Plot the historical SSD data and the linear regression line
plt.scatter(X_ssd, y_ssd)
plt.plot(X_ssd, y_ssd_pred, color='red')
plt.title('Historical SSD Prices')
plt.xlabel('Year')
plt.ylabel('Price (per GB)')
plt.show()

# Plot the future SSD price predictions


plt.plot(future_years, future_ssd_pred, color='green')
plt.title('Projected SSD Prices')
plt.xlabel('Year')
plt.ylabel('Price (per GB)')
plt.show()

# Fit a linear regression model to the historical HDD data


X_hdd = df['Year'].values.reshape(-1, 1)
y_hdd = df['HDD Price (per GB)'].values.reshape(-1, 1)
model_hdd = LinearRegression().fit(X_hdd, y_hdd)

# Make predictions for the historical HDD data and future years
y_hdd_pred = model_hdd.predict(X_hdd)
future_hdd_pred = model_hdd.predict(future_years)

# Plot the historical HDD data and the linear regression line
plt.scatter(X_hdd, y_hdd)
plt.plot(X_hdd, y_hdd_pred, color='blue')
plt.title('Historical HDD Prices')
plt.xlabel('Year')
plt.ylabel('Price (per GB)')
plt.show()

# Plot the future HDD price predictions


plt.plot(future_years, future_hdd_pred, color='purple')
plt.title('Projected HDD Prices')
plt.xlabel('Year')
plt.ylabel('Price (per GB)')
plt.show()

4
5
6
7
We first load the data from an xlsx file, drop any rows with NaN values, and fit separate linea
Again, keep in mind that these projections are just estimates based on historical data and trends,
and many external factors can affect the actual prices in the future. The purpose of these visual-
izations is to provide a rough idea of how HDD and SSD prices might evolve in the future based
on historical data.
2.) What types of outliers might you expect to occur in the following data sets?
a) Student grades
Solution:
In a data set of student grades, we might expect to see outliers that are caused by factors such as:
1. Errors in data entry, where a student’s grade is recorded incorrectly due to a mistake in data
input.
2. Extreme performances by individual students, such as a student who performs exceptionally
well or poorly on a particular assessment.
3. Cheating or academic misconduct, where a student’s grade is artificially inflated or deflated
due to academic dishonesty.
4. Variations in teaching quality or assessment difficulty, where a particular teacher or assessment
is significantly different from others in the data set.

8
5. External factors that impact student performance, such as illness or personal circumstances,
which may cause a student to perform unusually well or poorly compared to their peers.
It’s important to note that outliers in student grade data can be particularly sensitive, as they can
have significant consequences for individual students and their academic careers. Therefore, it’s
important to investigate any potential outliers and determine whether they are genuine or due to
errors or other factors.
b) Salary data
Solution:
In a data set of salary data, we might expect to see outliers that are caused by factors such as:
1. Executive salaries, which may be significantly higher than other salaries in the same company
or industry.
2. Extreme performances by individual employees, such as a highly successful salesperson or a
poorly performing employee with a high salary.
3. Seasonal or temporary employment, where an employee’s salary is higher or lower than usual
due to the nature of their employment.
4. Unusual bonuses or incentives, which may cause a particular employee’s salary to be higher
or lower than expected.
5. Data entry errors, where an employee’s salary is recorded incorrectly due to a mistake in data
input.
It’s important to note that outliers in salary data can have significant impacts on an organization’s
budget and financial planning, as well as the morale and motivation of employees. Therefore, it’s
important to investigate any potential outliers and determine whether they are genuine or due to
errors or other factors. Additionally, organizations may want to consider whether extreme salaries
are appropriate or sustainable in the long term.
c) Lifespans in Wikipedia
Solution:
1. Errors in data entry, where a person’s lifespan is recorded incorrectly due to a mistake in
data input.
2. Longevity records, where individuals who have lived exceptionally long lives are included in
the data set.
3. Unusual circumstances, such as accidents or illnesses, which may cause an individual’s lifespan
to be much shorter than expected.
4. Variations in lifespans across different cultures and historical periods, where certain individ-
uals may have lived much longer or shorter lives than others due to factors such as diet,
healthcare, or living conditions.
5. Individuals who have achieved significant accomplishments or notoriety, where their lifespan
is of interest to researchers or the general public.
It’s important to note that outliers in lifespan data can be particularly sensitive, as they can
have significant impacts on research and our understanding of longevity and aging. Therefore, it’s
important to investigate any potential outliers and determine whether they are genuine or due to
errors or other factors. Additionally, researchers may want to consider how variations in lifespans
across different cultures and historical periods may impact their analysis and conclusions.:

9
3 Part 2: Scores & Rankings
<div style="position: absolute; top: -45px; right: 10px; padding: 5px; background-color: #ddd;
<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; / 60</span>
</div>
3.) Let X represent a random variable drawn from the normal distribution defined by
µ = 2 and � = 3. Suppose we observe X = 5.08.
Find the Z-score of x, and determine how many standard deviations away from the
mean that x is.
Solution:
To find the Z-score of X, we can use the formula:
Z = (X - µ) / �
where X is the observed value, µ is the mean, and � is the standard deviation. Substituting the
given values, we get:
Z = (5.08 - 2) / 3 Z = 1.0267
Therefore, the Z-score of X is 1.0267.
To determine how many standard deviations away from the mean X is, we can use the fact that
the Z-score measures the number of standard deviations away from the mean:
Z = (X - µ) / � X - µ = Z * �
Substituting the values of Z, µ, and �, we get:
X - 2 = 1.0267 * 3 X - 2 = 3.0801 X = 5.0801
Therefore, X is 1.0267 standard deviations away from the mean, or approximately 3.08 units above
the mean.
4.) What percentage of the standard normal distribution (µ = 0, � = 1) is found in
each region?
a) Z > 1.13
Solution:
To find the percentage of the standard normal distribution that is in the region Z > 1.13, we can
use a standard normal distribution table or a calculator with a normal distribution function.
Using a standard normal distribution table, we can find the area under the curve to the right of Z
= 1.13, which is 0.1292 or approximately 12.92%. This means that approximately 12.92% of the
standard normal distribution is in the region Z > 1.13.
Alternatively, we can use a calculator with a normal distribution function to find the same result.
For example, using the function 1 - normdist(1.13, 0, 1, TRUE) in Microsoft Excel, we get the
result 0.1292 or approximately 12.92%.
b) Z < 0.18
Solution:

10
To find the percentage of the standard normal distribution that is in the region Z < 0.18, we can
use a standard normal distribution table or a calculator with a normal distribution function.
Using a standard normal distribution table, we can find the area under the curve to the left of Z
= 0.18, which is 0.5714 or approximately 57.14%. This means that approximately 57.14% of the
standard normal distribution is in the region Z < 0.18.
Alternatively, we can use a calculator with a normal distribution function to find the same result.
For example, using the function normdist(0.18, 0, 1, TRUE) in Microsoft Excel, we get the
result 0.5714 or approximately 57.14%.:
c) Z > 8
Solution:
To find the percentage of the standard normal distribution that is in the region Z > 8, we can use
a standard normal distribution table or a calculator with a normal distribution function.
Using a standard normal distribution table, we can see that the area under the curve to the right
of Z = 8 is extremely small and close to 0. Therefore, we can approximate that the percentage of
the standard normal distribution in the region Z > 8 is effectively 0%.
Alternatively, we can use a calculator with a normal distribution function to confirm this result.
For example, using the function 1 - normdist(8, 0, 1, TRUE) in Microsoft Excel, we get the
result 7.99E-18, which is extremely close to 0. This confirms that the percentage of the standard
normal distribution in the region Z > 8 is effectively 0%.
d) |Z| < 0.5
Solution:
To find the percentage of the standard normal distribution that is in the region |Z| < 0.5, we can
use a standard normal distribution table or a calculator with a normal distribution function.
We know that the standard normal distribution is symmetrical around the mean of 0, so the area
to the left of Z = -0.5 is the same as the area to the right of Z = 0.5. Therefore, we can find the
area under the curve to the left of Z = 0.5 and double it to find the total area in the region |Z| <
0.5.
Using a standard normal distribution table, we can find the area under the curve to the left of Z
= 0.5, which is 0.6915 or approximately 69.15%. Doubling this value, we get the total area in the
region |Z| < 0.5, which is approximately 138.3%.
Alternatively, we can use a calculator with a normal distribution function to find the same result.
For example, using the function normdist(0.5, 0, 1, TRUE) in Microsoft Excel, we get the result
0.6915 or approximately 69.15%. Doubling this value, we get the total area in the region |Z| < 0.5,
which is approximately 138.3%.
However, it’s important to note that probabilities cannot exceed 100%, so this value is not a valid
probability and may indicate an error in the calculation. The correct percentage in this case would
be 69.15%.
5.) Amanda took the Graduate Record Examination (GRE), and scored 160 in verbal
reasoning and 157 in quantitative reasoning. The mean score for verbal reasoning was

11
151 with a standard deviation of 7, compared with mean µ = 153 and � = 7.67 for
quantitative reasoning. Assume that both distributions are normal.
a) What were Amanda’s Z-scores on these exam sections? Mark these scores on a
standard normal distribution curve.
Solution:
To find Amanda’s Z-scores for each section, we can use the formula:
Z = (X - µ) / �
where X is the observed score, µ is the mean, and � is the standard deviation.
For the verbal reasoning section, Amanda’s Z-score is:
Z_verbal = (160 - 151) / 7 = 1.29
For the quantitative reasoning section, her Z-score is:
Z_quantitative = (157 - 153) / 7.67 = 0.52
To mark these scores on a standard normal distribution curve, we can plot the Z-scores on the
x-axis. A Z-score of 1.29 is approximately 1.29 standard deviations above the mean, and a Z-score
of 0.52 is approximately 0.52 standard deviations above the mean.
b) Which section did she do better on, relative to other students?
Solution:
To compare Amanda’s performance relative to other students, we need to look at her percentile
ranks for each section. Percentile rank indicates the percentage of scores that fall below a given
score.
To find Amanda’s percentile rank for the verbal reasoning section, we can use a standard normal
distribution table or a calculator with a normal distribution function. Using a table, we can find
the area under the curve to the left of Amanda’s Z-score of 1.29, which is 0.9015 or approximately
90.15%. This means that Amanda scored higher than approximately 90.15% of other students who
took the verbal reasoning section.
To find Amanda’s percentile rank for the quantitative reasoning section, we can use the same
method. Using a calculator or table, we can find the area under the curve to the left of Amanda’s
Z-score of 0.52, which is 0.6985 or approximately 69.85%. This means that Amanda scored higher
than approximately 69.85% of other students who took the quantitative reasoning section.
Therefore, Amanda did better relative to other students in the verbal reasoning section, as her
percentile rank was higher (90.15%) compared to the percentile rank for the quantitative reasoning
section (69.85%).
c) Find her percentile scores for the two exams.
Solution:
To find Amanda’s percentile scores for the two exams, we can use the same method as before to find
the area under the curve to the left of her Z-scores. Then we can convert that area to a percentile
score using a standard normal distribution table or calculator.

12
For the verbal reasoning section, Amanda’s Z-score was 1.29. Using a standard normal distribution
table or calculator, we find that the area under the curve to the left of 1.29 is approximately 0.9015.
To convert this to a percentile score, we multiply by 100 to get:
Percentile score for verbal reasoning section = 0.9015 x 100 = 90.15%
This means that Amanda scored higher than approximately 90.15% of other students who took the
verbal reasoning section.
For the quantitative reasoning section, Amanda’s Z-score was 0.52. Using a standard normal distri-
bution table or calculator, we find that the area under the curve to the left of 0.52 is approximately
0.6985. To convert this to a percentile score, we multiply by 100 to get:
Percentile score for quantitative reasoning section = 0.6985 x 100 = 69.85%
This means that Amanda scored higher than approximately 69.85% of other students who took the
quantitative reasoning section.
6.) Identify three successful and well-used scoring functions in areas of personal in-
terest to you. For each, explain what makes it a good scoring function and how it is
used to create rankings in that domain.
Solution:
1. Search Engine Ranking Function: One of the most successful scoring functions is the PageR-
ank algorithm used by Google Search. This algorithm measures the relevance and authority
of a webpage by analyzing the number and quality of links pointing to it. The more links a
page has from other relevant and authoritative pages, the higher its PageRank score, and the
higher it appears in search engine results. This makes it a good scoring function because it can
effectively identify high-quality and relevant content on the web, and rank them accordingly
in search results.
2. Credit Score Function: A credit score is a numerical value that indicates a person’s credit-
worthiness and ability to pay back debts. It is used by lenders to evaluate the risk of lending
money to a borrower. The FICO score is one of the most widely used scoring functions for
credit evaluation. It considers factors such as payment history, credit utilization, length of
credit history, types of credit, and recent credit inquiries to calculate a score between 300
and 850. A higher score indicates a lower credit risk, making it a good scoring function for
lenders to assess the creditworthiness of an individual.
3. Sports Ranking Function: A commonly used scoring function in sports is the Elo rating
system. This system was originally developed for chess, but has been adopted by various
sports, including tennis, football, and basketball. It measures the relative skill level of players
or teams by analyzing their performance in previous matches and adjusting their ratings
based on the outcome of each match. A player or team’s rating increases when they win
against a stronger opponent, and decreases when they lose against a weaker opponent. This
makes it a good scoring function for ranking players or teams in a given sport, as it takes into
account their performance over time and the strength of their opponents.

13
4 Finally: Submission
Save your notebook and submit it (as both notebook and PDF file). And please don’t forget to

- … choose a file name according to convention (see Exercise Sheet 1) and to
- … include the execution output in your submission!

14

You might also like