MUNAR - Linear Regression - Ipynb - Colaboratory
MUNAR - Linear Regression - Ipynb - Colaboratory
ipynb - Colaboratory
Linear Regression
Objective(s):
This activity aims to perform regression analysis using linear regression
Demonstrate how to use python to predict the outcome using linear regression.
Demonstrate how to use single and multiple features to predict the outcome using linear
regression.
Resources:
Jupyter Notebook
stores_dist.csv
real_estate_price_size_year.csv
Ames_Housing_Sales.csv
Procedure:
Code Text
from google.colab import drive
drive.mount('/content/drive')
#import the libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
#import the stores_dist.csv data
salesDist = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Linear Regression/stores-dis
#check the imported data
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 1/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
salesDist.head()
0 1 231.0 12
1 2 156.0 13
2 3 10.0 16
3 4 519.0 2
4 5 437.0 6
Rename the annual sales to sales and number of stores in district to stores
#rename the annual sales to sales and the number of stores in district to stores
salesDist = salesDist.rename(columns={'annual net sales':'sales','number of stores in distric
#check the salesDist data to verify if the columns were renamed.
salesDist.head()
0 1 231.0 12
1 2 156.0 13
2 3 10.0 16
3 4 519.0 2
4 5 437.0 6
#check the correlation
salesDist.corr()
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 2/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
Drop the column with the lowest correlation and verify the dataframe if the column was deleted.
sales = salesDist.drop('district',axis=1)
sales.head()
sales stores
0 231.0 12
1 156.0 13
2 10.0 16
3 519.0 2
4 437.0 6
From the correlation coefficent data, what type of correlation did you observe between annual net
sales and number of stores in the district?
Negative correlation
Create a plot to visualize the data. You will also assign stores as the independent variable x and
sales as the dependent variable y .
# dependent variable for y axis
y = sales['sales']
# independent variable for x axis
x = sales.stores
# Display the plot inline
%matplotlib inline
# Increase the size of the plot
plt.figure(figsize=(20,10))
# Create a scatter plot: Number of stores in the District vs. Annual Net Sales
plt.plot(x,y, 'o', markersize = 15)
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 3/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
# Add axis labels and increase the font size
plt.ylabel('Annual Net Sales', fontsize = 30)
plt.xlabel('Number of Stores in the District', fontsize = 30)
# Increase the font size on the ticks on the x and y axis
plt.xticks(fontsize = 20)
plt.yticks(fontsize = 20)
# Display the scatter plot
plt.show()
m, b = np.polyfit(x,y,1)
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 4/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
print ('The slope of line is {:.2f}.'.format(m))
print ('The y-intercept is {:.2f}.'.format(b))
print ('The best fit simple linear regression line is {:.2f}x + {:.2f}.'.format(m,b))
Using the linear regression line, you can predict the annual net sales based on the number of stores
in the district.
# Function to predict the net sales from the regression line
def predict(query):
if query >= 1:
predict = m * query + b
return predict
else:
print ("You must have at least 1 store in the district to predict the annual net sale
# Enter the number of stores in the function to generate the net sales prediction.
predict(4)
456.2313681207654
predict(16)
26.786342565077348
Import the libraries. Make sure to install all the libraries to avoid errors.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns
data = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Linear Regression/real_estate_pri
data.head()
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 5/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
data.describe()
The average real estate price in the dataset is 292,289.47 currency. The cheapest real estate price
is 154,282.13 currency is 52.78% of the average real estate price. The standard deviation is
77,051.73 currency. If we divide the standard deviation over mean we will get the coefficient
variation 0.26 which is less than 1. Thus, the dataset is more or less concentrated near the value
of the mean.
Calcuate the multiple linear regression. Set the dependent variable and independent variable
# Following the regression equation, our dependent variable (y) is the price
y = data ['price']
# Similarly, our independent variable (x) is the size and year
x1 = data [['size','year']]
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 6/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
# Add a constant. Esentially, we are adding a new column (equal in lenght to x), which consis
x = sm.add_constant(x1)
# Fit the model, according to the OLS (ordinary least squares) method with a dependent variab
results = sm.OLS(y,x).fit()
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/tsatools.py:117: FutureWarning: I
x = pd.concat(x[::order], 1)
# Print a nice summary of the regression.
results.summary()
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.41e+05. This might indicate that there are
Interpret the result. Explain the adjusted R squared , coef and standard error.
The result of the OLS regression shows that size and year are great predictor variables for
price . Results shows that these variables are statistically significant because their p-values
equal 0.000. The coefficient signifies how much the mean of the dependent variable ( price )
changes given a one-unit shift in the independent variables ( size and year ) while holding other
variables in the model constant. Among the two independent variables, year affects the price
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 7/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
positively than the size . The R-squared of the regression is the fraction of the variation in the
dependent variable( price ) that is accounted by the independent variables ( size and year ).
Adjusted R-squared, a modified version of R-squared, adds precision and reliability by considering
the impact of additional independent variables that tend to skew the results of R-squared
measurement
Import the data using Pandas. Check the data. Examine the data types and shape of the dataset.
import pandas as pd
import numpy as np
# Import the data using the file path
data = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Linear Regression/Ames_Housing_Sa
print(data.shape)
(1379, 80)
data
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 8/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
1 1262.0
data.dtypes.value_counts()
0.0 0.0 None 3 1Fam TA
float64
3 21
dtype:
4 int64
1145.0 1053.0 0.0 None 4 1Fam TA
categorical_cols = data.columns[mask]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:2:
1376 1188.0 1152.0 0.0 None 4 DeprecationWarning:
1Fam Gd `np
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/relea
1377
1078.0 0.0 0.0 None 2 1Fam TA
num_ohc_cols = (data[categorical_cols]
.apply(lambda x: x.nunique())
.sort_values(ascending=False))
# No need to encode if there is only one value
small_num_ohc_cols = num_ohc_cols.loc[num_ohc_cols>1]
# Number of one-hot columns is one less than the number of categories
small_num_ohc_cols -= 1
# This is 215 columns, assuming the original ones are dropped.
# This is quite a few extra columns!
small_num_ohc_cols.sum()
215
Create a new data set where all of the above categorical features will be one-hot encoded.
Used the dataframe .copy() method to create a completely separate copy of the dataframe
for one-hot encoding
On this new dataframe, one-hot encode each of the appropriate columns and add it back to
the dataframe. Be sure to drop the original column.
For the data that are not one-hot encoded, drop the columns that are string categoricals.
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 9/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
# Copy of the data
data_ohc = data.copy()
# The encoders
le = LabelEncoder()
ohc = OneHotEncoder()
for col in num_ohc_cols.index:
# Integer encode the string categories
dat = le.fit_transform(data_ohc[col]).astype(np.int)
# Remove the original column from the dataframe
data_ohc = data_ohc.drop(col, axis=1)
# One hot encode the data--this returns a sparse array
new_dat = ohc.fit_transform(dat.reshape(-1,1))
# Create unique column names
n_cols = new_dat.shape[1]
col_names = ['_'.join([col, str(x)]) for x in range(n_cols)]
# Create the new dataframe
new_df = pd.DataFrame(new_dat.toarray(),
index=data_ohc.index,
columns=col_names)
# Append the new data to the dataframe
data_ohc = pd.concat([data_ohc, new_df], axis=1)
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 10/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]
data_ohc
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 11/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
print(data.shape[1])
1377 1078.0 0.0 0.0 2 49.0 1029.0 1
print(data.shape[1])
80
37
Create train and test splits of both data sets. To ensure the data gets split the same way, use the
same random_state in each of the two splits.
For each data set, fit a basic linear regression model
on the training data.
Calculate the mean squared error on both the train and test sets for the
respective models
from sklearn.model_selection import train_test_split
y_col = 'SalePrice'
# Split the data that is not one-hot encoded
feature_cols = [x for x in data.columns if x != y_col]
X_data = data[feature_cols]
y_data = data[y_col]
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data,
test_size=0.3, random_state=42)
# Split the data that is one-hot encoded
feature_cols = [x for x in data_ohc.columns if x != y_col]
X_data_ohc = data_ohc[feature_cols]
y_data_ohc = data_ohc[y_col]
X_train_ohc, X_test_ohc, y_train_ohc, y_test_ohc = train_test_split(X_data_ohc, y_data_ohc,
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 12/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
test_size=0.3, random_state=42)
# Compare the indices to ensure they are identical
(X_train_ohc.index == X_train.index).all()
True
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
LR = LinearRegression()
# Storage for error values
error_df = list()
# Data that have not been one-hot encoded
LR = LR.fit(X_train, y_train)
y_train_pred = LR.predict(X_train)
y_test_pred = LR.predict(X_test)
error_df.append(pd.Series({'train': mean_squared_error(y_train, y_train_pred),
'test' : mean_squared_error(y_test, y_test_pred)},
name='no enc'))
# Data that have been one-hot encoded
LR = LR.fit(X_train_ohc, y_train_ohc)
y_train_ohc_pred = LR.predict(X_train_ohc)
y_test_ohc_pred = LR.predict(X_test_ohc)
error_df.append(pd.Series({'train': mean_squared_error(y_train_ohc, y_train_ohc_pred),
'test' : mean_squared_error(y_test_ohc, y_test_ohc_pred)},
name='one-hot enc'))
# Assemble the results
error_df = pd.concat(error_df, axis=1)
error_df
For each of the data sets (one-hot encoded and not encoded):
Scale the all the non-hot encoded values using one of the following: StandardScaler ,
MinMaxScaler , MaxAbsScaler .
Compare the error calculated on the test sets
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 13/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
# Mute the setting wtih a copy warnings
pd.options.mode.chained_assignment = None
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler
scalers = {'standard': StandardScaler(),
'minmax': MinMaxScaler(),
'maxabs': MaxAbsScaler()}
training_test_sets = {
'not_encoded': (X_train, y_train, X_test, y_test),
'one_hot_encoded': (X_train_ohc, y_train_ohc, X_test_ohc, y_test_ohc)}
# Get the list of float columns, and the float data
# so that we don't scale something we already scaled.
# We're supposed to scale the original data each time
mask = X_train.dtypes == np.float
float_columns = X_train.columns[mask]
# initialize model
LR = LinearRegression()
# iterate over all possible combinations and get the errors
errors = {}
for encoding_label, (_X_train, _y_train, _X_test, _y_test) in training_test_sets.items():
for scaler_label, scaler in scalers.items():
trainingset = _X_train.copy() # copy because we dont want to scale this more than on
testset = _X_test.copy()
trainingset[float_columns] = scaler.fit_transform(trainingset[float_columns])
testset[float_columns] = scaler.transform(testset[float_columns])
LR.fit(trainingset, _y_train)
predictions = LR.predict(testset)
key = encoding_label + ' - ' + scaler_label + 'scaling'
errors[key] = mean_squared_error(_y_test, predictions)
errors = pd.Series(errors)
print(errors.to_string())
print('-' * 80)
for key, error_val in errors.items():
print(key, error_val)
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 14/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
--------------------------------------------------------------------------------
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_context('talk')
sns.set_style('ticks')
sns.set_palette('dark')
ax = plt.axes()
# we are going to use y_test, y_test_pred
ax.scatter(y_test, y_test_pred, alpha=.5)
ax.set(xlabel='Ground truth',
ylabel='Predictions',
title='Ames, Iowa House Price Predictions vs Truth, using Linear Regression');
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 15/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
[ Linear Regression ]
Import Package
# Libraries required for data analysis and preprocessing
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
# Libraries required for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Libraries required for linear regression model application
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
# Libraries required for residualization
import scipy.stats
Import Data
Bureau of Custosm Import Data for January 2022 from https://customs.gov.ph/import-reports/
org_df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Linear Regression/DailyfiledJan
org_df.head()
Data Preprocessing
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 16/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
To deal meaningfully with the imported data, we will check and delete rows recorded as NaN or null.
Let's check the missing value first.
org_df.isnull().sum()
hs_code 0
country 1
weight 0
amount_php 0
tax 0
dtype: int64
df = org_df.dropna()
df
df.describe()
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 17/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
25% 5.807909e+10
df.describe()
1.840000e+00 6.706795e+03 0.000000e+00
From the current data, price is the import amount per item and tax is the tax per item, so wen can
calculate the tax rate in this way.
df = df.assign(tax_rate = df['tax'] / (df['amount_php']))
df.describe()
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 18/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
A rugplot
stdis a 2.298279e+10
graph that draws a small line 2.190546e+07
7.856476e+05 segment (rug) 4.577029e+05
to check the distribution of the data.
0.063685
Log Transformation
A. Explore Correlations
Correlation is a measure of the degree of linear relationship of variables.
Because regression techniques represent the relationship in which the independent variable affects
the dependent variable, the weight for that variable in the regression formula may vary depending
on the direction and strength of the correlation.
numeric_df = df[['weight', 'amount_php', 'tax']]
numeric_df.head()
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 19/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
1 75.0of each
The correlation 34971.04
feature can349.71
be quantified through a heatmap, and visually seen at the same
time. The bar shown
2 90000.0 on the right655746.21
6557462.10 side of the heat map shows what color the correlation is based on
its strength.
3 14520.0 48879.11 3421.54
<matplotlib.axes._subplots.AxesSubplot at 0x7eff18ebd790>
The formula used to calculate the correlation here is Pearson Correlation coefficient, a
quantification of the linear correlation between two variables X and Y. 1 means perfect positive
linear correlation, 0 means no linear correlation, and -1 means perfect negative correlation. The
graph above shows that the variable that shows the strongest linear correlation with tax is price. It's
a natural result because taxes are levied at a fraction of the price of the product.
However, it is dangerous to identify the correlation only with figures like this. As mentioned, all
Pearson correlation coefficients show is a 'linear correlation'. That is, even if there is an nonlinear
correlation, the Pearson correlation coefficient can come close to zero. Therefore, it is
recommended that you always draw a pair plot together to check the correlation. The pairplot
provided by the seaborn library visualizes and shows the correlation of each variable.
sns.pairplot(numeric_df, markers='x', diag_kind="kde")
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 20/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
<seaborn.axisgrid.PairGrid at 0x7eff18ebd190>
The graph on the diagonal of the Pairplot represents the distribution of each variable, and the rest
are scatterplots of the two variables. As previously identified by the correlation coefficient, prices
and taxes show the strongest linear correlation. However, we can see that not all data points are
above the right-up line. If you look at the graph from the left to the third in the last row, where the X-
axis is price and the Y-axis is tax , you can see that some taxes are underreported even as prices
rise.
B. Log Transformation
If you look at the pairplot you can see that the data is skewed to the left rather than evenly
distributed. Let's visualize the distribution through the histogram.
numeric_df.hist(figsize=(15,10))
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 21/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
dtype=object)
Log transformation controls the distribution by reducing the deviation of the data. We're going to
apply the log() function of the numpy library, but since the values of 0 to 1 return negative values,
let's add 1 to the whole data and apply the function.
numeric_log_df = np.log(numeric_df + 1)
numeric_log_df.hist(figsize=(15,10))
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 22/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
dtype=object)
When the histogram is redrew after log transformation, it can see that the distribution of the data is
distributed appropriately. Let's look at the basic statistics and pairplot graphs again.
numeric_log_df.describe()
sns.pairplot(numeric_log_df, markers='x', diag_kind="kde")
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 23/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
<seaborn.axisgrid.PairGrid at 0x7eff1869dc50>
After the log transformation, it can see that the distribution between each variable is better
visualized.
corr = numeric_log_df.corr()
sns.heatmap(corr, annot=True)
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 24/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
<matplotlib.axes._subplots.AxesSubplot at 0x7eff183b9bd0>
# create object
scaler = MinMaxScaler()
# Fit -> transform
scaler.fit(numeric_log_df)
scaled = scaler.transform(numeric_log_df)
# Convert to a data frame because it is returned in an array form
numeric_log_df = pd.DataFrame(data = scaled, columns=numeric_log_df.columns)
numeric_log_df
Applying Algorithms
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 25/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
Y = numeric_log_df['tax']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.8, test_size=0.2)
Defining Variables
x_train = X_train[['weight', 'amount_php']]
y_train = Y_train
The independent variable x is quantity, weight, price, and the dependent variable y is tax.
multi_fitter = LinearRegression()
multi_fitter.fit(x_train, y_train)
LinearRegression()
Through the fit() function, multi_fitter utilized the training data to model the linear relationship
between the independent and dependent variables.
multi_fitter.coef_
y_predict = multi_fitter.predict(x_train)
plt.scatter(y_train, y_predict, alpha=0.4)
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 26/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
plt.xlabel("Actual Tax")
plt.ylabel("Predicted Tax")
multi_fitter.score(x_train, y_train)
0.3003469124609752
MSE, RMSE
We will measure the error of the model through MSE and RMSE.
mse = mean_squared_error(y_train, multi_fitter.predict(x_train))
rmse = np.sqrt(mse)
mse, rmse
(0.03697998323651459, 0.19230180247858986)
The RMSE value, which converts the indicator of the error into units similar to the actual value,
shows that the tax predicted by the model has an error of approximately 0.192.
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 27/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
K-Fold
scores = cross_val_score(multi_fitter, x_train, y_train, cv=4)
scores
scores.mean()
0.3003315436263042
x_test = X_test[[ 'weight', 'amount_php']]
y_test = Y_test
Let's visualize the actual tax value of the test data and the tax value that the model predicted from
the test data.
The more accurately the model predicts, the closer the distribution of the points will be to a straight
line with a slope of 1.
y_predict = multi_fitter.predict(x_test)
plt.scatter(y_test, y_predict, alpha=0.4)
plt.xlabel("Actual Tax")
plt.ylabel("Predicted Tax")
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 28/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
B. MSE, RMSE
Let's estimate the error through MSE and RMSE.
mse = mean_squared_error(y_test, multi_fitter.predict(x_test))
rmse = np.sqrt(mse)
mse, rmse
(0.03688809427338566, 0.19206273525435813)
It has an error of about 0.0192, similar to the error in the training data.
multi_fitter.score(x_test, y_test)
0.3055014734172349
If the difference in error and coefficient of determination is large when validated from the training
data and when evaluated with the test data, it can be overfit to the training data and be an
unsuitable model for generalization.
Interpretation
The model created is unsatisfactory in predicting tax using amount_php and weight as
independent variables as shown by a low R-sqaured value of 0.30.
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 29/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 30/30