0% found this document useful (0 votes)
185 views30 pages

MUNAR - Linear Regression - Ipynb - Colaboratory

This document provides instructions for performing linear regression analysis on various datasets using Python. It demonstrates how to use single and multiple features to predict an outcome variable using linear regression. Key steps include importing data, determining correlations, plotting the data, calculating the linear regression line, and using multiple linear regression to predict real estate prices based on size and year features.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
185 views30 pages

MUNAR - Linear Regression - Ipynb - Colaboratory

This document provides instructions for performing linear regression analysis on various datasets using Python. It demonstrates how to use single and multiple features to predict an outcome variable using linear regression. Key steps include importing data, determining correlations, plotting the data, calculating the linear regression line, and using multiple linear regression to predict real estate prices based on size and year features.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

2/18/22, 11:11 PM Linear Regression.

ipynb - Colaboratory

Linear Regression
Objective(s):
This activity aims to perform regression analysis using linear regression

Intended Learning Outcomes (ILOs):

Demonstrate how to use python to predict the outcome using linear regression.
Demonstrate how to use single and multiple features to predict the outcome using linear
regression.

Resources:

Jupyter Notebook
stores_dist.csv
real_estate_price_size_year.csv
Ames_Housing_Sales.csv

Procedure:
Code Text

Import Google Drive

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.moun

Import the libraries and the data

#import the libraries

import matplotlib.pyplot as plt

import numpy as np

import pandas as pd

#import the stores_dist.csv data

salesDist = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Linear Regression/stores-dis

#check the imported data

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 1/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

salesDist.head()

district annual net sales number of stores in district

0 1 231.0 12

1 2 156.0 13

2 3 10.0 16

3 4 519.0 2

4 5 437.0 6

Rename the annual sales to sales and number of stores in district to stores

#rename the annual sales to sales and the  number of stores in district to stores

salesDist = salesDist.rename(columns={'annual net sales':'sales','number of stores in distric

#check the salesDist data to verify if the columns were renamed.

salesDist.head()

district sales stores

0 1 231.0 12

1 2 156.0 13

2 3 10.0 16

3 4 519.0 2

4 5 437.0 6

Determine the correlation

#check the correlation

salesDist.corr()

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 2/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

district sales stores


Interpret the correlation.
district 1.000000 0.136103 -0.230617

sales 0.136103 1.000000 -0.912236


There is a negative correlation between sales and stores .
stores -0.230617 -0.912236 1.000000

Drop the column with the lowest correlation and verify the dataframe if the column was deleted.

sales = salesDist.drop('district',axis=1)

sales.head()

sales stores

0 231.0 12

1 156.0 13

2 10.0 16

3 519.0 2

4 437.0 6

From the correlation coefficent data, what type of correlation did you observe between annual net
sales and number of stores in the district?

Negative correlation

Create a plot to visualize the data. You will also assign stores as the independent variable x and
sales as the dependent variable y .

# dependent variable for y axis

y = sales['sales']

# independent variable for x axis
x = sales.stores

# Display the plot inline

%matplotlib inline

# Increase the size of the plot

plt.figure(figsize=(20,10))

# Create a scatter plot: Number of stores in the District vs. Annual Net Sales

plt.plot(x,y, 'o', markersize = 15)

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 3/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

# Add axis labels and increase the font size

plt.ylabel('Annual Net Sales', fontsize = 30)

plt.xlabel('Number of Stores in the District', fontsize = 30)

# Increase the font size on the ticks on the x and y axis

plt.xticks(fontsize = 20)

plt.yticks(fontsize = 20)

# Display the scatter plot

plt.show()

Calculate the slope and y-intercept of the linear regression line.

m, b = np.polyfit(x,y,1) 

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 4/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

print ('The slope of line is {:.2f}.'.format(m))

print ('The y-intercept is {:.2f}.'.format(b))

print ('The best fit simple linear regression line is {:.2f}x + {:.2f}.'.format(m,b))

The slope of line is -35.79.

The y-intercept is 599.38.

The best fit simple linear regression line is -35.79x + 599.38.

Using the linear regression line, you can predict the annual net sales based on the number of stores
in the district.

# Function to predict the net sales from the regression line

def predict(query):

    if query >= 1:

        predict = m * query + b

        return predict

    else:

        print ("You must have at least 1 store in the district to predict the annual net sale

# Enter the number of stores in the function to generate the net sales prediction.

predict(4)

456.2313681207654

predict(16)

26.786342565077348

Using statsmodel to perform Multivariable Linear Regression

Import the libraries. Make sure to install all the libraries to avoid errors.

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import statsmodels.api as sm

import seaborn as sns

Load the data real_estate_price_size_year.csv and verify the data.

data = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Linear Regression/real_estate_pri
data.head()

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 5/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

price size year

0 234314.144 643.09 2015

1 228581.528 656.22 2009

2 281626.336 487.29 2018

3 401255.608 1504.75 2015

4 458674.256 1275.46 2009

Show the descriptive statistic analysis

data.describe()

price size year

count 100.000000 100.000000 100.000000

mean 292289.470160 853.024200 2012.600000

std 77051.727525 297.941951 4.729021

min 154282.128000 479.750000 2006.000000

25% 234280.148000 643.330000 2009.000000

50% 280590.716000 696.405000 2015.000000

75% 335723.696000 1029.322500 2018.000000

max 500681.128000 1842.510000 2018.000000

Interpret the count, mean, min and std

The average real estate price in the dataset is 292,289.47 currency. The cheapest real estate price
is 154,282.13 currency is 52.78% of the average real estate price. The standard deviation is
77,051.73 currency. If we divide the standard deviation over mean we will get the coefficient
variation 0.26 which is less than 1. Thus, the dataset is more or less concentrated near the value
of the mean.

Calcuate the multiple linear regression. Set the dependent variable and independent variable

# Following the regression equation, our dependent variable (y) is the price

y = data ['price']

# Similarly, our independent variable (x) is the size and year

x1 = data [['size','year']]

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 6/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

# Add a constant. Esentially, we are adding a new column (equal in lenght to x), which consis
x = sm.add_constant(x1)

# Fit the model, according to the OLS (ordinary least squares) method with a dependent variab
results = sm.OLS(y,x).fit()

/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/tsatools.py:117: FutureWarning: I
x = pd.concat(x[::order], 1)

# Print a nice summary of the regression.

results.summary()

OLS Regression Results


Dep. Variable: price R-squared: 0.776
Model: OLS Adj. R-squared: 0.772
Method: Least Squares F-statistic: 168.5
Date: Fri, 18 Feb 2022 Prob (F-statistic): 2.77e-32
Time: 14:54:29 Log-Likelihood: -1191.7
No. Observations: 100 AIC: 2389.
Df Residuals: 97 BIC: 2397.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -5.772e+06 1.58e+06 -3.647 0.000 -8.91e+06 -2.63e+06
size 227.7009 12.474 18.254 0.000 202.943 252.458
year 2916.7853 785.896 3.711 0.000 1357.000 4476.571
Omnibus: 10.083 Durbin-Watson: 2.250
Prob(Omnibus): 0.006 Jarque-Bera (JB): 3.678
Skew: 0.095 Prob(JB): 0.159
Kurtosis: 2.080 Cond. No. 9.41e+05

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

[2] The condition number is large, 9.41e+05. This might indicate that there are

strong multicollinearity or other numerical problems.

Interpret the result. Explain the adjusted R squared , coef and standard error.

The result of the OLS regression shows that size and year are great predictor variables for
price . Results shows that these variables are statistically significant because their p-values
equal 0.000. The coefficient signifies how much the mean of the dependent variable ( price )
changes given a one-unit shift in the independent variables ( size and year ) while holding other
variables in the model constant. Among the two independent variables, year affects the price

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 7/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

positively than the size . The R-squared of the regression is the fraction of the variation in the
dependent variable( price ) that is accounted by the independent variables ( size and year ).
Adjusted R-squared, a modified version of R-squared, adds precision and reliability by considering
the impact of additional independent variables that tend to skew the results of R-squared
measurement

Using sklearn to perform Linear Regression

Import the data using Pandas. Check the data. Examine the data types and shape of the dataset.

import pandas as pd
import numpy as np

# Import the data using the file path

data = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Linear Regression/Ames_Housing_Sa

print(data.shape)

(1379, 80)

data

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 8/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposu

0 856.0 854.0 0.0 None 3 1Fam TA

1 1262.0
data.dtypes.value_counts()
0.0 0.0 None 3 1Fam TA

2 920.0 866.0 0.0 None 3 1Fam TA


object 43

float64
3 21

961.0 756.0 0.0 None 3 1Fam Gd


int64 16

dtype:
4 int64
1145.0 1053.0 0.0 None 4 1Fam TA

... ... ... ... ... ... ... ...


# Select the object (string) columns

1374 953.0 694.0 0.0 None 3 1Fam None No


mask = data.dtypes == np.object

categorical_cols = data.columns[mask]

1375 2073.0 0.0 0.0 None 3 1Fam TA

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:2:
1376 1188.0 1152.0 0.0 None 4 DeprecationWarning:
1Fam Gd `np
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/relea
1377

1078.0 0.0 0.0 None 2 1Fam TA

1378 1256.0 0.0 0.0 None 3 1Fam TA

1379 rows × 80 columns


# Determine how many extra columns would be created

num_ohc_cols = (data[categorical_cols]

                .apply(lambda x: x.nunique())

                .sort_values(ascending=False))

# No need to encode if there is only one value

small_num_ohc_cols = num_ohc_cols.loc[num_ohc_cols>1]

# Number of one-hot columns is one less than the number of categories

small_num_ohc_cols -= 1

# This is 215 columns, assuming the original ones are dropped. 

# This is quite a few extra columns!

small_num_ohc_cols.sum()

215

Create a new data set where all of the above categorical features will be one-hot encoded.

Used the dataframe .copy() method to create a completely separate copy of the dataframe
for one-hot encoding
On this new dataframe, one-hot encode each of the appropriate columns and add it back to
the dataframe. Be sure to drop the original column.
For the data that are not one-hot encoded, drop the columns that are string categoricals.

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 9/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Copy of the data

data_ohc = data.copy()

# The encoders

le = LabelEncoder()

ohc = OneHotEncoder()

for col in num_ohc_cols.index:

    

    # Integer encode the string categories

    dat = le.fit_transform(data_ohc[col]).astype(np.int)

    

    # Remove the original column from the dataframe

    data_ohc = data_ohc.drop(col, axis=1)

    # One hot encode the data--this returns a sparse array

    new_dat = ohc.fit_transform(dat.reshape(-1,1))

    # Create unique column names

    n_cols = new_dat.shape[1]

    col_names = ['_'.join([col, str(x)]) for x in range(n_cols)]

    # Create the new dataframe

    new_df = pd.DataFrame(new_dat.toarray(), 

                          index=data_ohc.index, 

                          columns=col_names)

    # Append the new data to the dataframe

    data_ohc = pd.concat([data_ohc, new_df], axis=1)

/usr/local/lib/python3.7/dist packages/ipykernel_launcher.py:13: DeprecationWarning:


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 10/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: DeprecationWarning:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re
del sys.path[0]

data_ohc

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 11/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

1stFlrSF 2ndFlrSF 3SsnPorch BedroomAbvGr BsmtFinSF1 BsmtFinSF2 BsmtFullBath

0 856.0 854.0 0.0 3 706.0 0.0 1

1 1262.0 0.0 0.0 3 978.0 0.0 0

2 920.0 866.0 0.0 3 486.0 0.0 1

3 961.0 756.0 0.0 3 216.0 0.0 1

4 1145.0 1053.0 0.0 4 655.0 0.0 1

... ... ... ... ... ... ... ...


# Column difference is as calculated above
1374 953.0 694.0
data_ohc.shape[1] - data.shape[1] 0.0 3 0.0 0.0 0

1375 2073.0 0.0 0.0 3 790.0 163.0 1


215
1376 1188.0 1152.0 0.0 4 275.0 0.0 0

print(data.shape[1])
1377 1078.0 0.0 0.0 2 49.0 1029.0 1

1378 1256.0 0.0 0.0


# Remove the string columns from the dataframe 3 830.0 290.0 1
data = data.drop(num_ohc_cols.index, axis=1)
1379 rows × 295 columns

print(data.shape[1])

80

37

Create train and test splits of both data sets. To ensure the data gets split the same way, use the
same random_state in each of the two splits.
For each data set, fit a basic linear regression model
on the training data.
Calculate the mean squared error on both the train and test sets for the
respective models

from sklearn.model_selection import train_test_split

y_col = 'SalePrice'

# Split the data that is not one-hot encoded
feature_cols = [x for x in data.columns if x != y_col]
X_data = data[feature_cols]
y_data = data[y_col]

X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, 
                                                    test_size=0.3, random_state=42)
# Split the data that is one-hot encoded
feature_cols = [x for x in data_ohc.columns if x != y_col]
X_data_ohc = data_ohc[feature_cols]
y_data_ohc = data_ohc[y_col]

X_train_ohc, X_test_ohc, y_train_ohc, y_test_ohc = train_test_split(X_data_ohc, y_data_ohc, 
https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 12/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory
                                                    test_size=0.3, random_state=42)

# Compare the indices to ensure they are identical
(X_train_ohc.index == X_train.index).all()

True

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error

LR = LinearRegression()

# Storage for error values

error_df = list()

# Data that have not been one-hot encoded

LR = LR.fit(X_train, y_train)

y_train_pred = LR.predict(X_train)

y_test_pred = LR.predict(X_test)

error_df.append(pd.Series({'train': mean_squared_error(y_train, y_train_pred),

                           'test' : mean_squared_error(y_test,  y_test_pred)},

                           name='no enc'))

# Data that have been one-hot encoded

LR = LR.fit(X_train_ohc, y_train_ohc)

y_train_ohc_pred = LR.predict(X_train_ohc)

y_test_ohc_pred = LR.predict(X_test_ohc)

error_df.append(pd.Series({'train': mean_squared_error(y_train_ohc, y_train_ohc_pred),

                           'test' : mean_squared_error(y_test_ohc,  y_test_ohc_pred)},

                          name='one-hot enc'))

# Assemble the results

error_df = pd.concat(error_df, axis=1)

error_df

no enc one-hot enc

train 1.131507e+09 3.177269e+08

test 1.372182e+09 3.664592e+16

For each of the data sets (one-hot encoded and not encoded):

Scale the all the non-hot encoded values using one of the following: StandardScaler ,
MinMaxScaler , MaxAbsScaler .
Compare the error calculated on the test sets

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 13/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

# Mute the setting wtih a copy warnings

pd.options.mode.chained_assignment = None

from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler

scalers = {'standard': StandardScaler(),

           'minmax': MinMaxScaler(),

           'maxabs': MaxAbsScaler()}

training_test_sets = {

    'not_encoded': (X_train, y_train, X_test, y_test),

    'one_hot_encoded': (X_train_ohc, y_train_ohc, X_test_ohc, y_test_ohc)}

# Get the list of float columns, and the float data

# so that we don't scale something we already scaled. 

# We're supposed to scale the original data each time

mask = X_train.dtypes == np.float
float_columns = X_train.columns[mask]

# initialize model

LR = LinearRegression()

# iterate over all possible combinations and get the errors

errors = {}

for encoding_label, (_X_train, _y_train, _X_test, _y_test) in training_test_sets.items():

    for scaler_label, scaler in scalers.items():

        trainingset = _X_train.copy()  # copy because we dont want to scale this more than on
        testset = _X_test.copy()

        trainingset[float_columns] = scaler.fit_transform(trainingset[float_columns])

        testset[float_columns] = scaler.transform(testset[float_columns])

        LR.fit(trainingset, _y_train)

        predictions = LR.predict(testset)

        key = encoding_label + ' - ' + scaler_label + 'scaling'

        errors[key] = mean_squared_error(_y_test, predictions)

errors = pd.Series(errors)

print(errors.to_string())

print('-' * 80)

for key, error_val in errors.items():

    print(key, error_val)

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:16: DeprecationWarning: `np


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/relea
app.launch_new_instance()

not_encoded - standardscaling 1.372182e+09

not_encoded - minmaxscaling 1.372179e+09

not_encoded - maxabsscaling 1.372198e+09

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 14/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

one_hot_encoded - standardscaling 5.029834e+26

one_hot_encoded - minmaxscaling 8.065328e+09

one_hot_encoded - maxabsscaling 8.065328e+09

--------------------------------------------------------------------------------

not_encoded - standardscaling 1372182358.934498

not_encoded - minmaxscaling 1372179001.352261

not_encoded - maxabsscaling 1372198037.9660723

one_hot_encoded - standardscaling 5.0298339033526313e+26

one_hot_encoded - minmaxscaling 8065327607.218111

one_hot_encoded - maxabsscaling 8065327607.199247

Plot predictions vs actual for one of the models.

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

sns.set_context('talk')

sns.set_style('ticks')

sns.set_palette('dark')

ax = plt.axes()

# we are going to use y_test, y_test_pred

ax.scatter(y_test, y_test_pred, alpha=.5)

ax.set(xlabel='Ground truth', 

       ylabel='Predictions',

       title='Ames, Iowa House Price Predictions vs Truth, using Linear Regression');

Double-click (or enter) to edit

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 15/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

[ Linear Regression ]

Import Package

# Libraries required for data analysis and preprocessing
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# Libraries required for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Libraries required for linear regression model application
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

# Libraries required for residualization
import scipy.stats

Import Data
Bureau of Custosm Import Data for January 2022 from https://customs.gov.ph/import-reports/

org_df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Linear Regression/DailyfiledJan
org_df.head()

hs_code country weight amount_php tax

0 84381020000 IN 175.0 60119.33 0.00

1 85437090700 CN 75.0 34971.04 349.71

2 39021090000 TW 90000.0 6557462.10 655746.21

3 44152000000 JP 14520.0 48879.11 3421.54

4 39231090200 JP 37312.0 1097936.98 164690.55

Data Preprocessing

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 16/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

Remove Missing Values

To deal meaningfully with the imported data, we will check and delete rows recorded as NaN or null.
Let's check the missing value first.

org_df.isnull().sum()

hs_code 0

country 1

weight 0

amount_php 0

tax 0

dtype: int64

df = org_df.dropna()

df

hs_code country weight amount_php tax

0 84381020000 IN 175.00 60119.33 0.00

1 85437090700 CN 75.00 34971.04 349.71

2 39021090000 TW 90000.00 6557462.10 655746.21

3 44152000000 JP 14520.00 48879.11 3421.54

4 39231090200 JP 37312.00 1097936.98 164690.55

... ... ... ... ... ...

258263 85176299000 CN 1.53 50025.08 0.00

258264 85176299000 CN 10.50 343920.74 0.00

258265 85176299000 CN 9.06 297022.83 0.00

258266 85176299000 CN 0.77 25533.74 0.00

258267 84389029000 US 12.20 151972.67 0.00

258267 rows × 5 columns

df.describe()

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 17/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

hs_code weight amount_php tax

count 2.582670e+05 2.582670e+05 2.582670e+05 2.582670e+05

mean 7.101635e+10 1.902956e+04 1.445547e+06 2.588446e+04

std 2.298279e+10 7.856476e+05 2.190546e+07 4.577029e+05

min 1.012100e+09 0.000000e+00 0.000000e+00 0.000000e+00

25% 5.807909e+10
df.describe()
1.840000e+00 6.706795e+03 0.000000e+00

50% 8.471302e+10 1.977000e+01 4.110362e+04 1.501700e+02


hs_code weight amount_php tax
75% 8.537102e+10 3.298050e+02 2.744186e+05 3.703160e+03
count 2.582670e+05 2.582670e+05 2.582670e+05 2.582670e+05
max 9.704000e+10 2.024130e+08 3.445777e+09 1.059243e+08
mean 7.101635e+10 1.902956e+04 1.445547e+06 2.588446e+04

std 2.298279e+10 7.856476e+05 2.190546e+07 4.577029e+05

min 1.012100e+09 0.000000e+00 0.000000e+00 0.000000e+00

25% 5.807909e+10 1.840000e+00 6.706795e+03 0.000000e+00

50% 8.471302e+10 1.977000e+01 4.110362e+04 1.501700e+02

75% 8.537102e+10 3.298050e+02 2.744186e+05 3.703160e+03

max 9.704000e+10 2.024130e+08 3.445777e+09 1.059243e+08

Calculate the tax rate as follows:

tax_rate = tax / price

From the current data, price is the import amount per item and tax is the tax per item, so wen can
calculate the tax rate in this way.

df = df.assign(tax_rate = df['tax'] / (df['amount_php']))

df.describe()

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 18/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

hs_code weight amount_php tax tax_rate

count 2.582670e+05 2.582670e+05 2.582670e+05 2.582670e+05 258260.000000


Tax rate variable is added. Let's draw a rugplot to look at the distribution of tax0.052744
mean 7.101635e+10 1.902956e+04 1.445547e+06 2.588446e+04
rates.

A rugplot
stdis a 2.298279e+10
graph that draws a small line 2.190546e+07
7.856476e+05 segment (rug) 4.577029e+05
to check the distribution of the data.
0.063685

min 1.012100e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000


sns.rugplot(df['tax_rate'])

25% 5.807909e+10 1.840000e+00 6.706795e+03 0.000000e+00 0.000000


<matplotlib.axes._subplots.AxesSubplot at 0x7eff217fa590>
50% 8.471302e+10 1.977000e+01 4.110362e+04 1.501700e+02 0.030000

75% 8.537102e+10 3.298050e+02 2.744186e+05 3.703160e+03 0.100000

max 9.704000e+10 2.024130e+08 3.445777e+09 1.059243e+08 0.650000

Log Transformation

A. Explore Correlations
Correlation is a measure of the degree of linear relationship of variables.

Because regression techniques represent the relationship in which the independent variable affects
the dependent variable, the weight for that variable in the regression formula may vary depending
on the direction and strength of the correlation.

numeric_df = df[['weight', 'amount_php', 'tax']]

numeric_df.head()

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 19/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

weight amount_php tax

0 175.0 60119.33 0.00

1 75.0of each
The correlation 34971.04
feature can349.71
be quantified through a heatmap, and visually seen at the same
time. The bar shown
2 90000.0 on the right655746.21
6557462.10 side of the heat map shows what color the correlation is based on
its strength.
3 14520.0 48879.11 3421.54

4 37312.0 1097936.98 164690.55


corr = numeric_df.corr()
sns.heatmap(corr, annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7eff18ebd790>

The formula used to calculate the correlation here is Pearson Correlation coefficient, a
quantification of the linear correlation between two variables X and Y. 1 means perfect positive
linear correlation, 0 means no linear correlation, and -1 means perfect negative correlation. The
graph above shows that the variable that shows the strongest linear correlation with tax is price. It's
a natural result because taxes are levied at a fraction of the price of the product.

However, it is dangerous to identify the correlation only with figures like this. As mentioned, all
Pearson correlation coefficients show is a 'linear correlation'. That is, even if there is an nonlinear
correlation, the Pearson correlation coefficient can come close to zero. Therefore, it is
recommended that you always draw a pair plot together to check the correlation. The pairplot
provided by the seaborn library visualizes and shows the correlation of each variable.

sns.pairplot(numeric_df, markers='x', diag_kind="kde")

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 20/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

<seaborn.axisgrid.PairGrid at 0x7eff18ebd190>

The graph on the diagonal of the Pairplot represents the distribution of each variable, and the rest
are scatterplots of the two variables. As previously identified by the correlation coefficient, prices
and taxes show the strongest linear correlation. However, we can see that not all data points are
above the right-up line. If you look at the graph from the left to the third in the last row, where the X-
axis is price and the Y-axis is tax , you can see that some taxes are underreported even as prices
rise.

B. Log Transformation

If you look at the pairplot you can see that the data is skewed to the left rather than evenly
distributed. Let's visualize the distribution through the histogram.

numeric_df.hist(figsize=(15,10))

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 21/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7eff18a2c510>,

<matplotlib.axes._subplots.AxesSubplot object at 0x7eff18a26910>],

[<matplotlib.axes._subplots.AxesSubplot object at 0x7eff189d5d10>,

<matplotlib.axes._subplots.AxesSubplot object at 0x7eff18996350>]],

dtype=object)

Log transformation controls the distribution by reducing the deviation of the data. We're going to
apply the log() function of the numpy library, but since the values of 0 to 1 return negative values,
let's add 1 to the whole data and apply the function.

numeric_log_df = np.log(numeric_df + 1)

numeric_log_df.hist(figsize=(15,10))

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 22/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7eff188538d0>,

<matplotlib.axes._subplots.AxesSubplot object at 0x7eff188b38d0>],

[<matplotlib.axes._subplots.AxesSubplot object at 0x7eff1881c350>,

<matplotlib.axes._subplots.AxesSubplot object at 0x7eff187d2850>]],

dtype=object)

When the histogram is redrew after log transformation, it can see that the distribution of the data is
distributed appropriately. Let's look at the basic statistics and pairplot graphs again.

numeric_log_df.describe()

weight amount_php tax

count 258267.000000 258267.000000 258267.000000

mean 3.706474 10.677909 4.585706

std 3.058671 2.726455 4.250274

min 0.000000 0.000000 0.000000

25% 1.043804 8.811026 0.000000

50% 3.033510 10.623876 5.018405

75% 5.801529 12.522413 8.217212

max 19.125821 21.960415 18.478235

sns.pairplot(numeric_log_df, markers='x', diag_kind="kde")

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 23/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

<seaborn.axisgrid.PairGrid at 0x7eff1869dc50>

After the log transformation, it can see that the distribution between each variable is better
visualized.

Let's look at the correlation coefficients again.

corr = numeric_log_df.corr()

sns.heatmap(corr, annot=True)

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 24/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

<matplotlib.axes._subplots.AxesSubplot at 0x7eff183b9bd0>

Feature Scaling - MinMax Scaling


Here, we will use MinMax Scaler, a method that takes advantage of the minimum and maximum
values of the attribute to return the data values to a range value between 0 and 1. MinMax Scaler
subtracts the minimum value of the feature and returns it divided by the difference between the
maximum and minimum value.

# create object

scaler = MinMaxScaler()

# Fit -> transform

scaler.fit(numeric_log_df)

scaled = scaler.transform(numeric_log_df)

# Convert to a data frame because it is returned in an array form

numeric_log_df = pd.DataFrame(data = scaled, columns=numeric_log_df.columns)

numeric_log_df

weight amount_php tax

0 0.270341 0.501088 0.000000

1 0.226434 0.476416 0.317128

2 0.596449 0.714746 0.724827

3 0.501069 0.491663 0.440418

4 0.550413 0.633364 0.650053

... ... ... ...

258262 0.048532 0.492718 0.000000

258263 0.127699 0.580507 0.000000

258264 0.120704 0.573831 0.000000

258265 0.029854 0.462095 0.000000

258266 0.134908 0.543317 0.000000

258267 rows × 3 columns

Applying Algorithms

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 25/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

Split Training-Test Datasets


We will use 80% of the total data as training data and the remaining 20% as test data.
X = numeric_log_df[[ 'weight', 'amount_php']]

Y = numeric_log_df['tax']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.8, test_size=0.2)

Multi Linear Regression


Now, let's model linear regression by utilizing multiple independent variables.

Defining Variables

x_train = X_train[['weight', 'amount_php']]

y_train = Y_train

The independent variable x is quantity, weight, price, and the dependent variable y is tax.

Creating Model and Modeling


Let's create a model.

multi_fitter = LinearRegression()
multi_fitter.fit(x_train, y_train) 

LinearRegression()

Through the fit() function, multi_fitter utilized the training data to model the linear relationship
between the independent and dependent variables.

Checking the Regression Coefficient

multi_fitter.coef_

array([ 0.99175674, -0.47404556])

Visualizing Actual and Predicted Values

y_predict = multi_fitter.predict(x_train)

plt.scatter(y_train, y_predict, alpha=0.4)

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 26/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

plt.xlabel("Actual Tax")

plt.ylabel("Predicted Tax")

Text(0, 0.5, 'Predicted Tax')

Checking R²; Coefficient of Determination

Let us check the coefficient of determination.

multi_fitter.score(x_train, y_train)

0.3003469124609752

The multi-regression model has an explanatory power of about 30%.

MSE, RMSE
We will measure the error of the model through MSE and RMSE.

mse = mean_squared_error(y_train, multi_fitter.predict(x_train))

rmse = np.sqrt(mse)

mse, rmse

(0.03697998323651459, 0.19230180247858986)

The RMSE value, which converts the indicator of the error into units similar to the actual value,
shows that the tax predicted by the model has an error of approximately 0.192.

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 27/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

K-Fold

scores = cross_val_score(multi_fitter, x_train, y_train, cv=4)

scores

array([0.29782897, 0.29998959, 0.30076604, 0.30274157])

Checking the Result of K-Fold


Let's check the average value of the score.

scores.mean()

0.3003315436263042

Cross-validation results also show approximately 30% of explanatory power.

Applying Test Data


We will evaluate the final model by utilizing the test data we separated before modeling.

x_test = X_test[[ 'weight', 'amount_php']]

y_test = Y_test

A. Visualizing Actual and Predicted Values

Let's visualize the actual tax value of the test data and the tax value that the model predicted from
the test data.

The more accurately the model predicts, the closer the distribution of the points will be to a straight
line with a slope of 1.

y_predict = multi_fitter.predict(x_test)

plt.scatter(y_test, y_predict, alpha=0.4)

plt.xlabel("Actual Tax")

plt.ylabel("Predicted Tax")

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 28/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

Text(0, 0.5, 'Predicted Tax')

B. MSE, RMSE
Let's estimate the error through MSE and RMSE.

mse = mean_squared_error(y_test, multi_fitter.predict(x_test))

rmse = np.sqrt(mse)

mse, rmse

(0.03688809427338566, 0.19206273525435813)

It has an error of about 0.0192, similar to the error in the training data.

C. R²; Coefficient of Determination

multi_fitter.score(x_test, y_test)

0.3055014734172349

We obtained coefficients of determination similar to those in the training data.

If the difference in error and coefficient of determination is large when validated from the training
data and when evaluated with the test data, it can be overfit to the training data and be an
unsuitable model for generalization.

Interpretation
The model created is unsatisfactory in predicting tax using amount_php and weight as
independent variables as shown by a low R-sqaured value of 0.30.

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 29/30
2/18/22, 11:11 PM Linear Regression.ipynb - Colaboratory

error 0s completed at 10:55 PM

https://colab.research.google.com/drive/1XuPEQycCLJU2QGchaU5PPmHfgh7NhGcx#scrollTo=OracJqWau7jC&printMode=true 30/30

You might also like