0% found this document useful (0 votes)
10 views

List of Imported Libraries

The document provides a comprehensive overview of Python libraries and code used for data manipulation, visualization, and machine learning model training. It details the purpose of each library, the code used to import them, and the general workflow from data preprocessing to model evaluation. Additionally, it includes specific code snippets for tasks such as data cleaning, feature scaling, hyperparameter tuning, and model performance evaluation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

List of Imported Libraries

The document provides a comprehensive overview of Python libraries and code used for data manipulation, visualization, and machine learning model training. It details the purpose of each library, the code used to import them, and the general workflow from data preprocessing to model evaluation. Additionally, it includes specific code snippets for tasks such as data cleaning, feature scaling, hyperparameter tuning, and model performance evaluation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

List of Imported Libraries

Library / Module Code Used Purpose


pandas import pandas as pd For handling and
manipulating tabular data
numpy import numpy as np For numerical operations
and arrays
matplotlib import matplotlib.pyplot as plt For creating plots and graphs
seaborn import seaborn as sns; sns.set() For enhanced, clean-looking
visualizations
warnings import warnings To control or hide warning
messages
time import time To measure training and
prediction durations
scikit-learn (model from sklearn.model_selection import For splitting data and tuning
selection) train_test_split, GridSearchCV model parameters
scikit-learn from sklearn.preprocessing import MinMaxScaler For scaling feature values
(preprocessing) (normalization)
scikit-learn (models) from sklearn.ensemble import Two machine learning
RandomForestRegressor, models for regression
GradientBoostingRegressor
scikit-learn (metrics) from sklearn.metrics import For evaluating model
mean_squared_error, mean_absolute_error, performance
r2_score
xgboost from xgboost import XGBRegressor Another powerful model for
regression
pickle import pickle For saving and loading model
objects (like .pkl files)
1.CODE

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import warnings
import time
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split, GridSearchCV


from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from xgboost import XGBRegressor
import pickle

General Purpose of the Imports


• To load the necessary tools and libraries for handling data, visualizing it, building machine learning
models, evaluating them, and saving the results.
• These imports support the full workflow: data preprocessing → visualization → model training →
evaluation → saving the best model

Code Purpose / Explanation


import pandas as pd Used to read and work with data tables (like
spreadsheets).
import numpy as np Helps with numerical operations and arrays.
import matplotlib.pyplot as plt Used to create basic graphs and visualizations.
import seaborn as sns; sns.set() Makes prettier and more detailed plots.
import warnings Used to manage warnings in the code.
import time Lets you track or pause time while running the code.
warnings.filterwarnings("ignore") Hides warning messages so they don’t clutter the output.
from sklearn.model_selection import Splits the data into training and testing sets.
train_test_split
from sklearn.model_selection import Searches for the best model settings automatically.
GridSearchCV
from sklearn.preprocessing import Normalizes data between 0 and 1 to help models
MinMaxScaler perform better.
from sklearn.ensemble import A tree-based machine learning model used for prediction.
RandomForestRegressor
from sklearn.ensemble import Another tree-based model that builds predictions in
GradientBoostingRegressor stages.
from sklearn.metrics import Calculates the average of squared prediction errors.
mean_squared_error
from sklearn.metrics import Calculates the average of absolute prediction errors.
mean_absolute_error
from sklearn.metrics import r2_score Shows how well the model's predictions match actual
values (1.0 is perfect).
from xgboost import XGBRegressor A high-performance, advanced regression model used in
many competitions and real projects.
import pickle Used to save and load models so you don’t need to
retrain them every time.

2.CODE:

from google.colab import files


uploaded = files.upload()

General Purpose
• Allows the user to upload files (like a CSV) from their local computer into the Google Colab
environment
• This is typically used to bring in the dataset that will be used for analysis or model training

Code Purpose / Explanation


from google.colab import This imports the files module so you can upload files directly into Google
files Colab.
uploaded = files.upload() Opens a file upload dialog in Colab so you can select and upload a file from
your device.

3.CODE:
# Rename columns
data.columns = [ "cement", "slag", "fly_ash", "water", "superplasticizer",
"coarse_agg", "fine_agg", "age", "csMPa"]

General Purpose
• Renames the columns of the uploaded dataset to clear, consistent, and readable names
• Helps make the data easier to understand and work with during analysis and modeling

Code Purpose / Explanation

data.columns = This renames the columns of the dataset to clear, readable names like "cement",
[...] "water", etc.

4.CODE:
# Replace infinite values with NaN
data.replace([np.inf, -np.inf], np.nan, inplace=True)

# Pair Plot
sns.pairplot(data)
plt.suptitle("Pairplot of All Features", y=1.02)
plt.show()
# Histograms of each variable
for col in data.columns:
plt.figure(figsize=(6, 4))
sns.histplot(data[col], kde=True, color='skyblue')
plt.title(f'Distribution of {col}')
plt.tight_layout()
plt.show()

# Correlation Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.tight_layout()
plt.show()

General Purpose

This section cleans the data, visualizes it, and helps understand relationships between features. It:

• Replaces problematic values (like infinity) with NaN


• Shows how each feature is distributed (using histograms)
• Displays pairwise relationships (with a pairplot)
• Reveals correlations between variables (with a heatmap)

Code Purpose / Explanation


data.replace([np.inf, -np.inf], np.nan, Replaces infinite values in the dataset with NaN (missing
inplace=True) values) to prevent errors during analysis or modeling.
sns.pairplot(data) Creates scatterplots for every pair of variables to help
visualize relationships and patterns.
plt.suptitle("Pairplot of All Features", y=1.02) Adds a main title above the pairplot.
plt.show() Displays the current plot (pairplot in this case).
for col in data.columns: Starts a loop to go through each column in the dataset one
by one.
plt.figure(figsize=(6, 4)) Sets the size of the figure for each histogram.
sns.histplot(data[col], kde=True, Plots the histogram and smooth curve (KDE) for each
color='skyblue') variable to show its distribution.
plt.title(f'Distribution of {col}') Sets a title showing the name of the variable.
plt.tight_layout() Automatically adjusts layout to avoid overlapping text.
plt.show() Displays the histogram.
plt.figure(figsize=(10, 8)) Sets the size of the correlation heatmap.
sns.heatmap(data.corr(), annot=True, Creates a heatmap showing how strongly each variable is
cmap='coolwarm', fmt=".2f") related to the others (correlation matrix).
plt.title("Correlation Heatmap") Adds a title to the heatmap.
plt.tight_layout() Adjusts spacing in the heatmap.
plt.show() Displays the heatmap.

5.CODE:
# Features and target
X = data.drop("csMPa", axis=1)
y = data["csMPa"]

General Purpose

• Separates the data into input features and target output


• X will be used to train the model
• y is the actual value the model should predict (compressive strength)

Code Purpose / Explanation


X = data.drop("csMPa", Takes all columns except csMPa to use as inputs (features).
axis=1)
y = data["csMPa"] Sets the csMPa column (concrete strength) as the value to predict
(target).

6.CODE:
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=42)

General Purpose
• Splits the data into training and testing sets
• 75% of the data is used to train the model (X_train, y_train)
• 25% is used to test how well the model performs (X_test, y_test)
• random_state=42 ensures the split is the same every time for consistency

Code Purpose / Explanation

X_train, X_test, y_train, y_test = train_test_split(X, Splits the features and target into training and testing
y, train_size=0.75, random_state=42) sets, using 75% for training and 25% for testing.

7.CODE:
# Scale
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
pickle.dump(scaler, open("scaler.pkl", "wb"))
General Purpose
• Scales the feature values to a range between 0 and 1 using Min-Max scaling
• Helps the model train more efficiently and accurately
• Saves the scaler to a file so it can be reused later (especially for real-world predictions)

Code Purpose / Explanation

scaler = MinMaxScaler() Creates a scaler that transforms data to a 0–1 range.

X_train = scaler.fit_transform(X_train) Fits the scaler to the training data and scales it.

X_test = scaler.transform(X_test) Applies the same scaling to the test data (without refitting).

pickle.dump(scaler, open("scaler.pkl", "wb")) Saves the scaler to a .pkl file so it can be loaded later.

8.CODE:
# Grid Search: Random Forest
rf_param_grid = {
'n_estimators': [100, 300, 500],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5],
'min_samples_leaf': [1, 2],
'max_features': ['auto', 'sqrt']
}
rf_grid = GridSearchCV(RandomForestRegressor(random_state=42), rf_param_grid,
cv=5, scoring='r2', n_jobs=-1, verbose=1)
rf_grid.fit(X_train, y_train)

# Grid Search: Gradient Boosting


gb_param_grid = {
'n_estimators': [100, 300, 500],
'learning_rate': [0.01, 0.05, 0.1],
'max_depth': [3, 5, 10],
'min_samples_split': [2, 5],
'min_samples_leaf': [1, 2],
'subsample': [0.8, 1.0]
}
gb_grid = GridSearchCV(
GradientBoostingRegressor(random_state=42),
gb_param_grid,
cv=5,
scoring='r2',
n_jobs=-1,
verbose=1
)
gb_grid.fit(X_train, y_train)

General Purpose
• Performs hyperparameter tuning for two machine learning models:
o Random Forest Regressor
o Gradient Boosting Regressor
• Uses GridSearchCV to test all combinations of parameters and find the best one
• Evaluates model performance using 5-fold cross-validation with R² score as the metric
• Trains both models using the best-found parameters

Code Purpose / Explanation

rf_param_grid = { ... } Defines the possible values for each Random


Forest parameter to try during grid search.

GridSearchCV(RandomForestRegressor(...), Sets up grid search for the Random Forest model


rf_param_grid, ...) with 5-fold cross-validation and R² scoring.

rf_grid.fit(X_train, y_train) Runs the grid search and trains the Random
Forest model with the best combination of
parameters.

gb_param_grid = { ... } Defines the possible values for each Gradient


Boosting parameter.

GridSearchCV(GradientBoostingRegressor(...), Sets up grid search for the Gradient Boosting


gb_param_grid, ...) model with 5-fold cross-validation and R²
scoring.

gb_grid.fit(X_train, y_train) Runs the grid search and trains the Gradient
Boosting model with the best parameters.

9.CODE:
# Models dictionary
models = {
"Random Forest": rf_grid.best_estimator_,
"Gradient Boosting": gb_grid.best_estimator_,
"XGBoost": XGBRegressor(random_state=42)
}

General Purpose
• Creates a dictionary to store the machine learning models
• Includes the best-tuned versions of Random Forest and Gradient Boosting (from GridSearch)
• Adds XGBoost with default settings to also compare its performance later

Code Purpose / Explanation

models = { ... } Defines a dictionary that maps model names to their


corresponding model objects.

"Random Forest": rf_grid.best_estimator_ Uses the best Random Forest model found by grid search.

"Gradient Boosting": Uses the best Gradient Boosting model found by grid search.
gb_grid.best_estimator_

"XGBoost": Adds an XGBoost model with a fixed random seed for


XGBRegressor(random_state=42) consistent results

10.CODE:
# --- MODEL EVALUATION ---
results = []

for name, model in models.items():


start_fit = time.time()
model.fit(X_train, y_train)
end_fit = time.time()

start_score = time.time()
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
end_score = time.time()

fit_time = end_fit - start_fit


score_time = end_score - start_score

train_mae = mean_absolute_error(y_train, y_train_pred)


test_mae = mean_absolute_error(y_test, y_test_pred)

train_mse = mean_squared_error(y_train, y_train_pred)


test_mse = mean_squared_error(y_test, y_test_pred)

train_r2 = r2_score(y_train, y_train_pred)


test_r2 = r2_score(y_test, y_test_pred)

results.append({
"Model": name,
"Fit Time (s)": round(fit_time, 4),
"Score Time (s)": round(score_time, 4),
"Train MAE": round(train_mae, 4),
"Test MAE": round(test_mae, 4),
"Train MSE": round(train_mse, 4),
"Test MSE": round(test_mse, 4),
"Train R² Score": round(train_r2, 4),
"Test R² Score": round(test_r2, 4)
})

General Purpose

• Evaluates how well each model performs


• Measures how long each model takes to train and predict
• Calculates key metrics: MAE, MSE, and R² for both training and testing data
• Saves the results for comparison

Code Purpose / Explanation


results = [] Initializes an empty list to store evaluation results for each model.
for name, model in models.items(): Loops through each model in the models dictionary.
start_fit = time.time() Starts a timer before training the model.
model.fit(X_train, y_train) Trains the model on the training data.
end_fit = time.time() Ends the timer after training.
start_score = time.time() Starts a timer before making predictions.
y_train_pred = model.predict(X_train) Predicts outcomes for the training data.
y_test_pred = model.predict(X_test) Predicts outcomes for the testing data.
end_score = time.time() Ends the timer after predictions.
fit_time = end_fit - start_fit Calculates total training time.
score_time = end_score - start_score Calculates total prediction time.
train_mae = mean_absolute_error(...) Measures average absolute error on training data.
test_mae = mean_absolute_error(...) Measures average absolute error on testing data.
train_mse = mean_squared_error(...) Measures squared error on training data.
test_mse = mean_squared_error(...) Measures squared error on testing data.
train_r2 = r2_score(...) Measures how well predictions match actual training values.
test_r2 = r2_score(...) Measures how well predictions match actual testing values.
results.append({...}) Saves all evaluation metrics and times in the results list.

11.CODE:
# Results DataFrame
results_df = pd.DataFrame(results).sort_values(by="Test R² Score", ascending=False)
print("\n\U0001F4CA Model Performance Comparison:\n")
print(results_df.to_string(index=False))

General Purpose
• Converts the model evaluation results into a clean table
• Sorts the models by their Test R² Score from highest to lowest
• Prints a final summary comparing model performance side by side
Code Purpose / Explanation

results_df = pd.DataFrame(results) Converts the list of result dictionaries into a


structured DataFrame.

.sort_values(by="Test R² Score", ascending=False) Sorts the table so the model with the best test
performance appears first.

print("\n\U0001F4CA Model Performance Prints a heading for the performance table (


Comparison:\n") symbol).

print(results_df.to_string(index=False)) Displays the full performance table without row


numbers.

12.CODE:
# Save Best Model
best_model_name = results_df.iloc[0]["Model"]
best_model_instance = models[best_model_name]
pickle.dump(best_model_instance, open("best_model.pkl", "wb"))

General Purpose
• Identifies the best-performing model based on the highest Test R² Score
• Retrieves the actual model instance from the dictionary
• Saves this model to a file (best_model.pkl) so it can be used later without retraining

Code Purpose / Explanation

best_model_name = results_df.iloc[0]["Model"] Gets the name of the model with the highest Test
R² Score.

best_model_instance = models[best_model_name] Retrieves the corresponding model object from


the models dictionary.

pickle.dump(best_model_instance, Saves the best model to a .pkl file for future use.
open("best_model.pkl", "wb"))

13.CODE:
# Feature Importance Plot
if hasattr(best_model_instance, 'feature_importances_'):
importances = best_model_instance.feature_importances_
sorted_indices = np.argsort(importances)[::-1]
plt.figure(figsize=(10, 6))
plt.barh(range(len(importances)), importances[sorted_indices], align='center')
plt.yticks(range(len(importances)), X.columns[sorted_indices])
plt.title(f'{best_model_name} Feature Importance')
plt.gca().invert_yaxis()
plt.show()

General Purpose
• Checks if the best model supports feature importance
• If it does, it shows which features had the biggest impact on the model’s predictions
• Helps understand which inputs (like cement or age) are most influential in predicting concrete
strength

Code Purpose / Explanation

if hasattr(best_model_instance, Checks if the best model has the


'feature_importances_'): feature_importances_ attribute (not all models do).

importances = Gets the importance values of each feature from the


best_model_instance.feature_importances_ model.

sorted_indices = np.argsort(importances)[::-1] Sorts the features from most to least important.

plt.figure(figsize=(10, 6)) Sets the size of the feature importance plot.

plt.barh(...) Draws a horizontal bar chart showing the


importance of each feature.

plt.yticks(...) Labels each bar with the corresponding feature


name.

plt.title(...) Adds a title to the chart indicating which model it


came from.

plt.gca().invert_yaxis() Reverses the order of the y-axis so the most


important feature is on top.

plt.show() Displays the plot.

14.CODE:
print("RF best params:", rf_grid.best_params_)
print("GB best params:", gb_grid.best_params_)

General Purpose

• Displays the best hyperparameter settings found by Grid Search for both models
• Helps you see which combination worked best for Random Forest and Gradient Boosting
Code Purpose / Explanation
print("RF best params:", Prints the best parameter values chosen for the Random
rf_grid.best_params_) Forest model.
print("GB best params:", Prints the best parameter values chosen for the Gradient
gb_grid.best_params_) Boosting model.

You might also like