List of Imported Libraries
List of Imported Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import warnings
import time
warnings.filterwarnings("ignore")
2.CODE:
General Purpose
• Allows the user to upload files (like a CSV) from their local computer into the Google Colab
environment
• This is typically used to bring in the dataset that will be used for analysis or model training
3.CODE:
# Rename columns
data.columns = [ "cement", "slag", "fly_ash", "water", "superplasticizer",
"coarse_agg", "fine_agg", "age", "csMPa"]
General Purpose
• Renames the columns of the uploaded dataset to clear, consistent, and readable names
• Helps make the data easier to understand and work with during analysis and modeling
data.columns = This renames the columns of the dataset to clear, readable names like "cement",
[...] "water", etc.
4.CODE:
# Replace infinite values with NaN
data.replace([np.inf, -np.inf], np.nan, inplace=True)
# Pair Plot
sns.pairplot(data)
plt.suptitle("Pairplot of All Features", y=1.02)
plt.show()
# Histograms of each variable
for col in data.columns:
plt.figure(figsize=(6, 4))
sns.histplot(data[col], kde=True, color='skyblue')
plt.title(f'Distribution of {col}')
plt.tight_layout()
plt.show()
# Correlation Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.tight_layout()
plt.show()
General Purpose
This section cleans the data, visualizes it, and helps understand relationships between features. It:
5.CODE:
# Features and target
X = data.drop("csMPa", axis=1)
y = data["csMPa"]
General Purpose
6.CODE:
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=42)
General Purpose
• Splits the data into training and testing sets
• 75% of the data is used to train the model (X_train, y_train)
• 25% is used to test how well the model performs (X_test, y_test)
• random_state=42 ensures the split is the same every time for consistency
X_train, X_test, y_train, y_test = train_test_split(X, Splits the features and target into training and testing
y, train_size=0.75, random_state=42) sets, using 75% for training and 25% for testing.
7.CODE:
# Scale
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
pickle.dump(scaler, open("scaler.pkl", "wb"))
General Purpose
• Scales the feature values to a range between 0 and 1 using Min-Max scaling
• Helps the model train more efficiently and accurately
• Saves the scaler to a file so it can be reused later (especially for real-world predictions)
X_train = scaler.fit_transform(X_train) Fits the scaler to the training data and scales it.
X_test = scaler.transform(X_test) Applies the same scaling to the test data (without refitting).
pickle.dump(scaler, open("scaler.pkl", "wb")) Saves the scaler to a .pkl file so it can be loaded later.
8.CODE:
# Grid Search: Random Forest
rf_param_grid = {
'n_estimators': [100, 300, 500],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5],
'min_samples_leaf': [1, 2],
'max_features': ['auto', 'sqrt']
}
rf_grid = GridSearchCV(RandomForestRegressor(random_state=42), rf_param_grid,
cv=5, scoring='r2', n_jobs=-1, verbose=1)
rf_grid.fit(X_train, y_train)
General Purpose
• Performs hyperparameter tuning for two machine learning models:
o Random Forest Regressor
o Gradient Boosting Regressor
• Uses GridSearchCV to test all combinations of parameters and find the best one
• Evaluates model performance using 5-fold cross-validation with R² score as the metric
• Trains both models using the best-found parameters
rf_grid.fit(X_train, y_train) Runs the grid search and trains the Random
Forest model with the best combination of
parameters.
gb_grid.fit(X_train, y_train) Runs the grid search and trains the Gradient
Boosting model with the best parameters.
9.CODE:
# Models dictionary
models = {
"Random Forest": rf_grid.best_estimator_,
"Gradient Boosting": gb_grid.best_estimator_,
"XGBoost": XGBRegressor(random_state=42)
}
General Purpose
• Creates a dictionary to store the machine learning models
• Includes the best-tuned versions of Random Forest and Gradient Boosting (from GridSearch)
• Adds XGBoost with default settings to also compare its performance later
"Random Forest": rf_grid.best_estimator_ Uses the best Random Forest model found by grid search.
"Gradient Boosting": Uses the best Gradient Boosting model found by grid search.
gb_grid.best_estimator_
10.CODE:
# --- MODEL EVALUATION ---
results = []
start_score = time.time()
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
end_score = time.time()
results.append({
"Model": name,
"Fit Time (s)": round(fit_time, 4),
"Score Time (s)": round(score_time, 4),
"Train MAE": round(train_mae, 4),
"Test MAE": round(test_mae, 4),
"Train MSE": round(train_mse, 4),
"Test MSE": round(test_mse, 4),
"Train R² Score": round(train_r2, 4),
"Test R² Score": round(test_r2, 4)
})
General Purpose
11.CODE:
# Results DataFrame
results_df = pd.DataFrame(results).sort_values(by="Test R² Score", ascending=False)
print("\n\U0001F4CA Model Performance Comparison:\n")
print(results_df.to_string(index=False))
General Purpose
• Converts the model evaluation results into a clean table
• Sorts the models by their Test R² Score from highest to lowest
• Prints a final summary comparing model performance side by side
Code Purpose / Explanation
.sort_values(by="Test R² Score", ascending=False) Sorts the table so the model with the best test
performance appears first.
12.CODE:
# Save Best Model
best_model_name = results_df.iloc[0]["Model"]
best_model_instance = models[best_model_name]
pickle.dump(best_model_instance, open("best_model.pkl", "wb"))
General Purpose
• Identifies the best-performing model based on the highest Test R² Score
• Retrieves the actual model instance from the dictionary
• Saves this model to a file (best_model.pkl) so it can be used later without retraining
best_model_name = results_df.iloc[0]["Model"] Gets the name of the model with the highest Test
R² Score.
pickle.dump(best_model_instance, Saves the best model to a .pkl file for future use.
open("best_model.pkl", "wb"))
13.CODE:
# Feature Importance Plot
if hasattr(best_model_instance, 'feature_importances_'):
importances = best_model_instance.feature_importances_
sorted_indices = np.argsort(importances)[::-1]
plt.figure(figsize=(10, 6))
plt.barh(range(len(importances)), importances[sorted_indices], align='center')
plt.yticks(range(len(importances)), X.columns[sorted_indices])
plt.title(f'{best_model_name} Feature Importance')
plt.gca().invert_yaxis()
plt.show()
General Purpose
• Checks if the best model supports feature importance
• If it does, it shows which features had the biggest impact on the model’s predictions
• Helps understand which inputs (like cement or age) are most influential in predicting concrete
strength
14.CODE:
print("RF best params:", rf_grid.best_params_)
print("GB best params:", gb_grid.best_params_)
General Purpose
• Displays the best hyperparameter settings found by Grid Search for both models
• Helps you see which combination worked best for Random Forest and Gradient Boosting
Code Purpose / Explanation
print("RF best params:", Prints the best parameter values chosen for the Random
rf_grid.best_params_) Forest model.
print("GB best params:", Prints the best parameter values chosen for the Gradient
gb_grid.best_params_) Boosting model.