Exp 8_LM
Exp 8_LM
To familiarize the students with data visualization using two feature variables.
8.3 Prerequisites
Jupyter Notebook.
Bivariate analysis is a statistical method used to examine the relationship between two
variables. In Python, you can perform bivariate analysis using libraries such as NumPy,
Pandas, and Matplotlib/Seaborn for data manipulation, analysis, and visualization. Here's a
brief outline of the process:
Data Preparation: Load your dataset into a Pandas DataFrame and clean/preprocess
the data if necessary. Ensure that the two variables of interest are numeric or can be
appropriately converted into numeric format.
Descriptive Analysis: Compute descriptive statistics for each variable separately using
methods like mean, median, standard deviation, etc. This provides initial insights into
the characteristics of the variables.
Correlation Analysis: Calculate the correlation coefficient between the two variables
to measure the strength and direction of the linear relationship. Pearson correlation
coefficient is commonly used for this purpose.
Case Study:
Term deposits also called fixed deposits, are the cash investments made for a specific time
period ranging from 1 month to 5 years for predetermined fixed interest rates. The fixed
interest
rates offered for term deposits are higher than the regular interest rates for savings accounts.
The customers receive the total amount (investment plus the interest) at the end of the
maturity
period. Also, the money can only be withdrawn at the end of the maturity period.
Withdrawing
money before that will result in an added penalty associated, and the customer will not
receive
any interest returns.
Your target is to do end to end EDA on this bank telemarketing campaign data set to infer
knowledge that where bank has to put more effort to improve it's positive response rate.
Bivariate Analysis
inp0.isnull().sum()
#drop the records with age missing in inp0 and copy in inp1 dataframe.
inp1=inp0[-inp0.age.isnull()].copy()
inp1.shape
-1 indicates the missing values. Missing value does not always be present as null. How to
handle
it:
Objective is:
• you should ignore the missing values in the calculations
• simply make it missing - replace -1 with NaN.
• all summary statistics- mean, median etc. we will ignore the missing values of pdays.
#plot the pair plot of salary, balance and age in inp1 dataframe.
sns.pairplot(data=inp1, vars=["salary","balance", "age"])
plt.show()
#plot the correlation matrix of salary, balance and age in inp1
dataframe.
sns.heatmap( inp1[["salary","balance", "age"]].corr(), annot= True,
cmap= "Reds")
plt.show()
#groupby the response to find the mean of the salary with response no
& yes seperatly.
inp1.groupby("response")["salary"].mean()
#groupby the response to find the median of the salary with response
no & yes seperatly.
inp1.groupby("response")["salary"].median()
#groupby the response to find the mean of the balance with response no
& yes seperatly.
inp1.groupby("response")["balance"].mean()
#groupby the response to find the median of the balance with response
no & yes seperatly.
inp1.groupby("response")["balance"].median()
#groupby the education to find the median of the salary for each
education category.
inp1.groupby("education")["salary"].median()
Job vs salary
#groupby the job to find the mean of the salary for each job category.
inp1.groupby('job')['salary'].mean()
inp1.groupby('job')['salary'].median()
inp1.response.value_counts(normalize= True)
inp1.response_flag.mean()
Age vs response
#plot the boxplot of age with response_flag
sns.boxplot(data=inp1, x="response",y="age")
plt.show()
#plot the bar graph of job categories with response_flag mean value.
inp1.groupby(['job'])['response_flag'].mean().plot.barh()
plt.show()
Restart Kernel: If you encounter unexpected behavior or errors, try restarting the
kernel. This clears all the variables and imported modules, essentially resetting
the notebook's state. You can restart the kernel by going to the "Kernel" menu
and selecting "Restart."
Clear Outputs: To reduce clutter and confusion, consider clearing the outputs of
code cells that are no longer relevant. You can do this by selecting "Clear
Outputs" from the "Edit" menu.
Readability: Keep your code and comments clear and well-organized to make it
easier to understand and maintain. Use markdown cells for explanations,
headings, and documentation.
Kernel Selection: Make sure you're using the correct kernel for your notebook.
The kernel determines the programming language and environment in which
your code runs. You can change the kernel by clicking on "Kernel" > "Change
kernel" in the menu.
Troubleshooting:
Syntax Errors: Check for syntax errors in your code. Python is sensitive to
indentation and syntax, so ensure your code is properly formatted.
Library Installation: If you encounter Module Not Found Error or similar errors,
ensure that the required libraries are installed in your Jupyter environment. You
can install libraries using !pip install <library> or !conda install <library> in a
code cell.
8.8 Observations
Observe the results obtained in each operation.
Result should be printed and pasted in laboratory copy found from Jupyter note book.