0% found this document useful (0 votes)
66 views

Report Logistic Regression

This document discusses logistic regression for classification and predictive analysis. It provides an overview of interpreting logistic regression results by transforming log odds into odds ratios. It also describes the three main types of logistic regression models: binary, multinomial, and ordinal logistic regression. An example is given applying logistic regression to assess credit risk by predicting the probability of loan default. Model diagnostics are discussed including goodness-of-fit tests and analyzing residuals.

Uploaded by

Zara Batool
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

Report Logistic Regression

This document discusses logistic regression for classification and predictive analysis. It provides an overview of interpreting logistic regression results by transforming log odds into odds ratios. It also describes the three main types of logistic regression models: binary, multinomial, and ordinal logistic regression. An example is given applying logistic regression to assess credit risk by predicting the probability of loan default. Model diagnostics are discussed including goodness-of-fit tests and analyzing residuals.

Uploaded by

Zara Batool
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Report Logistic Regression

Introduction:

Logistic regression is used for classification and predictive analytic. It estimates the probability
of an event occurring e.g. sleep or no sleep after eating, based on data set of independent
variables. Outcome probability is the reason why dependent variable probability e.g. sleeps
probability is between 0-1 ranges.

Interpreting Logistic regression:

Log odds are difficult to make sense of in the logistic regression data analysis hence exponent of
beta estimate is taken to transform data into an odd ratio (OR) to make easy interpretation. The
OR represents the odd ratio that is an outcome will occur given a particular event, compared to
the odds of the outcome occurring in the absence of that event. If OR is greater than 1 event is
more likely to occur and if OR is less than 1 the event is less likely to occur.

.
Types of logistic regression:
There are 3 types of logistic regression models, which are defined based on categorical response:

1-Binary logistic regression:


The response or dependent variable has only 2 possible outcomes (e.g. 0 or 1), some popular
examples of its use include predicting if an email is spam or not spam or if a tumor is malignant
or not malignant. In logistic regression it is one of the most common classifier for binary
classification.

2-Multinomial logistic regression:


In this type of logistic regression model, the dependent variable has three or more possible
outcomes; however, these values have no specified order, for example, movie studios want to
predict what genre of films a moviegoer is likely to see to market films more effectively. A
multinomial logistic regression model can help the studio to determine the strength of influence a
person’s age, gender and dating status may have on type of films they prefer. The studio can then
orient an advertising campaign of a specific movie towards a group of people likely to see it.

3-Ordinal logistic regression:


This type of logistic regression model is leveraged when the response variable has 3 or more
possible outcome, but in this case, these values do have a defined order. Examples of ordinal
responses including grading scales from A to F or rating scales from 1 to 5.

Use cases of logistic regression:


Logistic regression is commonly used for prediction and classification problems. Some of these
use cases are:
.Fraud detection: logistic regression models can help teams identify data anomalies, which are
predictive of fraud. Certain behaviors or characters may have higher association with fraud
activities which is particularly helpful to bankers to protect clients’ data.
Disease prediction: In machine, this analytics approach can be used to predict the likelihood of
disease or illness for a given population. Healthcare organizations can set up preventive care for
individuals that show higher propensity for special illnesses.

Churn prediction: For example, human resources and management teams may want to know if
there are high performers within the company who are at risk of leaving the organization; this
type of insight can prompt conversations to understand the problem areas within the company,
such as culture or compensation. Alternatively. The sales organizations may want to learn which
of their clients are at risk of taking their business elsewhere. This can prompt teams to set up a
retention strategy to avoid lost revenue.

Example of Logistic regression to access credit risk:

Preparing the Data for Analysis


Setting the random seed allows you to replicate the random selection of cases in this analysis.

1. To set the random seed, from the menus choose:

Transform > Random Number Generators...

Figure 1. Random Number Generators dialog box

2. Select Set Starting Point.


3. Select Fixed Value and type 9191972 as the value
4. Click OK.
5. To create the selection variable for validation, from the menus choose:

Transform > Compute Variable...

Figure 2. Compute Variable dialog box

6. Type validate in the Target Variable text box.


7. Type rv.bernoulli (0.7) in the Numeric Expression text box.

These sets the values of validate to be randomly generated Bernoulli variates with
probability parameter 0.7.

You only intend to use validate with cases that could be used to create the model; that is,
previous customers. However, there are 150 cases corresponding to potential customers
in the data file.

8. To perform the computation only for previous customers, click If.Figure 3. If Cases
dialog box
9. Select include if case satisfies condition.
10. Type MISSING (default) = 0 as the conditional expression.

This ensures that validate is only computed for cases with non-missing values for default;
that is, for customers who previously received loans.

11. Click Continue.


12. Click OK in the Compute Variable dialog box.

Approximately 70 percent of the customers previously given loans will have a validate value of
1. These customers will be used to create the model. The remaining customers who were
previously given loans will be used to validate the model results.

Running the Analysis


1. To create the logistic regression model, from the menus choose:

Analyze > Regression > Binary Logistic...

Figure 1. Logistic Regression dialog box


2. Select previously defaulted as the dependent variable.
3. Select Age in years through other debt in thousands as covariates.
4. Select Forward: LR as the method.
5. Select validate as the selection variable and click Rule.Figure 2. Set Rule dialog box

6. Type 1 as the value for selection variable.


7. Click Continue.
8. Click Categorical in the Logistic Regression dialog box.Figure 3. Define Categorical
dialog box
9. Select Level of education as a categorical covariate.
10. Click Continue.
11. Click Save in the Logistic Regression dialog box.Figure 4. Save dialog box

12. Select Probabilities in the Predicted Values group, Cook's in the Influence group,
and Studentized in the Residuals group.
13. Click Continue.
14. Click Options in the Logistic Regression dialog box.Figure 5. Options dialog box
15. Select Classification plots and Hosmer-Lemeshow goodness-of-fit.
16. Click Continue.
17. Click OK in the Logistic Regression dialog box.

Model Diagnostics
After building a model, you need to determine whether it reasonably approximates the behavior
of your data.

Tests of Model Fit. The Binary Logistic Regression procedure reports the Hosmer-Lemeshow
goodness-of-fit statistic.

Residual Plots. Using variables specified in the Save dialog box; you can construct various
diagnostic plots. Two helpful plots are the change in deviance versus predicted probabilities and
Cook's distances versus predicted probabilities.

Tests of Model Fit


Last Updated: 2022-09-13

Figure 1. Hosmer-Lemeshow statistic


Goodness-of-fit statistics help you to determine whether the model adequately describes the data.
The Hosmer-Lemeshow statistic indicates a poor fit if the significance value is less than 0.05.
Here, the model adequately fits the data.

Figure 2. Contingency Table for Hosmer-Lemeshow statistic

Change in Deviance versus Predicted


Probabilities
The change in deviance is not an option in the Save dialog box, but it can be estimated by
squaring the studentized residuals.

1. To create the change in deviance, from the menus choose:

Transform > Compute Variable...

Figure 1. Compute Variable dialog box


2. Type chgdev as the Target Variable.
3. Type sre_1**2 as the Numeric Expression.
4. Click OK.

The squared studentized residuals have been saved to chgdev.

5. To produce the residual plot, from the menus choose:

Graphs > Chart Builder...

Figure 2. Chart Builder


6. Select the Scatter/Dot gallery and choose Simple Scatter.
7. Select chgdev as the variable to plot on the Y axis.
8. Select Predicted probability as the variable to plot on the X axis.
9. Click OK.

Figure 3. Change in deviance (squared studentized residuals) vs. predicted


probabilities
The change in deviance plot helps you to identify cases that are poorly fit by the model. Larger
changes in deviance indicate poorer fits. There are two distinct patterns in the plot: a curve that
extends from the lower left to the upper right, and a curve that extends from the upper left to the
lower right.

 The curve that extends from the lower left to the upper right corresponds to cases in
which the dependent variable has a value of 0. Thus, non-defaulters who have large
model-predicted probabilities of default are poorly fit by the model.
 The curve that extends from the upper left to the lower right corresponds to cases in
which the dependent variable has a value of 1. Thus, defaulters who have small model-
predicted probabilities of default are poorly fit by the model.

By identifying the cases that are poorly fit by the model, you can focus on how those customers
are different, and hopefully discover another predictor that will improve the model.

Cook's Distances versus Predicted


Probabilities

Figure 1. Chart Builder


1. Recall the Chart Builder.
2. Select Analog of Cook's influence statistics as the variable to plot on the Y axis.
3. Select Predicted probability as the variable to plot on the X axis.
4. Click OK.

Figure 2. Cook's distances vs. predicted probabilities


The shape of the Cook's distances plot generally follows that of the previous figure, with some
minor exceptions. These exceptions are high-leverage points, and can be influential to the
analysis. You should note which cases correspond to these points for further investigation.

Choosing the Right Model


There are usually several models that pass the diagnostic checks, so you need tools to choose
between them.

 Automated Variable Selection. When constructing a model, you generally want to


only include predictors that contribute significantly to the model. The Binary Logistic
Regression procedure offers several methods for stepwise selection of the "best"
predictors to include in the model.
 Pseudo R-Squared Statistics. The r-squared statistic, which measures the variability in
the dependent variable that is explained by a linear regression model, cannot be
computed for logistic regression models. The pseudo r-squared statistics are designed to
have similar properties to the true r-squared statistic.
 Classification and Validation. Crosstabulating observed response categories with
predicted categories helps you to determine how well the model identifies defaulters.

 Variable Selection
 Figure 1. Variables not in the Equation, block 0

 Forward stepwise methods start with a model that doesn't include any of the predictors.
At each step, the predictor with the largest score statistic whose significance value is less
than a specified value (by default 0.05) is added to the model.
 Figure 2. Variables not in the Equation, block 1


 The variables left out of the analysis at the last step all have significance values larger
than 0.05, so no more are added.
 Figure 3. Model if Term Removed


 The variables chosen by the forward stepwise method should all have significant changes
in -2 log-likelihood. The change in -2 log-likelihood is generally more reliable than the
Wald statistic. If the two disagree as to whether a predictor is useful to the model, trust
the change in -2 log-likelihood.
 As a further check, you can build a model using backward stepwise methods. Backward
methods start with a model that includes all of the predictors. At each step, the predictor
that contributes the least is removed from the model, until all of the predictors in the
model are significant. If the two methods choose the same variables, you can be fairly
confident that it's a good model.

R-Squared Statistics
Figure 1. Model Summary

What constitutes a “good” R 2 value varies between different areas of application. While these
statistics can be suggestive on their own, they are most useful when comparing competing
models for the same data. The model with the largest R 2 statistic is “best” according to this
measure.

Logistic Regression Coefficients


Figure 1. Parameter Estimates
The parameter estimates table summarizes the effect of each predictor. The ratio of the
coefficient to its standard error, squared, equals the Wald statistic. If the significance level of the
Wald statistic is small (less than 0.05) then the parameter is useful to the model. The predictors
and coefficient values shown in the last step are used by the procedure to make predictions.

The meaning of a logistic regression coefficient is not as straightforward as that of a linear


regression coefficient. While B is convenient for testing the usefulness of predictors, Exp (B) is
easier to interpret. Exp (B) represents the ratio-change in the odds of the event of interest for a
one-unit change in the predictor. For example, Exp (B) for employ is equal to 0.781, which
means that the odds of default for a person who has been employed at their current job for two
years are 0.781 times the odds of default for a person who has been employed at their current job
for 1 year, all other things being equal.

Summary
Using the Logistic Regression Procedure, I have constructed a model for predicting the
probability a given customer will default on their loan.

A critical issue for loan officers is the cost of Type I and Type II errors. That is, what is the cost
of classifying a defaulter as a non-defaulter (Type I)? What is the cost of classifying a non-
defaulter as a defaulter (Type II)? If bad debt is the primary concern, then you want to lower
your Type I error and maximize your sensitivity. If growing your customer base is the priority,
then you want to lower your Type II error and maximize your specificity. Usually both are major
concerns, so you have to choose a decision rule for classifying customers that gives the best mix
of sensitivity and specificity.

Source: Summary - IBM Documentation

You might also like