Report Logistic Regression
Report Logistic Regression
Introduction:
Logistic regression is used for classification and predictive analytic. It estimates the probability
of an event occurring e.g. sleep or no sleep after eating, based on data set of independent
variables. Outcome probability is the reason why dependent variable probability e.g. sleeps
probability is between 0-1 ranges.
Log odds are difficult to make sense of in the logistic regression data analysis hence exponent of
beta estimate is taken to transform data into an odd ratio (OR) to make easy interpretation. The
OR represents the odd ratio that is an outcome will occur given a particular event, compared to
the odds of the outcome occurring in the absence of that event. If OR is greater than 1 event is
more likely to occur and if OR is less than 1 the event is less likely to occur.
.
Types of logistic regression:
There are 3 types of logistic regression models, which are defined based on categorical response:
Churn prediction: For example, human resources and management teams may want to know if
there are high performers within the company who are at risk of leaving the organization; this
type of insight can prompt conversations to understand the problem areas within the company,
such as culture or compensation. Alternatively. The sales organizations may want to learn which
of their clients are at risk of taking their business elsewhere. This can prompt teams to set up a
retention strategy to avoid lost revenue.
These sets the values of validate to be randomly generated Bernoulli variates with
probability parameter 0.7.
You only intend to use validate with cases that could be used to create the model; that is,
previous customers. However, there are 150 cases corresponding to potential customers
in the data file.
8. To perform the computation only for previous customers, click If.Figure 3. If Cases
dialog box
9. Select include if case satisfies condition.
10. Type MISSING (default) = 0 as the conditional expression.
This ensures that validate is only computed for cases with non-missing values for default;
that is, for customers who previously received loans.
Approximately 70 percent of the customers previously given loans will have a validate value of
1. These customers will be used to create the model. The remaining customers who were
previously given loans will be used to validate the model results.
12. Select Probabilities in the Predicted Values group, Cook's in the Influence group,
and Studentized in the Residuals group.
13. Click Continue.
14. Click Options in the Logistic Regression dialog box.Figure 5. Options dialog box
15. Select Classification plots and Hosmer-Lemeshow goodness-of-fit.
16. Click Continue.
17. Click OK in the Logistic Regression dialog box.
Model Diagnostics
After building a model, you need to determine whether it reasonably approximates the behavior
of your data.
Tests of Model Fit. The Binary Logistic Regression procedure reports the Hosmer-Lemeshow
goodness-of-fit statistic.
Residual Plots. Using variables specified in the Save dialog box; you can construct various
diagnostic plots. Two helpful plots are the change in deviance versus predicted probabilities and
Cook's distances versus predicted probabilities.
The curve that extends from the lower left to the upper right corresponds to cases in
which the dependent variable has a value of 0. Thus, non-defaulters who have large
model-predicted probabilities of default are poorly fit by the model.
The curve that extends from the upper left to the lower right corresponds to cases in
which the dependent variable has a value of 1. Thus, defaulters who have small model-
predicted probabilities of default are poorly fit by the model.
By identifying the cases that are poorly fit by the model, you can focus on how those customers
are different, and hopefully discover another predictor that will improve the model.
Variable Selection
Figure 1. Variables not in the Equation, block 0
Forward stepwise methods start with a model that doesn't include any of the predictors.
At each step, the predictor with the largest score statistic whose significance value is less
than a specified value (by default 0.05) is added to the model.
Figure 2. Variables not in the Equation, block 1
The variables left out of the analysis at the last step all have significance values larger
than 0.05, so no more are added.
Figure 3. Model if Term Removed
The variables chosen by the forward stepwise method should all have significant changes
in -2 log-likelihood. The change in -2 log-likelihood is generally more reliable than the
Wald statistic. If the two disagree as to whether a predictor is useful to the model, trust
the change in -2 log-likelihood.
As a further check, you can build a model using backward stepwise methods. Backward
methods start with a model that includes all of the predictors. At each step, the predictor
that contributes the least is removed from the model, until all of the predictors in the
model are significant. If the two methods choose the same variables, you can be fairly
confident that it's a good model.
R-Squared Statistics
Figure 1. Model Summary
What constitutes a “good” R 2 value varies between different areas of application. While these
statistics can be suggestive on their own, they are most useful when comparing competing
models for the same data. The model with the largest R 2 statistic is “best” according to this
measure.
Summary
Using the Logistic Regression Procedure, I have constructed a model for predicting the
probability a given customer will default on their loan.
A critical issue for loan officers is the cost of Type I and Type II errors. That is, what is the cost
of classifying a defaulter as a non-defaulter (Type I)? What is the cost of classifying a non-
defaulter as a defaulter (Type II)? If bad debt is the primary concern, then you want to lower
your Type I error and maximize your sensitivity. If growing your customer base is the priority,
then you want to lower your Type II error and maximize your specificity. Usually both are major
concerns, so you have to choose a decision rule for classifying customers that gives the best mix
of sensitivity and specificity.