Add functions for calculating log-likelihood and null log-likelihood #25982

SinghAnkur28 · 2023-03-27T19:23:13Z

Describe the workflow you want to enable

As a user of Scikit-learn, I want to be able to calculate the McFadden's pseudo R-squared for a binary logistic regression model for that we need log-likelihood and null log-likelihood.

Describe your proposed solution

I use the following functions and I propose to add them in the library as well.

For Log Likelihood -

import numpy as np
from sklearn.linear_model import LogisticRegression

def log_likelihood(model, X, y):
    """Calculate the log-likelihood of a binary logistic regression model.

    Parameters
    ----------
    model : sklearn.linear_model.LogisticRegression
        A trained binary logistic regression model.
    X : array-like, shape (n_samples, n_features)
        Feature matrix.
    y : array-like, shape (n_samples,)
        Binary class labels.

    Returns
    -------
    log_likelihood : float
        The log-likelihood of the model.
    """

    # Get predicted probabilities
    pred_probs = model.predict_proba(X)[:, 1]

    # Calculate log-likelihood
    log_likelihood = np.sum(y * np.log(pred_probs) + (1 - y) * np.log(1 - pred_probs))

    return log_likelihood

For Null Log Likelihood -

import numpy as np
from sklearn.metrics import log_loss

def null_log_likelihood(y):
    """Calculate the null log-likelihood of a binary logistic regression model.

    Parameters
    ----------
    y : array-like, shape (n_samples,)
        Binary class labels.

    Returns
    -------
    null_log_likelihood : float
        The null log-likelihood of the model.
    """

    # Calculate the proportion of positive class labels
    p0 = y.mean()

    # Create an array of predicted probabilities equal to the proportion of positive class labels
    probs = p0 * np.ones_like(y)

    # Calculate the null log-likelihood using the log_loss function from scikit-learn
    null_log_likelihood = -log_loss(y, probs, normalize=False)

    return null_log_likelihood

Describe alternatives you've considered, if relevant

One alternative to adding these functions to Scikit-learn would be for users to use statsmodels.
However, adding these functions to Scikit-learn would make it easier for users to calculate log-likelihood and null log-likelihood within the Scikit-learn ecosystem and would provide a standardized implementation.

Additional context

No response

enigdata · 2023-04-10T00:48:19Z

You may initiate a PR on this for review?

betatim · 2023-04-11T07:36:55Z

There already is a log_loss function in the metrics module of scikit-learn. Naively I'd assume this would also work for your case. Could you explain in a bit more detail why you can't use that function instead of adding a new one?

SinghAnkur28 · 2023-04-11T15:30:30Z

Thank you for your comment @betatim . I want to measure the goodness of fit of a logistic regression model. However, I think the log_loss function measures the difference between the predicted probabilities and the actual outcomes.

Alternatively, can we have a metric to directly calculate the Pseudo R-square?

betatim · 2023-04-13T06:20:06Z

If I look at the code of your log_likelihood(model, X, y) I think you can perform the same computation using the log_loss that already exists.

SinghAnkur28 · 2023-04-13T09:45:25Z

Thanks @betatim for your guidance. I'm new to the field so that was immensely helpful.

I used the following code for my case

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X, y)

y_pred_proba = model.predict_proba(X)[:, 1]

from sklearn.metrics import log_loss

ll_null = log_loss(y, [y.mean()] * len(y))
ll_model = log_loss(y, y_pred_proba)

pseudo_r2 = 1 - ll_model / ll_null

However, I was wondering if we have an inbuilt function to compute pseudo R-squared, if not what might be the reason of not having it?

betatim · 2023-04-13T14:58:04Z

However, I was wondering if we have an inbuilt function to compute pseudo R-squared, if not what might be the reason of not having it?

I don't know the answer to that. I think there isn't a lot of "goodness of fit" tooling in scikit-learn because there are other libraries that do that.

SinghAnkur28 · 2023-04-14T10:47:35Z

I think the reason for this may be that there are different approaches to computing pseudo R-squared. Moreover, different pseudo-R-squared measures can have different interpretations and assumptions, which may not always apply in a particular context.

However, since sci-kit learn already has r2_score, IMO this library should also have pseudo R-squared for logistic regression (allowing the user to choose from indices) for consistency.
What do you think?

I'd be more than happy to work on it.

thomasjpfan · 2023-05-04T12:48:36Z

Metrics in the sklearn.metrics must be general enough to work with all models and not just linear models. In practice, I only seen pseudo R-squared applied to linear models. @AnkurSingh282000 Have you come across literature that applies pseudo R-squared to other models such as tree based models?

@lorentzenchr What do you think of providing a pseudo R-squared to the models in sklearn.linear_model?

lorentzenchr · 2023-05-04T16:30:19Z

Such a metric d2_log_loss is proposed in #20943. It‘s only a matter of someone implementing it.

I want to measure the goodness of fit of a logistic regression model. However, I think the log_loss function measures the difference between the predicted probabilities and the actual outcomes.

Note that a „good“ GOF does exactly that: it measures some kind of „distance“ between observations and predictions.

I think we can close then?

thomasjpfan · 2023-05-05T23:06:52Z

@lorentzenchr Okay that makes sense. I'm closing this issue as a duplicate of #20943.

SinghAnkur28 · 2023-05-15T13:31:20Z

@AnkurSingh282000 Have you come across literature that applies pseudo R-squared to other models such as tree based models?

Thanks @thomasjpfan for bringing this up and your time. I have also not seen this metric being applied to any tree based model. However I have a small concern.

As @lorentzenchr suggested-

Such a metric d2_log_loss is proposed in #20943.

I cannot locate any literature about it (I also noticed other people facing the same issue). If you could help me a little I could take shot at it.
Thanks!

lorentzenchr · 2023-06-01T20:18:21Z

@SinghAnkur28 There is literature and if you are still interested, I can point you to it.
I would very much like if you could give it a shot. I can assist in case you get stuck.

SinghAnkur28 added Needs Triage Issue requires triage New Feature labels Mar 27, 2023

SinghAnkur28 changed the title ~~Add functions for calculating log-likelihood and null log-likelihood to Scikit-learn~~ Add functions for calculating log-likelihood and null log-likelihood Mar 27, 2023

thomasjpfan added module:metrics Needs Decision - Include Feature Requires decision regarding including feature module:linear_model and removed Needs Triage Issue requires triage labels May 4, 2023

thomasjpfan closed this as completed May 5, 2023

May	JUN	Jul
	13
2022	2023	2024

Add functions for calculating log-likelihood and null log-likelihood #25982

Add functions for calculating log-likelihood and null log-likelihood #25982

SinghAnkur28 commented Mar 27, 2023

enigdata commented Apr 10, 2023

betatim commented Apr 11, 2023

SinghAnkur28 commented Apr 11, 2023

betatim commented Apr 13, 2023

SinghAnkur28 commented Apr 13, 2023

betatim commented Apr 13, 2023

SinghAnkur28 commented Apr 14, 2023

thomasjpfan commented May 4, 2023

lorentzenchr commented May 4, 2023

thomasjpfan commented May 5, 2023

SinghAnkur28 commented May 15, 2023

lorentzenchr commented Jun 1, 2023

Add functions for calculating log-likelihood and null log-likelihood #25982

Add functions for calculating log-likelihood and null log-likelihood #25982

Comments

SinghAnkur28 commented Mar 27, 2023

Describe the workflow you want to enable

Describe your proposed solution

For Log Likelihood -

For Null Log Likelihood -

Describe alternatives you've considered, if relevant

Additional context

enigdata commented Apr 10, 2023

betatim commented Apr 11, 2023

SinghAnkur28 commented Apr 11, 2023

betatim commented Apr 13, 2023

SinghAnkur28 commented Apr 13, 2023

betatim commented Apr 13, 2023

SinghAnkur28 commented Apr 14, 2023

thomasjpfan commented May 4, 2023

lorentzenchr commented May 4, 2023

thomasjpfan commented May 5, 2023

SinghAnkur28 commented May 15, 2023

lorentzenchr commented Jun 1, 2023