The Wayback Machine - https://web.archive.org/web/20230613161328/https://github.com/scikit-learn/scikit-learn/issues/25982
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functions for calculating log-likelihood and null log-likelihood #25982

Closed
SinghAnkur28 opened this issue Mar 27, 2023 · 12 comments
Closed

Comments

@SinghAnkur28
Copy link
Contributor

Describe the workflow you want to enable

As a user of Scikit-learn, I want to be able to calculate the McFadden's pseudo R-squared for a binary logistic regression model for that we need log-likelihood and null log-likelihood.

Describe your proposed solution

I use the following functions and I propose to add them in the library as well.

For Log Likelihood -

import numpy as np
from sklearn.linear_model import LogisticRegression

def log_likelihood(model, X, y):
    """Calculate the log-likelihood of a binary logistic regression model.

    Parameters
    ----------
    model : sklearn.linear_model.LogisticRegression
        A trained binary logistic regression model.
    X : array-like, shape (n_samples, n_features)
        Feature matrix.
    y : array-like, shape (n_samples,)
        Binary class labels.

    Returns
    -------
    log_likelihood : float
        The log-likelihood of the model.
    """

    # Get predicted probabilities
    pred_probs = model.predict_proba(X)[:, 1]

    # Calculate log-likelihood
    log_likelihood = np.sum(y * np.log(pred_probs) + (1 - y) * np.log(1 - pred_probs))

    return log_likelihood

For Null Log Likelihood -

import numpy as np
from sklearn.metrics import log_loss

def null_log_likelihood(y):
    """Calculate the null log-likelihood of a binary logistic regression model.

    Parameters
    ----------
    y : array-like, shape (n_samples,)
        Binary class labels.

    Returns
    -------
    null_log_likelihood : float
        The null log-likelihood of the model.
    """

    # Calculate the proportion of positive class labels
    p0 = y.mean()

    # Create an array of predicted probabilities equal to the proportion of positive class labels
    probs = p0 * np.ones_like(y)

    # Calculate the null log-likelihood using the log_loss function from scikit-learn
    null_log_likelihood = -log_loss(y, probs, normalize=False)

    return null_log_likelihood

Describe alternatives you've considered, if relevant

One alternative to adding these functions to Scikit-learn would be for users to use statsmodels.
However, adding these functions to Scikit-learn would make it easier for users to calculate log-likelihood and null log-likelihood within the Scikit-learn ecosystem and would provide a standardized implementation.

Additional context

No response

@SinghAnkur28 SinghAnkur28 added Needs Triage Issue requires triage New Feature labels Mar 27, 2023
@SinghAnkur28 SinghAnkur28 changed the title Add functions for calculating log-likelihood and null log-likelihood to Scikit-learn Add functions for calculating log-likelihood and null log-likelihood Mar 27, 2023
@enigdata
Copy link

You may initiate a PR on this for review?

@betatim
Copy link
Member

betatim commented Apr 11, 2023

There already is a log_loss function in the metrics module of scikit-learn. Naively I'd assume this would also work for your case. Could you explain in a bit more detail why you can't use that function instead of adding a new one?

@SinghAnkur28
Copy link
Contributor Author

Thank you for your comment @betatim . I want to measure the goodness of fit of a logistic regression model. However, I think the log_loss function measures the difference between the predicted probabilities and the actual outcomes.

Alternatively, can we have a metric to directly calculate the Pseudo R-square?

@betatim
Copy link
Member

betatim commented Apr 13, 2023

If I look at the code of your log_likelihood(model, X, y) I think you can perform the same computation using the log_loss that already exists.

@SinghAnkur28
Copy link
Contributor Author

Thanks @betatim for your guidance. I'm new to the field so that was immensely helpful.

I used the following code for my case

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X, y)

y_pred_proba = model.predict_proba(X)[:, 1]

from sklearn.metrics import log_loss

ll_null = log_loss(y, [y.mean()] * len(y))
ll_model = log_loss(y, y_pred_proba)

pseudo_r2 = 1 - ll_model / ll_null

However, I was wondering if we have an inbuilt function to compute pseudo R-squared, if not what might be the reason of not having it?

@betatim
Copy link
Member

betatim commented Apr 13, 2023

However, I was wondering if we have an inbuilt function to compute pseudo R-squared, if not what might be the reason of not having it?

I don't know the answer to that. I think there isn't a lot of "goodness of fit" tooling in scikit-learn because there are other libraries that do that.

@SinghAnkur28
Copy link
Contributor Author

I think the reason for this may be that there are different approaches to computing pseudo R-squared. Moreover, different pseudo-R-squared measures can have different interpretations and assumptions, which may not always apply in a particular context.

However, since sci-kit learn already has r2_score, IMO this library should also have pseudo R-squared for logistic regression (allowing the user to choose from indices) for consistency.
What do you think?

I'd be more than happy to work on it.

@thomasjpfan thomasjpfan added module:metrics Needs Decision - Include Feature Requires decision regarding including feature module:linear_model and removed Needs Triage Issue requires triage labels May 4, 2023
@thomasjpfan
Copy link
Member

Metrics in the sklearn.metrics must be general enough to work with all models and not just linear models. In practice, I only seen pseudo R-squared applied to linear models. @AnkurSingh282000 Have you come across literature that applies pseudo R-squared to other models such as tree based models?

@lorentzenchr What do you think of providing a pseudo R-squared to the models in sklearn.linear_model?

@lorentzenchr
Copy link
Member

Such a metric d2_log_loss is proposed in #20943. It‘s only a matter of someone implementing it.

I want to measure the goodness of fit of a logistic regression model. However, I think the log_loss function measures the difference between the predicted probabilities and the actual outcomes.

Note that a „good“ GOF does exactly that: it measures some kind of „distance“ between observations and predictions.

I think we can close then?

@thomasjpfan
Copy link
Member

@lorentzenchr Okay that makes sense. I'm closing this issue as a duplicate of #20943.

@SinghAnkur28
Copy link
Contributor Author

@AnkurSingh282000 Have you come across literature that applies pseudo R-squared to other models such as tree based models?

Thanks @thomasjpfan for bringing this up and your time. I have also not seen this metric being applied to any tree based model. However I have a small concern.

As @lorentzenchr suggested-

Such a metric d2_log_loss is proposed in #20943.

I cannot locate any literature about it (I also noticed other people facing the same issue). If you could help me a little I could take shot at it.
Thanks!

@lorentzenchr
Copy link
Member

@SinghAnkur28 There is literature and if you are still interested, I can point you to it.
I would very much like if you could give it a shot. I can assist in case you get stuck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants