The Wayback Machine - https://web.archive.org/web/20220525153606/https://github.com/scikit-learn/scikit-learn/pull/20653
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX PowerTransformer Yeo-Johnson auto-tuning on significantly non-Gaussian data #20653

Merged
merged 24 commits into from Mar 24, 2022

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Aug 1, 2021

Reference Issues/PRs

Fixes #14959
Closes #15385 (superseeds)

What does this implement/fix? Explain your changes.

Checks for the value for the neg-log-likelihood and issue a warning. We have common test and other test that input this type of data so I think a warning is okay.

EDIT: this PR now rejects the problematic lambda that would lead to constant transformed data causing the problem.

Any other comments?

The original issue has been bumped up a few times, lets see if we can resolve it for 1.0.
CC @NicolasHug

Copy link
Member

@NicolasHug NicolasHug left a comment

Thanks @thomasjpfan for the PR

With the proposed changes, we'll still get the ZeroDivisionWarnings, right? I'm wondering if erroring would make more sense.

Also, this new logic seems to assume that we can only get an infinite lambda when we have x_trans full of zeros. Are we sure about this?

sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved
Copy link
Member

@ogrisel ogrisel left a comment

Here are a few comments.

In retrospect, I wonder if we should set lmbda == np.nan when the brent optimization finds an infinite nll instead of using an arbitrary lambda value that depends on optimizer details.

For those columns with np.nan lambdas we could then skip the Yeo-Johnson transformation (but keep the subsequent StandardScaler when standardize=True). StandardScaler should be able to deal with near constant feature in numerically principled way.

We could also have a constructor parameter to silence the warning since in practice, many users might find it useful to just center constant or near constant features.

doc/whats_new/v1.0.rst Outdated Show resolved Hide resolved
sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved
sklearn/preprocessing/tests/test_data.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved
@ogrisel
Copy link
Member

@ogrisel ogrisel commented Aug 4, 2021

Actually there are two different cases to handle:

  • First case is: the features is really not constant but also significantly non-Gaussian before the transform, but constant for some values of lambda explored by the optimizers. This is the case for the original dataset reported by OP in #14959 (comment) and maybe the proposed solution is actually good in this case: reject those solution by returning np.inf instead of -np.inf. Based on the histograms reported in the description of #14959 it seems to yield valid, non-constant results that looks approximately Gaussian after the transform.

  • Second case: features where x_trans has zero variance for all possible values of lambda explored by brent, (most probably because the input feature is constant or near constant anyway). Then we could set lambda to np.nan and only standardize those columns, probably with a silence-able warning message.

@thomasjpfan thomasjpfan changed the title FIX Adds a warning for PowerTransform and pathological data FIX Adds a warning for PowerTransformer and significantly non-Gaussian data Aug 16, 2021
@thomasjpfan
Copy link
Member Author

@thomasjpfan thomasjpfan commented Aug 16, 2021

Since there are two cases here, I updated this PR to resolve the first case: significantly non-Gaussian data where some lambdas result in constant transformed data.

I will followup with a PR for the second case:

features where x_trans has zero variance for all possible values of lambda explored by brent,

ogrisel
ogrisel previously approved these changes Aug 20, 2021
Copy link
Member

@ogrisel ogrisel left a comment

LGTM, thanks @thomasjpfan. @NicolasHug do you agree with the analysis and this 2-step strategy?

@ogrisel ogrisel changed the title FIX Adds a warning for PowerTransformer and significantly non-Gaussian data FIX PowerTransformer Yeo-Johnson auto-tuning on significantly non-Gaussian data Dec 5, 2021
@ogrisel
Copy link
Member

@ogrisel ogrisel commented Dec 5, 2021

Still +1 for this PR. Maybe ping @adrinjalali @glemaitre @jnothman for a second review?

Copy link
Member

@jnothman jnothman left a comment

Looks great apart from wanting confidence that try-except is the best way to do it

sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved
@thomasjpfan thomasjpfan added this to the 1.1 milestone Mar 10, 2022
# Reject transformed data that is constant
if x_trans_var < x_tiny:
return np.inf
Copy link
Member

@jeremiedbb jeremiedbb Mar 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny is really small and will probably not catch what should be considered constant data in some cases.
What do you think about using _is_constant_feature (maybe changing the name) which is designed for that ?

Copy link
Member Author

@thomasjpfan thomasjpfan Mar 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comment. This is more for detecting the runtime warning in np.log in the line below. As long as the np.log(variance) can be computed the likelihood can be computed as well.

It turns out that np.log can handle values below tiny as well:

import numpy as np

np.log(np.finfo(np.float64).tiny * 1e-15)
# -742.83

np.log even works down to the smallest subnormal (smallest_subnormal was introduced in NumPy 1.22)

import numpy as np

finfo = np.finfo(np.float64)
finfo.smallest_subnormal
# 5e-324

np.log(finfo.smallest_subnormal)
# -744.44

np.log(finfo.smallest_subnormal * 0.5)
# -inf

This x_trans_var < x_tiny is used because of a valid threading concern raised here: #20653 (comment) Originally, I caught the runtime warning, but catch_warning is not thread-safe.

Copy link
Member

@jeremiedbb jeremiedbb Mar 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comment. This is more for detecting the runtime warning in np.log in the line below. As long as the np.log(variance) can be computed the likelihood can be computed as well.

What I meant is that even if computable, its value would be meaningless. The variance is so small that it lies within the theoretical error bounds, meaning it's undistinguishable from a zero variance.

However, this situation should not appear very often, and even if it does, this lambda would not be the argmin anyway (unless all lambdas lead to constant x), so I'm ok with the tiny solution as well.

sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved
sklearn/preprocessing/tests/test_data.py Outdated Show resolved Hide resolved
Copy link
Member

@jeremiedbb jeremiedbb left a comment

LGTM

@jeremiedbb jeremiedbb dismissed ogrisel’s stale review Mar 18, 2022

This fix is not the same as the initial one. @ogrisel you might want to take another look

Copy link
Member

@ogrisel ogrisel left a comment

Still +1. We might want to add proper support for float32 input data later.

@jeremiedbb
Copy link
Member

@jeremiedbb jeremiedbb commented Mar 24, 2022

We might want to add proper support for float32 input data later.

numpy.var always uses float64 accumulator

@ogrisel
Copy link
Member

@ogrisel ogrisel commented Mar 24, 2022

Ok but we will probably need to increase the test coverage with the global_dtype fixture.

@jeremiedbb jeremiedbb merged commit c3f81c1 into scikit-learn:main Mar 24, 2022
15 checks passed
@jeremiedbb
Copy link
Member

@jeremiedbb jeremiedbb commented Mar 24, 2022

Thanks @thomasjpfan !

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this issue Apr 6, 2022
@mdhaber
Copy link

@mdhaber mdhaber commented Apr 10, 2022

@thomasjpfan would you be willing to submit this to SciPy, too, to resolve scipy/scipy#10821?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants