FIX PowerTransformer Yeo-Johnson auto-tuning on significantly non-Gaussian data #20653

thomasjpfan · 2021-08-01T22:58:55Z

Reference Issues/PRs

Fixes #14959
Closes #15385 (superseeds)

What does this implement/fix? Explain your changes.

~~Checks for the value for the neg-log-likelihood and issue a warning. We have common test and other test that input this type of data so I think a warning is okay.~~

EDIT: this PR now rejects the problematic lambda that would lead to constant transformed data causing the problem.

Any other comments?

The original issue has been bumped up a few times, lets see if we can resolve it for 1.0.
CC @NicolasHug

NicolasHug

Thanks @thomasjpfan for the PR

With the proposed changes, we'll still get the ZeroDivisionWarnings, right? I'm wondering if erroring would make more sense.

Also, this new logic seems to assume that we can only get an infinite lambda when we have x_trans full of zeros. Are we sure about this?

sklearn/preprocessing/_data.py

ogrisel

Here are a few comments.

In retrospect, I wonder if we should set lmbda == np.nan when the brent optimization finds an infinite nll instead of using an arbitrary lambda value that depends on optimizer details.

For those columns with np.nan lambdas we could then skip the Yeo-Johnson transformation (but keep the subsequent StandardScaler when standardize=True). StandardScaler should be able to deal with near constant feature in numerically principled way.

We could also have a constructor parameter to silence the warning since in practice, many users might find it useful to just center constant or near constant features.

doc/whats_new/v1.0.rst

sklearn/preprocessing/_data.py

sklearn/preprocessing/tests/test_data.py

sklearn/preprocessing/_data.py

ogrisel · 2021-08-04T15:04:54Z

Actually there are two different cases to handle:

First case is: the features is really not constant but also significantly non-Gaussian before the transform, but constant for some values of lambda explored by the optimizers. This is the case for the original dataset reported by OP in #14959 (comment) and maybe the proposed solution is actually good in this case: reject those solution by returning np.inf instead of -np.inf. Based on the histograms reported in the description of #14959 it seems to yield valid, non-constant results that looks approximately Gaussian after the transform.
Second case: features where x_trans has zero variance for all possible values of lambda explored by brent, (most probably because the input feature is constant or near constant anyway). Then we could set lambda to np.nan and only standardize those columns, probably with a silence-able warning message.

Co-authored-by: Olivier Grisel <[email protected]>

thomasjpfan · 2021-08-16T21:53:52Z

Since there are two cases here, I updated this PR to resolve the first case: significantly non-Gaussian data where some lambdas result in constant transformed data.

I will followup with a PR for the second case:

features where x_trans has zero variance for all possible values of lambda explored by brent,

ogrisel

LGTM, thanks @thomasjpfan. @NicolasHug do you agree with the analysis and this 2-step strategy?

sklearn/preprocessing/tests/test_data.py

Co-authored-by: Olivier Grisel <[email protected]>

ogrisel · 2021-12-05T19:36:53Z

Still +1 for this PR. Maybe ping @adrinjalali @glemaitre @jnothman for a second review?

jnothman

Looks great apart from wanting confidence that try-except is the best way to do it

sklearn/preprocessing/_data.py

jeremiedbb · 2022-03-14T23:42:14Z

sklearn/preprocessing/_data.py

+            # Reject transformed data that is constant
+            if x_trans_var < x_tiny:
+                return np.inf


tiny is really small and will probably not catch what should be considered constant data in some cases.
What do you think about using _is_constant_feature (maybe changing the name) which is designed for that ?

I updated the comment. This is more for detecting the runtime warning in np.log in the line below. As long as the np.log(variance) can be computed the likelihood can be computed as well.

It turns out that np.log can handle values below tiny as well:

import numpy as np np.log(np.finfo(np.float64).tiny * 1e-15) # -742.83

np.log even works down to the smallest subnormal (smallest_subnormal was introduced in NumPy 1.22)

import numpy as np finfo = np.finfo(np.float64) finfo.smallest_subnormal # 5e-324 np.log(finfo.smallest_subnormal) # -744.44 np.log(finfo.smallest_subnormal * 0.5) # -inf

This x_trans_var < x_tiny is used because of a valid threading concern raised here: #20653 (comment) Originally, I caught the runtime warning, but catch_warning is not thread-safe.

I updated the comment. This is more for detecting the runtime warning in np.log in the line below. As long as the np.log(variance) can be computed the likelihood can be computed as well.

What I meant is that even if computable, its value would be meaningless. The variance is so small that it lies within the theoretical error bounds, meaning it's undistinguishable from a zero variance.

However, this situation should not appear very often, and even if it does, this lambda would not be the argmin anyway (unless all lambdas lead to constant x), so I'm ok with the tiny solution as well.

sklearn/preprocessing/_data.py

sklearn/preprocessing/tests/test_data.py

jeremiedbb

LGTM

@ogrisel

This fix is not the same as the initial one. @ogrisel you might want to take another look

ogrisel

Still +1. We might want to add proper support for float32 input data later.

jeremiedbb · 2022-03-24T15:43:39Z

We might want to add proper support for float32 input data later.

numpy.var always uses float64 accumulator

ogrisel · 2022-03-24T15:53:16Z

Ok but we will probably need to increase the test coverage with the global_dtype fixture.

jeremiedbb · 2022-03-24T16:42:15Z

Thanks @thomasjpfan !

…ssian data (scikit-learn#20653) Co-authored-by: Olivier Grisel <[email protected]>

mdhaber · 2022-04-10T15:53:06Z

@thomasjpfan would you be willing to submit this to SciPy, too, to resolve scipy/scipy#10821?

FIX Adds a warning for PowerTransform and pathological data

f099e51

github-actions bot added the module:preprocessing label Aug 1, 2021

thomasjpfan added 2 commits Aug 1, 2021

DOC Adds whats new

2d04749

CLN Slightly cleaner

5c9a9c5

NicolasHug reviewed Aug 2, 2021

View changes

sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved

ogrisel reviewed Aug 4, 2021

View changes

ogrisel mentioned this pull request Aug 6, 2021

PowerTransformer 'divide by zero encountered in log' + proposed fix #14959

Closed

thomasjpfan and others added 7 commits Aug 9, 2021

Apply suggestions from code review

8c251b5

Co-authored-by: Olivier Grisel <[email protected]>

Merge remote-tracking branch 'upstream/main' into power_error

05ba808

ENH Catches the runtime warning

8c2a2fd

REV Removes constant test

366ed84

REV Removes constant test

be6331f

REV Removes check for zero variance data

1161c9b

Merge remote-tracking branch 'upstream/main' into power_error

51afd23

thomasjpfan changed the title ~~FIX Adds a warning for PowerTransform and pathological data~~ FIX Adds a warning for PowerTransformer and significantly non-Gaussian data Aug 16, 2021

DOC Adds link to issue

4343df2

DOC Update code comment

c0dad4d

ogrisel previously approved these changes Aug 20, 2021

View changes

ogrisel reviewed Aug 20, 2021

View changes

sklearn/preprocessing/tests/test_data.py Show resolved Hide resolved

thomasjpfan and others added 4 commits Aug 20, 2021

Update sklearn/preprocessing/tests/test_data.py

8f628da

Co-authored-by: Olivier Grisel <[email protected]>

Black

366b23d

Merge remote-tracking branch 'upstream/main' into power_error

c5b887f

DOC move whats new to 1.1

fb8ed10

ogrisel changed the title ~~FIX Adds a warning for PowerTransformer and significantly non-Gaussian data~~ FIX PowerTransformer Yeo-Johnson auto-tuning on significantly non-Gaussian data Dec 5, 2021

ogrisel added the Waiting for Reviewer label Dec 5, 2021

jnothman reviewed Dec 6, 2021

View changes

sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/main' into power_error

8856a84

CLN Address threading concern

f128af9

thomasjpfan force-pushed the power_error branch from 9d7ca4e to f128af9 Compare Dec 6, 2021

thomasjpfan added 3 commits Mar 10, 2022

Merge remote-tracking branch 'upstream/main' into power_error

a80915a

TST Show warnings

069d2bd

DOC Remove unneeded whats new

8fc654a

thomasjpfan added this to the 1.1 milestone Mar 10, 2022

DOC Adds comment

6ef6371

jeremiedbb reviewed Mar 14, 2022

View changes

CLN Use float64

cfaf66d

jeremiedbb approved these changes Mar 17, 2022

View changes

Merge branch 'main' into power_error

68721df

ogrisel approved these changes Mar 24, 2022

View changes

jeremiedbb merged commit c3f81c1 into scikit-learn:main Mar 24, 2022
15 checks passed

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this issue Apr 6, 2022

FIX PowerTransformer Yeo-Johnson auto-tuning on significantly non-Gau…

200466f

…ssian data (scikit-learn#20653) Co-authored-by: Olivier Grisel <[email protected]>

mdhaber mentioned this pull request Apr 17, 2022

Errors with the Yeo-Johnson Transform that also Appear in Scikit-Learn scipy/scipy#10821

Closed

glemaitre mentioned this pull request May 10, 2022

Yeo-Johnson Power Transformer gives Numpy warning #23319

Open

scikit-learn / scikit-learn Public

FIX PowerTransformer Yeo-Johnson auto-tuning on significantly non-Gaussian data #20653

FIX PowerTransformer Yeo-Johnson auto-tuning on significantly non-Gaussian data #20653

thomasjpfan commented Aug 1, 2021 •

edited by ogrisel

NicolasHug left a comment

ogrisel left a comment

ogrisel commented Aug 4, 2021 •

edited

thomasjpfan commented Aug 16, 2021 •

edited

ogrisel left a comment

ogrisel commented Dec 5, 2021

jnothman left a comment

jeremiedbb Mar 14, 2022

thomasjpfan Mar 17, 2022 •

edited

jeremiedbb Mar 17, 2022 •

edited

jeremiedbb left a comment

ogrisel left a comment

jeremiedbb commented Mar 24, 2022

ogrisel commented Mar 24, 2022 •

edited

jeremiedbb commented Mar 24, 2022

mdhaber commented Apr 10, 2022

Apr	MAY	Apr
	25
2021	2022	2023

scikit-learn / scikit-learn Public

FIX PowerTransformer Yeo-Johnson auto-tuning on significantly non-Gaussian data #20653

FIX PowerTransformer Yeo-Johnson auto-tuning on significantly non-Gaussian data #20653

Conversation

thomasjpfan commented Aug 1, 2021 • edited by ogrisel

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

NicolasHug left a comment

ogrisel left a comment

ogrisel commented Aug 4, 2021 • edited

thomasjpfan commented Aug 16, 2021 • edited

ogrisel left a comment

ogrisel commented Dec 5, 2021

jnothman left a comment

jeremiedbb Mar 14, 2022

Choose a reason for hiding this comment

thomasjpfan Mar 17, 2022 • edited

Choose a reason for hiding this comment

jeremiedbb Mar 17, 2022 • edited

Choose a reason for hiding this comment

jeremiedbb left a comment

ogrisel left a comment

jeremiedbb commented Mar 24, 2022

ogrisel commented Mar 24, 2022 • edited

jeremiedbb commented Mar 24, 2022

mdhaber commented Apr 10, 2022

thomasjpfan commented Aug 1, 2021 •

edited by ogrisel

ogrisel commented Aug 4, 2021 •

edited

thomasjpfan commented Aug 16, 2021 •

edited

thomasjpfan Mar 17, 2022 •

edited

jeremiedbb Mar 17, 2022 •

edited

ogrisel commented Mar 24, 2022 •

edited