The Wayback Machine - https://web.archive.org/web/20220525125007/https://github.com/scikit-learn/scikit-learn/pull/20761
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX compute y_std properly with multi-target in GPR #20761

Merged
merged 24 commits into from Oct 20, 2021

Conversation

patrickctrf
Copy link
Contributor

@patrickctrf patrickctrf commented Aug 17, 2021

Reference Issues/PRs

Fixes #17394. Fixes #18065.

What does this implement/fix? Explain your changes.

GaussianProcessRegressor could not retrieve y_std
for predict(X, return_std=True) when n_targets were bigger than 1.
This happened because line 415 in file
"sklearn/gaussian_process/_gpr.py" tried to multiply
y_var * self._y_train_std ** 2 using simple multiplication ( a1 * a2).

However, it fails when self._y_train_std has more than one feature
(when n_targets is more than 1), so we need to implement this
multiplication using np.outer product, because it will handle
the conventional scalar-array multiplication for EACH output feature
(self._y_train_std contains one normalization rate for each output
feature).

After this fix, each output feature will receive its respective
normalized variance when return_std == True.

A simple example of how to train gpr and reproduce the error:

#=====breaking-code-before-fix============================================

import numpy as np

X_train = np.random.rand((11, 10))
y_train = np.random.rand((11, 6)) # 6 target features == ERROR

X_test = np.random.rand((4, 10))

import sklearn.gaussian_process as gp

kernel = gp.kernels.ConstantKernel(1.0, (1e-1, 1e3)) * gp.kernels.RBF(10.0, (1e-3, 1e3))

model = gp.GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10, alpha=0.1, normalize_y=True)

model.fit(X_train, y_train)
params = model.kernel_.get_params()

y_pred, std = model.predict(X_test, return_std=True)

#=====end-of-breaking-code-before-fix======================================


#=====well-running-code-before-fix============================================

import numpy as np

X_train = np.random.rand((11, 1))
y_train = np.random.rand((11, 1)) # 1 target feature == FINE

X_test = np.random.rand((4, 1))

import sklearn.gaussian_process as gp

kernel = gp.kernels.ConstantKernel(1.0, (1e-1, 1e3)) * gp.kernels.RBF(10.0, (1e-3, 1e3))

model = gp.GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10, alpha=0.1, normalize_y=True)

model.fit(X_train, y_train)
params = model.kernel_.get_params()

y_pred, std = model.predict(X_test, return_std=True)

#=====end-of-well-running-code-before-fix======================================

Line 415 in file "sklearn/gaussian_process/_gpr.py":

# before
y_var = y_var * self._y_train_std ** 2

# after
y_var = np.outer(y_var, self._y_train_std ** 2)

It could not retrieve y_std for predict(X, return_std=True) when
n_targets were bigger than 1. This happened because line 415 in file
"sklearn/gaussian_process/_gpr.py" tried to multiply
y_var * self._y_train_std ** 2 using simple multiplication ( a1 * a2).

However, it fails when self._y_train_std has more than one feature
(when n_targets is more than 1), so we need to implement this
multiplication using np.outer product, because it will handle
the conventional scalar-array multiplication for each output feature
(self._y_train_std contains one normalization rate for each output
feature).
@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Aug 17, 2021

Tests are failing, you need to:

  • reshape to (n_samples,) instead of (n_samples, 1) for a single target (this is the reason why the tests fail
  • add a non-regression test
  • add an entry in whats new

@glemaitre glemaitre changed the title fix: GaussianProcessRegressor fails to compute y_std when n_targets > 1 FIX compute y_std properly with multi-target in GPR Aug 17, 2021
@patrickctrf
Copy link
Contributor Author

@patrickctrf patrickctrf commented Aug 24, 2021

Tests are failing, you need to:

  • reshape to (n_samples,) instead of (n_samples, 1) for a single target (this is the reason why the tests fail
  • add a non-regression test
  • add an entry in whats new

Thanks for your review! What is a non-regression task? I'm only editing GPR, not GPC (classifier).
Also, where is "what's new" session? I couldn't find it, sorry. It is my first contribution.

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Aug 30, 2021

We are still missing a non-regression test and an entry to the what's new.

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Sep 2, 2021

You can read the following regarding regression testing

Here, the idea is to write a test function that will check that a code that was previously failing is giving the proper result now. Basically, it corresponds to isolate the code that you put in your first comment breaking-code-before-fix and add some assert. This function should be implemented in test_gpr.py.

Please add an entry to the change log at doc/whats_new/v1.0.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

@glemaitre glemaitre added this to REVIEWED AND WAITING FOR CHANGES in Guillaume's pet Sep 6, 2021
@patrickctrf
Copy link
Contributor Author

@patrickctrf patrickctrf commented Sep 9, 2021

I am getting a Linting error after I've added an entry in whats_new and the non-regressive test.
Can you tell me where the error is happening? I've tried different linting inspections but I had no success.

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Sep 9, 2021

You can see the failure there. You can as well click on the different "Details" that will land on this page.

The issue is that you need to run black on the file such that no style diff is detected by the CI. The following command would do the job:

black -t py37 sklearn/gaussian_process/tests/test_gpr.py

and commit the changes. Something handy instead to do this change manually is to install pre-commit as mentioned in the contributing guide (item 9.)

Check if GPR can compute y_std in predict() method when normalize_y==True
in multi-target regression.
"""
X_train = np.random.rand((11, 10))
Copy link
Contributor

@glemaitre glemaitre Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your test is failing as well: cf. there

Here it would be best to define a random state and then you can just use randn:

rng = np.random.RandomState(42)

n_samples, n_features, n_targets = 12, 10, 6

X_train = rng.randn(n_samples, n_features)
y_train = rng.randn(n_samples, n_targets)
X_test = rng.randn(n_samples, n_features)

# Generic kernel
kernel = kernels.ConstantKernel(1.0, (1e-1, 1e3))
kernel *= kernels.RBF(10.0, (1e-3, 1e3))
Copy link
Contributor

@glemaitre glemaitre Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Generic kernel
kernel = kernels.ConstantKernel(1.0, (1e-1, 1e3))
kernel *= kernels.RBF(10.0, (1e-3, 1e3))
# Generic kernel
kernel = (
kernels.ConstantKernel(1.0, (1e-1, 1e3))
* kernels.RBF(10.0, (1e-3, 1e3))
)

in multi-target regression.
"""
X_train = np.random.rand((11, 10))
# 6 target features -> multi-target
Copy link
Contributor

@glemaitre glemaitre Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for this comment. Variable names and the code should be as much as possible self-explanatory.

kernel = kernels.ConstantKernel(1.0, (1e-1, 1e3))
kernel *= kernels.RBF(10.0, (1e-3, 1e3))

# normalize_y == True
Copy link
Contributor

@glemaitre glemaitre Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this comment

alpha=0.1,
normalize_y=True)
model.fit(X_train, y_train)
y_pred, std = model.predict(X_test, return_std=True)
Copy link
Contributor

@glemaitre glemaitre Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
y_pred, std = model.predict(X_test, return_std=True)
y_pred, y_std = model.predict(X_test, return_std=True)

Copy link
Contributor

@glemaitre glemaitre Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true that the previous code was failing already. But we could make a couple of assertions regarding the shape of the returned values.

assert y_pred.shape == (n_samples, n_targets)

and something similar for y_std

"""
Regression test for issues #17394 and #18065.
Check if GPR can compute y_std in predict() method when normalize_y==True
in multi-target regression.
"""
Copy link
Contributor

@glemaitre glemaitre Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""
Regression test for issues #17394 and #18065.
Check if GPR can compute y_std in predict() method when normalize_y==True
in multi-target regression.
"""
"""Check that `y_std` is properly computed when `normalize_y=True`.
Non-regression test for:
https://github.com/scikit-learn/scikit-learn/issues/17394
https://github.com/scikit-learn/scikit-learn/issues/18065
"""

:mod:`sklearn.gaussian_process`
.........................

- |Fix| Compute 'y_std' properly with multi-target in
Copy link
Contributor

@glemaitre glemaitre Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- |Fix| Compute 'y_std' properly with multi-target in
- |Fix| Compute `y_std` properly with multi-target in

:mod:`sklearn.gaussian_process`
.........................
Copy link
Contributor

@glemaitre glemaitre Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:mod:`sklearn.gaussian_process`
.........................
:mod:`sklearn.gaussian_process`
...............................

Copy link
Contributor

@glemaitre glemaitre Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will let this entry here for the moment but we will have to move it in a section 1.0.1 most probably.

@patrickctrf
Copy link
Contributor Author

@patrickctrf patrickctrf commented Sep 9, 2021

Thanks for your support, @glemaitre. Just one more question: When I run black, other functions in test_gpr.py were formatted. Is it ok or should I change it back to its original state?

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Sep 9, 2021

Is it ok or should I change it back to its original state?

I think it is fine because the same command will be run and therefore lead to the same changes on the CI, normally.

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Sep 9, 2021

I relaunched the failing build but it certainly only a timeout. So everything should be fine otherwise.

@patrickctrf
Copy link
Contributor Author

@patrickctrf patrickctrf commented Sep 10, 2021

Do I need to do anything else?

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Sep 13, 2021

everything looks fine

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Sep 16, 2021

Tagging this PR for 1.0.1 to include it in the next bug fix release

@glemaitre glemaitre added this to the 1.0.1 milestone Sep 16, 2021
@patrickctrf
Copy link
Contributor Author

@patrickctrf patrickctrf commented Sep 28, 2021

Ok. It is asking for changes in whats_new section. Should I change it?

Copy link
Member

@jjerphan jjerphan left a comment

Thank you @patrickctrf!

LGTM with a few modifications for the change log and to follow conventions.

sklearn/gaussian_process/tests/test_gpr.py Outdated Show resolved Hide resolved
kernel=kernel,
n_restarts_optimizer=n_restarts_optimizer,
random_state=0,
kernel=kernel, n_restarts_optimizer=n_restarts_optimizer, random_state=0
Copy link
Member

@jjerphan jjerphan Sep 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated to this PR motives and also comes with no semantic changes.
Has black changed it? If it is not the case, could you revert this change, please?

Copy link
Contributor Author

@patrickctrf patrickctrf Oct 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, black changed it.

Copy link
Member

@jjerphan jjerphan Oct 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try to revert those changes? Normally it should be possible to keep the original lines.

Copy link
Contributor Author

@patrickctrf patrickctrf Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Done

theta_opt, func_min = (
initial_theta,
obj_func(initial_theta, eval_gradient=False),
Copy link
Member

@jjerphan jjerphan Sep 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

Copy link
Contributor Author

@patrickctrf patrickctrf Oct 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Black changed it too.

Copy link
Member

@jjerphan jjerphan Oct 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identically, can you try to revert those changes? Normally it should be possible to keep the original lines.

Copy link
Contributor Author

@patrickctrf patrickctrf Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Done

doc/whats_new/v1.0.rst Outdated Show resolved Hide resolved
sklearn/gaussian_process/_gpr.py Outdated Show resolved Hide resolved
@glemaitre glemaitre self-assigned this Oct 14, 2021
Copy link
Member

@jeremiedbb jeremiedbb left a comment

thanks @patrickctrf. The same issue appears in the if return_cov branch. y_cov has a shape (n_samples, n_samples) but should have a shape (n_samples, n_samples, n_targets). Would you mind fixing that too ? np.einsum("ij,k->ijk", y_var, self._y_train_std ** 2) followed by a np.squeeze will do it.

sklearn/gaussian_process/_gpr.py Outdated Show resolved Hide resolved
@jeremiedbb jeremiedbb self-requested a review Oct 20, 2021
Copy link
Member

@jeremiedbb jeremiedbb left a comment

LGTM. Thanks @patrickctrf !

@jeremiedbb jeremiedbb merged commit 9b210ae into scikit-learn:main Oct 20, 2021
32 checks passed
@patrickctrf
Copy link
Contributor Author

@patrickctrf patrickctrf commented Oct 21, 2021

Thanks for your support, guys!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment