FIX bagging with SGD and early stopping throws ZeroDivisionError #23275

MaxwellLZH · 2022-05-04T08:42:43Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Trying to fix the issue by passing sample_weight to _make_validation_split function, and only chose samples with positive weight as validation data.

Other comment

There's another opening PR #17435 that raises error when SGD with early-stopping is used inside bagging, which might not be the best choice for user experience in my opinion.

ogrisel

I wonder if we should not change the fit method of the estimator to physically drop rows with null weights prior to taking the validation split and the test of the fit. However this might introduce some other unanticipated effects.

At least this PR is minimal, so easier to review.

Could you please also add a changelog entry?

ogrisel · 2022-05-09T13:37:14Z

sklearn/linear_model/_stochastic_gradient.py

@@ -269,14 +269,17 @@ def _allocate_parameter_mem(
                self._standard_intercept.shape, dtype=np.float64, order="C"
            )

-    def _make_validation_split(self, y):
+    def _make_validation_split(self, y, sample_weight):


I think the code would be easier to follow if this variable was called sample_mask and that this method would be called as:

self._make_validation_split(y, sample_mask=sample_weight > 0)

otherwise the caller might expect that the sampling of the validation set would have probabilities proportional to the weights which is not the case in our shuffle-based CV splitters.

Updated as suggested :)

thomasjpfan

Thank you for the PR!

sklearn/linear_model/_stochastic_gradient.py

thomasjpfan · 2022-06-30T15:56:00Z

sklearn/linear_model/_stochastic_gradient.py

+        idx_non_zero = np.arange(n_samples)[sample_mask]
+        y_ = y[sample_mask]


This can change the underlying model compared to main. On main, a subset of the samples in the validation set can have zero weight. As long as not all of them have zero weight, then there will still be a score. In this PR, these zero weights are filtered out before the split.

Given that, I prefer to error or warn when this happens and suggest changing the random_state.

Co-authored-by: Thomas J. Fan <[email protected]>

thomasjpfan · 2022-07-05T14:51:10Z

sklearn/linear_model/_stochastic_gradient.py

+                f"There are {cnt_zero_weight_val} samples with zero sample weight in"
+                " the validation set, consider using a different random state."


I think having some zero sample weights in the validation set is okay. I think the original issue was because all the samples weights were zero.

Hi @thomasjpfan , the warning is added because we're changing the default bahaviour as you suggested in the previous comment

Given that, I prefer to error or warn when this happens and suggest changing the random_state.

In #23275 (comment), I mentioned:

As long as not all of them have zero weight, then there will still be a score.

To be clear, I meant to warn when all of the weights are zero. In other words:

if not np.any(sample_mask[idx_val]): warnings.warn(...)

Hi @thomasjpfan , I've updated the code to raise an error instead of warning because a ZeroDivisionError will eventually be raised when the sample weights for validation set are all 0, so I think it's useful to raise an error with a more informative error message beforehand.

thomasjpfan

Minor comments, otherwise LGTM

thomasjpfan · 2022-07-19T17:07:46Z

sklearn/linear_model/_stochastic_gradient.py

+                "The sample weights for validation set are all zero, consider using a"
+                " different random state."


This is technically backward breaking, but I consider this more of a bug fix. I do not think it makes sense to have a validation set where every sample weight is 0.

I agree with that, it was already failing so it's clearly a bug fix

sklearn/linear_model/tests/test_sgd.py

Co-authored-by: Thomas J. Fan <[email protected]>

…MaxwellLZH/scikit-learn into fix/bagging-earlystopping-zero-division

jeremiedbb

Thanks @MaxwellLZH. LGTM

…kit-learn#23275) Co-authored-by: Thomas J. Fan <[email protected]>

MaxwellLZH added 4 commits May 1, 2022

validation split take sample_weight into account

d35e6ac

merge main

aa6d7b0

add test case

9d24d68

pass in sample weight

dcb53ce

github-actions bot added the module:linear_model label May 4, 2022

fix linting error

fb877bf

ogrisel reviewed May 9, 2022

View changes

MaxwellLZH added 5 commits May 10, 2022

use sample_mask instead of sample_weight

8130282

Merge branch 'main' into fix/bagging-earlystopping-zero-division

f5352ae

update whatsnew

2108052

solve merge conflict

cb559f0

fix merge conflict

c8c95c7

MaxwellLZH requested a review from ogrisel Jun 22, 2022

thomasjpfan reviewed Jun 30, 2022

View changes

MaxwellLZH and others added 4 commits Jul 5, 2022

doc styling

72155b5

Co-authored-by: Thomas J. Fan <[email protected]>

raise warning when there are zero weight sample in validation set

1a4ee40

fix merge conflict

2ee2f72

fix linting error

777a6f8

thomasjpfan reviewed Jul 5, 2022

View changes

cmarmo mentioned this pull request Jul 9, 2022

[MRG]Fix ZeroDivisionError in bagging with early_stopping #17435

Closed

MaxwellLZH added 3 commits Jul 19, 2022

raise error when all validation sample has zero weight

0605d12

update whatsnew

fe61d1c

fix linting error

71ff761

thomasjpfan approved these changes Jul 19, 2022

View changes

MaxwellLZH and others added 4 commits Jul 20, 2022

update test comment

93c3830

Co-authored-by: Thomas J. Fan <[email protected]>

remove unneccesary parameters in test

ca00f31

Co-authored-by: Thomas J. Fan <[email protected]>

fix merge conflict

5930a86

Merge branch 'fix/bagging-earlystopping-zero-division' of github.com:…

3ec0f3a

…MaxwellLZH/scikit-learn into fix/bagging-earlystopping-zero-division

cmarmo added the Waiting for Reviewer label Aug 10, 2022

cmarmo removed the Waiting for Reviewer label Oct 20, 2022

cmarmo added the Waiting for Second Reviewer First reviewer is done, need a second one! label Oct 20, 2022

Merge branch 'main' into fix/bagging-earlystopping-zero-division

cdf011d

jeremiedbb approved these changes Oct 28, 2022

View changes

jeremiedbb merged commit 335586a into scikit-learn:main Oct 28, 2022
25 checks passed

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Oct 31, 2022

FIX bagging with SGD and early stopping throws ZeroDivisionError (sci…

fb1b81f

…kit-learn#23275) Co-authored-by: Thomas J. Fan <[email protected]>

andportnoy pushed a commit to andportnoy/scikit-learn that referenced this pull request Nov 5, 2022

FIX bagging with SGD and early stopping throws ZeroDivisionError (sci…

a129b82

…kit-learn#23275) Co-authored-by: Thomas J. Fan <[email protected]>

Dec	JAN	Apr
	28
2022	2023	2025

FIX bagging with SGD and early stopping throws ZeroDivisionError #23275

FIX bagging with SGD and early stopping throws ZeroDivisionError #23275

MaxwellLZH commented May 4, 2022 •

edited by jeremiedbb

ogrisel left a comment

ogrisel May 9, 2022

MaxwellLZH May 10, 2022

thomasjpfan left a comment

thomasjpfan Jun 30, 2022

thomasjpfan Jul 5, 2022

MaxwellLZH Jul 10, 2022

thomasjpfan Jul 15, 2022

MaxwellLZH Jul 19, 2022

thomasjpfan left a comment

thomasjpfan Jul 19, 2022

jeremiedbb Oct 28, 2022

jeremiedbb left a comment

		idx_non_zero = np.arange(n_samples)[sample_mask]
		y_ = y[sample_mask]

		f"There are {cnt_zero_weight_val} samples with zero sample weight in"
		" the validation set, consider using a different random state."

		"The sample weights for validation set are all zero, consider using a"
		" different random state."

FIX bagging with SGD and early stopping throws ZeroDivisionError #23275

FIX bagging with SGD and early stopping throws ZeroDivisionError #23275

Conversation

MaxwellLZH commented May 4, 2022 • edited by jeremiedbb

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Other comment

ogrisel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremiedbb left a comment

Choose a reason for hiding this comment

MaxwellLZH commented May 4, 2022 •

edited by jeremiedbb