The Wayback Machine - https://web.archive.org/web/20230128034652/https://github.com/scikit-learn/scikit-learn/pull/23275
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX bagging with SGD and early stopping throws ZeroDivisionError #23275

Merged

Conversation

MaxwellLZH
Copy link
Contributor

@MaxwellLZH MaxwellLZH commented May 4, 2022

Reference Issues/PRs

Fixes #17229.

What does this implement/fix? Explain your changes.

Trying to fix the issue by passing sample_weight to _make_validation_split function, and only chose samples with positive weight as validation data.

Other comment

There's another opening PR #17435 that raises error when SGD with early-stopping is used inside bagging, which might not be the best choice for user experience in my opinion.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should not change the fit method of the estimator to physically drop rows with null weights prior to taking the validation split and the test of the fit. However this might introduce some other unanticipated effects.

At least this PR is minimal, so easier to review.

Could you please also add a changelog entry?

@@ -269,14 +269,17 @@ def _allocate_parameter_mem(
self._standard_intercept.shape, dtype=np.float64, order="C"
)

def _make_validation_split(self, y):
def _make_validation_split(self, y, sample_weight):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code would be easier to follow if this variable was called sample_mask and that this method would be called as:

self._make_validation_split(y, sample_mask=sample_weight > 0)

otherwise the caller might expect that the sampling of the validation set would have probabilities proportional to the weights which is not the case in our shuffle-based CV splitters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as suggested :)

@MaxwellLZH MaxwellLZH requested a review from ogrisel Jun 22, 2022
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

sklearn/linear_model/_stochastic_gradient.py Outdated Show resolved Hide resolved
idx_non_zero = np.arange(n_samples)[sample_mask]
y_ = y[sample_mask]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can change the underlying model compared to main. On main, a subset of the samples in the validation set can have zero weight. As long as not all of them have zero weight, then there will still be a score. In this PR, these zero weights are filtered out before the split.

Given that, I prefer to error or warn when this happens and suggest changing the random_state.

f"There are {cnt_zero_weight_val} samples with zero sample weight in"
" the validation set, consider using a different random state."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having some zero sample weights in the validation set is okay. I think the original issue was because all the samples weights were zero.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @thomasjpfan , the warning is added because we're changing the default bahaviour as you suggested in the previous comment

Given that, I prefer to error or warn when this happens and suggest changing the random_state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #23275 (comment), I mentioned:

As long as not all of them have zero weight, then there will still be a score.

To be clear, I meant to warn when all of the weights are zero. In other words:

if not np.any(sample_mask[idx_val]):
    warnings.warn(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @thomasjpfan , I've updated the code to raise an error instead of warning because a ZeroDivisionError will eventually be raised when the sample weights for validation set are all 0, so I think it's useful to raise an error with a more informative error message beforehand.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, otherwise LGTM

"The sample weights for validation set are all zero, consider using a"
" different random state."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is technically backward breaking, but I consider this more of a bug fix. I do not think it makes sense to have a validation set where every sample weight is 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that, it was already failing so it's clearly a bug fix

sklearn/linear_model/tests/test_sgd.py Outdated Show resolved Hide resolved
sklearn/linear_model/tests/test_sgd.py Outdated Show resolved Hide resolved
MaxwellLZH and others added 4 commits Jul 20, 2022
Co-authored-by: Thomas J. Fan <[email protected]>
…MaxwellLZH/scikit-learn into fix/bagging-earlystopping-zero-division
@cmarmo cmarmo added the Waiting for Second Reviewer First reviewer is done, need a second one! label Oct 20, 2022
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MaxwellLZH. LGTM

@jeremiedbb jeremiedbb merged commit 335586a into scikit-learn:main Oct 28, 2022
25 checks passed
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Oct 31, 2022
andportnoy pushed a commit to andportnoy/scikit-learn that referenced this pull request Nov 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:linear_model Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Combining bagging with SGD and early stopping throws ZeroDivisionError
5 participants