New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX bagging with SGD and early stopping throws ZeroDivisionError #23275
FIX bagging with SGD and early stopping throws ZeroDivisionError #23275
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should not change the fit method of the estimator to physically drop rows with null weights prior to taking the validation split and the test of the fit. However this might introduce some other unanticipated effects.
At least this PR is minimal, so easier to review.
Could you please also add a changelog entry?
@@ -269,14 +269,17 @@ def _allocate_parameter_mem( | |||
self._standard_intercept.shape, dtype=np.float64, order="C" | |||
) | |||
|
|||
def _make_validation_split(self, y): | |||
def _make_validation_split(self, y, sample_weight): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code would be easier to follow if this variable was called sample_mask
and that this method would be called as:
self._make_validation_split(y, sample_mask=sample_weight > 0)
otherwise the caller might expect that the sampling of the validation set would have probabilities proportional to the weights which is not the case in our shuffle-based CV splitters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated as suggested :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR!
idx_non_zero = np.arange(n_samples)[sample_mask] | ||
y_ = y[sample_mask] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can change the underlying model compared to main
. On main
, a subset of the samples in the validation set can have zero weight. As long as not all of them have zero weight, then there will still be a score. In this PR, these zero weights are filtered out before the split.
Given that, I prefer to error or warn when this happens and suggest changing the random_state
.
Co-authored-by: Thomas J. Fan <[email protected]>
f"There are {cnt_zero_weight_val} samples with zero sample weight in" | ||
" the validation set, consider using a different random state." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having some zero sample weights in the validation set is okay. I think the original issue was because all the samples weights were zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @thomasjpfan , the warning is added because we're changing the default bahaviour as you suggested in the previous comment
Given that, I prefer to error or warn when this happens and suggest changing the random_state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #23275 (comment), I mentioned:
As long as not all of them have zero weight, then there will still be a score.
To be clear, I meant to warn when all of the weights are zero. In other words:
if not np.any(sample_mask[idx_val]):
warnings.warn(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @thomasjpfan , I've updated the code to raise an error instead of warning because a ZeroDivisionError
will eventually be raised when the sample weights for validation set are all 0, so I think it's useful to raise an error with a more informative error message beforehand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments, otherwise LGTM
"The sample weights for validation set are all zero, consider using a" | ||
" different random state." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is technically backward breaking, but I consider this more of a bug fix. I do not think it makes sense to have a validation set where every sample weight is 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that, it was already failing so it's clearly a bug fix
Co-authored-by: Thomas J. Fan <[email protected]>
Co-authored-by: Thomas J. Fan <[email protected]>
…MaxwellLZH/scikit-learn into fix/bagging-earlystopping-zero-division
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MaxwellLZH. LGTM
…kit-learn#23275) Co-authored-by: Thomas J. Fan <[email protected]>
…kit-learn#23275) Co-authored-by: Thomas J. Fan <[email protected]>
Reference Issues/PRs
Fixes #17229.
What does this implement/fix? Explain your changes.
Trying to fix the issue by passing sample_weight to
_make_validation_split
function, and only chose samples with positive weight as validation data.Other comment
There's another opening PR #17435 that raises error when SGD with early-stopping is used inside bagging, which might not be the best choice for user experience in my opinion.