[MRG+1] EHN Add bootstrap sample size limit to forest ensembles #14682
Conversation
You should also add the parameter to RandomForestRegressor
, isn't it?
Additional comments regarding the tests.
@@ -1330,3 +1330,93 @@ def test_forest_degenerate_feature_importances(): | |||
gbr = RandomForestRegressor(n_estimators=10).fit(X, y) | |||
assert_array_equal(gbr.feature_importances_, | |||
np.zeros(10, dtype=np.float64)) | |||
|
|||
|
|||
def test__get_n_bootstrap_samples(): |
We usually don't test a private function. Instead we would test through the different estimator (RandomForest, ExtraTrees)
You can parametrize easily this case and you can check. You can make a test function where want to check that errors are raised properly (you can parametrize as well).
The other behavior should be done by fitting the estimator and check that we fitted on a subset of data.
check that we fitted on a subset of data.
where do you get that info?
As far as I can see, the sample indices for each bootstrap are not stored anywhere, but translated into sample weights and passed to the estimator.
(exception check unit tests refactored in: 081b7b7)
where do you get that info?
Yep I thought that we were having a private function to get back the indices but it does not look like so simple to do without making some hacks. I would need a bit more time to think about how to test this.
sklearn/ensemble/forest.py
Outdated
unsampled_indices = _generate_unsampled_indices( | ||
estimator.random_state, n_samples) | ||
estimator.random_state, n_samples, n_bootstrap_samples) |
@jnothman here we will test on the left out samples. Since we used max_samples
it means that we will use a lot of samples for the OOB. Is it something that we want?
If this is the case, we should add a test to make sure that we have the right complement in case max_samples
is not None.
Is there a problem with having a large OOB sample for test? Testing with trees isn't fast, but is a lot faster than training...?
Testing with trees isn't fast, but is a lot faster than training...?
True. Should not be an issue then
If the sample is much smaller than the dataset I suppose it may be an issue. Maybe we should make a rule that the oob sample is constrained to be no larger than the training set... But that may confuse users trying different values for this parameter
If the sample is much smaller than the dataset
I can see this situation easily arising, e.g, your dataset is 106 examples and you want to to fit with say, max_samples=1000
and n_estimators=1000
.
# Conflicts: # sklearn/ensemble/forest.py
A couple of additional comments
Hello, does this actually limit sample size given to each base tree? I noticed few days back that I tried putting a |
It is the definition of a bootstrap sample: sampling with replacement so |
Apart from the small pytest.raises
LGTM.
FYI: codecov seems to be wrong. All the diff code is covered. |
Thanks @notmatthancock, let's wait for a second review before merging. Maybe @jnothman @adrinjalali could have a look |
I'm good with the implementation. I'm wondering if it needs mentioning in the documentation for bootstrap
... This is no longer quite a bootstrap, but I think former proposals reused that parameter name.
If we are happy with a new parameter, this lgtm
Otherwise LGTM, thanks @notmatthancock
if max_samples is None: | ||
return n_samples | ||
|
||
if isinstance(max_samples, numbers.Integral): |
I think we should be consistent w.r.t how we treat these fractions. For instance, in optics, we have:
scikit-learn/sklearn/cluster/optics_.py
Lines 280 to 290 in e5be3ad
And then 1
always means 100%
of the data, at least in optics. Do we have a similar semantics in other places?
With PCA, n_components=1
means 1 components while n_components<1
will be a percentage.
Excluding 1
from float avoid issue with float
comparison as well.
sklearn/ensemble/forest.py
Outdated
to train each base estimator. | ||
- If None (default), then draw `X.shape[0]` samples. | ||
- If int, then draw `max_samples` samples. | ||
- If float, then draw `max_samples * X.shape[0]` samples. |
I'm okay with the behavior of the float, but we need to document that here, i.e. explicitly say that if float, it must be in (0, 1)
746efb5
into
scikit-learn:master
@notmatthancock I made the small change requested by @adrinjalali and merged. |
Likewise @glemaitre, thanks to you and @jnothman for the helpful review comments. |
- Deprecate presort (scikit-learn/scikit-learn#14907) - Add Minimal Cost-Complexity Pruning to Decision Trees (scikit-learn/scikit-learn#12887) - Add bootstrap sample size limit to forest ensembles (scikit-learn/scikit-learn#14682)
- Deprecate presort (scikit-learn/scikit-learn#14907) - Add Minimal Cost-Complexity Pruning to Decision Trees (scikit-learn/scikit-learn#12887) - Add bootstrap sample size limit to forest ensembles (scikit-learn/scikit-learn#14682) - Fix deprecated imports
- Deprecate presort (scikit-learn/scikit-learn#14907) - Add Minimal Cost-Complexity Pruning to Decision Trees (scikit-learn/scikit-learn#12887) - Add bootstrap sample size limit to forest ensembles (scikit-learn/scikit-learn#14682) - Fix deprecated imports (scikit-learn/scikit-learn#9250)
- Deprecate presort (scikit-learn/scikit-learn#14907) - Add Minimal Cost-Complexity Pruning to Decision Trees (scikit-learn/scikit-learn#12887) - Add bootstrap sample size limit to forest ensembles (scikit-learn/scikit-learn#14682) - Fix deprecated imports (scikit-learn/scikit-learn#9250) Do not add ccp_alpha to SurvivalTree, because it relies node_impurity, which is not set for SurvivalTree.
- Deprecate presort (scikit-learn/scikit-learn#14907) - Add Minimal Cost-Complexity Pruning to Decision Trees (scikit-learn/scikit-learn#12887) - Add bootstrap sample size limit to forest ensembles (scikit-learn/scikit-learn#14682) - Fix deprecated imports (scikit-learn/scikit-learn#9250) Do not add ccp_alpha to SurvivalTree, because it relies node_impurity, which is not set for SurvivalTree.
- Deprecate presort (scikit-learn/scikit-learn#14907) - Add Minimal Cost-Complexity Pruning to Decision Trees (scikit-learn/scikit-learn#12887) - Add bootstrap sample size limit to forest ensembles (scikit-learn/scikit-learn#14682) - Fix deprecated imports (scikit-learn/scikit-learn#9250) Do not add ccp_alpha to SurvivalTree, because it relies node_impurity, which is not set for SurvivalTree.
Adds a
max_samples
kwarg to forest ensembles that limits the size of the bootstrap samples used to train each estimator. This PR is intended to supersede two previous stalled PRs (#5963 and #9645), which have not been touched for a couple of years. This PR adds unit tests for the various functionalities.Why is this useful?
When training a random forest classifier on a large dataset with correlated/redundant examples, training on the entire dataset is memory intensive and unnecessary. Further, this results models that occupy an unwieldy amount of memory. As an example consider training on image patches for a segmentation-as-classification problem. It would be useful in this situation to train only on a subset of the available image patches because it's expected that an image patch obtained at one location is highly correlated with the patch obtained one pixel over. Limiting the size of each bootstrap sample to train each estimator is useful in this and similar applications.
Pickled model disk space comparison
Here's a simple test script to show the difference between occupied disk space of the pickled model, using the full dataset size for each bootstrap vs. using just a bootstrap size of 1 (obviously this is dramatic):
The text was updated successfully, but these errors were encountered: