FIX adaboost return nan in feature importance #20415

MaxwellLZH · 2021-06-28T12:09:26Z

Reference Issues/PRs

This is a fix to #20320. Corresponding test case is also added.

What does this implement/fix? Explain your changes.

As discussed in the original thread, the NaN in feature importance is caused by the extremely small sample weight during the boosting process. A quick fix will be just clipping the sample weight to be at least epsilon.
sample_weight = np.clip(sample_weight, a_min=epsilon, a_max=1.0)

sklearn/ensemble/_weight_boosting.py

Co-authored-by: Olivier Grisel <[email protected]>

ogrisel

Let me clarify what I suggested previously:

ogrisel · 2021-06-29T09:14:03Z

sklearn/ensemble/_weight_boosting.py

+            # avoid extremely small sample weight, detail see issue #20320
+            sample_weight = np.clip(sample_weight, a_min=epsilon, a_max=None)
+            # do not clip sample weight when it's exactly 0
+            sample_weight[zero_loc] = 0.0


My suggestion above would instead have been:

# Make near-zero weights exactly 0 to avoid numerical issues when computing # feature importances. sample_weight[sample_weight < epsilon] = 0

However this is just a suggestion and is open to discussion because I am not 100% sure this is the correct fix.

Setting small sample weights to 0 hurts model performance which is shown as follows, I guess the reason is those small weights could add up to be impactful at certain node, so they're not ignorable.

AUC under original implementation:

AUC under proposed fix:

AUC when setting small sample weight straight to 0:

Thanks for the report. This is kind of surprising... but so be it.

I indeed find problematic the comparison with 0.0. It is ugly and might bite us at some point :)

@MaxwellLZH could you provide the code snippet used to build the ROC curve: is it a training or testing score and if it is the testing score, do you observe the same for the training score?

Since #20443 was linked to regressor mainly, I am thinking that this is maybe related but for a classification problem now. It could be nice to check the strategy of resetting weight because setting to zeros and adding weak-learner might lead to a local optimal with a non-diverse ensemble.

Hi @glemaitre , the curve can be reproduced with the following code, where data_train.csv is the same dataset in the original issue data_train.csv

import pandas as pd import numpy as np from sklearn.ensemble import AdaBoostClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import roc_auc_score from sklearn.model_selection import train_test_split train_data=pd.read_csv('~/Downloads/data_train.csv') model_variables=['RH','t2m','tp_r5','swvl1','SM_r20','tp','cvh','vdi','SM_r10','SM_IDW'] X = train_data[model_variables] # Features y = train_data.ignition_no X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1024) train_auc, test_auc = [], [] for n_est in [1, 3, 5, 10, 20, 50, 100]: est = AdaBoostClassifier( base_estimator=DecisionTreeClassifier(max_depth=10, random_state=0), n_estimators=n_est) est.fit(X_train, y_train) train_auc.append(roc_auc_score(y_train, est.predict_proba(X_train)[:, 1])) test_auc.append(roc_auc_score(y_test, est.predict_proba(X_test)[:, 1])) train_auc # [0.9408335858626135, 0.9843405707338944, 0.9937636343659565, 0.9972901393075558, 0.9483462433389864, 0.9472818988784156, 0.9391712786487824]

The changes I made is adding sample_weight[sample_weight < epsilon] = 0 after line 162, where epilson = np.finfo(sample_weight.dtype).eps

scikit-learn/sklearn/ensemble/_weight_boosting.py

Lines 156 to 163 in b94bc5e

random_state = check_random_state(self.random_state)

for iboost in range(self.n_estimators):

# Boosting step

sample_weight, estimator_weight, estimator_error = self._boost(

iboost, X, y, sample_weight, random_state

)

sklearn/ensemble/_weight_boosting.py

Co-authored-by: Olivier Grisel <[email protected]>

ogrisel

LGTM. Please add a entry to the changelog in doc/whats_new/v1.0.rst.

ogrisel · 2021-07-05T07:30:55Z

For reference, I tried to see if this PR could fix the bad training convergence behavior observed in #20443 but unfortunately it does not help.

cmarmo · 2022-05-10T22:19:51Z

Hi @MaxwellLZH , thank you for your work so far. Do you mind fixing conflicts, if you are still interested in working on this pull request?
This will probably not go into 1.1, but there is some hope for the next bugfix release... Thanks for your patience!

sklearn/ensemble/tests/test_weight_boosting.py

glemaitre

LGTM

glemaitre · 2022-05-17T09:45:31Z

It seems that the failure shows that we change the behaviour of the estimator. @MaxwellLZH would you have time to check why we don't trigger the warning anymore? I assume that the trick of implementing is avoiding to have degenerated cases. We might still want to trigger the warning in those cases even if it will not appear in the output.

MaxwellLZH · 2022-05-17T10:03:16Z

Sure! I will check the failing tests this week .

cmarmo · 2022-07-02T07:06:30Z

Hi @MaxwellLZH , two approvals but some failing tests.
Do you mind synchronizing with upstream and checking the failing tests?
Thanks a lot!

cmarmo

Hi @MaxwellLZH, the changelog contains some relics from the conflict declaration.
Do you mind fixing them and commit?
The builds are no longer available, it is difficult to review the failing tests.
Thanks!

doc/whats_new/v1.2.rst

Co-authored-by: Chiara Marmo <[email protected]>

cmarmo · 2022-08-12T20:15:29Z

Hi @MaxwellLZH, I hope you don't mind if I'm stepping in... just thinking that this PR deserves to be merged... :)
I took some time to check the failing test.
Apparently, your modification makes the AdaBoostClassifier failing with infinite weight values later than main.
Increasing the learning rate to values > 22 in the test

scikit-learn/sklearn/ensemble/tests/test_weight_boosting.py

Line 312 in b0e1bb9

clf = AdaBoostClassifier(n_estimators=30, learning_rate=5.0, algorithm="SAMME")

will make the test pass.

MaxwellLZH · 2022-08-14T14:15:38Z

Hi @cmarmo, thank you so much for helping with the PR! Really appreciate it ! :)

Co-authored-by: Olivier Grisel <[email protected]> Co-authored-by: Guillaume Lemaitre <[email protected]> Co-authored-by: Chiara Marmo <[email protected]>

fix adaboost return nan feature importance

1447152

github-actions bot added the module:ensemble label Jun 28, 2021

MaxwellLZH changed the title ~~FIX adaboost return nan in feature importance~~ WIP adaboost return nan in feature importance Jun 28, 2021

set epsilon to `np.float64

8c1eca3

ogrisel reviewed Jun 28, 2021

View changes

sklearn/ensemble/_weight_boosting.py Outdated Show resolved Hide resolved

MaxwellLZH and others added 2 commits Jun 29, 2021

Update sklearn/ensemble/_weight_boosting.py

1a2175e

Co-authored-by: Olivier Grisel <[email protected]>

fix failing test case

58c6542

MaxwellLZH changed the title ~~WIP adaboost return nan in feature importance~~ FIX adaboost return nan in feature importance Jun 29, 2021

ogrisel reviewed Jun 29, 2021

View changes

Update sklearn/ensemble/_weight_boosting.py

091d399

Co-authored-by: Olivier Grisel <[email protected]>

ogrisel approved these changes Jun 30, 2021

View changes

MaxwellLZH and others added 3 commits Jul 1, 2021

update whatsnew

0abbf50

Extract zero weight mask once

90d1d89

Seed the test data rng

6b09936

Change test to support all versions of numpy

a92b5d7

github-actions bot added the cython label Jan 28, 2022

MaxwellLZH added 2 commits Feb 23, 2022

Merge branch 'main' into fix/adaboost-nan-feature-importance

58b038e

move whatsnew to v1.1

cc5b899

MaxwellLZH force-pushed the fix/adaboost-nan-feature-importance branch from 336e993 to cc5b899 Compare Feb 23, 2022

cmarmo added the Waiting for Reviewer label May 10, 2022

MaxwellLZH and others added 5 commits May 11, 2022

Merge branch 'main' into fix/adaboost-nan-feature-importance

7150128

move changelog to whatsnew v1.2

4988052

fix typo

2924a8e

Merge branch 'main' into fix/adaboost-nan-feature-importance

6914724

Update v1.1.rst

807fd1e

glemaitre reviewed May 17, 2022

View changes

sklearn/ensemble/tests/test_weight_boosting.py Show resolved Hide resolved

Update sklearn/ensemble/tests/test_weight_boosting.py

7e5ac87

glemaitre reviewed May 17, 2022

View changes

sklearn/ensemble/tests/test_weight_boosting.py Outdated Show resolved Hide resolved

Update sklearn/ensemble/tests/test_weight_boosting.py

07ee556

glemaitre approved these changes May 17, 2022

View changes

cmarmo removed the Waiting for Reviewer label Jul 2, 2022

MaxwellLZH added 2 commits Jul 5, 2022

fix merge conflict

4664493

fix merge conflict

fe96d25

cmarmo reviewed Aug 10, 2022

View changes

doc/whats_new/v1.2.rst Outdated Show resolved Hide resolved

doc/whats_new/v1.2.rst Outdated Show resolved Hide resolved

MaxwellLZH and others added 2 commits Aug 11, 2022

remove conflict declaration

094db58

Co-authored-by: Chiara Marmo <[email protected]>

remove conflict declaration

3d4e20c

Co-authored-by: Chiara Marmo <[email protected]>

fix failing test

ef72de4

cmarmo added the Waiting for Reviewer label Aug 14, 2022

lorentzenchr merged commit b903486 into scikit-learn:main Aug 15, 2022
32 checks passed

cmarmo removed the Waiting for Reviewer label Aug 15, 2022

betatim mentioned this pull request Sep 9, 2022

Ada boost feature_importances_ contains NaN #20320

Closed

Dec	JAN	Apr
	28
2022	2023	2025

FIX adaboost return nan in feature importance #20415

FIX adaboost return nan in feature importance #20415

MaxwellLZH commented Jun 28, 2021

ogrisel left a comment

ogrisel Jun 29, 2021

MaxwellLZH Jun 29, 2021

ogrisel Jun 30, 2021 •

edited

glemaitre Jul 6, 2021 •

edited

MaxwellLZH May 11, 2022

ogrisel left a comment

ogrisel commented Jul 5, 2021

cmarmo commented May 10, 2022 •

edited

glemaitre left a comment

glemaitre commented May 17, 2022

MaxwellLZH commented May 17, 2022

cmarmo commented Jul 2, 2022

cmarmo left a comment

cmarmo commented Aug 12, 2022 •

edited

MaxwellLZH commented Aug 14, 2022

	random_state = check_random_state(self.random_state)

	for iboost in range(self.n_estimators):
	# Boosting step
	sample_weight, estimator_weight, estimator_error = self._boost(
	iboost, X, y, sample_weight, random_state
	)

FIX adaboost return nan in feature importance #20415

FIX adaboost return nan in feature importance #20415

Conversation

MaxwellLZH commented Jun 28, 2021

Reference Issues/PRs

What does this implement/fix? Explain your changes.

ogrisel left a comment

Choose a reason for hiding this comment

ogrisel Jun 29, 2021

Choose a reason for hiding this comment

MaxwellLZH Jun 29, 2021

Choose a reason for hiding this comment

ogrisel Jun 30, 2021 • edited

Choose a reason for hiding this comment

glemaitre Jul 6, 2021 • edited

Choose a reason for hiding this comment

MaxwellLZH May 11, 2022

Choose a reason for hiding this comment

ogrisel left a comment

Choose a reason for hiding this comment

ogrisel commented Jul 5, 2021

cmarmo commented May 10, 2022 • edited

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre commented May 17, 2022

MaxwellLZH commented May 17, 2022

cmarmo commented Jul 2, 2022

cmarmo left a comment

Choose a reason for hiding this comment

cmarmo commented Aug 12, 2022 • edited

MaxwellLZH commented Aug 14, 2022

ogrisel Jun 30, 2021 •

edited

glemaitre Jul 6, 2021 •

edited

cmarmo commented May 10, 2022 •

edited

cmarmo commented Aug 12, 2022 •

edited