The Wayback Machine - https://web.archive.org/web/20230128034352/https://github.com/scikit-learn/scikit-learn/pull/20415
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX adaboost return nan in feature importance #20415

Merged

Conversation

MaxwellLZH
Copy link
Contributor

Reference Issues/PRs

This is a fix to #20320. Corresponding test case is also added.

What does this implement/fix? Explain your changes.

As discussed in the original thread, the NaN in feature importance is caused by the extremely small sample weight during the boosting process. A quick fix will be just clipping the sample weight to be at least epsilon.
sample_weight = np.clip(sample_weight, a_min=epsilon, a_max=1.0)

@MaxwellLZH MaxwellLZH changed the title FIX adaboost return nan in feature importance WIP adaboost return nan in feature importance Jun 28, 2021
@MaxwellLZH MaxwellLZH changed the title WIP adaboost return nan in feature importance FIX adaboost return nan in feature importance Jun 29, 2021
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me clarify what I suggested previously:

# avoid extremely small sample weight, detail see issue #20320
sample_weight = np.clip(sample_weight, a_min=epsilon, a_max=None)
# do not clip sample weight when it's exactly 0
sample_weight[zero_loc] = 0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion above would instead have been:

# Make near-zero weights exactly 0 to avoid numerical issues when computing
# feature importances.
sample_weight[sample_weight < epsilon] = 0

However this is just a suggestion and is open to discussion because I am not 100% sure this is the correct fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting small sample weights to 0 hurts model performance which is shown as follows, I guess the reason is those small weights could add up to be impactful at certain node, so they're not ignorable.

AUC under original implementation:
image

AUC under proposed fix:
image

AUC when setting small sample weight straight to 0:
image

Copy link
Member

@ogrisel ogrisel Jun 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the report. This is kind of surprising... but so be it.

Copy link
Contributor

@glemaitre glemaitre Jul 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I indeed find problematic the comparison with 0.0. It is ugly and might bite us at some point :)

@MaxwellLZH could you provide the code snippet used to build the ROC curve: is it a training or testing score and if it is the testing score, do you observe the same for the training score?

Since #20443 was linked to regressor mainly, I am thinking that this is maybe related but for a classification problem now. It could be nice to check the strategy of resetting weight because setting to zeros and adding weak-learner might lead to a local optimal with a non-diverse ensemble.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @glemaitre , the curve can be reproduced with the following code, where data_train.csv is the same dataset in the original issue data_train.csv

import pandas as pd
import numpy as np

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

train_data=pd.read_csv('~/Downloads/data_train.csv')
model_variables=['RH','t2m','tp_r5','swvl1','SM_r20','tp','cvh','vdi','SM_r10','SM_IDW']

X = train_data[model_variables] # Features
y = train_data.ignition_no 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1024)

train_auc, test_auc = [], []
for n_est in [1, 3, 5, 10, 20, 50, 100]:
    est = AdaBoostClassifier(
            base_estimator=DecisionTreeClassifier(max_depth=10, random_state=0),
            n_estimators=n_est)
    est.fit(X_train, y_train)
    train_auc.append(roc_auc_score(y_train, est.predict_proba(X_train)[:, 1]))
    test_auc.append(roc_auc_score(y_test, est.predict_proba(X_test)[:, 1]))
    
train_auc

# [0.9408335858626135, 0.9843405707338944, 0.9937636343659565, 0.9972901393075558, 0.9483462433389864, 0.9472818988784156, 0.9391712786487824]

The changes I made is adding sample_weight[sample_weight < epsilon] = 0 after line 162, where epilson = np.finfo(sample_weight.dtype).eps

random_state = check_random_state(self.random_state)
for iboost in range(self.n_estimators):
# Boosting step
sample_weight, estimator_weight, estimator_error = self._boost(
iboost, X, y, sample_weight, random_state
)

sklearn/ensemble/_weight_boosting.py Outdated Show resolved Hide resolved
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please add a entry to the changelog in doc/whats_new/v1.0.rst.

@ogrisel
Copy link
Member

ogrisel commented Jul 5, 2021

For reference, I tried to see if this PR could fix the bad training convergence behavior observed in #20443 but unfortunately it does not help.

@MaxwellLZH MaxwellLZH force-pushed the fix/adaboost-nan-feature-importance branch from 336e993 to cc5b899 Compare Feb 23, 2022
@cmarmo
Copy link
Member

cmarmo commented May 10, 2022

Hi @MaxwellLZH , thank you for your work so far. Do you mind fixing conflicts, if you are still interested in working on this pull request?
This will probably not go into 1.1, but there is some hope for the next bugfix release... Thanks for your patience!

Copy link
Contributor

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@glemaitre
Copy link
Contributor

It seems that the failure shows that we change the behaviour of the estimator. @MaxwellLZH would you have time to check why we don't trigger the warning anymore? I assume that the trick of implementing is avoiding to have degenerated cases. We might still want to trigger the warning in those cases even if it will not appear in the output.

@MaxwellLZH
Copy link
Contributor Author

Sure! I will check the failing tests this week .

@cmarmo
Copy link
Member

cmarmo commented Jul 2, 2022

Hi @MaxwellLZH , two approvals but some failing tests.
Do you mind synchronizing with upstream and checking the failing tests?
Thanks a lot!

Copy link
Member

@cmarmo cmarmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @MaxwellLZH, the changelog contains some relics from the conflict declaration.
Do you mind fixing them and commit?
The builds are no longer available, it is difficult to review the failing tests.
Thanks!

doc/whats_new/v1.2.rst Outdated Show resolved Hide resolved
doc/whats_new/v1.2.rst Outdated Show resolved Hide resolved
MaxwellLZH and others added 2 commits Aug 11, 2022
Co-authored-by: Chiara Marmo <[email protected]>
Co-authored-by: Chiara Marmo <[email protected]>
@cmarmo
Copy link
Member

cmarmo commented Aug 12, 2022

Hi @MaxwellLZH, I hope you don't mind if I'm stepping in... just thinking that this PR deserves to be merged... :)
I took some time to check the failing test.
Apparently, your modification makes the AdaBoostClassifier failing with infinite weight values later than main.
Increasing the learning rate to values > 22 in the test

clf = AdaBoostClassifier(n_estimators=30, learning_rate=5.0, algorithm="SAMME")

will make the test pass.

@MaxwellLZH
Copy link
Contributor Author

Hi @cmarmo, thank you so much for helping with the PR! Really appreciate it ! :)

@lorentzenchr lorentzenchr merged commit b903486 into scikit-learn:main Aug 15, 2022
32 checks passed
glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Sep 12, 2022
Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Chiara Marmo <[email protected]>
mathijs02 pushed a commit to mathijs02/scikit-learn that referenced this pull request Dec 27, 2022
Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Chiara Marmo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants