The Wayback Machine - https://web.archive.org/web/20250520142406/https://github.com/scikit-learn/scikit-learn/pull/19429
Skip to content

EXA improve example of forest feature importances digits #19429

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
May 28, 2021

Conversation

azihna
Copy link
Contributor

@azihna azihna commented Feb 10, 2021

What does this implement/fix? Explain your changes.

References #14528
Changed the model to RandomForestClassifier from ExtraTreesClassifier.
Changed the formatting to the formatting of the current example with explanations.
Added the permutation importance example, for details please see the issue.
Changed the dataset from faces to digits because the faces dataset has n_features >> n_samples. permutation importance takes over 15 minutes to calculate even when the n_rounds parameter is reduced or a much smaller model is used and the results just came through as all zeroes. It is much quicker with the digits dataset and can still show the usage of the methods in a similar fashion.

Any other comments?

@glemaitre
Copy link
Member

It might not be worth to change this example. We will not have the issue with low-cardinality issue. However, we have issue with the fact that the MDI is computed on training statistics. Maybe, we could add a warning but let the example as is for the rest.

@glemaitre
Copy link
Member

15 minutes is too long for an example indeed.

@azihna
Copy link
Contributor Author

azihna commented Feb 11, 2021

The 15 minutes is for a RandomForest with 200 trees and permutation n_rounds 2 and n_jobs=-1 with an 8 core CPU machine and all of the permutation results come back as zero. With a model that small the example for MDI also didn't look as good.

Do you want me to revert the example back to faces? I think the example is more explanatory and complete like this but I can just provide a link to one of the other examples.

@reshamas
Copy link
Member

#DataUmbrella
cc: @Mariam-ke

@glemaitre
Copy link
Member

I summarize some discussions that we had during the meeting in DataUmbrella.

We should keep the example as is and keep the MDI feature importance.
However, we should mention that the limitations of the MDI are not an issue on this specific example: (i) all features are homogeneous and will not suffer the cardinality bias and (ii) we are interested to represent knowledge of the forest acquired on the training set.

@ogrisel
Copy link
Member

ogrisel commented Feb 20, 2021

Also mention that if those two conditions are not met, it's recommended to instead use permutation_importance and link to the relevant section in the user guide (using a sphinx reference).

@azihna
Copy link
Contributor Author

azihna commented Feb 23, 2021

Can you please advise what to fix for the changelog check? Is it better to just create a new PR?
Edit: It seems like it is fine since the last commit. I just needed to commit once more. Thanks!

@Mariam-ke
Copy link
Contributor

@azihna How is this PR going? Please let us know if we can answer any questions.

cc: @reshamas

@azihna
Copy link
Contributor Author

azihna commented Mar 11, 2021

@Mariam-ke I implemented all of the requests and waiting for review. Thanks for following it up :)

@reshamas
Copy link
Member

@ogrisel @glemaitre
This PR is from the #DataUmbrella sprint. It's one of 4 open PRs from the sprint.

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@azihna This is a clear improvement of this example, thank you. If suggested only a few small changes.

@azihna
Copy link
Contributor Author

azihna commented May 21, 2021

@lorentzenchr Thanks for the review. I accepted all of your suggestions.

@azihna azihna requested a review from lorentzenchr May 21, 2021 20:43
Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lorentzenchr
Copy link
Member

@glemaitre @ogrisel In case you want to give it another review pass, I'll wait a little before merging. If not, I'd say let's be efficient and merge as it's a clear improvement.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @azihna !

Comment on lines 22 to 27
# We use the faces data from datasets submodules when using impurity-based
# feature importance. One drawback of this method is that it cannot be
# applied on a separate test set.
# on a separate test set but for this example, we are interested
# in representing the information learned from the full dataset.
# Also, we'll set the number of cores to use for the tasks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think talking about a test set here while the example does not split on a test set could be confusing to a reader. Here is my suggestion:

First, we load the olivetti faces dataset and limit the dataset to contain only the first
five classes. Then we train a random forest on the dataset and evaluate the
impurity-based feature importance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan Thanks a lot for the review!
I think your explanation is much clearer than the previous one but I still think it is important to mention the drawback of using this method. Let me know what you think of the new version.

@lorentzenchr
Copy link
Member

@thomasjpfan If you've time: Is the current state good for you?

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this PR @azihna !

LGTM

@thomasjpfan thomasjpfan merged commit deda6e2 into scikit-learn:main May 28, 2021
sakibguy added a commit to sakibguy/scikit-learn that referenced this pull request May 30, 2021
EXA improve example of forest feature importances digits (scikit-learn#19429)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants