-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
EXA improve example of forest feature importances digits #19429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It might not be worth to change this example. We will not have the issue with low-cardinality issue. However, we have issue with the fact that the MDI is computed on training statistics. Maybe, we could add a warning but let the example as is for the rest. |
15 minutes is too long for an example indeed. |
The 15 minutes is for a RandomForest with 200 trees and permutation n_rounds 2 and n_jobs=-1 with an 8 core CPU machine and all of the permutation results come back as zero. With a model that small the example for MDI also didn't look as good. Do you want me to revert the example back to faces? I think the example is more explanatory and complete like this but I can just provide a link to one of the other examples. |
#DataUmbrella |
I summarize some discussions that we had during the meeting in DataUmbrella. We should keep the example as is and keep the MDI feature importance. |
Also mention that if those two conditions are not met, it's recommended to instead use |
Can you please advise what to fix for the changelog check? Is it better to just create a new PR? |
@Mariam-ke I implemented all of the requests and waiting for review. Thanks for following it up :) |
@ogrisel @glemaitre |
…nto exa_permutation_faces
…nto exa_permutation_faces
…nto exa_permutation_faces
…nto exa_permutation_faces
…nto exa_permutation_faces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@azihna This is a clear improvement of this example, thank you. If suggested only a few small changes.
Co-authored-by: Christian Lorentzen <[email protected]>
Co-authored-by: Christian Lorentzen <[email protected]>
Co-authored-by: Christian Lorentzen <[email protected]>
Co-authored-by: Christian Lorentzen <[email protected]>
@lorentzenchr Thanks for the review. I accepted all of your suggestions. |
Co-authored-by: Christian Lorentzen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@glemaitre @ogrisel In case you want to give it another review pass, I'll wait a little before merging. If not, I'd say let's be efficient and merge as it's a clear improvement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @azihna !
# We use the faces data from datasets submodules when using impurity-based | ||
# feature importance. One drawback of this method is that it cannot be | ||
# applied on a separate test set. | ||
# on a separate test set but for this example, we are interested | ||
# in representing the information learned from the full dataset. | ||
# Also, we'll set the number of cores to use for the tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think talking about a test set here while the example does not split on a test set could be confusing to a reader. Here is my suggestion:
First, we load the olivetti faces dataset and limit the dataset to contain only the first
five classes. Then we train a random forest on the dataset and evaluate the
impurity-based feature importance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan Thanks a lot for the review!
I think your explanation is much clearer than the previous one but I still think it is important to mention the drawback of using this method. Let me know what you think of the new version.
@thomasjpfan If you've time: Is the current state good for you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this PR @azihna !
LGTM
EXA improve example of forest feature importances digits (scikit-learn#19429)
What does this implement/fix? Explain your changes.
References #14528
Changed the model to
RandomForestClassifier
fromExtraTreesClassifier
.Changed the formatting to the formatting of the current example with explanations.
Added the permutation importance example, for details please see the issue.
Changed the dataset from faces to digits because the faces dataset has
n_features >> n_samples
. permutation importance takes over 15 minutes to calculate even when the n_rounds parameter is reduced or a much smaller model is used and the results just came through as all zeroes. It is much quicker with the digits dataset and can still show the usage of the methods in a similar fashion.Any other comments?