[MRG] Support unknown_value=np.nan in OrdinalEncoder #18406
Conversation
Thank you for working on this! |
LGTM |
LGTM |
Thanks @NicolasHug ! |
4aada4e
into
scikit-learn:master
Excellent work. Just to clarify: Will the new options allow to both
If yes, this will be super good news for fitting boosted trees! |
this PR supports 2 but 1 is still not supported. An error is raised when nans are present in the training data: it's unclear where to map them, as the output of OrdinalEncoder is supposed to be interpreted as ordered quantities. |
HistGradientBoostingClassifier and the correspoding regressor natively support both missing values (as nans) and categorical data now :) https://scikit-learn.org/stable/modules/ensemble.html#histogram-based-gradient-boosting |
yes, with 1 and 2 being inverted |
@NicolasHug : Thx for clarifying. From a practical perspective, it is not desirable that remaining nans would raise an error. If my subsequent model algorithm cannot natively deal with nans, we can simply add an imputer after the encoder and voila. |
@mayer79 Couldn’t you just run the encoder for not nans only to get the desired behavior? |
This PR adds support for
unknown_value=np.nan
inOrdinalEncoder
.(Parameter was introduced in #17406 by @FelixWick)
CC @thomasjpfan @ogrisel