The Wayback Machine - https://web.archive.org/web/20210828160141/https://github.com/scikit-learn/scikit-learn/pull/18406

scikit-learn / scikit-learn

Sponsor

Sponsor scikit-learn/scikit-learn
Notifications
Star 47k
Fork 21.8k

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[MRG] Support unknown_value=np.nan in OrdinalEncoder #18406

Merged

rth merged 5 commits into scikit-learn:master from NicolasHug:oe_nan

Sep 23, 2020

Member

NicolasHug commented Sep 16, 2020

This PR adds support for unknown_value=np.nan in OrdinalEncoder.

(Parameter was introduced in #17406 by @FelixWick)

CC @thomasjpfan @ogrisel


          support np.nan

b495815

NicolasHug mentioned this pull request

ENH Add Categorical support for HistGradientBoosting #18394

Merged

github-actions bot added the module:preprocessing label

thomasjpfan reviewed

Member

thomasjpfan left a comment

Thank you for working on this!

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

NicolasHug added 3 commits


          use is_scalar_nan

4fc2758


          Merge branch 'master' of github.com:scikit-learn/scikit-learn into oe…

9e59dcd

…_nan


          disallow object dtype

28bf0db

thomasjpfan approved these changes

Member

thomasjpfan left a comment

LGTM

thomasjpfan approved these changes

Member

thomasjpfan left a comment

LGTM

cmarmo added the Waiting for Reviewer label


          whatsnew

acea21c

rth approved these changes

Member

rth left a comment

Thanks @NicolasHug !

rth merged commit 4aada4e into scikit-learn:master

21 checks passed

21 checks passed

LGTM analysis: C/C++ No code changes detected

Details

LGTM analysis: JavaScript No code changes detected

Details

LGTM analysis: Python No new or fixed alerts

Details

ci/circleci: deploy Your tests passed on CircleCI!

Details

ci/circleci: doc Your tests passed on CircleCI!

Details

ci/circleci: doc artifact Link to 0/doc/_changed.html

Details

ci/circleci: doc-min-dependencies Your tests passed on CircleCI!

Details

ci/circleci: lint Your tests passed on CircleCI!

Details

codecov/patch Codecov Report

Details

codecov/project Codecov Report

Details

scikit-learn.scikit-learn Build #20200922.15 succeeded

Details

scikit-learn.scikit-learn (Linting) Linting succeeded

Details

scikit-learn.scikit-learn (Linux py36_conda_openblas) Linux py36_conda_openblas succeeded

Details

scikit-learn.scikit-learn (Linux py36_ubuntu_atlas) Linux py36_ubuntu_atlas succeeded

Details

scikit-learn.scikit-learn (Linux pylatest_pip_openblas_pandas) Linux pylatest_pip_openblas_pandas succeeded

Details

scikit-learn.scikit-learn (Linux32 py36_ubuntu_atlas_32bit) Linux32 py36_ubuntu_atlas_32bit succeeded

Details

scikit-learn.scikit-learn (Linux_Runs pylatest_conda_mkl) Linux_Runs pylatest_conda_mkl succeeded

Details

scikit-learn.scikit-learn (Windows py36_pip_openblas_32bit) Windows py36_pip_openblas_32bit succeeded

Details

scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded

Details

scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded

Details

scikit-learn.scikit-learn (macOS pylatest_conda_mkl_no_openmp) macOS pylatest_conda_mkl_no_openmp succeeded

Details

jayzed82 added a commit to jayzed82/scikit-learn that referenced this pull request


          ENH Support unknown_value=np.nan in OrdinalEncoder (scikit-learn#18406)

ed0b81a

mayer79 commented Dec 4, 2020 •

edited

Excellent work. Just to clarify: Will the new options allow to both

keep np.nan as they are and
map new unknown categories to np.nan?

If yes, this will be super good news for fitting boosted trees!

Member Author

NicolasHug commented Dec 4, 2020

this PR supports 2 but 1 is still not supported. An error is raised when nans are present in the training data: it's unclear where to map them, as the output of OrdinalEncoder is supposed to be interpreted as ordered quantities.

Member Author

NicolasHug commented Dec 4, 2020

If yes, this will be super good news for fitting boosted trees

HistGradientBoostingClassifier and the correspoding regressor natively support both missing values (as nans) and categorical data now :)

https://scikit-learn.org/stable/modules/ensemble.html#histogram-based-gradient-boosting

Contributor

mfeurer commented Dec 4, 2020

It's great to see all the progress in handling categorical data with native scikit-learn components!

Is what @mayer79 asked in 1. what's proposed in #17123 ?

Member Author

NicolasHug commented Dec 4, 2020

yes, with 1 and 2 being inverted

mayer79 commented Dec 4, 2020

@NicolasHug : Thx for clarifying. From a practical perspective, it is not desirable that remaining nans would raise an error. If my subsequent model algorithm cannot natively deal with nans, we can simply add an imputer after the encoder and voila.

Contributor

FelixWick commented Dec 5, 2020

@mayer79 Couldn’t you just run the encoder for not nans only to get the desired behavior?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment