The Wayback Machine - https://web.archive.org/web/20210828160141/https://github.com/scikit-learn/scikit-learn/pull/18406
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Support unknown_value=np.nan in OrdinalEncoder #18406

Merged
merged 5 commits into from Sep 23, 2020

Conversation

@NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Sep 16, 2020

This PR adds support for unknown_value=np.nan in OrdinalEncoder.

(Parameter was introduced in #17406 by @FelixWick)

CC @thomasjpfan @ogrisel

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Thank you for working on this!

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
Copy link
Member

@thomasjpfan thomasjpfan left a comment

LGTM

Copy link
Member

@thomasjpfan thomasjpfan left a comment

LGTM

@rth
rth approved these changes Sep 23, 2020
Copy link
Member

@rth rth left a comment

Thanks @NicolasHug !

@rth rth merged commit 4aada4e into scikit-learn:master Sep 23, 2020
21 checks passed
21 checks passed
@lgtm-com
LGTM analysis: C/C++ No code changes detected
Details
@lgtm-com
LGTM analysis: JavaScript No code changes detected
Details
@lgtm-com
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
@circleci-artifacts-redirector
ci/circleci: doc artifact Link to 0/doc/_changed.html
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
@codecov
codecov/patch Codecov Report
Details
@codecov
codecov/project Codecov Report
Details
@azure-pipelines
scikit-learn.scikit-learn Build #20200922.15 succeeded
Details
@azure-pipelines
scikit-learn.scikit-learn (Linting) Linting succeeded
Details
@azure-pipelines
scikit-learn.scikit-learn (Linux py36_conda_openblas) Linux py36_conda_openblas succeeded
Details
@azure-pipelines
scikit-learn.scikit-learn (Linux py36_ubuntu_atlas) Linux py36_ubuntu_atlas succeeded
Details
@azure-pipelines
scikit-learn.scikit-learn (Linux pylatest_pip_openblas_pandas) Linux pylatest_pip_openblas_pandas succeeded
Details
@azure-pipelines
scikit-learn.scikit-learn (Linux32 py36_ubuntu_atlas_32bit) Linux32 py36_ubuntu_atlas_32bit succeeded
Details
@azure-pipelines
scikit-learn.scikit-learn (Linux_Runs pylatest_conda_mkl) Linux_Runs pylatest_conda_mkl succeeded
Details
@azure-pipelines
scikit-learn.scikit-learn (Windows py36_pip_openblas_32bit) Windows py36_pip_openblas_32bit succeeded
Details
@azure-pipelines
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
@azure-pipelines
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
@azure-pipelines
scikit-learn.scikit-learn (macOS pylatest_conda_mkl_no_openmp) macOS pylatest_conda_mkl_no_openmp succeeded
Details
jayzed82 added a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
@mayer79
Copy link

@mayer79 mayer79 commented Dec 4, 2020

Excellent work. Just to clarify: Will the new options allow to both

  1. keep np.nan as they are and
  2. map new unknown categories to np.nan?

If yes, this will be super good news for fitting boosted trees!

@NicolasHug
Copy link
Member Author

@NicolasHug NicolasHug commented Dec 4, 2020

this PR supports 2 but 1 is still not supported. An error is raised when nans are present in the training data: it's unclear where to map them, as the output of OrdinalEncoder is supposed to be interpreted as ordered quantities.

@NicolasHug
Copy link
Member Author

@NicolasHug NicolasHug commented Dec 4, 2020

If yes, this will be super good news for fitting boosted trees

HistGradientBoostingClassifier and the correspoding regressor natively support both missing values (as nans) and categorical data now :)

https://scikit-learn.org/stable/modules/ensemble.html#histogram-based-gradient-boosting

@mfeurer
Copy link
Contributor

@mfeurer mfeurer commented Dec 4, 2020

It's great to see all the progress in handling categorical data with native scikit-learn components!

Is what @mayer79 asked in 1. what's proposed in #17123 ?

@NicolasHug
Copy link
Member Author

@NicolasHug NicolasHug commented Dec 4, 2020

yes, with 1 and 2 being inverted

@mayer79
Copy link

@mayer79 mayer79 commented Dec 4, 2020

@NicolasHug : Thx for clarifying. From a practical perspective, it is not desirable that remaining nans would raise an error. If my subsequent model algorithm cannot natively deal with nans, we can simply add an imputer after the encoder and voila.

@FelixWick
Copy link
Contributor

@FelixWick FelixWick commented Dec 5, 2020

@mayer79 Couldn’t you just run the encoder for not nans only to get the desired behavior?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

7 participants