Include drop='last' to OneHotEncoder #23436

WittmannF · 2022-05-20T21:19:45Z

Describe the workflow you want to enable

When using SimpleImputer + OneHotEncoder, I am able to add a new constant category for NaN values like the example below:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import numpy as np

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, [0])
    ])

df = pd.DataFrame(['Male', 'Female', np.nan])
preprocessor.fit_transform(df)

# array([[0., 1., 0.],
#       [1., 0., 0.],
#       [0., 0., 1.]])

However, I wanted to have an argument like OneHotEncoder(drop='last') in order to have an output like:

array([[0., 1.],
       [1., 0.],
       [0., 0.]])

This would allow all NaNs to be filled with zeros.

Describe your proposed solution

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant')),
    ('encoder', OneHotEncoder(drop='last'))])

Describe alternatives you've considered, if relevant

There's no good alternative for compatibility with sklearn's pipelines. I was following the issue #11996 of adding a handle_missing to OneHotEncoder but it has been ignored in favor of using a "constant" strategy on the categorical columns. But the constant strategy will add an unnecessary new column that could be dropped in this scenario.

Additional context

No response

The text was updated successfully, but these errors were encountered:

lesteve · 2022-05-25T05:47:14Z

There is a drop argument in OneHotEncoder which you can pass a array to (one category to drop for each feature), can you use this for you use case? Adapting your snippet, something like this:

import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame(['Male', 'Female', np.nan])
ohe = OneHotEncoder(drop=[np.nan])
ohe.fit_transform(df).toarray()

Output:

array([[0., 1.],
       [1., 0.],
       [0., 0.]])

WittmannF added Needs Triage New Feature labels May 20, 2022

Apr	MAY	Jun
	30
2021	2022	2023

scikit-learn / scikit-learn Public

Include drop='last' to OneHotEncoder #23436

Include drop='last' to OneHotEncoder #23436

WittmannF commented May 20, 2022

lesteve commented May 25, 2022

scikit-learn / scikit-learn Public

Include drop='last' to OneHotEncoder #23436

Include drop='last' to OneHotEncoder #23436

Comments

WittmannF commented May 20, 2022

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

lesteve commented May 25, 2022