The Wayback Machine - https://web.archive.org/web/20201208095418/https://github.com/scikit-learn/scikit-learn/issues/18498
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document on how to use custom bin edges in KBinsDiscretizer #18498

Open
rachittoshniwal opened this issue Sep 30, 2020 · 11 comments · May be fixed by #18929
Open

Document on how to use custom bin edges in KBinsDiscretizer #18498

rachittoshniwal opened this issue Sep 30, 2020 · 11 comments · May be fixed by #18929

Comments

@rachittoshniwal
Copy link
Contributor

@rachittoshniwal rachittoshniwal commented Sep 30, 2020

Describe the workflow you want to enable

A parameter accepting custom bin edges as an array.

Describe your proposed solution

Use pd.cut() under the hood or any other computationally efficient method.

Describe alternatives you've considered, if relevant

pandas' cut function.

Additional context

Say if we need to bin ages as 'infant', 'kid', 'teen', 'adult', 'senior citizen' into [0, 1, 13, 20, 60, np.inf] etc.

We can't do it using KBinsDiscretizer currently.

I wonder if there is any reason behind not implementing it?

@titigmr
Copy link

@titigmr titigmr commented Sep 30, 2020

I think is a good idea to create bins from np.array values but we need to wait an answer of a core developer for this issue. I can work on it if it is accepted.

Also, it's better that edge's bins take strictly less than or less and equal values for the discretization?

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Sep 30, 2020

One issue is that you will need to pass a list of arrays of length n_features. I don't know if it will be easy to deal with a high number of features.

ping @ogrisel

@rachittoshniwal
Copy link
Contributor Author

@rachittoshniwal rachittoshniwal commented Sep 30, 2020

One issue is that you will need to pass a list of arrays of length n_features. I don't know if it will be easy to deal with a high number of features.

Doesn't the same issue pop up with OrdinalEncoder when we pass in lists of ordered categories for each feature?

Secondly, the user was any way going to use pd.cut( ) and pass in list of arrays one feature at a time, and manually do it for the train and test sets separately.

It will simplify things for them as they can now automate the transform step for all the test set features which were binned.

scikit-learn will remember the bin edges from the fit and appropriately do the work on the test set when transformed

@lorentzenchr
Copy link
Contributor

@lorentzenchr lorentzenchr commented Oct 2, 2020

I'm in favour to offer a manual setup for bins. In my experience, this is important for use cases of just one or a few features that you want to bin yourself, often together with a ColumnTransformer.

The SplineTransformer of PR #18368 supports this, i.e. manually specifying bins (there it is knot positions). Note that SplineTransformer(degree=0, n_knots=n_bins+1) is equivalent to KBinsDiscretizer(n_bins=n_bins, encode='onehot-dense').

@jnothman
Copy link
Member

@jnothman jnothman commented Oct 4, 2020

@lorentzenchr
Copy link
Contributor

@lorentzenchr lorentzenchr commented Oct 6, 2020

It would just be an additional argument. If we decide against, it would be good to give an example with the FunctionTransformer.

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Oct 22, 2020

I would be in favour of using the FunctionTransformer. An example seems the best way. I think that it should go to the user guide.

@glemaitre glemaitre added Documentation and removed New Feature labels Oct 22, 2020
@glemaitre glemaitre changed the title Custom bin edges in KBinsDiscretizer Document on how to use custom bin edges in KBinsDiscretizer Oct 22, 2020
@bhargavasomya
Copy link

@bhargavasomya bhargavasomya commented Nov 1, 2020

Hello @glemaitre! I am pretty new to this and would like to contribute. Is this issue still open or someone is already working on it? Thanks.

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Nov 10, 2020

@bhargavasomya sorry for the delay to answer. The issue is still open and you can submit a pull request

@hitesh9116
Copy link

@hitesh9116 hitesh9116 commented Nov 23, 2020

Sir I also want to work on this issue. If no one is working on it, can I ?

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Nov 23, 2020

@bhargavasomya are you working on the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

8 participants
You can’t perform that action at this time.