Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upGitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
Document on how to use custom bin edges in KBinsDiscretizer #18498
Comments
I think is a good idea to create bins from Also, it's better that edge's bins take strictly less than or less and equal values for the discretization? |
One issue is that you will need to pass a list of arrays of length ping @ogrisel |
Doesn't the same issue pop up with OrdinalEncoder when we pass in lists of ordered categories for each feature? Secondly, the user was any way going to use pd.cut( ) and pass in list of arrays one feature at a time, and manually do it for the train and test sets separately. It will simplify things for them as they can now automate the scikit-learn will remember the bin edges from the |
I'm in favour to offer a manual setup for bins. In my experience, this is important for use cases of just one or a few features that you want to bin yourself, often together with a The |
One reason we have not supported manual bins is that it should be
equivalently available using FunctionTransformer with something like
pd.cut. Do we need to add a transformer for this purpose?
|
It would just be an additional argument. If we decide against, it would be good to give an example with the |
I would be in favour of using the |
Hello @glemaitre! I am pretty new to this and would like to contribute. Is this issue still open or someone is already working on it? Thanks. |
@bhargavasomya sorry for the delay to answer. The issue is still open and you can submit a pull request |
Sir I also want to work on this issue. If no one is working on it, can I ? |
@bhargavasomya are you working on the issue? |
Describe the workflow you want to enable
A parameter accepting custom bin edges as an array.
Describe your proposed solution
Use pd.cut() under the hood or any other computationally efficient method.
Describe alternatives you've considered, if relevant
pandas' cut function.
Additional context
Say if we need to bin ages as 'infant', 'kid', 'teen', 'adult', 'senior citizen' into [0, 1, 13, 20, 60, np.inf] etc.
We can't do it using KBinsDiscretizer currently.
I wonder if there is any reason behind not implementing it?