The Wayback Machine - https://web.archive.org/web/20220530203545/https://github.com/scikit-learn/scikit-learn/issues/14257
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Group aware Time-based cross validation #14257

Open
ogrisel opened this issue Jul 4, 2019 · 36 comments · May be fixed by #16236
Open

Feature request: Group aware Time-based cross validation #14257

ogrisel opened this issue Jul 4, 2019 · 36 comments · May be fixed by #16236
Assignees
Labels
Enhancement good first issue Moderate module:model_selection New Feature

Comments

@ogrisel
Copy link
Member

@ogrisel ogrisel commented Jul 4, 2019

Basically combining TimeSeriesSplit with the Group awareness of other CV strategies such as GroupKFold.

I think it's a good first issue for first time contributors that are already familiar with the existing cross validation tools in scikit-learn:

https://scikit-learn.org/stable/modules/cross_validation.html

Source code is here:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/model_selection/_split.py

@ogrisel ogrisel added Enhancement good first issue help wanted Moderate New Feature Sprint labels Jul 4, 2019
@souravsingh
Copy link
Contributor

@souravsingh souravsingh commented Jul 4, 2019

@ogrisel I am interested in working on this

@ogrisel
Copy link
Member Author

@ogrisel ogrisel commented Jul 5, 2019

Feel free to give it a try:

  • start by reading the documentation and the code of the existing CV split strategies,
  • then start by writing the tests for the new strategy and issue and early [WIP] PR to get feedback on the tests before writing the implementation itself.

Good luck!

@aditya1702
Copy link
Contributor

@aditya1702 aditya1702 commented Jul 11, 2019

@souravsingh Are you working on this?

@souravsingh
Copy link
Contributor

@souravsingh souravsingh commented Jul 11, 2019

@aditya1702 Yes, I am. I will be putting up a WIP PR.

@abhishek-jana
Copy link
Contributor

@abhishek-jana abhishek-jana commented Jul 12, 2019

@ogrisel I'd like to work on this issue.

@amueller
Copy link
Member

@amueller amueller commented Jul 12, 2019

I think this is way more complex than what we usually tag as "good first issue" which are usually much much smaller.

@ManishAradwad
Copy link
Contributor

@ManishAradwad ManishAradwad commented Aug 5, 2019

Is anyone working on this one?? I'd like to help..

@Pimpwhippa
Copy link

@Pimpwhippa Pimpwhippa commented Jan 26, 2020

@getgaurav2 ah, you already put up the new WIP with the implementation code. Actually I am working on this (after an ok from @mfcabrera a few days ago in the pull request linked). And I took time to understand the feature requirements, which I still would like to have it confirmed here. So I'll ask for your suggestions anyway.

So, it has to take the Group effect onto the TimeSeriesSplit. For example, if my data has groups A, B, and C, for the first split, say only group C will be in the test set, and those C samples also have to be in the future of the samples from group A and B that are in training set. (Here, does it mean part of group A and B that are not in the past of C test set will have to be excluded?)

For the second split, say group B is the test set, and as time increases, the size of group A and C in the training set will be larger, because the nature of TimeSeriesSplit. (Here, if the number of folds are not more than number of groups, will it mean the splits will give more weights to the groups that will be in the training set later?) maybe your code answer this already. I will go read it carefully.

Since you already have the implementation code written, I don't know what's next now. I would still like to do it, but if I'm too slow and you would like to have this feature done soon please let me know how should we proceed. There's also the test code that should be fixed. @ogrisel

@getgaurav2
Copy link
Contributor

@getgaurav2 getgaurav2 commented Jan 26, 2020

@Pimpwhippa - this PR was just a way to save my work . I realized later that you were already on it. Apologies for the confusion .
My 2 cents on your questions:

  • Training set shall include Group A and B and test shall include Group C . Test C shall always be at a later time than A and B . To answer your ques more directly - "does it mean part of group A and B that are not in the past of C test set will have to be excluded" - No .
  • Yes
  • I would be happy to work with you if want to come at it from the testing side.

@Pimpwhippa
Copy link

@Pimpwhippa Pimpwhippa commented Jan 26, 2020

@getgaurav2 Thank you for your answers. Yes I will be happy to work with you looking at the testing side. Can I make changes from mfcabrera s version and try to pass the test without taking into account your implementation code? Did his code fail the checks just because there was no implementation code? So I can make changes of the testing side only. Sorry for my newbie question. Just want to know what to do next.

@getgaurav2
Copy link
Contributor

@getgaurav2 getgaurav2 commented Jan 26, 2020

@Pimpwhippa yes - his test was failing because of no implementation . I agree with his test cases otherwise ( just by looking at the code so far )
To begin purely on the test side , You can hard code an expected o/p of the split function and make sure it passes in local .

Eventually , the split function from my implementation should replace your hard coded expected o/p and pass the test cases .

@Pimpwhippa
Copy link

@Pimpwhippa Pimpwhippa commented Jan 26, 2020

@getgaurav2 Alright. Will try so.

@Pimpwhippa
Copy link

@Pimpwhippa Pimpwhippa commented Feb 18, 2020

@getgaurav2 if you have any comments on the PR please let me know. Should I add/change anything? I'll continue working on linting and documentation in the meantime.

getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Feb 18, 2020
getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Feb 18, 2020
getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Feb 19, 2020
getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Feb 19, 2020
getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Feb 24, 2020
getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Feb 24, 2020
getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Feb 24, 2020
@getgaurav2
Copy link
Contributor

@getgaurav2 getgaurav2 commented Jun 5, 2020

@getgaurav2 All checks have passed also on my PR #16464!
So now, besides waiting for the core-dev to come see, I can try to call the function in my test code? What else can I do?

@Pimpwhippa Let's call the function in your test cases and check for any failures - Thanks .

@Pimpwhippa
Copy link

@Pimpwhippa Pimpwhippa commented Jun 13, 2020

Hi @getgaurav2,

The first two tests failed because ValueError: Cannot have more folds than groups. The third test passes because that’s what it supposes to say. The fourth one got AssertionError, not sure yet why. The last test on max_train_size also failed on the same ValueError.

I summarize all my points so that I check my understanding, and for us to discuss them further.

  1. let’s say groups is as in your docstring;

groups = [‘a’, ‘b’, ‘c’, ‘a’, ‘b’, ‘f’, ‘g’, ‘d’, ‘f’]
n_splits = 2

your result = Train: [ 0 1 2 3 4 5], Test [6 7]

my expected result = Train [0 2 3], Test [4]
Train [0 1 2 3 4 6 7], Test [8]

(I did not look at your code at all until I finished mine.)

  1. Should it fail or should it work if groups is None? Looking at your code, it seems to be the requirement. Don’t you think it could still work without groups? That's what we said at the beginning.

  2. In your docstring, both examples your n_splits = 2. shouldn’t at least one example n_splits = 5 as it is the default?

  3. I don’t understand the difference between n_folds and n_splits. Why n_folds = n_splits + 1? Could you please explain?

  4. I am not sure how the relationship between n_folds (or n_splits) and n_groups should be. Hence, all the tests are failing because of this 'Cannot have numbers of folds greater than numbers of groups'.

If we think of a real use case, let’s say default n_splits = 5, and let’s say in your definition n_folds = 6, doesn’t that mean to use this feature, your dataset has to have at least 6 groups? What if a dataset has 2-3 groups, which could be the case, maybe even more likely than a dataset with 6 groups?

Should we divide the cases to deal with n_folds (or n_splits) in relation to n_groups?
e.g. if n_splits = 5 is the default, we’ll deal with n_groups in different ways
5.1 n_groups = 2 (n_splits > n_groups)
5.2 n_groups = 5 ( n_splits = n_groups)
5.3 n_groups = 7 (n_splits < n_groups)
(same way TimeSeriesSplit uses to check max_train_size)

  1. my test ‘each group has to be test group at least once’ should pass when n_folds <= n_groups as it is now, but I still have AssertionError. I'll have to see in details again why, or I can write new tests that work according to test cases in 5. if we agree.

  2. I would just start like this
    test_size = (n_samples // n_splits)
    test_starts = range(test_size, n_samples, test_size)
    and then you have to take the group of test set out from the training set

Please let me know if I missed something obvious. Thank you.

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 13, 2020

@Pimpwhippa
Copy link

@Pimpwhippa Pimpwhippa commented Jun 15, 2020

@jnothman Thank you for dropping by to give the guidance.

getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Sep 5, 2020
… same number of splits as in n-split argument.
getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Sep 6, 2020
getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Sep 6, 2020
getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Sep 29, 2020
@jorijnsmit
Copy link

@jorijnsmit jorijnsmit commented Dec 3, 2020

Hey guys, can I ask what the progress is on this? I having written a working iterator which gets through all of @Pimpwhippa's tests I could find (https://github.com/Pimpwhippa/scikit-learn/blob/afd31b1a337d3c6d02491fd6bff67b51b1a05e91/sklearn/model_selection/tests/test_split.py). It would need some proper docstringing(, linting?) and some critical review of course but maybe I could contribute?

@getgaurav2
Copy link
Contributor

@getgaurav2 getgaurav2 commented Dec 3, 2020

Hey guys, can I ask what the progress is on this? I having written a working iterator which gets through all of @Pimpwhippa's tests I could find (https://github.com/Pimpwhippa/scikit-learn/blob/afd31b1a337d3c6d02491fd6bff67b51b1a05e91/sklearn/model_selection/tests/test_split.py). It would need some proper docstringing(, linting?) and some critical review of course but maybe I could contribute?

#16236 is waiting on a reviewer at this point.

@jorijnsmit
Copy link

@jorijnsmit jorijnsmit commented Dec 3, 2020

Ah it moved! Thanks for the #, that code looks great! Do you think it'll make 0.24?

@getgaurav2
Copy link
Contributor

@getgaurav2 getgaurav2 commented Dec 4, 2020

Ah it moved! Thanks for the #, that code looks great! Do you think it'll make 0.24?

I Hope it makes it into 0.24 . @albertvillanova might be able to share some more info .

@labdmitriy
Copy link

@labdmitriy labdmitriy commented Dec 16, 2020

Hello,
Is this feature completed and will be included in future releases?
I have no experience in contribution yet, but I wrote from scratch a class for grouped time series split with parameters, like test size, train size, number of splits, gap, shift size and window modes, and also different arguments checking and calculations.
Also it is pretty fast, because it uses only sklearn helper functions like indexable and _num_samples and numpy.
So if the current implementation will be included, I will try to write an article about that, but if not - I would happy to try to contribute my implementation in sklearn, if it is possible.
Thank you.

@devin-moonrise
Copy link

@devin-moonrise devin-moonrise commented Dec 16, 2020

@labdmitriy When you say window mode, are you talking about rolling equal length windows for timeseries CV? I was thinking about starting to work on a PR for that, but if it's already in the pipe-ish I won't.

Hello,
Is this feature completed and will be included in future releases?
I have no experience in contribution yet, but I wrote from scratch a class for grouped time series split with parameters, like test size, train size, number of splits, gap, shift size and window modes, and also different arguments checking and calculations.
Also it is pretty fast, because it uses only sklearn helper functions like indexable and _num_samples and numpy.
So if the current implementation will be included, I will try to write an article about that, but if not - I would happy to try to contribute my implementation in sklearn, if it is possible.
Thank you.

@labdmitriy
Copy link

@labdmitriy labdmitriy commented Dec 17, 2020

Yes, I added option to choose rolling or expanding window, also It works with sklearn GridSearchCV.
But I am not sure that my implementation will be interesting due to current implementation of this Feature Request, so your PR probably may be relevant.

@labdmitriy
Copy link

@labdmitriy labdmitriy commented Dec 24, 2020

Hello @ogrisel @jnothman ,
Could you please comment the question above (about alternative implementation of this feature)? Just to make sure that it is not relevant now due to the current implementation by @getgaurav2 and if not - I will plan to publish my implementaion somewhere else.
Thank you.

@jnothman
Copy link
Member

@jnothman jnothman commented Dec 29, 2020

Hi @labdmitriy, I don't really understand what the alternative proposal is here. I don't think our TimeSeriesSplit is going to fulfil every possible need around time series splitting, but we should support some key use cases to avoid users falling into traps. Scikit-learn does not generally see time-series as within it primary scope...

@jnothman
Copy link
Member

@jnothman jnothman commented Dec 29, 2020

I think when we first discussed this, I did not appreciate that a primary need was to handle non-overlapping groups; #16236 assumes that each group is contiguous in the sample ordering. Was this your intention @ogrisel?

@labdmitriy
Copy link

@labdmitriy labdmitriy commented Dec 29, 2020

Hi @labdmitriy, I don't really understand what the alternative proposal is here. I don't think our TimeSeriesSplit is going to fulfil every possible need around time series splitting, but we should support some key use cases to avoid users falling into traps. Scikit-learn does not generally see time-series as within it primary scope...

Hi @jnothman,
Thank you for your answer.
I mean that it is possible to create more general grouped time series, but if there is no need - no problem, I will share my implementation in the article later.
Thank you.

@labdmitriy
Copy link

@labdmitriy labdmitriy commented Nov 21, 2021

Hi @labdmitriy, I don't really understand what the alternative proposal is here. I don't think our TimeSeriesSplit is going to fulfil every possible need around time series splitting, but we should support some key use cases to avoid users falling into traps. Scikit-learn does not generally see time-series as within it primary scope...

Hi @jnothman, Thank you for your answer. I mean that it is possible to create more general grouped time series, but if there is no need - no problem, I will share my implementation in the article later. Thank you.

Hi @jnothman,

I've shared my implementation of group time series cross-validation which is compatible with sklearn.
Implementation contains ideas from different libraries and also additional functionality.
Implemented using Python standard library + numpy + sklearn helper functions.
I hope that it will be useful for implementation in sklearn, and I will be glad to contribute it to sklearn if it would be possible.

Article: https://medium.com/@labdmitriy/advanced-group-time-series-validation-bb00d4a74bcc
GitHub repository with source code: https://github.com/labdmitriy/ml-lab

Update (2022-05-27): enhanced version of this implementation now is the part of the mlxtend library.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement good first issue Moderate module:model_selection New Feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.