Feature request: Group aware Time-based cross validation #14257

ogrisel · 2019-07-04T13:33:44Z

Basically combining TimeSeriesSplit with the Group awareness of other CV strategies such as GroupKFold.

I think it's a good first issue for first time contributors that are already familiar with the existing cross validation tools in scikit-learn:

https://scikit-learn.org/stable/modules/cross_validation.html

Source code is here:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/model_selection/_split.py

The text was updated successfully, but these errors were encountered:

souravsingh · 2019-07-04T18:25:10Z

@ogrisel I am interested in working on this

ogrisel · 2019-07-05T07:08:25Z

Feel free to give it a try:

start by reading the documentation and the code of the existing CV split strategies,
then start by writing the tests for the new strategy and issue and early [WIP] PR to get feedback on the tests before writing the implementation itself.

Good luck!

aditya1702 · 2019-07-11T01:53:18Z

@souravsingh Are you working on this?

souravsingh · 2019-07-11T07:01:09Z

@aditya1702 Yes, I am. I will be putting up a WIP PR.

abhishek-jana · 2019-07-12T14:17:03Z

@ogrisel I'd like to work on this issue.

amueller · 2019-07-12T15:19:43Z

I think this is way more complex than what we usually tag as "good first issue" which are usually much much smaller.

ManishAradwad · 2019-08-05T04:49:17Z

Is anyone working on this one?? I'd like to help..

Pimpwhippa · 2020-01-26T13:53:02Z

@getgaurav2 ah, you already put up the new WIP with the implementation code. Actually I am working on this (after an ok from @mfcabrera a few days ago in the pull request linked). And I took time to understand the feature requirements, which I still would like to have it confirmed here. So I'll ask for your suggestions anyway.

So, it has to take the Group effect onto the TimeSeriesSplit. For example, if my data has groups A, B, and C, for the first split, say only group C will be in the test set, and those C samples also have to be in the future of the samples from group A and B that are in training set. (Here, does it mean part of group A and B that are not in the past of C test set will have to be excluded?)

For the second split, say group B is the test set, and as time increases, the size of group A and C in the training set will be larger, because the nature of TimeSeriesSplit. (Here, if the number of folds are not more than number of groups, will it mean the splits will give more weights to the groups that will be in the training set later?) maybe your code answer this already. I will go read it carefully.

Since you already have the implementation code written, I don't know what's next now. I would still like to do it, but if I'm too slow and you would like to have this feature done soon please let me know how should we proceed. There's also the test code that should be fixed. @ogrisel

getgaurav2 · 2020-01-26T16:14:14Z

@Pimpwhippa - this PR was just a way to save my work . I realized later that you were already on it. Apologies for the confusion .
My 2 cents on your questions:

Training set shall include Group A and B and test shall include Group C . Test C shall always be at a later time than A and B . To answer your ques more directly - "does it mean part of group A and B that are not in the past of C test set will have to be excluded" - No .
Yes
I would be happy to work with you if want to come at it from the testing side.

Pimpwhippa · 2020-01-26T17:55:19Z

@getgaurav2 Thank you for your answers. Yes I will be happy to work with you looking at the testing side. Can I make changes from mfcabrera s version and try to pass the test without taking into account your implementation code? Did his code fail the checks just because there was no implementation code? So I can make changes of the testing side only. Sorry for my newbie question. Just want to know what to do next.

getgaurav2 · 2020-01-26T18:05:25Z

@Pimpwhippa yes - his test was failing because of no implementation . I agree with his test cases otherwise ( just by looking at the code so far )
To begin purely on the test side , You can hard code an expected o/p of the split function and make sure it passes in local .

Eventually , the split function from my implementation should replace your hard coded expected o/p and pass the test cases .

Pimpwhippa · 2020-01-26T18:41:52Z

@getgaurav2 Alright. Will try so.

Pimpwhippa · 2020-02-18T00:24:29Z

@getgaurav2 if you have any comments on the PR please let me know. Should I add/change anything? I'll continue working on linting and documentation in the meantime.

getgaurav2 · 2020-06-05T05:20:32Z

@getgaurav2 All checks have passed also on my PR #16464!
So now, besides waiting for the core-dev to come see, I can try to call the function in my test code? What else can I do?

@Pimpwhippa Let's call the function in your test cases and check for any failures - Thanks .

Pimpwhippa · 2020-06-13T00:00:51Z

Hi @getgaurav2,

The first two tests failed because ValueError: Cannot have more folds than groups. The third test passes because that’s what it supposes to say. The fourth one got AssertionError, not sure yet why. The last test on max_train_size also failed on the same ValueError.

I summarize all my points so that I check my understanding, and for us to discuss them further.

let’s say groups is as in your docstring;

groups = [‘a’, ‘b’, ‘c’, ‘a’, ‘b’, ‘f’, ‘g’, ‘d’, ‘f’]
n_splits = 2

your result = Train: [ 0 1 2 3 4 5], Test [6 7]

my expected result = Train [0 2 3], Test [4]
Train [0 1 2 3 4 6 7], Test [8]

(I did not look at your code at all until I finished mine.)

Should it fail or should it work if groups is None? Looking at your code, it seems to be the requirement. Don’t you think it could still work without groups? That's what we said at the beginning.
In your docstring, both examples your n_splits = 2. shouldn’t at least one example n_splits = 5 as it is the default?
I don’t understand the difference between n_folds and n_splits. Why n_folds = n_splits + 1? Could you please explain?
I am not sure how the relationship between n_folds (or n_splits) and n_groups should be. Hence, all the tests are failing because of this 'Cannot have numbers of folds greater than numbers of groups'.

If we think of a real use case, let’s say default n_splits = 5, and let’s say in your definition n_folds = 6, doesn’t that mean to use this feature, your dataset has to have at least 6 groups? What if a dataset has 2-3 groups, which could be the case, maybe even more likely than a dataset with 6 groups?

Should we divide the cases to deal with n_folds (or n_splits) in relation to n_groups?
e.g. if n_splits = 5 is the default, we’ll deal with n_groups in different ways
5.1 n_groups = 2 (n_splits > n_groups)
5.2 n_groups = 5 ( n_splits = n_groups)
5.3 n_groups = 7 (n_splits < n_groups)
(same way TimeSeriesSplit uses to check max_train_size)

my test ‘each group has to be test group at least once’ should pass when n_folds <= n_groups as it is now, but I still have AssertionError. I'll have to see in details again why, or I can write new tests that work according to test cases in 5. if we agree.
I would just start like this
test_size = (n_samples // n_splits)
test_starts = range(test_size, n_samples, test_size)
and then you have to take the group of test set out from the training set

Please let me know if I missed something obvious. Thank you.

jnothman · 2020-06-13T22:31:52Z

In time series cross validation, some training data cannot be used elsewhere as test data, since we require that all test data follows (though not necessarily immediately) its corresponding training data in time. n_folds is really counting the number of data subsets to use in the various splits, including some training data to be unused for testing. You should understand the need for this algorithm in terms of what requirements there are on each split, and then understand the parameters and behaviours in terms of it: * CV: every test set must not overlap with its corresponding training set * ?Kfold-style cv: each test set must share no samples with another test set (this criterion may be removed as long as there is the possibility to use and reuse every test sample, as in ShuffleSplit) * Time: every test sample must not precede any train sample in the dataset * Groups: if a sample from a group is in the training set, no sample from that group can be in the test set Does the algorithm satisfy these criteria? Are these criteria reasonably well tested? Is it parameterised in a reasonable way for users who need these criteria upheld for their task?

Pimpwhippa · 2020-06-15T14:21:00Z

@jnothman Thank you for dropping by to give the guidance.

… same number of splits as in n-split argument.

…y as expected in the comment .

jorijnsmit · 2020-12-03T21:58:48Z

Hey guys, can I ask what the progress is on this? I having written a working iterator which gets through all of @Pimpwhippa's tests I could find (https://github.com/Pimpwhippa/scikit-learn/blob/afd31b1a337d3c6d02491fd6bff67b51b1a05e91/sklearn/model_selection/tests/test_split.py). It would need some proper docstringing(, linting?) and some critical review of course but maybe I could contribute?

getgaurav2 · 2020-12-03T22:02:36Z

Hey guys, can I ask what the progress is on this? I having written a working iterator which gets through all of @Pimpwhippa's tests I could find (https://github.com/Pimpwhippa/scikit-learn/blob/afd31b1a337d3c6d02491fd6bff67b51b1a05e91/sklearn/model_selection/tests/test_split.py). It would need some proper docstringing(, linting?) and some critical review of course but maybe I could contribute?

#16236 is waiting on a reviewer at this point.

jorijnsmit · 2020-12-03T22:51:24Z

Ah it moved! Thanks for the #, that code looks great! Do you think it'll make 0.24?

getgaurav2 · 2020-12-04T00:40:41Z

Ah it moved! Thanks for the #, that code looks great! Do you think it'll make 0.24?

I Hope it makes it into 0.24 . @albertvillanova might be able to share some more info .

labdmitriy · 2020-12-16T07:46:01Z

Hello,
Is this feature completed and will be included in future releases?
I have no experience in contribution yet, but I wrote from scratch a class for grouped time series split with parameters, like test size, train size, number of splits, gap, shift size and window modes, and also different arguments checking and calculations.
Also it is pretty fast, because it uses only sklearn helper functions like indexable and _num_samples and numpy.
So if the current implementation will be included, I will try to write an article about that, but if not - I would happy to try to contribute my implementation in sklearn, if it is possible.
Thank you.

devin-moonrise · 2020-12-16T22:23:27Z

@labdmitriy When you say window mode, are you talking about rolling equal length windows for timeseries CV? I was thinking about starting to work on a PR for that, but if it's already in the pipe-ish I won't.

Hello,
Is this feature completed and will be included in future releases?
I have no experience in contribution yet, but I wrote from scratch a class for grouped time series split with parameters, like test size, train size, number of splits, gap, shift size and window modes, and also different arguments checking and calculations.
Also it is pretty fast, because it uses only sklearn helper functions like indexable and _num_samples and numpy.
So if the current implementation will be included, I will try to write an article about that, but if not - I would happy to try to contribute my implementation in sklearn, if it is possible.
Thank you.

labdmitriy · 2020-12-17T04:34:58Z

Yes, I added option to choose rolling or expanding window, also It works with sklearn GridSearchCV.
But I am not sure that my implementation will be interesting due to current implementation of this Feature Request, so your PR probably may be relevant.

labdmitriy · 2020-12-24T07:25:20Z

Hello @ogrisel @jnothman ,
Could you please comment the question above (about alternative implementation of this feature)? Just to make sure that it is not relevant now due to the current implementation by @getgaurav2 and if not - I will plan to publish my implementaion somewhere else.
Thank you.

jnothman · 2020-12-29T13:09:55Z

Hi @labdmitriy, I don't really understand what the alternative proposal is here. I don't think our TimeSeriesSplit is going to fulfil every possible need around time series splitting, but we should support some key use cases to avoid users falling into traps. Scikit-learn does not generally see time-series as within it primary scope...

jnothman · 2020-12-29T13:12:15Z

I think when we first discussed this, I did not appreciate that a primary need was to handle non-overlapping groups; #16236 assumes that each group is contiguous in the sample ordering. Was this your intention @ogrisel?

labdmitriy · 2020-12-29T13:56:35Z

Hi @labdmitriy, I don't really understand what the alternative proposal is here. I don't think our TimeSeriesSplit is going to fulfil every possible need around time series splitting, but we should support some key use cases to avoid users falling into traps. Scikit-learn does not generally see time-series as within it primary scope...

Hi @jnothman,
Thank you for your answer.
I mean that it is possible to create more general grouped time series, but if there is no need - no problem, I will share my implementation in the article later.
Thank you.

labdmitriy · 2021-11-21T07:00:59Z

Hi @labdmitriy, I don't really understand what the alternative proposal is here. I don't think our TimeSeriesSplit is going to fulfil every possible need around time series splitting, but we should support some key use cases to avoid users falling into traps. Scikit-learn does not generally see time-series as within it primary scope...

Hi @jnothman, Thank you for your answer. I mean that it is possible to create more general grouped time series, but if there is no need - no problem, I will share my implementation in the article later. Thank you.

Hi @jnothman,

I've shared my implementation of group time series cross-validation which is compatible with sklearn.
Implementation contains ideas from different libraries and also additional functionality.
Implemented using Python standard library + numpy + sklearn helper functions.
I hope that it will be useful for implementation in sklearn, and I will be glad to contribute it to sklearn if it would be possible.

Article: https://medium.com/@labdmitriy/advanced-group-time-series-validation-bb00d4a74bcc
GitHub repository with source code: https://github.com/labdmitriy/ml-lab

Update (2022-05-27): enhanced version of this implementation now is the part of the mlxtend library.

Thank you.

ogrisel added Enhancement good first issue help wanted Moderate New Feature Sprint labels Jul 4, 2019

mfcabrera mentioned this issue Sep 6, 2019

[WIP] Group aware Time-based cross validation #14914

Closed

getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Jan 26, 2020

[WIP] fixes scikit-learn#14257 - Group aware Time-based cross validation

6182361

getgaurav2 linked a pull request Jan 26, 2020 that will close this issue

[MRG] fixes #14257 - Group aware Time-based cross validation #16236

Open

Pimpwhippa mentioned this issue Feb 18, 2020

[WIP] Fix #14257 (tests only) for GroupTimeSeriesSplit #16464

Closed

getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Feb 18, 2020

[WIP] fixes scikit-learn#14257 - New Approach plus flake8 fixes

10df234

getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Feb 18, 2020

[WIP] fixes scikit-learn#14257 Linting issues

37365fc

getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Feb 19, 2020

[WIP] fixes scikit-learn#14257 import new class in __init__

4daf8be

getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Feb 19, 2020

[WIP] fixes scikit-learn#14257 adding GroupTimeSeriesSplit to __init__

c636a0a

getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Feb 24, 2020

[WIP] fixes scikit-learn#14257 - removing test cases temporarily.

5e7c211

getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Feb 24, 2020

[WIP] fixes scikit-learn#14257 - linting issue

e8f26a8

getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Feb 24, 2020

[WIP] fixes scikit-learn#14257 Linting

a4c3bc3

cmarmo removed Sprint help wanted labels Jun 17, 2020

lorentzenchr assigned getgaurav2 Sep 4, 2020

getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Sep 5, 2020

[WIP] fixes scikit-learn#14257 making sure split function returns the…

8ed9d99

… same number of splits as in n-split argument.

getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Sep 6, 2020

WIP scikit-learn#14257 - Reduce Training array size based on grouping

96ebbe7

getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Sep 6, 2020

WIP scikit-learn#14257 linting

0757e1a

getgaurav2 pushed a commit to getgaurav2/scikit-learn that referenced this issue Sep 29, 2020

[WIP] fixes scikit-learn#14257 - (Check point) - Now returns the arra…

991bb8b

…y as expected in the comment .

geoHeil mentioned this issue Dec 13, 2020

[getting started] data loading of custom data timeseriesAI/tsai#25

Closed

getgaurav2 mentioned this issue Dec 27, 2020

GroupedTimeSeriesSplit with params - gap and test_size #19072

Open

cmarmo added the module:model_selection label Jan 18, 2021

cmarmo mentioned this issue Feb 19, 2021

[MRG] ENH Support split by group in TimeSeriesSplit #19496

Closed

KylSong mentioned this issue Apr 20, 2021

Group aware Time-based cross validation - v2 #19927

Closed

KylSong mentioned this issue Apr 28, 2021

Add GroupTimeSeriesSplit with params - gap and test_size #19996

Closed

glemaitre mentioned this issue Jan 29, 2022

Combinatorial Purged Cross-Validation strategy #22229

Closed

scikit-learn / scikit-learn Public

Feature request: Group aware Time-based cross validation #14257

Feature request: Group aware Time-based cross validation #14257

ogrisel commented Jul 4, 2019 •

edited

souravsingh commented Jul 4, 2019

ogrisel commented Jul 5, 2019 •

edited

aditya1702 commented Jul 11, 2019

souravsingh commented Jul 11, 2019

abhishek-jana commented Jul 12, 2019

amueller commented Jul 12, 2019

ManishAradwad commented Aug 5, 2019

Pimpwhippa commented Jan 26, 2020

getgaurav2 commented Jan 26, 2020 •

edited

Pimpwhippa commented Jan 26, 2020

getgaurav2 commented Jan 26, 2020

Pimpwhippa commented Jan 26, 2020

Pimpwhippa commented Feb 18, 2020

getgaurav2 commented Jun 5, 2020

Pimpwhippa commented Jun 13, 2020

jnothman commented Jun 13, 2020

Pimpwhippa commented Jun 15, 2020

jorijnsmit commented Dec 3, 2020 •

edited

getgaurav2 commented Dec 3, 2020

jorijnsmit commented Dec 3, 2020

getgaurav2 commented Dec 4, 2020

labdmitriy commented Dec 16, 2020

devin-moonrise commented Dec 16, 2020

labdmitriy commented Dec 17, 2020 •

edited

labdmitriy commented Dec 24, 2020

jnothman commented Dec 29, 2020

jnothman commented Dec 29, 2020

labdmitriy commented Dec 29, 2020

labdmitriy commented Nov 21, 2021 •

edited

scikit-learn / scikit-learn Public

Feature request: Group aware Time-based cross validation #14257

Feature request: Group aware Time-based cross validation #14257

Comments

ogrisel commented Jul 4, 2019 • edited

souravsingh commented Jul 4, 2019

ogrisel commented Jul 5, 2019 • edited

aditya1702 commented Jul 11, 2019

souravsingh commented Jul 11, 2019

abhishek-jana commented Jul 12, 2019

amueller commented Jul 12, 2019

ManishAradwad commented Aug 5, 2019

Pimpwhippa commented Jan 26, 2020

getgaurav2 commented Jan 26, 2020 • edited

Pimpwhippa commented Jan 26, 2020

getgaurav2 commented Jan 26, 2020

Pimpwhippa commented Jan 26, 2020

Pimpwhippa commented Feb 18, 2020

getgaurav2 commented Jun 5, 2020

Pimpwhippa commented Jun 13, 2020

jnothman commented Jun 13, 2020

Pimpwhippa commented Jun 15, 2020

jorijnsmit commented Dec 3, 2020 • edited

getgaurav2 commented Dec 3, 2020

jorijnsmit commented Dec 3, 2020

getgaurav2 commented Dec 4, 2020

labdmitriy commented Dec 16, 2020

devin-moonrise commented Dec 16, 2020

labdmitriy commented Dec 17, 2020 • edited

labdmitriy commented Dec 24, 2020

jnothman commented Dec 29, 2020

jnothman commented Dec 29, 2020

labdmitriy commented Dec 29, 2020

labdmitriy commented Nov 21, 2021 • edited

ogrisel commented Jul 4, 2019 •

edited

ogrisel commented Jul 5, 2019 •

edited

getgaurav2 commented Jan 26, 2020 •

edited

jorijnsmit commented Dec 3, 2020 •

edited

labdmitriy commented Dec 17, 2020 •

edited

labdmitriy commented Nov 21, 2021 •

edited