New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Group aware Time-based cross validation #14257
Comments
@ogrisel I am interested in working on this |
Feel free to give it a try:
Good luck! |
@souravsingh Are you working on this? |
@aditya1702 Yes, I am. I will be putting up a WIP PR. |
@ogrisel I'd like to work on this issue. |
I think this is way more complex than what we usually tag as "good first issue" which are usually much much smaller. |
Is anyone working on this one?? I'd like to help.. |
@getgaurav2 ah, you already put up the new WIP with the implementation code. Actually I am working on this (after an ok from @mfcabrera a few days ago in the pull request linked). And I took time to understand the feature requirements, which I still would like to have it confirmed here. So I'll ask for your suggestions anyway. So, it has to take the Group effect onto the TimeSeriesSplit. For example, if my data has groups A, B, and C, for the first split, say only group C will be in the test set, and those C samples also have to be in the future of the samples from group A and B that are in training set. (Here, does it mean part of group A and B that are not in the past of C test set will have to be excluded?) For the second split, say group B is the test set, and as time increases, the size of group A and C in the training set will be larger, because the nature of TimeSeriesSplit. (Here, if the number of folds are not more than number of groups, will it mean the splits will give more weights to the groups that will be in the training set later?) maybe your code answer this already. I will go read it carefully. Since you already have the implementation code written, I don't know what's next now. I would still like to do it, but if I'm too slow and you would like to have this feature done soon please let me know how should we proceed. There's also the test code that should be fixed. @ogrisel |
@Pimpwhippa - this PR was just a way to save my work . I realized later that you were already on it. Apologies for the confusion .
|
@getgaurav2 Thank you for your answers. Yes I will be happy to work with you looking at the testing side. Can I make changes from mfcabrera s version and try to pass the test without taking into account your implementation code? Did his code fail the checks just because there was no implementation code? So I can make changes of the testing side only. Sorry for my newbie question. Just want to know what to do next. |
@Pimpwhippa yes - his test was failing because of no implementation . I agree with his test cases otherwise ( just by looking at the code so far ) Eventually , the split function from my implementation should replace your hard coded expected o/p and pass the test cases . |
@getgaurav2 Alright. Will try so. |
@getgaurav2 if you have any comments on the PR please let me know. Should I add/change anything? I'll continue working on linting and documentation in the meantime. |
@Pimpwhippa Let's call the function in your test cases and check for any failures - Thanks . |
Hi @getgaurav2, The first two tests failed because ValueError: Cannot have more folds than groups. The third test passes because that’s what it supposes to say. The fourth one got AssertionError, not sure yet why. The last test on max_train_size also failed on the same ValueError. I summarize all my points so that I check my understanding, and for us to discuss them further.
groups = [‘a’, ‘b’, ‘c’, ‘a’, ‘b’, ‘f’, ‘g’, ‘d’, ‘f’] your result = Train: [ 0 1 2 3 4 5], Test [6 7] my expected result = Train [0 2 3], Test [4] (I did not look at your code at all until I finished mine.)
If we think of a real use case, let’s say default n_splits = 5, and let’s say in your definition n_folds = 6, doesn’t that mean to use this feature, your dataset has to have at least 6 groups? What if a dataset has 2-3 groups, which could be the case, maybe even more likely than a dataset with 6 groups? Should we divide the cases to deal with n_folds (or n_splits) in relation to n_groups?
Please let me know if I missed something obvious. Thank you. |
In time series cross validation, some training data cannot be used
elsewhere as test data, since we require that all test data follows (though
not necessarily immediately) its corresponding training data in time.
n_folds is really counting the number of data subsets to use in the various
splits, including some training data to be unused for testing.
You should understand the need for this algorithm in terms of what
requirements there are on each split, and then understand the parameters
and behaviours in terms of it:
* CV: every test set must not overlap with its corresponding training set
* ?Kfold-style cv: each test set must share no samples with another test
set (this criterion may be removed as long as there is the possibility to
use and reuse every test sample, as in ShuffleSplit)
* Time: every test sample must not precede any train sample in the dataset
* Groups: if a sample from a group is in the training set, no sample from
that group can be in the test set
Does the algorithm satisfy these criteria? Are these criteria reasonably
well tested? Is it parameterised in a reasonable way for users who need
these criteria upheld for their task?
|
@jnothman Thank you for dropping by to give the guidance. |
… same number of splits as in n-split argument.
…y as expected in the comment .
Hey guys, can I ask what the progress is on this? I having written a working iterator which gets through all of @Pimpwhippa's tests I could find (https://github.com/Pimpwhippa/scikit-learn/blob/afd31b1a337d3c6d02491fd6bff67b51b1a05e91/sklearn/model_selection/tests/test_split.py). It would need some proper docstringing(, linting?) and some critical review of course but maybe I could contribute? |
#16236 is waiting on a reviewer at this point. |
Ah it moved! Thanks for the #, that code looks great! Do you think it'll make |
I Hope it makes it into 0.24 . @albertvillanova might be able to share some more info . |
Hello, |
@labdmitriy When you say window mode, are you talking about rolling equal length windows for timeseries CV? I was thinking about starting to work on a PR for that, but if it's already in the pipe-ish I won't.
|
Yes, I added option to choose rolling or expanding window, also It works with sklearn GridSearchCV. |
Hello @ogrisel @jnothman , |
Hi @labdmitriy, I don't really understand what the alternative proposal is here. I don't think our TimeSeriesSplit is going to fulfil every possible need around time series splitting, but we should support some key use cases to avoid users falling into traps. Scikit-learn does not generally see time-series as within it primary scope... |
Hi @jnothman, |
Hi @jnothman, I've shared my implementation of group time series cross-validation which is compatible with sklearn. Article: https://medium.com/@labdmitriy/advanced-group-time-series-validation-bb00d4a74bcc Update (2022-05-27): enhanced version of this implementation now is the part of the mlxtend library. Thank you. |
Basically combining
TimeSeriesSplit
with theGroup
awareness of other CV strategies such asGroupKFold
.I think it's a good first issue for first time contributors that are already familiar with the existing cross validation tools in scikit-learn:
https://scikit-learn.org/stable/modules/cross_validation.html
Source code is here:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/model_selection/_split.py
The text was updated successfully, but these errors were encountered: