[MRG] Feature: Additional TimeSeriesSplit Functionality #13204
[MRG] Feature: Additional TimeSeriesSplit Functionality #13204thomasjpfan merged 26 commits intoscikit-learn:masterfrom
TimeSeriesSplit Functionality #13204Conversation
|
FWIW some of this was discussed in #6322 (which could perhaps be closed as
it predated TimeSeriesSplit)
|
TimeSeriesSplit Functionality TimeSeriesSplit Functionality
Thanks, I'll add that to the pull request incase relevant. |
|
I have found this PR after I made issue #13666. I think The gap is an interesting idea. |
|
Yes this pull request adds in the functionality you are requesting without changing any existing API. Perhaps, as you suggest, more documentation could be made around this, and I would be happy to write some code examples to accompany this code change if it is accepted. |
jnothman
left a comment
There was a problem hiding this comment.
Sorry for the slow review. I like your work.
Let's consider caret's approach (https://topepo.github.io/caret/data-splitting.html#data-splitting-for-time-series) for comparison (thanks for the reference @amueller) and consider the use cases of cross_validate and learning_curve.
- Training set min size: this PR assumes >= test_size; caret requires a minimum to be specified.
- Fixed/max size training set: specified as number of samples here; specified by boolean
fixedin caret - Test set overlap: non-overlapping in this PR; consecutive test sets overlap by (len(test_size) -1) in caret
- Gap: here removed from train set; there no support
n_splits: here must be specified; there implied by test_size and min_train_size.
For learning curve I can see the benefit of having overlapping test sets. I don't really see it as a great idea in cross-validation, though it may not hurt much, and would benefit when the dataset is small.
Except for the required test-set overlap and the lack of Gap, I quite like Caret's parametrisation. If I was designing this from scratch, I might have (min_train_size, max_train_size (int or 'fixed' or None), test_size, gap).
Questions:
- should we allow n_splits=None if test_size is specified? this would generate as many test sets as possible... does this get too confusing?
train_size >= test_sizeseems reasonable, but should we allow the user to specify a larger min train size?- should we support
max_train_size='fixed'? - if we support
n_splits=Noneshould we supporttest_overlapwhere a value of-1would correspond to caret's version? this all feels a bit too complex. Should we could consider deprecatingn_splitsand moving to the parametrisation I suggest above?
sklearn/model_selection/_split.py
Outdated
| n_splits = self.n_splits | ||
| n_folds = n_splits + 1 | ||
| gap_size = self.gap_size | ||
| test_size = self.test_size if self.test_size else n_samples // n_folds |
There was a problem hiding this comment.
perhaps this should be if self.test_size is not None so that 0 and None are not treated the same.
There was a problem hiding this comment.
perhaps the else should be (n_samples - gap) // n_folds so that train_size remains > gap
There was a problem hiding this comment.
I've updated the self.test_size is not None. I'm not convinced that the train_size necessarily needs to be constrained to be greater than the gap, should that be a decision left to the user?
There was a problem hiding this comment.
Edit: I might have misunderstood but - is max_train_size supposed to implement a sliding window? If that's the case please clarify and let's share opinions . I think we can keep it simpler with just train_size and test_size and more string options for n_splits.
My opinion is I like the idea of having sensible defaults and as much control for the user as possible:
- Add option
n_splits='walk_fw'; should walk forward with specifiedmin_train_size,test_size--> keep the functionality ofn_splitsas it is. Idea: also implement a walk forward as sliding window is possible. See more on that below. - What I would love to see added from caret as well is the idea of a sliding window. Please consider including this as well. Caret calls it
fixedWindowthough. The idea is to include only examples inside a fixed time window and then walk forward and move the window along --> could be implemented along the lines ofif n_splits=='window' and min_train_size>1 and test_size>1. Here,test_sizeis the prediction horizon of the window,min_train_sizeis the train_size for the window each fold that gets moved up by 1 each time. - Allow user to specify
min_train_sizesmaller thangap, if they so desire (it's a parameter to be explored imo). I agree with you that a reasonable default ismin_train_size >= gap if gap > 0 - Allow user to specify a
min_train_sizeboth smaller and larger thantest_sizebut keep default >= 'test_size' max_train_size='fixed'--> I'm missing something, what is the use of a max_train_size when doing walk forward while growing the train_set with the previous test_set. Should this implement sliding window?- What is the idea behind test set overlap from caret for walk forward in our case? We would have overlap if we walk forward for 1 time unit and our test_size spans more than 1 time unit. That is the idea? - And I think it's methodically OK. I may be missing something though. On that note also consider the sliding window, we have overlap here for certain if we have a horizon or
test_size>1:
|
@jnothman @amueller As some time has elapsed since last discussing this change, these changes implement the basic necessary functionality for walk-forward cross validation #14376. However, I do agree with the idea that |
|
I suspect n_splits in the sense of "number of train/test pairs to generate" remains a useful parameter for the user to control. But if it gets in the way of usability, we could consider deprecating it. |
|
Requested changes made. I'm not sure what to do about the codecov checks failing after merging master into this branch... EDIT: Latest master commit fixed this apparently. |
…into improve_tssplit
|
Any news on this? I wanted to contribute to add exactly this feature, but it was a pleasant surprise to see it already written in a PR :) |
|
Yeah it's just kinda been hanging in limbo for a year now, hopefully we can get around to merging it in as I keep copy and pasting this class into a lot of projects I work on 😃 |
|
An elegant and much-needed addition! I was also looking to contribute like @pepicello, but the problem seems to have been solved. Any idea on when this will be merged? |
jnothman
left a comment
There was a problem hiding this comment.
Yes, I am happy with this as an immediate enhancement. Sorry for the slow reviews.
Please add an entry to the change log at doc/whats_new/v0.24.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:
| next(splits) | ||
|
|
||
|
|
||
| def test_time_series_gap(): |
There was a problem hiding this comment.
This test_time_series_test_size and test_time_series_gap can be pytest.mark.parametrize.
The two error cases can be in its own test as well. This is not a blocker and can be done in a followup PR.
There was a problem hiding this comment.
I would be happy to implement this suggestion in a followup PR
thomasjpfan
left a comment
There was a problem hiding this comment.
Few more comments. Otherwise LGTM
Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>
|
Thank you @kykosic ! |
|
almost exactly 10,000 issues after the first walk-forward issue in 2014! #3202 🍾 thank you for contributing! 🙏 |
Co-authored-by: Kyle Kosic <kylekosic@Kyles-MacBook-Pro.local> Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>
Co-authored-by: Kyle Kosic <kylekosic@Kyles-MacBook-Pro.local> Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>
Co-authored-by: Kyle Kosic <kylekosic@Kyles-MacBook-Pro.local> Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>
Reference Issues/PRs
Related to #6322
Resolves #14376
What does this implement/fix? Explain your changes.
There are two parameters being added to
TimeSeriesSplitwhich make it more flexible, particularly when working with financial data.1.
test_size: int, optionalBy default,
TimeSeriesSplitdivides the data inton_splits + 1folds, and the size of each split's test set isn_samples / (n_splits + 1). However, it is sometimes useful to shift the balance between train and test data when doing cross validation relative to domain use cases.Example use case: I have 4 years (4 * 252 trading days) of stock price data. I wish to run cross-validation with 4 splits. Each split I want to train on 2 years (504 samples) and test on 6 months (126 samples). Currently, using
TimeSeriesSplit(n_splits=4, max_train_size=504)will only yield test sets of exactly 201 samples, and a train sets starting with 204 samples increasing until 504.The optional
test_sizeparameter allows me to control the number of samples used for each test split and use all remaining samples for training.See code docstring for a more visual example.
Test_size=Noneby default preserves current functionality.2.
gap: int, default=0Currently each train/test set are adjacent to each other in sequence. However, for some use cases it may be important remove "gap samples" between each train set and test set when look-ahead bias can result from labeling the data.
Example use case: I am building a classifier to buy/sell stocks. I have some input features for each day, and I label my data by looking ahead 2 days and seeing if the stock price increases or decreases. Since I am looking ahead 2 days to create my data labels, I need to insert a 2 day "gap" between my train and test sets for cross validation to prevent look-ahead bias in my validation score.
If I label my data by looking ahead 2 days, then this is not representative of real-life training. I would not have those 2 extra days to label my data if training a model today. I need to remove the last two elements of each training data set to get an authentic cross validation score.
The default value of
gap=0maintains current functionality.Any other comments?