[MRG] Feature: Additional `TimeSeriesSplit` Functionality by kykosic · Pull Request #13204 · scikit-learn/scikit-learn

kykosic · 2019-02-20T23:39:53Z

Reference Issues/PRs

Related to #6322
Resolves #14376

What does this implement/fix? Explain your changes.

There are two parameters being added to TimeSeriesSplit which make it more flexible, particularly when working with financial data.

1. `test_size` : int, optional

By default, TimeSeriesSplit divides the data into n_splits + 1 folds, and the size of each split's test set is n_samples / (n_splits + 1). However, it is sometimes useful to shift the balance between train and test data when doing cross validation relative to domain use cases.

Example use case: I have 4 years (4 * 252 trading days) of stock price data. I wish to run cross-validation with 4 splits. Each split I want to train on 2 years (504 samples) and test on 6 months (126 samples). Currently, using TimeSeriesSplit(n_splits=4, max_train_size=504) will only yield test sets of exactly 201 samples, and a train sets starting with 204 samples increasing until 504.

>>> x = np.arange(252 * 4)
>>> cv = TimeSeriesSplit(n_splits=4, max_train_size=504)
>>> for train_index, test_index in cv.split(x):
...     print("Train Size:", len(train_index), "Test Size:", len(test_index))
Train Size: 204 Test Size: 201
Train Size: 405 Test Size: 201
Train Size: 504 Test Size: 201
Train Size: 504 Test Size: 201

The optional test_size parameter allows me to control the number of samples used for each test split and use all remaining samples for training.

>>> x = np.arange(252 * 4)
>>> cv = TimeSeriesSplit3(n_splits=4, max_train_size=504, test_size=126)
>>> for train_index, test_index in cv.split(x):
....     print("Train Size:", len(train_index), "Test Size:", len(test_index))
Train Size: 504 Test Size: 126
Train Size: 504 Test Size: 126
Train Size: 504 Test Size: 126
Train Size: 504 Test Size: 126

See code docstring for a more visual example. Test_size=None by default preserves current functionality.

2. `gap` : int, default=0

Currently each train/test set are adjacent to each other in sequence. However, for some use cases it may be important remove "gap samples" between each train set and test set when look-ahead bias can result from labeling the data.

Example use case: I am building a classifier to buy/sell stocks. I have some input features for each day, and I label my data by looking ahead 2 days and seeing if the stock price increases or decreases. Since I am looking ahead 2 days to create my data labels, I need to insert a 2 day "gap" between my train and test sets for cross validation to prevent look-ahead bias in my validation score.

>>> x = np.arange(15)
>>> cv = TimeSeriesSplit(n_splits=3)
>>> for train_index, test_index in cv.split(x):
...      print("TRAIN:", train_index, "TEST:", test_index)
TRAIN: [0 1 2 3 4 5] TEST: [6 7 8]
TRAIN: [0 1 2 3 4 5 6 7 8] TEST: [ 9 10 11]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11] TEST: [12 13 14]

If I label my data by looking ahead 2 days, then this is not representative of real-life training. I would not have those 2 extra days to label my data if training a model today. I need to remove the last two elements of each training data set to get an authentic cross validation score.

>>> x = np.arange(15)
>>> cv = TimeSeriesSplit(n_splits=3, gap=2)
>>> for train_index, test_index in cv.split(x):
...      print("TRAIN:", train_index, "TEST:", test_index)
TRAIN: [0 1 2 3] TEST: [6 7 8]
TRAIN: [0 1 2 3 4 5 6] TEST: [ 9 10 11]
TRAIN: [0 1 2 3 4 5 6 7 8 9] TEST: [12 13 14]

The default value of gap=0 maintains current functionality.

Any other comments?

jnothman · 2019-02-21T02:16:52Z

FWIW some of this was discussed in #6322 (which could perhaps be closed as it predated TimeSeriesSplit)

kykosic · 2019-02-21T04:06:30Z

FWIW some of this was discussed in #6322 (which could perhaps be closed as it predated TimeSeriesSplit)

Thanks, I'll add that to the pull request incase relevant.

mitar · 2019-04-18T05:29:36Z

I have found this PR after I made issue #13666. I think test_size is trying to accomplish similar thing, just with different interface. What I argue in #13666 is that controlling the test size should maybe be done through the number of folds used for testing, not number of samples. Similarly, max_train_size should also be number of folds to use for train data and not sample size.

The gap is an interesting idea.

kykosic · 2019-04-19T14:31:37Z

Yes this pull request adds in the functionality you are requesting without changing any existing API. Perhaps, as you suggest, more documentation could be made around this, and I would be happy to write some code examples to accompany this code change if it is accepted.

jnothman

Sorry for the slow review. I like your work.

Let's consider caret's approach (https://topepo.github.io/caret/data-splitting.html#data-splitting-for-time-series) for comparison (thanks for the reference @amueller) and consider the use cases of cross_validate and learning_curve.

Training set min size: this PR assumes >= test_size; caret requires a minimum to be specified.
Fixed/max size training set: specified as number of samples here; specified by boolean fixed in caret
Test set overlap: non-overlapping in this PR; consecutive test sets overlap by (len(test_size) -1) in caret
Gap: here removed from train set; there no support
n_splits: here must be specified; there implied by test_size and min_train_size.

For learning curve I can see the benefit of having overlapping test sets. I don't really see it as a great idea in cross-validation, though it may not hurt much, and would benefit when the dataset is small.

Except for the required test-set overlap and the lack of Gap, I quite like Caret's parametrisation. If I was designing this from scratch, I might have (min_train_size, max_train_size (int or 'fixed' or None), test_size, gap).

Questions:

should we allow n_splits=None if test_size is specified? this would generate as many test sets as possible... does this get too confusing?
train_size >= test_size seems reasonable, but should we allow the user to specify a larger min train size?
should we support max_train_size='fixed'?
if we support n_splits=None should we support test_overlap where a value of -1 would correspond to caret's version? this all feels a bit too complex. Should we could consider deprecating n_splits and moving to the parametrisation I suggest above?

jnothman · 2019-05-30T02:17:39Z

sklearn/model_selection/_split.py

        n_splits = self.n_splits
        n_folds = n_splits + 1
+        gap_size = self.gap_size
+        test_size = self.test_size if self.test_size else n_samples // n_folds


perhaps this should be if self.test_size is not None so that 0 and None are not treated the same.

perhaps the else should be (n_samples - gap) // n_folds so that train_size remains > gap

I've updated the self.test_size is not None. I'm not convinced that the train_size necessarily needs to be constrained to be greater than the gap, should that be a decision left to the user?

Edit: I might have misunderstood but - is max_train_size supposed to implement a sliding window? If that's the case please clarify and let's share opinions . I think we can keep it simpler with just train_size and test_size and more string options for n_splits.

My opinion is I like the idea of having sensible defaults and as much control for the user as possible:

Add option n_splits='walk_fw' ; should walk forward with specified min_train_size, test_size --> keep the functionality of n_splits as it is. Idea: also implement a walk forward as sliding window is possible. See more on that below.

What I would love to see added from caret as well is the idea of a sliding window. Please consider including this as well. Caret calls it fixedWindow though. The idea is to include only examples inside a fixed time window and then walk forward and move the window along --> could be implemented along the lines of if n_splits=='window' and min_train_size>1 and test_size>1. Here, test_size is the prediction horizon of the window, min_train_size is the train_size for the window each fold that gets moved up by 1 each time.

Allow user to specify min_train_size smaller than gap, if they so desire (it's a parameter to be explored imo). I agree with you that a reasonable default is min_train_size >= gap if gap > 0

Allow user to specify a min_train_size both smaller and larger than test_size but keep default >= 'test_size'

max_train_size='fixed' --> I'm missing something, what is the use of a max_train_size when doing walk forward while growing the train_set with the previous test_set. Should this implement sliding window?

What is the idea behind test set overlap from caret for walk forward in our case? We would have overlap if we walk forward for 1 time unit and our test_size spans more than 1 time unit. That is the idea? - And I think it's methodically OK. I may be missing something though. On that note also consider the sliding window, we have overlap here for certain if we have a horizon or test_size>1:

sklearn/model_selection/_split.py

kykosic · 2019-12-26T15:54:03Z

@jnothman @amueller As some time has elapsed since last discussing this change, these changes implement the basic necessary functionality for walk-forward cross validation #14376.

However, I do agree with the idea that n_splits is not a very useful parameter when defining these sorts of splits. These changes were how the functionality can be expressed without breaking the current API. If we were to remove n_splits as proposed in earlier comments, the use would probably more intuitive.

jnothman · 2020-01-05T20:45:19Z

I suspect n_splits in the sense of "number of train/test pairs to generate" remains a useful parameter for the user to control. But if it gets in the way of usability, we could consider deprecating it.

jnothman

Thanks @kykosic, I do appreciate this as a minimal change, and am more or less ready to approve it. I just want to ensure that it's enough of a change to be useful.

sklearn/model_selection/_split.py

sklearn/model_selection/tests/test_split.py

…into improve_tssplit

kykosic · 2020-01-06T17:23:30Z

Requested changes made. I'm not sure what to do about the codecov checks failing after merging master into this branch...

EDIT: Latest master commit fixed this apparently.

…into improve_tssplit

pepicello · 2020-02-18T10:12:55Z

Any news on this? I wanted to contribute to add exactly this feature, but it was a pleasant surprise to see it already written in a PR :)

kykosic · 2020-02-18T14:19:55Z

Yeah it's just kinda been hanging in limbo for a year now, hopefully we can get around to merging it in as I keep copy and pasting this class into a lot of projects I work on 😃

tuomijal · 2020-05-04T13:29:04Z

An elegant and much-needed addition! I was also looking to contribute like @pepicello, but the problem seems to have been solved. Any idea on when this will be merged?

jnothman

Yes, I am happy with this as an immediate enhancement. Sorry for the slow reviews.

Please add an entry to the change log at doc/whats_new/v0.24.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

sklearn/model_selection/_split.py

doc/modules/cross_validation.rst

sklearn/model_selection/_split.py

sklearn/model_selection/tests/test_split.py

thomasjpfan · 2020-05-10T15:55:13Z

sklearn/model_selection/tests/test_split.py

+        next(splits)
+
+
+def test_time_series_gap():


This test_time_series_test_size and test_time_series_gap can be pytest.mark.parametrize.

The two error cases can be in its own test as well. This is not a blocker and can be done in a followup PR.

I would be happy to implement this suggestion in a followup PR

…into improve_tssplit

thomasjpfan

Few more comments. Otherwise LGTM

sklearn/model_selection/_split.py

Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>

thomasjpfan · 2020-05-12T01:48:16Z

Thank you @kykosic !

mjbommar · 2020-05-12T02:00:57Z

almost exactly 10,000 issues after the first walk-forward issue in 2014! #3202

🍾 thank you for contributing! 🙏

Co-authored-by: Kyle Kosic <kylekosic@Kyles-MacBook-Pro.local> Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>

Kyle Kosic added 9 commits February 20, 2019 13:33

initial changes

72bd6bc

docs and unit tests

5410798

typo in docstring

f04c37d

flake8; clarify indexing logic

3fed48f

flake8 tests

266751f

another flake8 fix in test_split.py

1e06e43

fix failing doctest

166b9a9

flake8 again :)

dbb51b4

update cv rst documentation

a0c16f6

kykosic changed the title ~~[WIP] Feature: Additional TimeSeriesSplit Functionality~~ [MRG] Feature: Additional TimeSeriesSplit Functionality Feb 21, 2019

resolve merge conflict with master

04b2909

mitar mentioned this pull request Apr 18, 2019

Unclear behavior of max_train_size argument in TimeSeriesSplit #13666

Open

WenjieZ mentioned this pull request May 1, 2019

[MRG] Cross-validation for time series (inserting gaps between the training set and the test set) #13761

Closed

jnothman reviewed May 30, 2019

View reviewed changes

amueller mentioned this pull request Jul 23, 2019

Implement WalkForward cross-validator for time series data. #14376

Closed

kykosic added 5 commits December 25, 2019 12:03

resolve merge conflict

95a4e63

rename gap_size -> gap

b4fa003

change check for is not None

caa4398

update docs

a17d06a

update doctests

9fdf59a

jnothman reviewed Jan 5, 2020

View reviewed changes

sklearn/model_selection/_split.py Outdated Show resolved Hide resolved

sklearn/model_selection/tests/test_split.py Outdated Show resolved Hide resolved

kykosic added 3 commits January 6, 2020 09:52

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

f74b5ae

…into improve_tssplit

clean up docs and tests

bb52c26

fix kwarg

a765033

update test

7461e84

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

b9e1fe5

…into improve_tssplit

github-actions bot added the module:model_selection label Mar 2, 2020

kykosic added 2 commits May 4, 2020 13:17

Merge branch 'master' into improve_tssplit

5d291eb

resolve merge conflict

dc58df9

kykosic requested a review from jnothman May 8, 2020 12:31

jnothman approved these changes May 10, 2020

View reviewed changes

sklearn/model_selection/_split.py Show resolved Hide resolved

jnothman added the Waiting for Reviewer label May 10, 2020

jnothman added this to the 0.24 milestone May 10, 2020

thomasjpfan reviewed May 10, 2020

View reviewed changes

kykosic added 2 commits May 10, 2020 14:48

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

e1be8cf

…into improve_tssplit

implement suggested changes

02d17bc

thomasjpfan approved these changes May 10, 2020

View reviewed changes

sklearn/model_selection/_split.py Outdated Show resolved Hide resolved

sklearn/model_selection/_split.py Outdated Show resolved Hide resolved

sklearn/model_selection/_split.py Outdated Show resolved Hide resolved

sklearn/model_selection/_split.py Outdated Show resolved Hide resolved

kykosic and others added 2 commits May 10, 2020 21:41

Update sklearn/model_selection/_split.py

0f21d45

Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>

Apply suggestions from code review

d6797e3

Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>

thomasjpfan merged commit b4e215c into scikit-learn:master May 12, 2020

gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020

FEA Additional TimeSeriesSplit Functionality (scikit-learn#13204)

704edcb

Co-authored-by: Kyle Kosic <kylekosic@Kyles-MacBook-Pro.local> Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>

viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020

FEA Additional TimeSeriesSplit Functionality (scikit-learn#13204)

4f9d14a

Co-authored-by: Kyle Kosic <kylekosic@Kyles-MacBook-Pro.local> Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020

FEA Additional TimeSeriesSplit Functionality (scikit-learn#13204)

65c1f62

Co-authored-by: Kyle Kosic <kylekosic@Kyles-MacBook-Pro.local> Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>

cmp1 mentioned this pull request Feb 17, 2022

Add rolling window to sklearn.model_selection.TimeSeriesSplit #22523

Open

Uh oh!

Conversation

kykosic commented Feb 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

1. test_size : int, optional

2. gap : int, default=0

Any other comments?

Uh oh!

jnothman commented Feb 21, 2019 via email

Uh oh!

kykosic commented Feb 21, 2019

Uh oh!

mitar commented Apr 18, 2019

Uh oh!

kykosic commented Apr 19, 2019

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman May 30, 2019

Choose a reason for hiding this comment

Uh oh!

jnothman May 30, 2019

Choose a reason for hiding this comment

Uh oh!

kykosic Dec 26, 2019

Choose a reason for hiding this comment

Uh oh!

svenstehle Dec 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kykosic commented Dec 26, 2019

Uh oh!

jnothman commented Jan 5, 2020

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kykosic commented Jan 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pepicello commented Feb 18, 2020

Uh oh!

kykosic commented Feb 18, 2020

Uh oh!

tuomijal commented May 4, 2020

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thomasjpfan May 10, 2020

Choose a reason for hiding this comment

Uh oh!

kykosic May 10, 2020

Choose a reason for hiding this comment

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thomasjpfan commented May 12, 2020

Uh oh!

mjbommar commented May 12, 2020

Uh oh!

Reviewers

kykosic commented Feb 20, 2019 •

edited

Loading

1. `test_size` : int, optional

2. `gap` : int, default=0

svenstehle Dec 29, 2019 •

edited

Loading

kykosic commented Jan 6, 2020 •

edited

Loading