Skip to content

[MRG] Feature: Additional TimeSeriesSplit Functionality #13204

Merged
thomasjpfan merged 26 commits intoscikit-learn:masterfrom
kykosic:improve_tssplit
May 12, 2020
Merged

[MRG] Feature: Additional TimeSeriesSplit Functionality #13204
thomasjpfan merged 26 commits intoscikit-learn:masterfrom
kykosic:improve_tssplit

Conversation

@kykosic
Copy link
Copy Markdown
Contributor

@kykosic kykosic commented Feb 20, 2019

Reference Issues/PRs

Related to #6322
Resolves #14376

What does this implement/fix? Explain your changes.

There are two parameters being added to TimeSeriesSplit which make it more flexible, particularly when working with financial data.

1. test_size : int, optional

By default, TimeSeriesSplit divides the data into n_splits + 1 folds, and the size of each split's test set is n_samples / (n_splits + 1). However, it is sometimes useful to shift the balance between train and test data when doing cross validation relative to domain use cases.

Example use case: I have 4 years (4 * 252 trading days) of stock price data. I wish to run cross-validation with 4 splits. Each split I want to train on 2 years (504 samples) and test on 6 months (126 samples). Currently, using TimeSeriesSplit(n_splits=4, max_train_size=504) will only yield test sets of exactly 201 samples, and a train sets starting with 204 samples increasing until 504.

>>> x = np.arange(252 * 4)
>>> cv = TimeSeriesSplit(n_splits=4, max_train_size=504)
>>> for train_index, test_index in cv.split(x):
...     print("Train Size:", len(train_index), "Test Size:", len(test_index))
Train Size: 204 Test Size: 201
Train Size: 405 Test Size: 201
Train Size: 504 Test Size: 201
Train Size: 504 Test Size: 201

The optional test_size parameter allows me to control the number of samples used for each test split and use all remaining samples for training.

>>> x = np.arange(252 * 4)
>>> cv = TimeSeriesSplit3(n_splits=4, max_train_size=504, test_size=126)
>>> for train_index, test_index in cv.split(x):
....     print("Train Size:", len(train_index), "Test Size:", len(test_index))
Train Size: 504 Test Size: 126
Train Size: 504 Test Size: 126
Train Size: 504 Test Size: 126
Train Size: 504 Test Size: 126

See code docstring for a more visual example. Test_size=None by default preserves current functionality.

2. gap : int, default=0

Currently each train/test set are adjacent to each other in sequence. However, for some use cases it may be important remove "gap samples" between each train set and test set when look-ahead bias can result from labeling the data.

Example use case: I am building a classifier to buy/sell stocks. I have some input features for each day, and I label my data by looking ahead 2 days and seeing if the stock price increases or decreases. Since I am looking ahead 2 days to create my data labels, I need to insert a 2 day "gap" between my train and test sets for cross validation to prevent look-ahead bias in my validation score.

>>> x = np.arange(15)
>>> cv = TimeSeriesSplit(n_splits=3)
>>> for train_index, test_index in cv.split(x):
...      print("TRAIN:", train_index, "TEST:", test_index)
TRAIN: [0 1 2 3 4 5] TEST: [6 7 8]
TRAIN: [0 1 2 3 4 5 6 7 8] TEST: [ 9 10 11]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11] TEST: [12 13 14]

If I label my data by looking ahead 2 days, then this is not representative of real-life training. I would not have those 2 extra days to label my data if training a model today. I need to remove the last two elements of each training data set to get an authentic cross validation score.

>>> x = np.arange(15)
>>> cv = TimeSeriesSplit(n_splits=3, gap=2)
>>> for train_index, test_index in cv.split(x):
...      print("TRAIN:", train_index, "TEST:", test_index)
TRAIN: [0 1 2 3] TEST: [6 7 8]
TRAIN: [0 1 2 3 4 5 6] TEST: [ 9 10 11]
TRAIN: [0 1 2 3 4 5 6 7 8 9] TEST: [12 13 14]

The default value of gap=0 maintains current functionality.

Any other comments?

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Feb 21, 2019 via email

@kykosic kykosic changed the title [WIP] Feature: Additional TimeSeriesSplit Functionality [MRG] Feature: Additional TimeSeriesSplit Functionality Feb 21, 2019
@kykosic
Copy link
Copy Markdown
Contributor Author

kykosic commented Feb 21, 2019

FWIW some of this was discussed in #6322 (which could perhaps be closed as it predated TimeSeriesSplit)

Thanks, I'll add that to the pull request incase relevant.

@mitar
Copy link
Copy Markdown
Contributor

mitar commented Apr 18, 2019

I have found this PR after I made issue #13666. I think test_size is trying to accomplish similar thing, just with different interface. What I argue in #13666 is that controlling the test size should maybe be done through the number of folds used for testing, not number of samples. Similarly, max_train_size should also be number of folds to use for train data and not sample size.

The gap is an interesting idea.

@kykosic
Copy link
Copy Markdown
Contributor Author

kykosic commented Apr 19, 2019

Yes this pull request adds in the functionality you are requesting without changing any existing API. Perhaps, as you suggest, more documentation could be made around this, and I would be happy to write some code examples to accompany this code change if it is accepted.

Copy link
Copy Markdown
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the slow review. I like your work.

Let's consider caret's approach (https://topepo.github.io/caret/data-splitting.html#data-splitting-for-time-series) for comparison (thanks for the reference @amueller) and consider the use cases of cross_validate and learning_curve.

  • Training set min size: this PR assumes >= test_size; caret requires a minimum to be specified.
  • Fixed/max size training set: specified as number of samples here; specified by boolean fixed in caret
  • Test set overlap: non-overlapping in this PR; consecutive test sets overlap by (len(test_size) -1) in caret
  • Gap: here removed from train set; there no support
  • n_splits: here must be specified; there implied by test_size and min_train_size.

For learning curve I can see the benefit of having overlapping test sets. I don't really see it as a great idea in cross-validation, though it may not hurt much, and would benefit when the dataset is small.

Except for the required test-set overlap and the lack of Gap, I quite like Caret's parametrisation. If I was designing this from scratch, I might have (min_train_size, max_train_size (int or 'fixed' or None), test_size, gap).

Questions:

  • should we allow n_splits=None if test_size is specified? this would generate as many test sets as possible... does this get too confusing?
  • train_size >= test_size seems reasonable, but should we allow the user to specify a larger min train size?
  • should we support max_train_size='fixed'?
  • if we support n_splits=None should we support test_overlap where a value of -1 would correspond to caret's version? this all feels a bit too complex. Should we could consider deprecating n_splits and moving to the parametrisation I suggest above?

n_splits = self.n_splits
n_folds = n_splits + 1
gap_size = self.gap_size
test_size = self.test_size if self.test_size else n_samples // n_folds
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps this should be if self.test_size is not None so that 0 and None are not treated the same.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps the else should be (n_samples - gap) // n_folds so that train_size remains > gap

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the self.test_size is not None. I'm not convinced that the train_size necessarily needs to be constrained to be greater than the gap, should that be a decision left to the user?

Copy link
Copy Markdown
Contributor

@svenstehle svenstehle Dec 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: I might have misunderstood but - is max_train_size supposed to implement a sliding window? If that's the case please clarify and let's share opinions . I think we can keep it simpler with just train_size and test_size and more string options for n_splits.


My opinion is I like the idea of having sensible defaults and as much control for the user as possible:

  • Add option n_splits='walk_fw' ; should walk forward with specified min_train_size, test_size --> keep the functionality of n_splits as it is. Idea: also implement a walk forward as sliding window is possible. See more on that below.
  • What I would love to see added from caret as well is the idea of a sliding window. Please consider including this as well. Caret calls it fixedWindow though. The idea is to include only examples inside a fixed time window and then walk forward and move the window along --> could be implemented along the lines of if n_splits=='window' and min_train_size>1 and test_size>1. Here, test_size is the prediction horizon of the window, min_train_size is the train_size for the window each fold that gets moved up by 1 each time.
  • Allow user to specify min_train_size smaller than gap, if they so desire (it's a parameter to be explored imo). I agree with you that a reasonable default is min_train_size >= gap if gap > 0
  • Allow user to specify a min_train_size both smaller and larger than test_size but keep default >= 'test_size'
  • max_train_size='fixed' --> I'm missing something, what is the use of a max_train_size when doing walk forward while growing the train_set with the previous test_set. Should this implement sliding window?
  • What is the idea behind test set overlap from caret for walk forward in our case? We would have overlap if we walk forward for 1 time unit and our test_size spans more than 1 time unit. That is the idea? - And I think it's methodically OK. I may be missing something though. On that note also consider the sliding window, we have overlap here for certain if we have a horizon or test_size>1:

@kykosic
Copy link
Copy Markdown
Contributor Author

kykosic commented Dec 26, 2019

@jnothman @amueller As some time has elapsed since last discussing this change, these changes implement the basic necessary functionality for walk-forward cross validation #14376.

However, I do agree with the idea that n_splits is not a very useful parameter when defining these sorts of splits. These changes were how the functionality can be expressed without breaking the current API. If we were to remove n_splits as proposed in earlier comments, the use would probably more intuitive.

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Jan 5, 2020

I suspect n_splits in the sense of "number of train/test pairs to generate" remains a useful parameter for the user to control. But if it gets in the way of usability, we could consider deprecating it.

Copy link
Copy Markdown
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kykosic, I do appreciate this as a minimal change, and am more or less ready to approve it. I just want to ensure that it's enough of a change to be useful.

@kykosic
Copy link
Copy Markdown
Contributor Author

kykosic commented Jan 6, 2020

Requested changes made. I'm not sure what to do about the codecov checks failing after merging master into this branch...

EDIT: Latest master commit fixed this apparently.

@pepicello
Copy link
Copy Markdown

Any news on this? I wanted to contribute to add exactly this feature, but it was a pleasant surprise to see it already written in a PR :)

@kykosic
Copy link
Copy Markdown
Contributor Author

kykosic commented Feb 18, 2020

Yeah it's just kinda been hanging in limbo for a year now, hopefully we can get around to merging it in as I keep copy and pasting this class into a lot of projects I work on 😃

@tuomijal
Copy link
Copy Markdown

tuomijal commented May 4, 2020

An elegant and much-needed addition! I was also looking to contribute like @pepicello, but the problem seems to have been solved. Any idea on when this will be merged?

@kykosic kykosic requested a review from jnothman May 8, 2020 12:31
Copy link
Copy Markdown
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am happy with this as an immediate enhancement. Sorry for the slow reviews.

Please add an entry to the change log at doc/whats_new/v0.24.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

@jnothman jnothman added this to the 0.24 milestone May 10, 2020
next(splits)


def test_time_series_gap():
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test_time_series_test_size and test_time_series_gap can be pytest.mark.parametrize.

The two error cases can be in its own test as well. This is not a blocker and can be done in a followup PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be happy to implement this suggestion in a followup PR

Copy link
Copy Markdown
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more comments. Otherwise LGTM

kykosic and others added 2 commits May 10, 2020 21:41
Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>
@thomasjpfan thomasjpfan merged commit b4e215c into scikit-learn:master May 12, 2020
@thomasjpfan
Copy link
Copy Markdown
Member

Thank you @kykosic !

@mjbommar
Copy link
Copy Markdown
Contributor

almost exactly 10,000 issues after the first walk-forward issue in 2014! #3202

🍾 thank you for contributing! 🙏

gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020
Co-authored-by: Kyle Kosic <kylekosic@Kyles-MacBook-Pro.local>
Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>
viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020
Co-authored-by: Kyle Kosic <kylekosic@Kyles-MacBook-Pro.local>
Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
Co-authored-by: Kyle Kosic <kylekosic@Kyles-MacBook-Pro.local>
Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement WalkForward cross-validator for time series data.

8 participants