[MRG+3] edit train/test_size default behavior by nelson-liu · Pull Request #7459 · scikit-learn/scikit-learn

nelson-liu · 2016-09-20T03:35:27Z

Reference Issue

What does this implement/fix? Explain your changes.

Changes the default behavior of train_size and test_size in splitters.
The defaults for both parameters are now None

if train_size and test_size are both None, train_size is set to 0.9 and test_size is set to 0.1
if only one of train_size or test_size are set, then the value of the unset parameter is n_samples - set_parameter (the complement)
If both are explicitly set, check that they are equal to or less than 1 or n_samples and respect that the user wants to subsample the dataset

Any other comments?

I'm aware that the cross_validation module is being deprecated; does this mean that I shouldn't add these changes to it and only to model_selection?

TODO

Edit documentation to clearly explain these new changes

This change is

jnothman · 2016-09-20T04:18:56Z

Please don't modify cross_validation.

On 20 September 2016 at 13:35, Nelson Liu notifications@github.com wrote:

Reference Issue

Fixes #5948 #5948
What does this implement/fix? Explain your changes.

Changes the default behavior of train_size and test_size in splitters.
The defaults for both parameters are now None

if train_size and test_size are both None, train_size is set to 0.1
and test_size is set to0.9`

if only one of train_size or test_size are set, then the value of
the unset parameter is n_samples - set_parameter (the complement)

If both are explicitly set, check that they are equal to or less
than 0 and respect that the user wants to subsample the dataset

Any other comments?

I'm aware that the cross_validation module is being deprecated; does this
mean that I shouldn't add these changes to it and only to model_selection?
TODO

Edit documentation to clearly explain these new changes

You can view, comment on, or merge this pull request online at:

#7459
Commit Summary

edit train/test_size default behavior

File Changes

M sklearn/cross_validation.py
https://github.com/scikit-learn/scikit-learn/pull/7459/files#diff-0
(41)

M sklearn/model_selection/_split.py
https://github.com/scikit-learn/scikit-learn/pull/7459/files#diff-1
(54)

M sklearn/model_selection/tests/test_split.py
https://github.com/scikit-learn/scikit-learn/pull/7459/files#diff-2
(15)

M sklearn/tests/test_cross_validation.py
https://github.com/scikit-learn/scikit-learn/pull/7459/files#diff-3
(36)

Patch Links:

https://github.com/scikit-learn/scikit-learn/pull/7459.patch

https://github.com/scikit-learn/scikit-learn/pull/7459.diff

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#7459, or mute the
thread
https://github.com/notifications/unsubscribe-auth/AAEz6w0JaHZaFQeJaKrIaip96r33SnH0ks5qr1SAgaJpZM4KBMM_
.

nelson-liu · 2016-09-20T04:27:43Z

Please don't modify cross_validation.

ok, thanks.

lesteve · 2016-09-20T12:03:11Z

This needs a rebase on master to fix the conflicts.

nelson-liu · 2016-09-20T16:16:01Z

This needs a rebase on master to fix the conflicts.

Done, thanks for letting me know. Didn't notice / realize someone else was working on the file as well :)

nelson-liu · 2016-09-27T17:36:55Z

this is ready to be looked at, missing a whatsnew though (I'd like to verify the current behavior is correct before writing it). Among the changes discussed earlier, GroupShuffleSplit has test_size set to a default of 0.2. This PR changes it to 0.1 to match all the other splitters.

amueller · 2016-09-30T00:55:48Z

I expect the docs of train_test_split also need a change. And you can't change the default behavior of GroupShuffleSplit without a deprecation. We could have done that in 0.18, but now it's too late :-/

nelson-liu · 2016-09-30T20:38:38Z

right, sorry for omitting that @amueller . I've actually never "deprecated" by changing the default parameter of a function before, would you mind pointing me to an example / explaining how to do it? It seems to be missing in the contributing docs as well, so I'd be happy to add it there.

amueller · 2016-09-30T20:57:37Z

I'm not sure we should actually change the default from 0.2 to 0.1 in that case.

For reference, the deprecation documentation is here:
http://scikit-learn.org/dev/developers/contributing.html#deprecation

amueller · 2016-09-30T20:58:06Z

I'm always kinda confused by the fact that the "Contributing" link on the help page doesn't go to contributing, but to developer....

nelson-liu · 2016-09-30T21:14:16Z

hmm, would

            warnings.warn("The default value of train_size parameter was "
                          "changed from 0.2 to None, which changes the "
                          "default (train_size, test_size)"
                          "from (0.2, 0.8) to (0.1, 0.9) "
                          "in version 0.19", DeprecationWarning)

be clear? I feel like there's a better way to word it...

amueller · 2016-09-30T21:19:05Z

Usually we just say "The default value of (..) will change from .. to .. in version 0.20" and the None is really just a weird implementation detail

nelson-liu · 2016-10-04T03:32:28Z

@amueller we also need a deprecation for train_test_split right? Considering the defaults will change from 0.25,0.75 to 0.1,0.9 in version 0.21

also, by the docs of train_test_split need a change, do you mean the docstrings or the examples / tutorial material?

nelson-liu · 2016-10-04T03:36:40Z

also I edited the default parameters taken by the __init__ of GroupShuffleSplit to take test_size=None and train_size=None, allowing me to throw the deprecation warning and then accordingly set the previous defaults (test_size=0.2 and train_size-0.8). Should I just not have touched this?

amueller · 2016-10-05T22:12:36Z

We should not change the defaults of train_test_split. We should use the same behavior, though. And I meant the docstrings need to document the new way train_size and test_size interact.

jnothman · 2016-10-06T08:27:48Z

There are a couple of errors in your issue description:

if train_size and test_size are both None, train_size is set to 0.1 and test_size is set to 0.9

you mean the opposite

If both are explicitly set, check that they are equal to or less than 0 and respect that the user wants to subsample the dataset

I think you mean less than n_samples or less than 1.

Now that model_selection is released we need to be aware of backwards compatibility issues. For now, if a user sets train_size without setting test_size (and train_size+test_size > 1 or n_samples) the best we can do is warn that the behaviour will change in a future release, IMO.

nelson-liu · 2016-10-07T04:34:23Z

We should not change the defaults of train_test_split. We should use the same behavior, though. And I meant the docstrings need to document the new way train_size and test_size interact.

Makes sense, I've reverted the change and updated the description accordingly.

There are a couple of errors in your issue description:

Indeed, thanks for catching that.

Now that model_selection is released we need to be aware of backwards compatibility issues. For now, if a user sets train_size without setting test_size (and train_size+test_size > 1 or n_samples) the best we can do is warn that the behaviour will change in a future release, IMO.

To clarify --- we should not implement the changes to the behavior when setting train_size without test_size, but throw a warning instead saying that it will change in future release (sort of like a deprecation)? Also, how is there a case that train_size+test_size > 1 or n_samples? Sorry if these questions have obvious answers.

jnothman · 2016-10-08T09:55:28Z

I'm not now sure about the details of my comment. Point is that all behaviours need to remain identical, except perhaps for a warning or extra user-specified parameters/values handled. I need to take a proper look at what's changing here to give more precise feedback but there's a huge load of issues for me to skim right now...

jnothman · 2016-11-24T02:17:29Z

I hope to look at this soon, @nelson-liu. In the meantime, resolving merge conflicts with master wouldn't hurt.

amueller · 2016-11-30T21:06:34Z

@jnothman feel free to hassle me about this, too ;)

lesteve · 2016-12-01T08:11:21Z

Rebased on master to fix conflicts.

nelson-liu · 2016-12-01T18:54:53Z

thanks @lesteve! not sure if this is ready to be looked at actually, I don't think i've addressed @jnothman 's latest comments wrt keeping behaviors identical...

jnothman

Unfortunately, we need to retain the behaviour that train_test_split(train_size=.5) will sample 10% (or whatever the default is) for test and 50% for train. This behaviour can change in version 0.21. We achieve this by setting test_size to some otherwise inappropriate sentinel by default, such as 'default', which behaves exactly like the current default value (0.1). In 0.21, it can change to None and behave like you specify. In the case that test_size='default' and train_size is a number, we warn that behaviour will change to one where test_size will always complement train_size unless both are specified or both are unspecified.

jnothman · 2016-12-08T21:21:23Z

sklearn/model_selection/_split.py

+    test_size : float, int, or None, default None
+        If float, should be between 0.0 and 1.0 and represent the proportion
+        of the dataset to include in the test split. If int, represents the
+        absolute number of test samples. If None, and `train_size` is None,


double backticks

jnothman · 2016-12-08T21:21:26Z

sklearn/model_selection/_split.py

+        If float, should be between 0.0 and 1.0 and represent the proportion
+        of the dataset to include in the test split. If int, represents the
+        absolute number of test samples. If None, and `train_size` is None,
+        the value is set to 0.1. If None and `train_size` is not None, the


double backticks here too

jnothman · 2016-12-08T21:21:31Z

sklearn/model_selection/_split.py

                 random_state=None):
+        if test_size is None and train_size is None:
+            warnings.warn("The default value of the test_size parameter"
+                          "will change from 0.1 to 0.2 in version 0.21.",


0.2 is already the default, is it not??

nelson-liu · 2016-12-25T06:14:36Z

@jnothman - let me know if i understood what to do correctly. The idea is that the current behavior shouldn't change, and thus this PR should just add the functions that detect when a user invokes behavior that will change in the future (test_size default, unspecified value, train_size some specified value) and warn them that the behavior of the code will change in 0.21. Then when 0.20dev comes along, we'll actually implement the new behavior?

If this is the case, I reverted everything to ensure that it maintains the current behavior and just warns the user appropriately.

lesteve · 2017-03-09T10:07:00Z

but this seems a little too narrative?

Maybe you can put a short snippet and mention the values of test_size/train_size in both 0.20 and 0.21. By the way it may be worth creating an issue to make sure we remove the warnings in 0.21 and we add an entry in the 0.21 changelog in the "API changes" section ...

jnothman · 2017-06-13T13:27:21Z

Sufficient what's new:

In version 0.21, the default behavior of splitters that use the test_size and train_size parameter will change, such that specifying train_size alone will cause test_size to be the remainder.

codecov · 2017-06-14T11:27:39Z

Codecov Report

Merging #7459 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #7459      +/-   ##
==========================================
+ Coverage   95.48%   95.48%   +<.01%     
==========================================
  Files         342      342              
  Lines       61013    61033      +20     
==========================================
+ Hits        58259    58279      +20     
  Misses       2754     2754

Impacted Files	Coverage Δ
sklearn/model_selection/_split.py	`98.65% <100%> (+0.04%)`	⬆️
sklearn/model_selection/tests/test_split.py	`96% <100%> (+0.03%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 919b4a8...fd49cb9. Read the comment docs.

jnothman · 2017-06-14T11:29:04Z

Thanks @nelson-liu

nelson-liu · 2017-06-14T22:41:44Z

sorry for letting this languish @jnothman , and thanks for taking it into your own hands to patch it up.

jnothman · 2017-06-14T22:44:24Z

No worries! I'm glad it is in and we can stop having people frustrated by it.

…

On 15 June 2017 at 08:41, Nelson Liu ***@***.***> wrote: sorry for letting this languish @jnothman <https://github.com/jnothman> , and thanks for taking it into your own hands to patch it up. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7459 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6_4J-x0YhQAiI_EelADBaO31QJZnks5sEGGrgaJpZM4KBMM_> .

… 0.21 (scikit-learn#7459)

eliot1785 · 2017-08-16T22:32:25Z

So what actually changed here, in practice? If the user did nothing they would have 0.75/0.25 before, and it looks like that is still the case.

nelson-liu · 2017-08-16T23:10:02Z

So what actually changed here, in practice?

No breaking behavior has been implemented in version 0.19 (hence the FutureWarning). In version 0.21, if you specify train_size without specifying test_size, it will automatically set test_size = 1 - train_size

If the user did nothing they would have 0.75/0.25 before, and it looks like that is still the case.

Yes, that is correct.

eliot1785 · 2017-08-16T23:32:57Z

Right, so what I'm confused by is it seems like that's already the behavior. From the doc: "If None, the value is set to the complement of the train size."

nelson-liu · 2017-08-16T23:39:27Z

ah, test_size is not None by default (it's now a sentinel string "default", and was the float 0.1 in version 0.18). If you explicitly pass None to it, then it will complement the train size. This is the case in both 0.18 and 0.19.

Does that help clarify?

eliot1785 · 2017-08-16T23:44:08Z

That helps, although I would have thought the default was the float 0.25 since the docs also say, "By default, the value is set to 0.25." Was it 0.1 only in version 0.18?

jnothman · 2017-08-16T23:45:26Z

defaults and their documentation were inconsistent in 0.18 iirc.

nelson-liu · 2017-08-16T23:46:50Z

Also, note that the defaults are different for different splitting functions/classes (e.g. train_test_split does have a default of 0.25 instead of 0.1).

edit: note that the train_test_split default test_size is 0.25 only if train_size and test_size are both none (which is their default value, unlike the CVSplitters).

eliot1785 · 2017-08-16T23:47:45Z

But you're saying it was 0.1 for one version?

nelson-liu · 2017-08-16T23:58:13Z

The default test_size for BaseSplitter has been 0.1 for at least 0.17 and 0.18 (and possibly longer).

As far as i can remember, the default test_size for train_test_split has been 0.25 (it could have been changed from 0.1, though -- you can probably find out by looking at the release history)

In addition: I'm not sure what objects you're referring to by "it" --- the default behavior of the Splitter classes and train_test_split are different (and this PR touches both, so unsure which you're referring to). Hopefully I answered your question though.

eliot1785 · 2017-08-17T00:01:38Z

Ok thanks. I just wanted to make sure the defaults for train_test_split hadn't changed recently because if so it would affect my models. It doesn't sound like it, so I'm satisfied. I guess this is a good example of the wisdom of always specifying parameters rather than relying on defaults.

… 0.21 (scikit-learn#7459)

nelson-liu force-pushed the edit_train_test_split_api branch from 83a7a05 to 7ea9a8e Compare September 20, 2016 03:38

nelson-liu force-pushed the edit_train_test_split_api branch from f77bdc6 to faa3124 Compare September 20, 2016 15:58

jnothman mentioned this pull request Sep 21, 2016

[DOC+RFC] Enhance test_size and train_size docs #5948

Closed

nelson-liu changed the title ~~[WIP] edit train/test_size default behavior~~ [MRG] edit train/test_size default behavior Sep 27, 2016

amueller added this to the 0.19 milestone Nov 30, 2016

lesteve force-pushed the edit_train_test_split_api branch from 0c53fda to 898ab03 Compare December 1, 2016 08:10

nelson-liu changed the title ~~[MRG] edit train/test_size default behavior~~ [WIP] edit train/test_size default behavior Dec 1, 2016

jnothman requested changes Dec 8, 2016

View reviewed changes

fix typo in GroupShuffleSplit stating default is 0.1

5392f85

jnothman changed the title ~~[MRG+1] edit train/test_size default behavior~~ [MRG+3] edit train/test_size default behavior Jun 13, 2017

Add what's new

fd49cb9

Merge branch 'master' into edit_train_test_split_api

a042aac

jnothman merged commit 1f781e6 into scikit-learn:master Jun 14, 2017

jnothman mentioned this pull request Jun 14, 2017

cross_validation.ShuffleSplit setting train_size without setting test_size, the sum of train_size and test_size is not equal to 1. #4618

Closed

nelson-liu deleted the edit_train_test_split_api branch June 14, 2017 22:41

dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

[MRG+3] CV splitters: train/test_size default behavior will change in…

acd47fd

… 0.21 (scikit-learn#7459)

dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

[MRG+3] CV splitters: train/test_size default behavior will change in…

4aad2bb

… 0.21 (scikit-learn#7459)

NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

[MRG+3] CV splitters: train/test_size default behavior will change in…

85fb00d

… 0.21 (scikit-learn#7459)

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+3] CV splitters: train/test_size default behavior will change in…

18af6df

… 0.21 (scikit-learn#7459)

AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017

[MRG+3] CV splitters: train/test_size default behavior will change in…

17ae68d

… 0.21 (scikit-learn#7459)

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG+3] CV splitters: train/test_size default behavior will change in…

21ecee5

… 0.21 (scikit-learn#7459)

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

[MRG+3] CV splitters: train/test_size default behavior will change in…

8e84626

… 0.21 (scikit-learn#7459)

Uh oh!

Conversation

nelson-liu commented Sep 20, 2016 • edited by jnothman Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

TODO

Uh oh!

jnothman commented Sep 20, 2016

Uh oh!

nelson-liu commented Sep 20, 2016

Uh oh!

lesteve commented Sep 20, 2016

Uh oh!

nelson-liu commented Sep 20, 2016

Uh oh!

nelson-liu commented Sep 27, 2016

Uh oh!

amueller commented Sep 30, 2016

Uh oh!

nelson-liu commented Sep 30, 2016

Uh oh!

amueller commented Sep 30, 2016

Uh oh!

amueller commented Sep 30, 2016

Uh oh!

nelson-liu commented Sep 30, 2016

Uh oh!

amueller commented Sep 30, 2016

Uh oh!

nelson-liu commented Oct 4, 2016

Uh oh!

nelson-liu commented Oct 4, 2016

Uh oh!

amueller commented Oct 5, 2016

Uh oh!

jnothman commented Oct 6, 2016

Uh oh!

nelson-liu commented Oct 7, 2016

Uh oh!

jnothman commented Oct 8, 2016

Uh oh!

jnothman commented Nov 24, 2016

Uh oh!

amueller commented Nov 30, 2016

Uh oh!

lesteve commented Dec 1, 2016

Uh oh!

nelson-liu commented Dec 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman Dec 8, 2016

Choose a reason for hiding this comment

Uh oh!

jnothman Dec 8, 2016

Choose a reason for hiding this comment

Uh oh!

jnothman Dec 8, 2016

Choose a reason for hiding this comment

Uh oh!

nelson-liu commented Dec 25, 2016

Uh oh!

lesteve commented Mar 9, 2017

Uh oh!

jnothman commented Jun 13, 2017

Uh oh!

codecov bot commented Jun 14, 2017

Codecov Report

Uh oh!

jnothman commented Jun 14, 2017

Uh oh!

nelson-liu commented Jun 14, 2017

Uh oh!

jnothman commented Jun 14, 2017 via email

Uh oh!

eliot1785 commented Aug 16, 2017

nelson-liu commented Sep 20, 2016 •

edited by jnothman

Loading

nelson-liu commented Dec 1, 2016 •

edited

Loading

nelson-liu commented Aug 16, 2017 •

edited

Loading

nelson-liu commented Aug 16, 2017 •

edited

Loading

nelson-liu commented Aug 16, 2017 •

edited

Loading