[MRG+3] edit train/test_size default behavior#7459
[MRG+3] edit train/test_size default behavior#7459jnothman merged 26 commits intoscikit-learn:masterfrom
Conversation
83a7a05 to
7ea9a8e
Compare
|
Please don't modify cross_validation. On 20 September 2016 at 13:35, Nelson Liu notifications@github.com wrote:
|
ok, thanks. |
|
This needs a rebase on master to fix the conflicts. |
f77bdc6 to
faa3124
Compare
Done, thanks for letting me know. Didn't notice / realize someone else was working on the file as well :) |
|
this is ready to be looked at, missing a whatsnew though (I'd like to verify the current behavior is correct before writing it). Among the changes discussed earlier, |
|
I expect the docs of |
|
right, sorry for omitting that @amueller . I've actually never "deprecated" by changing the default parameter of a function before, would you mind pointing me to an example / explaining how to do it? It seems to be missing in the contributing docs as well, so I'd be happy to add it there. |
|
I'm not sure we should actually change the default from 0.2 to 0.1 in that case. For reference, the deprecation documentation is here: |
|
I'm always kinda confused by the fact that the "Contributing" link on the help page doesn't go to contributing, but to developer.... |
|
hmm, would be clear? I feel like there's a better way to word it... |
|
Usually we just say "The default value of (..) will change from .. to .. in version 0.20" and the |
|
@amueller we also need a deprecation for also, by |
|
also I edited the default parameters taken by the |
|
We should not change the defaults of |
|
There are a couple of errors in your issue description:
you mean the opposite
I think you mean less than Now that |
Makes sense, I've reverted the change and updated the description accordingly.
Indeed, thanks for catching that.
To clarify --- we should not implement the changes to the behavior when setting train_size without test_size, but throw a warning instead saying that it will change in future release (sort of like a deprecation)? Also, how is there a case that |
|
I'm not now sure about the details of my comment. Point is that all behaviours need to remain identical, except perhaps for a warning or extra user-specified parameters/values handled. I need to take a proper look at what's changing here to give more precise feedback but there's a huge load of issues for me to skim right now... |
|
I hope to look at this soon, @nelson-liu. In the meantime, resolving merge conflicts with master wouldn't hurt. |
|
@jnothman feel free to hassle me about this, too ;) |
0c53fda to
898ab03
Compare
|
Rebased on master to fix conflicts. |
jnothman
left a comment
There was a problem hiding this comment.
Unfortunately, we need to retain the behaviour that train_test_split(train_size=.5) will sample 10% (or whatever the default is) for test and 50% for train. This behaviour can change in version 0.21. We achieve this by setting test_size to some otherwise inappropriate sentinel by default, such as 'default', which behaves exactly like the current default value (0.1). In 0.21, it can change to None and behave like you specify. In the case that test_size='default' and train_size is a number, we warn that behaviour will change to one where test_size will always complement train_size unless both are specified or both are unspecified.
sklearn/model_selection/_split.py
Outdated
| test_size : float, int, or None, default None | ||
| If float, should be between 0.0 and 1.0 and represent the proportion | ||
| of the dataset to include in the test split. If int, represents the | ||
| absolute number of test samples. If None, and `train_size` is None, |
sklearn/model_selection/_split.py
Outdated
| If float, should be between 0.0 and 1.0 and represent the proportion | ||
| of the dataset to include in the test split. If int, represents the | ||
| absolute number of test samples. If None, and `train_size` is None, | ||
| the value is set to 0.1. If None and `train_size` is not None, the |
sklearn/model_selection/_split.py
Outdated
| random_state=None): | ||
| if test_size is None and train_size is None: | ||
| warnings.warn("The default value of the test_size parameter" | ||
| "will change from 0.1 to 0.2 in version 0.21.", |
There was a problem hiding this comment.
0.2 is already the default, is it not??
|
@jnothman - let me know if i understood what to do correctly. The idea is that the current behavior shouldn't change, and thus this PR should just add the functions that detect when a user invokes behavior that will change in the future ( If this is the case, I reverted everything to ensure that it maintains the current behavior and just warns the user appropriately. |
Maybe you can put a short snippet and mention the values of test_size/train_size in both 0.20 and 0.21. By the way it may be worth creating an issue to make sure we remove the warnings in 0.21 and we add an entry in the 0.21 changelog in the "API changes" section ... |
|
Sufficient what's new: In version 0.21, the default behavior of splitters that use the |
Codecov Report
@@ Coverage Diff @@
## master #7459 +/- ##
==========================================
+ Coverage 95.48% 95.48% +<.01%
==========================================
Files 342 342
Lines 61013 61033 +20
==========================================
+ Hits 58259 58279 +20
Misses 2754 2754
Continue to review full report at Codecov.
|
|
Thanks @nelson-liu |
|
sorry for letting this languish @jnothman , and thanks for taking it into your own hands to patch it up. |
|
No worries! I'm glad it is in and we can stop having people frustrated by
it.
…On 15 June 2017 at 08:41, Nelson Liu ***@***.***> wrote:
sorry for letting this languish @jnothman <https://github.com/jnothman> ,
and thanks for taking it into your own hands to patch it up.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7459 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_4J-x0YhQAiI_EelADBaO31QJZnks5sEGGrgaJpZM4KBMM_>
.
|
|
So what actually changed here, in practice? If the user did nothing they would have 0.75/0.25 before, and it looks like that is still the case. |
No breaking behavior has been implemented in version
Yes, that is correct. |
|
Right, so what I'm confused by is it seems like that's already the behavior. From the doc: "If None, the value is set to the complement of the train size." |
|
ah, Does that help clarify? |
|
That helps, although I would have thought the default was the float 0.25 since the docs also say, "By default, the value is set to 0.25." Was it 0.1 only in version 0.18? |
|
defaults and their documentation were inconsistent in 0.18 iirc.
|
|
Also, note that the defaults are different for different splitting functions/classes (e.g. edit: note that the |
|
But you're saying it was 0.1 for one version? |
|
The default As far as i can remember, the default In addition: I'm not sure what objects you're referring to by "it" --- the default behavior of the Splitter classes and |
|
Ok thanks. I just wanted to make sure the defaults for train_test_split hadn't changed recently because if so it would affect my models. It doesn't sound like it, so I'm satisfied. I guess this is a good example of the wisdom of always specifying parameters rather than relying on defaults. |
Reference Issue
Fixes #5948, #4618
What does this implement/fix? Explain your changes.
Changes the default behavior of
train_sizeandtest_sizein splitters.The defaults for both parameters are now
Nonetrain_sizeandtest_sizeare bothNone,train_sizeis set to0.9andtest_sizeis set to0.1train_sizeortest_sizeare set, then the value of the unset parameter isn_samples - set_parameter(the complement)n_samplesand respect that the user wants to subsample the datasetAny other comments?
I'm aware that the
cross_validationmodule is being deprecated; does this mean that I shouldn't add these changes to it and only tomodel_selection?TODO
This change is