[MRG+1] Learning curve: Add an option to randomly choose indices for different training sizes#7506
Conversation
|
Why not just pass a |
|
Thank you, @amueller for the prompt response. from sklearn.cross_validation import LabelKFold |
|
You are right, there's no shuffle in |
|
I see, do you mean adding shuffle=False here ? |
|
I was going to say that I don't mind adding a (Also, the |
|
Yes, adding shuffling to the equivalent in model_selection/_split.py On 29 September 2016 at 07:51, NarineK notifications@github.com wrote:
|
|
Great, thanks @NarineK :) |
|
I hope it's okay to start a new PR, thanks @NarineK . |
|
I think what you need here is actually a |
|
@jnothman or leave the subsetting to the cv object and use What would a stratified learning curve look like? Make sure that for each And do we then also add a group option to learning curve? Maybe |
|
Thinking about it more, I feel that the way we are doing the learning curves feels weird. @agramfort do you have literature on cross-validation with learning curves? |
|
We don't need to add a group option: the nice thing about the learning curve implementation is that any questions of dependency between training and test samples are handled by |
|
I prefer this option too. Should |
|
@jnothman that's true about correlations between the training and the test set, but that might make for very strange learning curves. I guess we can do the shuffle here. I'm not entirely convinced by the CV based approach but I guess it's too late to change anyhow. |
|
I assume the change is needed in sklearn/model_selection/_validation.py instead of sklearn/learning_curve.py |
|
Only the model_selection, I think. |
|
the overall patch is now nothing...? |
|
I merged to master to fix the conflicts. I'll push my changes in model selection soon. |
|
|
||
|
|
||
| def test_learning_curve_with_shuffle(): | ||
| X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [11, 12], [13, 14], [15, 16], |
There was a problem hiding this comment.
Add a note here about why the test is designed this way, or perhaps just reference the issue number.
There was a problem hiding this comment.
Even just making clear that you assert in the test that the shuffle=False case breaks will help the maintainer.
|
I'd rather the contrast between failure and success. Why are you looking at On 16 October 2016 at 11:25, NarineK notifications@github.com wrote:
|
|
I'll add the test score too. |
|
SGDClassifier itself is giving non deterministic scores. It is hard to write test cases for it. |
|
MultinomialNB doesn't fail if I set all labels the same. I'm not sure if this is by design, but it is not consistent with other algorithms. The output isn't helpful either. |
|
I couldn't find an estimator for which (shuffle=False and exploit_incremental_learning=True) would fail. Tried: |
|
I've realised why: it gets the list of We still have an underlying problem that the metrics are being calculated incorrectly (assuming fewer classes than there should be), but that's a somewhat different issue, closely related to #6231. |
jnothman
left a comment
There was a problem hiding this comment.
Otherwise LGTM. Please add an entry to what's new (under 0.19/enhancements)
| return train_sizes_abs, out[0], out[1] | ||
|
|
||
|
|
||
| def _shuffle_train_indices(cv_iter, shuffle, random_state): |
There was a problem hiding this comment.
I think we had a little misunderstanding in the creation of this function. Can you please put it back inline? Thanks.
… cases and added new entry under 0.19/enhancements
doc/whats_new.rst
Outdated
|
|
||
| - Added ``shuffle`` and ``random_state`` parameters to shuffle training | ||
| data before taking prefixes of it based on training sizes in | ||
| ``model_selection``s ``learning_curve``. |
There was a problem hiding this comment.
this should be
:func:`model_selection.learning_curve`
| data before taking prefixes of it based on training sizes in | ||
| ``model_selection``s ``learning_curve``. | ||
| (`#7506` <https://github.com/scikit-learn/scikit-learn/pull/7506>_) by | ||
| `Narine Kokhlikyan`_. |
There was a problem hiding this comment.
need to add link target at bottom of file
|
Added the modifications in doc/whats_new.rst |
|
|
||
|
|
||
| def test_learning_curve_with_shuffle(): | ||
| """Following test case was designed this way to verify the code |
There was a problem hiding this comment.
Please use a comment, not a docstring for the test - that makes it easier to find out which test is run.
Also, I'm not sure I understand the test. Can you please add an explanation here?
There was a problem hiding this comment.
After reading the discussion again, the point of the test is that it would fail without shuffling, because the first split doesn't contain label 4. Can you please just add that here?
amueller
left a comment
There was a problem hiding this comment.
LGTM apart from explaining the test.
|
|
||
|
|
||
| def test_learning_curve_with_shuffle(): | ||
| """Following test case was designed this way to verify the code |
There was a problem hiding this comment.
After reading the discussion again, the point of the test is that it would fail without shuffling, because the first split doesn't contain label 4. Can you please just add that here?
| estimator, X, y, cv=cv, n_jobs=1, train_sizes=np.linspace(0.3, 1.0, 3), | ||
| groups=groups, shuffle=True, random_state=2, | ||
| exploit_incremental_learning=True) | ||
| assert_array_almost_equal(train_scores_inc.mean(axis=1), |
There was a problem hiding this comment.
Any reason to use the mean here instead of everything?
There was a problem hiding this comment.
Thank you for the review, @amueller
I used mean instead of everything in order to be consistent with other test cases for learning curves.
|
thanks @NarineK |
|
oh, I haven't addressed this point yet, @amueller After reading the discussion again, the point of the test is that it would fail without shuffling, because the first split doesn't contain label 4. Can you please just add that here? Is it too late now? I see you merged it. |
|
Damn too quick. I'll add the comment in master. I'm right about the intend, though? |
|
yes, you're right. Thank you. |
|
Either way thanks, @NarineK, for raising the issue and solving it even when we told you to solve it the wrong way at first! |
|
No problem, my pleasure! |
…different training sizes (scikit-learn#7506) * Chooses randomly the indices for different training sizes * Bring back deleted line * Rewrote the description of 'shuffle' attribute * use random.sample instead of np.random.choice * replace tabs with spaces * merge to master * Added shuffle in model-selection's learning_curve method * Added shuffle for incremental learning + addressed Joel's comment * Shorten long lines * Add 2 blank spaces between test cases * Addressed Joel's review comments * Added 2 blank lines between methods * Added non regression test for learning_curve with shuffle * Fixed indentions * Fixed space issues * Modified test cases + small code improvements * Fix some style issues * Addressed Joel's comments - removed _shuffle_train_indices, more test cases and added new entry under 0.19/enhancements * Added some modifications in whats_new.rst
…different training sizes (scikit-learn#7506) * Chooses randomly the indices for different training sizes * Bring back deleted line * Rewrote the description of 'shuffle' attribute * use random.sample instead of np.random.choice * replace tabs with spaces * merge to master * Added shuffle in model-selection's learning_curve method * Added shuffle for incremental learning + addressed Joel's comment * Shorten long lines * Add 2 blank spaces between test cases * Addressed Joel's review comments * Added 2 blank lines between methods * Added non regression test for learning_curve with shuffle * Fixed indentions * Fixed space issues * Modified test cases + small code improvements * Fix some style issues * Addressed Joel's comments - removed _shuffle_train_indices, more test cases and added new entry under 0.19/enhancements * Added some modifications in whats_new.rst
…different training sizes (scikit-learn#7506) * Chooses randomly the indices for different training sizes * Bring back deleted line * Rewrote the description of 'shuffle' attribute * use random.sample instead of np.random.choice * replace tabs with spaces * merge to master * Added shuffle in model-selection's learning_curve method * Added shuffle for incremental learning + addressed Joel's comment * Shorten long lines * Add 2 blank spaces between test cases * Addressed Joel's review comments * Added 2 blank lines between methods * Added non regression test for learning_curve with shuffle * Fixed indentions * Fixed space issues * Modified test cases + small code improvements * Fix some style issues * Addressed Joel's comments - removed _shuffle_train_indices, more test cases and added new entry under 0.19/enhancements * Added some modifications in whats_new.rst
Currently, training sizes are chosen sequentially from 0 to n_train_samples:
train[:n_train_samples]If training data is sorted by the target variable, for small sizes of training data it will choose only one label and this will always lead to failures during model fitting.
For example:
The following always fails with: ValueError: The number of classes has to be greater than one; got 1
train_sizes, train_scores, valid_scores = learning_curve(SVC(kernel='linear'), X, y, train_sizes=[0.7, 1.0], cv=3)The following runs successfully for most tries:
train_sizes, train_scores, valid_scores = learning_curve(SVC(kernel='linear'), X, y, train_sizes=[0.7, 1.0], cv=3, shuffle=True)If we would have an option to shuffle the indices of training data before choosing ’n_train_samples’, that will increase our chances of not fitting data with the same label into the learner and have more label variety.
In the following pull request I did a small modification and added an option to shuffle and choose randomly the indices. We could do the same also for incremental learning.
Let me know what you think.
Thanks!
This change is