[MRG+1] Learning curve: Add an option to randomly choose indices for different training sizes by NarineK · Pull Request #7506 · scikit-learn/scikit-learn

NarineK · 2016-09-28T00:27:05Z

Currently, training sizes are chosen sequentially from 0 to n_train_samples:
train[:n_train_samples]

If training data is sorted by the target variable, for small sizes of training data it will choose only one label and this will always lead to failures during model fitting.
For example:

X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18], [19, 20, 21]], np.int32)
y = np.array(['a', 'a', 'a', 'a', 'b', 'b', 'b']) 
import numpy as np
from sklearn.learning_curve import learning_curve
from sklearn.svm import SVC

The following always fails with: ValueError: The number of classes has to be greater than one; got 1
train_sizes, train_scores, valid_scores = learning_curve(SVC(kernel='linear'), X, y, train_sizes=[0.7, 1.0], cv=3)

The following runs successfully for most tries:

train_sizes, train_scores, valid_scores = learning_curve(SVC(kernel='linear'), X, y, train_sizes=[0.7, 1.0], cv=3, shuffle=True)

If we would have an option to shuffle the indices of training data before choosing ’n_train_samples’, that will increase our chances of not fitting data with the same label into the learner and have more label variety.

In the following pull request I did a small modification and added an option to shuffle and choose randomly the indices. We could do the same also for incremental learning.

Let me know what you think.

Thanks!

This change is

amueller · 2016-09-28T16:54:33Z

Why not just pass a cv with shuffle? Like cv=KFold(5, shuffle=True)?

NarineK · 2016-09-28T18:14:54Z

Thank you, @amueller for the prompt response.
For my specific use case I need to use LabelKFold.
But it seems that there is no support for shuffle.

from sklearn.cross_validation import LabelKFold
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])
labels = np.array([0, 0, 2, 2])
label_kfold = LabelKFold(labels, n_folds=2, shuffle=True)
Traceback (most recent call last):
File "", line 1, in
TypeError: init() got an unexpected keyword argument 'shuffle'

amueller · 2016-09-28T19:33:38Z

You are right, there's no shuffle in LabelKFold (which was just renamed GroupKFold).
I think a better solution would be to add optional shuffling to the cross-validation generator.

NarineK · 2016-09-28T21:51:42Z

I see, do you mean adding shuffle=False here ?
https://github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/cross_validation.py#L397
Thank you!

jnothman · 2016-09-28T23:29:04Z

I was going to say that I don't mind adding a shuffle parameter, but I think I'm coming to agree with @amueller.

(Also, the sklearn.learning_curve module is deprecated. See model_selection)

jnothman · 2016-09-28T23:30:11Z

Yes, adding shuffling to the equivalent in model_selection/_split.py

On 29 September 2016 at 07:51, NarineK notifications@github.com wrote:

I see, do you mean adding shuffle=False here ?
https://github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/cross_
validation.py#L397
Thank you!

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#7506 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6yDNFb3JJA48DDhWLT9FL6KDxu3gks5quuFvgaJpZM4KIS7h
.

NarineK · 2016-09-28T23:36:17Z

Thank you @amueller and @jnothman!

I see GroupKFold in model_selection/_split.py.
Let me give a try and add shuffle there.

amueller · 2016-09-29T13:28:29Z

Great, thanks @NarineK :)

jnothman · 2016-09-30T02:50:16Z

I hope it's okay to start a new PR, thanks @NarineK .

jnothman · 2016-10-05T12:14:38Z

I think what you need here is actually a stratify option in learning_curve.

amueller · 2016-10-05T17:13:09Z

@jnothman or leave the subsetting to the cv object and use StratifiedShuffleSplit? Hm that actually conflates the subsetting and the cross-validation somewhat...

What would a stratified learning curve look like? Make sure that for each n_samples amount the data is stratified?

And do we then also add a group option to learning curve? Maybe shuffle is a good enough fix?

amueller · 2016-10-05T18:35:43Z

Thinking about it more, I feel that the way we are doing the learning curves feels weird. @agramfort do you have literature on cross-validation with learning curves?

jnothman · 2016-10-05T23:29:59Z

We don't need to add a group option: the nice thing about the learning curve implementation is that any questions of dependency between training and test samples are handled by cv. The only question is then how to sample from the training data. I'm coming to the idea that shuffling it is not so bad. It will mean that there aren't ordering-dependent anomalies in the learning curve.

NarineK · 2016-10-06T06:22:13Z

I prefer this option too. Should leaning_curve in that case have 2 additional arguments ? random_state=None and shuffle=False ?

amueller · 2016-10-07T17:03:02Z

@jnothman that's true about correlations between the training and the test set, but that might make for very strange learning curves. I guess we can do the shuffle here. I'm not entirely convinced by the CV based approach but I guess it's too late to change anyhow.

NarineK · 2016-10-07T21:39:41Z

I assume the change is needed in sklearn/model_selection/_validation.py instead of sklearn/learning_curve.py
https://github.com/NarineK/scikit-learn/blob/1263e5acb1b9e729fdd740299a1bf5fe73a6c618/sklearn/model_selection/_validation.py#L652
Or maybe in both places ?

amueller · 2016-10-08T01:00:50Z

Only the model_selection, I think.

jnothman · 2016-10-09T13:27:57Z

the overall patch is now nothing...?

NarineK · 2016-10-09T15:31:34Z

I merged to master to fix the conflicts. I'll push my changes in model selection soon.

jnothman · 2016-10-15T13:43:33Z

sklearn/model_selection/tests/test_validation.py



+def test_learning_curve_with_shuffle():
+    X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [11, 12], [13, 14], [15, 16],


Add a note here about why the test is designed this way, or perhaps just reference the issue number.

Even just making clear that you assert in the test that the shuffle=False case breaks will help the maintainer.

jnothman · 2016-10-16T01:58:42Z

I'd rather the contrast between failure and success. Why are you looking at
train_scores rather than test_scores?

On 16 October 2016 at 11:25, NarineK notifications@github.com wrote:

@NarineK commented on this pull request.

In sklearn/model_selection/tests/test_validation.py
#7506:

@@ -713,6 +738,25 @@ def test_learning_curve_with_boolean_indices():
np.linspace(0.1, 1.0, 10))

+def test_learning_curve_with_shuffle():
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [11, 12], [13, 14], [15, 16],
            [17, 18], [19, 20], [7, 8], [9, 10], [11, 12], [13, 14],
            [15, 16], [17, 18]])
y = np.array([1, 1, 1, 2, 3, 4, 1, 1, 2, 3, 4, 1, 2, 3, 4])

groups = np.array([1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 4, 4, 4, 4])

estimator = LogisticRegression(multi_class='multinomial', penalty='l2',
Hi @jnothman https://github.com/jnothman,
specifically, SGDClassifier with incremental training does not fail
without shuffle, but I believe that with shuffle it trains better model
(just from looking at the scores):
Without shuffle:

train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=1, train_sizes=np.linspace(0.3,1.0,3), groups=groups, verbose=1, exploit_incremental_learning=True)
train_scores
array([[ 1. , 1. ],
[ 0.2 , 0.2 ],
[ 0.22222222, 0.16666667]])

With shuffle:

train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=1, train_sizes=np.linspace(0.3,1.0,3), groups=groups, verbose=1, shuffle=True, random_state=2, exploit_incremental_learning=True)
train_scores
array([[ 0.5 , 0.5 ],
[ 0.4 , 0.2 ],
[ 0.22222222, 0.5 ]])

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7506, or mute the
thread
https://github.com/notifications/unsubscribe-auth/AAEz68Y76C7mxZENiA23RHTNVE-mjhujks5q0W8IgaJpZM4KIS7h
.

NarineK · 2016-10-16T02:02:19Z

I'll add the test score too.

NarineK · 2016-10-16T02:25:48Z

SGDClassifier itself is giving non deterministic scores. It is hard to write test cases for it.
but MultinomialNB is maybe a better option ?

NarineK · 2016-10-16T02:40:32Z

MultinomialNB doesn't fail if I set all labels the same. I'm not sure if this is by design, but it is not consistent with other algorithms. The output isn't helpful either.

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [11, 12], [13, 14],[15, 16],[17, 18],[19, 20],[7, 8], [9, 10], [11, 12], [13, 14],[15, 16],[17, 18]])
y= np.array([1,1,1,1,1,1,1,1,1,1,1,1,1,1,1])
groups = np.array([1,1,1,1,1,1,3,3,3,3,3,4,4,4,4])
train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=1, train_sizes=np.linspace(0.3, 1.0, 3),groups=groups, shuffle=True, exploit_incremental_learning=True)
>>> train_scores
array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])
>>> test_scores
array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])

NarineK · 2016-10-16T17:35:42Z

I couldn't find an estimator for which (shuffle=False and exploit_incremental_learning=True) would fail. Tried:

        sklearn.naive_bayes.MultinomialNB
        sklearn.linear_model.SGDClassifier
        sklearn.linear_model.PassiveAggressiveClassifier

jnothman · 2016-10-18T09:56:00Z

I've realised why: it gets the list of classes from the whole dataset. So performing the test, despite not getting an error in master, is sufficient.

We still have an underlying problem that the metrics are being calculated incorrectly (assuming fewer classes than there should be), but that's a somewhat different issue, closely related to #6231.

jnothman

Otherwise LGTM. Please add an entry to what's new (under 0.19/enhancements)

jnothman · 2016-10-18T09:56:48Z

sklearn/model_selection/_validation.py

    return train_sizes_abs, out[0], out[1]


+def _shuffle_train_indices(cv_iter, shuffle, random_state):


I think we had a little misunderstanding in the creation of this function. Can you please put it back inline? Thanks.

… cases and added new entry under 0.19/enhancements

jnothman

LGTM

jnothman · 2016-10-19T01:38:31Z

doc/whats_new.rst


+   - Added ``shuffle`` and ``random_state`` parameters to shuffle training
+     data before taking prefixes of it based on training sizes in
+     ``model_selection``s ``learning_curve``.


this should be

:func:`model_selection.learning_curve`

jnothman · 2016-10-19T01:38:59Z

doc/whats_new.rst

+     data before taking prefixes of it based on training sizes in
+     ``model_selection``s ``learning_curve``.
+     (`#7506` <https://github.com/scikit-learn/scikit-learn/pull/7506>_) by
+     `Narine Kokhlikyan`_.


need to add link target at bottom of file

NarineK · 2016-10-19T05:19:06Z

Added the modifications in doc/whats_new.rst
Thanks for reviewing, @jnothman

amueller · 2016-10-19T18:06:33Z

sklearn/model_selection/tests/test_validation.py



+def test_learning_curve_with_shuffle():
+    """Following test case was designed this way to verify the code


Please use a comment, not a docstring for the test - that makes it easier to find out which test is run.
Also, I'm not sure I understand the test. Can you please add an explanation here?

After reading the discussion again, the point of the test is that it would fail without shuffling, because the first split doesn't contain label 4. Can you please just add that here?

amueller

LGTM apart from explaining the test.

amueller · 2016-10-19T18:10:46Z

sklearn/model_selection/tests/test_validation.py



+def test_learning_curve_with_shuffle():
+    """Following test case was designed this way to verify the code


After reading the discussion again, the point of the test is that it would fail without shuffling, because the first split doesn't contain label 4. Can you please just add that here?

amueller · 2016-10-19T18:13:37Z

sklearn/model_selection/tests/test_validation.py

+        estimator, X, y, cv=cv, n_jobs=1, train_sizes=np.linspace(0.3, 1.0, 3),
+        groups=groups, shuffle=True, random_state=2,
+        exploit_incremental_learning=True)
+    assert_array_almost_equal(train_scores_inc.mean(axis=1),


Any reason to use the mean here instead of everything?

Thank you for the review, @amueller
I used mean instead of everything in order to be consistent with other test cases for learning curves.

amueller · 2016-10-19T19:18:05Z

thanks @NarineK

NarineK · 2016-10-19T19:32:48Z

oh, I haven't addressed this point yet, @amueller

After reading the discussion again, the point of the test is that it would fail without shuffling, because the first split doesn't contain label 4. Can you please just add that here?

Is it too late now? I see you merged it.

amueller · 2016-10-19T19:38:25Z

Damn too quick. I'll add the comment in master. I'm right about the intend, though?

NarineK · 2016-10-19T19:40:53Z

yes, you're right. Thank you.

jnothman · 2016-10-19T20:22:54Z

Either way thanks, @NarineK, for raising the issue and solving it even when we told you to solve it the wrong way at first!

NarineK · 2016-10-19T20:27:28Z

No problem, my pleasure!

…different training sizes (scikit-learn#7506) * Chooses randomly the indices for different training sizes * Bring back deleted line * Rewrote the description of 'shuffle' attribute * use random.sample instead of np.random.choice * replace tabs with spaces * merge to master * Added shuffle in model-selection's learning_curve method * Added shuffle for incremental learning + addressed Joel's comment * Shorten long lines * Add 2 blank spaces between test cases * Addressed Joel's review comments * Added 2 blank lines between methods * Added non regression test for learning_curve with shuffle * Fixed indentions * Fixed space issues * Modified test cases + small code improvements * Fix some style issues * Addressed Joel's comments - removed _shuffle_train_indices, more test cases and added new entry under 0.19/enhancements * Added some modifications in whats_new.rst

Narine Kokhlikyan added 5 commits September 27, 2016 17:04

Chooses randomly the indices for different training sizes

43ebd81

Bring back deleted line

660db34

Rewrote the description of 'shuffle' attribute

9351553

use random.sample instead of np.random.choice

5839808

replace tabs with spaces

1263e5a

jnothman closed this Sep 30, 2016

NarineK mentioned this pull request Oct 3, 2016

Add shuffle to GroupKFold cross-validation generator #7566

Closed

amueller reopened this Oct 7, 2016

Narine Kokhlikyan added 2 commits October 8, 2016 23:55

merge to master

2cca140

merge to master

fa40520

Added shuffle in model-selection's learning_curve method

abdcc3f

jnothman reviewed Oct 15, 2016

View reviewed changes

Narine Kokhlikyan added 2 commits October 15, 2016 21:51

Modified test cases + small code improvements

a6a63e2

Fix some style issues

7ed713d

jnothman requested changes Oct 18, 2016

View reviewed changes

Narine Kokhlikyan added 2 commits October 18, 2016 18:14

Addressed Joel's comments - removed _shuffle_train_indices, more test…

1ea3a6e

… cases and added new entry under 0.19/enhancements

Merged to master

b6af4d1

jnothman requested changes Oct 19, 2016

View reviewed changes

jnothman changed the title ~~Learning curve: Add an option to randomly choose indices for different training sizes~~ [MRG+1] Learning curve: Add an option to randomly choose indices for different training sizes Oct 19, 2016

jnothman added the Waiting for Reviewer label Oct 19, 2016

Added some modifications in whats_new.rst

c11e29e

amueller reviewed Oct 19, 2016

View reviewed changes

amueller merged commit 829efa5 into scikit-learn:master Oct 19, 2016

astrakhantsev mentioned this pull request Feb 24, 2018

Stratify option for learning_curve #10684

Open



		def test_learning_curve_with_shuffle():
		X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [11, 12], [13, 14], [15, 16],

		return train_sizes_abs, out[0], out[1]


		def _shuffle_train_indices(cv_iter, shuffle, random_state):



		def test_learning_curve_with_shuffle():
		"""Following test case was designed this way to verify the code

Uh oh!

Conversation

NarineK commented Sep 28, 2016 • edited by amueller Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Sep 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NarineK commented Sep 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Sep 28, 2016

Uh oh!

NarineK commented Sep 28, 2016

Uh oh!

jnothman commented Sep 28, 2016

Uh oh!

jnothman commented Sep 28, 2016

Uh oh!

NarineK commented Sep 28, 2016

Uh oh!

amueller commented Sep 29, 2016

Uh oh!

jnothman commented Sep 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Oct 5, 2016

Uh oh!

amueller commented Oct 5, 2016

Uh oh!

amueller commented Oct 5, 2016

Uh oh!

jnothman commented Oct 5, 2016

Uh oh!

NarineK commented Oct 6, 2016

Uh oh!

amueller commented Oct 7, 2016

Uh oh!

NarineK commented Oct 7, 2016

Uh oh!

amueller commented Oct 8, 2016

Uh oh!

jnothman commented Oct 9, 2016

Uh oh!

NarineK commented Oct 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Oct 16, 2016

@NarineK commented on this pull request.

Uh oh!

NarineK commented Oct 16, 2016

Uh oh!

NarineK commented Oct 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NarineK commented Oct 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NarineK commented Oct 16, 2016

Uh oh!

jnothman commented Oct 18, 2016

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NarineK commented Oct 19, 2016

Uh oh!

Choose a reason for hiding this comment

NarineK commented Sep 28, 2016 •

edited by amueller

Loading

amueller commented Sep 28, 2016 •

edited

Loading

NarineK commented Sep 28, 2016 •

edited

Loading

jnothman commented Sep 30, 2016 •

edited

Loading

NarineK commented Oct 16, 2016 •

edited

Loading

NarineK commented Oct 16, 2016 •

edited

Loading

NarineK commented Oct 19, 2016 •

edited

Loading