[MRG+1] fix StratifiedShuffleSplit train and test overlap#6379
Conversation
| minlength=len(self.y)) == 0, | ||
| )[0] | ||
| missing_idx = rng.permutation(missing_idx) | ||
| train.extend(missing_idx[:(self.n_train - len(train))]) |
There was a problem hiding this comment.
The crux of the problem is that self.n_train - len(train) can be negative if you are unlucky with the rounding, which causes a problem in the slicing below.
|
LGTM |
|
I think this should be included in the |
|
good point. sklearn/model_selection/_split.py |
I wasn't aware of that. Just curious, are model_selection.StratifiedShuffleSplit and cross_validation.StratifiedShuffle split supposed to be just a copy and paste of each other? |
|
For the most part yes! |
Everything in cross_validation is deprecated. As we changed API we had to |
32b99c0 to
115517e
Compare
|
OK I just pushed a commit with the fix for both model_selection.StratifiedShuffleSplit and cross_validation.StratifiedShuffleSplit.
This is slightly surprising as manually maintaining consistency between the two for two more major releases may be a bit burdensome. I guess this was discussed in details though and it was deemed a better solution than, say, sharing the common code between the two. |
115517e to
95c3a29
Compare
Indeed ;( An attempt to reuse the code from model selection has been made at #5568 Reviews would be welcome 😁 |
|
And thanks for extending the fix to |
sklearn/model_selection/_split.py
Outdated
| >>> sss = StratifiedShuffleSplit(n_iter=3, test_size=0.5, random_state=0) | ||
| >>> sss.get_n_splits(X, y) | ||
| 3 | ||
| 3n |
There was a problem hiding this comment.
I wonder why the docstring tests pass...
There was a problem hiding this comment.
I fixed the typo and I'll take a closer look at why the docstring didn't seem to be run.
There was a problem hiding this comment.
why the docstring didn't seem to be run.
Seems like nosetests ignores files starting with an underscore. Maybe we should use ignore-files=DONTIGNOREANYFILES as was mentioned here
There was a problem hiding this comment.
For some reason you have to use ignore-files=setup.py because trying to import sklearn/cluster/setup.py confuses the hell out of nose not sure why.
|
Besides the typo in the doctest and the fact that travis is still happy about it, +1 for merge as well. |
95c3a29 to
7470513
Compare
Well, there will be no changes to the deprecated code, and hopefully there are not too many bugs to fix ;) |
|
whatsnew entry? |
|
the logic is kinda tricky now. Can you add comments about the one-letter variables please? |
I'll add a whatsnew entry. Just for the record this overlap between train and test happens only in some unlikely edge cases. You have to be pretty unlucky to hit it. Basically you have to
You also need 1. and 2. to coordinate a in a specific fashion as well.
don't think the PR makes the logic trickier than it already was.
I was thinking about whether there was a way to simplify the code, in particular by getting rid of the unassigned samples logic, but I wasn't able to find one right away. I can try to look at it in more details and open a separate PR if I managed to get anywhere. |
in some edge cases and added test. Fix was applied in both sklearn.model_selection and sklearn.cross_validation.
7470513 to
07728d9
Compare
whatsnew entry added. |
Which variables? |
…ain-test-overlap [MRG+1] fix StratifiedShuffleSplit train and test overlap
|
I did mean |
|
Maybe we can make that a programming challenge lol |
I'll try to take a closer look at this in the next few days. |
|
I think we could try to fix
|
Fix #6121.