[MRG + 1] Added DisjointLabelKFold to perform K-Fold cv on sets with disjoint labels. #4444
[MRG + 1] Added DisjointLabelKFold to perform K-Fold cv on sets with disjoint labels. #4444JeanKossaifi wants to merge 27 commits intoscikit-learn:masterfrom
Conversation
|
This is the same as LeavePLabelOut, right? |
|
SubjectIndependent is not a great name either, though, as it is very domain specific. |
|
Sorry I just now saw the discussion on the mailing list. If it is very similar to LeavePLabelOut, we indeed should either add a |
|
Hi Andreas, Thanks for the feedback. Cheers, Jean 2015-03-24 16:26 GMT+00:00 Andreas Mueller notifications@github.com:
|
|
I think adding what you want seems like a good idea. I just don't have a good name. Maybe GroupIndependentKFold? |
|
sorry misclick. |
There was a problem hiding this comment.
To avoid side effect between tests, please avoid seeding the global PRNG instance but instead use:
rng = np.random.RandomState(0)
...
subjects = np.random.randint(0, n_subjects, n_samples)|
I agree with @amueller with respect to the name. Baybe we could use |
cosmetic changes test (fix seed correctly, use assert_equal for meaningful error messages)
|
Thanks @ogrisel for the review! Made the corrections and changed the name to DisjointGroupKFold. |
|
I don't like the use of Group when elsewhere in CV Label actually means the On 25 March 2015 at 22:44, landscape-bot notifications@github.com wrote:
|
|
I don't like the use of Group when elsewhere in CV Label actually means the
same thing. Group might be the better name, but we should be consistent.
+1
|
|
@jnothman, @GaelVaroquaux +1 for consistency |
sklearn/cross_validation.py
Outdated
There was a problem hiding this comment.
Could you add a nice Examples section like we have for LeavePLabelOut here?
|
Thanks for adding the example :) ( There is a whitespace related failure at https://travis-ci.org/scikit-learn/scikit-learn/jobs/55942000#L1336 ) |
There was a problem hiding this comment.
Could you also convert this to a comment following #4432 ? (sorry for the trouble)
There was a problem hiding this comment.
Thanks for the quick response!
There was a problem hiding this comment.
Thanks, I didn't see that issue. Just changed the docstring into comment.
There was a problem hiding this comment.
you could remove the for loop by constructing a coo_matrix instead, but I'm not sure it's worth it. this one might be cleaner.
There was a problem hiding this comment.
I agree, to the reader it is probably easier to have the loopy version.
sklearn/cross_validation.py
Outdated
There was a problem hiding this comment.
I wonder if we should not use a stable sort algorithm like np.argsort(samples_per_label, kind="mergesort") here.
@larsmans any idea if a non-stable sort could cause reproducibility issues in this case?
There was a problem hiding this comment.
If the NumPy implementation of quicksort changes, you can get a different ordering for tied labels. From a quick glance at the code, I don't see how that would not affect the output.
Use mergesort.
sklearn/cross_validation.py
Outdated
There was a problem hiding this comment.
I'm sure I missed discussion somewhere, but why did this get called y?
There was a problem hiding this comment.
We could call it labels, y is simply for consistency with other cross-validation methods (.e.g. StratifiedKFold).
There was a problem hiding this comment.
But y means something different there; there it is the target of prediction over which samples are stratified. This variable is much more like label in LeaveOneLabelOut
There was a problem hiding this comment.
+1, you cannot predict something you have never seen on the training set. y in scikit-learn is always the target variable for supervised prediction. I agree, it's a bad name but it's too late to change.
|
I'm trying to think of degenerate cases. We should raise an error if |
|
@jnothman good point, I added a test. |
sklearn/cross_validation.py
Outdated
There was a problem hiding this comment.
I would not put emphasis on balancing but rather on mutual label exclusion:
class DisjointLabelKFold(_BaseKFold):
"""K-fold iterator variant with non-overlapping labels.
The same label will not appear in two different folds (the number of
labels has to be at least equal to the number of folds).
The folds are approximately balanced in the sense so that the number of
distinct labels is approximately the same in each fold.
"""There was a problem hiding this comment.
On 07/01/2015 12:46 PM, Olivier Grisel wrote:
I would not put emphasis on balancing but rather on mutual label
exclusion:We also have that in #4583 though, right?
|
Please mention this new class in the "see also" section of the Please also introduce this tool in the user guide in a new section in You also need to add a new entry for this class at the appropriate location in |
sklearn/cross_validation.py
Outdated
There was a problem hiding this comment.
I think this should be np.zeros(len(unique_labels), dtype=np.uintp) instead.
|
Thanks @ogrisel, I added the documentation and addressed the other issues. |
|
I know there was already some discussion and changes regarding the naming of this iterator, but what about simply (We just added |
|
I know there was already some discussion and changes regarding the naming of
this iterator, but what about simply LabelKFold?
I would be happy with that.
|
|
I know we're stuck with "Label" for now, but "GroupedKFold" is much clearer! On 30 August 2015 at 23:32, Gael Varoquaux notifications@github.com wrote:
|
|
I like LabelKFold :)
|
|
I am taking care of rebasing and renaming. |
Added a SubjectIndependentKFold class to create subject independent folds.