[MRG] Add a stratify option to utils.resample#13549
[MRG] Add a stratify option to utils.resample#13549glemaitre merged 10 commits intoscikit-learn:masterfrom
Conversation
| return bool(isinstance(x, numbers.Real) and np.isnan(x)) | ||
|
|
||
|
|
||
| def _approximate_mode(class_counts, n_draws, rng): |
There was a problem hiding this comment.
I wonder whether this deserves a clearer name
There was a problem hiding this comment.
draw_from_class_counts?
There was a problem hiding this comment.
It's not a random draw, since we actually want the mode. Sorry... not coming up w good names here either
sklearn/utils/__init__.py
Outdated
| random_state.shuffle(indices) | ||
| indices = indices[:max_n_samples] | ||
| # Code adapted from StratifiedShuffleSplit() | ||
| y = stratify |
There was a problem hiding this comment.
I wonder whether there is a better way to share the code/logic with StratifiedShuffleSplit. Am I right to think the difficulty stems from the use of permutation + slice in ShuffleSplit, which we don't want here?
There was a problem hiding this comment.
Not really, I removed the permutation + slice logic because it's simpler to use np.random.choice, but could have kept it.
The real need for this is that we want to avoid the checks for train / test set sizes that are in StratifiedShuffleSplit()
jnothman
left a comment
There was a problem hiding this comment.
I'm okay with this. Please add to what's new.
sklearn/utils/tests/test_utils.py
Outdated
| n_samples = 100 | ||
| X = rng.normal(size=(n_samples, 1)) | ||
| y = rng.randint(0, 2, size=(n_samples, 2)) | ||
| resample(X, y, n_samples=50, random_state=rng, stratify=y) |
There was a problem hiding this comment.
We should probably check the shape of y.
There was a problem hiding this comment.
I'm not sure if we can have a better test
|
@NicolasHug Thanks!!! Going forward for SuccessiveHalving :) |
…)" This reverts commit be9bbc6.
…)" This reverts commit be9bbc6.
Reference Issues/PRs
Closes #13507
What does this implement/fix? Explain your changes.
This PR adds a
stratifyoption toutils.resample. The issue withtrain_test_splitis that it will (rightfully) complain if train or testsets are empty.The code is based on that of
StratifiedShuffleSplit.Any other comments?
I personally need this to properly implement SuccessiveHalving #12538