[MRG] Add a stratify option to utils.resample by NicolasHug · Pull Request #13549 · scikit-learn/scikit-learn

NicolasHug · 2019-03-31T16:22:16Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR adds a stratify option to utils.resample. The issue with train_test_split is that it will (rightfully) complain if train or testsets are empty.

The code is based on that of StratifiedShuffleSplit.

Any other comments?

I personally need this to properly implement SuccessiveHalving #12538

sklearn/utils/__init__.py

jnothman · 2019-04-03T12:31:50Z

sklearn/utils/__init__.py

    return bool(isinstance(x, numbers.Real) and np.isnan(x))
+
+
+def _approximate_mode(class_counts, n_draws, rng):


I wonder whether this deserves a clearer name

draw_from_class_counts?

It's not a random draw, since we actually want the mode. Sorry... not coming up w good names here either

jnothman · 2019-04-03T12:32:23Z

sklearn/utils/__init__.py

-        random_state.shuffle(indices)
-        indices = indices[:max_n_samples]
+        # Code adapted from StratifiedShuffleSplit()
+        y = stratify


I wonder whether there is a better way to share the code/logic with StratifiedShuffleSplit. Am I right to think the difficulty stems from the use of permutation + slice in ShuffleSplit, which we don't want here?

Not really, I removed the permutation + slice logic because it's simpler to use np.random.choice, but could have kept it.

The real need for this is that we want to avoid the checks for train / test set sizes that are in StratifiedShuffleSplit()

jnothman

I'm okay with this. Please add to what's new.

sklearn/utils/__init__.py

sklearn/utils/tests/test_utils.py

glemaitre · 2019-04-24T13:08:43Z

sklearn/utils/tests/test_utils.py

+    n_samples = 100
+    X = rng.normal(size=(n_samples, 1))
+    y = rng.randint(0, 2, size=(n_samples, 2))
+    resample(X, y, n_samples=50, random_state=rng, stratify=y)


We should probably check the shape of y.

I'm not sure if we can have a better test

…sample_strat

sklearn/utils/__init__.py

glemaitre · 2019-04-24T21:17:18Z

@NicolasHug Thanks!!! Going forward for SuccessiveHalving :)

…)" This reverts commit be9bbc6.

NicolasHug added 4 commits March 26, 2019 15:58

first quick hack

e4d5d05

WIP

e16f999

Merge branch 'master' into resample_strat

950e1b3

Cleaner implem + tests

a0294a0

jnothman reviewed Apr 1, 2019

View reviewed changes

sklearn/utils/__init__.py Outdated Show resolved Hide resolved

moved _approximte_mode in utils

8ffd7ee

jnothman reviewed Apr 3, 2019

View reviewed changes

jnothman approved these changes Apr 4, 2019

View reviewed changes

NicolasHug added 2 commits April 4, 2019 08:20

Merge branch 'master' into resample_strat

e17f664

Added whatsnew entry

cac5896

NicolasHug mentioned this pull request Apr 5, 2019

[MRG+2] Faster Gradient Boosting Decision Trees with binned features #12807

Merged

glemaitre self-requested a review April 24, 2019 08:26