[MRG+2] FIX adaboost estimators not randomising correctly#7411
[MRG+2] FIX adaboost estimators not randomising correctly#7411jnothman merged 5 commits intoscikit-learn:masterfrom
Conversation
0561679 to
8e9e0c1
Compare
|
hm looks good. |
sklearn/utils/__init__.py
Outdated
| random_state = check_random_state(random_state) | ||
| to_set = {} | ||
| for key in sorted(estimator.get_params(deep=True)): | ||
| if key == 'random_state' or key.endswith('_random_state'): |
There was a problem hiding this comment.
Pipelines have random states too! This isn't necessarily the only way to do this, but having integer random_states in each component means that unit can be replicated.
There was a problem hiding this comment.
There the application would be trying to make an estimator deterministic. That is not really related to the issue we're trying to fix here, is it?
There was a problem hiding this comment.
The random ensembles currently initialise each estimator with an integer random state derived from their local random state. However, currently they only use the random_state param. They should be setting every random_state param.
There was a problem hiding this comment.
hm... true from a reproducibility perspective. The question is a bit how much of this can / should we put on the user.... Or on the Pipeline? The contract could also be "setting random_state on an object makes it deterministic". Then it would be the Pipelines problem. But that might be too much magic? It would be a pretty clear contract, though.
There was a problem hiding this comment.
There was a problem hiding this comment.
If an object depends on something that we don't control, then that object breaks the contract and it's not our fault ;)
There was a problem hiding this comment.
Too much magic and highly impractical. It means every meta-estimator or wrapper needs a random_state param and needs to manage it to do just the same as this... Here we manage things like dict iteration order invariance; you're going to need a shared helper function anyway. We already define nested parameter management and the random_state convention. Let's use it.
That having been said, I think it might be a good idea for RandomizedSearchCV to (optionally) manage the random_state of scipy.stats RVs
There was a problem hiding this comment.
That having been said, I think it might be a good idea for RandomizedSearchCV to (optionally) manage the random_state of scipy.stats RVs
That already happens.
Hm but shouldn't this only match __random_state if we want to match the sub-estimator random state?
There was a problem hiding this comment.
I thought there would be no harm in allowing something to have multiple distinct parameters ending _random_state, but also including __random_state. No, it's not tested; yes, I can change it.
|
@amueller and @GaelVaroquaux , let me know if you think I should make this more conservative, i.e. to only effect the ensembles. |
|
Pushing more conservative version. Adding what's new in case you want to include the fix in 0.18-final. |
7d4bca5 to
5e275f6
Compare
ogrisel
left a comment
There was a problem hiding this comment.
Besides the following minor comments, LGTM.
sklearn/ensemble/base.py
Outdated
| random_state = check_random_state(random_state) | ||
| to_set = {} | ||
| for key in sorted(estimator.get_params(deep=True)): | ||
| if key == 'random_state' or key.endswith('_random_state'): |
There was a problem hiding this comment.
Souldn't it be double _: key.endswith('__random_state') instead?
There was a problem hiding this comment.
@amueller (I think) also questioned this. I wanted, but did not test, this to be applicable to the hypothetical case that an estimator had multiple random_states for multiple purposes. Is that silly?
I'll suppose I'll change it so it doesn't look wrong...
There was a problem hiding this comment.
The multiple random_states case sounds like a YAGNI to me.
|
|
||
| assert_true(isinstance(ensemble[0], Perceptron)) | ||
| assert_equal(ensemble[0].random_state, None) | ||
| assert_true(isinstance(ensemble[1].random_state, int)) |
There was a problem hiding this comment.
For completeness you could add:
assert_true(isinstance(ensemble[2].random_state, int))| (alg, score) | ||
|
|
||
| # Check we used multiple estimators | ||
| assert_true(len(clf.estimators_) > 1) |
There was a problem hiding this comment.
Better user assert_greater(len(clf.estimators_), 1) to get more informative error messages (when using nosetests).
| # Check we used multiple estimators | ||
| assert_true(len(clf.estimators_) > 1) | ||
| # Check for distinct random states (see issue #7408) | ||
| assert_true(len(set(est.random_state for est in clf.estimators_)) > 1) |
There was a problem hiding this comment.
assert_greater(len(set(est.random_state for est in clf.estimators_)), 1)| # Check we used multiple estimators | ||
| assert_true(len(reg.estimators_) > 1) | ||
| # Check for distinct random states (see issue #7408) | ||
| assert_true(len(set(est.random_state for est in reg.estimators_)) > 1) |
|
Changes made, thanks @ogrisel. Am I leaving the what's new in 0.18 or has that train passed? |
b3cc749 to
4751362
Compare
(fixes scikit-learn#7408) FIX ensure nested random_state is set in ensembles
|
rebasing and marking MRG+1 due to @ogrisel's LGTM. |
4751362 to
3a1ba65
Compare
| # Check we used multiple estimators | ||
| assert_greater(len(clf.estimators_), 1) | ||
| # Check for distinct random states (see issue #7408) | ||
| assert_greater(len(set(est.random_state |
There was a problem hiding this comment.
shouldn't they be equal to len(clf.estimators_) (with high probability)?
| # Check we used multiple estimators | ||
| assert_true(len(reg.estimators_) > 1) | ||
| # Check for distinct random states (see issue #7408) | ||
| assert_greater(len(set(est.random_state for est in reg.estimators_)), 1) |
|
LGTM apart from minor nitpick in tests. |
|
LGTM. Waiting for a couple of comments by @amueller in the tests before merging (on the len of unique random_state). But everything is for me. +1 for merge |
|
Tests improved. Thanks for the reviews. |
|
Merging. @amueller, please backport. |
|
(Actually, now I'm wondering if I should have merged before CIs were green... I checked that those more specific tests passed locally. Please excuse my unjustified hurry.) |
|
would have probably been better to wait, but if master fails we'll know ;) |
|
I'll backport everything all at once, when we're ready to release, I think. |
* FIX adaboost estimators not randomising correctly (fixes #7408) FIX ensure nested random_state is set in ensembles * DOC add what's new * Only affect *__random_state, not *_random_state for now * TST More informative assertions for ensemble tests * More specific testing of different random_states
…rn#7411) * FIX adaboost estimators not randomising correctly (fixes scikit-learn#7408) FIX ensure nested random_state is set in ensembles * DOC add what's new * Only affect *__random_state, not *_random_state for now * TST More informative assertions for ensemble tests * More specific testing of different random_states
* tag '0.18': (1286 commits) [MRG + 1] More versionadded everywhere! (scikit-learn#7403) minor doc fixes fix lbfgs rename (scikit-learn#7503) minor fixes to whatsnew fix scoring function table fix rebase messup DOC more what's new subdivision DOC Attempt to impose some order on What's New 0.18 no fixed width within bold REL changes for release in 0.18.X branch (scikit-learn#7414) [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325) DOC: Added Nested Cross Validation Example (scikit-learn#7111) Sync docstring and definition default argument in kneighbors (scikit-learn#7476) added contributors for 0.18, minor formatting fixes. Fix typo in whats_new.rst [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411) Addressing issue scikit-learn#7468. (scikit-learn#7472) Reorganize README clean up deprecation warning stuff in common tests [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438) ...
* releases: (1286 commits) [MRG + 1] More versionadded everywhere! (scikit-learn#7403) minor doc fixes fix lbfgs rename (scikit-learn#7503) minor fixes to whatsnew fix scoring function table fix rebase messup DOC more what's new subdivision DOC Attempt to impose some order on What's New 0.18 no fixed width within bold REL changes for release in 0.18.X branch (scikit-learn#7414) [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325) DOC: Added Nested Cross Validation Example (scikit-learn#7111) Sync docstring and definition default argument in kneighbors (scikit-learn#7476) added contributors for 0.18, minor formatting fixes. Fix typo in whats_new.rst [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411) Addressing issue scikit-learn#7468. (scikit-learn#7472) Reorganize README clean up deprecation warning stuff in common tests [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438) ...
* dfsg: (1286 commits) [MRG + 1] More versionadded everywhere! (scikit-learn#7403) minor doc fixes fix lbfgs rename (scikit-learn#7503) minor fixes to whatsnew fix scoring function table fix rebase messup DOC more what's new subdivision DOC Attempt to impose some order on What's New 0.18 no fixed width within bold REL changes for release in 0.18.X branch (scikit-learn#7414) [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325) DOC: Added Nested Cross Validation Example (scikit-learn#7111) Sync docstring and definition default argument in kneighbors (scikit-learn#7476) added contributors for 0.18, minor formatting fixes. Fix typo in whats_new.rst [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411) Addressing issue scikit-learn#7468. (scikit-learn#7472) Reorganize README clean up deprecation warning stuff in common tests [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438) ...
…rn#7411) * FIX adaboost estimators not randomising correctly (fixes scikit-learn#7408) FIX ensure nested random_state is set in ensembles * DOC add what's new * Only affect *__random_state, not *_random_state for now * TST More informative assertions for ensemble tests * More specific testing of different random_states
…rn#7411) * FIX adaboost estimators not randomising correctly (fixes scikit-learn#7408) FIX ensure nested random_state is set in ensembles * DOC add what's new * Only affect *__random_state, not *_random_state for now * TST More informative assertions for ensemble tests * More specific testing of different random_states
Fixes #7408 , where adaboost estimators were given the same random state initialisation, resulting in poor performance.
At the same time, I realised the need for setting nested
random_states. I think this should be a public utility. This PR also ensures nested random_state is set in ensembles and in common tests.Caveats:
set_random_state) will differ from previouslysklearn.utils.randomize_estimatoris a bad name for the new utility to set nestedrandom_state. Suggestions welcome!sklearn.utils.testing.set_random_statehas a better name, but is subtly and valuably different from the new function:random_state=0rather than system-wide random state{get,set}_params()if they are to be affected by the new utility. (If and when they do get this support, it could change the states set for other parts of the estimator by the new utility, due to parameter iteration order.) PR welcome, IMOrandom_statebut without{get,set}_paramsare not affected by the new utility; this is noted in its docstring.