MRG clone parameters in gridsearch etc by amueller · Pull Request #15096 · scikit-learn/scikit-learn

amueller · 2019-09-26T17:13:08Z

Fixed #10063 without going through the pain of #8350.

I don't see a case where this could change behavior, as we immediately call fit after setting the steps, but maybe I'm missing something.

This not only removes possible confusion of users (see #8350 on how the stored estimators are quite hard to interpret), it also saves us potentially a lot of memory (imagine grid-searching a neural net and storing the weights for each parameter setting).

sklearn/model_selection/_validation.py

sklearn/model_selection/tests/test_search.py

NicolasHug · 2019-09-26T18:30:15Z

sklearn/model_selection/_validation.py

+        # clone after setting parameters in case any parameters
+        # are estimators (like pipeline steps)
+        # because pipeline doesn't clone steps in fit
+        estimator = clone(estimator.set_params(**parameters))


There's a problem if the parameter is assumed to be a fitted estimator? But does it ever happen?

Could in 3rd party things if a meta-estimator doesn't call fit on an estimator.

amueller · 2019-09-27T19:05:04Z

@jnothman @glemaitre might have thoughts?

thomasjpfan · 2019-09-27T20:00:05Z

sklearn/model_selection/_search.py

-                **self.best_params_)
+            # we clone again after setting params in case some
+            # of the params are estimators as well.
+            self.best_estimator_ = clone(clone(base_estimator).set_params(


Is the following good enough?

clone(base_estimator.set_params(**self.best_params_))

I don't think so?
We're changing base_estimator then, right?

The test_grid_search_pipeline_steps test passes without the double clone. From

scikit-learn/sklearn/model_selection/_search.py

Line 655 in 86aea99

base_estimator = clone(self.estimator)

it is already a clone?

We could indeed only clone once as @thomasjpfan suggested since base_estimator is just a local variable, which isn't used later.

I guess cloning twice is fine too: no surprises.

NicolasHug · 2019-09-27T22:32:21Z

Cloning the estimator in _fit_and_score() will make the implementation of #8230 (gridsearch + warm start) impossible now :(

Maybe we could just clone the parameters?

jnothman

Nice hack! I'm not sure that this fixes #10063 which is about cv_results_, though... it only fixes the best_estimator_ part of that problem.

amueller · 2019-10-04T15:44:55Z

@jnothman no it definitely fixes that one. It's tested pretty extensively.

amueller · 2019-10-04T16:00:39Z

@NicolasHug good point, now only cloning the parameters. If the parameters are not an estimator, this will do a deep copy btw, so if they are large arrays, we keep copying them. On the other hand, if you pass a mutable structure you're asking for trouble and copying it is probably a good idea.

amueller · 2019-10-04T16:01:23Z

@jnothman do you think the current solution is still a hack? Why?

jnothman · 2019-10-08T05:12:43Z

sklearn/model_selection/_validation.py

    train_scores = {}
    if parameters is not None:
-        estimator.set_params(**parameters)
+        # clone after setting parameters in case any parameters


I wonder if someone had code relying on the existing behaviour.

Add a test for this wrt cross_validate??

Possibly, but I'm not sure what to do about that.
What do you want tested?

No, parameters is not set by cross_validate. Could add a test for validation_curve. But I'm okay without.

adrinjalali · 2019-10-22T15:08:27Z

I was curious about a more nested case, but the good news is that this also passes the tests:

#%%
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
rng = np.random.RandomState(41)
X['random_cat'] = rng.randint(3, size=X.shape[0])
X['random_num'] = rng.randn(X.shape[0])

categorical_columns = ['pclass', 'sex', 'embarked', 'random_cat']
numerical_columns = ['age', 'sibsp', 'parch', 'fare', 'random_num']

X = X[categorical_columns + numerical_columns]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, random_state=42)

categorical_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])
numerical_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('selector', PCA())
])

preprocessing = ColumnTransformer(
    [('cat', categorical_pipe, categorical_columns),
     ('num', numerical_pipe, numerical_columns)])

pipe = Pipeline([
    ('preprocess', preprocessing),
    ('classifier', RandomForestClassifier(random_state=42))
])

param_grid = {'preprocess__num__selector': [PCA(), SelectKBest(k=3)]}
grid_search = GridSearchCV(pipe, param_grid, cv=2)
grid_search.fit(X, y)

NicolasHug

LGTM!

NicolasHug · 2019-10-24T17:06:37Z

sklearn/model_selection/_search.py

-                **self.best_params_)
+            # we clone again after setting params in case some
+            # of the params are estimators as well.
+            self.best_estimator_ = clone(clone(base_estimator).set_params(


We could indeed only clone once as @thomasjpfan suggested since base_estimator is just a local variable, which isn't used later.

I guess cloning twice is fine too: no surprises.

NicolasHug · 2019-10-24T17:11:12Z

sklearn/model_selection/_validation.py

+        cloned_parameters = {}
+        for k, v in parameters.items():
+            cloned_parameters[k] = clone(v, safe=False)


nit: dict comprehension?

jnothman

Lgtm besides the validation_curve comment which may not be essential

jnothman · 2019-10-27T13:52:55Z

This needs a what's new

amueller · 2019-11-01T00:48:15Z

Thanks folks!

amueller added 2 commits September 26, 2019 13:10

clone estimator again after setting parameters

dca6136

add more tests

fb691e8

amueller added this to the 0.22 milestone Sep 26, 2019

NicolasHug reviewed Sep 26, 2019

View reviewed changes

sklearn/model_selection/_validation.py Outdated Show resolved Hide resolved

sklearn/model_selection/tests/test_search.py Show resolved Hide resolved

sklearn/model_selection/tests/test_search.py Show resolved Hide resolved

amueller added 2 commits September 26, 2019 13:42

add some comments

13e7e6d

more tests

6fe88aa

NicolasHug reviewed Sep 26, 2019

View reviewed changes

thomasjpfan reviewed Sep 27, 2019

View reviewed changes

jnothman reviewed Sep 28, 2019

View reviewed changes

don't clone estimators, just clone parameters

9da28a7

amueller changed the title ~~MRG clone estimator again after setting parameters in gridsearch etc~~ MRG clone parameters in gridsearch etc Oct 4, 2019

jnothman reviewed Oct 8, 2019

View reviewed changes

NicolasHug approved these changes Oct 24, 2019

View reviewed changes

jnothman approved these changes Oct 24, 2019

View reviewed changes

adrinjalali merged commit 3d606cf into scikit-learn:master Oct 29, 2019

adrinjalali mentioned this pull request Oct 29, 2019

Missing from what's new v0.22 #15384

Closed

8 tasks

Uh oh!

Conversation

amueller commented Sep 26, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Sep 27, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug commented Sep 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

amueller commented Oct 4, 2019

Uh oh!

amueller commented Oct 4, 2019

Uh oh!

amueller commented Oct 4, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrinjalali commented Oct 22, 2019

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman commented Oct 27, 2019

Uh oh!

amueller commented Nov 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

NicolasHug commented Sep 27, 2019 •

edited

Loading