-
-
Notifications
You must be signed in to change notification settings - Fork 260
BaseEstimator tokenization may cause issues for fitted models #658
Copy link
Copy link
Closed
Description
In https://github.com/dask/dask-ml/blob/master/dask_ml/model_selection/_normalize.py#L18-L24, we register a tokenizer for scikit-learn models that inherit from BaseEstimator. This can cause issues when we have multiple instances of the same model that have been fitted to different pieces of data (as in #657).
In [8]: import dask
...: import sklearn.datasets
...: import sklearn.linear_model
...: import dask_ml.model_selection._normalize
...:
...: m1 = sklearn.linear_model.LogisticRegression(random_state=0)
...: m2 = sklearn.linear_model.LogisticRegression(random_state=0)
...:
...: # Not fitted, fine for these to match
...: assert dask.base.tokenize(m1) == dask.base.tokenize(m2)
...:
...: X1, y1 = sklearn.datasets.make_classification()
...: X2, y2 = sklearn.datasets.make_classification()
...: m1.fit(X1, y1)
...: m2.fit(X2, y2)
...:
...: # After fitting, these should be different!
...: assert dask.base.tokenize(m1) != dask.base.tokenize(m2)
...:
...:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-8-590215541e34> in <module>
16
17 # After fitting, these should be different!
---> 18 assert dask.base.tokenize(m1) != dask.base.tokenize(m2)
19
AssertionError:I think that normalize_estimator should also (try to) hash things like coef_. Since this is supposed to work for arbitrary estimators, the safest thing to do is tokenize any attribute ending in _. We might exclude selected attributes like GridSearchCV.cv_results_, since those include things that are incidental (like the time it took to train a specific split).
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels