-
-
Notifications
You must be signed in to change notification settings - Fork 260
Description
In #196, we came across an issue with the scoring API. For training an estimator on small NumPy arrays, a regular
clf = sklearn.linear_model.SGDClassifier()
param_grid = {'alpha': [0.1, 10]}
grid_search = GridSearchCV(clf, param_grid, scoring=None) # or scoring='accuracy'
with joblib.parallel_backend('dask'):
grid_search.fit(X, y)will induce calls to grid_search.estimator.score(X_test, y_test). This is fine since X and y are small.
Now, we'd like to do something similar with Incremental, (the new wrapper to pass blocks of a dask array to the underlying estimator.partial_fit) which is itself a meta-estimator.
clf = SGDClassifier()
inc = Incremental(clf)
grid_search = GridSearchCV(inc, param_grid, scoring=None) # or scoring='accuracy'
with joblib.parallel_backend('dask'):
grid_search.fit(X, y) # X and y are *large* Dask arraysAs written, this will end up calling a scikit-learn scoring function with a CV split, which will potentially cause a memory error. The current workaround is to pass a scoring to Incremental or GridSearchCV
from dask_ml.metrics import make_scorer
dask_scorer = get_scorer('accuracy')
inc = Incremental(clf, scoring=dask_scorer)
grid_search = GridSearchCV(inc, param_grid)Then things will be fine, but that may be difficult to discover. A couple issues:
- (less important) It's a bit unclear how
param_gridshould behave whenestimatoris a meta-estimator. I hacked upIncremental.get/set_paramsto pass things through. Maybe a syntax likeparam_grid={'estimator.alpha': [0.1, 10]}would make sense for scikit-learn? Has this come up before? - (more important): The string
'accuracy'will get scikit-learn's accuracy scorer. This is unfortunate sinceX_testandy_testwill be large dask Arrays, which will maybe cause a memory error when passed the the scikit-learn version. A few potential solutions
a.) Do nothing, just write better documentation. Maybe makescorera required argument toIncremental?
b.) Rewrite the scikit-learn scorers to be compatible with both ndarrays and dask arrays (is that possible?)
c.) A way to introspect an estimator to get the scorer used byEstimator.score. Something like
from sklearn.metrics import get_estimator_scorer # or just put in get_scorer
scorer = get_estimator_scorer(sklearn.linear_model.SGDClassifier)
# <function sklearn.metrics.classification.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)>Then in Incremental, we can detect that the default scorer should be dask_ml.metrics.classification.accuracy_score. I haven't checked many estimators to see if this is feasible.
cc @ogrisel, @GaelVaroquaux if they have thoughts on this