Skip to content

How to handle scoring for meta-estimators #200

@TomAugspurger

Description

@TomAugspurger

In #196, we came across an issue with the scoring API. For training an estimator on small NumPy arrays, a regular

clf = sklearn.linear_model.SGDClassifier()
param_grid = {'alpha': [0.1, 10]}
grid_search = GridSearchCV(clf, param_grid, scoring=None)  # or scoring='accuracy'

with joblib.parallel_backend('dask'):
    grid_search.fit(X, y)

will induce calls to grid_search.estimator.score(X_test, y_test). This is fine since X and y are small.

Now, we'd like to do something similar with Incremental, (the new wrapper to pass blocks of a dask array to the underlying estimator.partial_fit) which is itself a meta-estimator.

clf = SGDClassifier()
inc = Incremental(clf)

grid_search = GridSearchCV(inc, param_grid, scoring=None)  # or scoring='accuracy'

with joblib.parallel_backend('dask'):
    grid_search.fit(X, y)  # X and y are *large* Dask arrays

As written, this will end up calling a scikit-learn scoring function with a CV split, which will potentially cause a memory error. The current workaround is to pass a scoring to Incremental or GridSearchCV

from dask_ml.metrics import make_scorer
dask_scorer = get_scorer('accuracy')
inc = Incremental(clf, scoring=dask_scorer)
grid_search = GridSearchCV(inc, param_grid)

Then things will be fine, but that may be difficult to discover. A couple issues:

  1. (less important) It's a bit unclear how param_grid should behave when estimator is a meta-estimator. I hacked up Incremental.get/set_params to pass things through. Maybe a syntax like param_grid={'estimator.alpha': [0.1, 10]} would make sense for scikit-learn? Has this come up before?
  2. (more important): The string 'accuracy' will get scikit-learn's accuracy scorer. This is unfortunate since X_test and y_test will be large dask Arrays, which will maybe cause a memory error when passed the the scikit-learn version. A few potential solutions
    a.) Do nothing, just write better documentation. Maybe make scorer a required argument to Incremental?
    b.) Rewrite the scikit-learn scorers to be compatible with both ndarrays and dask arrays (is that possible?)
    c.) A way to introspect an estimator to get the scorer used by Estimator.score. Something like
from sklearn.metrics import get_estimator_scorer  # or just put in get_scorer
scorer = get_estimator_scorer(sklearn.linear_model.SGDClassifier)
# <function sklearn.metrics.classification.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)>

Then in Incremental, we can detect that the default scorer should be dask_ml.metrics.classification.accuracy_score. I haven't checked many estimators to see if this is feasible.

cc @ogrisel, @GaelVaroquaux if they have thoughts on this

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions