How to handle scoring for meta-estimators

In #196, we came across an issue with the scoring API. For training an estimator on small NumPy arrays, a regular

```python
clf = sklearn.linear_model.SGDClassifier()
param_grid = {'alpha': [0.1, 10]}
grid_search = GridSearchCV(clf, param_grid, scoring=None)  # or scoring='accuracy'

with joblib.parallel_backend('dask'):
    grid_search.fit(X, y)
```

will induce calls to `grid_search.estimator.score(X_test, y_test)`. This is fine since `X` and `y` are small.

Now, we'd like to do something similar with `Incremental`, (the new wrapper to pass blocks of a dask array to the underlying `estimator.partial_fit`) which is itself a meta-estimator.

```python
clf = SGDClassifier()
inc = Incremental(clf)

grid_search = GridSearchCV(inc, param_grid, scoring=None)  # or scoring='accuracy'

with joblib.parallel_backend('dask'):
    grid_search.fit(X, y)  # X and y are *large* Dask arrays
```

As written, this will end up calling a scikit-learn scoring function with a CV split, which will potentially cause a memory error. The current workaround is to pass a `scoring` to `Incremental` or `GridSearchCV`

```python
from dask_ml.metrics import make_scorer
dask_scorer = get_scorer('accuracy')
inc = Incremental(clf, scoring=dask_scorer)
grid_search = GridSearchCV(inc, param_grid)
```

Then things will be fine, but that may be difficult to discover. A couple issues:

1. (less important) It's a bit unclear how `param_grid` should behave when `estimator` is a meta-estimator. I hacked up `Incremental.get/set_params` to pass things through. Maybe a syntax like `param_grid={'estimator.alpha': [0.1, 10]}` would make sense for scikit-learn? Has this come up before?
2. (more important): The string `'accuracy'` will get scikit-learn's accuracy scorer. This is unfortunate since `X_test` and `y_test` will be large dask Arrays, which will maybe cause a memory error when passed the the scikit-learn version. A few potential solutions
  a.) Do nothing, just write better documentation. Maybe make `scorer` a required argument to `Incremental`?
  b.) Rewrite the scikit-learn scorers to be compatible with both ndarrays and dask arrays (is that possible?)
  c.) A way to introspect an estimator to get the scorer used by `Estimator.score`. Something like

```python
from sklearn.metrics import get_estimator_scorer  # or just put in get_scorer
scorer = get_estimator_scorer(sklearn.linear_model.SGDClassifier)
# <function sklearn.metrics.classification.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)>
```

Then in `Incremental`, we can detect that the default `scorer` should be `dask_ml.metrics.classification.accuracy_score`. I haven't checked many estimators to see if this is feasible.

cc @ogrisel, @GaelVaroquaux if they have thoughts on this

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to handle scoring for meta-estimators #200

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

How to handle scoring for meta-estimators #200

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions