ENH: Make Incremental work in grid search by TomAugspurger · Pull Request #196 · dask/dask-ml

TomAugspurger · 2018-06-04T19:30:04Z

Closes #195

Summary of changes

**kwargs in the Incremental constructor now refers to hyperparameters
of estimator, rather than fit_kwargs
fit_kwargs are now passed in fit.

This seemed to work just fine with sklearn.model_selection.GridSearchCV.
Incremental.fit is correctly getting Dask (not NumPy) arrays.
Will try it out on a larger test now.

TomAugspurger · 2018-06-04T19:31:17Z

dask_ml/wrappers.py

+    >>> clf.fit(X, y, classes=[0, 1])
    """
    def __init__(self, estimator, **kwargs):
+        estimator.set_params(**kwargs)


I think that calling estimator.set_params is the least surprising thing here. In general, I don't expect users to be passing additional kwargs here, just scikit-learn.

TomAugspurger · 2018-06-04T19:32:11Z

dask_ml/wrappers.py

+
+    def get_params(self, deep=True):
+        out = self.estimator.get_params(deep=deep)
+        out['estimator'] = self.estimator


Awkward API: what if the estimator parameter has an estimator parameter? Raise an error?

TomAugspurger · 2018-06-04T20:16:59Z

This isn't quite working with the distributed scheduler yet.

Closes dask#195 Summary of changes 1. ``**kwargs`` in the Incremental constructor now refers to hyperparameters of `estimator`, rather than fit_kwargs 2. fit_kwargs are now passed in `fit`. This *seemed* to work just fine with `sklearn.model_selection.GridSearchCV`. Incremental.fit is correctly getting Dask (not NumPy) arrays. Will try it out on a larger test now.

* Scorer module * Additional logging in training

TomAugspurger · 2018-06-06T13:54:14Z

This grew quite a bit, so here's a summary

ParellelPostFit.score & Incremental.score now use dask-aware scorers. Previously they materialized a single ndarray. This exposed a difficulty with the scikit-learn API that I haven't resolved yet: given an estimator like sklearn.linear_model.SGDClassifier, there's no way to inspect what underlying metric SGDClassfier actually uses. I happen to know that it uses accuracy_score, but I would like to detect that in code, so that we can swap out the scorer with a dask-aware one. Writing metrics that work on either NumPy or dask code would also work (and may be doable.) I'll think on it more and raise a scikit-learn issue. For now we ask the users to provide a scoring parameter to Incremental and ParallelPostFit.
I'm seeings some strange errors with the distributed scheduler, that I haven't narrowed down yet. When running GridSearchCV(Incremental(SGDClassifier)) I get exceptions in the scheduler like

distributed.client - WARNING - Couldn't gather 2 keys, rescheduling {"('getitem-bd5f254a6b4dabd84f4bf11f7b61edd7', 1)": [], "('getitem-bd5f254a6b4dabd84f4bf11f7b61edd7', 0)": []}
distributed.core - ERROR - "('astype-lt-getitem-4f528eaacb9ddb9447dec5cd09677027', 1)"
Traceback (most recent call last):
  File "/Users/taugspurger/sandbox/distributed/distributed/core.py", line 375, in handle_stream
    handler(**merge(extra, msg))
  File "/Users/taugspurger/sandbox/distributed/distributed/scheduler.py", line 1355, in update_graph
    ts = self.tasks[key]
KeyError: "('astype-lt-getitem-4f528eaacb9ddb9447dec5cd09677027', 1)"
distributed.core - ERROR - "('astype-lt-getitem-4f528eaacb9ddb9447dec5cd09677027', 1)"
Traceback (most recent call last):
  File "/Users/taugspurger/sandbox/distributed/distributed/core.py", line 321, in handle_comm
    result = yield result
  File "/Users/taugspurger/Envs/dask-dev/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/Users/taugspurger/Envs/dask-dev/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/Users/taugspurger/sandbox/distributed/distributed/scheduler.py", line 1922, in add_client
    yield self.handle_stream(comm=comm, extra={'client': client})
  File "/Users/taugspurger/Envs/dask-dev/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/Users/taugspurger/Envs/dask-dev/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/Users/taugspurger/sandbox/distributed/distributed/core.py", line 375, in handle_stream
    handler(**merge(extra, msg))
  File "/Users/taugspurger/sandbox/distributed/distributed/scheduler.py", line 1355, in update_graph
    ts = self.tasks[key]
KeyError: "('astype-lt-getitem-4f528eaacb9ddb9447dec5cd09677027', 1)"
Traceback (most recent call last):
  File "/Users/taugspurger/sandbox/distributed/distributed/client.py", line 1429, in _gather
    st = self.futures[key]
KeyError: "('getitem-4f528eaacb9ddb9447dec5cd09677027', 2)"

https://gist.github.com/f65705bd1605bee283741c18eef34b10 reliably has failures, but I haven't worked to minimize the example yet.

@mrocklin if I'm still struggling with the distributed issues in an hour or so, could I bother you for a debugging session?

TomAugspurger · 2018-06-06T14:04:46Z

Here's the GridSearchCV(Incremental(SGDClassifier)) on an 8GB dataset stored in Zarr, run locally.

https://gist.github.com/8ef23ae0861037637c03bbdaf56d0260

The GridSearch n_jobs is set to 1. We still get some parallelism, I think in loading data and scoring. Each call to the underlying .partial_fit is sequential still, since n_jobs=1.

https://streamable.com/l6k82

TomAugspurger · 2018-06-06T14:07:29Z

@jakirkham Once I get the GridSearchCV working with the distributed joblib backend, I'm going to turn the to the neuroimager data.

Do you have any simple scikit-image operations to apply to the dataset that would reduce some of the noise we saw? This would make for a nice recap blogpost from the sprint.

* Added compute to scorers API, make eager by default * Use fixture to avoid Py2 distributed test hang

TomAugspurger · 2018-06-06T15:04:17Z

Some notes on the distrbituted issue. Running https://gist.github.com/8ef23ae0861037637c03bbdaf56d0260, I see the following transitions for the key that eventually errors

2018-06-06 10:01:45 0465-taugspurger __main__[75993] INFO ('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1) - released - waiting
2018-06-06 10:01:45 0465-taugspurger __main__[75993] INFO ('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1) - waiting - processing
2018-06-06 10:01:45 0465-taugspurger __main__[75993] INFO ('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1) - processing - memory
2018-06-06 10:01:45 0465-taugspurger __main__[75993] INFO ('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1) - memory - released
2018-06-06 10:01:45 0465-taugspurger __main__[75993] INFO ('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1) - released - forgotten
2018-06-06 10:01:45 0465-taugspurger distributed.core[75993] ERROR "('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1)"
KeyError: "('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1)"
2018-06-06 10:01:45 0465-taugspurger distributed.core[75993] ERROR "('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1)"
KeyError: "('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1)"

Details

2018-06-06 10:01:45 0465-taugspurger __main__[75993] INFO ('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1) - released - waiting
2018-06-06 10:01:45 0465-taugspurger __main__[75993] INFO ('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1) - waiting - processing
2018-06-06 10:01:45 0465-taugspurger __main__[75993] INFO ('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1) - processing - memory
2018-06-06 10:01:45 0465-taugspurger __main__[75993] INFO ('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1) - memory - released
2018-06-06 10:01:45 0465-taugspurger __main__[75993] INFO ('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1) - released - forgotten
2018-06-06 10:01:45 0465-taugspurger distributed.core[75993] ERROR "('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1)"
KeyError: "('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1)"
2018-06-06 10:01:45 0465-taugspurger distributed.core[75993] ERROR "('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1)"
KeyError: "('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1)"

Full output

2018-06-06 10:01:44 0465-taugspurger distributed.core[76055] INFO Starting established connection

TomAugspurger · 2018-06-06T15:25:28Z

The array that astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f is for is a slice of y. It's generated inside sklearn.model_selection._validation, which is inside a joblib.Parallel call. If I persist the collections after they're created (by modifying scikit-learn), then there's no exception.

TomAugspurger · 2018-06-06T15:40:44Z

There was a bug in the previous implementation, as the scorer was reverting back to the underlying estimators score (scoring wasn't being passed through in get / set params).

It's fixed things so that https://gist.github.com/8ef23ae0861037637c03bbdaf56d0260 run successfully, though the larger zarr-based example still fails.

mrocklin · 2018-06-06T15:50:12Z

I'd be happy to take a look at this, but will be unfortunately inaccessible for most of today. I may have an hour or two in the late afternoon. If this is the case then I'll reach out to you off-list @TomAugspurger. Alternatively I have much more time tomorrow.

…

On Wed, Jun 6, 2018 at 11:40 AM, Tom Augspurger ***@***.***> wrote: There was a bug in the previous implementation, as the scorer was reverting back to the underlying estimators score (scoring wasn't being passed through in get / set params). It's fixed things so that https://gist.github.com/ 8ef23ae0861037637c03bbdaf56d0260 run successfully, though the larger zarr-based example still fails. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#196 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszJPOKcqve-DQ6liIJYcvrHDwVnDlks5t5_f8gaJpZM4UZpoC> .

TomAugspurger · 2018-06-06T16:02:45Z

No rush at all.

Given that this works fine for datasets that are persisted in (maybe distributed) memory, I'm going to merge this and followup on the distributed issues later.

TomAugspurger commented Jun 4, 2018

View reviewed changes

TomAugspurger added 4 commits June 6, 2018 08:41

Test, raise

32ebbfb

Ensure Incremental works with distributed

bf513b5

More grid search

7901d43

* Scorer module * Additional logging in training

TomAugspurger force-pushed the incremental-params branch from c22946f to 7901d43 Compare June 6, 2018 13:42

Bug fixes

5484fa2

* Added compute to scorers API, make eager by default * Use fixture to avoid Py2 distributed test hang

Ensure scorer is set correctly

d4196bb

TomAugspurger merged commit aab60f7 into dask:master Jun 6, 2018

TomAugspurger deleted the incremental-params branch June 6, 2018 16:03

This was referenced Jun 6, 2018

KeyError in distributed scheduler with GridSearchCV(Incremental) #199

Closed

How to handle scoring for meta-estimators #200

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ENH: Make Incremental work in grid search#196

ENH: Make Incremental work in grid search#196
TomAugspurger merged 6 commits intodask:masterfrom
TomAugspurger:incremental-params

TomAugspurger commented Jun 4, 2018

Uh oh!

TomAugspurger Jun 4, 2018

Uh oh!

TomAugspurger Jun 4, 2018

Uh oh!

TomAugspurger commented Jun 4, 2018

Uh oh!

TomAugspurger commented Jun 6, 2018

Uh oh!

TomAugspurger commented Jun 6, 2018

Uh oh!

TomAugspurger commented Jun 6, 2018

Uh oh!

TomAugspurger commented Jun 6, 2018

Uh oh!

TomAugspurger commented Jun 6, 2018

Uh oh!

TomAugspurger commented Jun 6, 2018

Uh oh!

mrocklin commented Jun 6, 2018 via email

Uh oh!

TomAugspurger commented Jun 6, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

TomAugspurger commented Jun 4, 2018

Uh oh!

TomAugspurger Jun 4, 2018

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Jun 4, 2018

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Jun 4, 2018

Uh oh!

TomAugspurger commented Jun 6, 2018

Uh oh!

TomAugspurger commented Jun 6, 2018

Uh oh!

TomAugspurger commented Jun 6, 2018

Uh oh!

TomAugspurger commented Jun 6, 2018

Uh oh!

TomAugspurger commented Jun 6, 2018

Uh oh!

TomAugspurger commented Jun 6, 2018

Uh oh!

mrocklin commented Jun 6, 2018 via email

Uh oh!

TomAugspurger commented Jun 6, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants