ENH: Make Incremental work in grid search#196
Conversation
| >>> clf.fit(X, y, classes=[0, 1]) | ||
| """ | ||
| def __init__(self, estimator, **kwargs): | ||
| estimator.set_params(**kwargs) |
There was a problem hiding this comment.
I think that calling estimator.set_params is the least surprising thing here. In general, I don't expect users to be passing additional kwargs here, just scikit-learn.
|
|
||
| def get_params(self, deep=True): | ||
| out = self.estimator.get_params(deep=deep) | ||
| out['estimator'] = self.estimator |
There was a problem hiding this comment.
Awkward API: what if the estimator parameter has an estimator parameter? Raise an error?
|
This isn't quite working with the distributed scheduler yet. |
Closes dask#195 Summary of changes 1. ``**kwargs`` in the Incremental constructor now refers to hyperparameters of `estimator`, rather than fit_kwargs 2. fit_kwargs are now passed in `fit`. This *seemed* to work just fine with `sklearn.model_selection.GridSearchCV`. Incremental.fit is correctly getting Dask (not NumPy) arrays. Will try it out on a larger test now.
* Scorer module * Additional logging in training
c22946f to
7901d43
Compare
|
This grew quite a bit, so here's a summary
distributed.client - WARNING - Couldn't gather 2 keys, rescheduling {"('getitem-bd5f254a6b4dabd84f4bf11f7b61edd7', 1)": [], "('getitem-bd5f254a6b4dabd84f4bf11f7b61edd7', 0)": []}
distributed.core - ERROR - "('astype-lt-getitem-4f528eaacb9ddb9447dec5cd09677027', 1)"
Traceback (most recent call last):
File "/Users/taugspurger/sandbox/distributed/distributed/core.py", line 375, in handle_stream
handler(**merge(extra, msg))
File "/Users/taugspurger/sandbox/distributed/distributed/scheduler.py", line 1355, in update_graph
ts = self.tasks[key]
KeyError: "('astype-lt-getitem-4f528eaacb9ddb9447dec5cd09677027', 1)"
distributed.core - ERROR - "('astype-lt-getitem-4f528eaacb9ddb9447dec5cd09677027', 1)"
Traceback (most recent call last):
File "/Users/taugspurger/sandbox/distributed/distributed/core.py", line 321, in handle_comm
result = yield result
File "/Users/taugspurger/Envs/dask-dev/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
value = future.result()
File "/Users/taugspurger/Envs/dask-dev/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
yielded = self.gen.send(value)
File "/Users/taugspurger/sandbox/distributed/distributed/scheduler.py", line 1922, in add_client
yield self.handle_stream(comm=comm, extra={'client': client})
File "/Users/taugspurger/Envs/dask-dev/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
value = future.result()
File "/Users/taugspurger/Envs/dask-dev/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
yielded = self.gen.send(value)
File "/Users/taugspurger/sandbox/distributed/distributed/core.py", line 375, in handle_stream
handler(**merge(extra, msg))
File "/Users/taugspurger/sandbox/distributed/distributed/scheduler.py", line 1355, in update_graph
ts = self.tasks[key]
KeyError: "('astype-lt-getitem-4f528eaacb9ddb9447dec5cd09677027', 1)"
Traceback (most recent call last):
File "/Users/taugspurger/sandbox/distributed/distributed/client.py", line 1429, in _gather
st = self.futures[key]
KeyError: "('getitem-4f528eaacb9ddb9447dec5cd09677027', 2)"https://gist.github.com/f65705bd1605bee283741c18eef34b10 reliably has failures, but I haven't worked to minimize the example yet. @mrocklin if I'm still struggling with the distributed issues in an hour or so, could I bother you for a debugging session? |
|
Here's the GridSearchCV(Incremental(SGDClassifier)) on an 8GB dataset stored in Zarr, run locally. https://gist.github.com/8ef23ae0861037637c03bbdaf56d0260 The GridSearch |
|
@jakirkham Once I get the GridSearchCV working with the distributed joblib backend, I'm going to turn the to the neuroimager data. Do you have any simple scikit-image operations to apply to the dataset that would reduce some of the noise we saw? This would make for a nice recap blogpost from the sprint. |
|
Some notes on the distrbituted issue. Running https://gist.github.com/8ef23ae0861037637c03bbdaf56d0260, I see the following transitions for the key that eventually errors DetailsFull output |
|
The array that |
|
There was a bug in the previous implementation, as the scorer was reverting back to the underlying estimators score ( It's fixed things so that https://gist.github.com/8ef23ae0861037637c03bbdaf56d0260 run successfully, though the larger zarr-based example still fails. |
|
I'd be happy to take a look at this, but will be unfortunately inaccessible
for most of today. I may have an hour or two in the late afternoon. If
this is the case then I'll reach out to you off-list @TomAugspurger.
Alternatively I have much more time tomorrow.
…On Wed, Jun 6, 2018 at 11:40 AM, Tom Augspurger ***@***.***> wrote:
There was a bug in the previous implementation, as the scorer was
reverting back to the underlying estimators score (scoring wasn't being
passed through in get / set params).
It's fixed things so that https://gist.github.com/
8ef23ae0861037637c03bbdaf56d0260 run successfully, though the larger
zarr-based example still fails.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#196 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszJPOKcqve-DQ6liIJYcvrHDwVnDlks5t5_f8gaJpZM4UZpoC>
.
|
|
No rush at all. Given that this works fine for datasets that are persisted in (maybe distributed) memory, I'm going to merge this and followup on the distributed issues later. |
Closes #195
Summary of changes
**kwargsin the Incremental constructor now refers to hyperparametersof
estimator, rather than fit_kwargsfit.This seemed to work just fine with
sklearn.model_selection.GridSearchCV.Incremental.fit is correctly getting Dask (not NumPy) arrays.
Will try it out on a larger test now.