Skip to content

ENH: Make Incremental work in grid search#196

Merged
TomAugspurger merged 6 commits intodask:masterfrom
TomAugspurger:incremental-params
Jun 6, 2018
Merged

ENH: Make Incremental work in grid search#196
TomAugspurger merged 6 commits intodask:masterfrom
TomAugspurger:incremental-params

Conversation

@TomAugspurger
Copy link
Member

Closes #195

Summary of changes

  1. **kwargs in the Incremental constructor now refers to hyperparameters
    of estimator, rather than fit_kwargs
  2. fit_kwargs are now passed in fit.

This seemed to work just fine with sklearn.model_selection.GridSearchCV.
Incremental.fit is correctly getting Dask (not NumPy) arrays.
Will try it out on a larger test now.

>>> clf.fit(X, y, classes=[0, 1])
"""
def __init__(self, estimator, **kwargs):
estimator.set_params(**kwargs)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that calling estimator.set_params is the least surprising thing here. In general, I don't expect users to be passing additional kwargs here, just scikit-learn.


def get_params(self, deep=True):
out = self.estimator.get_params(deep=deep)
out['estimator'] = self.estimator
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awkward API: what if the estimator parameter has an estimator parameter? Raise an error?

@TomAugspurger
Copy link
Member Author

This isn't quite working with the distributed scheduler yet.

Closes dask#195

Summary of changes

1. ``**kwargs`` in the Incremental constructor now refers to hyperparameters
   of `estimator`, rather than fit_kwargs
2. fit_kwargs are now passed in `fit`.

This *seemed* to work just fine with `sklearn.model_selection.GridSearchCV`.
Incremental.fit is correctly getting Dask (not NumPy) arrays.
Will try it out on a larger test now.
* Scorer module
* Additional logging in training
@TomAugspurger
Copy link
Member Author

This grew quite a bit, so here's a summary

  1. ParellelPostFit.score & Incremental.score now use dask-aware scorers. Previously they materialized a single ndarray. This exposed a difficulty with the scikit-learn API that I haven't resolved yet: given an estimator like sklearn.linear_model.SGDClassifier, there's no way to inspect what underlying metric SGDClassfier actually uses. I happen to know that it uses accuracy_score, but I would like to detect that in code, so that we can swap out the scorer with a dask-aware one. Writing metrics that work on either NumPy or dask code would also work (and may be doable.) I'll think on it more and raise a scikit-learn issue. For now we ask the users to provide a scoring parameter to Incremental and ParallelPostFit.
  2. I'm seeings some strange errors with the distributed scheduler, that I haven't narrowed down yet. When running GridSearchCV(Incremental(SGDClassifier)) I get exceptions in the scheduler like
distributed.client - WARNING - Couldn't gather 2 keys, rescheduling {"('getitem-bd5f254a6b4dabd84f4bf11f7b61edd7', 1)": [], "('getitem-bd5f254a6b4dabd84f4bf11f7b61edd7', 0)": []}
distributed.core - ERROR - "('astype-lt-getitem-4f528eaacb9ddb9447dec5cd09677027', 1)"
Traceback (most recent call last):
  File "/Users/taugspurger/sandbox/distributed/distributed/core.py", line 375, in handle_stream
    handler(**merge(extra, msg))
  File "/Users/taugspurger/sandbox/distributed/distributed/scheduler.py", line 1355, in update_graph
    ts = self.tasks[key]
KeyError: "('astype-lt-getitem-4f528eaacb9ddb9447dec5cd09677027', 1)"
distributed.core - ERROR - "('astype-lt-getitem-4f528eaacb9ddb9447dec5cd09677027', 1)"
Traceback (most recent call last):
  File "/Users/taugspurger/sandbox/distributed/distributed/core.py", line 321, in handle_comm
    result = yield result
  File "/Users/taugspurger/Envs/dask-dev/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/Users/taugspurger/Envs/dask-dev/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/Users/taugspurger/sandbox/distributed/distributed/scheduler.py", line 1922, in add_client
    yield self.handle_stream(comm=comm, extra={'client': client})
  File "/Users/taugspurger/Envs/dask-dev/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/Users/taugspurger/Envs/dask-dev/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/Users/taugspurger/sandbox/distributed/distributed/core.py", line 375, in handle_stream
    handler(**merge(extra, msg))
  File "/Users/taugspurger/sandbox/distributed/distributed/scheduler.py", line 1355, in update_graph
    ts = self.tasks[key]
KeyError: "('astype-lt-getitem-4f528eaacb9ddb9447dec5cd09677027', 1)"
Traceback (most recent call last):
  File "/Users/taugspurger/sandbox/distributed/distributed/client.py", line 1429, in _gather
    st = self.futures[key]
KeyError: "('getitem-4f528eaacb9ddb9447dec5cd09677027', 2)"

https://gist.github.com/f65705bd1605bee283741c18eef34b10 reliably has failures, but I haven't worked to minimize the example yet.

@mrocklin if I'm still struggling with the distributed issues in an hour or so, could I bother you for a debugging session?

@TomAugspurger
Copy link
Member Author

Here's the GridSearchCV(Incremental(SGDClassifier)) on an 8GB dataset stored in Zarr, run locally.

https://gist.github.com/8ef23ae0861037637c03bbdaf56d0260

The GridSearch n_jobs is set to 1. We still get some parallelism, I think in loading data and scoring. Each call to the underlying .partial_fit is sequential still, since n_jobs=1.

https://streamable.com/l6k82

@TomAugspurger
Copy link
Member Author

@jakirkham Once I get the GridSearchCV working with the distributed joblib backend, I'm going to turn the to the neuroimager data.

Do you have any simple scikit-image operations to apply to the dataset that would reduce some of the noise we saw? This would make for a nice recap blogpost from the sprint.

* Added compute to scorers API, make eager by default
* Use fixture to avoid Py2 distributed test hang
@TomAugspurger
Copy link
Member Author

Some notes on the distrbituted issue. Running https://gist.github.com/8ef23ae0861037637c03bbdaf56d0260, I see the following transitions for the key that eventually errors

2018-06-06 10:01:45 0465-taugspurger __main__[75993] INFO ('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1) - released - waiting
2018-06-06 10:01:45 0465-taugspurger __main__[75993] INFO ('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1) - waiting - processing
2018-06-06 10:01:45 0465-taugspurger __main__[75993] INFO ('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1) - processing - memory
2018-06-06 10:01:45 0465-taugspurger __main__[75993] INFO ('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1) - memory - released
2018-06-06 10:01:45 0465-taugspurger __main__[75993] INFO ('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1) - released - forgotten
2018-06-06 10:01:45 0465-taugspurger distributed.core[75993] ERROR "('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1)"
KeyError: "('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1)"
2018-06-06 10:01:45 0465-taugspurger distributed.core[75993] ERROR "('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1)"
KeyError: "('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1)"
Details
2018-06-06 10:01:45 0465-taugspurger __main__[75993] INFO ('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1) - released - waiting
2018-06-06 10:01:45 0465-taugspurger __main__[75993] INFO ('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1) - waiting - processing
2018-06-06 10:01:45 0465-taugspurger __main__[75993] INFO ('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1) - processing - memory
2018-06-06 10:01:45 0465-taugspurger __main__[75993] INFO ('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1) - memory - released
2018-06-06 10:01:45 0465-taugspurger __main__[75993] INFO ('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1) - released - forgotten
2018-06-06 10:01:45 0465-taugspurger distributed.core[75993] ERROR "('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1)"
KeyError: "('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1)"
2018-06-06 10:01:45 0465-taugspurger distributed.core[75993] ERROR "('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1)"
KeyError: "('astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f', 1)"

Full output

2018-06-06 10:01:44 0465-taugspurger distributed.core[76055] INFO Starting established connection

@TomAugspurger
Copy link
Member Author

The array that astype-lt-getitem-46b51822fff487bc1df0fb2e4d96323f is for is a slice of y. It's generated inside sklearn.model_selection._validation, which is inside a joblib.Parallel call. If I persist the collections after they're created (by modifying scikit-learn), then there's no exception.

@TomAugspurger
Copy link
Member Author

There was a bug in the previous implementation, as the scorer was reverting back to the underlying estimators score (scoring wasn't being passed through in get / set params).

It's fixed things so that https://gist.github.com/8ef23ae0861037637c03bbdaf56d0260 run successfully, though the larger zarr-based example still fails.

@mrocklin
Copy link
Member

mrocklin commented Jun 6, 2018 via email

@TomAugspurger
Copy link
Member Author

No rush at all.

Given that this works fine for datasets that are persisted in (maybe distributed) memory, I'm going to merge this and followup on the distributed issues later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants