WIP: ENH: implement hyperband by stsievert · Pull Request #71 · dask/dask-searchcv

stsievert · 2018-05-25T01:14:49Z

This implements Hyperband, a hyperparameter optimization algorithm. I describe it more in dask/dask-ml#161 (comment). This algorithm is adaptive: it has to make choices from previous evaluations.

This code works (though very much a WIP). Each "bracket" evaluates many different models, then kills off about 2/3rds based on which ones are best performing. Here's a graph of the s=3 bracket:

We spend more energy on better performing models.

This is a work in progress. I am opening this PR to get thoughts/ideas for integration of adaptive algorithms in dask-searchcv.

squash

stsievert · 2018-05-25T01:14:56Z

dask-searchcv does build one graph statically:

dask-searchcv/dask_searchcv/model_selection.py

Line 852 in 86d14bd

dsk, keys, n_splits = build_graph(estimator, self.cv, self.scorer_,

With any adaptive function, we will have to build graphs dynamically.

Any adaptive function requires two functions, as described in dask/dask-ml#161 (comment). That is, the general framework (regardless of where it goes) will be something like

alg = Hyperband()


losses = None
while True:
    models = [model.set_params(config) for config in alg.get_configs(models, losses)]
    losses = [delayed(validation_loss, model) for model in models]
    models = alg.filter(models, losses)
    if alg.stop:
        return alg.best_model, alg.best_config

I think this would belong in fit (or should call fit as a subroutine, maybe moving it to _fit).

Right now, RandomizedSearchCV and GridSearchCV only provide alg.get_configs through _get_param_iterator.

stsievert · 2018-05-25T02:24:42Z

I'm not sure how difficult it will be to integrate adaptive algorithms into dask-searchcv.I think I remember someone proposing to include this algorithm in dask-ml without any other integration into sklearn pipelines.

This integration barrier exists for all adaptive algorithms (i.e., other algorithm choices in dask/dask-ml#161). I'm inclined to solve it once for some adaptive algorithm base class, then have any other algorithm (including Hyperband) inherit from that base class.

cc @mrocklin @TomAugspurger

TomAugspurger · 2018-05-25T18:23:26Z

Right now I would worry less about the class / inheritance structure. If we see an opportunity for code reuse we can merge things later.

I do care a bit more about following the scikit-learn API of specifying hyperparameters as parameters to the class, and passing X and y to the fit.

TomAugspurger · 2018-05-25T18:28:02Z

dask_searchcv/hyperband/tests.py

+from algs import Hyperband
+
+
+def accuracy(x, y):


FYI: implementing this in dask/dask-ml#177

TomAugspurger

(reading through the paper now to comment on the actual algorithm, but is the learning entirely synchronous?)

TomAugspurger · 2018-05-25T18:32:31Z

dask_searchcv/hyperband/algs.py

+
+
+class Hyperband:
+    def __init__(self, params, model, R, X, y, eta=3):


So IIUC, we could restructure this as

def __init__(self, estimator, parameters, R, eta=3): ... def fit(self, X, y=None): ...

What is R? we'll want a mo8re descriptive name.

TomAugspurger · 2018-05-25T18:33:30Z

dask_searchcv/hyperband/algs.py

+        self.model = model
+        self.R = R
+        self.eta = eta
+        self.best_val_loss = np.inf


And if we're following the scikit-learn style (which I think we should), "learned" attributes like this would have a trailing underscore, and wouldn't exist till .fit is called.

TomAugspurger · 2018-05-25T18:33:49Z

dask_searchcv/hyperband/algs.py

+        self.classes = np.unique(y)
+
+        n, d = X.shape
+        train, val = train_test_split(range(n))


this would be moved to .fit

TomAugspurger · 2018-05-25T18:35:05Z

dask_searchcv/hyperband/algs.py

+        R, eta = self.R, self.eta
+        s_max = math.floor(math.log(self.R, self.eta))
+        B = (s_max + 1) * self.R
+        for s in reversed(range(s_max + 1)):


Python is probably smart about this, but this could also be range(s_max, -1, -1)

TomAugspurger · 2018-05-25T20:09:46Z

Regarding parallelization, the authors suggest a few:

distribute individual brackets of SuccessiveHalving to different machines. This can be done asynchronously and as machines free up, new brackets can be launched with a different set of arms.
parallelize a single bracket so that each round of SuccessiveHalving runs faster.
one could also parallelize each individual execution of R.

Do you have a sense of how that would adapt to Dask?

One thing we haven't really discussed is what user-API to use for getting the client that should be used for the async stuff. We could

Rely on a global Client with distributed.get_client()
Accept a client parameter to classifiers that need it. This could be either a Client or a string representing the scheduler-address. I suspect that the string would be preferred, since then the classifier is serializable.

stsievert · 2018-05-26T01:14:58Z

Right now I would worry less about the class / inheritance structure. If we see an opportunity for code reuse we can merge things later.

I should have explained myself better in the original comment. I see two paths forward for this code:

integrate into dask-searchcv by inheriting from the base class (DaskBaseSearchCV)
adapting this code to Dask, and relying on the distributed scheduler

Option 2 will be quicker to implement, but will not integrate well into sklearn pipelines and will require mirroring the sklearn API manually.

Option 1 will be slower to implement if it's possible, but will provide better integration and will fit nicely into sklearn pipelines. This is the approach that the two other model selection algorithms (grid and random search) use.

I'm inclined to go with option 1. It's easier and is a minimal solution to provide the functionality. A future PR can take this choice and extend it to option 2, especially if we inherit from an DaskAdaptiveSearchCV class.

I had been deciding between these options, and which I would have submitted this PR after I started to integrate dask.distributed. This PR is very rough, and far from reasonable code. I have decided to close this PR until it's more reasonable and ready for review.

Do you have a sense of how that would adapt to Dask?

I think this is the right algorithm for dask – most of this algorithm is embarrassingly parallel, and in section 6.1 they mention "a more sophisticated job priority queue must be managed". Most of the benefits would come from implementing this diff:

- models = {k: train(model, ...) for k, model in models.items()}
+ futures = {k: client.submit(train, model, ...) for k, model in models.items()}
+ models = client.gather(futures)

We can likewise have another diff to launch the successive halving brackets simultaneously:

def fit():  # the main hyperband alg
    # ...
    args = [...]
    client.map(self.successive_halving, args)
    return self.best_config, self.best_model

This takes care of "distribute individual bracket" and "parallelize a single bracket".

TomAugspurger · 2018-05-27T19:10:57Z

Sounds good. LMK if you want additional comments on the implementation here, or if you think the future PR will be different enough that it's not worthwhile.

stsievert · 2018-05-27T20:41:20Z

Let me integrate with dask.distributed, then ask for comments (I think the updated PR will be different enough). I'll also try to improve the user-facing API more, now that I've decided not to inherit from DaskBaseSearchCV. I'll probably open another PR – this algorithm could use a better introduction.

I think this can be made to work with sklearn pipelines, as long as exactly one model supports partial_fit.

ENH: implement hyperband

ba140e0

squash

TomAugspurger reviewed May 25, 2018

View reviewed changes

stsievert closed this May 26, 2018

stsievert mentioned this pull request Jun 3, 2018

ENH: Hyperband implementation #72

Closed

3 tasks



		class Hyperband:
		def __init__(self, params, model, R, X, y, eta=3):

Uh oh!

Conversation

stsievert commented May 25, 2018

Uh oh!

stsievert commented May 25, 2018

Uh oh!

stsievert commented May 25, 2018

Uh oh!

TomAugspurger commented May 25, 2018

Uh oh!

TomAugspurger May 25, 2018

Choose a reason for hiding this comment

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

TomAugspurger May 25, 2018

Choose a reason for hiding this comment

Uh oh!

TomAugspurger May 25, 2018

Choose a reason for hiding this comment

Uh oh!

TomAugspurger May 25, 2018

Choose a reason for hiding this comment

Uh oh!

TomAugspurger May 25, 2018

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented May 25, 2018

Uh oh!

stsievert commented May 26, 2018

Uh oh!

TomAugspurger commented May 27, 2018

Uh oh!

stsievert commented May 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stsievert commented May 27, 2018 •

edited

Loading