Skip to content
This repository was archived by the owner on Oct 14, 2018. It is now read-only.

WIP: ENH: implement hyperband#71

Closed
stsievert wants to merge 1 commit intodask:masterfrom
stsievert:hyperband
Closed

WIP: ENH: implement hyperband#71
stsievert wants to merge 1 commit intodask:masterfrom
stsievert:hyperband

Conversation

@stsievert
Copy link
Copy Markdown
Member

This implements Hyperband, a hyperparameter optimization algorithm. I describe it more in dask/dask-ml#161 (comment). This algorithm is adaptive: it has to make choices from previous evaluations.

This code works (though very much a WIP). Each "bracket" evaluates many different models, then kills off about 2/3rds based on which ones are best performing. Here's a graph of the s=3 bracket:

screen shot 2018-05-24 at 7 42 06 pm

We spend more energy on better performing models.

This is a work in progress. I am opening this PR to get thoughts/ideas for integration of adaptive algorithms in dask-searchcv.

@stsievert
Copy link
Copy Markdown
Member Author

dask-searchcv does build one graph statically:

dsk, keys, n_splits = build_graph(estimator, self.cv, self.scorer_,
With any adaptive function, we will have to build graphs dynamically.

Any adaptive function requires two functions, as described in dask/dask-ml#161 (comment). That is, the general framework (regardless of where it goes) will be something like

alg = Hyperband()


losses = None
while True:
    models = [model.set_params(config) for config in alg.get_configs(models, losses)]
    losses = [delayed(validation_loss, model) for model in models]
    models = alg.filter(models, losses)
    if alg.stop:
        return alg.best_model, alg.best_config

I think this would belong in fit (or should call fit as a subroutine, maybe moving it to _fit).

Right now, RandomizedSearchCV and GridSearchCV only provide alg.get_configs through _get_param_iterator.

@stsievert
Copy link
Copy Markdown
Member Author

I'm not sure how difficult it will be to integrate adaptive algorithms into dask-searchcv.I think I remember someone proposing to include this algorithm in dask-ml without any other integration into sklearn pipelines.

This integration barrier exists for all adaptive algorithms (i.e., other algorithm choices in dask/dask-ml#161). I'm inclined to solve it once for some adaptive algorithm base class, then have any other algorithm (including Hyperband) inherit from that base class.

cc @mrocklin @TomAugspurger

@TomAugspurger
Copy link
Copy Markdown
Member

Right now I would worry less about the class / inheritance structure. If we see an opportunity for code reuse we can merge things later.

I do care a bit more about following the scikit-learn API of specifying hyperparameters as parameters to the class, and passing X and y to the fit.

from algs import Hyperband


def accuracy(x, y):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: implementing this in dask/dask-ml#177

Copy link
Copy Markdown
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(reading through the paper now to comment on the actual algorithm, but is the learning entirely synchronous?)



class Hyperband:
def __init__(self, params, model, R, X, y, eta=3):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So IIUC, we could restructure this as

def __init__(self, estimator, parameters, R, eta=3):
    ...
def fit(self, X, y=None):
    ...

What is R? we'll want a mo8re descriptive name.

self.model = model
self.R = R
self.eta = eta
self.best_val_loss = np.inf
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if we're following the scikit-learn style (which I think we should), "learned" attributes like this would have a trailing underscore, and wouldn't exist till .fit is called.

self.classes = np.unique(y)

n, d = X.shape
train, val = train_test_split(range(n))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be moved to .fit

R, eta = self.R, self.eta
s_max = math.floor(math.log(self.R, self.eta))
B = (s_max + 1) * self.R
for s in reversed(range(s_max + 1)):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python is probably smart about this, but this could also be range(s_max, -1, -1)

@TomAugspurger
Copy link
Copy Markdown
Member

Regarding parallelization, the authors suggest a few:

  • distribute individual brackets of SuccessiveHalving to different machines. This can be done asynchronously and as machines free up, new brackets can be launched with a different set of arms.
  • parallelize a single bracket so that each round of SuccessiveHalving runs faster.
  • one could also parallelize each individual execution of R.

Do you have a sense of how that would adapt to Dask?

One thing we haven't really discussed is what user-API to use for getting the client that should be used for the async stuff. We could

  1. Rely on a global Client with distributed.get_client()
  2. Accept a client parameter to classifiers that need it. This could be either a Client or a string representing the scheduler-address. I suspect that the string would be preferred, since then the classifier is serializable.

@stsievert
Copy link
Copy Markdown
Member Author

Right now I would worry less about the class / inheritance structure. If we see an opportunity for code reuse we can merge things later.

I should have explained myself better in the original comment. I see two paths forward for this code:

  1. integrate into dask-searchcv by inheriting from the base class (DaskBaseSearchCV)
  2. adapting this code to Dask, and relying on the distributed scheduler

Option 2 will be quicker to implement, but will not integrate well into sklearn pipelines and will require mirroring the sklearn API manually.

Option 1 will be slower to implement if it's possible, but will provide better integration and will fit nicely into sklearn pipelines. This is the approach that the two other model selection algorithms (grid and random search) use.

I'm inclined to go with option 1. It's easier and is a minimal solution to provide the functionality. A future PR can take this choice and extend it to option 2, especially if we inherit from an DaskAdaptiveSearchCV class.

I had been deciding between these options, and which I would have submitted this PR after I started to integrate dask.distributed. This PR is very rough, and far from reasonable code. I have decided to close this PR until it's more reasonable and ready for review.

Do you have a sense of how that would adapt to Dask?

I think this is the right algorithm for dask – most of this algorithm is embarrassingly parallel, and in section 6.1 they mention "a more sophisticated job priority queue must be managed". Most of the benefits would come from implementing this diff:

- models = {k: train(model, ...) for k, model in models.items()}
+ futures = {k: client.submit(train, model, ...) for k, model in models.items()}
+ models = client.gather(futures)

We can likewise have another diff to launch the successive halving brackets simultaneously:

def fit():  # the main hyperband alg
    # ...
    args = [...]
    client.map(self.successive_halving, args)
    return self.best_config, self.best_model

This takes care of "distribute individual bracket" and "parallelize a single bracket".

@stsievert stsievert closed this May 26, 2018
@TomAugspurger
Copy link
Copy Markdown
Member

Sounds good. LMK if you want additional comments on the implementation here, or if you think the future PR will be different enough that it's not worthwhile.

@stsievert
Copy link
Copy Markdown
Member Author

stsievert commented May 27, 2018

Let me integrate with dask.distributed, then ask for comments (I think the updated PR will be different enough). I'll also try to improve the user-facing API more, now that I've decided not to inherit from DaskBaseSearchCV. I'll probably open another PR – this algorithm could use a better introduction.

I think this can be made to work with sklearn pipelines, as long as exactly one model supports partial_fit.

@stsievert stsievert mentioned this pull request Jun 3, 2018
3 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants