Conversation
squash
|
dask-searchcv does build one graph statically: With any adaptive function, we will have to build graphs dynamically.Any adaptive function requires two functions, as described in dask/dask-ml#161 (comment). That is, the general framework (regardless of where it goes) will be something like alg = Hyperband()
losses = None
while True:
models = [model.set_params(config) for config in alg.get_configs(models, losses)]
losses = [delayed(validation_loss, model) for model in models]
models = alg.filter(models, losses)
if alg.stop:
return alg.best_model, alg.best_configI think this would belong in Right now, |
|
I'm not sure how difficult it will be to integrate adaptive algorithms into dask-searchcv.I think I remember someone proposing to include this algorithm in dask-ml without any other integration into sklearn pipelines. This integration barrier exists for all adaptive algorithms (i.e., other algorithm choices in dask/dask-ml#161). I'm inclined to solve it once for some adaptive algorithm base class, then have any other algorithm (including Hyperband) inherit from that base class. |
|
Right now I would worry less about the class / inheritance structure. If we see an opportunity for code reuse we can merge things later. I do care a bit more about following the scikit-learn API of specifying hyperparameters as parameters to the class, and passing |
| from algs import Hyperband | ||
|
|
||
|
|
||
| def accuracy(x, y): |
TomAugspurger
left a comment
There was a problem hiding this comment.
(reading through the paper now to comment on the actual algorithm, but is the learning entirely synchronous?)
|
|
||
|
|
||
| class Hyperband: | ||
| def __init__(self, params, model, R, X, y, eta=3): |
There was a problem hiding this comment.
So IIUC, we could restructure this as
def __init__(self, estimator, parameters, R, eta=3):
...
def fit(self, X, y=None):
...
What is R? we'll want a mo8re descriptive name.
| self.model = model | ||
| self.R = R | ||
| self.eta = eta | ||
| self.best_val_loss = np.inf |
There was a problem hiding this comment.
And if we're following the scikit-learn style (which I think we should), "learned" attributes like this would have a trailing underscore, and wouldn't exist till .fit is called.
| self.classes = np.unique(y) | ||
|
|
||
| n, d = X.shape | ||
| train, val = train_test_split(range(n)) |
There was a problem hiding this comment.
this would be moved to .fit
| R, eta = self.R, self.eta | ||
| s_max = math.floor(math.log(self.R, self.eta)) | ||
| B = (s_max + 1) * self.R | ||
| for s in reversed(range(s_max + 1)): |
There was a problem hiding this comment.
Python is probably smart about this, but this could also be range(s_max, -1, -1)
|
Regarding parallelization, the authors suggest a few:
Do you have a sense of how that would adapt to Dask? One thing we haven't really discussed is what user-API to use for getting the client that should be used for the async stuff. We could
|
I should have explained myself better in the original comment. I see two paths forward for this code:
Option 2 will be quicker to implement, but will not integrate well into sklearn pipelines and will require mirroring the sklearn API manually. Option 1 will be slower to implement if it's possible, but will provide better integration and will fit nicely into sklearn pipelines. This is the approach that the two other model selection algorithms (grid and random search) use. I'm inclined to go with option 1. It's easier and is a minimal solution to provide the functionality. A future PR can take this choice and extend it to option 2, especially if we inherit from an I had been deciding between these options, and which I would have submitted this PR after I started to integrate dask.distributed. This PR is very rough, and far from reasonable code. I have decided to close this PR until it's more reasonable and ready for review.
I think this is the right algorithm for dask – most of this algorithm is embarrassingly parallel, and in section 6.1 they mention "a more sophisticated job priority queue must be managed". Most of the benefits would come from implementing this diff: - models = {k: train(model, ...) for k, model in models.items()}
+ futures = {k: client.submit(train, model, ...) for k, model in models.items()}
+ models = client.gather(futures)We can likewise have another diff to launch the successive halving brackets simultaneously: def fit(): # the main hyperband alg
# ...
args = [...]
client.map(self.successive_halving, args)
return self.best_config, self.best_modelThis takes care of "distribute individual bracket" and "parallelize a single bracket". |
|
Sounds good. LMK if you want additional comments on the implementation here, or if you think the future PR will be different enough that it's not worthwhile. |
|
Let me integrate with dask.distributed, then ask for comments (I think the updated PR will be different enough). I'll also try to improve the user-facing API more, now that I've decided not to inherit from I think this can be made to work with sklearn pipelines, as long as exactly one model supports |
This implements Hyperband, a hyperparameter optimization algorithm. I describe it more in dask/dask-ml#161 (comment). This algorithm is adaptive: it has to make choices from previous evaluations.
This code works (though very much a WIP). Each "bracket" evaluates many different models, then kills off about 2/3rds based on which ones are best performing. Here's a graph of the s=3 bracket:
We spend more energy on better performing models.
This is a work in progress. I am opening this PR to get thoughts/ideas for integration of adaptive algorithms in dask-searchcv.