Base class for estimators using _incremental.fit#356
Base class for estimators using _incremental.fit#356TomAugspurger merged 44 commits intodask:masterfrom
Conversation
| * _get_ids_and_args : Build the dict of | ||
| ``{model_id : (param_list, additional_calls)}`` | ||
| that are eventually passed to `fit` | ||
| * _get_cv_results : Build the dict containing the CV results. |
There was a problem hiding this comment.
I would expect this to be shared among implementations. I think that the history output of the fit function should have all of the information here. Are there choices that are likely to change between implementations?
There was a problem hiding this comment.
Yes, I think so. Initially, the structure of my IncrementalRandomizedSearchCV differed from Hyperband, as my results were simply results = fit(...), and not results = {model_id: fit(...) for ...}.
Now that both have the consistent results type of {model_id: (info, models, history)}, more of this should be sharable.
same applies to #356 (comment)
There was a problem hiding this comment.
In my implementation of _get_cv_results, the bracket key (model_id here I think) was only required to make it so the model_ids were unique across brackets. This was important for downstream processing; I wanted unique model IDs for each bracket, not shared (and isn't the cleanest code).
This solution can be used with the "consistent results type of {model_id: (info, models, history)}". The result I came up at _hyperband.py#L315-L318 is something like
for loop_id, (info, models, hist) in results.items(): # results defined in comment above
for h in hist:
h["model_id"] = "loop{}-{}".format(loop_id, h["model_id"])|
|
||
| * _get_ids_and_args : Build the dict of | ||
| ``{model_id : (param_list, additional_calls)}`` | ||
| that are eventually passed to `fit` |
There was a problem hiding this comment.
I would expect the additional_calls function to be a method of this class that developers would override.
I think that we probably don't need users to define model_ids. The fit function will do this for us.
I do think that we need a way to decide initial parameter lists. This might be an overrideable method.
There was a problem hiding this comment.
@mrocklin does _get_ids_and_args make sense now? Initially the building of param_lists and additional_calls was two separate methods, but it seems like they may be dependent on eachother. So I figured having a single method to override the things passed into fit made the most sense.
This is the only mandatory method that has to be implemented by subclasses now.
| that are eventually passed to `fit` | ||
| * _get_cv_results : Build the dict containing the CV results. | ||
| * _get_best_estimator : Select the best estimator from the set of | ||
| fitted estimators |
There was a problem hiding this comment.
I think that this should be easy to make a default for. I would pass over all of the models in the resulting models dict and then pick the one with the best score coming out of info
| * _get_best_estimator : Select the best estimator from the set of | ||
| fitted estimators | ||
| * _update_results : Update self with attributes from the fitted | ||
| esimators. |
There was a problem hiding this comment.
I would also expect this to be somewhat standardized among different implementations. What does this need to do?
There was a problem hiding this comment.
I'm not really sure. For Hyperband, it's something like
self.history_ = sum([SHA._history for SHA in SHAs.values()], [])
self.metadata_ = {
"models": sum(m["models"] for m in meta),
"partial_fit_calls": sum(m["partial_fit_calls"] for m in meta),
"brackets": meta,
}This could probably have a default implementation that extracts the results that will likely be useful to all base classes (I'm just not sure what those items are at the moment).
There was a problem hiding this comment.
I think that my objective here is for users not to have to write too much to make things work. Some of these hooks should probably be optional.
| * _get_ids_and_args : Build the dict of | ||
| ``{model_id : (param_list, additional_calls)}`` | ||
| that are eventually passed to `fit` | ||
| * _get_cv_results : Build the dict containing the CV results. |
There was a problem hiding this comment.
In my implementation of _get_cv_results, the bracket key (model_id here I think) was only required to make it so the model_ids were unique across brackets. This was important for downstream processing; I wanted unique model IDs for each bracket, not shared (and isn't the cleanest code).
This solution can be used with the "consistent results type of {model_id: (info, models, history)}". The result I came up at _hyperband.py#L315-L318 is something like
for loop_id, (info, models, hist) in results.items(): # results defined in comment above
for h in hist:
h["model_id"] = "loop{}-{}".format(loop_id, h["model_id"])| return self | ||
|
|
||
|
|
||
| class IncrementalRandomizedSearchCV(BaseIncrementalSearchCV): |
There was a problem hiding this comment.
There are two parts to this class: the method of choosing parameters, and the incremental/adaptive method implemented (currently remove_worst). I'd rather the name reflect the incremental/adaptive component: maybe something like KillWorstModelCV?
There was a problem hiding this comment.
I think that this is only up here as a strawman to see what things look like when extending the class. I don't think that the remove_worst function is a particularly pragmatic approach. It was just something simple that was small enough for me to put into a docstring.
(cherry picked from commit a37a8e5)
|
I think this is reasonably close to being ready. May require some tuning tomorrow once hyperband is ready. As a preview https://github.com/TomAugspurger/dask-ml/blob/stsievert-hyperband-2/dask_ml/model_selection/_hyperband.py has the trimmed down implementation on top of this. Once we're good here, I'll make a PR against your branch @stsievert from https://github.com/TomAugspurger/dask-ml/tree/stsievert-hyperband-2, and then we should be close to ready. I still have a few failing tests to fix up. |
| * additional_fit_calls : Union[Callable, object] | ||
| Objects with a ``fit`` method have the bound | ||
| ``additional_fit_calls.fit`` method passed through to `fit`. | ||
| (Other) callables are simply passed through. |
There was a problem hiding this comment.
I would design things so that the IncrementalRandomizedSearchCV class has remove_worst as a method, rather than as a return value of _get_ids_an_args. I think that the remove_worst method is likely to be the main point of variance between different classes, and so is something that people will probably expect to specify at the class level.
(also, just to be clear, we shouldn't actually publish a class with the remove_worst function as a genuine solution. It's not a good solution)
I also don't understand why this returns a dictionary.
| random_state=self.random_state, | ||
| ) | ||
| for model_id, (param_list, additional_calls) in ids_and_args.items() | ||
| } |
There was a problem hiding this comment.
The Hyperband implementation is weird in that they'll want to call fit several times rather than just once (which I think will be more common). Additionally, Hyperband will want to call fit many times concurrently, so we'll need to use coroutines for this.
Short term I suggest that we implement classes for fitting one model, RandomSearch, and SuccessiveHalving using the functions found in this notebook. Then we look at Hyperband as a special case that has to call fit a few times and see what extra methods might have to be pulled out.
I'm happy to take a look at this later today as well if that would be useful.
There was a problem hiding this comment.
In short, I think that this approach is more complicated than necessary specifically because it is targetting Hyperband first and because Hyperband is a bit of a special case. I think that we should target some of the simpler situations first, build conventions around them, and then see how we can add release valves for Hyperband to also work.
There was a problem hiding this comment.
My reply seems to have been lost.
tldr: that seems sensible. I agree that Hyperband's multiple calls to fit complicated this PR, and I don't know how representative that will be of other strategies (I assume not).
|
@mrocklin I think this is mostly done for now. I may do a bit of restructuring to split out the CV-specific parts of DaskBaseSearchCV into a mixin. We currently inherit some things that don't make so much sense. Followup items are to try out random search and successive halving from https://gist.github.com/mrocklin/4c95bd26d15281d82e0bf2d27632e294 on this. Do you want to do that or should I? |
|
Do you want to do that or should I?
I can take a look. It'll force me to better understand the system you've
put in place here.
…On Fri, Sep 7, 2018 at 11:44 AM, Tom Augspurger ***@***.***> wrote:
@mrocklin <https://github.com/mrocklin> I think this is mostly done for
now.
I may do a bit of restructuring to split out the CV-specific parts of
DaskBaseSearchCV into a mixin. We currently inherit some things that don't
make so much sense.
Followup items are to try out random search and successive halving from
https://gist.github.com/mrocklin/4c95bd26d15281d82e0bf2d27632e294 on
this. Do you want to do that or should I?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#356 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszFmEb3_C1Td9L-wlpmDcsSEBIwruks5uYpRRgaJpZM4WcvYU>
.
|
|
OK, just merged in master to hopefully fix the CI failure. If you want to "check your work", here's a solution :) Detailsdiff --git a/dask_ml/model_selection/_incremental.py b/dask_ml/model_selection/_incremental.py
index c2983ae..b208ae9 100644
--- a/dask_ml/model_selection/_incremental.py
+++ b/dask_ml/model_selection/_incremental.py
@@ -618,3 +618,68 @@ class RandomizedWorstIncrementalSearch(BaseIncrementalSearch):
def _get_params(self):
return ParameterSampler(self.parameters, self.n_iter, self.random_state)
+
+
+class RandomIncrementalSearch(BaseIncrementalSearch):
+ def __init__(
+ self,
+ estimator,
+ param_distribution,
+ n_iter=10,
+ max_iter=100,
+ patience=10,
+ tol=0.001,
+ test_size=0.15,
+ random_state=None,
+ scoring=None,
+ iid=True,
+ refit=True,
+ error_score="raise",
+ return_train_score=_RETURN_TRAIN_SCORE_DEFAULT,
+ scheduler=None,
+ n_jobs=-1,
+ cache_cv=True,
+ ):
+ self.max_iter = max_iter
+ self.n_iter = n_iter
+ self.patience = patience
+ self.tol = tol
+ super(RandomIncrementalSearch, self).__init__(
+ estimator,
+ param_distribution,
+ test_size,
+ random_state,
+ scoring,
+ iid,
+ refit,
+ error_score,
+ return_train_score,
+ scheduler,
+ n_jobs,
+ cache_cv
+ )
+
+ def _get_params(self):
+ return ParameterSampler(self.parameters, self.n_iter)
+
+ def _additional_calls(self, info):
+ out = {}
+ max_iter = self.max_iter
+ patience = self.patience
+ tol = self.tol
+
+ for ident, records in info.items():
+ if max_iter is not None and len(records) > max_iter:
+ out[ident] = 0
+
+ elif len(records) > patience:
+ old = records[-patience]["score"]
+ if all(d["score"] < old + tol for d in records[-patience:]):
+ out[ident] = 0
+ else:
+ out[ident] = 1
+
+ else:
+ out[ident] = 1
+ print(out)
+ return out |
|
I think some of these parameters should be removed:
But need to double check first. |
|
Care to push up your RandomSearch implementation?
…On Fri, Sep 7, 2018 at 1:04 PM, Tom Augspurger ***@***.***> wrote:
I think some of these parameters should be removed:
- iid
- refit
- error_score
- return_train_score
- scheduler
- n_jobs
- cache_cv
But need to double check first.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#356 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszI8RemEqLzCwZeQOP5uw_FxsYntqks5uYqcXgaJpZM4WcvYU>
.
|
|
NM, I'm on it
On Fri, Sep 7, 2018 at 1:36 PM, Matthew Rocklin <mrocklin@anaconda.com>
wrote:
… Care to push up your RandomSearch implementation?
On Fri, Sep 7, 2018 at 1:04 PM, Tom Augspurger ***@***.***>
wrote:
> I think some of these parameters should be removed:
>
> - iid
> - refit
> - error_score
> - return_train_score
> - scheduler
> - n_jobs
> - cache_cv
>
> But need to double check first.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#356 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AASszI8RemEqLzCwZeQOP5uw_FxsYntqks5uYqcXgaJpZM4WcvYU>
> .
>
|
|
How do we feel about the following? Exponential decay with patience on randomly selected parameters 100 partial_fit calls with no decay on a full grid cc @ogrisel your feedback on API and functionality here would be welcome |
|
That seems reasonable, with the caveat that we shouldn't have the CV suffix if we're just doing a single train-test split. Now that |
|
OK, done. @TomAugspurger I'm curious about your thoughts on the |
|
I think |
|
For the default size of the test data how about using the size of a single chunk. For example if the input data has ten chunks then we use 10%. That provides some guarantee that the result will fit into memory. This would be the default value if |
|
Yes that's a very good idea. |
| @if_delegate_has_method(delegate=("best_estimator_", "estimator")) | ||
| def inverse_transform(self, Xt): | ||
| self._check_is_fitted("inverse_transform") | ||
| return self.best_estimator_.transform(Xt) |
There was a problem hiding this comment.
It looks like the methods above are untested.
There was a problem hiding this comment.
I can put some together quick.
|
Edge case when there's a single parameter X, y = make_classification(n_samples=100, n_features=5, chunks=(10, 5))
model = SGDClassifier(tol=1e-3)
params = {"alpha": np.logspace(-2, 10, 3)}
search = IncrementalSearch(model, params, n_initial_parameters="grid")
search.fit(X, y, classes=[0, 1])---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
~/sandbox/dask-ml/dask_ml/model_selection/_incremental.py in _fit(self, X, y, **fit_params)
533 history_results = self._get_history_results(results)
--> 534 best_estimator, best_index = self._get_best(results, history_results)
535 best_estimator = yield best_estimator
~/sandbox/dask-ml/dask_ml/model_selection/_incremental.py in _get_best(self, results, history_results)
486 """Select the best estimator from the set of estimators."""
--> 487 best_model_id = first(results.info)
488 key = operator.itemgetter("model_id")
~/Envs/dask-dev/lib/python3.6/site-packages/toolz/itertoolz.py in first(seq)
367 """
--> 368 return next(iter(seq))
369
StopIteration:
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
<ipython-input-7-048117aa2d7a> in <module>()
----> 1 search.fit(X, y, classes=[0, 1])
~/sandbox/dask-ml/dask_ml/model_selection/_incremental.py in fit(self, X, y, **fit_params)
559 Additional partial fit keyword arguments for the estimator.
560 """
--> 561 return default_client().sync(self._fit, X, y, **fit_params)
562
563 @if_delegate_has_method(delegate=("best_estimator_", "estimator"))
~/Envs/dask-dev/lib/python3.6/site-packages/distributed/client.py in sync(self, func, *args, **kwargs)
645 return future
646 else:
--> 647 return sync(self.loop, func, *args, **kwargs)
648
649 def __repr__(self):
~/Envs/dask-dev/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
275 e.wait(10)
276 if error[0]:
--> 277 six.reraise(*error[0])
278 else:
279 return result[0]
~/Envs/dask-dev/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
691 if value.__traceback__ is not tb:
692 raise value.with_traceback(tb)
--> 693 raise value
694 finally:
695 value = None
~/Envs/dask-dev/lib/python3.6/site-packages/distributed/utils.py in f()
260 if timeout is not None:
261 future = gen.with_timeout(timedelta(seconds=timeout), future)
--> 262 result[0] = yield future
263 except Exception as exc:
264 error[0] = sys.exc_info()
~/Envs/dask-dev/lib/python3.6/site-packages/tornado/gen.py in run(self)
1131
1132 try:
-> 1133 value = future.result()
1134 except Exception:
1135 self.had_exception = True
~/Envs/dask-dev/lib/python3.6/site-packages/tornado/gen.py in run(self)
1145 exc_info = None
1146 else:
-> 1147 yielded = self.gen.send(value)
1148
1149 if stack_context._state.contexts is not orig_stack_contexts:
RuntimeError: generator raised StopIteration |
|
On it
…On Mon, Sep 17, 2018 at 3:48 PM, Tom Augspurger ***@***.***> wrote:
Edge case when there's a single parameter
X, y = make_classification(n_samples=100, n_features=5, chunks=(10, 5))
model = SGDClassifier(tol=1e-3)
params = {"alpha": np.logspace(-2, 10, 3)}
search = IncrementalSearch(model, params, n_initial_parameters="grid")
search.fit(X, y, classes=[0, 1])
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
~/sandbox/dask-ml/dask_ml/model_selection/_incremental.py in _fit(self, X, y, **fit_params)
533 history_results = self._get_history_results(results)
--> 534 best_estimator, best_index = self._get_best(results, history_results)
535 best_estimator = yield best_estimator
~/sandbox/dask-ml/dask_ml/model_selection/_incremental.py in _get_best(self, results, history_results)
486 """Select the best estimator from the set of estimators."""
--> 487 best_model_id = first(results.info)
488 key = operator.itemgetter("model_id")
~/Envs/dask-dev/lib/python3.6/site-packages/toolz/itertoolz.py in first(seq)
367 """
--> 368 return next(iter(seq))
369
StopIteration:
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
<ipython-input-7-048117aa2d7a> in <module>()
----> 1 search.fit(X, y, classes=[0, 1])
~/sandbox/dask-ml/dask_ml/model_selection/_incremental.py in fit(self, X, y, **fit_params)
559 Additional partial fit keyword arguments for the estimator.
560 """
--> 561 return default_client().sync(self._fit, X, y, **fit_params)
562
563 @if_delegate_has_method(delegate=("best_estimator_", "estimator"))
~/Envs/dask-dev/lib/python3.6/site-packages/distributed/client.py in sync(self, func, *args, **kwargs)
645 return future
646 else:
--> 647 return sync(self.loop, func, *args, **kwargs)
648
649 def __repr__(self):
~/Envs/dask-dev/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
275 e.wait(10)
276 if error[0]:
--> 277 six.reraise(*error[0])
278 else:
279 return result[0]
~/Envs/dask-dev/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
691 if value.__traceback__ is not tb:
692 raise value.with_traceback(tb)
--> 693 raise value
694 finally:
695 value = None
~/Envs/dask-dev/lib/python3.6/site-packages/distributed/utils.py in f()
260 if timeout is not None:
261 future = gen.with_timeout(timedelta(seconds=timeout), future)
--> 262 result[0] = yield future
263 except Exception as exc:
264 error[0] = sys.exc_info()
~/Envs/dask-dev/lib/python3.6/site-packages/tornado/gen.py in run(self)
1131
1132 try:
-> 1133 value = future.result()
1134 except Exception:
1135 self.had_exception = True
~/Envs/dask-dev/lib/python3.6/site-packages/tornado/gen.py in run(self)
1145 exc_info = None
1146 else:
-> 1147 yielded = self.gen.send(value)
1148
1149 if stack_context._state.contexts is not orig_stack_contexts:
RuntimeError: generator raised StopIteration
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#356 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszC011l4dJT2ACj0C564Q_TZqeJJ1ks5ub_yfgaJpZM4WcvYU>
.
|
|
Added small tests for predict_proba, predict_log_proba, transform, and decision_function. |
|
Sometimes the shape of the result in test_transform differs. Is this expected? |
|
Sorry, fixed. |
|
+1 |
Here's a rough draft of what a base class calling
_incremental.fitmay look like. I've just been stealing pieces from #221 and making it work for the random sampler with theremove_worststrategy.Looking at #221 there were a few pieces that will be common to each subclass
That leaves a few things that must be done by each class
{model_id: (param_list, additional_calls)}fiton each of those into a common formIn the end, I'm not sure how much code reuse we'll actually get, but I think it's worth exploring a bit further. A next step would be to make a branch of #221 that builds on top of this, and see how well the hyperband structure maps on to this base class. From there we can decide whether the base class can be adjusted to fit hyperband's needs, or whether it should just be scrapped. I'll try to get to that today.
cc @stsievert