[MRG] More flexible grid search interface#13145
[MRG] More flexible grid search interface#13145NicolasHug wants to merge 7 commits intoscikit-learn:masterfrom
Conversation
|
Hmmmm... Overall it is good to have this guided by realistic applications.
I'm not sure the change to the return value provides much benefit, but
you're welcome to disagree.
The part where I was surprised by this was in recognising that cv.split
gets called in each evaluation... I thought we had ensured the cv split
would be the same across evaluations.
This can, btw, be achieved more hackily by wrapping the base estimator with
something that has a parameter to tell it which samples to take.
|
but it is the same across evaluations... as long as the input data is the same, right? For the return value, in my situation I agree I could get by with the current behavior. And only returning the result of the last call might make other implementations harder. Maybe we can even return both? @janvanrijn you've been building custom SearchCV objects, what do you think about this? |
|
Hmm... not the same across evaluations if KFold(shuffle=True,
random_state=None)
|
|
@jnothman thinking more about whether returning the results from all the calls to
WDYT? |
|
Another solution we discussed with @amueller would be to pass another parameter
I pushed 80963f3 and will write tests/checks if you're OK with this design? |
|
I've not really got why info goes into the results...? I haven't understood how this solves the problem of selecting samples. Maybe it should be called more_results or something. Also, it might be a good idea to generally include the evaluation number in results where multiple evaluations are performed... |
|
Also should this PR just go ahead and introduce a HalvingSearchCV? |
It goes into the results dict so that in HalvingSearchCV I can properly overload
It is unrelated to the addition of X and y to
I'm currently implementing |
|
So including an evaluation_idx (or just iteration) in results by default
would be sufficient here?
|
|
Yes, I have modified my implementation of SuccessiveHalving like so: ...
info = {'iter': [iter_i] * n_candidates,
'n_samples': [n_samples_iter] * n_candidates}
out = evaluate_candidates(candidate_params, X_iter, y_iter,
info=info)
...Passing n_samples is not really needed here. Now in |
|
@jnothman with the current version of this PR, here is how we can implement SuccessiveHalving: Details def _run_search(self, evaluate_candidates, X, y):
rng = check_random_state(self.random_state)
candidate_params = list(self._generate_candidate_params())
n_iterations = int(ceil(log2(len(candidate_params))))
n_samples_total = X.shape[0]
for iter_i in range(n_iterations):
# randomly sample training samples
n_candidates = len(candidate_params)
n_samples_iter = floor(n_samples_total /
(n_candidates * n_iterations))
indices = rng.choice(n_samples_total, n_samples_iter,
replace=False)
X_iter, y_iter = X[indices], y[indices]
more_results= {'iter': [iter_i] * n_candidates,
'n_samples': [n_samples_iter] * n_candidates}
out = evaluate_candidates(candidate_params, X_iter, y_iter,
more_results=more_results)
# Select the best half of the candidates for the next iteration
# We need to filter out candidates from the previous iterations
n_candidates_to_keep = ceil(n_candidates / 2)
best_candidates_indices = np.argsort(out['mean_test_score'])[::-1]
best_candidates_indices = [i for i in best_candidates_indices
if out['iter'][i] == iter_i]
best_candidates_indices = \
best_candidates_indices[:n_candidates_to_keep]
candidate_params = [out['params'][i]
for i in best_candidates_indices]
assert len(candidate_params) == n_candidates_to_keep == 1def _refit_callable(results):
# Custom refit callable to return the index of the best candidate. We want
# the best candidate out of the last iteration. By default BaseSearchCV
# would return the best candidate out of all iterations.
last_iter = np.max(results['iter'])
sorted_indices = np.argsort(results['mean_test_score'])[::-1]
best_index = next(i for i in sorted_indices
if results['iter'][i] == last_iter)
return best_indexExample of a dataframe, originally with n_samples = 1000: Details
best_index_ is 5 The proposed design makes it easy to implement SuccessiveHalving and will allow any sort of other search estimators. If you're OK with the design I'll start to write tests and checks |
|
@jnothman shall we discuss this during the sprint? |
|
I'm in favor of adding SuccessiveHalvingCV btw. I'm happy to put it into |
|
We agreed with Joel during the sprint that ideally only the CV object could be passed, instead of passing X and y. We can do the subsampling inside the CV. And yes I think it could be a good fit for the experimental module |
|
@jnothman , I've been trying to only pass We still have to pass So basically this only replaces |
|
By saying "we need y to stratify" do you mean that you need to subsample
from the training set in a stratified way? You can handle that when you
wrap split()... Or that you need to tell check_cv that the base estimator
is a classifier so that it selects the right splitter? I don't see why that
requires y.
|
|
What I mean is that currently in
We can't write the same thing in I don't think that's much better than passing |
|
I forgot that check_cv takes y. Hmmmm. I'd still rather we pass indices
over the training set to evaluate_candidates. But I might be wrong. This
certainly makes the interface more elegant, but it's worth it?
|
|
Hmm maybe passing indices is a nice alternative since unlike X and y, indices can have default values (None). I think we should pass both row indices and column indices. This way we could also implement feature subsampling: in successive halving the budget can also be the number of features. I'm not sure I entirely understand what you said before about joblib being able to use a numpy memmap. It seems to me that the memmapping only happens over a single call to |
|
I think you're right about Parallel in the current implementation.
You make me wonder if we are better off with prepending some
SuccessiveHalvingResampler onto the estimator somehow, rather than doing
the resampling in _run_search... But maybe your solution of passing X and y
around is reasonable.
|
|
Could you please expand on prenpending SuccessiveHalvingResampler? I'm not following
Would you agree to keep the proposed changed as they are? If yes I'll start writing tests |
|
Also It'd be useful to introduce a |
well it can also be done in |
I mean it would be nice to have a Right now for successive halving I am forced to do the input checking in |
|
why is it the only method you can overwrite?
|
|
Because apart from |
|
It would be very normal when inheriting to override fit and call
super().fit()
|
|
Right, I was thinking of a hook that could be called anywhere in |
~~Still very WIP, need feedback on the interface :)~~ I'm now overwriting `_run_search`, taking advantage of the proposed changes in `BaseGridSearch` from scikit-learn/scikit-learn#13145 Note that this forces to remove the `refit` parameter, which has to be set to a custom callable. `cv_results_` looks good and it's still useful. For e.g. 4 candidates it's going to have 4 (first iter) + 2 (second and last iter) rows. If you're OK with this new design I'll start to write tests :)
|
Could you please post a summary and what needs to be done here @NicolasHug ? |
|
@adrinjalali This PR is actually superseded by #13900 |
|
I see, you said "This builds upon #13145, whose changes are required." on that PR, so I assumed you need this one first. Closing then. |

Reference Issues/PRs
What does this implement/fix? Explain your changes.
To make a custom GridSearch class, one can overwrite the (private)
_run_searchmethod and callevaluate_candidateswith specific candidate parameter. However, it is not possible to specify the data on which we want to evaluate the results, which makes it impossible to implement schemes such as SuccessiveHalving where N candidates compete over K iterations: iteration i + 1 uses twice as much data as iteration i, and only uses the best half of the candidates. At the end of the process, only one candidate remains.With the proposed changes, sucessive halving can be easily implemented as such (+ a custom refit callable to properly determine the best index):
Details
Proposed changes:
run_searchandevaluate_candidatesevaluate_candidatesreturns the formatted results of the current call, not the formatted results of all the calls since the beginning. This is IMO more useful, as illustrated in the above code snippet.Any other comments?
Please note that I also integrated the small changes proposed in #13144
ping @jnothman who implemented the current design
ping @janvanrijn you might be interested in this?
ping @amueller