[MRG] More flexible grid search interface by NicolasHug · Pull Request #13145 · scikit-learn/scikit-learn

NicolasHug · 2019-02-12T16:19:11Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

To make a custom GridSearch class, one can overwrite the (private) _run_search method and call evaluate_candidates with specific candidate parameter. However, it is not possible to specify the data on which we want to evaluate the results, which makes it impossible to implement schemes such as SuccessiveHalving where N candidates compete over K iterations: iteration i + 1 uses twice as much data as iteration i, and only uses the best half of the candidates. At the end of the process, only one candidate remains.

With the proposed changes, sucessive halving can be easily implemented as such (+ a custom refit callable to properly determine the best index):

Details

    def _run_search(self, evaluate_candidates, X, y):
        rng = check_random_state(self.random_state)

        candidate_params = list(self._generate_candidate_params())
        n_iterations = int(ceil(log2(len(candidate_params))))
        n_samples_total = X.shape[0]

        for _ in range(n_iterations):
            # randomly sample training samples
            n_candidates = len(candidate_params)
            n_samples_iter = floor(n_samples_total /
                                   (n_candidates * n_iterations))
            indices = rng.choice(n_samples_total, n_samples_iter,
                                 replace=False)
            X_iter, y_iter = X[indices], y[indices]

            out = evaluate_candidates(candidate_params, X_iter, y_iter)

            # Select the best half of the candidates for the next iteration
            n_candidates_to_keep = ceil(n_candidates / 2)
            best_candidates_indices = np.argsort(out['mean_test_score'])
            best_candidates_indices = \
                best_candidates_indices[-n_candidates_to_keep:]
            candidate_params = [candidate_params[i]
                                for i in best_candidates_indices]
        assert len(candidate_params) == 1

Proposed changes:

Force X and y to be passed to run_search and evaluate_candidates
evaluate_candidates returns the formatted results of the current call, not the formatted results of all the calls since the beginning. This is IMO more useful, as illustrated in the above code snippet.

Any other comments?

Please note that I also integrated the small changes proposed in #13144

ping @jnothman who implemented the current design
ping @janvanrijn you might be interested in this?
ping @amueller

jnothman · 2019-02-12T22:58:18Z

Hmmmm... Overall it is good to have this guided by realistic applications. I'm not sure the change to the return value provides much benefit, but you're welcome to disagree. The part where I was surprised by this was in recognising that cv.split gets called in each evaluation... I thought we had ensured the cv split would be the same across evaluations. This can, btw, be achieved more hackily by wrapping the base estimator with something that has a parameter to tell it which samples to take.

NicolasHug · 2019-02-12T23:10:06Z

The part where I was surprised by this was in recognising that cv.split gets called in each evaluation... I thought we had ensured the cv split would be the same across evaluations.

but it is the same across evaluations... as long as the input data is the same, right?

For the return value, in my situation I agree I could get by with the current behavior. And only returning the result of the last call might make other implementations harder. Maybe we can even return both? @janvanrijn you've been building custom SearchCV objects, what do you think about this?

jnothman · 2019-02-12T23:35:06Z

Hmm... not the same across evaluations if KFold(shuffle=True, random_state=None)

NicolasHug · 2019-02-13T16:33:33Z

@jnothman thinking more about whether returning the results from all the calls to evaluate_candidate or just the last one:

returning only the last one makes it a bit hard to reconstruct all of the results in the custom _run_search. I personally don't need it for successive halving, but other implementations may.
returning results from all the calls makes it possible to return this dict in run_search: that would allow to remove this nonlocal variable which I would think is a good thing. However, this forces the results dict to be ordered, else there is no way to know which are the results of the last call to evaluate_results in run_search. If the dict is not ordered, I cannot implement the successive halving search.

WDYT?

NicolasHug · 2019-02-13T22:15:52Z

Another solution we discussed with @amueller would be to pass another parameter info: evaluate_candidate(candidates, X, y, info=None)

info would be a dict whose keys would be transformed into columns in cv_results. This way in SuccessiveHalving I could pass which candidates correspond to which iteration, and find the best candidate accordingly.

I pushed 80963f3 and will write tests/checks if you're OK with this design?

jnothman · 2019-02-13T23:27:19Z

I've not really got why info goes into the results...? I haven't understood how this solves the problem of selecting samples. Maybe it should be called more_results or something. Also, it might be a good idea to generally include the evaluation number in results where multiple evaluations are performed...

jnothman · 2019-02-13T23:28:07Z

Also should this PR just go ahead and introduce a HalvingSearchCV?

NicolasHug · 2019-02-13T23:42:49Z

I've not really got why info goes into the results

It goes into the results dict so that in HalvingSearchCV I can properly overload refit to only keep the best candidates out of the last iteration (i.e. the last call to evaluate). And I agree more_results is a better name.

I haven't understood how this solves the problem of selecting samples

It is unrelated to the addition of X and y to evaluate, if that's what you mean.

Also should this PR just go ahead and introduce a HalvingSearchCV?

I'm currently implementing HalvingSearchCV in another repo and see how it goes but yes I'm planning to submit a PR eventually.

jnothman · 2019-02-13T23:50:11Z

So including an evaluation_idx (or just iteration) in results by default would be sufficient here?

NicolasHug · 2019-02-14T00:02:02Z

Yes, I have modified my implementation of SuccessiveHalving like so:

...
            info = {'iter': [iter_i] * n_candidates,
                    'n_samples': [n_samples_iter] * n_candidates}
            out = evaluate_candidates(candidate_params, X_iter, y_iter,
                                      info=info)

...

Passing n_samples is not really needed here.

Now in results I have a new column indicating the iteration (i.e. when evaluate was called), and the number of samples used at this iteration

NicolasHug · 2019-02-14T15:47:30Z

@jnothman with the current version of this PR, here is how we can implement SuccessiveHalving:

Details

    def _run_search(self, evaluate_candidates, X, y):
        rng = check_random_state(self.random_state)

        candidate_params = list(self._generate_candidate_params())
        n_iterations = int(ceil(log2(len(candidate_params))))
        n_samples_total = X.shape[0]

        for iter_i in range(n_iterations):
            # randomly sample training samples
            n_candidates = len(candidate_params)
            n_samples_iter = floor(n_samples_total /
                                   (n_candidates * n_iterations))
            indices = rng.choice(n_samples_total, n_samples_iter,
                                 replace=False)
            X_iter, y_iter = X[indices], y[indices]

            more_results= {'iter': [iter_i] * n_candidates,
                           'n_samples': [n_samples_iter] * n_candidates}
            out = evaluate_candidates(candidate_params, X_iter, y_iter,
                                      more_results=more_results)

            # Select the best half of the candidates for the next iteration
            # We need to filter out candidates from the previous iterations
            n_candidates_to_keep = ceil(n_candidates / 2)
            best_candidates_indices = np.argsort(out['mean_test_score'])[::-1]
            best_candidates_indices = [i for i in best_candidates_indices
                                       if out['iter'][i] == iter_i]
            best_candidates_indices = \
                best_candidates_indices[:n_candidates_to_keep]
            candidate_params = [out['params'][i]
                                for i in best_candidates_indices]

        assert len(candidate_params) == n_candidates_to_keep == 1

def _refit_callable(results):
    # Custom refit callable to return the index of the best candidate. We want
    # the best candidate out of the last iteration. By default BaseSearchCV
    # would return the best candidate out of all iterations.

    last_iter = np.max(results['iter'])
    sorted_indices = np.argsort(results['mean_test_score'])[::-1]
    best_index = next(i for i in sorted_indices
                      if results['iter'][i] == last_iter)
    return best_index

Example of a dataframe, originally with n_samples = 1000:

Details

	iter	n_samples	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_C	param_kernel	params	split0_test_score	split1_test_score	split2_test_score	split3_test_score	split4_test_score	mean_test_score	std_test_score	rank_test_score	split0_train_score	split1_train_score	split2_train_score	split3_train_score	split4_train_score	mean_train_score	std_train_score
0	0	125	0.00114207	0.00066063	0.000282717	4.80541e-05	1	linear	{'C': 1, 'kernel': 'linear'}	0.807692	0.846154	0.84	0.791667	0.916667	0.840436	0.0431152	6	0.969697	0.969697	0.96	0.970297	0.970297	0.967998	0.00400779
1	0	125	0.000829029	3.47029e-05	0.000310659	5.15117e-06	1	rbf	{'C': 1, 'kernel': 'rbf'}	0.923077	0.923077	0.88	0.875	0.958333	0.911897	0.0309358	2	1	1	1	0.990099	1	0.99802	0.0039604
2	0	125	0.00140367	0.000458439	0.000246429	3.36026e-06	10	linear	{'C': 10, 'kernel': 'linear'}	0.923077	0.923077	0.8	0.75	0.833333	0.845897	0.0683726	5	0.989899	0.989899	1	0.990099	0.970297	0.988039	0.0096851
3	0	125	0.000910473	8.3377e-06	0.000307178	5.0881e-06	10	rbf	{'C': 10, 'kernel': 'rbf'}	0.923077	0.923077	0.92	0.875	0.958333	0.919897	0.0265078	1	1	1	1	1	1	1	0
4	1	250	0.00220547	0.000168366	0.00052824	6.1141e-05	10	rbf	{'C': 10, 'kernel': 'rbf'}	0.921569	0.86	0.92	0.92	0.857143	0.895742	0.0303687	4	1	1	1	1	1	1	0
5	1	250	0.00166616	5.47791e-05	0.000470543	5.46514e-06	1	rbf	{'C': 1, 'kernel': 'rbf'}	0.901961	0.92	0.94	0.96	0.836735	0.911739	0.0422311	3	0.979899	0.985	0.98	0.98	0.975124	0.980005	0.00312351

best_index_ is 5

The proposed design makes it easy to implement SuccessiveHalving and will allow any sort of other search estimators. If you're OK with the design I'll start to write tests and checks

NicolasHug · 2019-02-27T12:15:02Z

@jnothman shall we discuss this during the sprint?

amueller · 2019-03-07T19:31:12Z

I'm in favor of adding SuccessiveHalvingCV btw. I'm happy to put it into experimental if we decided we want that module.

NicolasHug · 2019-03-07T19:47:48Z

We agreed with Joel during the sprint that ideally only the CV object could be passed, instead of passing X and y. We can do the subsampling inside the CV.
I'll get to it soon and will report back.

And yes I think it could be a good fit for the experimental module

NicolasHug · 2019-03-13T21:26:04Z

@jnothman , I've been trying to only pass cv instead of X, y.

We still have to pass cv into both _run_search and evaluate_candidates. We cannot avoid passing cv into _run_search: using something like cv = check_cv(self.cv) in _run_search is not possible since we need y to stratify.

So basically this only replaces X and y with cv everywhere. Do you still think it's worth it?

jnothman · 2019-03-17T22:21:58Z

By saying "we need y to stratify" do you mean that you need to subsample from the training set in a stratified way? You can handle that when you wrap split()... Or that you need to tell check_cv that the base estimator is a classifier so that it selects the right splitter? I don't see why that requires y.

NicolasHug · 2019-03-17T22:51:31Z

What I mean is that currently in fit, the cv object is set as such:

cv = check_cv(self.cv, y, classifier=is_classifier(estimator))

We can't write the same thing in _run_search unless we pass y. So the more natural way is just to pass cv to _run_search.

I don't think that's much better than passing X and y.

jnothman · 2019-03-18T09:45:32Z

I forgot that check_cv takes y. Hmmmm. I'd still rather we pass indices over the training set to evaluate_candidates. But I might be wrong. This certainly makes the interface more elegant, but it's worth it?

NicolasHug · 2019-03-18T11:52:48Z

Hmm maybe passing indices is a nice alternative since unlike X and y, indices can have default values (None).

I think we should pass both row indices and column indices. This way we could also implement feature subsampling: in successive halving the budget can also be the number of features.

I'm not sure I entirely understand what you said before about joblib being able to use a numpy memmap. It seems to me that the memmapping only happens over a single call to Parallel, so multiple calls to evaluate_candidates can't leverage the same memmapping anyway, right?

jnothman · 2019-03-19T01:51:33Z

I think you're right about Parallel in the current implementation. You make me wonder if we are better off with prepending some SuccessiveHalvingResampler onto the estimator somehow, rather than doing the resampling in _run_search... But maybe your solution of passing X and y around is reasonable.

NicolasHug · 2019-03-19T19:13:39Z

Could you please expand on prenpending SuccessiveHalvingResampler? I'm not following

I've implemented SH by passing indices to evaluate_candidates instead of X and y. It works fine, but doesn't make the code simpler IMO. It requires changing evaluate_candidates like so:
```
def evaluate_candidates(candidate_params, sample_indices=None):
  # ...
  if sample_indices is not None:
      X_, y_ = X[sample_indices], y[sample_indices]
  else:
      X_, y_ = X, y
  # ...
  # use X_ and y_ from now on, since using nonlocal isn't an option
```
Also, we still need to pass X and y to _run_search anyway, else there is no way we can do the resampling.

So passing X and y seems to be the most natural way here.
It is also possible to remove the more_results parameter. But it is honestly quite helpful and makes SucessiveHalving code cleaner and less magic. In particular, @amueller came up with this nice plot that is only possible if there is an 'iter' column:
```
res = pd.DataFrame(sh.cv_results_)
res['params_str'] = res.params.apply(str)
reshape = res.pivot(index='iter', columns='params_str', values='mean_test_score')
reshape.plot(legend=False, alpha=.4, c='k')
```

Would you agree to keep the proposed changed as they are? If yes I'll start writing tests

NicolasHug · 2019-03-20T14:45:22Z

Also It'd be useful to introduce a _check_input_parameters method. Right now input checking can only be done in _run_search. WDYT?

jnothman · 2019-03-26T22:49:54Z

Also It'd be useful to introduce a _check_input_parameters method. Right now input checking can only be done in _run_search. WDYT?

well it can also be done in fit... why would it be better done in _run_search?

NicolasHug · 2019-03-27T00:36:28Z

well it can also be done in fit... why would it be better done in _run_search?

I mean it would be nice to have a _check_input_parameters that would be called in fit, before _run_search.

Right now for successive halving I am forced to do the input checking in _run_search since this is the only method that I can override.

jnothman · 2019-03-27T00:49:11Z

why is it the only method you can overwrite?

NicolasHug · 2019-03-27T00:58:41Z

Because apart from _run_search, fit doesn't call any method that can be overridden. I don't want to override the whole fit method

jnothman · 2019-03-27T01:20:50Z

It would be very normal when inheriting to override fit and call super().fit()

NicolasHug · 2019-03-27T10:32:57Z

Right, I was thinking of a hook that could be called anywhere in fit, but indeed for input validation this would make sense.

~~Still very WIP, need feedback on the interface :)~~ I'm now overwriting `_run_search`, taking advantage of the proposed changes in `BaseGridSearch` from scikit-learn/scikit-learn#13145 Note that this forces to remove the `refit` parameter, which has to be set to a custom callable. `cv_results_` looks good and it's still useful. For e.g. 4 candidates it's going to have 4 (first iter) + 2 (second and last iter) rows. If you're OK with this new design I'll start to write tests :)

adrinjalali · 2019-08-07T14:55:04Z

Could you please post a summary and what needs to be done here @NicolasHug ?

NicolasHug · 2019-08-07T15:07:03Z

@adrinjalali This PR is actually superseded by #13900

adrinjalali · 2019-08-07T15:12:24Z

I see, you said "This builds upon #13145, whose changes are required." on that PR, so I assumed you need this one first. Closing then.

More flexible grid search interface

0d1401f

NicolasHug changed the title ~~More flexible grid search interface~~ [MRG] More flexible grid search interface Feb 12, 2019

NicolasHug mentioned this pull request Feb 12, 2019

[MRG] Successive halving dabl/dabl#4

Merged

Merge branch 'master' into grid_search_pass_X

3cb78e1

added info dict parameter

80963f3

Put back removed test

326fe39

renamed info into more_results

bae0d95

amueller mentioned this pull request Mar 8, 2019

Add successive halving for search? #12538

Closed

NicolasHug added 2 commits March 24, 2019 17:28

Merge branch 'master' into grid_search_pass_X

abbb606

Passed grroups as well since we need n_to use get_n_splits(X, y, groups)

7d4cb56

NicolasHug mentioned this pull request May 17, 2019

[MRG] Successive halving for faster parameter search #13900

Merged

6 tasks

amueller added Waiting for Reviewer Needs Decision Requires decision labels Aug 6, 2019

NicolasHug added Superseded PR has been replace by a newer PR and removed Needs Decision Requires decision Waiting for Reviewer labels Aug 7, 2019

adrinjalali closed this Aug 7, 2019

Uh oh!

Conversation

NicolasHug commented Feb 12, 2019

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman commented Feb 12, 2019 via email

Uh oh!

NicolasHug commented Feb 12, 2019

Uh oh!

jnothman commented Feb 12, 2019 via email

Uh oh!

NicolasHug commented Feb 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasHug commented Feb 13, 2019

Uh oh!

jnothman commented Feb 13, 2019

Uh oh!

jnothman commented Feb 13, 2019

Uh oh!

NicolasHug commented Feb 13, 2019

Uh oh!

jnothman commented Feb 13, 2019 via email

Uh oh!

NicolasHug commented Feb 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasHug commented Feb 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasHug commented Feb 27, 2019

Uh oh!

amueller commented Mar 7, 2019

Uh oh!

NicolasHug commented Mar 7, 2019

Uh oh!

NicolasHug commented Mar 13, 2019

Uh oh!

jnothman commented Mar 17, 2019 via email

Uh oh!

NicolasHug commented Mar 17, 2019

Uh oh!

jnothman commented Mar 18, 2019 via email

Uh oh!

NicolasHug commented Mar 18, 2019

Uh oh!

jnothman commented Mar 19, 2019 via email

Uh oh!

NicolasHug commented Mar 19, 2019

Uh oh!

NicolasHug commented Mar 20, 2019

Uh oh!

jnothman commented Mar 26, 2019

Uh oh!

NicolasHug commented Mar 27, 2019

Uh oh!

jnothman commented Mar 27, 2019 via email

Uh oh!

NicolasHug commented Mar 27, 2019

Uh oh!

jnothman commented Mar 27, 2019 via email

Uh oh!

NicolasHug commented Mar 27, 2019

Uh oh!

adrinjalali commented Aug 7, 2019

Uh oh!

NicolasHug commented Aug 7, 2019

Uh oh!

adrinjalali commented Aug 7, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NicolasHug commented Feb 13, 2019 •

edited

Loading

NicolasHug commented Feb 14, 2019 •

edited

Loading

NicolasHug commented Feb 14, 2019 •

edited

Loading