Skip to content

[MRG] More flexible grid search interface#13145

Closed
NicolasHug wants to merge 7 commits intoscikit-learn:masterfrom
NicolasHug:grid_search_pass_X
Closed

[MRG] More flexible grid search interface#13145
NicolasHug wants to merge 7 commits intoscikit-learn:masterfrom
NicolasHug:grid_search_pass_X

Conversation

@NicolasHug
Copy link
Copy Markdown
Member

Reference Issues/PRs

What does this implement/fix? Explain your changes.

To make a custom GridSearch class, one can overwrite the (private) _run_search method and call evaluate_candidates with specific candidate parameter. However, it is not possible to specify the data on which we want to evaluate the results, which makes it impossible to implement schemes such as SuccessiveHalving where N candidates compete over K iterations: iteration i + 1 uses twice as much data as iteration i, and only uses the best half of the candidates. At the end of the process, only one candidate remains.

With the proposed changes, sucessive halving can be easily implemented as such (+ a custom refit callable to properly determine the best index):

Details
    def _run_search(self, evaluate_candidates, X, y):
        rng = check_random_state(self.random_state)

        candidate_params = list(self._generate_candidate_params())
        n_iterations = int(ceil(log2(len(candidate_params))))
        n_samples_total = X.shape[0]

        for _ in range(n_iterations):
            # randomly sample training samples
            n_candidates = len(candidate_params)
            n_samples_iter = floor(n_samples_total /
                                   (n_candidates * n_iterations))
            indices = rng.choice(n_samples_total, n_samples_iter,
                                 replace=False)
            X_iter, y_iter = X[indices], y[indices]

            out = evaluate_candidates(candidate_params, X_iter, y_iter)

            # Select the best half of the candidates for the next iteration
            n_candidates_to_keep = ceil(n_candidates / 2)
            best_candidates_indices = np.argsort(out['mean_test_score'])
            best_candidates_indices = \
                best_candidates_indices[-n_candidates_to_keep:]
            candidate_params = [candidate_params[i]
                                for i in best_candidates_indices]
        assert len(candidate_params) == 1

Proposed changes:

  • Force X and y to be passed to run_search and evaluate_candidates
  • evaluate_candidates returns the formatted results of the current call, not the formatted results of all the calls since the beginning. This is IMO more useful, as illustrated in the above code snippet.

Any other comments?

Please note that I also integrated the small changes proposed in #13144

ping @jnothman who implemented the current design
ping @janvanrijn you might be interested in this?
ping @amueller

@NicolasHug NicolasHug changed the title More flexible grid search interface [MRG] More flexible grid search interface Feb 12, 2019
@jnothman
Copy link
Copy Markdown
Member

jnothman commented Feb 12, 2019 via email

@NicolasHug
Copy link
Copy Markdown
Member Author

The part where I was surprised by this was in recognising that cv.split gets called in each evaluation... I thought we had ensured the cv split would be the same across evaluations.

but it is the same across evaluations... as long as the input data is the same, right?

For the return value, in my situation I agree I could get by with the current behavior. And only returning the result of the last call might make other implementations harder. Maybe we can even return both? @janvanrijn you've been building custom SearchCV objects, what do you think about this?

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Feb 12, 2019 via email

@NicolasHug
Copy link
Copy Markdown
Member Author

NicolasHug commented Feb 13, 2019

@jnothman thinking more about whether returning the results from all the calls to evaluate_candidate or just the last one:

  • returning only the last one makes it a bit hard to reconstruct all of the results in the custom _run_search. I personally don't need it for successive halving, but other implementations may.
  • returning results from all the calls makes it possible to return this dict in run_search: that would allow to remove this nonlocal variable which I would think is a good thing. However, this forces the results dict to be ordered, else there is no way to know which are the results of the last call to evaluate_results in run_search. If the dict is not ordered, I cannot implement the successive halving search.

WDYT?

@NicolasHug
Copy link
Copy Markdown
Member Author

Another solution we discussed with @amueller would be to pass another parameter info: evaluate_candidate(candidates, X, y, info=None)

info would be a dict whose keys would be transformed into columns in cv_results. This way in SuccessiveHalving I could pass which candidates correspond to which iteration, and find the best candidate accordingly.

I pushed 80963f3 and will write tests/checks if you're OK with this design?

@jnothman
Copy link
Copy Markdown
Member

I've not really got why info goes into the results...? I haven't understood how this solves the problem of selecting samples. Maybe it should be called more_results or something. Also, it might be a good idea to generally include the evaluation number in results where multiple evaluations are performed...

@jnothman
Copy link
Copy Markdown
Member

Also should this PR just go ahead and introduce a HalvingSearchCV?

@NicolasHug
Copy link
Copy Markdown
Member Author

I've not really got why info goes into the results

It goes into the results dict so that in HalvingSearchCV I can properly overload refit to only keep the best candidates out of the last iteration (i.e. the last call to evaluate). And I agree more_results is a better name.

I haven't understood how this solves the problem of selecting samples

It is unrelated to the addition of X and y to evaluate, if that's what you mean.

Also should this PR just go ahead and introduce a HalvingSearchCV?

I'm currently implementing HalvingSearchCV in another repo and see how it goes but yes I'm planning to submit a PR eventually.

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Feb 13, 2019 via email

@NicolasHug
Copy link
Copy Markdown
Member Author

NicolasHug commented Feb 14, 2019

Yes, I have modified my implementation of SuccessiveHalving like so:

...
            info = {'iter': [iter_i] * n_candidates,
                    'n_samples': [n_samples_iter] * n_candidates}
            out = evaluate_candidates(candidate_params, X_iter, y_iter,
                                      info=info)

...

Passing n_samples is not really needed here.

Now in results I have a new column indicating the iteration (i.e. when evaluate was called), and the number of samples used at this iteration

@NicolasHug
Copy link
Copy Markdown
Member Author

NicolasHug commented Feb 14, 2019

@jnothman with the current version of this PR, here is how we can implement SuccessiveHalving:

Details
    def _run_search(self, evaluate_candidates, X, y):
        rng = check_random_state(self.random_state)

        candidate_params = list(self._generate_candidate_params())
        n_iterations = int(ceil(log2(len(candidate_params))))
        n_samples_total = X.shape[0]

        for iter_i in range(n_iterations):
            # randomly sample training samples
            n_candidates = len(candidate_params)
            n_samples_iter = floor(n_samples_total /
                                   (n_candidates * n_iterations))
            indices = rng.choice(n_samples_total, n_samples_iter,
                                 replace=False)
            X_iter, y_iter = X[indices], y[indices]

            more_results= {'iter': [iter_i] * n_candidates,
                           'n_samples': [n_samples_iter] * n_candidates}
            out = evaluate_candidates(candidate_params, X_iter, y_iter,
                                      more_results=more_results)

            # Select the best half of the candidates for the next iteration
            # We need to filter out candidates from the previous iterations
            n_candidates_to_keep = ceil(n_candidates / 2)
            best_candidates_indices = np.argsort(out['mean_test_score'])[::-1]
            best_candidates_indices = [i for i in best_candidates_indices
                                       if out['iter'][i] == iter_i]
            best_candidates_indices = \
                best_candidates_indices[:n_candidates_to_keep]
            candidate_params = [out['params'][i]
                                for i in best_candidates_indices]

        assert len(candidate_params) == n_candidates_to_keep == 1
def _refit_callable(results):
    # Custom refit callable to return the index of the best candidate. We want
    # the best candidate out of the last iteration. By default BaseSearchCV
    # would return the best candidate out of all iterations.

    last_iter = np.max(results['iter'])
    sorted_indices = np.argsort(results['mean_test_score'])[::-1]
    best_index = next(i for i in sorted_indices
                      if results['iter'][i] == last_iter)
    return best_index

Example of a dataframe, originally with n_samples = 1000:

Details
iter n_samples mean_fit_time std_fit_time mean_score_time std_score_time param_C param_kernel params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score split0_train_score split1_train_score split2_train_score split3_train_score split4_train_score mean_train_score std_train_score
0 0 125 0.00114207 0.00066063 0.000282717 4.80541e-05 1 linear {'C': 1, 'kernel': 'linear'} 0.807692 0.846154 0.84 0.791667 0.916667 0.840436 0.0431152 6 0.969697 0.969697 0.96 0.970297 0.970297 0.967998 0.00400779
1 0 125 0.000829029 3.47029e-05 0.000310659 5.15117e-06 1 rbf {'C': 1, 'kernel': 'rbf'} 0.923077 0.923077 0.88 0.875 0.958333 0.911897 0.0309358 2 1 1 1 0.990099 1 0.99802 0.0039604
2 0 125 0.00140367 0.000458439 0.000246429 3.36026e-06 10 linear {'C': 10, 'kernel': 'linear'} 0.923077 0.923077 0.8 0.75 0.833333 0.845897 0.0683726 5 0.989899 0.989899 1 0.990099 0.970297 0.988039 0.0096851
3 0 125 0.000910473 8.3377e-06 0.000307178 5.0881e-06 10 rbf {'C': 10, 'kernel': 'rbf'} 0.923077 0.923077 0.92 0.875 0.958333 0.919897 0.0265078 1 1 1 1 1 1 1 0
4 1 250 0.00220547 0.000168366 0.00052824 6.1141e-05 10 rbf {'C': 10, 'kernel': 'rbf'} 0.921569 0.86 0.92 0.92 0.857143 0.895742 0.0303687 4 1 1 1 1 1 1 0
5 1 250 0.00166616 5.47791e-05 0.000470543 5.46514e-06 1 rbf {'C': 1, 'kernel': 'rbf'} 0.901961 0.92 0.94 0.96 0.836735 0.911739 0.0422311 3 0.979899 0.985 0.98 0.98 0.975124 0.980005 0.00312351

best_index_ is 5

The proposed design makes it easy to implement SuccessiveHalving and will allow any sort of other search estimators. If you're OK with the design I'll start to write tests and checks

@NicolasHug
Copy link
Copy Markdown
Member Author

@jnothman shall we discuss this during the sprint?

@amueller
Copy link
Copy Markdown
Member

amueller commented Mar 7, 2019

I'm in favor of adding SuccessiveHalvingCV btw. I'm happy to put it into experimental if we decided we want that module.

@NicolasHug
Copy link
Copy Markdown
Member Author

We agreed with Joel during the sprint that ideally only the CV object could be passed, instead of passing X and y. We can do the subsampling inside the CV.
I'll get to it soon and will report back.

And yes I think it could be a good fit for the experimental module

@NicolasHug
Copy link
Copy Markdown
Member Author

@jnothman , I've been trying to only pass cv instead of X, y.

We still have to pass cv into both _run_search and evaluate_candidates. We cannot avoid passing cv into _run_search: using something like cv = check_cv(self.cv) in _run_search is not possible since we need y to stratify.

So basically this only replaces X and y with cv everywhere. Do you still think it's worth it?

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Mar 17, 2019 via email

@NicolasHug
Copy link
Copy Markdown
Member Author

What I mean is that currently in fit, the cv object is set as such:

cv = check_cv(self.cv, y, classifier=is_classifier(estimator))

We can't write the same thing in _run_search unless we pass y. So the more natural way is just to pass cv to _run_search.

I don't think that's much better than passing X and y.

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Mar 18, 2019 via email

@NicolasHug
Copy link
Copy Markdown
Member Author

Hmm maybe passing indices is a nice alternative since unlike X and y, indices can have default values (None).

I think we should pass both row indices and column indices. This way we could also implement feature subsampling: in successive halving the budget can also be the number of features.


I'm not sure I entirely understand what you said before about joblib being able to use a numpy memmap. It seems to me that the memmapping only happens over a single call to Parallel, so multiple calls to evaluate_candidates can't leverage the same memmapping anyway, right?

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Mar 19, 2019 via email

@NicolasHug
Copy link
Copy Markdown
Member Author

Could you please expand on prenpending SuccessiveHalvingResampler? I'm not following


  • I've implemented SH by passing indices to evaluate_candidates instead of X and y. It works fine, but doesn't make the code simpler IMO. It requires changing evaluate_candidates like so:

    def evaluate_candidates(candidate_params, sample_indices=None):
      # ...
      if sample_indices is not None:
          X_, y_ = X[sample_indices], y[sample_indices]
      else:
          X_, y_ = X, y
      # ...
      # use X_ and y_ from now on, since using nonlocal isn't an option

    Also, we still need to pass X and y to _run_search anyway, else there is no way we can do the resampling.

    So passing X and y seems to be the most natural way here.

  • It is also possible to remove the more_results parameter. But it is honestly quite helpful and makes SucessiveHalving code cleaner and less magic. In particular, @amueller came up with this nice plot that is only possible if there is an 'iter' column:

    res = pd.DataFrame(sh.cv_results_)
    res['params_str'] = res.params.apply(str)
    reshape = res.pivot(index='iter', columns='params_str', values='mean_test_score')
    reshape.plot(legend=False, alpha=.4, c='k')

53989080-1da39080-40f3-11e9-9a24-cba71ea29c9d

Would you agree to keep the proposed changed as they are? If yes I'll start writing tests

@NicolasHug
Copy link
Copy Markdown
Member Author

Also It'd be useful to introduce a _check_input_parameters method. Right now input checking can only be done in _run_search. WDYT?

@jnothman
Copy link
Copy Markdown
Member

Also It'd be useful to introduce a _check_input_parameters method. Right now input checking can only be done in _run_search. WDYT?

well it can also be done in fit... why would it be better done in _run_search?

@NicolasHug
Copy link
Copy Markdown
Member Author

well it can also be done in fit... why would it be better done in _run_search?

I mean it would be nice to have a _check_input_parameters that would be called in fit, before _run_search.

Right now for successive halving I am forced to do the input checking in _run_search since this is the only method that I can override.

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Mar 27, 2019 via email

@NicolasHug
Copy link
Copy Markdown
Member Author

Because apart from _run_search, fit doesn't call any method that can be overridden. I don't want to override the whole fit method

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Mar 27, 2019 via email

@NicolasHug
Copy link
Copy Markdown
Member Author

Right, I was thinking of a hook that could be called anywhere in fit, but indeed for input validation this would make sense.

amueller pushed a commit to dabl/dabl that referenced this pull request Mar 27, 2019
~~Still very WIP, need feedback on the interface :)~~

I'm now overwriting `_run_search`, taking advantage of the proposed changes in `BaseGridSearch` from scikit-learn/scikit-learn#13145

Note that this forces to remove the `refit` parameter, which has to be set to a custom callable.

`cv_results_` looks good and it's still useful. For e.g. 4 candidates it's going to have 4 (first iter) + 2 (second and last iter) rows.

If you're OK with this new design I'll start to write tests :)
@adrinjalali
Copy link
Copy Markdown
Member

Could you please post a summary and what needs to be done here @NicolasHug ?

@NicolasHug
Copy link
Copy Markdown
Member Author

@adrinjalali This PR is actually superseded by #13900

@NicolasHug NicolasHug added Superseded PR has been replace by a newer PR and removed Needs Decision Requires decision Waiting for Reviewer labels Aug 7, 2019
@adrinjalali
Copy link
Copy Markdown
Member

I see, you said "This builds upon #13145, whose changes are required." on that PR, so I assumed you need this one first. Closing then.

@adrinjalali adrinjalali closed this Aug 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Superseded PR has been replace by a newer PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants