Add constant liar strategy to GPSampler#6430
Conversation
BenchmarkBenchmark setup(This benchmark follows the evaluation procedure used in #2964.) I compared several different strategy to fix #6392 :
Tested settings
EvaluationFor each (benchmark problems, n_trials, batch_size), I compared If strategy i is significantly better than strategy j (j != i), I count it as a “win” for i. Total wins are then aggregated over all (n_trials, batch_size, problems). NOTE: The benchmark script is iavailable here, The Constant Liar (CL) results were collected using this implementation and the Kriging Believer (KB) results were collected using this implemntation Results (total wins)
NOTE: Most of the wins for the master branch come from BBOB function1. Since this function has a very simple landscape, the sampler must have reaches near-optimal values within the first few trials. After that, continuing to sample locally around the current best tends to produce slightly better values, which likely explains the advantage observed here. Based on these results, I propose using CL(best) as the default strategy. This result also aligns with existing research on parallel BO (e.g., Ginsbourger et al., "Kriging is well-suited to parallelize optimization." Computational intelligence in expensive optimization problems. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010. 131-162.) |
|
@y0z Could you review this PR? |
| self.noise_var = self.noise_var.detach() | ||
| self.noise_var.grad = None | ||
|
|
||
| def append_running_data(self, X_running: torch.Tensor, y_running: torch.Tensor) -> None: |
There was a problem hiding this comment.
I think we can compute this more efficiently by reusing self._cov_Y_Y_chol, which we already have.
This would reduce the computational cost from the current
to
Considering the fact that the
This will also eliminate the need for storing self._cov_Y_Y.
However, it is also acceptable to leave the current one and address this in a follow-up PR, since the current implementation itself is not incorrect.
We can calculate this as follows:
Let self._cov_Y_Y) and define its Cholesky decomposition as self._cov_Y_Y_chol). Let
(in this case, self.kernel(X_running), and self.kernel(X_running, X_running)).
Then, the Cholesky decomposition of
where
Note that we can compute scipy.linalg.solve_triangular.)
proof.
Assume that
= [ [
Comparing this to
Thus, it follows that
Substituting this into the Equation (1) yields:
Since
This completes the proof. □
There was a problem hiding this comment.
Thank you so much for the precise explanation! I made the update accordingly.
| normalized_params_of_running_trials = torch.from_numpy( | ||
| normalized_params_of_running_trials | ||
| ) | ||
| constant_liar_value = self._gpr._y_train.max() |
There was a problem hiding this comment.
Could you add an inline comment about this design choice?
Namely, we could use mean (posterior mean), median, min, or any kind of operators, right?
Plus, as you may already know, we often integrate out the value because we can calculate the posterior distributions for each running trial.
There was a problem hiding this comment.
Thank you for the comment. I have added a few inline comments. I also left some notes on issue #6392 regarding the future implementation plan.
optuna/_gp/gp.py
Outdated
| self._squared_X_diff[..., self._is_categorical] = ( | ||
| self._squared_X_diff[..., self._is_categorical] > 0.0 | ||
| ).type(torch.float64) | ||
| self._cov_Y_Y: np.ndarray | None = None |
There was a problem hiding this comment.
What about renaming it to one of the following because the other data is stored as tensor.
| self._cov_Y_Y: np.ndarray | None = None | |
| self._cov_Y_Y_array: np.ndarray | None = None |
Or
| self._cov_Y_Y: np.ndarray | None = None | |
| self._cov_Y_Y_np: np.ndarray | None = None |
optuna/_gp/acqf.py
Outdated
| gpr: GPRegressor, | ||
| search_space: SearchSpace, | ||
| threshold: float, | ||
| normalized_params_of_running_trials: np.ndarray = np.array([]), |
There was a problem hiding this comment.
This syntax is standard in Python.
| normalized_params_of_running_trials: np.ndarray = np.array([]), | |
| normalized_params_of_running_trials: np.ndarray | None = None, |
The response from Gemini follows:
Gemini
Why use None for default arguments?
Avoids Unwanted Behavior: Using None as a sentinel value ensures that a fresh, independent list (or other mutable object) is created each time the function is called without a specific argument.
Encapsulation and Predictability: It helps in writing more stable, predictable, and testable code by ensuring function calls are isolated and do not interact with a shared, mutable default object.
Follows Official Style Guides: This approach aligns with recommendations from the official Python style guide, PEP 8, for handling mutable default arguments.
There was a problem hiding this comment.
I was thinking about the same thing
optuna/samplers/_gp/sampler.py
Outdated
| trials = study._get_trials(deepcopy=False, states=states, use_cache=True) | ||
| states = (TrialState.COMPLETE, TrialState.RUNNING) | ||
| trials = study._get_trials(deepcopy=False, states=states, use_cache=False) | ||
| complete_trials = [t for t in trials if t.state == TrialState.COMPLETE] |
There was a problem hiding this comment.
nit: completed_trials might be more natural.
| complete_trials = [t for t in trials if t.state == TrialState.COMPLETE] | |
| completed_trials = [t for t in trials if t.state == TrialState.COMPLETE] |
optuna/_gp/gp.py
Outdated
| # NOTE(nabenabe): Here we use NumPy to guarantee the reproducibility from the past. | ||
| self._cov_Y_Y_chol = torch.from_numpy(cov_Y_Y_chol) | ||
| self._cov_Y_Y_inv_Y = torch.from_numpy(cov_Y_Y_inv_Y) | ||
| self._X_train = torch.cat([self._X_train, X_running], dim=0) |
There was a problem hiding this comment.
I don't think we should update the training dataset because we didn't use X_running in the training, which is fit_kernel_params.
| self._X_train = torch.cat([self._X_train, X_running], dim=0) |
Maybe, it's even better to remove X_train and y_train after the call of _cache_matrix.
In other words, define X_all or something and use it instead of X_train.
optuna/_gp/gp.py
Outdated
| self._cov_Y_Y_chol = torch.from_numpy(cov_Y_Y_chol) | ||
| self._cov_Y_Y_inv_Y = torch.from_numpy(cov_Y_Y_inv_Y) | ||
| self._X_train = torch.cat([self._X_train, X_running], dim=0) | ||
| self._y_train = torch.cat([self._y_train, y_running], dim=0) |
There was a problem hiding this comment.
| self._y_train = torch.cat([self._y_train, y_running], dim=0) |
|
Thank you all for taking the time to review this PR. Your feedback was very helpful, and I’ve updated the implementation accordingly. |
optuna/_gp/gp.py
Outdated
| ) | ||
| with torch.no_grad(): | ||
| cov_Y_Y = self.kernel().detach().cpu().numpy() | ||
| _cov_Y_Y = self.kernel().detach().cpu().numpy() |
There was a problem hiding this comment.
nit: Now that it is no longer stored as a member variable, there is no need to prefix it with an underscore.
| _cov_Y_Y = self.kernel().detach().cpu().numpy() | |
| cov_Y_Y = self.kernel().detach().cpu().numpy() |
optuna/_gp/gp.py
Outdated
| cov_Y_Y_chol = np.zeros((n_total, n_total), dtype=np.float64) | ||
| cov_Y_Y_chol[:n_train, :n_train] = self._cov_Y_Y_chol.numpy() | ||
| with torch.no_grad(): | ||
| kernel_train_running = self.kernel(X_running).detach().cpu().numpy() |
There was a problem hiding this comment.
Wouldn't kernel_running_train be more natural, since it is kernel(X_running, X_train)?
| kernel_train_running = self.kernel(X_running).detach().cpu().numpy() | |
| kernel_running_train = self.kernel(X_running).detach().cpu().numpy() |
| - cov_Y_Y_chol[n_train:, :n_train] @ cov_Y_Y_chol[n_train:, :n_train].T | ||
| ) | ||
| self._y_all = torch.cat([self._y_train, y_running], dim=0) | ||
| cov_Y_Y_inv_Y = scipy.linalg.solve_triangular( |
There was a problem hiding this comment.
I think we can also compute here little bit more efficiently by reusing self._cov_Y_Y_inv_Y, leading
However, this is not very important and we can leave here as it is, since the overall complexity of append_running_data is bottlenecked by computing cov_Y_Y_chol.
There was a problem hiding this comment.
Thank you for the helpful suggestion! I agree that the forward substitution, scipy.linalg.solve_triangular(cov_Y_Y_chol, self._y_all.cpu().numpy(), lower=True) can likely be reduced to O(n_train * n_running + n_running**2) by caching the corresponding train-only term,
scipy.linalg.solve_triangular(cov_Y_Y_chol, self._y_train.cpu().numpy(), lower=True) in _cache_matrix.
What I was less certain about was whether the backward substitution could also be reduced to the same complexity. As you mentioned, I think this would be a good optimization to consider in a future improvement.
There was a problem hiding this comment.
Thank you for the reply. Indeed, yes. We need append_running_data is already bottlenecked by the computation of cov_Y_Y_chol, which we cannot improve more.
Therefore, we can achieve only a constant-factor improvement here, which is the reason why I felt this is not very important in contrast to the one I mentioned here. Also, I'm not very certain how this constant-factor improvement contribute to the overall runtime without conducting benchmarking on it.
optuna/samplers/_gp/sampler.py
Outdated
| return constraint_vals, is_feasible | ||
|
|
||
|
|
||
| def get_params(trial: FrozenTrial) -> dict[str, Any]: |
There was a problem hiding this comment.
Now that this function is only used in this module, we should make this internal.
| def get_params(trial: FrozenTrial) -> dict[str, Any]: | |
| def _get_params(trial: FrozenTrial) -> dict[str, Any]: |
|
Thank you for the update! It almost looks good to me, with only some minor naming issues. |
|
I’ve updated the PR to address the comments so far. I’d appreciate it if you could take another look. |
kAIto47802
left a comment
There was a problem hiding this comment.
Thank you for the update! LGTM
|
I confirmed the improvement in the BBOB benchmark. It's LGTM except the above comments. |
|
Thank you for the check. I’ve updated the code accordingly. PTAL. |
GPSampler
Motivation
Fixes #6392.
This PR adds a constant liar strategy to GPSampler. While some trials are still RUNNING, the sampler assigns them pseudo objective values, which is the best value among completed trials, and includes them in the GP update. This helps batch optimization by reducing the likelihood of proposing the same (or very similar) parameters when multiple evaluations are pending.
At this stage, the feature is limited to single-objective, unconstrained optimization.
Description of the changes
TPESamplerwithmultivariateandconstant_liar#6189).Note