Skip to content

Add constant liar strategy to GPSampler#6430

Merged
not522 merged 22 commits intooptuna:masterfrom
sawa3030:add-constant-liar-in-gpsampler
Mar 4, 2026
Merged

Add constant liar strategy to GPSampler#6430
not522 merged 22 commits intooptuna:masterfrom
sawa3030:add-constant-liar-in-gpsampler

Conversation

@sawa3030
Copy link
Copy Markdown
Collaborator

Motivation

Fixes #6392.

This PR adds a constant liar strategy to GPSampler. While some trials are still RUNNING, the sampler assigns them pseudo objective values, which is the best value among completed trials, and includes them in the GP update. This helps batch optimization by reducing the likelihood of proposing the same (or very similar) parameters when multiple evaluations are pending.

At this stage, the feature is limited to single-objective, unconstrained optimization.

Description of the changes

Note

  • Since this behavior has been requested by many users, this PR resolves Constant Liar strategy for GPSampler #6392 by adding a constant liar strategy.
  • Instead of adding Monte Carlo–based acquisition functions here, we opted for constant liar as a lightweight solution.

@sawa3030
Copy link
Copy Markdown
Collaborator Author

Benchmark

Benchmark setup

(This benchmark follows the evaluation procedure used in #2964.)

I compared several different strategy to fix #6392 :

  • Constant Liar (pseudo values = best / mean / worst)
  • Kriging Believer
  • Baseline: current master implementation

Tested settings

  • Benchmark problems: 5×BBOB + 3×HPOBench
  • n_trials: [25, 50, 75, 100]
  • batch_size: [5, 10, 50]
  • Repeats: 100 independent runs per (problem, n_trials, batch_size, strategy), using different random seeds

Evaluation

For each (benchmark problems, n_trials, batch_size), I compared best_value between every pair of strategies using a one-sided Mann–Whitney U test (α = 0.05).

If strategy i is significantly better than strategy j (j != i), I count it as a “win” for i. Total wins are then aggregated over all (n_trials, batch_size, problems).

NOTE: The benchmark script is iavailable here, The Constant Liar (CL) results were collected using this implementation and the Kriging Believer (KB) results were collected using this implemntation

Results (total wins)

benchmark_type wins:CL(best) wins:CL(mean) wins:CL(worst) wins:master branch wins:KB
bbob 153 109 47 12 84
hpo 44 35 18 0 27

NOTE: Most of the wins for the master branch come from BBOB function1. Since this function has a very simple landscape, the sampler must have reaches near-optimal values within the first few trials. After that, continuing to sample locally around the current best tends to produce slightly better values, which likely explains the advantage observed here.

Based on these results, I propose using CL(best) as the default strategy.

This result also aligns with existing research on parallel BO (e.g., Ginsbourger et al., "Kriging is well-suited to parallelize optimization." Computational intelligence in expensive optimization problems. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010. 131-162.)

@not522
Copy link
Copy Markdown
Member

not522 commented Feb 12, 2026

@y0z Could you review this PR?

self.noise_var = self.noise_var.detach()
self.noise_var.grad = None

def append_running_data(self, X_running: torch.Tensor, y_running: torch.Tensor) -> None:
Copy link
Copy Markdown
Collaborator

@kAIto47802 kAIto47802 Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can compute this more efficiently by reusing self._cov_Y_Y_chol, which we already have.
This would reduce the computational cost from the current
   $$O((\text{n\_train} + \text{n\_running})^3)$$
   $$= O(\text{n\_train}^3 + 3\,\text{n\_train}^2\:\text{n\_running} + 3\,\text{n\_train}\:\text{n\_running}^2 + \text{n\_running}^3),$$
to
   $$O(\text{n\_train}^2\:\text{n\_running} + \text{n\_train}\:\text{n\_running}^2 + \text{n\_running}^3).$$
Considering the fact that the $\text{n\_train}$ is the dominant size ($\text{n\_train} >> \text{n\_running}$), the reducing the order of $\text{n\_train}$ from 3 to 2 will improve runtime.
This will also eliminate the need for storing self._cov_Y_Y.
However, it is also acceptable to leave the current one and address this in a follow-up PR, since the current implementation itself is not incorrect.


We can calculate this as follows:

Let $A$ be the symmetric positive definite matrix (in this case, self._cov_Y_Y) and define its Cholesky decomposition as $LL^\top=A$ (in this case, $L$ corresponds to self._cov_Y_Y_chol). Let $M$ be a symmetric positive definite matrix partitioned as:
   $M$ = [ [ $A\:$ $X$ ] [ $X^\top\:$ $Y$ ] ]
(in this case, $X$ corresponds to self.kernel(X_running), and $Y$ corresponds to self.kernel(X_running, X_running)).
Then, the Cholesky decomposition of $M$ is provided as follows:
   $\mathrm{chol}(M)$ = [ [ $L\:$ $0$ ] [ $Z^\top\:$ $C$ ] ],
where $Z := L^{-1} X$ and $C := \mathrm{chol}(Y - Z^\top Z)$.
Note that we can compute $Z$ efficiently by solving $LZ = X$, since $L$ is the triangular matrix. (Use scipy.linalg.solve_triangular.)

proof.
Assume that $M$ admits a block Cholesky factorization of the form:
   $M$ := [ [ $L\:$ $0$ ] [ $B\:$ $C$ ] ] [ [ $L^\top\:$ $B^\top$ ] [ $0\:$ $C^\top$ ] ]
   = [ [ $LL^\top\:$ $LB^\top$ ] [ $BL^\top\:$ $BB^\top + CC^\top$ ] ].
Comparing this to $M$ = [ [ $A\:$ $X$ ] [ $X^\top\:$ $Y$ ] ], we obtain:
   $LB^\top = X$ and $BB^\top + CC^\top = Y$.   (1)
Thus, it follows that $B^\top = L^{-1}X$. Defining $Z := L^{-1}X$, we have $B = Z^\top$, and hence $BB^\top = Z^\top Z$.
Substituting this into the Equation (1) yields:
   $CC^\top = Y - Z^\top Z$
   $= Y - (L^{-1}X)^\top L^{-1}X$
   $= Y - X^\top (L^{-1})^\top L^{-1}X$
   $= Y - X A^{-1} X$.
Since $M$ is symmetric positive definite, the Schur complement $Y - X A^{-1} X = Y - Z^\top Z$ is also symmetric positive definite. Therefore, its Cholesky factor exists and $C$ can be chosen as $C = \mathrm{chol}(Y - Z^\top Z)$.
This completes the proof. □

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for the precise explanation! I made the update accordingly.

normalized_params_of_running_trials = torch.from_numpy(
normalized_params_of_running_trials
)
constant_liar_value = self._gpr._y_train.max()
Copy link
Copy Markdown
Contributor

@nabenabe0928 nabenabe0928 Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an inline comment about this design choice?
Namely, we could use mean (posterior mean), median, min, or any kind of operators, right?
Plus, as you may already know, we often integrate out the value because we can calculate the posterior distributions for each running trial.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comment. I have added a few inline comments. I also left some notes on issue #6392 regarding the future implementation plan.

optuna/_gp/gp.py Outdated
self._squared_X_diff[..., self._is_categorical] = (
self._squared_X_diff[..., self._is_categorical] > 0.0
).type(torch.float64)
self._cov_Y_Y: np.ndarray | None = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about renaming it to one of the following because the other data is stored as tensor.

Suggested change
self._cov_Y_Y: np.ndarray | None = None
self._cov_Y_Y_array: np.ndarray | None = None

Or

Suggested change
self._cov_Y_Y: np.ndarray | None = None
self._cov_Y_Y_np: np.ndarray | None = None

gpr: GPRegressor,
search_space: SearchSpace,
threshold: float,
normalized_params_of_running_trials: np.ndarray = np.array([]),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This syntax is standard in Python.

Suggested change
normalized_params_of_running_trials: np.ndarray = np.array([]),
normalized_params_of_running_trials: np.ndarray | None = None,

The response from Gemini follows:

Gemini

Why use None for default arguments?
Avoids Unwanted Behavior: Using None as a sentinel value ensures that a fresh, independent list (or other mutable object) is created each time the function is called without a specific argument.
Encapsulation and Predictability: It helps in writing more stable, predictable, and testable code by ensuring function calls are isolated and do not interact with a shared, mutable default object.
Follows Official Style Guides: This approach aligns with recommendations from the official Python style guide, PEP 8, for handling mutable default arguments.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about the same thing

trials = study._get_trials(deepcopy=False, states=states, use_cache=True)
states = (TrialState.COMPLETE, TrialState.RUNNING)
trials = study._get_trials(deepcopy=False, states=states, use_cache=False)
complete_trials = [t for t in trials if t.state == TrialState.COMPLETE]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: completed_trials might be more natural.

Suggested change
complete_trials = [t for t in trials if t.state == TrialState.COMPLETE]
completed_trials = [t for t in trials if t.state == TrialState.COMPLETE]

optuna/_gp/gp.py Outdated
# NOTE(nabenabe): Here we use NumPy to guarantee the reproducibility from the past.
self._cov_Y_Y_chol = torch.from_numpy(cov_Y_Y_chol)
self._cov_Y_Y_inv_Y = torch.from_numpy(cov_Y_Y_inv_Y)
self._X_train = torch.cat([self._X_train, X_running], dim=0)
Copy link
Copy Markdown
Contributor

@nabenabe0928 nabenabe0928 Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should update the training dataset because we didn't use X_running in the training, which is fit_kernel_params.

Suggested change
self._X_train = torch.cat([self._X_train, X_running], dim=0)

Maybe, it's even better to remove X_train and y_train after the call of _cache_matrix.
In other words, define X_all or something and use it instead of X_train.

optuna/_gp/gp.py Outdated
self._cov_Y_Y_chol = torch.from_numpy(cov_Y_Y_chol)
self._cov_Y_Y_inv_Y = torch.from_numpy(cov_Y_Y_inv_Y)
self._X_train = torch.cat([self._X_train, X_running], dim=0)
self._y_train = torch.cat([self._y_train, y_running], dim=0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._y_train = torch.cat([self._y_train, y_running], dim=0)

@sawa3030
Copy link
Copy Markdown
Collaborator Author

Thank you all for taking the time to review this PR. Your feedback was very helpful, and I’ve updated the implementation accordingly.

optuna/_gp/gp.py Outdated
)
with torch.no_grad():
cov_Y_Y = self.kernel().detach().cpu().numpy()
_cov_Y_Y = self.kernel().detach().cpu().numpy()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Now that it is no longer stored as a member variable, there is no need to prefix it with an underscore.

Suggested change
_cov_Y_Y = self.kernel().detach().cpu().numpy()
cov_Y_Y = self.kernel().detach().cpu().numpy()

optuna/_gp/gp.py Outdated
cov_Y_Y_chol = np.zeros((n_total, n_total), dtype=np.float64)
cov_Y_Y_chol[:n_train, :n_train] = self._cov_Y_Y_chol.numpy()
with torch.no_grad():
kernel_train_running = self.kernel(X_running).detach().cpu().numpy()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't kernel_running_train be more natural, since it is kernel(X_running, X_train)?

Suggested change
kernel_train_running = self.kernel(X_running).detach().cpu().numpy()
kernel_running_train = self.kernel(X_running).detach().cpu().numpy()

- cov_Y_Y_chol[n_train:, :n_train] @ cov_Y_Y_chol[n_train:, :n_train].T
)
self._y_all = torch.cat([self._y_train, y_running], dim=0)
cov_Y_Y_inv_Y = scipy.linalg.solve_triangular(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can also compute here little bit more efficiently by reusing self._cov_Y_Y_inv_Y, leading $O( (\text{n\_train} + \text{n\_running})^2)$ to $O(\text{n\_train}~\text{n\_running} + \text{n\_running}^2)$.
However, this is not very important and we can leave here as it is, since the overall complexity of append_running_data is bottlenecked by computing cov_Y_Y_chol.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the helpful suggestion! I agree that the forward substitution, scipy.linalg.solve_triangular(cov_Y_Y_chol, self._y_all.cpu().numpy(), lower=True) can likely be reduced to O(n_train * n_running + n_running**2) by caching the corresponding train-only term,
scipy.linalg.solve_triangular(cov_Y_Y_chol, self._y_train.cpu().numpy(), lower=True) in _cache_matrix.

What I was less certain about was whether the backward substitution could also be reduced to the same complexity. As you mentioned, I think this would be a good optimization to consider in a future improvement.

Copy link
Copy Markdown
Collaborator

@kAIto47802 kAIto47802 Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the reply. Indeed, yes. We need $W := AX^{-1}$ by solving $L^\top W = Z$, which takes $O(\text{n\_train}^2 + \text{n\_running})$. Also, the overall complexity of append_running_data is already bottlenecked by the computation of cov_Y_Y_chol, which we cannot improve more.
Therefore, we can achieve only a constant-factor improvement here, which is the reason why I felt this is not very important in contrast to the one I mentioned here. Also, I'm not very certain how this constant-factor improvement contribute to the overall runtime without conducting benchmarking on it.

return constraint_vals, is_feasible


def get_params(trial: FrozenTrial) -> dict[str, Any]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this function is only used in this module, we should make this internal.

Suggested change
def get_params(trial: FrozenTrial) -> dict[str, Any]:
def _get_params(trial: FrozenTrial) -> dict[str, Any]:

@kAIto47802
Copy link
Copy Markdown
Collaborator

Thank you for the update! It almost looks good to me, with only some minor naming issues.

@c-bata c-bata unassigned y0z Feb 25, 2026
@c-bata c-bata added the feature Change that does not break compatibility, but affects the public interfaces. label Feb 25, 2026
@sawa3030
Copy link
Copy Markdown
Collaborator Author

I’ve updated the PR to address the comments so far. I’d appreciate it if you could take another look.

Copy link
Copy Markdown
Collaborator

@kAIto47802 kAIto47802 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the update! LGTM

@not522
Copy link
Copy Markdown
Member

not522 commented Feb 27, 2026

I confirmed the improvement in the BBOB benchmark. It's LGTM except the above comments.

@sawa3030
Copy link
Copy Markdown
Collaborator Author

sawa3030 commented Mar 3, 2026

Thank you for the check. I’ve updated the code accordingly. PTAL.

Copy link
Copy Markdown
Member

@not522 not522 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@not522 not522 added this to the v4.8.0 milestone Mar 4, 2026
@not522 not522 changed the title Add constant liar strategy to GPSampler Add constant liar strategy to GPSampler Mar 4, 2026
@not522 not522 merged commit 41618e2 into optuna:master Mar 4, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Change that does not break compatibility, but affects the public interfaces.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Constant Liar strategy for GPSampler

6 participants