Skip to content

Speed up GPSampler by Batching Acquisition Function Evaluations#6268

Merged
nabenabe0928 merged 71 commits intooptuna:masterfrom
Kaichi-Irie:add-batched-acqf-eval-greenlet-for-PR
Sep 11, 2025
Merged

Speed up GPSampler by Batching Acquisition Function Evaluations#6268
nabenabe0928 merged 71 commits intooptuna:masterfrom
Kaichi-Irie:add-batched-acqf-eval-greenlet-for-PR

Conversation

@Kaichi-Irie
Copy link
Copy Markdown
Contributor

@Kaichi-Irie Kaichi-Irie commented Sep 8, 2025

Motivation

The multi-start optimization of the acquisition function in the GPSampler is a significant performance bottleneck, accounting for 50-60% of its total execution time. The current implementation uses 10 initial points (n_local_search=10) and performs optimization from each point sequentially.

This pull request aims to significantly speed up this multi-start optimization by batching the evaluation of the acquisition function and its gradient. Benchmark results show that this change can reduce the execution time by approximately 40-50% without compromising the optimization accuracy of the objective function.

Description of the Changes

Previous Sequential Approach

The previous implementation runs an optimization loop for each starting point individually.

# Previous pseudo-code
def multistart_optimize(acqf, x0_list, ...):
    ...
    min_fvals = ...
    min_xs = ...
    for i, x0 in enumerate(x0_list):
        # optimize() performs iterative optimization for a single initial point x0
        min_fval, min_x = optimize(acqf, x0, ...)
        min_fvals[i] = min_fval
        min_xs[i] = min_x
    best_idx = np.argmin(min_fvals)
    return min_xs[best_idx], min_fvals[best_idx]

Proposed Batched Approach

The proposed approach executes each optimization step for all initial points in a single batch.

Evaluations of the acquisition function (acqf) and its gradient (acqf_grad) are performed at once using vectorized operations on torch.Tensor. The iterative optimization from each starting point is managed cooperatively using greenlet. This enables a cooperative, iterative workflow. In each iteration, the function values for all starting points are evaluated as a single batch. Subsequently, each optimization is advanced by one step to determine the next set of points to be evaluated, which then forms the input for the following batch evaluation cycle.

# New pseudo-code
def multistart_optimize(acqf, x0_list, ...):
    ...
    # optimize_batched manages optimization for multiple initial points x0_list
    min_xs, min_fvals = optimize_batched(acqf, x0_list, ...)
    best_idx = np.argmin(min_fvals)
    return min_xs[best_idx], min_fvals[best_idx]

For backward compatibility, a fallback mechanism has been implemented. If greenlet fails to import, the logic will revert to the original sequential execution.

Dependencies

This change utilizes greenlet, which is already included as a dependency of sqlalchemy, a core Optuna dependency. Therefore, this PR does not introduce any new library dependencies.

Performance Evaluation (Benchmark)

To validate the effectiveness of the proposed method, we compared the performance in the following three modes:

  1. This PR (Batched): The implementation in this PR with batch evaluation enabled.
  2. original: The current implementation on the master branch.
  3. This PR (Fallback): The implementation in this PR with batch evaluation intentionally disabled to activate the sequential fallback logic. This helps confirm that the new implementation's overhead is acceptable and that the implementation is functioning correctly.

Experimental Setup

  • CPU: Apple M4 (10-core)
  • Trials: 300 trials per study
  • Seed: 42, 43, 44 (results are averaged over three runs)
  • Objective Functions:
    • rotated_ellipsoid (10 dimensions)
    • f6 from optunahub/benchmarks/bbob (20 dimensions)
    • wfg (2 objectives, 3 dimensions, k=1) from optunahub/benchmarks/wfg

Results

Objective Function (Dimension) Mode Avg. Execution Time (s) Avg. Best Objective Value
f6 (D=20) This PR (Batched) 19.0 133.2
f6 (D=20) original 35.7 155.9
f6 (D=20) This PR (Fallback) 31.9 159.2
rotated_ellipsoid (D=10) This PR (Batched) 19.4 3.64e-4
rotated_ellipsoid (D=10) original 39.6 3.75e-4
rotated_ellipsoid (D=10) This PR (Fallback) 34.0 2.28e-4
wfg (D=3) This PR (Batched) 35.3 -
wfg (D=3) original 74.4 -
wfg (D=3) This PR (Fallback) 66.9 -
speed_comparison_plot

The following plot show that the hypervolume values remain consistent across all modes, confirming that the change does not negatively impact search performance.

hypervolume_history

Analysis

  • The implementation with batched evaluation is approximately 1.8x to 2.0x faster than the current implementation.
  • The best objective values obtained are comparable across all modes. The observed variations are likely due to changes in the order of floating-point operations and do not indicate a degradation in the sampler's search performance.
  • The fallback mode is as fast as the original implementation, suggesting that the structural overhead of the proposed changes is minimal.

Objective Value Progression:

f6 (D=20, seed=42)

The following tables show that the objective values remain consistent across all modes, confirming that the change does not negatively impact search performance.

mode \ Trial Trial 9 Trial 10 Trial 11 Trial 12 Trial 13 Trial 14
This PR (Batched) 1193964.5576430012 435258.7193614113 429479.3338986842 328199.63425374887 541828.1312516633 571560.8492535071
This PR (Fallback) 1193964.5576430012 435258.7193614113 429479.33389868634 328199.63425374887 541828.1312516204 571560.8492535099
original 1193964.5576430012 435258.7193614113 429479.3338986842 328199.6342537497 541828.1312516493 571560.8492535143
rotated_ellipsoid (D=10, seed=42)
mode \ Trial Trial 9 Trial 10 Trial 11 Trial 12 Trial 13 Trial 14
This PR (Batched) 12.130935341373624 6.497476295603841 9.159600880097619 11.184451388408986 13.400053021276653 10.290407711547488
This PR (Fallback) 12.130935341373624 6.497476295603834 9.159600880097601 11.184451388408993 13.400053021279028 10.290407711541484
original 12.130935341373624 6.497476295603843 9.159600880097617 11.18445138840904 13.40005302128083 10.290407711537135
Benchmark Code
import json
import os
from itertools import product

import numpy as np
import optunahub

import optuna

np.random.seed(42)

NUM_CONTINUOUS_PARAMS = 10
ROTATION_MATRIX, _ = np.linalg.qr(np.random.rand(NUM_CONTINUOUS_PARAMS, NUM_CONTINUOUS_PARAMS))

CONDITIONING = 10
WEIGHTS = np.array([CONDITIONING**(i / (NUM_CONTINUOUS_PARAMS - 1)) for i in range(NUM_CONTINUOUS_PARAMS)])

def rotated_ellipsoid(trial):
    xs = np.array([trial.suggest_float(f"x_{i}", -1, 1) for i in range(NUM_CONTINUOUS_PARAMS)])
    rotated_xs = ROTATION_MATRIX @ xs
    return np.sum(WEIGHTS * (rotated_xs**2))

bbob = optunahub.load_module("benchmarks/bbob")
f6 = bbob.Problem(function_id=6, dimension=20, instance_id=1)

def execute_benchmark(
    mode: str,
    objective_type: str,
    n_trials: int,
    seed: int,
    results_file="results.jsonl",
    output_dir="output",
):
    sampler = optuna.samplers.GPSampler(seed=seed)
    name = f"{objective_type}_{seed}_{mode}_{n_trials}"
    log_file = f"{name}.jsonl"
    study = optuna.create_study(study_name=name, sampler=sampler)

    study.optimize(func=f6 if objective_type == "f6" else rotated_ellipsoid, n_trials=n_trials)
    print(study.best_trial.params, study.best_trial.value)
    elapsed = (study.trials[-1].datetime_complete - study.trials[0].datetime_start).total_seconds() # type: ignore
    print(f"{mode} took {elapsed:f} seconds. ")

    result = {
        "objective_type": objective_type,
        "seed": seed,
        "mode": mode,
        "elapsed": round(elapsed, 2),
        "n_trials": n_trials,
        "best_value": study.best_trial.value,
    }

    with open(os.path.join(output_dir, results_file), "a") as f:
        f.write(json.dumps(result) + "\n")

seeds = [42,43,44]
n_trials = 300

# To test different modes, you need to edit the source code of the GPSampler directly.
# This script runs with the batched evaluation enabled by default.
modes: list[str] = [
    "This PR (Batched)",
    # "This PR (Fallback)",
    # "original",
]

objective_types = [
    "rotated_ellipsoid",
    "f6",
]

for seed, mode, objective_type in product(seeds, modes, objective_types):
    execute_benchmark(
        mode=mode,
        objective_type=objective_type,
        n_trials=n_trials,
        seed=seed,
    )

Tests

Unit tests must be added.

…reamline fallback logic for sequential optimization
Copy link
Copy Markdown
Member

@not522 not522 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your PR! Could you check my small comments?

Kaichi-Irie and others added 2 commits September 10, 2025 14:01
Co-authored-by: Naoto Mizuno <naotomizuno@preferred.jp>
Co-authored-by: Naoto Mizuno <naotomizuno@preferred.jp>
Co-authored-by: Naoto Mizuno <naotomizuno@preferred.jp>
Copy link
Copy Markdown
Member

@not522 not522 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your update. LGTM!
I confirmed the speed improvement in my environment.

@not522 not522 changed the title Speed up GPSampler by Batching Acquisition Function Evaluations Speed up GPSampler by Batching Acquisition Function Evaluations Sep 10, 2025
@not522 not522 removed their assignment Sep 10, 2025
Comment on lines +60 to +61
normalized_params[: len(scaled_x), continuous_indices] = scaled_x * lengthscales
x_tensor = torch.from_numpy(normalized_params[: len(scaled_x), :]).requires_grad_(True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This slicing is indeed incorrect when we have (a) discrete parameter(s).
For example, think about the case where normalized_params = np.array([1.0, 0.0], [0.0, 1.0]) and normalized_params[:, 1] is the discrete dimension.
Let's say the first batch converged and the second batch is still in progress, then the current slicing uses the discrete parameter value of 0.0, which is indeed from the first batch.
But we would like to use the discrete parameter from the second batch, meaning that we need to have the unconverged batch indices from the batched_l_bfgs_b so that we can correctly assign the discrete parameter values.

Copy link
Copy Markdown
Contributor Author

@Kaichi-Irie Kaichi-Irie Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out the bug! You are right.
I added batch indices as an argument to the fun_and_grad function, as well as the corresponding code. This change appears to be working; the runtime and accuracy seem fine.

@codecov
Copy link
Copy Markdown

codecov bot commented Sep 11, 2025

Codecov Report

❌ Patch coverage is 96.36364% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.15%. Comparing base (57fade1) to head (6624800).
⚠️ Report is 536 commits behind head on master.

Files with missing lines Patch % Lines
optuna/_gp/batched_lbfgsb.py 96.29% 2 Missing ⚠️
optuna/_gp/optim_mixed.py 96.42% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6268      +/-   ##
==========================================
- Coverage   89.24%   89.15%   -0.10%     
==========================================
  Files         208      209       +1     
  Lines       13821    13908      +87     
==========================================
+ Hits        12334    12399      +65     
- Misses       1487     1509      +22     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Copy Markdown
Contributor

@nabenabe0928 nabenabe0928 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's work on followups in (an)other PR(s)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Change that does not break compatibility and not affect public interfaces, but improves performance.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants