Fix shared callback state in parallel OptunaSearchCV with LightGBM by Quant-Quasar · Pull Request #260 · optuna/optuna-integration

Quant-Quasar · 2025-12-26T07:37:22Z

Motivation

When using OptunaSearchCV with n_jobs > 1, users may pass stateful callback objects
(e.g., LightGBM early stopping callbacks) via fit_params.

Because fit_params is shared across parallel workers, callback objects inside it can
be shared by reference, leading to unintended cross-trial interference.
In practice, this can cause multiple parallel trials to observe and mutate the same
callback state (such as best_iteration), resulting in incorrect or identical early
stopping behavior across trials.

This issue was reported in #240 and can silently affect optimization results when running
parallel cross-validation with callbacks.

Description of the changes

This PR ensures proper isolation of callback state across parallel trials by:

Creating a shallow copy of fit_params at objective execution time.
Deep-copying only the callbacks entry (if present) to provide each trial with an
independent callback instance.
Leaving all other entries in fit_params unchanged to avoid unnecessary duplication
of large or expensive objects.

The change is localized to the objective execution path and does not affect existing
behavior for single-threaded runs or workflows that do not use callbacks.

This prevents shared mutable state between parallel trials while keeping memory and
performance overhead minimal.

not522 · 2025-12-26T08:52:56Z

Thank you for your PR! Since we're approaching the year-end and New Year holidays, I'll review the details after New Year's Day.

Quant-Quasar · 2025-12-26T10:21:14Z

Thank you for the update! I completely understand.
Please take your time, and I’m happy to follow up after New Year’s.

github-actions · 2026-01-04T23:07:09Z

This pull request has not seen any recent activity.

not522 · 2026-01-08T07:50:43Z

Thank you for waiting.
I haven't been able to reproduce the original bug report. Did you manage to reproduce it in your environment?

Quant-Quasar · 2026-01-08T08:26:29Z

Thanks for checking!

Yes, I was able to reproduce the issue locally under parallel execution. The key requirement is running OptunaSearchCV with n_jobs > 1 and passing a stateful callback (e.g., LightGBM early stopping) via fit_params.

I used a setup similar to the following (simplified):

early_stop = lgb.early_stopping(stopping_rounds=10)

oscv = OptunaSearchCV(
    lgb.LGBMClassifier(n_estimators=100, verbose=-1),
    param_distributions={
        "learning_rate": optuna.distributions.FloatDistribution(0.01, 0.2),
        "num_leaves": optuna.distributions.IntDistribution(10, 20),
    },
    cv=3,
    n_jobs=2,
    n_trials=4,
)

oscv.fit(
    X_train,
    y_train,
    eval_set=[(X_test, y_test)],
    eval_metric="auc",
    callbacks=[early_stop],  # shared callback instance
)

In this configuration, the callback object inside fit_params["callbacks"] is shared across parallel workers. As a result, multiple trials mutate the same callback state (e.g., best_iteration_), which causes cross-trial interference. This typically manifests as identical or prematurely stopped iteration counts across different trials rather than a hard crash.

The behavior is timing- and backend-dependent, so it may not reproduce deterministically in all environments, but the underlying issue is the shared mutable callback state.

The proposed fix:
This isolates the callback instances per worker by shallow-copying fit_params and deep-copying only the callbacks list at execution time. This prevents shared-state corruption without unnecessarily duplicating large objects.

If helpful, I can also add this reproducer directly to the PR description or expand it further.

not522 · 2026-01-08T08:29:52Z

Thank you! Could you provide a reproducible example?

Quant-Quasar · 2026-01-08T08:49:10Z

import optuna
from optuna.integration import OptunaSearchCV
import lightgbm as lgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Reproducer for shared callback state in parallel OptunaSearchCV + LightGBM
# Key requirements:
#   - n_jobs > 1
#   - stateful callback (LightGBM early stopping)
# The issue manifests as cross-trial interference (e.g., identical or
# prematurely stopped iteration counts), not necessarily a hard crash.

# 1. Setup Data
X, y = make_classification(n_samples=500, n_features=20, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Setup Search Space
search_space = {
    "learning_rate": optuna.distributions.FloatDistribution(0.01, 0.2),
    "num_leaves": optuna.distributions.IntDistribution(10, 20),
}

# 3. Setup Estimator
clf = lgb.LGBMClassifier(verbose=-1, n_estimators=100)

# 4. Setup OptunaSearchCV with Parallelism
oscv = OptunaSearchCV(
    clf,
    param_distributions=search_space,
    cv=3,
    n_jobs=2, # PARALLELISM IS KEY
    n_trials=4,
    random_state=42,
    verbose=1
)

# 5. Create a Stateful Callback (Early Stopping)
# This object stores 'best_iteration'. If shared, it corrupts parallel runs.
early_stop = lgb.early_stopping(stopping_rounds=10)

print("--- STARTING PARALLEL FIT ---")
try:
    oscv.fit(
        X_train, y_train, 
        eval_set=[(X_test, y_test)], 
        eval_metric='auc',
        callbacks=[early_stop] # Passing the SAME list/object
    )
    print("✅ Finished fit (Check logs for identical iteration counts which implies bug).")
except Exception as e:
    print(f"❌ Crash: {e}")

# The user report says it runs but produces wrong results.

not522 · 2026-01-08T09:51:15Z

Thank you. Could you also share the console log? The original bug report mentions that "sometimes the two trials have the same best iteration," but I haven't seen this happen very often in my environment.

Quant-Quasar · 2026-01-08T10:48:35Z

Agreed that this is intermittent and environment-dependent.

Below is a representative excerpt from one of my runs with n_jobs=2. One observable signal is that parallel fits report identical best_iteration values , despite being interleaved and using different hyperparameters:

Training until validation scores don't improve for 10 rounds
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[2]     valid_0's auc: 0.974949
Early stopping, best iteration is:
[2]     valid_0's auc: 0.974949

Training until validation scores don't improve for 10 rounds
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[8]     valid_0's auc: 0.967273
Early stopping, best iteration is:
[8]     valid_0's auc: 0.967273

Training until validation scores don't improve for 10 rounds
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[15]    valid_0's auc: 0.97697
Early stopping, best iteration is:
[15]    valid_0's auc: 0.97697

In the affected runs, parallel trials with different sampled hyperparameters occasionally reported identical best_iteration values, which is consistent with shared state in the early stopping callback.

In other runs, the issue manifests as premature stopping rather than identical iteration numbers, which is why it may not reproduce consistently.

Isolating the callback instances per worker removes this behavior entirely.

optuna_integration/sklearn/sklearn.py

Co-authored-by: Naoto Mizuno <naotomizuno@preferred.jp>

not522

Thank you! LGTM!

Quant-Quasar force-pushed the fix-issue-240-shared-callbacks branch 2 times, most recently from 76e5c15 to 05662a1 Compare December 26, 2025 08:06

Fix shared callback state in parallel OptunaSearchCV

6a15eac

Quant-Quasar force-pushed the fix-issue-240-shared-callbacks branch from 05662a1 to 6a15eac Compare December 26, 2025 08:18

not522 self-assigned this Dec 26, 2025

github-actions bot added the stale Exempt from stale bot labeling. label Jan 4, 2026

c-bata removed the stale Exempt from stale bot labeling. label Jan 5, 2026

not522 reviewed Jan 14, 2026

View reviewed changes

optuna_integration/sklearn/sklearn.py Outdated Show resolved Hide resolved

refactor: deepcopy fit_params to avoid shared state

b0e0c82

Co-authored-by: Naoto Mizuno <naotomizuno@preferred.jp>

not522 approved these changes Jan 22, 2026

View reviewed changes

not522 merged commit 33d5678 into optuna:main Jan 22, 2026
31 checks passed

not522 added this to the v4.8.0 milestone Jan 22, 2026

not522 added the bug Something isn't working label Feb 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix shared callback state in parallel OptunaSearchCV with LightGBM#260

Fix shared callback state in parallel OptunaSearchCV with LightGBM#260
not522 merged 2 commits intooptuna:mainfrom
Quant-Quasar:fix-issue-240-shared-callbacks

Quant-Quasar commented Dec 26, 2025

Uh oh!

not522 commented Dec 26, 2025

Uh oh!

Quant-Quasar commented Dec 26, 2025

Uh oh!

github-actions bot commented Jan 4, 2026

Uh oh!

not522 commented Jan 8, 2026

Uh oh!

Quant-Quasar commented Jan 8, 2026

Uh oh!

not522 commented Jan 8, 2026

Uh oh!

Quant-Quasar commented Jan 8, 2026

Uh oh!

not522 commented Jan 8, 2026

Uh oh!

Quant-Quasar commented Jan 8, 2026

Uh oh!

Uh oh!

not522 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Quant-Quasar commented Dec 26, 2025

Motivation

Description of the changes

Uh oh!

not522 commented Dec 26, 2025

Uh oh!

Quant-Quasar commented Dec 26, 2025

Uh oh!

github-actions bot commented Jan 4, 2026

Uh oh!

not522 commented Jan 8, 2026

Uh oh!

Quant-Quasar commented Jan 8, 2026

Uh oh!

not522 commented Jan 8, 2026

Uh oh!

Quant-Quasar commented Jan 8, 2026

Uh oh!

not522 commented Jan 8, 2026

Uh oh!

Quant-Quasar commented Jan 8, 2026

Uh oh!

Uh oh!

not522 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants