Skip to content

Fix shared callback state in parallel OptunaSearchCV with LightGBM#260

Merged
not522 merged 2 commits intooptuna:mainfrom
Quant-Quasar:fix-issue-240-shared-callbacks
Jan 22, 2026
Merged

Fix shared callback state in parallel OptunaSearchCV with LightGBM#260
not522 merged 2 commits intooptuna:mainfrom
Quant-Quasar:fix-issue-240-shared-callbacks

Conversation

@Quant-Quasar
Copy link
Copy Markdown
Contributor

Motivation

When using OptunaSearchCV with n_jobs > 1, users may pass stateful callback objects
(e.g., LightGBM early stopping callbacks) via fit_params.

Because fit_params is shared across parallel workers, callback objects inside it can
be shared by reference, leading to unintended cross-trial interference.
In practice, this can cause multiple parallel trials to observe and mutate the same
callback state (such as best_iteration), resulting in incorrect or identical early
stopping behavior across trials.

This issue was reported in #240 and can silently affect optimization results when running
parallel cross-validation with callbacks.

Description of the changes

This PR ensures proper isolation of callback state across parallel trials by:

  • Creating a shallow copy of fit_params at objective execution time.
  • Deep-copying only the callbacks entry (if present) to provide each trial with an
    independent callback instance.
  • Leaving all other entries in fit_params unchanged to avoid unnecessary duplication
    of large or expensive objects.

The change is localized to the objective execution path and does not affect existing
behavior for single-threaded runs or workflows that do not use callbacks.

This prevents shared mutable state between parallel trials while keeping memory and
performance overhead minimal.

@Quant-Quasar Quant-Quasar force-pushed the fix-issue-240-shared-callbacks branch 2 times, most recently from 76e5c15 to 05662a1 Compare December 26, 2025 08:06
@Quant-Quasar Quant-Quasar force-pushed the fix-issue-240-shared-callbacks branch from 05662a1 to 6a15eac Compare December 26, 2025 08:18
@not522 not522 self-assigned this Dec 26, 2025
@not522
Copy link
Copy Markdown
Member

not522 commented Dec 26, 2025

Thank you for your PR! Since we're approaching the year-end and New Year holidays, I'll review the details after New Year's Day.

@Quant-Quasar
Copy link
Copy Markdown
Contributor Author

Thank you for the update! I completely understand.
Please take your time, and I’m happy to follow up after New Year’s.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 4, 2026

This pull request has not seen any recent activity.

@github-actions github-actions bot added the stale Exempt from stale bot labeling. label Jan 4, 2026
@c-bata c-bata removed the stale Exempt from stale bot labeling. label Jan 5, 2026
@not522
Copy link
Copy Markdown
Member

not522 commented Jan 8, 2026

Thank you for waiting.
I haven't been able to reproduce the original bug report. Did you manage to reproduce it in your environment?

@Quant-Quasar
Copy link
Copy Markdown
Contributor Author

Thanks for checking!

Yes, I was able to reproduce the issue locally under parallel execution. The key requirement is running OptunaSearchCV with n_jobs > 1 and passing a stateful callback (e.g., LightGBM early stopping) via fit_params.

I used a setup similar to the following (simplified):

early_stop = lgb.early_stopping(stopping_rounds=10)

oscv = OptunaSearchCV(
    lgb.LGBMClassifier(n_estimators=100, verbose=-1),
    param_distributions={
        "learning_rate": optuna.distributions.FloatDistribution(0.01, 0.2),
        "num_leaves": optuna.distributions.IntDistribution(10, 20),
    },
    cv=3,
    n_jobs=2,
    n_trials=4,
)

oscv.fit(
    X_train,
    y_train,
    eval_set=[(X_test, y_test)],
    eval_metric="auc",
    callbacks=[early_stop],  # shared callback instance
)

In this configuration, the callback object inside fit_params["callbacks"] is shared across parallel workers. As a result, multiple trials mutate the same callback state (e.g., best_iteration_), which causes cross-trial interference. This typically manifests as identical or prematurely stopped iteration counts across different trials rather than a hard crash.

The behavior is timing- and backend-dependent, so it may not reproduce deterministically in all environments, but the underlying issue is the shared mutable callback state.

The proposed fix:
This isolates the callback instances per worker by shallow-copying fit_params and deep-copying only the callbacks list at execution time. This prevents shared-state corruption without unnecessarily duplicating large objects.

If helpful, I can also add this reproducer directly to the PR description or expand it further.

@not522
Copy link
Copy Markdown
Member

not522 commented Jan 8, 2026

Thank you! Could you provide a reproducible example?

@Quant-Quasar
Copy link
Copy Markdown
Contributor Author

import optuna
from optuna.integration import OptunaSearchCV
import lightgbm as lgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Reproducer for shared callback state in parallel OptunaSearchCV + LightGBM
# Key requirements:
#   - n_jobs > 1
#   - stateful callback (LightGBM early stopping)
# The issue manifests as cross-trial interference (e.g., identical or
# prematurely stopped iteration counts), not necessarily a hard crash.

# 1. Setup Data
X, y = make_classification(n_samples=500, n_features=20, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Setup Search Space
search_space = {
    "learning_rate": optuna.distributions.FloatDistribution(0.01, 0.2),
    "num_leaves": optuna.distributions.IntDistribution(10, 20),
}

# 3. Setup Estimator
clf = lgb.LGBMClassifier(verbose=-1, n_estimators=100)

# 4. Setup OptunaSearchCV with Parallelism
oscv = OptunaSearchCV(
    clf,
    param_distributions=search_space,
    cv=3,
    n_jobs=2, # PARALLELISM IS KEY
    n_trials=4,
    random_state=42,
    verbose=1
)

# 5. Create a Stateful Callback (Early Stopping)
# This object stores 'best_iteration'. If shared, it corrupts parallel runs.
early_stop = lgb.early_stopping(stopping_rounds=10)

print("--- STARTING PARALLEL FIT ---")
try:
    oscv.fit(
        X_train, y_train, 
        eval_set=[(X_test, y_test)], 
        eval_metric='auc',
        callbacks=[early_stop] # Passing the SAME list/object
    )
    print("✅ Finished fit (Check logs for identical iteration counts which implies bug).")
except Exception as e:
    print(f"❌ Crash: {e}")

# The user report says it runs but produces wrong results.

@not522
Copy link
Copy Markdown
Member

not522 commented Jan 8, 2026

Thank you. Could you also share the console log? The original bug report mentions that "sometimes the two trials have the same best iteration," but I haven't seen this happen very often in my environment.

@Quant-Quasar
Copy link
Copy Markdown
Contributor Author

Agreed that this is intermittent and environment-dependent.

Below is a representative excerpt from one of my runs with n_jobs=2. One observable signal is that parallel fits report identical best_iteration values , despite being interleaved and using different hyperparameters:

Training until validation scores don't improve for 10 rounds
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[2]     valid_0's auc: 0.974949
Early stopping, best iteration is:
[2]     valid_0's auc: 0.974949

Training until validation scores don't improve for 10 rounds
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[8]     valid_0's auc: 0.967273
Early stopping, best iteration is:
[8]     valid_0's auc: 0.967273

Training until validation scores don't improve for 10 rounds
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[15]    valid_0's auc: 0.97697
Early stopping, best iteration is:
[15]    valid_0's auc: 0.97697

In the affected runs, parallel trials with different sampled hyperparameters occasionally reported identical best_iteration values, which is consistent with shared state in the early stopping callback.

In other runs, the issue manifests as premature stopping rather than identical iteration numbers, which is why it may not reproduce consistently.

Isolating the callback instances per worker removes this behavior entirely.

Co-authored-by: Naoto Mizuno <naotomizuno@preferred.jp>
Copy link
Copy Markdown
Member

@not522 not522 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! LGTM!

@not522 not522 merged commit 33d5678 into optuna:main Jan 22, 2026
31 checks passed
@not522 not522 added this to the v4.8.0 milestone Jan 22, 2026
@not522 not522 added the bug Something isn't working label Feb 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants