Fix shared callback state in parallel OptunaSearchCV with LightGBM#260
Conversation
76e5c15 to
05662a1
Compare
05662a1 to
6a15eac
Compare
|
Thank you for your PR! Since we're approaching the year-end and New Year holidays, I'll review the details after New Year's Day. |
|
Thank you for the update! I completely understand. |
|
This pull request has not seen any recent activity. |
|
Thank you for waiting. |
|
Thanks for checking! Yes, I was able to reproduce the issue locally under parallel execution. The key requirement is running I used a setup similar to the following (simplified): early_stop = lgb.early_stopping(stopping_rounds=10)
oscv = OptunaSearchCV(
lgb.LGBMClassifier(n_estimators=100, verbose=-1),
param_distributions={
"learning_rate": optuna.distributions.FloatDistribution(0.01, 0.2),
"num_leaves": optuna.distributions.IntDistribution(10, 20),
},
cv=3,
n_jobs=2,
n_trials=4,
)
oscv.fit(
X_train,
y_train,
eval_set=[(X_test, y_test)],
eval_metric="auc",
callbacks=[early_stop], # shared callback instance
)In this configuration, the callback object inside The behavior is timing- and backend-dependent, so it may not reproduce deterministically in all environments, but the underlying issue is the shared mutable callback state. The proposed fix: If helpful, I can also add this reproducer directly to the PR description or expand it further. |
|
Thank you! Could you provide a reproducible example? |
import optuna
from optuna.integration import OptunaSearchCV
import lightgbm as lgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Reproducer for shared callback state in parallel OptunaSearchCV + LightGBM
# Key requirements:
# - n_jobs > 1
# - stateful callback (LightGBM early stopping)
# The issue manifests as cross-trial interference (e.g., identical or
# prematurely stopped iteration counts), not necessarily a hard crash.
# 1. Setup Data
X, y = make_classification(n_samples=500, n_features=20, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 2. Setup Search Space
search_space = {
"learning_rate": optuna.distributions.FloatDistribution(0.01, 0.2),
"num_leaves": optuna.distributions.IntDistribution(10, 20),
}
# 3. Setup Estimator
clf = lgb.LGBMClassifier(verbose=-1, n_estimators=100)
# 4. Setup OptunaSearchCV with Parallelism
oscv = OptunaSearchCV(
clf,
param_distributions=search_space,
cv=3,
n_jobs=2, # PARALLELISM IS KEY
n_trials=4,
random_state=42,
verbose=1
)
# 5. Create a Stateful Callback (Early Stopping)
# This object stores 'best_iteration'. If shared, it corrupts parallel runs.
early_stop = lgb.early_stopping(stopping_rounds=10)
print("--- STARTING PARALLEL FIT ---")
try:
oscv.fit(
X_train, y_train,
eval_set=[(X_test, y_test)],
eval_metric='auc',
callbacks=[early_stop] # Passing the SAME list/object
)
print("✅ Finished fit (Check logs for identical iteration counts which implies bug).")
except Exception as e:
print(f"❌ Crash: {e}")
# The user report says it runs but produces wrong results. |
|
Thank you. Could you also share the console log? The original bug report mentions that "sometimes the two trials have the same best iteration," but I haven't seen this happen very often in my environment. |
|
Agreed that this is intermittent and environment-dependent. Below is a representative excerpt from one of my runs with In the affected runs, parallel trials with different sampled hyperparameters occasionally reported identical best_iteration values, which is consistent with shared state in the early stopping callback. In other runs, the issue manifests as premature stopping rather than identical iteration numbers, which is why it may not reproduce consistently. Isolating the callback instances per worker removes this behavior entirely. |
Co-authored-by: Naoto Mizuno <naotomizuno@preferred.jp>
Motivation
When using
OptunaSearchCVwithn_jobs > 1, users may pass stateful callback objects(e.g., LightGBM early stopping callbacks) via
fit_params.Because
fit_paramsis shared across parallel workers, callback objects inside it canbe shared by reference, leading to unintended cross-trial interference.
In practice, this can cause multiple parallel trials to observe and mutate the same
callback state (such as
best_iteration), resulting in incorrect or identical earlystopping behavior across trials.
This issue was reported in #240 and can silently affect optimization results when running
parallel cross-validation with callbacks.
Description of the changes
This PR ensures proper isolation of callback state across parallel trials by:
fit_paramsat objective execution time.callbacksentry (if present) to provide each trial with anindependent callback instance.
fit_paramsunchanged to avoid unnecessary duplicationof large or expensive objects.
The change is localized to the objective execution path and does not affect existing
behavior for single-threaded runs or workflows that do not use callbacks.
This prevents shared mutable state between parallel trials while keeping memory and
performance overhead minimal.