Fix ill-combination of journal and gRPC by nabenabe0928 · Pull Request #6175 · optuna/optuna

nabenabe0928 · 2025-06-23T04:57:39Z

Motivation

This PR resolves the following issues:

JournalStorage fails frequently in distributed optimization setups in combination with GrpcProxyStorage #6084
Suspicious JournalFileStorage behavior with GrpcProxyStorage in a multi-threading setup #6172

Description of the changes

Fix the logic for update check in set_trial_state_values
Modify an existing unit test so that the current master branch fails much more often
Add a unit test to verify whether JournalStorage works with gRPC on multi-processing setups

Copilot

Pull Request Overview

This PR fixes an issue with updating trial state values when using gRPC by refining the logic in set_trial_state_values and related helper functions.

Introduces a new error message constant for finished trials.
Adds a synchronized check within a lock to ensure that trial updates occur only when valid.
Refactors the return logic in set_trial_state_values and updates error handling in _trial_exists_and_updatable.

Comments suppressed due to low confidence (1)

optuna/storages/journal/_storage.py:34

[nitpick] Consider changing 'can not' to 'cannot' for improved readability and consistency in the error message.

UNUPDATABLE_MSG = "Trial#{trial_number} has already finished and can not be updated."

nabenabe0928 · 2025-06-23T04:58:15Z

@c-bata @gen740
Could you review this PR?

github-actions · 2025-06-30T23:06:32Z

This pull request has not seen any recent activity.

nabenabe0928 · 2025-07-01T08:06:25Z

tests/study_tests/test_study.py

+    num_enqueued = 30
+    # NOTE(nabenabe): Fewer threads in gRPC increases the probability of thread collision on the
+    # proxy side. See https://github.com/optuna/optuna/issues/6084
+    storage_kwargs = (
+        {"thread_pool": ThreadPoolExecutor(2)} if storage_mode == "grpc_journal_file" else {}
+    )
+    with StorageSupplier(storage_mode, **storage_kwargs) as storage:


The master branch with this change yielded 80 failures out of 100 runs on Ubuntu 20.04.

$ (for _ in `seq 0 99`; do python -m pytest tests/study_tests/test_study.py::test_pop_waiting_trial_thread_safe[grpc_journal_file] | grep "1 f ailed"; done) | wc -l >>> 80

Note
Meanwhile, I got no failures with the changes in this PR.

nabenabe0928 · 2025-07-01T08:22:58Z

tests/study_tests/test_study.py


-    num_enqueued = 10
-    with StorageSupplier(storage_mode) as storage:
+    num_enqueued = 30


#6175 (comment)

nabenabe0928 · 2025-07-01T09:45:07Z

@c-bata @gen740
I added a unit test and modified an existing one accordingly:)

nabenabe0928 · 2025-07-01T11:36:39Z

tests/storages_tests/journal_tests/test_combination_with_grpc.py

This unit test also induces quite frequent failures in the master branch.

tests/study_tests/test_study.py

…cleaner

nabenabe0928 · 2025-07-02T06:53:35Z

I tested the new unit tests on the master branch:

$ (for _ in `seq 0 99`; do python -m pytest tests/storages_tests/journal_tests/test_combination_with_grpc.py::test_pop_waiting_trial_multiprocess_safe | grep "1 failed"; done) | wc -l
>>> 45

$ (for _ in `seq 0 99`; do python -m pytest tests/storages_tests/journal_tests/test_combination_with_grpc.py::test_pop_waiting_trial_thread_safe | grep "1 failed"; done) | wc -l
>>> 8

nabenabe0928 · 2025-07-02T09:40:03Z

This PR addresses the concern here:

https://github.com/optuna/optuna/pull/6170/files#r2176367263

nabenabe0928 · 2025-07-02T09:50:10Z

@gen740 @c-bata

It seems the thread problem in the unit test also went away:

Add pytest-xdist to speed up the CI #6170

c-bata · 2025-07-08T04:27:33Z

optuna/storages/journal/_storage.py

+                existing_trial = self._replay_result._trials.get(trial_id)
+                if existing_trial is not None and existing_trial.state != TrialState.WAITING:
+                    if existing_trial.state.is_finished():
+                        raise UpdateFinishedTrialError(
+                            UNUPDATABLE_MSG.format(trial_number=existing_trial.number)
+                        )
+                    return False


I have two questions.

Can we use an assert statement to check that existing_trial is not None, so we simplify the condition in the if statement?

Since existing_trial.state.is_finished() implies that existing_trial.state != TrialState.WAITING is always true, can we reduce the nesting like this?

existing_trial = self._replay_result._trials.get(trial_id) assert existing_trial is not None, ( "This must be True. Please file a bug report on GitHub if this line raises AssertionError." ) if existing_trial.state.is_finished(): raise UpdateFinishedTrialError( UNUPDATABLE_MSG.format(trial_number=existing_trial.number) ) if existing_trial.state != TrialState.WAITING: # this line is equivalent to `existing_trial.state == TrialState.RUNNING`. return False

I confirmed that this PR works nicely on a toy problem even with your change!
Let me confirm with a bigger task:)
I am gonna get back to you as soon as the experiment finishes!

c-bata · 2025-07-11T04:15:14Z

I also checked JournalRedisBackend works as expected with this code, so just leaving it here for reference.

docker run -d --name redis -p 127.0.0.1:6379:6379 redis

from concurrent.futures import ProcessPoolExecutor

import optuna
from optuna.storages import JournalStorage
from optuna.storages.journal import JournalFileBackend
from optuna.storages.journal import JournalRedisBackend

storage = JournalStorage(JournalRedisBackend(url="redis://127.0.0.1:6379", prefix="pr-6175"))


def objective(trial: optuna.Trial) -> float:
    x = trial.suggest_float("x", -10, 10)
    return x ** 2


if __name__ == "__main__":
    study = optuna.create_study(storage=storage)
    for i in range(1000):
        study.enqueue_trial({"x": -5.0 + float(i) / 100})

    with ProcessPoolExecutor(max_workers=20) as pool:
        for i in range(100):
            pool.submit(study.optimize, objective, n_trials=10)

    print(f"{len(study.trials)=}")    
    print(f"{len({trial.number for trial in study.trials})=}")    
    print(f"{len({trial._trial_id for trial in study.trials})=}")

c-bata

Changes look good to me. this PR can be merged after:

Confirming it works as expected with the larger task by @nabenabe0928
Receiving approval from the second reviewer

gen740 · 2025-07-11T07:19:39Z

optuna/storages/journal/_storage.py

+                # return statement of trial_id == _replay_result.owned_trial_id. To eliminate false
+                # positives, we verify whether another process is already evaluating the trial with
+                # trial_id. If True, it means this query does not update the trial state.
+                existing_trial = self._replay_result._trials.get(trial_id)


Isn't it necessary to call self._sync_with_backend()? Since the thin line uses self._reply_result.

Thanks, will do it:)

Co-authored-by: Gen <54583542+gen740@users.noreply.github.com>

gen740

LGTM!

Fix ill-combination of journal and gRPC

6ba8a5e

nabenabe0928 requested a review from Copilot June 23, 2025 04:57

nabenabe0928 assigned c-bata and gen740 Jun 23, 2025

nabenabe0928 added the bug Issue/PR about behavior that is broken. Not for typos/examples/CI/test but for Optuna itself. label Jun 23, 2025

Copilot AI reviewed Jun 23, 2025

View reviewed changes

nabenabe0928 added this to the v4.5.0 milestone Jun 23, 2025

nabenabe0928 marked this pull request as ready for review June 23, 2025 04:58

Apply formatter

cd77c91

github-actions bot added the stale Exempt from stale bot labeling. label Jun 30, 2025

nabenabe0928 added 4 commits July 1, 2025 06:10

Enhance the comment

3e19908

Update the comment

42c978f

Increase the failure probability of journal x gRPC

a26ee68

Apply formatter

b3dd39d

nabenabe0928 commented Jul 1, 2025

View reviewed changes

Enhance comment

a5968c9

nabenabe0928 commented Jul 1, 2025

View reviewed changes

Add a unit test using multiprocessing

038d2fa

nabenabe0928 commented Jul 1, 2025

View reviewed changes

github-actions bot removed the stale Exempt from stale bot labeling. label Jul 1, 2025

c-bata reviewed Jul 2, 2025

View reviewed changes

tests/study_tests/test_study.py Outdated Show resolved Hide resolved

nabenabe0928 added 6 commits July 2, 2025 07:45

Address c-bata's comment

f6b40bc

Clean unit tests

9c60544

Revert test_study.py

fcd1595

Separate the unit tests

bf0e473

Merge remote-tracking branch 'upstream/master' into fix-journal-grpc-…

a5ff90e

…cleaner

Fix

be7651c

nabenabe0928 added 4 commits July 2, 2025 09:04

Refactor

ccfbe8c

Refactor

fbb20bc

Merge upstream/master

5c74b30

Support thread-safe checking for journal grpc

65b5232

c-bata reviewed Jul 8, 2025

View reviewed changes

nabenabe0928 added 2 commits July 9, 2025 09:34

Apply c-bata's comment

e967cf2

Apply formatter

7e54561

c-bata reviewed Jul 11, 2025

View reviewed changes

c-bata approved these changes Jul 11, 2025

View reviewed changes

gen740 reviewed Jul 11, 2025

View reviewed changes

nabenabe0928 unassigned c-bata and gen740 Jul 11, 2025

Add an inline comment for sync

4733d00

Co-authored-by: Gen <54583542+gen740@users.noreply.github.com>

nabenabe0928 force-pushed the fix-journal-grpc-cleaner branch from ab09c1e to 4733d00 Compare July 11, 2025 07:22

gen740 approved these changes Jul 11, 2025

View reviewed changes

nabenabe0928 enabled auto-merge July 11, 2025 07:24

nabenabe0928 merged commit 1406cb3 into optuna:master Jul 11, 2025
11 of 14 checks passed

nabenabe0928 mentioned this pull request Oct 28, 2025

Towards More Robust Optuna Backend nabenabe0928/my-skills#5

Open

Uh oh!

Conversation

nabenabe0928 commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Description of the changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

nabenabe0928 commented Jun 23, 2025

Uh oh!

github-actions bot commented Jun 30, 2025

Uh oh!

nabenabe0928 Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nabenabe0928 Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

nabenabe0928 commented Jul 1, 2025

Uh oh!

nabenabe0928 Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nabenabe0928 commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nabenabe0928 commented Jul 2, 2025

Uh oh!

nabenabe0928 commented Jul 2, 2025

Uh oh!

c-bata Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

nabenabe0928 Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

c-bata commented Jul 11, 2025

Uh oh!

c-bata left a comment

Choose a reason for hiding this comment

Uh oh!

gen740 Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

nabenabe0928 Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

gen740 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nabenabe0928 commented Jun 23, 2025 •

edited

Loading

nabenabe0928 Jul 1, 2025 •

edited

Loading

nabenabe0928 commented Jul 2, 2025 •

edited

Loading

nabenabe0928 Jul 9, 2025 •

edited

Loading