Skip to content

Fix ill-combination of journal and gRPC#6175

Merged
nabenabe0928 merged 21 commits intooptuna:masterfrom
nabenabe0928:fix-journal-grpc-cleaner
Jul 11, 2025
Merged

Fix ill-combination of journal and gRPC#6175
nabenabe0928 merged 21 commits intooptuna:masterfrom
nabenabe0928:fix-journal-grpc-cleaner

Conversation

@nabenabe0928
Copy link
Copy Markdown
Contributor

@nabenabe0928 nabenabe0928 commented Jun 23, 2025

Motivation

This PR resolves the following issues:

Description of the changes

  • Fix the logic for update check in set_trial_state_values
  • Modify an existing unit test so that the current master branch fails much more often
  • Add a unit test to verify whether JournalStorage works with gRPC on multi-processing setups

@nabenabe0928 nabenabe0928 requested a review from Copilot June 23, 2025 04:57
@nabenabe0928 nabenabe0928 added the bug Issue/PR about behavior that is broken. Not for typos/examples/CI/test but for Optuna itself. label Jun 23, 2025
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes an issue with updating trial state values when using gRPC by refining the logic in set_trial_state_values and related helper functions.

  • Introduces a new error message constant for finished trials.
  • Adds a synchronized check within a lock to ensure that trial updates occur only when valid.
  • Refactors the return logic in set_trial_state_values and updates error handling in _trial_exists_and_updatable.
Comments suppressed due to low confidence (1)

optuna/storages/journal/_storage.py:34

  • [nitpick] Consider changing 'can not' to 'cannot' for improved readability and consistency in the error message.
UNUPDATABLE_MSG = "Trial#{trial_number} has already finished and can not be updated."

@nabenabe0928 nabenabe0928 added this to the v4.5.0 milestone Jun 23, 2025
@nabenabe0928 nabenabe0928 marked this pull request as ready for review June 23, 2025 04:58
@nabenabe0928
Copy link
Copy Markdown
Contributor Author

@c-bata @gen740
Could you review this PR?

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has not seen any recent activity.

@github-actions github-actions bot added the stale Exempt from stale bot labeling. label Jun 30, 2025
Comment on lines +1676 to +1682
num_enqueued = 30
# NOTE(nabenabe): Fewer threads in gRPC increases the probability of thread collision on the
# proxy side. See https://github.com/optuna/optuna/issues/6084
storage_kwargs = (
{"thread_pool": ThreadPoolExecutor(2)} if storage_mode == "grpc_journal_file" else {}
)
with StorageSupplier(storage_mode, **storage_kwargs) as storage:
Copy link
Copy Markdown
Contributor Author

@nabenabe0928 nabenabe0928 Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The master branch with this change yielded 80 failures out of 100 runs on Ubuntu 20.04.

$ (for _ in `seq 0 99`; do python -m pytest tests/study_tests/test_study.py::test_pop_waiting_trial_thread_safe[grpc_journal_file] | grep "1 f
ailed"; done) | wc -l

>>> 80

Note

Meanwhile, I got no failures with the changes in this PR.


num_enqueued = 10
with StorageSupplier(storage_mode) as storage:
num_enqueued = 30
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nabenabe0928
Copy link
Copy Markdown
Contributor Author

@c-bata @gen740
I added a unit test and modified an existing one accordingly:)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unit test also induces quite frequent failures in the master branch.

@github-actions github-actions bot removed the stale Exempt from stale bot labeling. label Jul 1, 2025
@nabenabe0928
Copy link
Copy Markdown
Contributor Author

nabenabe0928 commented Jul 2, 2025

I tested the new unit tests on the master branch:

$ (for _ in `seq 0 99`; do python -m pytest tests/storages_tests/journal_tests/test_combination_with_grpc.py::test_pop_waiting_trial_multiprocess_safe | grep "1 failed"; done) | wc -l
>>> 45

$ (for _ in `seq 0 99`; do python -m pytest tests/storages_tests/journal_tests/test_combination_with_grpc.py::test_pop_waiting_trial_thread_safe | grep "1 failed"; done) | wc -l
>>> 8

@nabenabe0928
Copy link
Copy Markdown
Contributor Author

This PR addresses the concern here:

@nabenabe0928
Copy link
Copy Markdown
Contributor Author

@gen740 @c-bata

It seems the thread problem in the unit test also went away:

Comment on lines +333 to +339
existing_trial = self._replay_result._trials.get(trial_id)
if existing_trial is not None and existing_trial.state != TrialState.WAITING:
if existing_trial.state.is_finished():
raise UpdateFinishedTrialError(
UNUPDATABLE_MSG.format(trial_number=existing_trial.number)
)
return False
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two questions.

  1. Can we use an assert statement to check that existing_trial is not None, so we simplify the condition in the if statement?
  2. Since existing_trial.state.is_finished() implies that existing_trial.state != TrialState.WAITING is always true, can we reduce the nesting like this?
                existing_trial = self._replay_result._trials.get(trial_id)
                assert existing_trial is not None, (
                    "This must be True. Please file a bug report on GitHub if this line raises AssertionError."
                )
                if existing_trial.state.is_finished():
                    raise UpdateFinishedTrialError(
                        UNUPDATABLE_MSG.format(trial_number=existing_trial.number)
                    )
                if existing_trial.state != TrialState.WAITING:  # this line is equivalent to `existing_trial.state == TrialState.RUNNING`.
                    return False

Copy link
Copy Markdown
Contributor Author

@nabenabe0928 nabenabe0928 Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed that this PR works nicely on a toy problem even with your change!
Let me confirm with a bigger task:)
I am gonna get back to you as soon as the experiment finishes!

@c-bata
Copy link
Copy Markdown
Member

c-bata commented Jul 11, 2025

I also checked JournalRedisBackend works as expected with this code, so just leaving it here for reference.

docker run -d --name redis -p 127.0.0.1:6379:6379 redis
from concurrent.futures import ProcessPoolExecutor

import optuna
from optuna.storages import JournalStorage
from optuna.storages.journal import JournalFileBackend
from optuna.storages.journal import JournalRedisBackend

storage = JournalStorage(JournalRedisBackend(url="redis://127.0.0.1:6379", prefix="pr-6175"))


def objective(trial: optuna.Trial) -> float:
    x = trial.suggest_float("x", -10, 10)
    return x ** 2


if __name__ == "__main__":
    study = optuna.create_study(storage=storage)
    for i in range(1000):
        study.enqueue_trial({"x": -5.0 + float(i) / 100})

    with ProcessPoolExecutor(max_workers=20) as pool:
        for i in range(100):
            pool.submit(study.optimize, objective, n_trials=10)

    print(f"{len(study.trials)=}")    
    print(f"{len({trial.number for trial in study.trials})=}")    
    print(f"{len({trial._trial_id for trial in study.trials})=}")

Copy link
Copy Markdown
Member

@c-bata c-bata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me. this PR can be merged after:

  • Confirming it works as expected with the larger task by @nabenabe0928
  • Receiving approval from the second reviewer

# return statement of trial_id == _replay_result.owned_trial_id. To eliminate false
# positives, we verify whether another process is already evaluating the trial with
# trial_id. If True, it means this query does not update the trial state.
existing_trial = self._replay_result._trials.get(trial_id)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it necessary to call self._sync_with_backend()? Since the thin line uses self._reply_result.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will do it:)

Co-authored-by: Gen <54583542+gen740@users.noreply.github.com>
@nabenabe0928 nabenabe0928 force-pushed the fix-journal-grpc-cleaner branch from ab09c1e to 4733d00 Compare July 11, 2025 07:22
Copy link
Copy Markdown
Member

@gen740 gen740 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@nabenabe0928 nabenabe0928 enabled auto-merge July 11, 2025 07:24
@nabenabe0928 nabenabe0928 merged commit 1406cb3 into optuna:master Jul 11, 2025
11 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Issue/PR about behavior that is broken. Not for typos/examples/CI/test but for Optuna itself.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants