Skip to content

JournalStorage fails frequently in distributed optimization setups in combination with GrpcProxyStorage #6084

@nabenabe0928

Description

@nabenabe0928

Expected behavior

Some processes are killed stochastically when using JournalStorage with GrpcProxyStorage.
It seems this problem happens due to the design assumption of JournalStorage, i.e., only one thread in a process takes care of a trial from the beginning to the end.
However, GrpcProxyStorage, in principle, breaks this rule because the process for sampling and that for evaluating a trial are separated.

Image

Note

This issue typically happens when using enqueue_trial, but I cannot deny the possibility that we face the same issue with other operations as well.

Environment

  • Optuna version: 4.3
  • Python version: 3.11
  • OS: Ubuntu 20.04

Error messages, stack traces, or logs

[I 2025-05-20 07:15:00,216] Using an existing study with name 'b99b661c-4229-4a54-92bf-f29b0ed7db25' instead of creating a new one.
[I 2025-05-20 07:15:00,222] Using an existing study with name 'b99b661c-4229-4a54-92bf-f29b0ed7db25' instead of creating a new one.
[I 2025-05-20 07:15:00,227] Using an existing study with name 'b99b661c-4229-4a54-92bf-f29b0ed7db25' instead of creating a new one.
[I 2025-05-20 07:15:00,231] Using an existing study with name 'b99b661c-4229-4a54-92bf-f29b0ed7db25' instead of creating a new one.
[I 2025-05-20 07:15:00,242] Using an existing study with name 'b99b661c-4229-4a54-92bf-f29b0ed7db25' instead of creating a new one.
[I 2025-05-20 07:15:00,245] Using an existing study with name 'b99b661c-4229-4a54-92bf-f29b0ed7db25' instead of creating a new one.
[I 2025-05-20 07:15:00,246] Using an existing study with name 'b99b661c-4229-4a54-92bf-f29b0ed7db25' instead of creating a new one.
[I 2025-05-20 07:15:00,254] Using an existing study with name 'b99b661c-4229-4a54-92bf-f29b0ed7db25' instead of creating a new one.
[I 2025-05-20 07:15:00,294] Trial 0 finished with value: 114.0 and parameters: {'y': 3.8116090154495232, 'x': 0.08687552782907204}. Best is trial 0 with value: 114.0.
[I 2025-05-20 07:15:00,298] Trial 1 finished with value: 205.0 and parameters: {'y': -1.2209212453908358, 'x': -2.021884130213263}. Best is trial 0 with value: 114.0.
[I 2025-05-20 07:15:00,308] Trial 2 finished with value: 311.0 and parameters: {'y': 0.7764554661270058, 'x': 3.346412592844274}. Best is trial 0 with value: 114.0.
[I 2025-05-20 07:15:00,314] Trial 2 finished with value: 311.0 and parameters: {'y': 0.7764554661270058, 'x': 3.346412592844274}. Best is trial 0 with value: 114.0.
Process Process-7:
[I 2025-05-20 07:15:00,315] Trial 3 finished with value: 534.0 and parameters: {'y': -3.0491387551257243, 'x': 4.990802171581327}. Best is trial 0 with value: 114.0.
Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/shuhei/pfn-work/optuna-dev/optuna/simple_grpc.py", line 27, in main
    study.optimize(objective, n_trials=1)
  File "/home/shuhei/pfn-work/optuna-dev/optuna/optuna/study/study.py", line 475, in optimize
    _optimize(
  File "/home/shuhei/pfn-work/optuna-dev/optuna/optuna/study/_optimize.py", line 63, in _optimize
    _optimize_sequential(
  File "/home/shuhei/pfn-work/optuna-dev/optuna/optuna/study/_optimize.py", line 160, in _optimize_sequential
    frozen_trial = _run_trial(study, func, catch)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/shuhei/pfn-work/optuna-dev/optuna/optuna/study/_optimize.py", line 209, in _run_trial
    frozen_trial = _tell_with_warning(
                   ^^^^^^^^^^^^^^^^^^^
  File "/home/shuhei/pfn-work/optuna-dev/optuna/optuna/study/_tell.py", line 120, in _tell_with_warning
    raise ValueError(f"Cannot tell a {frozen_trial.state.name} trial.")
ValueError: Cannot tell a COMPLETE trial.
[I 2025-05-20 07:15:00,320] Trial 4 finished with value: 426.0 and parameters: {'y': -4.9408864997561945, 'x': 1.5103378665562222}. Best is trial 0 with value: 114.0.
[I 2025-05-20 07:15:00,322] Trial 5 finished with value: 629.0 and parameters: {'y': -4.889657514236255, 'x': -2.4055138468929904}. Best is trial 0 with value: 114.0.
[I 2025-05-20 07:15:00,324] Trial 6 finished with value: 826.0 and parameters: {'y': 4.624483520664185, 'x': 2.162965026496132}. Best is trial 0 with value: 114.0.

Steps to reproduce

Install the dependencies:

$ pip install optuna grpcio protobuf

Then build a proxy server with the following code:

import os

from optuna.storages import run_grpc_proxy_server
from optuna.storages.journal import JournalFileBackend
from optuna.storages.journal import JournalStorage


try:
    os.remove("test-grpc.log")
except FileNotFoundError:
    pass
storage = JournalStorage(JournalFileBackend("test-grpc.log"))
run_grpc_proxy_server(storage, host="localhost", port=13000)

Launch another process and run the following code:

from collections.abc import Callable
import multiprocessing
import os
import time
import uuid

import numpy as np
import optuna


def load_study(study_name: str, storage_builder: Callable[[], optuna.storages.BaseStorage]) -> optuna.Study:
    sampler = optuna.samplers.RandomSampler()
    return optuna.create_study(
        study_name=study_name, sampler=sampler, storage=storage_builder(), load_if_exists=True
    )


def main(study_name: str, worker_id: int, storage_builder: Callable[[], optuna.storages.BaseStorage]) -> None:
    def objective(trial: optuna.Trial) -> float:
        time.sleep(0.01)
        x = trial.suggest_float("x", -5, 5)
        y = trial.suggest_float("y", -5, 5)
        return float(int((worker_id + 1) * 100 + x**2 + y**2))


    study = load_study(study_name, storage_builder)
    study.optimize(objective, n_trials=1)


def enqueue(study_name: str, storage_builder: Callable[[], optuna.storages.BaseStorage]) -> None:
    study = load_study(study_name, storage_builder)
    XY = np.random.random((50, 2)) * 10 - 5
    for xy in XY:
        study.enqueue_trial({"x": float(xy[0]), "y": float(xy[1])})


def execute(storage_builder: Callable[[], optuna.storages.BaseStorage]) -> None:
    study_name = str(uuid.uuid4())
    enqueue(study_name, storage_builder)
    procs = []
    for i in range(8):
        proc = multiprocessing.Process(target=main, args=(study_name, i, storage_builder))
        procs.append(proc)
        proc.start()

    for proc in procs:
        proc.join()


if __name__ == "__main__":
    execute(lambda: optuna.storages.GrpcStorageProxy(host="localhost", port=13000))

Additional context (optional)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugIssue/PR about behavior that is broken. Not for typos/examples/CI/test but for Optuna itself.needs-discussionIssue/PR which needs discussion.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions