Skip to content

Add wait_server_ready method in GrpcStorageProxy#6010

Merged
HideakiImamura merged 28 commits intooptuna:masterfrom
hitsgub:wait-grpc-server
Mar 28, 2025
Merged

Add wait_server_ready method in GrpcStorageProxy#6010
HideakiImamura merged 28 commits intooptuna:masterfrom
hitsgub:wait-grpc-server

Conversation

@hitsgub
Copy link
Copy Markdown
Contributor

@hitsgub hitsgub commented Mar 13, 2025

Motivation

The GrpcStorageProxy class introduced in v4.2.0 does not have any mechanism to wait for the server to start, so users need to write code to wait for the server to start. In this PR, we change the GrpcStorageProxy class to wait for the server to start when initializing.

Description of the changes

Use grpc.channel_ready_future function for wait to server.

@y0z y0z added the enhancement Change that does not break compatibility and not affect public interfaces, but improves performance. label Mar 14, 2025
@y0z
Copy link
Copy Markdown
Member

y0z commented Mar 14, 2025

@HideakiImamura @c-bata
Could you review this PR?

@HideakiImamura
Copy link
Copy Markdown
Member

I confirmed that the failed tests cases pass in my local environment (Apple M2 chip, macOS 14.6.1). I used same python version with the CI (Python 3.12). The CI uses Apple M1 chip and macOS 14 according to https://github.com/actions/runner-images/tree/main. I would like to investigate the reason of the CI failure.

@toshihikoyanase
Copy link
Copy Markdown
Member

I reproduced the issue on my M1 MacBook. When running tests, the gRPC Storage Proxy is repeatedly started on the same port, which leads to a port unavailability error. This seems to be due to the server not shutting down gracefully.

I verified that implementing a graceful shutdown (by waiting for the shutdown event) resolves the problem on my machine. Please review the changes and let me know your thoughts.

diff --git a/optuna/testing/storages.py b/optuna/testing/storages.py
index 1893994f2..819339e61 100644
--- a/optuna/testing/storages.py
+++ b/optuna/testing/storages.py
@@ -121,7 +121,8 @@ class StorageSupplier:

         if self.server:
             assert self.thread is not None
-            self.server.stop(None)
+            shutdown_event = self.server.stop(0.1)
+            shutdown_event.wait()
             self.thread.join()
             self.server = None
             self.thread = None

@toshihikoyanase
Copy link
Copy Markdown
Member

toshihikoyanase commented Mar 19, 2025

I reconsidered my approach and examined the gRPC clients rather than the gRPC servers.
I tested the change in my forked repository and found that the mactest ran as expected.
https://github.com/toshihikoyanase/optuna/actions/runs/13943424624/job/39025049106?pr=3

The main change is that it appears GrpcStorageProxy._cache and GrpcStorageProxy._stub were retaining a gRPC client. I removed them because I suspect that the timing of their deletion was causing issues—possibly by keeping connections open. Consequently, a graceful shutdown is not necessary, and port 13000 will be continuously reused.

toshihikoyanase@819737b

diff --git a/optuna/testing/storages.py b/optuna/testing/storages.py
index 21ad2edf6..8985b64e9 100644
--- a/optuna/testing/storages.py
+++ b/optuna/testing/storages.py
@@ -49,6 +49,7 @@ class StorageSupplier:
         self.tempfile: IO[Any] | None = None
         self.server: grpc.Server | None = None
         self.thread: threading.Thread | None = None
+        self.proxy: GrpcStorageProxy | None = None
 
     def __enter__(
         self,
@@ -100,13 +101,13 @@ class StorageSupplier:
             self.thread = threading.Thread(target=self.server.start)
             self.thread.start()
 
-            proxy = GrpcStorageProxy(host="localhost", port=port)
+            self.proxy = GrpcStorageProxy(host="localhost", port=port)
 
             # Wait until the server is ready.
             while True:
                 try:
-                    proxy.get_all_studies()
-                    return proxy
+                    self.proxy.get_all_studies()
+                    return self.proxy
                 except grpc.RpcError:
                     time.sleep(1)
                     continue
@@ -119,18 +120,24 @@ class StorageSupplier:
         if self.tempfile:
             self.tempfile.close()
 
+        if self.proxy:
+            del self.proxy._cache
+            del self.proxy._stub
+            # del self.proxy  # In my environment, no need to delete this instance.
+            self.proxy = None
+
         if self.server:
             assert self.thread is not None
-            self.server.stop(5).wait()
+            self.server.stop(None)
             self.thread.join()
             self.server = None
             self.thread = None

@toshihikoyanase
Copy link
Copy Markdown
Member

It seems a bit strange that only the test failed, so I reviewed the main code and noticed that the channel might not be closing properly.

In the changes of this PR, the insecure_channel is used with the with statement, so it is closed when the scope is exited:

class GrpcStorageProxy(BaseStorage):
    def __init__(self, *, host: str = "localhost", port: int = 13000, timeout: int = 900) -> None:
        def wait_server(host: str, port: int, timeout: int) -> None:
            try:
                with grpc.insecure_channel(
                    f"{host}:{port}", options=[("grpc.max_receive_message_length", -1)]
                ) as channel:
                    grpc.channel_ready_future(channel).result(timeout=timeout)
            except grpc.FutureTimeoutError as e:
                raise ConnectionError("GRPC connection timeout") from e

However, the channel specified for _stub below does not seem to be closed:

self._stub = api_pb2_grpc.StorageServiceStub(
            grpc.insecure_channel(
                f"{host}:{port}",
                options=[("grpc.max_receive_message_length", -1)],
            )
        )  # type: ignore

Typically, there might not be issues since we don't reconnect many times to the same port in a short period, and it might be acceptable to leave it to garbage collection. However, it might be problematic depending on the implementation, for example, in the case of mac-test CI jobs.

@hitsgub
Copy link
Copy Markdown
Contributor Author

hitsgub commented Mar 21, 2025

Thank you.
In order to more reliably close a channel, I have changed to use a with statement at each point where stub is used.

Copy link
Copy Markdown
Member

@HideakiImamura HideakiImamura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR and great efforts. I have a minor comment. PTAL.

Copy link
Copy Markdown
Member

@HideakiImamura HideakiImamura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your swift actions! LGTM!

@HideakiImamura HideakiImamura removed their assignment Mar 21, 2025
@hitsgub
Copy link
Copy Markdown
Contributor Author

hitsgub commented Mar 25, 2025

trying to re-use stubs and channels

https://grpc.io/docs/guides/performance/

Copy link
Copy Markdown
Member

@c-bata c-bata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your pull request. I left some early feedback comments.

@c-bata c-bata mentioned this pull request Mar 26, 2025
Copy link
Copy Markdown
Member

@c-bata c-bata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes almost look great to me 💯 I left some minor suggestions though.

hitsgub and others added 4 commits March 26, 2025 17:27
Co-authored-by: c-bata <contact@c-bata.link>
Co-authored-by: c-bata <contact@c-bata.link>
Copy link
Copy Markdown
Member

@c-bata c-bata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after CI passes!

@hitsgub Could you update the PR title as well? "Add wait_server_ready method in GrpcStorageProxy" might be appropriate.

@hitsgub hitsgub changed the title wait until the grpc server is up in GrpcStorageProxy.__init__() Add wait_server_ready method in GrpcStorageProxy Mar 26, 2025
@c-bata c-bata removed their assignment Mar 26, 2025
Copy link
Copy Markdown
Member

@HideakiImamura HideakiImamura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. LGTM!

@HideakiImamura HideakiImamura merged commit d26a1d2 into optuna:master Mar 28, 2025
14 checks passed
@HideakiImamura HideakiImamura added this to the v4.3.0 milestone Mar 28, 2025
@hitsgub hitsgub deleted the wait-grpc-server branch March 28, 2025 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Change that does not break compatibility and not affect public interfaces, but improves performance.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants