[core] Sync wait for metrics exporter init to avoid getenv/setenv race#61281
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a getenv/setenv race condition by synchronously waiting for the metrics exporter to initialize. The use of std::promise and std::future is a good approach for this. The implementation is correct, but I've identified a potential issue where an exception during exporter initialization could cause the main thread to hang. I've added a review comment with a suggestion to make the code more robust.
Signed-off-by: yicheng <yicheng@anyscale.com>
352a058 to
64fd4f2
Compare
| auto metrics_init_promise = std::make_shared<std::promise<void>>(); | ||
| metrics_init_future = metrics_init_promise->get_future(); | ||
| metrics_agent_client_->WaitForServerReady( | ||
| [this, metrics_init_promise](const Status &server_status) { | ||
| if (server_status.ok()) { | ||
| stats::ConnectOpenCensusExporter(options_.metrics_agent_port); | ||
| stats::InitOpenTelemetryExporter(options_.metrics_agent_port); | ||
| } else { | ||
| RAY_LOG(ERROR) | ||
| << "Failed to establish connection to the metrics exporter agent. " | ||
| "Metrics will not be exported. " | ||
| << "Exporter agent status: " << server_status.ToString(); | ||
| } | ||
| metrics_init_promise->set_value(); | ||
| }); |
There was a problem hiding this comment.
Can you make WaitForServerReady synchronous instead? It's unnecessary complexity to call and async method and then block.
There was a problem hiding this comment.
Considered this, but the underlying RPC (HealthCheck) is async with callback based retries.So making it synchronous is a big change. This async + promise/future pattern seems a standard way to bridge async RPCs to sync callers in Ray. see InitializeSystemConfig, GlobalStateAccessor::GetAllJobInfo, NodeInfoAccessor::CheckAlive, etc.
There was a problem hiding this comment.
We have removed WaitForServerReady and made it fully synchronous! See #61281 (comment)
|
So, I have found in many tests that metrics actually failed to export. This is an issue because we now wait for a long time until retries are exhausted. Investigating. |
| # same event loop, and ray.init() needs to health | ||
| # check that server. Running it on the event loop | ||
| # thread would self deadlock. | ||
| await asyncio.to_thread( |
There was a problem hiding this comment.
Found that the dashboard agent hosts both its grpc server and the JobAgent in the same event loop. When ray job submit arrives, JobAgent calls ray.init()(so that to create JobSupervisor Actor) which blocks the event loop waiting for the HealthCheck to succeed, but the gRPC server can't respond because the event loop is blocked. This is a self deadlock.
Fixed by running ray.init() in a thread pool and releasing the GIL during CoreWorkerProcess::Initialize() so the event loop thread can process the HealthCheck.
…rker init and running ray.init in a thread pool Signed-off-by: yicheng <yicheng@anyscale.com>
6140536 to
7496dd7
Compare
…orker Signed-off-by: yicheng <yicheng@anyscale.com>
|
Summary of discussion with @rueian offline: The root issue is a getenv/setenv race condition (POSIX setenv is MT-Unsafe). The fix is to initialize the metrics exporter synchronously. However, after #55152, CoreWorker first tries to connect to the metrics agent via The In practice, with or without HealthCheck, metrics will be dropped at startup either way. gRPC channels have built-in reconnect, so the exporter will eventually connect on its own. The HealthCheck only reduces some OTel warning logs during that brief window. So the simplest fix is to just skip the |
Signed-off-by: yicheng <yicheng@anyscale.com>
| }); | ||
| // Initialize exporters synchronously to avoid a getenv/setenv race condition. | ||
| // POSIX setenv is MT-Unsafe. | ||
| stats::ConnectOpenCensusExporter(options_.metrics_agent_port); |
There was a problem hiding this comment.
If the server is not OK, otel itself will pop up warning messages.
|
ran the following perf test and compared to the master, didn't see regression:https://buildkite.com/ray-project/release/builds/81809/steps/canvas |
#61281) ## Description Workers can SIGSEGV due to a `getenv`/`setenv` race. The `io_thread_` calls `getenv` during OTel/gRPC exporter init (`WaitForServerReady` callback), while the main thread calls `setenv` in accelerator managers (CUDA, Neuron, TPU). POSIX `setenv` is MT-Unsafe, so concurrent access crashes. Previously the callback was fire and forget, so the constructor could return before init finished. #61034 and #61195 tried to fix specific `getenv` call sites, but gRPC keeps calling `getenv` internally (`HttpProxyMapper` reads proxy env vars on every channel creation), so the race kept showing up in new coredumps. This PR just makes the CoreWorker call site synchronously wait for the callback to finish (via `std::promise/std::future`), so all `getenv` completes before any `setenv` can happen. ## Performance **This PR (sync wait):** ``` task latency (includes driver + 1 worker startup) Trial 1: first task latency = 0.1907s Trial 2: first task latency = 0.1669s Trial 3: first task latency = 0.1726s Trial 4: first task latency = 0.1737s Trial 5: first task latency = 0.1731s Mean: 0.1754s, Std: 0.0081s ``` **Master (baseline):** ``` task latency (includes driver + 1 worker startup) Trial 1: first task latency = 0.1781s Trial 2: first task latency = 0.1601s Trial 3: first task latency = 0.1578s Trial 4: first task latency = 0.1685s Trial 5: first task latency = 0.1676s Mean: 0.1664s, Std: 0.0072s ``` This level of overhead is acceptable to fix the crash. --------- Signed-off-by: yicheng <yicheng@anyscale.com> Co-authored-by: yicheng <yicheng@anyscale.com> Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
ray-project#61281) ## Description Workers can SIGSEGV due to a `getenv`/`setenv` race. The `io_thread_` calls `getenv` during OTel/gRPC exporter init (`WaitForServerReady` callback), while the main thread calls `setenv` in accelerator managers (CUDA, Neuron, TPU). POSIX `setenv` is MT-Unsafe, so concurrent access crashes. Previously the callback was fire and forget, so the constructor could return before init finished. ray-project#61034 and ray-project#61195 tried to fix specific `getenv` call sites, but gRPC keeps calling `getenv` internally (`HttpProxyMapper` reads proxy env vars on every channel creation), so the race kept showing up in new coredumps. This PR just makes the CoreWorker call site synchronously wait for the callback to finish (via `std::promise/std::future`), so all `getenv` completes before any `setenv` can happen. ## Performance **This PR (sync wait):** ``` task latency (includes driver + 1 worker startup) Trial 1: first task latency = 0.1907s Trial 2: first task latency = 0.1669s Trial 3: first task latency = 0.1726s Trial 4: first task latency = 0.1737s Trial 5: first task latency = 0.1731s Mean: 0.1754s, Std: 0.0081s ``` **Master (baseline):** ``` task latency (includes driver + 1 worker startup) Trial 1: first task latency = 0.1781s Trial 2: first task latency = 0.1601s Trial 3: first task latency = 0.1578s Trial 4: first task latency = 0.1685s Trial 5: first task latency = 0.1676s Mean: 0.1664s, Std: 0.0072s ``` This level of overhead is acceptable to fix the crash. --------- Signed-off-by: yicheng <yicheng@anyscale.com> Co-authored-by: yicheng <yicheng@anyscale.com> Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>
…(LDMP-4900) The async WaitForServerReady callback ran InitOpenTelemetryExporter on the io_service thread, where gRPC's HttpProxyMapper calls getenv() for proxy settings. This races with setenv() calls from Python runtime_env on the main thread, causing SIGSEGV in getenv() (glibc getenv is MT-Unsafe when setenv reallocs the environ array). Fix: call InitOpenTelemetryExporter synchronously on the main thread before worker threads start, eliminating the race. Mirrors upstream ray-project/ray PR ray-project#61281. Linear: LDMP-4900 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The async WaitForServerReady callback ran InitOpenTelemetryExporter on the io_service thread, where gRPC's HttpProxyMapper calls getenv() for proxy settings. This races with setenv() calls from Python runtime_env on the main thread, causing SIGSEGV in getenv() (glibc getenv is MT-Unsafe when setenv reallocs the environ array). Fix: call InitOpenTelemetryExporter synchronously on the main thread before worker threads start, eliminating the race. Mirrors upstream ray-project/ray PR ray-project#61281.
Description
Workers can SIGSEGV due to a
getenv/setenvrace. Theio_thread_callsgetenvduring OTel/gRPC exporter init (WaitForServerReadycallback), while the main thread callssetenvin accelerator managers (CUDA, Neuron, TPU). POSIXsetenvis MT-Unsafe, so concurrent access crashes.Previously the callback was fire and forget, so the constructor could return before init finished. #61034 and #61195 tried to fix specific
getenvcall sites, but gRPC keeps callinggetenvinternally (HttpProxyMapperreads proxy env vars on every channel creation), so the race kept showing up in new coredumps.This PR just makes the CoreWorker call site synchronously wait for the callback to finish (via
std::promise/std::future), so allgetenvcompletes before anysetenvcan happen.Performance
This PR (sync wait):
Master (baseline):
This level of overhead is acceptable to fix the crash.