[core] Sync wait for metrics exporter init to avoid getenv/setenv race by Yicheng-Lu-llll · Pull Request #61281 · ray-project/ray

Yicheng-Lu-llll · 2026-02-24T11:17:07Z

Description

Workers can SIGSEGV due to a getenv/setenv race. The io_thread_ calls getenv during OTel/gRPC exporter init (WaitForServerReady callback), while the main thread calls setenv in accelerator managers (CUDA, Neuron, TPU). POSIX setenv is MT-Unsafe, so concurrent access crashes.

Previously the callback was fire and forget, so the constructor could return before init finished. #61034 and #61195 tried to fix specific getenv call sites, but gRPC keeps calling getenv internally (HttpProxyMapper reads proxy env vars on every channel creation), so the race kept showing up in new coredumps.

This PR just makes the CoreWorker call site synchronously wait for the callback to finish (via std::promise/std::future), so all getenv completes before any setenv can happen.

Performance

This PR (sync wait):

task latency (includes driver + 1 worker startup) 
  Trial 1: first task latency = 0.1907s
  Trial 2: first task latency = 0.1669s
  Trial 3: first task latency = 0.1726s
  Trial 4: first task latency = 0.1737s
  Trial 5: first task latency = 0.1731s
  Mean: 0.1754s, Std: 0.0081s

Master (baseline):

task latency (includes driver + 1 worker startup)
  Trial 1: first task latency = 0.1781s
  Trial 2: first task latency = 0.1601s
  Trial 3: first task latency = 0.1578s
  Trial 4: first task latency = 0.1685s
  Trial 5: first task latency = 0.1676s
  Mean: 0.1664s, Std: 0.0072s

This level of overhead is acceptable to fix the crash.

gemini-code-assist

Code Review

This pull request addresses a getenv/setenv race condition by synchronously waiting for the metrics exporter to initialize. The use of std::promise and std::future is a good approach for this. The implementation is correct, but I've identified a potential issue where an exception during exporter initialization could cause the main thread to hang. I've added a review comment with a suggestion to make the code more robust.

src/ray/core_worker/core_worker_process.cc

Signed-off-by: yicheng <yicheng@anyscale.com>

src/ray/core_worker/core_worker_process.cc

israbbani · 2026-02-24T23:20:49Z

src/ray/core_worker/core_worker_process.cc

+      auto metrics_init_promise = std::make_shared<std::promise<void>>();
+      metrics_init_future = metrics_init_promise->get_future();
+      metrics_agent_client_->WaitForServerReady(
+          [this, metrics_init_promise](const Status &server_status) {
+            if (server_status.ok()) {
+              stats::ConnectOpenCensusExporter(options_.metrics_agent_port);
+              stats::InitOpenTelemetryExporter(options_.metrics_agent_port);
+            } else {
+              RAY_LOG(ERROR)
+                  << "Failed to establish connection to the metrics exporter agent. "
+                     "Metrics will not be exported. "
+                  << "Exporter agent status: " << server_status.ToString();
+            }
+            metrics_init_promise->set_value();
+          });


Can you make WaitForServerReady synchronous instead? It's unnecessary complexity to call and async method and then block.

Considered this, but the underlying RPC (HealthCheck) is async with callback based retries.So making it synchronous is a big change. This async + promise/future pattern seems a standard way to bridge async RPCs to sync callers in Ray. see InitializeSystemConfig, GlobalStateAccessor::GetAllJobInfo, NodeInfoAccessor::CheckAlive, etc.

We have removed WaitForServerReady and made it fully synchronous! See #61281 (comment)

Yicheng-Lu-llll · 2026-02-25T00:22:10Z

So, I have found in many tests that metrics actually failed to export. This is an issue because we now wait for a long time until retries are exhausted. Investigating.

python/ray/dashboard/optional_utils.py

src/ray/core_worker/core_worker_process.cc

Yicheng-Lu-llll · 2026-02-25T02:33:20Z

python/ray/dashboard/optional_utils.py

+                        # same event loop, and ray.init() needs to health
+                        # check that server. Running it on the event loop
+                        # thread would self deadlock.
+                        await asyncio.to_thread(


Found that the dashboard agent hosts both its grpc server and the JobAgent in the same event loop. When ray job submit arrives, JobAgent calls ray.init()(so that to create JobSupervisor Actor) which blocks the event loop waiting for the HealthCheck to succeed, but the gRPC server can't respond because the event loop is blocked. This is a self deadlock.

Fixed by running ray.init() in a thread pool and releasing the GIL during CoreWorkerProcess::Initialize() so the event loop thread can process the HealthCheck.

…rker init and running ray.init in a thread pool Signed-off-by: yicheng <yicheng@anyscale.com>

…orker Signed-off-by: yicheng <yicheng@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

src/ray/core_worker/core_worker_process.cc

Yicheng-Lu-llll · 2026-02-25T03:59:35Z

Summary of discussion with @rueian offline:

The root issue is a getenv/setenv race condition (POSIX setenv is MT-Unsafe). The fix is to initialize the metrics exporter synchronously.

However, after #55152, CoreWorker first tries to connect to the metrics agent via HealthCheck before initializing the exporter. The purpose was to avoid initializing the OTel exporter too early, which would produce noisy OTel warning logs(still metric got lost, but no warning logs from otel). That said, the PR itself noted this wasn't a problem for CoreWorker but mainly for gcs.

The HealthCheck introduced a new problem: Dashboard Agent hosts both its gRPC server and the JobAgent in the same asyncio event loop. When ray job submit arrives, JobAgent calls ray.init() to create the JobSupervisor actor, which blocks the event loop waiting for the HealthCheck to succeed. But the gRPC server can't respond to the HealthCheck because the event loop is already blocked. This is a self-deadlock.

In practice, with or without HealthCheck, metrics will be dropped at startup either way. gRPC channels have built-in reconnect, so the exporter will eventually connect on its own. The HealthCheck only reduces some OTel warning logs during that brief window.

So the simplest fix is to just skip the HealthCheck in CoreWorker and initialize the exporter synchronously and directly.

Signed-off-by: yicheng <yicheng@anyscale.com>

Yicheng-Lu-llll · 2026-02-25T04:24:58Z

src/ray/core_worker/core_worker_process.cc

-      });
+      // Initialize exporters synchronously to avoid a getenv/setenv race condition.
+      // POSIX setenv is MT-Unsafe.
+      stats::ConnectOpenCensusExporter(options_.metrics_agent_port);


If the server is not OK, otel itself will pop up warning messages.

Yicheng-Lu-llll · 2026-02-26T02:11:26Z

ran the following perf test and compared to the master, didn't see regression:https://buildkite.com/ray-project/release/builds/81809/steps/canvas

name:^microbenchmark\.aws$
name:^many_actors\.aws$
name:^many_tasks\.aws$

#61281) ## Description Workers can SIGSEGV due to a `getenv`/`setenv` race. The `io_thread_` calls `getenv` during OTel/gRPC exporter init (`WaitForServerReady` callback), while the main thread calls `setenv` in accelerator managers (CUDA, Neuron, TPU). POSIX `setenv` is MT-Unsafe, so concurrent access crashes. Previously the callback was fire and forget, so the constructor could return before init finished. #61034 and #61195 tried to fix specific `getenv` call sites, but gRPC keeps calling `getenv` internally (`HttpProxyMapper` reads proxy env vars on every channel creation), so the race kept showing up in new coredumps. This PR just makes the CoreWorker call site synchronously wait for the callback to finish (via `std::promise/std::future`), so all `getenv` completes before any `setenv` can happen. ## Performance **This PR (sync wait):** ``` task latency (includes driver + 1 worker startup) Trial 1: first task latency = 0.1907s Trial 2: first task latency = 0.1669s Trial 3: first task latency = 0.1726s Trial 4: first task latency = 0.1737s Trial 5: first task latency = 0.1731s Mean: 0.1754s, Std: 0.0081s ``` **Master (baseline):** ``` task latency (includes driver + 1 worker startup) Trial 1: first task latency = 0.1781s Trial 2: first task latency = 0.1601s Trial 3: first task latency = 0.1578s Trial 4: first task latency = 0.1685s Trial 5: first task latency = 0.1676s Mean: 0.1664s, Std: 0.0072s ``` This level of overhead is acceptable to fix the crash. --------- Signed-off-by: yicheng <yicheng@anyscale.com> Co-authored-by: yicheng <yicheng@anyscale.com> Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>

ray-project#61281) ## Description Workers can SIGSEGV due to a `getenv`/`setenv` race. The `io_thread_` calls `getenv` during OTel/gRPC exporter init (`WaitForServerReady` callback), while the main thread calls `setenv` in accelerator managers (CUDA, Neuron, TPU). POSIX `setenv` is MT-Unsafe, so concurrent access crashes. Previously the callback was fire and forget, so the constructor could return before init finished. ray-project#61034 and ray-project#61195 tried to fix specific `getenv` call sites, but gRPC keeps calling `getenv` internally (`HttpProxyMapper` reads proxy env vars on every channel creation), so the race kept showing up in new coredumps. This PR just makes the CoreWorker call site synchronously wait for the callback to finish (via `std::promise/std::future`), so all `getenv` completes before any `setenv` can happen. ## Performance **This PR (sync wait):** ``` task latency (includes driver + 1 worker startup) Trial 1: first task latency = 0.1907s Trial 2: first task latency = 0.1669s Trial 3: first task latency = 0.1726s Trial 4: first task latency = 0.1737s Trial 5: first task latency = 0.1731s Mean: 0.1754s, Std: 0.0081s ``` **Master (baseline):** ``` task latency (includes driver + 1 worker startup) Trial 1: first task latency = 0.1781s Trial 2: first task latency = 0.1601s Trial 3: first task latency = 0.1578s Trial 4: first task latency = 0.1685s Trial 5: first task latency = 0.1676s Mean: 0.1664s, Std: 0.0072s ``` This level of overhead is acceptable to fix the crash. --------- Signed-off-by: yicheng <yicheng@anyscale.com> Co-authored-by: yicheng <yicheng@anyscale.com> Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>

…(LDMP-4900) The async WaitForServerReady callback ran InitOpenTelemetryExporter on the io_service thread, where gRPC's HttpProxyMapper calls getenv() for proxy settings. This races with setenv() calls from Python runtime_env on the main thread, causing SIGSEGV in getenv() (glibc getenv is MT-Unsafe when setenv reallocs the environ array). Fix: call InitOpenTelemetryExporter synchronously on the main thread before worker threads start, eliminating the race. Mirrors upstream ray-project/ray PR ray-project#61281. Linear: LDMP-4900 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The async WaitForServerReady callback ran InitOpenTelemetryExporter on the io_service thread, where gRPC's HttpProxyMapper calls getenv() for proxy settings. This races with setenv() calls from Python runtime_env on the main thread, causing SIGSEGV in getenv() (glibc getenv is MT-Unsafe when setenv reallocs the environ array). Fix: call InitOpenTelemetryExporter synchronously on the main thread before worker threads start, eliminating the race. Mirrors upstream ray-project/ray PR ray-project#61281.

Yicheng-Lu-llll requested a review from a team as a code owner February 24, 2026 11:17

gemini-code-assist bot reviewed Feb 24, 2026

View reviewed changes

src/ray/core_worker/core_worker_process.cc Outdated Show resolved Hide resolved

cursor bot reviewed Feb 24, 2026

View reviewed changes

src/ray/core_worker/core_worker_process.cc Outdated Show resolved Hide resolved

Sync wait for metrics exporter init to avoid getenv/setenv race

64fd4f2

Signed-off-by: yicheng <yicheng@anyscale.com>

Yicheng-Lu-llll force-pushed the fix-otel-getenv-race-condition branch from 352a058 to 64fd4f2 Compare February 24, 2026 11:38

ray-gardener bot added the community-contribution Contributed by the community label Feb 24, 2026

Yicheng-Lu-llll added core Issues that should be addressed in Ray Core and removed community-contribution Contributed by the community labels Feb 24, 2026

Merge branch 'master' into fix-otel-getenv-race-condition

fb26a0a

rueian reviewed Feb 24, 2026

View reviewed changes

src/ray/core_worker/core_worker_process.cc Show resolved Hide resolved

rueian mentioned this pull request Feb 24, 2026

[core] upgrade grpc to 1.58.0 to fix getenv races #61195

Merged

israbbani reviewed Feb 24, 2026

View reviewed changes

rueian approved these changes Feb 24, 2026

View reviewed changes

cursor bot reviewed Feb 25, 2026

View reviewed changes

python/ray/dashboard/optional_utils.py Outdated Show resolved Hide resolved

src/ray/core_worker/core_worker_process.cc Outdated Show resolved Hide resolved

Yicheng-Lu-llll commented Feb 25, 2026

View reviewed changes

avoid self deadlock in dashboard agent by releasing GIL during CoreWo…

7496dd7

…rker init and running ray.init in a thread pool Signed-off-by: yicheng <yicheng@anyscale.com>

Yicheng-Lu-llll force-pushed the fix-otel-getenv-race-condition branch from 6140536 to 7496dd7 Compare February 25, 2026 03:25

Fix getenv/setenv race: init metrics exporters synchronously in CoreW…

c24c38e

…orker Signed-off-by: yicheng <yicheng@anyscale.com>

cursor bot reviewed Feb 25, 2026

View reviewed changes

src/ray/core_worker/core_worker_process.cc Show resolved Hide resolved

Remove unused code

4ed77d3

Signed-off-by: yicheng <yicheng@anyscale.com>

Yicheng-Lu-llll commented Feb 25, 2026

View reviewed changes

Yicheng-Lu-llll added the go add ONLY when ready to merge, run all tests label Feb 25, 2026

edoakes merged commit c73d04e into ray-project:master Mar 2, 2026
7 checks passed

michelle-tan-ai mentioned this pull request Mar 19, 2026

fix: make OTel exporter init synchronous to avoid getenv/setenv race (LDMP-4900) Applied-Shared/ray#9

Merged

Conversation

Yicheng-Lu-llll commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Performance

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

israbbani Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Yicheng-Lu-llll Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Yicheng-Lu-llll Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Yicheng-Lu-llll commented Feb 25, 2026

Uh oh!

Uh oh!

Uh oh!

Yicheng-Lu-llll Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Yicheng-Lu-llll commented Feb 25, 2026

Uh oh!

Yicheng-Lu-llll Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Yicheng-Lu-llll commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Yicheng-Lu-llll commented Feb 24, 2026 •

edited

Loading