Skip to content

[core] Sync wait for metrics exporter init to avoid getenv/setenv race#61281

Merged
edoakes merged 5 commits intoray-project:masterfrom
Yicheng-Lu-llll:fix-otel-getenv-race-condition
Mar 2, 2026
Merged

[core] Sync wait for metrics exporter init to avoid getenv/setenv race#61281
edoakes merged 5 commits intoray-project:masterfrom
Yicheng-Lu-llll:fix-otel-getenv-race-condition

Conversation

@Yicheng-Lu-llll
Copy link
Copy Markdown
Member

@Yicheng-Lu-llll Yicheng-Lu-llll commented Feb 24, 2026

Description

Workers can SIGSEGV due to a getenv/setenv race. The io_thread_ calls getenv during OTel/gRPC exporter init (WaitForServerReady callback), while the main thread calls setenv in accelerator managers (CUDA, Neuron, TPU). POSIX setenv is MT-Unsafe, so concurrent access crashes.

Previously the callback was fire and forget, so the constructor could return before init finished. #61034 and #61195 tried to fix specific getenv call sites, but gRPC keeps calling getenv internally (HttpProxyMapper reads proxy env vars on every channel creation), so the race kept showing up in new coredumps.

This PR just makes the CoreWorker call site synchronously wait for the callback to finish (via std::promise/std::future), so all getenv completes before any setenv can happen.

Performance

This PR (sync wait):

task latency (includes driver + 1 worker startup) 
  Trial 1: first task latency = 0.1907s
  Trial 2: first task latency = 0.1669s
  Trial 3: first task latency = 0.1726s
  Trial 4: first task latency = 0.1737s
  Trial 5: first task latency = 0.1731s
  Mean: 0.1754s, Std: 0.0081s

Master (baseline):

task latency (includes driver + 1 worker startup)
  Trial 1: first task latency = 0.1781s
  Trial 2: first task latency = 0.1601s
  Trial 3: first task latency = 0.1578s
  Trial 4: first task latency = 0.1685s
  Trial 5: first task latency = 0.1676s
  Mean: 0.1664s, Std: 0.0072s

This level of overhead is acceptable to fix the crash.

@Yicheng-Lu-llll Yicheng-Lu-llll requested a review from a team as a code owner February 24, 2026 11:17
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a getenv/setenv race condition by synchronously waiting for the metrics exporter to initialize. The use of std::promise and std::future is a good approach for this. The implementation is correct, but I've identified a potential issue where an exception during exporter initialization could cause the main thread to hang. I've added a review comment with a suggestion to make the code more robust.

Signed-off-by: yicheng <yicheng@anyscale.com>
@Yicheng-Lu-llll Yicheng-Lu-llll force-pushed the fix-otel-getenv-race-condition branch from 352a058 to 64fd4f2 Compare February 24, 2026 11:38
@ray-gardener ray-gardener bot added the community-contribution Contributed by the community label Feb 24, 2026
@Yicheng-Lu-llll Yicheng-Lu-llll added core Issues that should be addressed in Ray Core and removed community-contribution Contributed by the community labels Feb 24, 2026
Comment on lines +862 to +876
auto metrics_init_promise = std::make_shared<std::promise<void>>();
metrics_init_future = metrics_init_promise->get_future();
metrics_agent_client_->WaitForServerReady(
[this, metrics_init_promise](const Status &server_status) {
if (server_status.ok()) {
stats::ConnectOpenCensusExporter(options_.metrics_agent_port);
stats::InitOpenTelemetryExporter(options_.metrics_agent_port);
} else {
RAY_LOG(ERROR)
<< "Failed to establish connection to the metrics exporter agent. "
"Metrics will not be exported. "
<< "Exporter agent status: " << server_status.ToString();
}
metrics_init_promise->set_value();
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make WaitForServerReady synchronous instead? It's unnecessary complexity to call and async method and then block.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considered this, but the underlying RPC (HealthCheck) is async with callback based retries.So making it synchronous is a big change. This async + promise/future pattern seems a standard way to bridge async RPCs to sync callers in Ray. see InitializeSystemConfig, GlobalStateAccessor::GetAllJobInfo, NodeInfoAccessor::CheckAlive, etc.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have removed WaitForServerReady and made it fully synchronous! See #61281 (comment)

@Yicheng-Lu-llll
Copy link
Copy Markdown
Member Author

So, I have found in many tests that metrics actually failed to export. This is an issue because we now wait for a long time until retries are exhausted. Investigating.

# same event loop, and ray.init() needs to health
# check that server. Running it on the event loop
# thread would self deadlock.
await asyncio.to_thread(
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found that the dashboard agent hosts both its grpc server and the JobAgent in the same event loop. When ray job submit arrives, JobAgent calls ray.init()(so that to create JobSupervisor Actor) which blocks the event loop waiting for the HealthCheck to succeed, but the gRPC server can't respond because the event loop is blocked. This is a self deadlock.

Fixed by running ray.init() in a thread pool and releasing the GIL during CoreWorkerProcess::Initialize() so the event loop thread can process the HealthCheck.

…rker init and running ray.init in a thread pool

Signed-off-by: yicheng <yicheng@anyscale.com>
@Yicheng-Lu-llll Yicheng-Lu-llll force-pushed the fix-otel-getenv-race-condition branch from 6140536 to 7496dd7 Compare February 25, 2026 03:25
…orker

Signed-off-by: yicheng <yicheng@anyscale.com>
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

@Yicheng-Lu-llll
Copy link
Copy Markdown
Member Author

Summary of discussion with @rueian offline:

The root issue is a getenv/setenv race condition (POSIX setenv is MT-Unsafe). The fix is to initialize the metrics exporter synchronously.

However, after #55152, CoreWorker first tries to connect to the metrics agent via HealthCheck before initializing the exporter. The purpose was to avoid initializing the OTel exporter too early, which would produce noisy OTel warning logs(still metric got lost, but no warning logs from otel). That said, the PR itself noted this wasn't a problem for CoreWorker but mainly for gcs.

The HealthCheck introduced a new problem: Dashboard Agent hosts both its gRPC server and the JobAgent in the same asyncio event loop. When ray job submit arrives, JobAgent calls ray.init() to create the JobSupervisor actor, which blocks the event loop waiting for the HealthCheck to succeed. But the gRPC server can't respond to the HealthCheck because the event loop is already blocked. This is a self-deadlock.

In practice, with or without HealthCheck, metrics will be dropped at startup either way. gRPC channels have built-in reconnect, so the exporter will eventually connect on its own. The HealthCheck only reduces some OTel warning logs during that brief window.

So the simplest fix is to just skip the HealthCheck in CoreWorker and initialize the exporter synchronously and directly.

Signed-off-by: yicheng <yicheng@anyscale.com>
});
// Initialize exporters synchronously to avoid a getenv/setenv race condition.
// POSIX setenv is MT-Unsafe.
stats::ConnectOpenCensusExporter(options_.metrics_agent_port);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the server is not OK, otel itself will pop up warning messages.

@Yicheng-Lu-llll Yicheng-Lu-llll added the go add ONLY when ready to merge, run all tests label Feb 25, 2026
@Yicheng-Lu-llll
Copy link
Copy Markdown
Member Author

ran the following perf test and compared to the master, didn't see regression:https://buildkite.com/ray-project/release/builds/81809/steps/canvas

name:^microbenchmark\.aws$
name:^many_actors\.aws$
name:^many_tasks\.aws$

@edoakes edoakes merged commit c73d04e into ray-project:master Mar 2, 2026
7 checks passed
kamil-kaczmarek pushed a commit that referenced this pull request Mar 3, 2026
#61281)

## Description

Workers can SIGSEGV due to a `getenv`/`setenv` race. The `io_thread_`
calls `getenv` during OTel/gRPC exporter init (`WaitForServerReady`
callback), while the main thread calls `setenv` in accelerator managers
(CUDA, Neuron, TPU). POSIX `setenv` is MT-Unsafe, so concurrent access
crashes.

Previously the callback was fire and forget, so the constructor could
return before init finished. #61034 and #61195 tried to fix specific
`getenv` call sites, but gRPC keeps calling `getenv` internally
(`HttpProxyMapper` reads proxy env vars on every channel creation), so
the race kept showing up in new coredumps.

This PR just makes the CoreWorker call site synchronously wait for the
callback to finish (via `std::promise/std::future`), so all `getenv`
completes before any `setenv` can happen.



## Performance
**This PR (sync wait):**
```
task latency (includes driver + 1 worker startup) 
  Trial 1: first task latency = 0.1907s
  Trial 2: first task latency = 0.1669s
  Trial 3: first task latency = 0.1726s
  Trial 4: first task latency = 0.1737s
  Trial 5: first task latency = 0.1731s
  Mean: 0.1754s, Std: 0.0081s
```

**Master (baseline):**
```
task latency (includes driver + 1 worker startup)
  Trial 1: first task latency = 0.1781s
  Trial 2: first task latency = 0.1601s
  Trial 3: first task latency = 0.1578s
  Trial 4: first task latency = 0.1685s
  Trial 5: first task latency = 0.1676s
  Mean: 0.1664s, Std: 0.0072s
```
This level of overhead is acceptable to fix the crash.

---------

Signed-off-by: yicheng <yicheng@anyscale.com>
Co-authored-by: yicheng <yicheng@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
ayushk7102 pushed a commit to ayushk7102/ray that referenced this pull request Mar 6, 2026
ray-project#61281)

## Description

Workers can SIGSEGV due to a `getenv`/`setenv` race. The `io_thread_`
calls `getenv` during OTel/gRPC exporter init (`WaitForServerReady`
callback), while the main thread calls `setenv` in accelerator managers
(CUDA, Neuron, TPU). POSIX `setenv` is MT-Unsafe, so concurrent access
crashes.

Previously the callback was fire and forget, so the constructor could
return before init finished. ray-project#61034 and ray-project#61195 tried to fix specific
`getenv` call sites, but gRPC keeps calling `getenv` internally
(`HttpProxyMapper` reads proxy env vars on every channel creation), so
the race kept showing up in new coredumps.

This PR just makes the CoreWorker call site synchronously wait for the
callback to finish (via `std::promise/std::future`), so all `getenv`
completes before any `setenv` can happen.

## Performance
**This PR (sync wait):**
```
task latency (includes driver + 1 worker startup)
  Trial 1: first task latency = 0.1907s
  Trial 2: first task latency = 0.1669s
  Trial 3: first task latency = 0.1726s
  Trial 4: first task latency = 0.1737s
  Trial 5: first task latency = 0.1731s
  Mean: 0.1754s, Std: 0.0081s
```

**Master (baseline):**
```
task latency (includes driver + 1 worker startup)
  Trial 1: first task latency = 0.1781s
  Trial 2: first task latency = 0.1601s
  Trial 3: first task latency = 0.1578s
  Trial 4: first task latency = 0.1685s
  Trial 5: first task latency = 0.1676s
  Mean: 0.1664s, Std: 0.0072s
```
This level of overhead is acceptable to fix the crash.

---------

Signed-off-by: yicheng <yicheng@anyscale.com>
Co-authored-by: yicheng <yicheng@anyscale.com>
Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>
michelle-tan-ai added a commit to Applied-Shared/ray that referenced this pull request Mar 19, 2026
…(LDMP-4900)

The async WaitForServerReady callback ran InitOpenTelemetryExporter on the
io_service thread, where gRPC's HttpProxyMapper calls getenv() for proxy
settings. This races with setenv() calls from Python runtime_env on the
main thread, causing SIGSEGV in getenv() (glibc getenv is MT-Unsafe when
setenv reallocs the environ array).

Fix: call InitOpenTelemetryExporter synchronously on the main thread before
worker threads start, eliminating the race. Mirrors upstream ray-project/ray
PR ray-project#61281.

Linear: LDMP-4900

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
michelle-tan-ai added a commit to Applied-Shared/ray that referenced this pull request Mar 19, 2026
The async WaitForServerReady callback ran InitOpenTelemetryExporter on the
io_service thread, where gRPC's HttpProxyMapper calls getenv() for proxy
settings. This races with setenv() calls from Python runtime_env on the
main thread, causing SIGSEGV in getenv() (glibc getenv is MT-Unsafe when
setenv reallocs the environ array).

Fix: call InitOpenTelemetryExporter synchronously on the main thread before
worker threads start, eliminating the race. Mirrors upstream ray-project/ray
PR ray-project#61281.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants