[core] upgrade grpc to 1.58.0 to fix getenv races#61195
[core] upgrade grpc to 1.58.0 to fix getenv races#61195MengjinYan merged 2 commits intoray-project:masterfrom
Conversation
Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request correctly upgrades gRPC to version 1.58.0 and its boringssl dependency to address getenv race conditions. The changes in bazel/ray_deps_setup.bzl are accurate. For better long-term maintenance, I recommend two things. First, please update the pull request description with details and links regarding the getenv race fix in gRPC. Second, consider a follow-up change to update the thirdparty/patches/grpc-configurable-thread-count.patch file. This patch still uses std::getenv, and should be modified to use a thread-safe gRPC alternative like gpr_getenv to fully align with the goal of this PR.
|
A new core dump with grpc 1.58.0. grpc will still use getenv on every client channel creation. This is a bit unfortunate. |
|
The windows build failures in the CI don't seem to be relevant to this upgrade. |
|
related thread: open-telemetry/opentelemetry-cpp#3883 |
Confused about this. If we still see getenv on 1.58.0, then is there any benefit to upgrading for this bug? |
Sorry for the confusion, but yes, we still need to upgrade grpc to solve the first core dump. The second core dump can be fixed by #61281. |
|
Kicking off windows and mac just in case since this could have an impact |
|
Unfortunately windows premerge is broken across the board 💀 |
|
windows and macos tests passed. cc @israbbani and @dayshah |
#61281) ## Description Workers can SIGSEGV due to a `getenv`/`setenv` race. The `io_thread_` calls `getenv` during OTel/gRPC exporter init (`WaitForServerReady` callback), while the main thread calls `setenv` in accelerator managers (CUDA, Neuron, TPU). POSIX `setenv` is MT-Unsafe, so concurrent access crashes. Previously the callback was fire and forget, so the constructor could return before init finished. #61034 and #61195 tried to fix specific `getenv` call sites, but gRPC keeps calling `getenv` internally (`HttpProxyMapper` reads proxy env vars on every channel creation), so the race kept showing up in new coredumps. This PR just makes the CoreWorker call site synchronously wait for the callback to finish (via `std::promise/std::future`), so all `getenv` completes before any `setenv` can happen. ## Performance **This PR (sync wait):** ``` task latency (includes driver + 1 worker startup) Trial 1: first task latency = 0.1907s Trial 2: first task latency = 0.1669s Trial 3: first task latency = 0.1726s Trial 4: first task latency = 0.1737s Trial 5: first task latency = 0.1731s Mean: 0.1754s, Std: 0.0081s ``` **Master (baseline):** ``` task latency (includes driver + 1 worker startup) Trial 1: first task latency = 0.1781s Trial 2: first task latency = 0.1601s Trial 3: first task latency = 0.1578s Trial 4: first task latency = 0.1685s Trial 5: first task latency = 0.1676s Mean: 0.1664s, Std: 0.0072s ``` This level of overhead is acceptable to fix the crash. --------- Signed-off-by: yicheng <yicheng@anyscale.com> Co-authored-by: yicheng <yicheng@anyscale.com>
## Description
grpc 1.57.1 will call `GetEnv("GRPC_EXPERIMENTAL_PICKFIRST_LB_CONFIG")`
on every grpc channel establishment for parsing load-balancing policy.
This causes race conditions between user tasks as they are allowed to do
setenv at anytime. This PR upgrades the grpc lib to 1.58.0 to get rid of
the `GetEnv("GRPC_EXPERIMENTAL_PICKFIRST_LB_CONFIG")`.
```
(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=11, threadid=129183804413504) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=11, threadid=129183804413504) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=129183804413504, signo=signo@entry=11) at ./nptl/pthread_kill.c:89
#3 0x00007580a7545476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#4 <signal handler called>
#5 __pthread_kill_implementation (no_tid=0, signo=11, threadid=129183804413504) at ./nptl/pthread_kill.c:44
#6 __pthread_kill_internal (signo=11, threadid=129183804413504) at ./nptl/pthread_kill.c:78
#7 __GI___pthread_kill (threadid=129183804413504, signo=signo@entry=11) at ./nptl/pthread_kill.c:89
#8 0x00007580a7545476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#9 <signal handler called>
#10 __GI_getenv (name=0x7580a6a078c2 "PC_EXPERIMENTAL_PICKFIRST_LB_CONFIG") at ./stdlib/getenv.c:84
#11 0x00007580a67e8b8a in grpc_core::GetEnv(char const*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#12 0x00007580a649601f in grpc_core::ShufflePickFirstEnabled() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#13 0x00007580a64960ed in grpc_core::json_detail::FinishedJsonObjectLoader<grpc_core::(anonymous namespace)::PickFirstConfig, 1ul, void>::LoadInto(grpc_core::experimental::Json const&, grpc_core::JsonArgs const&, void*, grpc_core::ValidationErrors*) const () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#14 0x00007580a6787384 in grpc_core::json_detail::LoadWrapped::LoadInto(grpc_core::experimental::Json const&, grpc_core::JsonArgs const&, void*, grpc_core::ValidationErrors*) const ()
from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#15 0x00007580a6497b07 in grpc_core::(anonymous namespace)::PickFirstFactory::ParseLoadBalancingConfig(grpc_core::experimental::Json const&) const ()
from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#16 0x00007580a67c18a7 in grpc_core::LoadBalancingPolicyRegistry::ParseLoadBalancingConfig(grpc_core::experimental::Json const&) const ()
from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#17 0x00007580a66ad9b8 in grpc_core::ClientChannel::OnResolverResultChangedLocked(grpc_core::Resolver::Result) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#18 0x00007580a66ae452 in grpc_core::ClientChannel::ResolverResultHandler::ReportResult(grpc_core::Resolver::Result) ()
from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#19 0x00007580a63bc603 in grpc_core::PollingResolver::OnRequestCompleteLocked(grpc_core::Resolver::Result) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#20 0x00007580a63bcb2d in std::_Function_handler<void (), grpc_core::PollingResolver::OnRequestComplete(grpc_core::Resolver::Result)::{lambda()#1}>::_M_invoke(std::_Any_data const&)
() from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#21 0x00007580a67cbf46 in grpc_core::WorkSerializer::WorkSerializerImpl::Run(std::function<void ()>, grpc_core::DebugLocation const&) ()
from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#22 0x00007580a67cc0ea in grpc_core::WorkSerializer::Run(std::function<void ()>, grpc_core::DebugLocation const&) ()
from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#23 0x00007580a63bd117 in grpc_core::PollingResolver::OnRequestComplete(grpc_core::Resolver::Result) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#24 0x00007580a63b3f86 in grpc_core::(anonymous namespace)::AresClientChannelDNSResolver::AresRequestWrapper::OnHostnameResolved(void*, absl::lts_20230802::Status) ()
from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#25 0x00007580a67c44c4 in grpc_core::ExecCtx::Flush() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#26 0x00007580a63408a2 in grpc_core::ExecCtx::~ExecCtx() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#27 0x00007580a6740343 in grpc_call_start_batch () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#28 0x00007580a5e281e9 in grpc::internal::CallOpSet<grpc::internal::CallOpSendInitialMetadata, grpc::internal::CallOpSendMessage, grpc::internal::CallOpRecvInitialMetadata, grpc::internal::CallOpRecvMessage<google::protobuf::MessageLite>, grpc::internal::CallOpClientSendClose, grpc::internal::CallOpClientRecvStatus>::ContinueFillOpsAfterInterception() ()
from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#29 0x00007580a5e2d809 in grpc::internal::BlockingUnaryCallImpl<google::protobuf::MessageLite, google::protobuf::MessageLite>::BlockingUnaryCallImpl(grpc::ChannelInterface*, grpc::inte--Type <RET> for more, q to quit, c to continue without paging--c
rnal::RpcMethod const&, grpc::ClientContext*, google::protobuf::MessageLite const&, google::protobuf::MessageLite*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#30 0x00007580a62d76ea in opentelemetry::proto::collector::metrics::v1::MetricsService::Stub::Export(grpc::ClientContext*, opentelemetry::proto::collector::metrics::v1::ExportMetricsServiceRequest const&, opentelemetry::proto::collector::metrics::v1::ExportMetricsServiceResponse*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#31 0x00007580a62ca40c in opentelemetry::v1::exporter::otlp::OtlpGrpcClient::DelegateExport(opentelemetry::proto::collector::metrics::v1::MetricsService::StubInterface*, std::unique_ptr<grpc::ClientContext, std::default_delete<grpc::ClientContext> >&&, std::unique_ptr<google::protobuf::Arena, std::default_delete<google::protobuf::Arena> >&&, opentelemetry::proto::collector::metrics::v1::ExportMetricsServiceRequest&&, opentelemetry::proto::collector::metrics::v1::ExportMetricsServiceResponse*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#32 0x00007580a62c23ed in opentelemetry::v1::exporter::otlp::OtlpGrpcMetricExporter::Export(opentelemetry::v1::sdk::metrics::ResourceMetrics const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#33 0x00007580a62c0334 in (anonymous namespace)::OpenTelemetryMetricExporter::Export(opentelemetry::v1::sdk::metrics::ResourceMetrics const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#34 0x00007580a62e5fdf in opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::{lambda()#1}::operator()() const::{lambda(opentelemetry::v1::sdk::metrics::ResourceMetrics&)#1}::operator()(opentelemetry::v1::sdk::metrics::ResourceMetrics&) const () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#35 0x00007580a62ee7a6 in opentelemetry::v1::sdk::metrics::MetricReader::Collect(opentelemetry::v1::nostd::function_ref<bool (opentelemetry::v1::sdk::metrics::ResourceMetrics&)>) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#36 0x00007580a62e5085 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::{lambda()#1}> > >::_M_run() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#37 0x00007580a6997be0 in execute_native_thread_routine () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#38 0x00007580a7597ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#39 0x00007580a76298d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
```
Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
#61281) ## Description Workers can SIGSEGV due to a `getenv`/`setenv` race. The `io_thread_` calls `getenv` during OTel/gRPC exporter init (`WaitForServerReady` callback), while the main thread calls `setenv` in accelerator managers (CUDA, Neuron, TPU). POSIX `setenv` is MT-Unsafe, so concurrent access crashes. Previously the callback was fire and forget, so the constructor could return before init finished. #61034 and #61195 tried to fix specific `getenv` call sites, but gRPC keeps calling `getenv` internally (`HttpProxyMapper` reads proxy env vars on every channel creation), so the race kept showing up in new coredumps. This PR just makes the CoreWorker call site synchronously wait for the callback to finish (via `std::promise/std::future`), so all `getenv` completes before any `setenv` can happen. ## Performance **This PR (sync wait):** ``` task latency (includes driver + 1 worker startup) Trial 1: first task latency = 0.1907s Trial 2: first task latency = 0.1669s Trial 3: first task latency = 0.1726s Trial 4: first task latency = 0.1737s Trial 5: first task latency = 0.1731s Mean: 0.1754s, Std: 0.0081s ``` **Master (baseline):** ``` task latency (includes driver + 1 worker startup) Trial 1: first task latency = 0.1781s Trial 2: first task latency = 0.1601s Trial 3: first task latency = 0.1578s Trial 4: first task latency = 0.1685s Trial 5: first task latency = 0.1676s Mean: 0.1664s, Std: 0.0072s ``` This level of overhead is acceptable to fix the crash. --------- Signed-off-by: yicheng <yicheng@anyscale.com> Co-authored-by: yicheng <yicheng@anyscale.com> Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
This reverts commit d85ed28. Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Reverts #61195 to check if cpp ubsan tests pass in postmerge Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
ray-project#61281) ## Description Workers can SIGSEGV due to a `getenv`/`setenv` race. The `io_thread_` calls `getenv` during OTel/gRPC exporter init (`WaitForServerReady` callback), while the main thread calls `setenv` in accelerator managers (CUDA, Neuron, TPU). POSIX `setenv` is MT-Unsafe, so concurrent access crashes. Previously the callback was fire and forget, so the constructor could return before init finished. ray-project#61034 and ray-project#61195 tried to fix specific `getenv` call sites, but gRPC keeps calling `getenv` internally (`HttpProxyMapper` reads proxy env vars on every channel creation), so the race kept showing up in new coredumps. This PR just makes the CoreWorker call site synchronously wait for the callback to finish (via `std::promise/std::future`), so all `getenv` completes before any `setenv` can happen. ## Performance **This PR (sync wait):** ``` task latency (includes driver + 1 worker startup) Trial 1: first task latency = 0.1907s Trial 2: first task latency = 0.1669s Trial 3: first task latency = 0.1726s Trial 4: first task latency = 0.1737s Trial 5: first task latency = 0.1731s Mean: 0.1754s, Std: 0.0081s ``` **Master (baseline):** ``` task latency (includes driver + 1 worker startup) Trial 1: first task latency = 0.1781s Trial 2: first task latency = 0.1601s Trial 3: first task latency = 0.1578s Trial 4: first task latency = 0.1685s Trial 5: first task latency = 0.1676s Mean: 0.1664s, Std: 0.0072s ``` This level of overhead is acceptable to fix the crash. --------- Signed-off-by: yicheng <yicheng@anyscale.com> Co-authored-by: yicheng <yicheng@anyscale.com> Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>
…ct#61449) Reverts ray-project#61195 to check if cpp ubsan tests pass in postmerge Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>
## Description Reintroduce the grpc upgrade for fixing setenv/getenv races. postmerge and premerge, including tsan, ubsan, macos, windows, tests passed. ## Related issues #61195 --------- Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
…ct#61449) Reverts ray-project#61195 to check if cpp ubsan tests pass in postmerge Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Description
grpc 1.57.1 will call
GetEnv("GRPC_EXPERIMENTAL_PICKFIRST_LB_CONFIG")on every grpc channel establishment for parsing load-balancing policy. This causes race conditions between user tasks as they are allowed to do setenv at anytime. This PR upgrades the grpc lib to 1.58.0 to get rid of theGetEnv("GRPC_EXPERIMENTAL_PICKFIRST_LB_CONFIG").Related issues
Additional information