Skip to content

[core] upgrade grpc to 1.58.0 to fix getenv races#61195

Merged
MengjinYan merged 2 commits intoray-project:masterfrom
rueian:grpc1580
Feb 28, 2026
Merged

[core] upgrade grpc to 1.58.0 to fix getenv races#61195
MengjinYan merged 2 commits intoray-project:masterfrom
rueian:grpc1580

Conversation

@rueian
Copy link
Copy Markdown
Contributor

@rueian rueian commented Feb 20, 2026

Description

grpc 1.57.1 will call GetEnv("GRPC_EXPERIMENTAL_PICKFIRST_LB_CONFIG") on every grpc channel establishment for parsing load-balancing policy. This causes race conditions between user tasks as they are allowed to do setenv at anytime. This PR upgrades the grpc lib to 1.58.0 to get rid of the GetEnv("GRPC_EXPERIMENTAL_PICKFIRST_LB_CONFIG").

(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=11, threadid=129183804413504) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=11, threadid=129183804413504) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=129183804413504, signo=signo@entry=11) at ./nptl/pthread_kill.c:89
#3  0x00007580a7545476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#4  <signal handler called>
#5  __pthread_kill_implementation (no_tid=0, signo=11, threadid=129183804413504) at ./nptl/pthread_kill.c:44
#6  __pthread_kill_internal (signo=11, threadid=129183804413504) at ./nptl/pthread_kill.c:78
#7  __GI___pthread_kill (threadid=129183804413504, signo=signo@entry=11) at ./nptl/pthread_kill.c:89
#8  0x00007580a7545476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#9  <signal handler called>
#10 __GI_getenv (name=0x7580a6a078c2 "PC_EXPERIMENTAL_PICKFIRST_LB_CONFIG") at ./stdlib/getenv.c:84
#11 0x00007580a67e8b8a in grpc_core::GetEnv(char const*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#12 0x00007580a649601f in grpc_core::ShufflePickFirstEnabled() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#13 0x00007580a64960ed in grpc_core::json_detail::FinishedJsonObjectLoader<grpc_core::(anonymous namespace)::PickFirstConfig, 1ul, void>::LoadInto(grpc_core::experimental::Json const&, grpc_core::JsonArgs const&, void*, grpc_core::ValidationErrors*) const () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#14 0x00007580a6787384 in grpc_core::json_detail::LoadWrapped::LoadInto(grpc_core::experimental::Json const&, grpc_core::JsonArgs const&, void*, grpc_core::ValidationErrors*) const ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#15 0x00007580a6497b07 in grpc_core::(anonymous namespace)::PickFirstFactory::ParseLoadBalancingConfig(grpc_core::experimental::Json const&) const ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#16 0x00007580a67c18a7 in grpc_core::LoadBalancingPolicyRegistry::ParseLoadBalancingConfig(grpc_core::experimental::Json const&) const ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#17 0x00007580a66ad9b8 in grpc_core::ClientChannel::OnResolverResultChangedLocked(grpc_core::Resolver::Result) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#18 0x00007580a66ae452 in grpc_core::ClientChannel::ResolverResultHandler::ReportResult(grpc_core::Resolver::Result) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#19 0x00007580a63bc603 in grpc_core::PollingResolver::OnRequestCompleteLocked(grpc_core::Resolver::Result) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#20 0x00007580a63bcb2d in std::_Function_handler<void (), grpc_core::PollingResolver::OnRequestComplete(grpc_core::Resolver::Result)::{lambda()#1}>::_M_invoke(std::_Any_data const&)
    () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#21 0x00007580a67cbf46 in grpc_core::WorkSerializer::WorkSerializerImpl::Run(std::function<void ()>, grpc_core::DebugLocation const&) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#22 0x00007580a67cc0ea in grpc_core::WorkSerializer::Run(std::function<void ()>, grpc_core::DebugLocation const&) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#23 0x00007580a63bd117 in grpc_core::PollingResolver::OnRequestComplete(grpc_core::Resolver::Result) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#24 0x00007580a63b3f86 in grpc_core::(anonymous namespace)::AresClientChannelDNSResolver::AresRequestWrapper::OnHostnameResolved(void*, absl::lts_20230802::Status) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#25 0x00007580a67c44c4 in grpc_core::ExecCtx::Flush() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#26 0x00007580a63408a2 in grpc_core::ExecCtx::~ExecCtx() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#27 0x00007580a6740343 in grpc_call_start_batch () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#28 0x00007580a5e281e9 in grpc::internal::CallOpSet<grpc::internal::CallOpSendInitialMetadata, grpc::internal::CallOpSendMessage, grpc::internal::CallOpRecvInitialMetadata, grpc::internal::CallOpRecvMessage<google::protobuf::MessageLite>, grpc::internal::CallOpClientSendClose, grpc::internal::CallOpClientRecvStatus>::ContinueFillOpsAfterInterception() ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#29 0x00007580a5e2d809 in grpc::internal::BlockingUnaryCallImpl<google::protobuf::MessageLite, google::protobuf::MessageLite>::BlockingUnaryCallImpl(grpc::ChannelInterface*, grpc::inte--Type <RET> for more, q to quit, c to continue without paging--c
rnal::RpcMethod const&, grpc::ClientContext*, google::protobuf::MessageLite const&, google::protobuf::MessageLite*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#30 0x00007580a62d76ea in opentelemetry::proto::collector::metrics::v1::MetricsService::Stub::Export(grpc::ClientContext*, opentelemetry::proto::collector::metrics::v1::ExportMetricsServiceRequest const&, opentelemetry::proto::collector::metrics::v1::ExportMetricsServiceResponse*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#31 0x00007580a62ca40c in opentelemetry::v1::exporter::otlp::OtlpGrpcClient::DelegateExport(opentelemetry::proto::collector::metrics::v1::MetricsService::StubInterface*, std::unique_ptr<grpc::ClientContext, std::default_delete<grpc::ClientContext> >&&, std::unique_ptr<google::protobuf::Arena, std::default_delete<google::protobuf::Arena> >&&, opentelemetry::proto::collector::metrics::v1::ExportMetricsServiceRequest&&, opentelemetry::proto::collector::metrics::v1::ExportMetricsServiceResponse*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#32 0x00007580a62c23ed in opentelemetry::v1::exporter::otlp::OtlpGrpcMetricExporter::Export(opentelemetry::v1::sdk::metrics::ResourceMetrics const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#33 0x00007580a62c0334 in (anonymous namespace)::OpenTelemetryMetricExporter::Export(opentelemetry::v1::sdk::metrics::ResourceMetrics const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#34 0x00007580a62e5fdf in opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::{lambda()#1}::operator()() const::{lambda(opentelemetry::v1::sdk::metrics::ResourceMetrics&)#1}::operator()(opentelemetry::v1::sdk::metrics::ResourceMetrics&) const () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#35 0x00007580a62ee7a6 in opentelemetry::v1::sdk::metrics::MetricReader::Collect(opentelemetry::v1::nostd::function_ref<bool (opentelemetry::v1::sdk::metrics::ResourceMetrics&)>) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#36 0x00007580a62e5085 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::{lambda()#1}> > >::_M_run() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#37 0x00007580a6997be0 in execute_native_thread_routine () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#38 0x00007580a7597ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#39 0x00007580a76298d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly upgrades gRPC to version 1.58.0 and its boringssl dependency to address getenv race conditions. The changes in bazel/ray_deps_setup.bzl are accurate. For better long-term maintenance, I recommend two things. First, please update the pull request description with details and links regarding the getenv race fix in gRPC. Second, consider a follow-up change to update the thirdparty/patches/grpc-configurable-thread-count.patch file. This patch still uses std::getenv, and should be modified to use a thread-safe gRPC alternative like gpr_getenv to fully align with the goal of this PR.

@rueian rueian added core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests labels Feb 20, 2026
@rueian rueian marked this pull request as ready for review February 20, 2026 08:41
@rueian
Copy link
Copy Markdown
Contributor Author

rueian commented Feb 21, 2026

A new core dump with grpc 1.58.0. grpc will still use getenv on every client channel creation. This is a bit unfortunate.

Core was generated by `ray::IDLE                                                                     '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __pthread_kill_implementation (no_tid=0, signo=11, threadid=130592523916864) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
[Current thread is 1 (Thread 0x76c5f0ff9640 (LWP 401864))]
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=11, threadid=130592523916864) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=11, threadid=130592523916864) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=130592523916864, signo=signo@entry=11) at ./nptl/pthread_kill.c:89
#3  0x000076c618f9d476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#4  <signal handler called>
#5  __pthread_kill_implementation (no_tid=0, signo=11, threadid=130592523916864) at ./nptl/pthread_kill.c:44
#6  __pthread_kill_internal (signo=11, threadid=130592523916864) at ./nptl/pthread_kill.c:78
#7  __GI___pthread_kill (threadid=130592523916864, signo=signo@entry=11) at ./nptl/pthread_kill.c:89
#8  0x000076c618f9d476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#9  <signal handler called>
#10 __GI_getenv (name=0x76c61840a863 "pc_proxy") at ./stdlib/getenv.c:84
#11 0x000076c61823e9ea in grpc_core::GetEnv(char const*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#12 0x000076c6180fa0cf in grpc_core::HttpProxyMapper::MapName(std::basic_string_view<char, std::char_traits<char> >, grpc_core::ChannelArgs*) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#13 0x000076c61822315b in grpc_core::ProxyMapperRegistry::MapName(std::basic_string_view<char, std::char_traits<char> >, grpc_core::ChannelArgs*) const ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#14 0x000076c6180e8e1a in grpc_core::ClientChannel::ClientChannel(grpc_channel_element_args*, absl::lts_20230802::Status*) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#15 0x000076c6180e9707 in grpc_core::ClientChannel::Init(grpc_channel_element*, grpc_channel_element_args*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#16 0x000076c618134b01 in grpc_channel_stack_init(int, void (*)(void*, absl::lts_20230802::Status), void*, grpc_channel_filter const**, unsigned long, grpc_core::ChannelArgs const&, char const*, grpc_channel_stack*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#17 0x000076c6181369be in grpc_core::ChannelStackBuilderImpl::Build() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#18 0x000076c618164e28 in grpc_core::Channel::CreateWithBuilder(grpc_core::ChannelStackBuilder*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#19 0x000076c6181656df in grpc_core::Channel::Create(char const*, grpc_core::ChannelArgs, grpc_channel_stack_type, grpc_transport*) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#20 0x000076c617ef8277 in grpc_channel_create () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#21 0x000076c617d8815e in grpc::(anonymous namespace)::InsecureChannelCredentialsImpl::CreateChannelWithInterceptors(std::string const&, grpc::ChannelArguments const&, std::vector<std::unique_ptr<grpc::experimental::ClientInterceptorFactoryInterface, std::default_delete<grpc::experimental::ClientInterceptorFactoryInterface> >, std::allocator<std::unique_ptr<grpc::experimental::ClientInterceptorFactoryInterface, std::default_delete<grpc::experimental::ClientInterceptorFactoryInterface> > > >) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#22 0x000076c617d88274 in grpc::(anonymous namespace)::InsecureChannelCredentialsImpl::CreateChannelImpl(std::string const&, grpc::ChannelArguments const&) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#23 0x000076c617d863a9 in grpc::CreateCustomChannel(std::string const&, std::shared_ptr<grpc::ChannelCredentials> const&, grpc::ChannelArguments const&) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#24 0x000076c617d12dbe in opentelemetry::v1::exporter::otlp::OtlpGrpcClient::MakeChannel(opentelemetry::v1::exporter::otlp::OtlpGrpcClientOptions const&) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#25 0x000076c617d135bf in opentelemetry::v1::exporter::otlp::OtlpGrpcClient::OtlpGrpcClient(opentelemetry::v1::exporter::otlp::OtlpGrpcClientOptions const&) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#26 0x000076c617d13923 in opentelemetry::v1::exporter::otlp::OtlpGrpcClientFactory::Create(opentelemetry::v1::exporter::otlp::OtlpGrpcClientOptions const&) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#27 0x000076c617d0a1c9 in opentelemetry::v1::exporter::otlp::OtlpGrpcMetricExporter::OtlpGrpcMetricExporter(opentelemetry::v1::exporter::otlp::OtlpGrpcMetricExporterOptions const&) ()
--Type <RET> for more, q to quit, c to continue without paging--c
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#28 0x000076c617d00f8f in ray::observability::OpenTelemetryMetricRecorder::Start(std::string const&, std::chrono::duration<long, std::ratio<1l, 1000l> >, std::chrono::duration<long, std::ratio<1l, 1000l> >) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#29 0x000076c61772a5a0 in ray::core::CoreWorkerProcessImpl::CoreWorkerProcessImpl(ray::core::CoreWorkerOptions const&)::{lambda(ray::Status const&)#2}::operator()(ray::Status const&) const () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#30 0x000076c617b247ca in auto ray::rpc::MetricsAgentClientImpl::WaitForServerReadyWithRetry(std::function<void (ray::Status const&)>, int, int, int)::{lambda(auto:1&, auto:2&&)#1}::operator()<ray::Status const, ray::rpc::HealthCheckReply>(ray::Status const&, ray::rpc::HealthCheckReply&&) const [clone .constprop.0] () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#31 0x000076c617b26a35 in ray::rpc::ClientCallImpl<ray::rpc::HealthCheckReply>::OnReplyReceived() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#32 0x000076c61772c2b5 in std::_Function_handler<void (), ray::rpc::ClientCallManager::PollEventsFromCompletionQueue(int)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#33 0x000076c617cf906b in EventTracker::RecordExecution(std::function<void ()> const&, std::shared_ptr<StatsHandle>) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#34 0x000076c617cefb8b in std::_Function_handler<void (), instrumented_io_context::post(std::function<void ()>, std::string, long)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#35 0x000076c61782d08b in boost::asio::detail::executor_op<boost::asio::detail::binder0<std::function<void ()> >, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#36 0x000076c61825462b in boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#37 0x000076c618255fc9 in boost::asio::detail::scheduler::run(boost::system::error_code&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#38 0x000076c6182566d2 in boost::asio::io_context::run() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#39 0x000076c617729994 in ray::core::CoreWorkerProcessImpl::CreateCoreWorker(ray::core::CoreWorkerOptions, ray::WorkerID const&)::{lambda()#1}::operator()() const () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#40 0x000076c617860270 in thread_proxy () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#41 0x000076c618fefac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#42 0x000076c6190818d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

@rueian
Copy link
Copy Markdown
Contributor Author

rueian commented Feb 23, 2026

The windows build failures in the CI don't seem to be relevant to this upgrade.

@rueian
Copy link
Copy Markdown
Contributor Author

rueian commented Feb 24, 2026

related thread: open-telemetry/opentelemetry-cpp#3883

@israbbani
Copy link
Copy Markdown
Contributor

A new core dump with grpc 1.58.0. grpc will still use getenv on every client channel creation. This is a bit unfortunate.

Core was generated by `ray::IDLE                                                                     '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __pthread_kill_implementation (no_tid=0, signo=11, threadid=130592523916864) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
[Current thread is 1 (Thread 0x76c5f0ff9640 (LWP 401864))]
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=11, threadid=130592523916864) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=11, threadid=130592523916864) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=130592523916864, signo=signo@entry=11) at ./nptl/pthread_kill.c:89
#3  0x000076c618f9d476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#4  <signal handler called>
#5  __pthread_kill_implementation (no_tid=0, signo=11, threadid=130592523916864) at ./nptl/pthread_kill.c:44
#6  __pthread_kill_internal (signo=11, threadid=130592523916864) at ./nptl/pthread_kill.c:78
#7  __GI___pthread_kill (threadid=130592523916864, signo=signo@entry=11) at ./nptl/pthread_kill.c:89
#8  0x000076c618f9d476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#9  <signal handler called>
#10 __GI_getenv (name=0x76c61840a863 "pc_proxy") at ./stdlib/getenv.c:84
#11 0x000076c61823e9ea in grpc_core::GetEnv(char const*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#12 0x000076c6180fa0cf in grpc_core::HttpProxyMapper::MapName(std::basic_string_view<char, std::char_traits<char> >, grpc_core::ChannelArgs*) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#13 0x000076c61822315b in grpc_core::ProxyMapperRegistry::MapName(std::basic_string_view<char, std::char_traits<char> >, grpc_core::ChannelArgs*) const ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#14 0x000076c6180e8e1a in grpc_core::ClientChannel::ClientChannel(grpc_channel_element_args*, absl::lts_20230802::Status*) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#15 0x000076c6180e9707 in grpc_core::ClientChannel::Init(grpc_channel_element*, grpc_channel_element_args*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#16 0x000076c618134b01 in grpc_channel_stack_init(int, void (*)(void*, absl::lts_20230802::Status), void*, grpc_channel_filter const**, unsigned long, grpc_core::ChannelArgs const&, char const*, grpc_channel_stack*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#17 0x000076c6181369be in grpc_core::ChannelStackBuilderImpl::Build() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#18 0x000076c618164e28 in grpc_core::Channel::CreateWithBuilder(grpc_core::ChannelStackBuilder*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#19 0x000076c6181656df in grpc_core::Channel::Create(char const*, grpc_core::ChannelArgs, grpc_channel_stack_type, grpc_transport*) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#20 0x000076c617ef8277 in grpc_channel_create () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#21 0x000076c617d8815e in grpc::(anonymous namespace)::InsecureChannelCredentialsImpl::CreateChannelWithInterceptors(std::string const&, grpc::ChannelArguments const&, std::vector<std::unique_ptr<grpc::experimental::ClientInterceptorFactoryInterface, std::default_delete<grpc::experimental::ClientInterceptorFactoryInterface> >, std::allocator<std::unique_ptr<grpc::experimental::ClientInterceptorFactoryInterface, std::default_delete<grpc::experimental::ClientInterceptorFactoryInterface> > > >) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#22 0x000076c617d88274 in grpc::(anonymous namespace)::InsecureChannelCredentialsImpl::CreateChannelImpl(std::string const&, grpc::ChannelArguments const&) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#23 0x000076c617d863a9 in grpc::CreateCustomChannel(std::string const&, std::shared_ptr<grpc::ChannelCredentials> const&, grpc::ChannelArguments const&) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#24 0x000076c617d12dbe in opentelemetry::v1::exporter::otlp::OtlpGrpcClient::MakeChannel(opentelemetry::v1::exporter::otlp::OtlpGrpcClientOptions const&) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#25 0x000076c617d135bf in opentelemetry::v1::exporter::otlp::OtlpGrpcClient::OtlpGrpcClient(opentelemetry::v1::exporter::otlp::OtlpGrpcClientOptions const&) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#26 0x000076c617d13923 in opentelemetry::v1::exporter::otlp::OtlpGrpcClientFactory::Create(opentelemetry::v1::exporter::otlp::OtlpGrpcClientOptions const&) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#27 0x000076c617d0a1c9 in opentelemetry::v1::exporter::otlp::OtlpGrpcMetricExporter::OtlpGrpcMetricExporter(opentelemetry::v1::exporter::otlp::OtlpGrpcMetricExporterOptions const&) ()
--Type <RET> for more, q to quit, c to continue without paging--c
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#28 0x000076c617d00f8f in ray::observability::OpenTelemetryMetricRecorder::Start(std::string const&, std::chrono::duration<long, std::ratio<1l, 1000l> >, std::chrono::duration<long, std::ratio<1l, 1000l> >) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#29 0x000076c61772a5a0 in ray::core::CoreWorkerProcessImpl::CoreWorkerProcessImpl(ray::core::CoreWorkerOptions const&)::{lambda(ray::Status const&)#2}::operator()(ray::Status const&) const () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#30 0x000076c617b247ca in auto ray::rpc::MetricsAgentClientImpl::WaitForServerReadyWithRetry(std::function<void (ray::Status const&)>, int, int, int)::{lambda(auto:1&, auto:2&&)#1}::operator()<ray::Status const, ray::rpc::HealthCheckReply>(ray::Status const&, ray::rpc::HealthCheckReply&&) const [clone .constprop.0] () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#31 0x000076c617b26a35 in ray::rpc::ClientCallImpl<ray::rpc::HealthCheckReply>::OnReplyReceived() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#32 0x000076c61772c2b5 in std::_Function_handler<void (), ray::rpc::ClientCallManager::PollEventsFromCompletionQueue(int)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#33 0x000076c617cf906b in EventTracker::RecordExecution(std::function<void ()> const&, std::shared_ptr<StatsHandle>) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#34 0x000076c617cefb8b in std::_Function_handler<void (), instrumented_io_context::post(std::function<void ()>, std::string, long)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#35 0x000076c61782d08b in boost::asio::detail::executor_op<boost::asio::detail::binder0<std::function<void ()> >, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#36 0x000076c61825462b in boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#37 0x000076c618255fc9 in boost::asio::detail::scheduler::run(boost::system::error_code&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#38 0x000076c6182566d2 in boost::asio::io_context::run() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#39 0x000076c617729994 in ray::core::CoreWorkerProcessImpl::CreateCoreWorker(ray::core::CoreWorkerOptions, ray::WorkerID const&)::{lambda()#1}::operator()() const () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#40 0x000076c617860270 in thread_proxy () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#41 0x000076c618fefac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#42 0x000076c6190818d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Confused about this. If we still see getenv on 1.58.0, then is there any benefit to upgrading for this bug?

@rueian
Copy link
Copy Markdown
Contributor Author

rueian commented Feb 24, 2026

Confused about this. If we still see getenv on 1.58.0, then is there any benefit to upgrading for this bug?

Sorry for the confusion, but yes, we still need to upgrade grpc to solve the first core dump.

The second core dump can be fixed by #61281.

@dayshah
Copy link
Copy Markdown
Contributor

dayshah commented Feb 24, 2026

Kicking off windows and mac just in case since this could have an impact

@israbbani
Copy link
Copy Markdown
Contributor

Unfortunately windows premerge is broken across the board 💀

@rueian
Copy link
Copy Markdown
Contributor Author

rueian commented Feb 27, 2026

windows and macos tests passed. cc @israbbani and @dayshah

@MengjinYan MengjinYan merged commit d85ed28 into ray-project:master Feb 28, 2026
6 checks passed
edoakes pushed a commit that referenced this pull request Mar 2, 2026
#61281)

## Description

Workers can SIGSEGV due to a `getenv`/`setenv` race. The `io_thread_`
calls `getenv` during OTel/gRPC exporter init (`WaitForServerReady`
callback), while the main thread calls `setenv` in accelerator managers
(CUDA, Neuron, TPU). POSIX `setenv` is MT-Unsafe, so concurrent access
crashes.

Previously the callback was fire and forget, so the constructor could
return before init finished. #61034 and #61195 tried to fix specific
`getenv` call sites, but gRPC keeps calling `getenv` internally
(`HttpProxyMapper` reads proxy env vars on every channel creation), so
the race kept showing up in new coredumps.

This PR just makes the CoreWorker call site synchronously wait for the
callback to finish (via `std::promise/std::future`), so all `getenv`
completes before any `setenv` can happen.



## Performance
**This PR (sync wait):**
```
task latency (includes driver + 1 worker startup) 
  Trial 1: first task latency = 0.1907s
  Trial 2: first task latency = 0.1669s
  Trial 3: first task latency = 0.1726s
  Trial 4: first task latency = 0.1737s
  Trial 5: first task latency = 0.1731s
  Mean: 0.1754s, Std: 0.0081s
```

**Master (baseline):**
```
task latency (includes driver + 1 worker startup)
  Trial 1: first task latency = 0.1781s
  Trial 2: first task latency = 0.1601s
  Trial 3: first task latency = 0.1578s
  Trial 4: first task latency = 0.1685s
  Trial 5: first task latency = 0.1676s
  Mean: 0.1664s, Std: 0.0072s
```
This level of overhead is acceptable to fix the crash.

---------

Signed-off-by: yicheng <yicheng@anyscale.com>
Co-authored-by: yicheng <yicheng@anyscale.com>
kamil-kaczmarek pushed a commit that referenced this pull request Mar 3, 2026
## Description
grpc 1.57.1 will call `GetEnv("GRPC_EXPERIMENTAL_PICKFIRST_LB_CONFIG")`
on every grpc channel establishment for parsing load-balancing policy.
This causes race conditions between user tasks as they are allowed to do
setenv at anytime. This PR upgrades the grpc lib to 1.58.0 to get rid of
the `GetEnv("GRPC_EXPERIMENTAL_PICKFIRST_LB_CONFIG")`.

```
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=11, threadid=129183804413504) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=11, threadid=129183804413504) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=129183804413504, signo=signo@entry=11) at ./nptl/pthread_kill.c:89
#3  0x00007580a7545476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#4  <signal handler called>
#5  __pthread_kill_implementation (no_tid=0, signo=11, threadid=129183804413504) at ./nptl/pthread_kill.c:44
#6  __pthread_kill_internal (signo=11, threadid=129183804413504) at ./nptl/pthread_kill.c:78
#7  __GI___pthread_kill (threadid=129183804413504, signo=signo@entry=11) at ./nptl/pthread_kill.c:89
#8  0x00007580a7545476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#9  <signal handler called>
#10 __GI_getenv (name=0x7580a6a078c2 "PC_EXPERIMENTAL_PICKFIRST_LB_CONFIG") at ./stdlib/getenv.c:84
#11 0x00007580a67e8b8a in grpc_core::GetEnv(char const*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#12 0x00007580a649601f in grpc_core::ShufflePickFirstEnabled() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#13 0x00007580a64960ed in grpc_core::json_detail::FinishedJsonObjectLoader<grpc_core::(anonymous namespace)::PickFirstConfig, 1ul, void>::LoadInto(grpc_core::experimental::Json const&, grpc_core::JsonArgs const&, void*, grpc_core::ValidationErrors*) const () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#14 0x00007580a6787384 in grpc_core::json_detail::LoadWrapped::LoadInto(grpc_core::experimental::Json const&, grpc_core::JsonArgs const&, void*, grpc_core::ValidationErrors*) const ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#15 0x00007580a6497b07 in grpc_core::(anonymous namespace)::PickFirstFactory::ParseLoadBalancingConfig(grpc_core::experimental::Json const&) const ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#16 0x00007580a67c18a7 in grpc_core::LoadBalancingPolicyRegistry::ParseLoadBalancingConfig(grpc_core::experimental::Json const&) const ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#17 0x00007580a66ad9b8 in grpc_core::ClientChannel::OnResolverResultChangedLocked(grpc_core::Resolver::Result) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#18 0x00007580a66ae452 in grpc_core::ClientChannel::ResolverResultHandler::ReportResult(grpc_core::Resolver::Result) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#19 0x00007580a63bc603 in grpc_core::PollingResolver::OnRequestCompleteLocked(grpc_core::Resolver::Result) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#20 0x00007580a63bcb2d in std::_Function_handler<void (), grpc_core::PollingResolver::OnRequestComplete(grpc_core::Resolver::Result)::{lambda()#1}>::_M_invoke(std::_Any_data const&)
    () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#21 0x00007580a67cbf46 in grpc_core::WorkSerializer::WorkSerializerImpl::Run(std::function<void ()>, grpc_core::DebugLocation const&) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#22 0x00007580a67cc0ea in grpc_core::WorkSerializer::Run(std::function<void ()>, grpc_core::DebugLocation const&) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#23 0x00007580a63bd117 in grpc_core::PollingResolver::OnRequestComplete(grpc_core::Resolver::Result) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#24 0x00007580a63b3f86 in grpc_core::(anonymous namespace)::AresClientChannelDNSResolver::AresRequestWrapper::OnHostnameResolved(void*, absl::lts_20230802::Status) ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#25 0x00007580a67c44c4 in grpc_core::ExecCtx::Flush() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#26 0x00007580a63408a2 in grpc_core::ExecCtx::~ExecCtx() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#27 0x00007580a6740343 in grpc_call_start_batch () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#28 0x00007580a5e281e9 in grpc::internal::CallOpSet<grpc::internal::CallOpSendInitialMetadata, grpc::internal::CallOpSendMessage, grpc::internal::CallOpRecvInitialMetadata, grpc::internal::CallOpRecvMessage<google::protobuf::MessageLite>, grpc::internal::CallOpClientSendClose, grpc::internal::CallOpClientRecvStatus>::ContinueFillOpsAfterInterception() ()
   from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#29 0x00007580a5e2d809 in grpc::internal::BlockingUnaryCallImpl<google::protobuf::MessageLite, google::protobuf::MessageLite>::BlockingUnaryCallImpl(grpc::ChannelInterface*, grpc::inte--Type <RET> for more, q to quit, c to continue without paging--c
rnal::RpcMethod const&, grpc::ClientContext*, google::protobuf::MessageLite const&, google::protobuf::MessageLite*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#30 0x00007580a62d76ea in opentelemetry::proto::collector::metrics::v1::MetricsService::Stub::Export(grpc::ClientContext*, opentelemetry::proto::collector::metrics::v1::ExportMetricsServiceRequest const&, opentelemetry::proto::collector::metrics::v1::ExportMetricsServiceResponse*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#31 0x00007580a62ca40c in opentelemetry::v1::exporter::otlp::OtlpGrpcClient::DelegateExport(opentelemetry::proto::collector::metrics::v1::MetricsService::StubInterface*, std::unique_ptr<grpc::ClientContext, std::default_delete<grpc::ClientContext> >&&, std::unique_ptr<google::protobuf::Arena, std::default_delete<google::protobuf::Arena> >&&, opentelemetry::proto::collector::metrics::v1::ExportMetricsServiceRequest&&, opentelemetry::proto::collector::metrics::v1::ExportMetricsServiceResponse*) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#32 0x00007580a62c23ed in opentelemetry::v1::exporter::otlp::OtlpGrpcMetricExporter::Export(opentelemetry::v1::sdk::metrics::ResourceMetrics const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#33 0x00007580a62c0334 in (anonymous namespace)::OpenTelemetryMetricExporter::Export(opentelemetry::v1::sdk::metrics::ResourceMetrics const&) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#34 0x00007580a62e5fdf in opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::{lambda()#1}::operator()() const::{lambda(opentelemetry::v1::sdk::metrics::ResourceMetrics&)#1}::operator()(opentelemetry::v1::sdk::metrics::ResourceMetrics&) const () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#35 0x00007580a62ee7a6 in opentelemetry::v1::sdk::metrics::MetricReader::Collect(opentelemetry::v1::nostd::function_ref<bool (opentelemetry::v1::sdk::metrics::ResourceMetrics&)>) () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#36 0x00007580a62e5085 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce()::{lambda()#1}> > >::_M_run() () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#37 0x00007580a6997be0 in execute_native_thread_routine () from /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so
#38 0x00007580a7597ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#39 0x00007580a76298d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
```

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
kamil-kaczmarek pushed a commit that referenced this pull request Mar 3, 2026
#61281)

## Description

Workers can SIGSEGV due to a `getenv`/`setenv` race. The `io_thread_`
calls `getenv` during OTel/gRPC exporter init (`WaitForServerReady`
callback), while the main thread calls `setenv` in accelerator managers
(CUDA, Neuron, TPU). POSIX `setenv` is MT-Unsafe, so concurrent access
crashes.

Previously the callback was fire and forget, so the constructor could
return before init finished. #61034 and #61195 tried to fix specific
`getenv` call sites, but gRPC keeps calling `getenv` internally
(`HttpProxyMapper` reads proxy env vars on every channel creation), so
the race kept showing up in new coredumps.

This PR just makes the CoreWorker call site synchronously wait for the
callback to finish (via `std::promise/std::future`), so all `getenv`
completes before any `setenv` can happen.



## Performance
**This PR (sync wait):**
```
task latency (includes driver + 1 worker startup) 
  Trial 1: first task latency = 0.1907s
  Trial 2: first task latency = 0.1669s
  Trial 3: first task latency = 0.1726s
  Trial 4: first task latency = 0.1737s
  Trial 5: first task latency = 0.1731s
  Mean: 0.1754s, Std: 0.0081s
```

**Master (baseline):**
```
task latency (includes driver + 1 worker startup)
  Trial 1: first task latency = 0.1781s
  Trial 2: first task latency = 0.1601s
  Trial 3: first task latency = 0.1578s
  Trial 4: first task latency = 0.1685s
  Trial 5: first task latency = 0.1676s
  Mean: 0.1664s, Std: 0.0072s
```
This level of overhead is acceptable to fix the crash.

---------

Signed-off-by: yicheng <yicheng@anyscale.com>
Co-authored-by: yicheng <yicheng@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
codope added a commit that referenced this pull request Mar 3, 2026
codope added a commit that referenced this pull request Mar 3, 2026
codope added a commit that referenced this pull request Mar 3, 2026
This reverts commit d85ed28.

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
israbbani pushed a commit that referenced this pull request Mar 3, 2026
Reverts #61195 to check if cpp ubsan tests pass in
postmerge

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
ayushk7102 pushed a commit to ayushk7102/ray that referenced this pull request Mar 6, 2026
ray-project#61281)

## Description

Workers can SIGSEGV due to a `getenv`/`setenv` race. The `io_thread_`
calls `getenv` during OTel/gRPC exporter init (`WaitForServerReady`
callback), while the main thread calls `setenv` in accelerator managers
(CUDA, Neuron, TPU). POSIX `setenv` is MT-Unsafe, so concurrent access
crashes.

Previously the callback was fire and forget, so the constructor could
return before init finished. ray-project#61034 and ray-project#61195 tried to fix specific
`getenv` call sites, but gRPC keeps calling `getenv` internally
(`HttpProxyMapper` reads proxy env vars on every channel creation), so
the race kept showing up in new coredumps.

This PR just makes the CoreWorker call site synchronously wait for the
callback to finish (via `std::promise/std::future`), so all `getenv`
completes before any `setenv` can happen.

## Performance
**This PR (sync wait):**
```
task latency (includes driver + 1 worker startup)
  Trial 1: first task latency = 0.1907s
  Trial 2: first task latency = 0.1669s
  Trial 3: first task latency = 0.1726s
  Trial 4: first task latency = 0.1737s
  Trial 5: first task latency = 0.1731s
  Mean: 0.1754s, Std: 0.0081s
```

**Master (baseline):**
```
task latency (includes driver + 1 worker startup)
  Trial 1: first task latency = 0.1781s
  Trial 2: first task latency = 0.1601s
  Trial 3: first task latency = 0.1578s
  Trial 4: first task latency = 0.1685s
  Trial 5: first task latency = 0.1676s
  Mean: 0.1664s, Std: 0.0072s
```
This level of overhead is acceptable to fix the crash.

---------

Signed-off-by: yicheng <yicheng@anyscale.com>
Co-authored-by: yicheng <yicheng@anyscale.com>
Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>
ayushk7102 pushed a commit to ayushk7102/ray that referenced this pull request Mar 6, 2026
…ct#61449)

Reverts ray-project#61195 to check if cpp ubsan tests pass in
postmerge

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>
MengjinYan pushed a commit that referenced this pull request Mar 12, 2026
## Description

Reintroduce the grpc upgrade for fixing setenv/getenv races. postmerge
and premerge, including tsan, ubsan, macos, windows, tests passed.

## Related issues
#61195

---------

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Mar 13, 2026
…ct#61449)

Reverts ray-project#61195 to check if cpp ubsan tests pass in
postmerge

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

5 participants