Skip to content

[export] [PH2] Fix crash with error flattener#41930

Closed
copybara-service[bot] wants to merge 1 commit into
masterfrom
test_887957316
Closed

[export] [PH2] Fix crash with error flattener#41930
copybara-service[bot] wants to merge 1 commit into
masterfrom
test_887957316

Conversation

@copybara-service

@copybara-service copybara-service Bot commented Mar 24, 2026

Copy link
Copy Markdown

[export] [PH2] Fix crash with error flattener
Root cause

  1. status_details_recv_ is not initialized at time of creation :
    grpc_slice status_details_recv_;
  2. It is then used to initialize an op, and is passed by reference via the op
    op->data.recv_status_on_client.status_details = &status_details_recv_;
  3. This op goes all over the gRPC call space taking our status_details_recv_ , and many components assign stuff to it.
  4. One such path misses assignment when error flattener is enabled :
    auto out_status_details = op->data.recv_status_on_client.status_details;
    and
    *out_status_details = message_slice.TakeCSlice();
  5. And then we try to CSliceUnref(status_details_recv_) which was not initialized. And crash.
    CSliceUnref(status_details_recv_);
I0123 21:12:16.698753      24 dual_ref_counted.h:216] ClientCall:0x7fab7402e660 weak_unref 1 -> 0 (refs=1)
*** SIGSEGV received at time=1769202736 on cpu 1 ***
PC: @     0x7faba2478c57  (unknown)  grpc_slice_refcount::Unref()
    @     0x7fab9899041c         64  absl::lts_20250512::WriteFailureInfo()
    @     0x7fab989900bb        304  absl::lts_20250512::AbslFailureSignalHandler()
    @     0x7fab93da5420       1744  (unknown)
    @     0x7faba2478b37         64  grpc_core::CSliceUnref()
    @     0x7faba0e9ff14       1120  grpc_core::(anonymous namespace)::RlsLb::RlsRequest::OnRlsCallCompleteLocked()
    @     0x7faba0e9f7d9         96  grpc_core::(anonymous namespace)::RlsLb::RlsRequest::OnRlsCallComplete()::$_0::operator()()
    @     0x7faba0e9f795         32  std::__invoke_impl<>()
    @     0x7faba0e9f775         32  std::__invoke<>()
    @     0x7faba0e9f755         32  std::invoke<>()
    @     0x7faba0e9f725         32  absl::lts_20250512::internal_any_invocable::InvokeR<>()
    @     0x7faba0e9f652         32  absl::lts_20250512::internal_any_invocable::LocalInvoker<>()
    @     0x7fab9c426f72         32  absl::lts_20250512::internal_any_invocable::Impl<>::operator()()
    @     0x7fab95bd3957        528  grpc_core::WorkSerializer::WorkSerializerImpl::Run()
    @     0x7fab974fef54        160  grpc_event_engine::experimental::WorkStealingThreadPool::ThreadState::Step()
    @     0x7fab974fe8da        176  grpc_event_engine::experimental::WorkStealingThreadPool::ThreadState::ThreadBody()
    @     0x7fab974ff591         48  grpc_event_engine::experimental::WorkStealingThreadPool::WorkStealingThreadPoolImpl::StartThread()::$_0::operator()()
    @     0x7fab974ff569         32  grpc_event_engine::experimental::WorkStealingThreadPool::WorkStealingThreadPoolImpl::StartThread()::$_0::__invoke()
    @     0x7fab95766d25        160  grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix()::{lambda()#1}::operator()()
    @     0x7fab95766bf9         32  grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix()::{lambda()#1}::__invoke()
    @     0x7fab93d99609  (unknown)  start_thread

Fix option 1 : Initialize status_details_recv_
Fix option 2 : Modify the

*out_status_details = message_slice.TakeCSlice();

PR with the experiment enabled AND the fix for reference : #41920


DO NOT SUBMIT. This PR is for testing purposes only. cl/887957316

@copybara-service copybara-service Bot requested a review from markdroth as a code owner March 24, 2026 09:02
@copybara-service copybara-service Bot added exported release notes: no Indicates if PR should not be in release notes labels Mar 24, 2026
@copybara-service copybara-service Bot force-pushed the test_887957316 branch 4 times, most recently from 266000c to 62eb0ee Compare March 24, 2026 09:07
Root cause

1. status_details_recv_ is not initialized at time of creation : https://github.com/grpc/grpc/blob/173b262da788de397ea3f7d788156c59272f3a89/src/core/load_balancing/rls/rls.cc#L684
2. It is then used to initialize an op, and is passed by reference via the op https://github.com/grpc/grpc/blob/173b262da788de397ea3f7d788156c59272f3a89/src/core/load_balancing/rls/rls.cc#L1705
3. This op goes all over the gRPC call space taking our status_details_recv_ , and many components assign stuff to it.
4. One such path misses assignment when error flattener is enabled : https://github.com/grpc/grpc/blob/173b262da788de397ea3f7d788156c59272f3a89/src/core/call/client_call.cc#L392 and https://github.com/grpc/grpc/blob/173b262da788de397ea3f7d788156c59272f3a89/src/core/call/client_call.cc#L451
5. And then we try to CSliceUnref(status_details_recv_) which was not initialized. And crash. https://github.com/grpc/grpc/blob/c0728bdae796f214a9fc56a617b21075c2ed3b40/src/core/load_balancing/rls/rls.cc#L1749

```
I0123 21:12:16.698753      24 dual_ref_counted.h:216] ClientCall:0x7fab7402e660 weak_unref 1 -> 0 (refs=1)
*** SIGSEGV received at time=1769202736 on cpu 1 ***
PC: @     0x7faba2478c57  (unknown)  grpc_slice_refcount::Unref()
    @     0x7fab9899041c         64  absl::lts_20250512::WriteFailureInfo()
    @     0x7fab989900bb        304  absl::lts_20250512::AbslFailureSignalHandler()
    @     0x7fab93da5420       1744  (unknown)
    @     0x7faba2478b37         64  grpc_core::CSliceUnref()
    @     0x7faba0e9ff14       1120  grpc_core::(anonymous namespace)::RlsLb::RlsRequest::OnRlsCallCompleteLocked()
    @     0x7faba0e9f7d9         96  grpc_core::(anonymous namespace)::RlsLb::RlsRequest::OnRlsCallComplete()::$_0::operator()()
    @     0x7faba0e9f795         32  std::__invoke_impl<>()
    @     0x7faba0e9f775         32  std::__invoke<>()
    @     0x7faba0e9f755         32  std::invoke<>()
    @     0x7faba0e9f725         32  absl::lts_20250512::internal_any_invocable::InvokeR<>()
    @     0x7faba0e9f652         32  absl::lts_20250512::internal_any_invocable::LocalInvoker<>()
    @     0x7fab9c426f72         32  absl::lts_20250512::internal_any_invocable::Impl<>::operator()()
    @     0x7fab95bd3957        528  grpc_core::WorkSerializer::WorkSerializerImpl::Run()
    @     0x7fab974fef54        160  grpc_event_engine::experimental::WorkStealingThreadPool::ThreadState::Step()
    @     0x7fab974fe8da        176  grpc_event_engine::experimental::WorkStealingThreadPool::ThreadState::ThreadBody()
    @     0x7fab974ff591         48  grpc_event_engine::experimental::WorkStealingThreadPool::WorkStealingThreadPoolImpl::StartThread()::$_0::operator()()
    @     0x7fab974ff569         32  grpc_event_engine::experimental::WorkStealingThreadPool::WorkStealingThreadPoolImpl::StartThread()::$_0::__invoke()
    @     0x7fab95766d25        160  grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix()::{lambda()#1}::operator()()
    @     0x7fab95766bf9         32  grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix()::{lambda()#1}::__invoke()
    @     0x7fab93d99609  (unknown)  start_thread
```

Fix option 1 : Initialize status_details_recv_
Fix option 2 : Modify the https://github.com/grpc/grpc/blob/173b262da788de397ea3f7d788156c59272f3a89/src/core/call/client_call.cc#L451

PR with the experiment enabled AND the fix for reference : #41920

----
DO NOT SUBMIT. This PR is for testing purposes only. [cl/887957316](http://cl/887957316)

PiperOrigin-RevId: 887957316
@markdroth

Copy link
Copy Markdown
Member

This doesn't look like the right fix to me. I think this would mask the symptom but not solve the underlying problem.

The C-core API guarantees that when the recv_status_on_client op completes, the status and details will be set, so there should be no requirement that the caller initialize these two variables. If we're completing the op without setting these two variables, then we're violating the API contract. Initializing the variables in RLS may work around the immediate problem, but it will just crop up again later in other contexts.

I think we need to figure out what code-path is completing the recv_status_on_client op without setting these two variables and fix it to make sure the variables are set.

@tanvi-jagtap

tanvi-jagtap commented Mar 25, 2026

Copy link
Copy Markdown
Collaborator

This doesn't look like the right fix to me. I think this would mask the symptom but not solve the underlying problem.

The C-core API guarantees that when the recv_status_on_client op completes, the status and details will be set, so there should be no requirement that the caller initialize these two variables. If we're completing the op without setting these two variables, then we're violating the API contract. Initializing the variables in RLS may work around the immediate problem, but it will just crop up again later in other contexts.

I think we need to figure out what code-path is completing the recv_status_on_client op without setting these two variables and fix it to make sure the variables are set.

In that case, would it make sense to add an else case to this

if (!IsErrorFlattenEnabled() || status != GRPC_STATUS_OK) {

and assign the required variables?

Or would there be a better way?

cc : @markdroth

@markdroth

Copy link
Copy Markdown
Member

Good catch, Tanvi! You're right, that's where the bug is. I've sent you #41940 to fix it.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

exported lang/core release notes: no Indicates if PR should not be in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants