Skip to content

[Core] Fix completion queue shutdown race on weak memory models (ARM)#41510

Closed
zarinn3pal wants to merge 2 commits into
grpc:masterfrom
zarinn3pal:fix/cc-queue-shutdown
Closed

[Core] Fix completion queue shutdown race on weak memory models (ARM)#41510
zarinn3pal wants to merge 2 commits into
grpc:masterfrom
zarinn3pal:fix/cc-queue-shutdown

Conversation

@zarinn3pal

Copy link
Copy Markdown
Contributor

On ARM64, server shutdown could hang for 20+ minutes due to a memory visibility issue in the C-core completion queue. The shutdown_called flag lacks memory barriers, causing blocked threads to never wake up on ARM's weak memory model.

A workaround fix was created for ruby that sent a dummy RPC before shutdown to unblock the completion queue from the I/O side. 41223.

This PR addresses the issue in the core; such that all wrapped languages can reap the benefit; as well as the root cause is addressed. Converted the shutdown_called flag from bool to std::atomic<bool> in all internal completion queue data structures. This guarantees that the shutdown state transition is atomically visible across threads, preventing race conditions and ensuring the completion queue drains and shuts down correctly on all architectures.

The PR addresses the issue skipped in 40770

@emil10001

Copy link
Copy Markdown
Member

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly identifies a data race on the shutdown_called flag and addresses it by changing the type to std::atomic<bool>. This is a good step towards fixing the shutdown hang on weak memory model architectures like ARM. However, the use of std::memory_order_relaxed for all atomic operations is insufficient as it does not create the necessary memory barriers to guarantee visibility across threads. I've left several comments suggesting the use of std::memory_order_release for stores and std::memory_order_acquire for loads to ensure the shutdown signal is correctly propagated, making the fix robust and correct.

Comment thread src/core/lib/surface/completion_queue.cc Outdated
Comment thread src/core/lib/surface/completion_queue.cc Outdated
Comment thread src/core/lib/surface/completion_queue.cc Outdated
Comment thread src/core/lib/surface/completion_queue.cc Outdated
Comment thread src/core/lib/surface/completion_queue.cc Outdated
Comment thread src/core/lib/surface/completion_queue.cc Outdated
@zarinn3pal

Copy link
Copy Markdown
Contributor Author

/gemini review

@zarinn3pal zarinn3pal added the release notes: yes Indicates if PR needs to be in release notes label Feb 11, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a memory visibility issue on weak memory model architectures like ARM by changing the shutdown_called flag from a bool to a std::atomic<bool> across various completion queue data structures. The use of appropriate memory ordering (acquire for loads and release for stores) ensures proper synchronization between threads, preventing the shutdown hangs that were observed. The changes are consistent and well-commented. I have one minor suggestion to improve the consistency of the comments.

Comment thread src/core/lib/surface/completion_queue.cc

@asheshvidyut asheshvidyut left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@asheshvidyut

Copy link
Copy Markdown
Member

Please check why the CI is failing.

@murgatroid99

Copy link
Copy Markdown
Member

Can you merge in the master branch? I'd like to get some up to date test results with all of the test changes we've made in the fixit, and the kokoro labels don't quite seem to be doing it.

@zarinn3pal zarinn3pal force-pushed the fix/cc-queue-shutdown branch from 2370441 to 5e23512 Compare April 24, 2026 11:02
@murgatroid99

Copy link
Copy Markdown
Member

Thank you, that looks better

asheshvidyut pushed a commit to a-detiste/grpc that referenced this pull request Jun 10, 2026
…grpc#41510)

On ARM64, server shutdown could hang for 20+ minutes due to a memory visibility issue in the C-core completion queue. The shutdown_called flag lacks memory barriers, causing blocked threads to never wake up on ARM's weak memory model.

A  workaround fix was created for ruby that sent a dummy RPC before shutdown to unblock the completion queue from the I/O side. [41223](grpc#41223).

This PR addresses the issue in the core; such that all wrapped languages can reap the benefit; as well as the root cause is addressed.  Converted the `shutdown_called` flag from bool to `std::atomic<bool>` in all internal completion queue data structures. This guarantees that the shutdown state transition is atomically visible across threads, preventing race conditions and ensuring the completion queue drains and shuts down correctly on all architectures.

The PR addresses the issue skipped in [40770](grpc#40770)

Closes grpc#41510

COPYBARA_INTEGRATE_REVIEW=grpc#41510 from zarinn3pal:fix/cc-queue-shutdown 5e23512
PiperOrigin-RevId: 905116782
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants