Skip to content

[Fix] Skip flaky active_call_spec.rb on linux arm64#40770

Closed
sergiitk wants to merge 1 commit into
grpc:masterfrom
sergiitk:fix/ci/ruby/skip-active-call-spec
Closed

[Fix] Skip flaky active_call_spec.rb on linux arm64#40770
sergiitk wants to merge 1 commit into
grpc:masterfrom
sergiitk:fix/ci/ruby/skip-active-call-spec

Conversation

@sergiitk

@sergiitk sergiitk commented Sep 25, 2025

Copy link
Copy Markdown
Member

On linux arm the test is sometimes randomly hangs until the 20min timeout. Normally it takes just a few seconds to pass.

Fail: https://source.cloud.google.com/results/invocations/e0038f3c-5d32-405e-9e99-c28c9328a3b7

TIMEOUT: src/ruby/spec/generic/active_call_spec.rb [pid=23381, time=1201.4sec]

Pass: https://source.cloud.google.com/results/invocations/7be4b597-a7bd-48e2-a631-83695020fef3

PASSED: src/ruby/spec/generic/active_call_spec.rb [time=2.9sec, retries=0:0; cpu_cost=1.0; estimated=1.0]

This PR selectively skip this test on linux arm, until the root cause is determined.

@sergiitk sergiitk self-assigned this Sep 25, 2025
@sergiitk sergiitk added release notes: no Indicates if PR should not be in release notes and removed lang/ruby labels Sep 25, 2025
@sergiitk

Copy link
Copy Markdown
Member Author

I do have the pass for this one, but the job is flaky and a pass doesn't mean much without logs confirming the skip:
https://source.cloud.google.com/results/invocations/e23f9c82-e17d-424a-9117-a38f417c6640.

I'm running a new test now with a modified code to force-fail https://source.cloud.google.com/results/invocations/ac179b76-0301-4ec6-b023-6d45374f4a62.

@sergiitk

Copy link
Copy Markdown
Member Author

Confirmed the tests are skipped as expected by forcing a failure at the end of the suite https://source.cloud.google.com/results/invocations/ac179b76-0301-4ec6-b023-6d45374f4a62

GRPC::ActiveCall
  restricted view methods
    #multi_req_view
      exposes a fixed subset of the ActiveCall.methods (PENDING: Flaky: this tests times out randomly when running on arm64 linux)
    #single_req_view
      exposes a fixed subset of the ActiveCall.methods (PENDING: Flaky: this tests times out randomly when running on arm64 linux)
...
Finished in 0.26265 seconds (files took 5.06 seconds to load)
29 examples, 0 failures, 29 pending, 1 error occurred outside of examples

The tests are indeed skipped on linux arm (marked as pending)

@sergiitk

sergiitk commented Sep 25, 2025

Copy link
Copy Markdown
Member Author

Now just double-checking the test are not skipped on non-arm linux: https://source.cloud.google.com/results/invocations/dc65bcf2-192d-4391-bf13-868b30748d1d

@sergiitk sergiitk marked this pull request as ready for review September 25, 2025 06:42
@sergiitk sergiitk requested review from asheshvidyut and removed request for stanley-cheung November 4, 2025 19:26
@sergiitk

sergiitk commented Nov 4, 2025

Copy link
Copy Markdown
Member Author

Not planning to merge this, we'll keep the failure visible until it's fixed.

@sergiitk sergiitk closed this Nov 4, 2025
@zarinn3pal zarinn3pal self-assigned this Dec 4, 2025
copybara-service Bot pushed a commit that referenced this pull request Apr 24, 2026
…#41510)

On ARM64, server shutdown could hang for 20+ minutes due to a memory visibility issue in the C-core completion queue. The shutdown_called flag lacks memory barriers, causing blocked threads to never wake up on ARM's weak memory model.

A  workaround fix was created for ruby that sent a dummy RPC before shutdown to unblock the completion queue from the I/O side. [41223](#41223).

This PR addresses the issue in the core; such that all wrapped languages can reap the benefit; as well as the root cause is addressed.  Converted the `shutdown_called` flag from bool to `std::atomic<bool>` in all internal completion queue data structures. This guarantees that the shutdown state transition is atomically visible across threads, preventing race conditions and ensuring the completion queue drains and shuts down correctly on all architectures.

The PR addresses the issue skipped in [40770](#40770)

Closes #41510

COPYBARA_INTEGRATE_REVIEW=#41510 from zarinn3pal:fix/cc-queue-shutdown 5e23512
PiperOrigin-RevId: 905116782
asheshvidyut pushed a commit to a-detiste/grpc that referenced this pull request Jun 10, 2026
…grpc#41510)

On ARM64, server shutdown could hang for 20+ minutes due to a memory visibility issue in the C-core completion queue. The shutdown_called flag lacks memory barriers, causing blocked threads to never wake up on ARM's weak memory model.

A  workaround fix was created for ruby that sent a dummy RPC before shutdown to unblock the completion queue from the I/O side. [41223](grpc#41223).

This PR addresses the issue in the core; such that all wrapped languages can reap the benefit; as well as the root cause is addressed.  Converted the `shutdown_called` flag from bool to `std::atomic<bool>` in all internal completion queue data structures. This guarantees that the shutdown state transition is atomically visible across threads, preventing race conditions and ensuring the completion queue drains and shuts down correctly on all architectures.

The PR addresses the issue skipped in [40770](grpc#40770)

Closes grpc#41510

COPYBARA_INTEGRATE_REVIEW=grpc#41510 from zarinn3pal:fix/cc-queue-shutdown 5e23512
PiperOrigin-RevId: 905116782
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lang/ruby release notes: no Indicates if PR should not be in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants