Skip to content

[ruby] improve the way completion queue pluck operations handle signals and process shutdown#36903

Closed
apolcyn wants to merge 9 commits intogrpc:masterfrom
apolcyn:fix_ruby_interrupt
Closed

[ruby] improve the way completion queue pluck operations handle signals and process shutdown#36903
apolcyn wants to merge 9 commits intogrpc:masterfrom
apolcyn:fix_ruby_interrupt

Conversation

@apolcyn
Copy link
Copy Markdown
Contributor

@apolcyn apolcyn commented Jun 12, 2024

Fixes the CBF of src/ruby/end2end/killed_client_thread_test.rb (failure mode is a hang of the child process that receives the SIGTERM) that has been happening since #36724

So far grpc-ruby CQ pluck operations have so far used a 20ms-interval busy poll to check interrupts in case we've received a signal, handle process shutdown, etc. This means ongoing RPCs will not terminate their CQ operations if we need to terminate the process (the loop simply exits without waiting for the CQ op to finish), causing a leak. Those RPCs can leave refs over their corresponding channels preventing this from terminating (the channels don't reach state SHUTDOWN after being destroyed).

Fix is to unblock CQ pluck operations by cancelling calls, and thus allowing the CQ pluck to actually complete its operation. For server listening CQ operations, we unblock them by shutting down the server.

A side win here is to remove the 20ms-interval busy poll on CQ operations, which was only needed to handle shutdown.

@apolcyn apolcyn added the release notes: yes Indicates if PR needs to be in release notes label Jun 12, 2024
@apolcyn apolcyn changed the title Fix ruby interrupt [ruby] improve the way completion queue pluck operations handle signals and process shutdown Jun 12, 2024
@apolcyn apolcyn marked this pull request as ready for review June 12, 2024 23:21
Copy link
Copy Markdown
Contributor

@stanley-cheung stanley-cheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Any tests affected?

@apolcyn
Copy link
Copy Markdown
Contributor Author

apolcyn commented Jun 13, 2024

LGTM. Any tests affected?

This doesn't require any test changes

It fixes a CBF in src/ruby/end2end/killed_client_thread_test.rb though - that has been failing since a ~week ago as noted:

Timed out waiting for client process. It likely freezes when killed while in the middle of an rpc (RuntimeError)

apolcyn added a commit to apolcyn/grpc that referenced this pull request Jun 13, 2024
…ls and process shutdown (grpc#36903)

Fixes the CBF of `src/ruby/end2end/killed_client_thread_test.rb` (failure mode is a hang of the child process that receives the SIGTERM) that has been happening since grpc#36724

So far grpc-ruby CQ pluck operations have so far used a 20ms-interval busy poll to check interrupts in case we've received a signal, handle process shutdown, etc. This means ongoing RPCs will not terminate their CQ operations if we need to terminate the process (the loop simply exits without waiting for the CQ op to finish), causing a leak. Those RPCs can leave refs over their corresponding channels preventing [this](https://github.com/grpc/grpc/blob/8564f72e8e0334c25c480e0aec1df75bdc1fce14/src/ruby/ext/grpc/rb_channel.c#L653) from terminating (the channels don't reach state SHUTDOWN after being destroyed).

Fix is to unblock CQ pluck operations by cancelling calls, and thus allowing the CQ pluck to actually complete its operation. For server listening CQ operations, we unblock them by shutting down the server.

A side win here is to remove the [20ms-interval busy poll](https://github.com/grpc/grpc/blob/8564f72e8e0334c25c480e0aec1df75bdc1fce14/src/ruby/ext/grpc/rb_completion_queue.c#L44) on CQ operations, which was only needed to handle shutdown.

Closes grpc#36903

COPYBARA_INTEGRATE_REVIEW=grpc#36903 from apolcyn:fix_ruby_interrupt bed1ee2
PiperOrigin-RevId: 643046465
XuanWang-Amos pushed a commit that referenced this pull request Jun 13, 2024
alto-ruby added a commit to alto-ruby/grpc that referenced this pull request Jul 11, 2024
…k operations handle signals and process shutdown grpc#36903" (grpc#36916)"

This reverts commit 3469d7b.
copybara-service bot pushed a commit that referenced this pull request Aug 7, 2024
Fixes #37234

Following up on the problem described in #36903, there are a number of paths in `client_server_spec.rb` and a few other tests where client call objects can leak due to RPC lifecycles not being properly completed, leading to a thread not terminating.

Some of the tests, which don't use the surface-level APIs, are changed to manually close calls (and not rely on GC which might not happen before shutdown of ruby threads). `client_server_spec.rb` is updated to use surface level APIs, which manages call lifecycles correctly (this also improves the test's fidelity).

While we're here: expose `cancel_with_status` on call operations. This was only accidentally private so far. The test refactoring caught it.

Closes #37410

COPYBARA_INTEGRATE_REVIEW=#37410 from apolcyn:fix_call_leak b230472
PiperOrigin-RevId: 660430463
paulosjca pushed a commit to paulosjca/grpc that referenced this pull request Nov 25, 2024
…ls and process shutdown (grpc#36903)

Fixes the CBF of `src/ruby/end2end/killed_client_thread_test.rb` (failure mode is a hang of the child process that receives the SIGTERM) that has been happening since grpc#36724

So far grpc-ruby CQ pluck operations have so far used a 20ms-interval busy poll to check interrupts in case we've received a signal, handle process shutdown, etc. This means ongoing RPCs will not terminate their CQ operations if we need to terminate the process (the loop simply exits without waiting for the CQ op to finish), causing a leak. Those RPCs can leave refs over their corresponding channels preventing [this](https://github.com/grpc/grpc/blob/8564f72e8e0334c25c480e0aec1df75bdc1fce14/src/ruby/ext/grpc/rb_channel.c#L653) from terminating (the channels don't reach state SHUTDOWN after being destroyed).

Fix is to unblock CQ pluck operations by cancelling calls, and thus allowing the CQ pluck to actually complete its operation. For server listening CQ operations, we unblock them by shutting down the server.

A side win here is to remove the [20ms-interval busy poll](https://github.com/grpc/grpc/blob/8564f72e8e0334c25c480e0aec1df75bdc1fce14/src/ruby/ext/grpc/rb_completion_queue.c#L44) on CQ operations, which was only needed to handle shutdown.

Closes grpc#36903

COPYBARA_INTEGRATE_REVIEW=grpc#36903 from apolcyn:fix_ruby_interrupt bed1ee2
PiperOrigin-RevId: 643046465
paulosjca pushed a commit to paulosjca/grpc that referenced this pull request Nov 25, 2024
Fixes grpc#37234

Following up on the problem described in grpc#36903, there are a number of paths in `client_server_spec.rb` and a few other tests where client call objects can leak due to RPC lifecycles not being properly completed, leading to a thread not terminating.

Some of the tests, which don't use the surface-level APIs, are changed to manually close calls (and not rely on GC which might not happen before shutdown of ruby threads). `client_server_spec.rb` is updated to use surface level APIs, which manages call lifecycles correctly (this also improves the test's fidelity).

While we're here: expose `cancel_with_status` on call operations. This was only accidentally private so far. The test refactoring caught it.

Closes grpc#37410

COPYBARA_INTEGRATE_REVIEW=grpc#37410 from apolcyn:fix_call_leak b230472
PiperOrigin-RevId: 660430463
apolcyn added a commit to apolcyn/grpc that referenced this pull request Apr 28, 2025
…le signals and process shutdown (grpc#36903)"

This reverts commit d4b5e12.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lang/ruby release notes: yes Indicates if PR needs to be in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants