Skip to content

[WIP] GCS client test failure flakiness#34656

Merged
pcmoritz merged 2 commits intoray-project:masterfrom
rkooo567:gcs-client-flaky
Apr 22, 2023
Merged

[WIP] GCS client test failure flakiness#34656
pcmoritz merged 2 commits intoray-project:masterfrom
rkooo567:gcs-client-flaky

Conversation

@rkooo567
Copy link
Copy Markdown
Contributor

@rkooo567 rkooo567 commented Apr 21, 2023

Why are these changes needed?

Right now the theory is as follow.

  1. pubsub io service is created and run inside the GcsServer. That means if pubsub io service is accessed after GCSServer GC'ed, it will segfault.
  2. Right now, upon teardown, when we call rpc::DrainAndResetExecutor, this will recreate the Executor thread pool.
  3. Upon teardown, If DrainAndResetExecutor -> GcsServer's internal pubsub posts new SendReply to the newly created threadpool -> GcsServer.reset -> pubsub io service GC'ed -> SendReply invoked from the newly created thread pool, it will segfault.

NOTE: the segfault is from pubsub service if you see the failure

#2 0x7f92034d9129 in ray::rpc::ServerCallImpl<ray::rpc::InternalPubSubGcsServiceHandler, ray::rpc::GcsSubscriberPollRequest, ray::rpc::GcsSubscriberPollReply>::HandleRequestImpl()::'lambda'(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>)::operator()(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>) const::'lambda'()::operator()() const /proc/self/cwd/bazel-out/k8-opt/bin/_virtual_includes/grpc_common_lib/ray/rpc/server_call.h:212:48

As a fix, I only drain the thread pool. And then reset it after all operations are fully cleaned up (only from tests). I think there's no need to reset for regular proc termination like raylet, gcs, core workers.

Related issue number

Closes #34344

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: SangBin Cho <rkooo567@gmail.com>
@rkooo567 rkooo567 requested a review from a team as a code owner April 21, 2023 05:43
@pcmoritz pcmoritz merged commit 26a9201 into ray-project:master Apr 22, 2023
@pcmoritz
Copy link
Copy Markdown
Contributor

I'm merging this now since I was debugging a PR that ran into the same issue I think, seeing if this fixes it :)

ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
Why are these changes needed?

Right now the theory is as follow.

pubsub io service is created and run inside the GcsServer. That means if pubsub io service is accessed after GCSServer GC'ed, it will segfault.
Right now, upon teardown, when we call rpc::DrainAndResetExecutor, this will recreate the Executor thread pool.
Upon teardown, If DrainAndResetExecutor -> GcsServer's internal pubsub posts new SendReply to the newly created threadpool -> GcsServer.reset -> pubsub io service GC'ed -> SendReply invoked from the newly created thread pool, it will segfault.
NOTE: the segfault is from pubsub service if you see the failure

ray-project#2 0x7f92034d9129 in ray::rpc::ServerCallImpl<ray::rpc::InternalPubSubGcsServiceHandler, ray::rpc::GcsSubscriberPollRequest, ray::rpc::GcsSubscriberPollReply>::HandleRequestImpl()::'lambda'(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>)::operator()(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>) const::'lambda'()::operator()() const /proc/self/cwd/bazel-out/k8-opt/bin/_virtual_includes/grpc_common_lib/ray/rpc/server_call.h:212:48
As a fix, I only drain the thread pool. And then reset it after all operations are fully cleaned up (only from tests). I think there's no need to reset for regular proc termination like raylet, gcs, core workers.

Related issue number

Closes ray-project#34344

Signed-off-by: SangBin Cho <rkooo567@gmail.com>
Signed-off-by: Jack He <jackhe2345@gmail.com>
architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023
Why are these changes needed?

Right now the theory is as follow.

pubsub io service is created and run inside the GcsServer. That means if pubsub io service is accessed after GCSServer GC'ed, it will segfault.
Right now, upon teardown, when we call rpc::DrainAndResetExecutor, this will recreate the Executor thread pool.
Upon teardown, If DrainAndResetExecutor -> GcsServer's internal pubsub posts new SendReply to the newly created threadpool -> GcsServer.reset -> pubsub io service GC'ed -> SendReply invoked from the newly created thread pool, it will segfault.
NOTE: the segfault is from pubsub service if you see the failure

#2 0x7f92034d9129 in ray::rpc::ServerCallImpl<ray::rpc::InternalPubSubGcsServiceHandler, ray::rpc::GcsSubscriberPollRequest, ray::rpc::GcsSubscriberPollReply>::HandleRequestImpl()::'lambda'(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>)::operator()(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>) const::'lambda'()::operator()() const /proc/self/cwd/bazel-out/k8-opt/bin/_virtual_includes/grpc_common_lib/ray/rpc/server_call.h:212:48
As a fix, I only drain the thread pool. And then reset it after all operations are fully cleaned up (only from tests). I think there's no need to reset for regular proc termination like raylet, gcs, core workers.

Related issue number

Closes ray-project#34344

Signed-off-by: SangBin Cho <rkooo567@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] linux://:gcs_client_test is failing/flaky on master.

3 participants