Assert that pthread_join succeeds.#32584
Merged
yashykt merged 4 commits intogrpc:masterfrom Mar 21, 2023
laramiel:patch-1
Merged
Assert that pthread_join succeeds.#32584yashykt merged 4 commits intogrpc:masterfrom laramiel:patch-1
yashykt merged 4 commits intogrpc:masterfrom
laramiel:patch-1
Conversation
Long story here: CallbackAlternativeCQ operates a thread pool which processes a completion queue and then directly invoking the functions in the thread. This thread pool is initialized on first Ref() and unallocated on last Unref(). When running an in-process synchronous server (as we do for tests, using this https://github.com/google/tensorstore/blob/master/tensorstore/internal/grpc/grpc_mock.h), when called by an async() interface caller, if the async() callback happens to drop the last reference to the grpc Channel, then the channel shutdown will attempt to run in one of the the CallbackAlternativeCQ threads. This will cause a deadlock/race condition, as CallbackAlternativeCQ is not designed to shutdown itself. When this deadlock happens, pthread_join(pthread_id_) will return EDEADLK and the thread will keep running. However EDEADLK is silently ignored by Join() so CallbackAlternativeCQ will continue to delete the underlying grpc_completion_queue, leading to a SIGSEGV later in the process. It's useful in this case to assert that pthread_join succeeded. Alternatively, the thread implementation could gpr_log the errorcode before asserting.
Contributor
Author
|
I think that my specific completion queue issue can happen whenever there is (1) a channel which is owned by (2) an object which is deleted by an async callback: |
Contributor
Author
|
And it seems likely that this resource management pattern is used elsewhere in grpc. Here are a few candidates: src/core/lib/iomgr/executor.c |
copybara-service bot
pushed a commit
to google/tensorstore
that referenced
this pull request
Mar 10, 2023
Reworked several methods from grpc_kvstore, as I discovered a deadlock in grpc that I needed to workaround. Described more here: grpc/grpc#32584 PiperOrigin-RevId: 515472854 Change-Id: I0f0909929680bd6d26d1e240360d739f10401773
yashykt
approved these changes
Mar 13, 2023
Update Join() to log a failure prior to crashing.
Member
|
Please run |
Member
|
Re-running the tests. Will merge once the tests go green |
XuanWang-Amos
pushed a commit
to XuanWang-Amos/grpc
that referenced
this pull request
May 1, 2023
Long story here: CallbackAlternativeCQ operates a thread pool which processes a completion queue and then directly invokes the completion function in the same thread. This thread pool is initialized on first Ref() and unallocated on last Unref(). When running an in-process synchronous server (as we do for tests, using this https://github.com/google/tensorstore/blob/master/tensorstore/internal/grpc/grpc_mock.h), called by an async() interface caller, if the async() callback happens to drop the last reference to the grpc Channel, then the channel shutdown will attempt to run in one of the the CallbackAlternativeCQ threads. This will cause a deadlock/race condition, as `CallbackAlternativeCQ` is not designed to shutdown itself. When this deadlock happens, `pthread_join(pthread_id_)` will return `EDEADLK` and the thread will keep running. However `EDEADLK` is silently ignored by Join() so `CallbackAlternativeCQ::Unref` will continue to delete the underlying grpc_completion_queue, leading to a `SIGSEGV` later in the process. https://github.com/grpc/grpc/blob/97ba9871324cb68b93f22fd1860934392cd476ee/src/cpp/common/completion_queue_cc.cc#L115 This adds an assert that pthread_join succeeded, which is useful as it avoids a later SIGSEBV. Alternatively, the thread implementation could gpr_log the errorcode before asserting. Example backtrace of crash: frame #0: 0x0000000194f1e868 libsystem_kernel.dylib`__pthread_kill + 8 frame #1: 0x0000000194f55cec libsystem_pthread.dylib`pthread_kill + 288 frame #2: 0x0000000194e8e2c8 libsystem_c.dylib`abort + 180 frame #3: 0x0000000194e8d620 libsystem_c.dylib`__assert_rtn + 272 frame #4: 0x0000000100a64f50 grpc_kvstore_test`grpc_core::(anonymous namespace)::ThreadInternalsPosix::Join() + 188 frame #5: 0x00000001009c5dd0 grpc_kvstore_test`grpc_core::Thread::Join() + 56 frame #6: 0x0000000100154474 grpc_kvstore_test`grpc::(anonymous namespace)::CallbackAlternativeCQ::Unref() + 216 frame #7: 0x0000000100154390 grpc_kvstore_test`grpc::CompletionQueue::ReleaseCallbackAlternativeCQ(grpc::CompletionQueue*) + 120 frame #8: 0x000000010014130c grpc_kvstore_test`grpc::Channel::~Channel() + 220 frame #9: 0x00000001001413c8 grpc_kvstore_test`grpc::Channel::~Channel() + 28 frame #10: 0x000000010014d678 grpc_kvstore_test`std::__1::default_delete<grpc::Channel>::operator()(grpc::Channel*) const + 44 frame #11: 0x000000010014d358 grpc_kvstore_test`std::__1::__shared_ptr_pointer<grpc::Channel*, std::__1::shared_ptr<grpc::Channel>::__shared_ptr_default_delete<grpc::Channel, grpc::Channel>, std::__1::allocator<grpc::Channel> >::__on_zero_shared() + 72 frame #12: 0x000000010002ab5c grpc_kvstore_test`std::__1::__shared_count::__release_shared() + 60 frame #13: 0x000000010002ab00 grpc_kvstore_test`std::__1::__shared_weak_count::__release_shared() + 28 frame #14: 0x000000010002aad0 grpc_kvstore_test`std::__1::shared_ptr<grpc::ServerCredentials>::~shared_ptr() + 56 frame #15: 0x00000001000053ec grpc_kvstore_test`std::__1::shared_ptr<tensorstore_grpc::kvstore::grpc_gen::KvStoreService::Stub>::~shared_ptr() + 28 frame #16: 0x000000010014653c grpc_kvstore_test`grpc::ClientContext::~ClientContext() + 356 frame #17: 0x0000000100146570 grpc_kvstore_test`grpc::ClientContext::~ClientContext() + 28 frame #18: 0x00000001000ab000 grpc_kvstore_test`tensorstore::(anonymous namespace)::ReadTask::~ReadTask() + 68 frame #19: 0x00000001000aae90 grpc_kvstore_test`tensorstore::(anonymous namespace)::ReadTask::~ReadTask() + 28 frame #20: 0x00000001000aae18 grpc_kvstore_test`tensorstore::internal::intrusive_ptr_decrement(tensorstore::internal::AtomicReferenceCount<tensorstore::(anonymous namespace)::ReadTask> const*) + 68 frame #21: 0x00000001000aadc8 grpc_kvstore_test`void tensorstore::internal::DefaultIntrusivePtrTraits::decrement<tensorstore::(anonymous namespace)::ReadTask*>(tensorstore::(anonymous namespace)::ReadTask*) + 24 frame #22: 0x00000001000aad9c grpc_kvstore_test`tensorstore::internal::IntrusivePtr<tensorstore::(anonymous namespace)::ReadTask, tensorstore::internal::DefaultIntrusivePtrTraits>::~IntrusivePtr() + 52 frame #23: 0x00000001000a5994 grpc_kvstore_test`tensorstore::internal::IntrusivePtr<tensorstore::(anonymous namespace)::ReadTask, tensorstore::internal::DefaultIntrusivePtrTraits>::~IntrusivePtr() + 28 frame #24: 0x00000001000aac24 grpc_kvstore_test`tensorstore::(anonymous namespace)::ReadTask::Start(tensorstore_grpc::kvstore::grpc_gen::KvStoreService::StubInterface*, absl::Time)::'lambda'(grpc::Status)::~() + 40 frame #25: 0x00000001000a6280 grpc_kvstore_test`tensorstore::(anonymous namespace)::ReadTask::Start(tensorstore_grpc::kvstore::grpc_gen::KvStoreService::StubInterface*, absl::Time)::'lambda'(grpc::Status)::~() + 28 frame #26: 0x00000001000a84ac grpc_kvstore_test`std::__1::__compressed_pair_elem<tensorstore::(anonymous namespace)::ReadTask::Start(tensorstore_grpc::kvstore::grpc_gen::KvStoreService::StubInterface*, absl::Time)::'lambda'(grpc::Status), 0, false>::~__compressed_pair_elem() + 28 frame #27: 0x00000001000a86c0 grpc_kvstore_test`std::__1::__compressed_pair<tensorstore::(anonymous namespace)::ReadTask::Start(tensorstore_grpc::kvstore::grpc_gen::KvStoreService::StubInterface*, absl::Time)::'lambda'(grpc::Status), std::__1::allocator<tensorstore::(anonymous namespace)::ReadTask::Start(tensorstore_grpc::kvstore::grpc_gen::KvStoreService::StubInterface*, absl::Time)::'lambda'(grpc::Status)> >::~__compressed_pair() + 28 frame #28: 0x00000001000a8694 grpc_kvstore_test`std::__1::__compressed_pair<tensorstore::(anonymous namespace)::ReadTask::Start(tensorstore_grpc::kvstore::grpc_gen::KvStoreService::StubInterface*, absl::Time)::'lambda'(grpc::Status), std::__1::allocator<tensorstore::(anonymous namespace)::ReadTask::Start(tensorstore_grpc::kvstore::grpc_gen::KvStoreService::StubInterface*, absl::Time)::'lambda'(grpc::Status)> >::~__compressed_pair() + 28 frame #29: 0x00000001000a990c grpc_kvstore_test`std::__1::__function::__alloc_func<tensorstore::(anonymous namespace)::ReadTask::Start(tensorstore_grpc::kvstore::grpc_gen::KvStoreService::StubInterface*, absl::Time)::'lambda'(grpc::Status), std::__1::allocator<tensorstore::(anonymous namespace)::ReadTask::Start(tensorstore_grpc::kvstore::grpc_gen::KvStoreService::StubInterface*, absl::Time)::'lambda'(grpc::Status)>, void (grpc::Status)>::destroy() + 24 frame #30: 0x00000001000a7ea0 grpc_kvstore_test`std::__1::__function::__func<tensorstore::(anonymous namespace)::ReadTask::Start(tensorstore_grpc::kvstore::grpc_gen::KvStoreService::StubInterface*, absl::Time)::'lambda'(grpc::Status), std::__1::allocator<tensorstore::(anonymous namespace)::ReadTask::Start(tensorstore_grpc::kvstore::grpc_gen::KvStoreService::StubInterface*, absl::Time)::'lambda'(grpc::Status)>, void (grpc::Status)>::destroy() + 28 frame #31: 0x00000001000aabbc grpc_kvstore_test`std::__1::__function::__value_func<void (grpc::Status)>::~__value_func() + 68 frame #32: 0x00000001000aab68 grpc_kvstore_test`std::__1::__function::__value_func<void (grpc::Status)>::~__value_func() + 28 frame #33: 0x00000001000aab3c grpc_kvstore_test`std::__1::function<void (grpc::Status)>::~function() + 28 frame #34: 0x00000001000a6254 grpc_kvstore_test`std::__1::function<void (grpc::Status)>::~function() + 28 frame #35: 0x0000000100108ae0 grpc_kvstore_test`grpc::internal::CallbackWithStatusTag::Run(bool) + 368 frame #36: 0x0000000100108964 grpc_kvstore_test`grpc::internal::CallbackWithStatusTag::StaticRun(grpc_completion_queue_functor*, int) + 44 frame #37: 0x0000000100154cb0 grpc_kvstore_test`grpc::(anonymous namespace)::CallbackAlternativeCQ::ThreadLoop(void*) + 356 frame #38: 0x0000000100a650b8 grpc_kvstore_test`grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::'lambda'(void*)::operator()(void*) const + 240 frame #39: 0x0000000100a64fbc grpc_kvstore_test`grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::'lambda'(void*)::__invoke(void*) + 28 frame #40: 0x0000000194f5606c libsystem_pthread.dylib`_pthread_start + 148
|
Hi, guys. We met the same issue when running UT with ASAN enabled. This should be a bug and this PR only provides a nicer crash log, right? Do we have any plans to fix it? |
wanlin31
pushed a commit
that referenced
this pull request
May 18, 2023
Long story here: CallbackAlternativeCQ operates a thread pool which processes a completion queue and then directly invokes the completion function in the same thread. This thread pool is initialized on first Ref() and unallocated on last Unref(). When running an in-process synchronous server (as we do for tests, using this https://github.com/google/tensorstore/blob/master/tensorstore/internal/grpc/grpc_mock.h), called by an async() interface caller, if the async() callback happens to drop the last reference to the grpc Channel, then the channel shutdown will attempt to run in one of the the CallbackAlternativeCQ threads. This will cause a deadlock/race condition, as `CallbackAlternativeCQ` is not designed to shutdown itself. When this deadlock happens, `pthread_join(pthread_id_)` will return `EDEADLK` and the thread will keep running. However `EDEADLK` is silently ignored by Join() so `CallbackAlternativeCQ::Unref` will continue to delete the underlying grpc_completion_queue, leading to a `SIGSEGV` later in the process. https://github.com/grpc/grpc/blob/97ba9871324cb68b93f22fd1860934392cd476ee/src/cpp/common/completion_queue_cc.cc#L115 This adds an assert that pthread_join succeeded, which is useful as it avoids a later SIGSEBV. Alternatively, the thread implementation could gpr_log the errorcode before asserting. Example backtrace of crash: frame #0: 0x0000000194f1e868 libsystem_kernel.dylib`__pthread_kill + 8 frame #1: 0x0000000194f55cec libsystem_pthread.dylib`pthread_kill + 288 frame #2: 0x0000000194e8e2c8 libsystem_c.dylib`abort + 180 frame #3: 0x0000000194e8d620 libsystem_c.dylib`__assert_rtn + 272 frame #4: 0x0000000100a64f50 grpc_kvstore_test`grpc_core::(anonymous namespace)::ThreadInternalsPosix::Join() + 188 frame #5: 0x00000001009c5dd0 grpc_kvstore_test`grpc_core::Thread::Join() + 56 frame #6: 0x0000000100154474 grpc_kvstore_test`grpc::(anonymous namespace)::CallbackAlternativeCQ::Unref() + 216 frame #7: 0x0000000100154390 grpc_kvstore_test`grpc::CompletionQueue::ReleaseCallbackAlternativeCQ(grpc::CompletionQueue*) + 120 frame #8: 0x000000010014130c grpc_kvstore_test`grpc::Channel::~Channel() + 220 frame #9: 0x00000001001413c8 grpc_kvstore_test`grpc::Channel::~Channel() + 28 frame #10: 0x000000010014d678 grpc_kvstore_test`std::__1::default_delete<grpc::Channel>::operator()(grpc::Channel*) const + 44 frame #11: 0x000000010014d358 grpc_kvstore_test`std::__1::__shared_ptr_pointer<grpc::Channel*, std::__1::shared_ptr<grpc::Channel>::__shared_ptr_default_delete<grpc::Channel, grpc::Channel>, std::__1::allocator<grpc::Channel> >::__on_zero_shared() + 72 frame #12: 0x000000010002ab5c grpc_kvstore_test`std::__1::__shared_count::__release_shared() + 60 frame #13: 0x000000010002ab00 grpc_kvstore_test`std::__1::__shared_weak_count::__release_shared() + 28 frame #14: 0x000000010002aad0 grpc_kvstore_test`std::__1::shared_ptr<grpc::ServerCredentials>::~shared_ptr() + 56 frame #15: 0x00000001000053ec grpc_kvstore_test`std::__1::shared_ptr<tensorstore_grpc::kvstore::grpc_gen::KvStoreService::Stub>::~shared_ptr() + 28 frame #16: 0x000000010014653c grpc_kvstore_test`grpc::ClientContext::~ClientContext() + 356 frame #17: 0x0000000100146570 grpc_kvstore_test`grpc::ClientContext::~ClientContext() + 28 frame #18: 0x00000001000ab000 grpc_kvstore_test`tensorstore::(anonymous namespace)::ReadTask::~ReadTask() + 68 frame #19: 0x00000001000aae90 grpc_kvstore_test`tensorstore::(anonymous namespace)::ReadTask::~ReadTask() + 28 frame #20: 0x00000001000aae18 grpc_kvstore_test`tensorstore::internal::intrusive_ptr_decrement(tensorstore::internal::AtomicReferenceCount<tensorstore::(anonymous namespace)::ReadTask> const*) + 68 frame #21: 0x00000001000aadc8 grpc_kvstore_test`void tensorstore::internal::DefaultIntrusivePtrTraits::decrement<tensorstore::(anonymous namespace)::ReadTask*>(tensorstore::(anonymous namespace)::ReadTask*) + 24 frame #22: 0x00000001000aad9c grpc_kvstore_test`tensorstore::internal::IntrusivePtr<tensorstore::(anonymous namespace)::ReadTask, tensorstore::internal::DefaultIntrusivePtrTraits>::~IntrusivePtr() + 52 frame #23: 0x00000001000a5994 grpc_kvstore_test`tensorstore::internal::IntrusivePtr<tensorstore::(anonymous namespace)::ReadTask, tensorstore::internal::DefaultIntrusivePtrTraits>::~IntrusivePtr() + 28 frame #24: 0x00000001000aac24 grpc_kvstore_test`tensorstore::(anonymous namespace)::ReadTask::Start(tensorstore_grpc::kvstore::grpc_gen::KvStoreService::StubInterface*, absl::Time)::'lambda'(grpc::Status)::~() + 40 frame #25: 0x00000001000a6280 grpc_kvstore_test`tensorstore::(anonymous namespace)::ReadTask::Start(tensorstore_grpc::kvstore::grpc_gen::KvStoreService::StubInterface*, absl::Time)::'lambda'(grpc::Status)::~() + 28 frame #26: 0x00000001000a84ac grpc_kvstore_test`std::__1::__compressed_pair_elem<tensorstore::(anonymous namespace)::ReadTask::Start(tensorstore_grpc::kvstore::grpc_gen::KvStoreService::StubInterface*, absl::Time)::'lambda'(grpc::Status), 0, false>::~__compressed_pair_elem() + 28 frame #27: 0x00000001000a86c0 grpc_kvstore_test`std::__1::__compressed_pair<tensorstore::(anonymous namespace)::ReadTask::Start(tensorstore_grpc::kvstore::grpc_gen::KvStoreService::StubInterface*, absl::Time)::'lambda'(grpc::Status), std::__1::allocator<tensorstore::(anonymous namespace)::ReadTask::Start(tensorstore_grpc::kvstore::grpc_gen::KvStoreService::StubInterface*, absl::Time)::'lambda'(grpc::Status)> >::~__compressed_pair() + 28 frame #28: 0x00000001000a8694 grpc_kvstore_test`std::__1::__compressed_pair<tensorstore::(anonymous namespace)::ReadTask::Start(tensorstore_grpc::kvstore::grpc_gen::KvStoreService::StubInterface*, absl::Time)::'lambda'(grpc::Status), std::__1::allocator<tensorstore::(anonymous namespace)::ReadTask::Start(tensorstore_grpc::kvstore::grpc_gen::KvStoreService::StubInterface*, absl::Time)::'lambda'(grpc::Status)> >::~__compressed_pair() + 28 frame #29: 0x00000001000a990c grpc_kvstore_test`std::__1::__function::__alloc_func<tensorstore::(anonymous namespace)::ReadTask::Start(tensorstore_grpc::kvstore::grpc_gen::KvStoreService::StubInterface*, absl::Time)::'lambda'(grpc::Status), std::__1::allocator<tensorstore::(anonymous namespace)::ReadTask::Start(tensorstore_grpc::kvstore::grpc_gen::KvStoreService::StubInterface*, absl::Time)::'lambda'(grpc::Status)>, void (grpc::Status)>::destroy() + 24 frame #30: 0x00000001000a7ea0 grpc_kvstore_test`std::__1::__function::__func<tensorstore::(anonymous namespace)::ReadTask::Start(tensorstore_grpc::kvstore::grpc_gen::KvStoreService::StubInterface*, absl::Time)::'lambda'(grpc::Status), std::__1::allocator<tensorstore::(anonymous namespace)::ReadTask::Start(tensorstore_grpc::kvstore::grpc_gen::KvStoreService::StubInterface*, absl::Time)::'lambda'(grpc::Status)>, void (grpc::Status)>::destroy() + 28 frame #31: 0x00000001000aabbc grpc_kvstore_test`std::__1::__function::__value_func<void (grpc::Status)>::~__value_func() + 68 frame #32: 0x00000001000aab68 grpc_kvstore_test`std::__1::__function::__value_func<void (grpc::Status)>::~__value_func() + 28 frame #33: 0x00000001000aab3c grpc_kvstore_test`std::__1::function<void (grpc::Status)>::~function() + 28 frame #34: 0x00000001000a6254 grpc_kvstore_test`std::__1::function<void (grpc::Status)>::~function() + 28 frame #35: 0x0000000100108ae0 grpc_kvstore_test`grpc::internal::CallbackWithStatusTag::Run(bool) + 368 frame #36: 0x0000000100108964 grpc_kvstore_test`grpc::internal::CallbackWithStatusTag::StaticRun(grpc_completion_queue_functor*, int) + 44 frame #37: 0x0000000100154cb0 grpc_kvstore_test`grpc::(anonymous namespace)::CallbackAlternativeCQ::ThreadLoop(void*) + 356 frame #38: 0x0000000100a650b8 grpc_kvstore_test`grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::'lambda'(void*)::operator()(void*) const + 240 frame #39: 0x0000000100a64fbc grpc_kvstore_test`grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::'lambda'(void*)::__invoke(void*) + 28 frame #40: 0x0000000194f5606c libsystem_pthread.dylib`_pthread_start + 148
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Long story here:
CallbackAlternativeCQ operates a thread pool which processes a completion queue and then directly invokes the completion function in the same thread. This thread pool is initialized on first Ref() and unallocated on last Unref().
When running an in-process synchronous server (as we do for tests, using this https://github.com/google/tensorstore/blob/master/tensorstore/internal/grpc/grpc_mock.h), called by an async() interface caller, if the async() callback happens to drop the last reference to the grpc Channel, then the channel shutdown will attempt to run in one of the the CallbackAlternativeCQ threads.
This will cause a deadlock/race condition, as
CallbackAlternativeCQis not designed to shutdown itself. When this deadlock happens,pthread_join(pthread_id_)will returnEDEADLKand the thread will keep running. HoweverEDEADLKis silently ignored by Join() soCallbackAlternativeCQ::Unrefwill continue to delete the underlying grpc_completion_queue, leading to aSIGSEGVlater in the process.grpc/src/cpp/common/completion_queue_cc.cc
Line 115 in 97ba987
This adds an assert that pthread_join succeeded, which is useful as it avoids a later SIGSEBV. Alternatively, the thread implementation could gpr_log the errorcode before asserting.
Example backtrace of crash: