Skip to content

[OTel] Fix deadlocks when adding and removing callbacks#37425

Closed
yashykt wants to merge 4 commits intogrpc:masterfrom
yashykt:FixDeadlocks1
Closed

[OTel] Fix deadlocks when adding and removing callbacks#37425
yashykt wants to merge 4 commits intogrpc:masterfrom
yashykt:FixDeadlocks1

Conversation

@yashykt
Copy link
Copy Markdown
Member

@yashykt yashykt commented Aug 8, 2024

Internal bug: b/357864682

There are two bugs being fixed here -

Bug 1 - A lock ordering inversion was noticed with the following stacks -

[mutex.cc : 1418] RAW: Potential Mutex deadlock:                                                         
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()                                                 
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()                                        
        @ 0x564f4be968c5 grpc::internal::OpenTelemetryPluginImpl::RemoveCallback()                        
        @ 0x564f4cd097b8 grpc_core::RegisteredMetricCallback::~RegisteredMetricCallback()                 
        @ 0x564f4c1f1216 std::default_delete<>::operator()()                                              
        @ 0x564f4c1f157f std::__uniq_ptr_impl<>::reset()                                                  
        @ 0x564f4c1ee967 std::unique_ptr<>::reset()                                                        
        @ 0x564f4c352f44 grpc_core::GrpcXdsClient::Orphaned()                                            
        @ 0x564f4c25dad1 grpc_core::DualRefCounted<>::Unref()                                             
        @ 0x564f4c4653ed grpc_core::RefCountedPtr<>::reset()                                               
        @ 0x564f4c463c73 grpc_core::XdsClusterDropStats::~XdsClusterDropStats()                           
        @ 0x564f4c463d02 grpc_core::XdsClusterDropStats::~XdsClusterDropStats()                           
        @ 0x564f4c25efa9 grpc_core::UnrefDelete::operator()<>()                                           
        @ 0x564f4c25d5f0 grpc_core::RefCounted<>::Unref()                                                 
        @ 0x564f4c25c2d9 grpc_core::RefCountedPtr<>::~RefCountedPtr()                                     
        @ 0x564f4c25b1d8 grpc_core::(anonymous namespace)::XdsClusterImplLb::Picker::~Picker()            
        @ 0x564f4c25b240 grpc_core::(anonymous namespace)::XdsClusterImplLb::Picker::~Picker()            
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()                                           
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()                                         
        @ 0x564f4c124fb8 grpc_core::DualRefCounted<>::Unref()                                             
        @ 0x564f4c11f029 grpc_core::RefCountedPtr<>::~RefCountedPtr()                                     
        @ 0x564f4c14e958 grpc_core::(anonymous namespace)::OutlierDetectionLb::Picker::~Picker()          
        @ 0x564f4c14e980 grpc_core::(anonymous namespace)::OutlierDetectionLb::Picker::~Picker()          
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()                                           
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()                                         
        @ 0x564f4c124fb8 grpc_core::DualRefCounted<>::Unref()                                             
        @ 0x564f4c11f029 grpc_core::RefCountedPtr<>::~RefCountedPtr()                                     
        @ 0x564f4c26bafc std::pair<>::~pair()                                                             
        @ 0x564f4c26bb28 __gnu_cxx::new_allocator<>::destroy<>()                                          
        @ 0x564f4c26b88f std::allocator_traits<>::destroy<>()                                             
        @ 0x564f4c26b297 std::_Rb_tree<>::_M_destroy_node()                                               
        @ 0x564f4c26abfb std::_Rb_tree<>::_M_drop_node()                                                  
        @ 0x564f4c26a926 std::_Rb_tree<>::_M_erase()                                                      
        @ 0x564f4c26a6f0 std::_Rb_tree<>::~_Rb_tree()                                                     
        @ 0x564f4c26a62a std::map<>::~map()                                                               
        @ 0x564f4c2691a4 grpc_core::(anonymous namespace)::XdsClusterManagerLb::ClusterPicker::~ClusterPicker()                                                                                                     
        @ 0x564f4c2691cc grpc_core::(anonymous namespace)::XdsClusterManagerLb::ClusterPicker::~ClusterPicker()                                                                                                     
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()                                           
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()
                                                                                                                                                                                                                   
[mutex.cc : 1428] RAW: Acquiring absl::Mutex 0x564f4f22ad40 while holding  0x7f939834bb70; a cycle in the historical lock ordering graph has been observed                                                          
[mutex.cc : 1432] RAW: Cycle:                       
[mutex.cc : 1446] RAW: mutex@0x564f4f22ad40 stack:                                                                                                                                                                    
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()                                    
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()                                               
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()            
        @ 0x564f4be96124 grpc::internal::OpenTelemetryPluginImpl::AddCallback()         
        @ 0x564f4cd096f0 grpc_core::RegisteredMetricCallback::RegisteredMetricCallback()                   
        @ 0x564f4c1f111b std::make_unique<>()                                                            
        @ 0x564f4c3564b0 grpc_core::GlobalStatsPluginRegistry::StatsPluginGroup::RegisterCallback<>()
        @ 0x564f4c352dea grpc_core::GrpcXdsClient::GrpcXdsClient()                                       
        @ 0x564f4c355bc6 grpc_core::MakeRefCounted<>()                                               
        @ 0x564f4c3525f2 grpc_core::GrpcXdsClient::GetOrCreate()                     
        @ 0x564f4c28f8f8 grpc_core::(anonymous namespace)::XdsResolver::StartLocked()                  
        @ 0x564f4c2f5f82 grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartXdsResolver()
        @ 0x564f4c2f515d grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::ZoneQueryDone()                                                                                                               
        @ 0x564f4c2f496b grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartLocked()::{lambda()#1}::operator()()::{lambda()#1}::operator()()                                                          
        @ 0x564f4c2f80f6 std::__invoke_impl<>()                                                                                                                                                                    
        @ 0x564f4c2f7b9d _ZSt10__invoke_rIvRZZN9grpc_core12_GLOBAL__N_124GoogleCloud2ProdResolver11StartLockedEvENUlNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEN4absl12lts_202401168StatusOrIS8_EEE_clES8_SC_EUlvE_J...
        @ 0x564f4c2f748c std::_Function_handler<>::_M_invoke()                                           
        @ 0x564f4b8ad682 std::function<>::operator()()                                                                                                                                                                                                                                                                                                                                                                                  
        @ 0x564f4cd1c6bf grpc_core::WorkSerializer::LegacyWorkSerializer::Run()
        @ 0x564f4cd1dae4 grpc_core::WorkSerializer::Run()
        @ 0x564f4c2f4b0b grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartLocked()::{lambda()#1}::operator()()                                                                                     
        @ 0x564f4c2f8dc7 absl::lts_20240116::base_internal::Callable::Invoke<>()
        @ 0x564f4c2f8cb8 absl::lts_20240116::base_internal::invoke<>()                                                                                                                                                
        @ 0x564f4c2f8b16 absl::lts_20240116::internal_any_invocable::InvokeR<>()
        @ 0x564f4c2f8a0c absl::lts_20240116::internal_any_invocable::LocalInvoker<>()
        @ 0x564f4c2fb88d absl::lts_20240116::internal_any_invocable::Impl<>::operator()()
        @ 0x564f4c2fb1f3 grpc_core::GcpMetadataQuery::OnDone()                       
        @ 0x564f4cd75a72 exec_ctx_run()                                                                    
        @ 0x564f4cd75ba9 grpc_core::ExecCtx::Flush()          
        @ 0x564f4cc8ee1d end_worker()  
        @ 0x564f4cc8f304 pollset_work()                                                                    
        @ 0x564f4cc5dcaf pollset_work()
        @ 0x564f4cc69220 grpc_pollset_work()
        @ 0x564f4cbe7733 cq_pluck()    
        @ 0x564f4cbe7ad5 grpc_completion_queue_pluck                                                     
        @ 0x564f4bc61d96 grpc::CompletionQueue::Pluck()                                                  
        @ 0x564f4bfdb055 grpc::ClientReader<>::ClientReader<>()
        @ 0x564f4bfd6035 grpc::internal::ClientReaderFactory<>::Create<>()
        @ 0x564f4bfc322b google::storage::v2::Storage::Stub::ReadObjectRaw()
        @ 0x564f4bf9934b google::storage::v2::Storage::Stub::ReadObject() 
                                                                                                           
[mutex.cc : 1446] RAW: mutex@0x7f939834bb70 stack:                                                         
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()                                    
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()                                               
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()  
        @ 0x564f4c1ce9eb grpc_core::(anonymous namespace)::RlsLb::RlsLb()::{lambda()#1}::operator()()
        @ 0x564f4c1e794c absl::lts_20240116::base_internal::Callable::Invoke<>()
        @ 0x564f4c1e72c1 absl::lts_20240116::base_internal::invoke<>()                               
        @ 0x564f4c1e6af1 absl::lts_20240116::internal_any_invocable::InvokeR<>()
        @ 0x564f4c1e5d6c absl::lts_20240116::internal_any_invocable::LocalInvoker<>()
        @ 0x564f4be9d0c8 absl::lts_20240116::internal_any_invocable::Impl<>::operator()()
        @ 0x564f4be9b4ff grpc_core::RegisteredMetricCallback::Run()                  
        @ 0x564f4bea07ae grpc::internal::OpenTelemetryPluginImpl::CallbackGaugeState<>::CallbackGaugeCallback()                                                                                                    
        @ 0x564f4bf844de opentelemetry::v1::sdk::metrics::ObservableRegistry::Observe()
        @ 0x564f4bf56529 opentelemetry::v1::sdk::metrics::Meter::Collect()                                                                                                                                            
        @ 0x564f4bf8c1d5 opentelemetry::v1::sdk::metrics::MetricCollector::Collect()::{lambda()#1}::operator()()                                                                                                   
        @ 0x564f4bf8c5ac opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::operator()()                                                                                                          
        @ 0x564f4bf8c5e8 opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::_FUN()                                                                                                                   
        @ 0x564f4bf7604d opentelemetry::v1::nostd::function_ref<>::operator()()                                                                                                                                       
        @ 0x564f4bf74ad9 opentelemetry::v1::sdk::metrics::MeterContext::ForEachMeter()             
        @ 0x564f4bf8c457 opentelemetry::v1::sdk::metrics::MetricCollector::Collect()
        @ 0x564f4bf4a7fe opentelemetry::v1::sdk::metrics::MetricReader::Collect()     
        @ 0x564f4bed5e24 opentelemetry::v1::exporter::metrics::PrometheusCollector::Collect()
        @ 0x564f4bef004f prometheus::detail::CollectMetrics()                    
        @ 0x564f4beec26d prometheus::detail::MetricsHandler::handleGet()                     
        @ 0x564f4bf1cd8b CivetServer::requestHandler()       
        @ 0x564f4bf35e7b handle_request                                                                    
        @ 0x564f4bf29534 handle_request_stat_log                                                           
        @ 0x564f4bf39b3f process_new_connection
        @ 0x564f4bf3a448 worker_thread_run      
        @ 0x564f4bf3a57f worker_thread         
        @ 0x7f93e9137ea7 start_thread     
                                                                                                           
[mutex.cc : 1454] RAW: dying due to potential deadlock                                                   
Aborted    

From the stack, it looks like we are ending up holding a lock to the RlsLB policy while removing a callback from the gRPC OpenTelemetry plugin, which is a lock ordering inversion. The expected order is OpenTelemetry -> gRPC OpenTelemetry plugin -> gRPC Component like RLS/xDSClient.

Bug 2 - We are adding and removing callbacks on the OpenTelemetry Async Instruments without synchronization. This opens us to races where we have an AddCallback and RemoveCallback operation happening at the same time. The correct result after these operations is to still have a callback registered with OpenTelemetry at the end, but the two operations could race and we could just decide to remove the OpenTelemetry callback.

Changes -

  • We add and remove callbacks from the gRPC OpenTelemetry plugin asynchronously using EventEngine. This allows to be safe and avoid deadlocks even if the callback owner adds/removes a callback while holding a lock.
  • To make this async operation work, we now need RegisteredMetricCallback to be internally refcounted so that we can hold on to the callback even after the owner has released it. Additionally, to make sure that the callbacks are valid while we actually remove the callback, we need the callbacks to hold a ref to the callback owner.
  • To fix Bug 2, we are just deciding to not remove OpenTelemetry callbacks immediately, and instead delay this to plugin destruction time.

Verified that the deadlock goes away with this fix.

In an upcoming PR, I am also going to change the gRPC OpenTelemetry plugin to release its mutex before the callbacks are invoked. That would save a lot of contention.

@yashykt yashykt added the release notes: yes Indicates if PR needs to be in release notes label Aug 9, 2024
@yashykt yashykt changed the title Fix deadlocks1 [OTel] Fix deadlocks when adding and removing callbacks Aug 9, 2024
Copy link
Copy Markdown
Member

@yijiem yijiem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Yash! This is such a tricky issue...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should add a test case where e.g. we register callback1 and callback2 for the same gauge. Then we destroy callback1 (internally calls RemoveCallback). Then we should still expect callback2 to work. I think this test should also work before this change, but we don't have it now.

// callback per OpenTelemetry instrument which is a small number. If we decide
// to remove the callback immediately at this point, we need to make sure that
// 1) the callback is removed without holding mu_ and 2) we make sure that
// this does not race against a possible `AddCallback` operation. A potential
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the hypothetical race is when AddCallback() and RemoveCallback() get called concurrently from different threads within the same instance of a component (e.g. one instance of RlsLb) and passing in the same RegisteredMetricCallback* pointer which we can say is not supported or undefined behavior.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we may not need to make the change in RemoveCallback() if we are okay to put that restriction in our API. But we probably should increase our test coverage for these cases.

Copy link
Copy Markdown
Member Author

@yashykt yashykt Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the race is when AddCallback() and RemoveCallback() get called concurrently from different threads with different instances of the same component, since all instances share the same OTel instrument.

Copy link
Copy Markdown
Member

@yijiem yijiem Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that is a great catch! I think I was wrong about this. Consider the following scenario with the original code where Thread1 has already added a callback on a gauge and then Thread1 calls RemoveCallback and Thread2 calls AddCallback for the same gauge:

Scenario1 where Thread1 passed the critical region first followed by Thread2 passed the critical region, and then if Thread2 added the otel callback first (double added) and Thread1 removed the callback last. This will cause the otel callback being removed which is wrong.

        Thread1                           Thread2
  RemoveCallback                        AddCallback
             |                               |
   critical region                           |
             |                         critical region
             |                               |
             |                         add otel callback
remove otel callback

But Scenario2 where Thread2 passed the critical region first (increase gauge cache's reference count) followed by Thread1 seems fine.

        Thread1                           Thread2
  RemoveCallback                        AddCallback
             |                               |
             |                         critical region
    critical region                          |
             |                        will not add otel callback
will not remove otel callback

Scenario3 will be fine:

        Thread1                           Thread2
  RemoveCallback                        AddCallback
             |                               |
             |                        critical region
             |                               |
             |                        will not add otel callback
    critical region
             |
 will not remove otel calback

Scenario4 will also be fine:

        Thread1                           Thread2
  RemoveCallback                       AddCallback
             |                              |
   critical region                          |
             |                              |
  remove otel callback                      |
                                      critical region
                                            |
                                    add back otel callback

void AddCallback(grpc_core::RegisteredMetricCallback* callback)
void AddCallback(
grpc_core::RefCountedPtr<grpc_core::RegisteredMetricCallback> callback)
ABSL_LOCKS_EXCLUDED(mu_) override;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably remove the ABSL_LOCKS_EXCLUDED(mu_) annotation now. Same for RemoveCallback().

copybara-service bot pushed a commit that referenced this pull request Aug 15, 2024
Internal bug: b/357864682

A lock ordering inversion was noticed with the following stacks -
```
[mutex.cc : 1418] RAW: Potential Mutex deadlock:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4be968c5 grpc::internal::OpenTelemetryPluginImpl::RemoveCallback()
        @ 0x564f4cd097b8 grpc_core::RegisteredMetricCallback::~RegisteredMetricCallback()
        @ 0x564f4c1f1216 std::default_delete<>::operator()()
        @ 0x564f4c1f157f std::__uniq_ptr_impl<>::reset()
        @ 0x564f4c1ee967 std::unique_ptr<>::reset()
        @ 0x564f4c352f44 grpc_core::GrpcXdsClient::Orphaned()
        @ 0x564f4c25dad1 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c4653ed grpc_core::RefCountedPtr<>::reset()
        @ 0x564f4c463c73 grpc_core::XdsClusterDropStats::~XdsClusterDropStats()
        @ 0x564f4c463d02 grpc_core::XdsClusterDropStats::~XdsClusterDropStats()
        @ 0x564f4c25efa9 grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c25d5f0 grpc_core::RefCounted<>::Unref()
        @ 0x564f4c25c2d9 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c25b1d8 grpc_core::(anonymous namespace)::XdsClusterImplLb::Picker::~Picker()
        @ 0x564f4c25b240 grpc_core::(anonymous namespace)::XdsClusterImplLb::Picker::~Picker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()
        @ 0x564f4c124fb8 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c11f029 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c14e958 grpc_core::(anonymous namespace)::OutlierDetectionLb::Picker::~Picker()
        @ 0x564f4c14e980 grpc_core::(anonymous namespace)::OutlierDetectionLb::Picker::~Picker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()
        @ 0x564f4c124fb8 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c11f029 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c26bafc std::pair<>::~pair()
        @ 0x564f4c26bb28 __gnu_cxx::new_allocator<>::destroy<>()
        @ 0x564f4c26b88f std::allocator_traits<>::destroy<>()
        @ 0x564f4c26b297 std::_Rb_tree<>::_M_destroy_node()
        @ 0x564f4c26abfb std::_Rb_tree<>::_M_drop_node()
        @ 0x564f4c26a926 std::_Rb_tree<>::_M_erase()
        @ 0x564f4c26a6f0 std::_Rb_tree<>::~_Rb_tree()
        @ 0x564f4c26a62a std::map<>::~map()
        @ 0x564f4c2691a4 grpc_core::(anonymous namespace)::XdsClusterManagerLb::ClusterPicker::~ClusterPicker()
        @ 0x564f4c2691cc grpc_core::(anonymous namespace)::XdsClusterManagerLb::ClusterPicker::~ClusterPicker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()

[mutex.cc : 1428] RAW: Acquiring absl::Mutex 0x564f4f22ad40 while holding  0x7f939834bb70; a cycle in the historical lock ordering graph has been observed
[mutex.cc : 1432] RAW: Cycle:
[mutex.cc : 1446] RAW: mutex@0x564f4f22ad40 stack:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4be96124 grpc::internal::OpenTelemetryPluginImpl::AddCallback()
        @ 0x564f4cd096f0 grpc_core::RegisteredMetricCallback::RegisteredMetricCallback()
        @ 0x564f4c1f111b std::make_unique<>()
        @ 0x564f4c3564b0 grpc_core::GlobalStatsPluginRegistry::StatsPluginGroup::RegisterCallback<>()
        @ 0x564f4c352dea grpc_core::GrpcXdsClient::GrpcXdsClient()
        @ 0x564f4c355bc6 grpc_core::MakeRefCounted<>()
        @ 0x564f4c3525f2 grpc_core::GrpcXdsClient::GetOrCreate()
        @ 0x564f4c28f8f8 grpc_core::(anonymous namespace)::XdsResolver::StartLocked()
        @ 0x564f4c2f5f82 grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartXdsResolver()
        @ 0x564f4c2f515d grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::ZoneQueryDone()
        @ 0x564f4c2f496b grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartLocked()::{lambda()#1}::operator()()::{lambda()#1}::operator()()
        @ 0x564f4c2f80f6 std::__invoke_impl<>()
        @ 0x564f4c2f7b9d _ZSt10__invoke_rIvRZZN9grpc_core12_GLOBAL__N_124GoogleCloud2ProdResolver11StartLockedEvENUlNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEN4absl12lts_202401168StatusOrIS8_EEE_clES8_SC_EUlvE_J...
        @ 0x564f4c2f748c std::_Function_handler<>::_M_invoke()
        @ 0x564f4b8ad682 std::function<>::operator()()
        @ 0x564f4cd1c6bf grpc_core::WorkSerializer::LegacyWorkSerializer::Run()
        @ 0x564f4cd1dae4 grpc_core::WorkSerializer::Run()
        @ 0x564f4c2f4b0b grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartLocked()::{lambda()#1}::operator()()
        @ 0x564f4c2f8dc7 absl::lts_20240116::base_internal::Callable::Invoke<>()
        @ 0x564f4c2f8cb8 absl::lts_20240116::base_internal::invoke<>()
        @ 0x564f4c2f8b16 absl::lts_20240116::internal_any_invocable::InvokeR<>()
        @ 0x564f4c2f8a0c absl::lts_20240116::internal_any_invocable::LocalInvoker<>()
        @ 0x564f4c2fb88d absl::lts_20240116::internal_any_invocable::Impl<>::operator()()
        @ 0x564f4c2fb1f3 grpc_core::GcpMetadataQuery::OnDone()
        @ 0x564f4cd75a72 exec_ctx_run()
        @ 0x564f4cd75ba9 grpc_core::ExecCtx::Flush()
        @ 0x564f4cc8ee1d end_worker()
        @ 0x564f4cc8f304 pollset_work()
        @ 0x564f4cc5dcaf pollset_work()
        @ 0x564f4cc69220 grpc_pollset_work()
        @ 0x564f4cbe7733 cq_pluck()
        @ 0x564f4cbe7ad5 grpc_completion_queue_pluck
        @ 0x564f4bc61d96 grpc::CompletionQueue::Pluck()
        @ 0x564f4bfdb055 grpc::ClientReader<>::ClientReader<>()
        @ 0x564f4bfd6035 grpc::internal::ClientReaderFactory<>::Create<>()
        @ 0x564f4bfc322b google::storage::v2::Storage::Stub::ReadObjectRaw()
        @ 0x564f4bf9934b google::storage::v2::Storage::Stub::ReadObject()

[mutex.cc : 1446] RAW: mutex@0x7f939834bb70 stack:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4c1ce9eb grpc_core::(anonymous namespace)::RlsLb::RlsLb()::{lambda()#1}::operator()()
        @ 0x564f4c1e794c absl::lts_20240116::base_internal::Callable::Invoke<>()
        @ 0x564f4c1e72c1 absl::lts_20240116::base_internal::invoke<>()
        @ 0x564f4c1e6af1 absl::lts_20240116::internal_any_invocable::InvokeR<>()
        @ 0x564f4c1e5d6c absl::lts_20240116::internal_any_invocable::LocalInvoker<>()
        @ 0x564f4be9d0c8 absl::lts_20240116::internal_any_invocable::Impl<>::operator()()
        @ 0x564f4be9b4ff grpc_core::RegisteredMetricCallback::Run()
        @ 0x564f4bea07ae grpc::internal::OpenTelemetryPluginImpl::CallbackGaugeState<>::CallbackGaugeCallback()
        @ 0x564f4bf844de opentelemetry::v1::sdk::metrics::ObservableRegistry::Observe()
        @ 0x564f4bf56529 opentelemetry::v1::sdk::metrics::Meter::Collect()
        @ 0x564f4bf8c1d5 opentelemetry::v1::sdk::metrics::MetricCollector::Collect()::{lambda()#1}::operator()()
        @ 0x564f4bf8c5ac opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::operator()()
        @ 0x564f4bf8c5e8 opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::_FUN()
        @ 0x564f4bf7604d opentelemetry::v1::nostd::function_ref<>::operator()()
        @ 0x564f4bf74ad9 opentelemetry::v1::sdk::metrics::MeterContext::ForEachMeter()
        @ 0x564f4bf8c457 opentelemetry::v1::sdk::metrics::MetricCollector::Collect()
        @ 0x564f4bf4a7fe opentelemetry::v1::sdk::metrics::MetricReader::Collect()
        @ 0x564f4bed5e24 opentelemetry::v1::exporter::metrics::PrometheusCollector::Collect()
        @ 0x564f4bef004f prometheus::detail::CollectMetrics()
        @ 0x564f4beec26d prometheus::detail::MetricsHandler::handleGet()
        @ 0x564f4bf1cd8b CivetServer::requestHandler()
        @ 0x564f4bf35e7b handle_request
        @ 0x564f4bf29534 handle_request_stat_log
        @ 0x564f4bf39b3f process_new_connection
        @ 0x564f4bf3a448 worker_thread_run
        @ 0x564f4bf3a57f worker_thread
        @ 0x7f93e9137ea7 start_thread

[mutex.cc : 1454] RAW: dying due to potential deadlock
Aborted
```

From the stack, it looks like we are ending up holding a lock to the `RlsLB` policy while removing a callback from the gRPC OpenTelemetry plugin, which is a lock ordering inversion. The correct order is `OpenTelemetry` -> `gRPC OpenTelemetry plugin` -> `gRPC Component like RLS/xDSClient`.

A common pattern we employ for metrics is for the callbacks to be unregistered when the corresponding component object is orphaned/destroyed (unreffing). Also, note that removing callbacks requires a lock in `gRPC OpenTelemetry plugin`. To avoid deadlocks, we remove the callback inside `RlsLb` from outside the critical region, but `RlsLb` owns refs to child policies which in turn hold refs to `XdsClient`. The lock ordering inversion occurred due to unreffing child policies within the critical region.

This PR is an alternative fix to this problem. Original fix in #37425.
Verified that it fixes the bug.

Closes #37459

COPYBARA_INTEGRATE_REVIEW=#37459 from yashykt:FixDeadlocks ec7fbcf
PiperOrigin-RevId: 663360427
yashykt added a commit to yashykt/grpc that referenced this pull request Aug 15, 2024
Internal bug: b/357864682

A lock ordering inversion was noticed with the following stacks -
```
[mutex.cc : 1418] RAW: Potential Mutex deadlock:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4be968c5 grpc::internal::OpenTelemetryPluginImpl::RemoveCallback()
        @ 0x564f4cd097b8 grpc_core::RegisteredMetricCallback::~RegisteredMetricCallback()
        @ 0x564f4c1f1216 std::default_delete<>::operator()()
        @ 0x564f4c1f157f std::__uniq_ptr_impl<>::reset()
        @ 0x564f4c1ee967 std::unique_ptr<>::reset()
        @ 0x564f4c352f44 grpc_core::GrpcXdsClient::Orphaned()
        @ 0x564f4c25dad1 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c4653ed grpc_core::RefCountedPtr<>::reset()
        @ 0x564f4c463c73 grpc_core::XdsClusterDropStats::~XdsClusterDropStats()
        @ 0x564f4c463d02 grpc_core::XdsClusterDropStats::~XdsClusterDropStats()
        @ 0x564f4c25efa9 grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c25d5f0 grpc_core::RefCounted<>::Unref()
        @ 0x564f4c25c2d9 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c25b1d8 grpc_core::(anonymous namespace)::XdsClusterImplLb::Picker::~Picker()
        @ 0x564f4c25b240 grpc_core::(anonymous namespace)::XdsClusterImplLb::Picker::~Picker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()
        @ 0x564f4c124fb8 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c11f029 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c14e958 grpc_core::(anonymous namespace)::OutlierDetectionLb::Picker::~Picker()
        @ 0x564f4c14e980 grpc_core::(anonymous namespace)::OutlierDetectionLb::Picker::~Picker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()
        @ 0x564f4c124fb8 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c11f029 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c26bafc std::pair<>::~pair()
        @ 0x564f4c26bb28 __gnu_cxx::new_allocator<>::destroy<>()
        @ 0x564f4c26b88f std::allocator_traits<>::destroy<>()
        @ 0x564f4c26b297 std::_Rb_tree<>::_M_destroy_node()
        @ 0x564f4c26abfb std::_Rb_tree<>::_M_drop_node()
        @ 0x564f4c26a926 std::_Rb_tree<>::_M_erase()
        @ 0x564f4c26a6f0 std::_Rb_tree<>::~_Rb_tree()
        @ 0x564f4c26a62a std::map<>::~map()
        @ 0x564f4c2691a4 grpc_core::(anonymous namespace)::XdsClusterManagerLb::ClusterPicker::~ClusterPicker()
        @ 0x564f4c2691cc grpc_core::(anonymous namespace)::XdsClusterManagerLb::ClusterPicker::~ClusterPicker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()

[mutex.cc : 1428] RAW: Acquiring absl::Mutex 0x564f4f22ad40 while holding  0x7f939834bb70; a cycle in the historical lock ordering graph has been observed
[mutex.cc : 1432] RAW: Cycle:
[mutex.cc : 1446] RAW: mutex@0x564f4f22ad40 stack:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4be96124 grpc::internal::OpenTelemetryPluginImpl::AddCallback()
        @ 0x564f4cd096f0 grpc_core::RegisteredMetricCallback::RegisteredMetricCallback()
        @ 0x564f4c1f111b std::make_unique<>()
        @ 0x564f4c3564b0 grpc_core::GlobalStatsPluginRegistry::StatsPluginGroup::RegisterCallback<>()
        @ 0x564f4c352dea grpc_core::GrpcXdsClient::GrpcXdsClient()
        @ 0x564f4c355bc6 grpc_core::MakeRefCounted<>()
        @ 0x564f4c3525f2 grpc_core::GrpcXdsClient::GetOrCreate()
        @ 0x564f4c28f8f8 grpc_core::(anonymous namespace)::XdsResolver::StartLocked()
        @ 0x564f4c2f5f82 grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartXdsResolver()
        @ 0x564f4c2f515d grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::ZoneQueryDone()
        @ 0x564f4c2f496b grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartLocked()::{lambda()#1}::operator()()::{lambda()#1}::operator()()
        @ 0x564f4c2f80f6 std::__invoke_impl<>()
        @ 0x564f4c2f7b9d _ZSt10__invoke_rIvRZZN9grpc_core12_GLOBAL__N_124GoogleCloud2ProdResolver11StartLockedEvENUlNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEN4absl12lts_202401168StatusOrIS8_EEE_clES8_SC_EUlvE_J...
        @ 0x564f4c2f748c std::_Function_handler<>::_M_invoke()
        @ 0x564f4b8ad682 std::function<>::operator()()
        @ 0x564f4cd1c6bf grpc_core::WorkSerializer::LegacyWorkSerializer::Run()
        @ 0x564f4cd1dae4 grpc_core::WorkSerializer::Run()
        @ 0x564f4c2f4b0b grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartLocked()::{lambda()#1}::operator()()
        @ 0x564f4c2f8dc7 absl::lts_20240116::base_internal::Callable::Invoke<>()
        @ 0x564f4c2f8cb8 absl::lts_20240116::base_internal::invoke<>()
        @ 0x564f4c2f8b16 absl::lts_20240116::internal_any_invocable::InvokeR<>()
        @ 0x564f4c2f8a0c absl::lts_20240116::internal_any_invocable::LocalInvoker<>()
        @ 0x564f4c2fb88d absl::lts_20240116::internal_any_invocable::Impl<>::operator()()
        @ 0x564f4c2fb1f3 grpc_core::GcpMetadataQuery::OnDone()
        @ 0x564f4cd75a72 exec_ctx_run()
        @ 0x564f4cd75ba9 grpc_core::ExecCtx::Flush()
        @ 0x564f4cc8ee1d end_worker()
        @ 0x564f4cc8f304 pollset_work()
        @ 0x564f4cc5dcaf pollset_work()
        @ 0x564f4cc69220 grpc_pollset_work()
        @ 0x564f4cbe7733 cq_pluck()
        @ 0x564f4cbe7ad5 grpc_completion_queue_pluck
        @ 0x564f4bc61d96 grpc::CompletionQueue::Pluck()
        @ 0x564f4bfdb055 grpc::ClientReader<>::ClientReader<>()
        @ 0x564f4bfd6035 grpc::internal::ClientReaderFactory<>::Create<>()
        @ 0x564f4bfc322b google::storage::v2::Storage::Stub::ReadObjectRaw()
        @ 0x564f4bf9934b google::storage::v2::Storage::Stub::ReadObject()

[mutex.cc : 1446] RAW: mutex@0x7f939834bb70 stack:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4c1ce9eb grpc_core::(anonymous namespace)::RlsLb::RlsLb()::{lambda()#1}::operator()()
        @ 0x564f4c1e794c absl::lts_20240116::base_internal::Callable::Invoke<>()
        @ 0x564f4c1e72c1 absl::lts_20240116::base_internal::invoke<>()
        @ 0x564f4c1e6af1 absl::lts_20240116::internal_any_invocable::InvokeR<>()
        @ 0x564f4c1e5d6c absl::lts_20240116::internal_any_invocable::LocalInvoker<>()
        @ 0x564f4be9d0c8 absl::lts_20240116::internal_any_invocable::Impl<>::operator()()
        @ 0x564f4be9b4ff grpc_core::RegisteredMetricCallback::Run()
        @ 0x564f4bea07ae grpc::internal::OpenTelemetryPluginImpl::CallbackGaugeState<>::CallbackGaugeCallback()
        @ 0x564f4bf844de opentelemetry::v1::sdk::metrics::ObservableRegistry::Observe()
        @ 0x564f4bf56529 opentelemetry::v1::sdk::metrics::Meter::Collect()
        @ 0x564f4bf8c1d5 opentelemetry::v1::sdk::metrics::MetricCollector::Collect()::{lambda()#1}::operator()()
        @ 0x564f4bf8c5ac opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::operator()()
        @ 0x564f4bf8c5e8 opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::_FUN()
        @ 0x564f4bf7604d opentelemetry::v1::nostd::function_ref<>::operator()()
        @ 0x564f4bf74ad9 opentelemetry::v1::sdk::metrics::MeterContext::ForEachMeter()
        @ 0x564f4bf8c457 opentelemetry::v1::sdk::metrics::MetricCollector::Collect()
        @ 0x564f4bf4a7fe opentelemetry::v1::sdk::metrics::MetricReader::Collect()
        @ 0x564f4bed5e24 opentelemetry::v1::exporter::metrics::PrometheusCollector::Collect()
        @ 0x564f4bef004f prometheus::detail::CollectMetrics()
        @ 0x564f4beec26d prometheus::detail::MetricsHandler::handleGet()
        @ 0x564f4bf1cd8b CivetServer::requestHandler()
        @ 0x564f4bf35e7b handle_request
        @ 0x564f4bf29534 handle_request_stat_log
        @ 0x564f4bf39b3f process_new_connection
        @ 0x564f4bf3a448 worker_thread_run
        @ 0x564f4bf3a57f worker_thread
        @ 0x7f93e9137ea7 start_thread

[mutex.cc : 1454] RAW: dying due to potential deadlock
Aborted
```

From the stack, it looks like we are ending up holding a lock to the `RlsLB` policy while removing a callback from the gRPC OpenTelemetry plugin, which is a lock ordering inversion. The correct order is `OpenTelemetry` -> `gRPC OpenTelemetry plugin` -> `gRPC Component like RLS/xDSClient`.

A common pattern we employ for metrics is for the callbacks to be unregistered when the corresponding component object is orphaned/destroyed (unreffing). Also, note that removing callbacks requires a lock in `gRPC OpenTelemetry plugin`. To avoid deadlocks, we remove the callback inside `RlsLb` from outside the critical region, but `RlsLb` owns refs to child policies which in turn hold refs to `XdsClient`. The lock ordering inversion occurred due to unreffing child policies within the critical region.

This PR is an alternative fix to this problem. Original fix in grpc#37425.
Verified that it fixes the bug.

Closes grpc#37459

COPYBARA_INTEGRATE_REVIEW=grpc#37459 from yashykt:FixDeadlocks ec7fbcf
PiperOrigin-RevId: 663360427
yashykt added a commit to yashykt/grpc that referenced this pull request Aug 15, 2024
Internal bug: b/357864682

A lock ordering inversion was noticed with the following stacks -
```
[mutex.cc : 1418] RAW: Potential Mutex deadlock:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4be968c5 grpc::internal::OpenTelemetryPluginImpl::RemoveCallback()
        @ 0x564f4cd097b8 grpc_core::RegisteredMetricCallback::~RegisteredMetricCallback()
        @ 0x564f4c1f1216 std::default_delete<>::operator()()
        @ 0x564f4c1f157f std::__uniq_ptr_impl<>::reset()
        @ 0x564f4c1ee967 std::unique_ptr<>::reset()
        @ 0x564f4c352f44 grpc_core::GrpcXdsClient::Orphaned()
        @ 0x564f4c25dad1 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c4653ed grpc_core::RefCountedPtr<>::reset()
        @ 0x564f4c463c73 grpc_core::XdsClusterDropStats::~XdsClusterDropStats()
        @ 0x564f4c463d02 grpc_core::XdsClusterDropStats::~XdsClusterDropStats()
        @ 0x564f4c25efa9 grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c25d5f0 grpc_core::RefCounted<>::Unref()
        @ 0x564f4c25c2d9 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c25b1d8 grpc_core::(anonymous namespace)::XdsClusterImplLb::Picker::~Picker()
        @ 0x564f4c25b240 grpc_core::(anonymous namespace)::XdsClusterImplLb::Picker::~Picker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()
        @ 0x564f4c124fb8 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c11f029 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c14e958 grpc_core::(anonymous namespace)::OutlierDetectionLb::Picker::~Picker()
        @ 0x564f4c14e980 grpc_core::(anonymous namespace)::OutlierDetectionLb::Picker::~Picker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()
        @ 0x564f4c124fb8 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c11f029 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c26bafc std::pair<>::~pair()
        @ 0x564f4c26bb28 __gnu_cxx::new_allocator<>::destroy<>()
        @ 0x564f4c26b88f std::allocator_traits<>::destroy<>()
        @ 0x564f4c26b297 std::_Rb_tree<>::_M_destroy_node()
        @ 0x564f4c26abfb std::_Rb_tree<>::_M_drop_node()
        @ 0x564f4c26a926 std::_Rb_tree<>::_M_erase()
        @ 0x564f4c26a6f0 std::_Rb_tree<>::~_Rb_tree()
        @ 0x564f4c26a62a std::map<>::~map()
        @ 0x564f4c2691a4 grpc_core::(anonymous namespace)::XdsClusterManagerLb::ClusterPicker::~ClusterPicker()
        @ 0x564f4c2691cc grpc_core::(anonymous namespace)::XdsClusterManagerLb::ClusterPicker::~ClusterPicker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()

[mutex.cc : 1428] RAW: Acquiring absl::Mutex 0x564f4f22ad40 while holding  0x7f939834bb70; a cycle in the historical lock ordering graph has been observed
[mutex.cc : 1432] RAW: Cycle:
[mutex.cc : 1446] RAW: mutex@0x564f4f22ad40 stack:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4be96124 grpc::internal::OpenTelemetryPluginImpl::AddCallback()
        @ 0x564f4cd096f0 grpc_core::RegisteredMetricCallback::RegisteredMetricCallback()
        @ 0x564f4c1f111b std::make_unique<>()
        @ 0x564f4c3564b0 grpc_core::GlobalStatsPluginRegistry::StatsPluginGroup::RegisterCallback<>()
        @ 0x564f4c352dea grpc_core::GrpcXdsClient::GrpcXdsClient()
        @ 0x564f4c355bc6 grpc_core::MakeRefCounted<>()
        @ 0x564f4c3525f2 grpc_core::GrpcXdsClient::GetOrCreate()
        @ 0x564f4c28f8f8 grpc_core::(anonymous namespace)::XdsResolver::StartLocked()
        @ 0x564f4c2f5f82 grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartXdsResolver()
        @ 0x564f4c2f515d grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::ZoneQueryDone()
        @ 0x564f4c2f496b grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartLocked()::{lambda()#1}::operator()()::{lambda()#1}::operator()()
        @ 0x564f4c2f80f6 std::__invoke_impl<>()
        @ 0x564f4c2f7b9d _ZSt10__invoke_rIvRZZN9grpc_core12_GLOBAL__N_124GoogleCloud2ProdResolver11StartLockedEvENUlNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEN4absl12lts_202401168StatusOrIS8_EEE_clES8_SC_EUlvE_J...
        @ 0x564f4c2f748c std::_Function_handler<>::_M_invoke()
        @ 0x564f4b8ad682 std::function<>::operator()()
        @ 0x564f4cd1c6bf grpc_core::WorkSerializer::LegacyWorkSerializer::Run()
        @ 0x564f4cd1dae4 grpc_core::WorkSerializer::Run()
        @ 0x564f4c2f4b0b grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartLocked()::{lambda()#1}::operator()()
        @ 0x564f4c2f8dc7 absl::lts_20240116::base_internal::Callable::Invoke<>()
        @ 0x564f4c2f8cb8 absl::lts_20240116::base_internal::invoke<>()
        @ 0x564f4c2f8b16 absl::lts_20240116::internal_any_invocable::InvokeR<>()
        @ 0x564f4c2f8a0c absl::lts_20240116::internal_any_invocable::LocalInvoker<>()
        @ 0x564f4c2fb88d absl::lts_20240116::internal_any_invocable::Impl<>::operator()()
        @ 0x564f4c2fb1f3 grpc_core::GcpMetadataQuery::OnDone()
        @ 0x564f4cd75a72 exec_ctx_run()
        @ 0x564f4cd75ba9 grpc_core::ExecCtx::Flush()
        @ 0x564f4cc8ee1d end_worker()
        @ 0x564f4cc8f304 pollset_work()
        @ 0x564f4cc5dcaf pollset_work()
        @ 0x564f4cc69220 grpc_pollset_work()
        @ 0x564f4cbe7733 cq_pluck()
        @ 0x564f4cbe7ad5 grpc_completion_queue_pluck
        @ 0x564f4bc61d96 grpc::CompletionQueue::Pluck()
        @ 0x564f4bfdb055 grpc::ClientReader<>::ClientReader<>()
        @ 0x564f4bfd6035 grpc::internal::ClientReaderFactory<>::Create<>()
        @ 0x564f4bfc322b google::storage::v2::Storage::Stub::ReadObjectRaw()
        @ 0x564f4bf9934b google::storage::v2::Storage::Stub::ReadObject()

[mutex.cc : 1446] RAW: mutex@0x7f939834bb70 stack:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4c1ce9eb grpc_core::(anonymous namespace)::RlsLb::RlsLb()::{lambda()#1}::operator()()
        @ 0x564f4c1e794c absl::lts_20240116::base_internal::Callable::Invoke<>()
        @ 0x564f4c1e72c1 absl::lts_20240116::base_internal::invoke<>()
        @ 0x564f4c1e6af1 absl::lts_20240116::internal_any_invocable::InvokeR<>()
        @ 0x564f4c1e5d6c absl::lts_20240116::internal_any_invocable::LocalInvoker<>()
        @ 0x564f4be9d0c8 absl::lts_20240116::internal_any_invocable::Impl<>::operator()()
        @ 0x564f4be9b4ff grpc_core::RegisteredMetricCallback::Run()
        @ 0x564f4bea07ae grpc::internal::OpenTelemetryPluginImpl::CallbackGaugeState<>::CallbackGaugeCallback()
        @ 0x564f4bf844de opentelemetry::v1::sdk::metrics::ObservableRegistry::Observe()
        @ 0x564f4bf56529 opentelemetry::v1::sdk::metrics::Meter::Collect()
        @ 0x564f4bf8c1d5 opentelemetry::v1::sdk::metrics::MetricCollector::Collect()::{lambda()#1}::operator()()
        @ 0x564f4bf8c5ac opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::operator()()
        @ 0x564f4bf8c5e8 opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::_FUN()
        @ 0x564f4bf7604d opentelemetry::v1::nostd::function_ref<>::operator()()
        @ 0x564f4bf74ad9 opentelemetry::v1::sdk::metrics::MeterContext::ForEachMeter()
        @ 0x564f4bf8c457 opentelemetry::v1::sdk::metrics::MetricCollector::Collect()
        @ 0x564f4bf4a7fe opentelemetry::v1::sdk::metrics::MetricReader::Collect()
        @ 0x564f4bed5e24 opentelemetry::v1::exporter::metrics::PrometheusCollector::Collect()
        @ 0x564f4bef004f prometheus::detail::CollectMetrics()
        @ 0x564f4beec26d prometheus::detail::MetricsHandler::handleGet()
        @ 0x564f4bf1cd8b CivetServer::requestHandler()
        @ 0x564f4bf35e7b handle_request
        @ 0x564f4bf29534 handle_request_stat_log
        @ 0x564f4bf39b3f process_new_connection
        @ 0x564f4bf3a448 worker_thread_run
        @ 0x564f4bf3a57f worker_thread
        @ 0x7f93e9137ea7 start_thread

[mutex.cc : 1454] RAW: dying due to potential deadlock
Aborted
```

From the stack, it looks like we are ending up holding a lock to the `RlsLB` policy while removing a callback from the gRPC OpenTelemetry plugin, which is a lock ordering inversion. The correct order is `OpenTelemetry` -> `gRPC OpenTelemetry plugin` -> `gRPC Component like RLS/xDSClient`.

A common pattern we employ for metrics is for the callbacks to be unregistered when the corresponding component object is orphaned/destroyed (unreffing). Also, note that removing callbacks requires a lock in `gRPC OpenTelemetry plugin`. To avoid deadlocks, we remove the callback inside `RlsLb` from outside the critical region, but `RlsLb` owns refs to child policies which in turn hold refs to `XdsClient`. The lock ordering inversion occurred due to unreffing child policies within the critical region.

This PR is an alternative fix to this problem. Original fix in grpc#37425.
Verified that it fixes the bug.

Closes grpc#37459

COPYBARA_INTEGRATE_REVIEW=grpc#37459 from yashykt:FixDeadlocks ec7fbcf
PiperOrigin-RevId: 663360427
yashykt added a commit to yashykt/grpc that referenced this pull request Aug 15, 2024
Internal bug: b/357864682

A lock ordering inversion was noticed with the following stacks -
```
[mutex.cc : 1418] RAW: Potential Mutex deadlock:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4be968c5 grpc::internal::OpenTelemetryPluginImpl::RemoveCallback()
        @ 0x564f4cd097b8 grpc_core::RegisteredMetricCallback::~RegisteredMetricCallback()
        @ 0x564f4c1f1216 std::default_delete<>::operator()()
        @ 0x564f4c1f157f std::__uniq_ptr_impl<>::reset()
        @ 0x564f4c1ee967 std::unique_ptr<>::reset()
        @ 0x564f4c352f44 grpc_core::GrpcXdsClient::Orphaned()
        @ 0x564f4c25dad1 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c4653ed grpc_core::RefCountedPtr<>::reset()
        @ 0x564f4c463c73 grpc_core::XdsClusterDropStats::~XdsClusterDropStats()
        @ 0x564f4c463d02 grpc_core::XdsClusterDropStats::~XdsClusterDropStats()
        @ 0x564f4c25efa9 grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c25d5f0 grpc_core::RefCounted<>::Unref()
        @ 0x564f4c25c2d9 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c25b1d8 grpc_core::(anonymous namespace)::XdsClusterImplLb::Picker::~Picker()
        @ 0x564f4c25b240 grpc_core::(anonymous namespace)::XdsClusterImplLb::Picker::~Picker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()
        @ 0x564f4c124fb8 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c11f029 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c14e958 grpc_core::(anonymous namespace)::OutlierDetectionLb::Picker::~Picker()
        @ 0x564f4c14e980 grpc_core::(anonymous namespace)::OutlierDetectionLb::Picker::~Picker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()
        @ 0x564f4c124fb8 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c11f029 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c26bafc std::pair<>::~pair()
        @ 0x564f4c26bb28 __gnu_cxx::new_allocator<>::destroy<>()
        @ 0x564f4c26b88f std::allocator_traits<>::destroy<>()
        @ 0x564f4c26b297 std::_Rb_tree<>::_M_destroy_node()
        @ 0x564f4c26abfb std::_Rb_tree<>::_M_drop_node()
        @ 0x564f4c26a926 std::_Rb_tree<>::_M_erase()
        @ 0x564f4c26a6f0 std::_Rb_tree<>::~_Rb_tree()
        @ 0x564f4c26a62a std::map<>::~map()
        @ 0x564f4c2691a4 grpc_core::(anonymous namespace)::XdsClusterManagerLb::ClusterPicker::~ClusterPicker()
        @ 0x564f4c2691cc grpc_core::(anonymous namespace)::XdsClusterManagerLb::ClusterPicker::~ClusterPicker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()

[mutex.cc : 1428] RAW: Acquiring absl::Mutex 0x564f4f22ad40 while holding  0x7f939834bb70; a cycle in the historical lock ordering graph has been observed
[mutex.cc : 1432] RAW: Cycle:
[mutex.cc : 1446] RAW: mutex@0x564f4f22ad40 stack:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4be96124 grpc::internal::OpenTelemetryPluginImpl::AddCallback()
        @ 0x564f4cd096f0 grpc_core::RegisteredMetricCallback::RegisteredMetricCallback()
        @ 0x564f4c1f111b std::make_unique<>()
        @ 0x564f4c3564b0 grpc_core::GlobalStatsPluginRegistry::StatsPluginGroup::RegisterCallback<>()
        @ 0x564f4c352dea grpc_core::GrpcXdsClient::GrpcXdsClient()
        @ 0x564f4c355bc6 grpc_core::MakeRefCounted<>()
        @ 0x564f4c3525f2 grpc_core::GrpcXdsClient::GetOrCreate()
        @ 0x564f4c28f8f8 grpc_core::(anonymous namespace)::XdsResolver::StartLocked()
        @ 0x564f4c2f5f82 grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartXdsResolver()
        @ 0x564f4c2f515d grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::ZoneQueryDone()
        @ 0x564f4c2f496b grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartLocked()::{lambda()#1}::operator()()::{lambda()#1}::operator()()
        @ 0x564f4c2f80f6 std::__invoke_impl<>()
        @ 0x564f4c2f7b9d _ZSt10__invoke_rIvRZZN9grpc_core12_GLOBAL__N_124GoogleCloud2ProdResolver11StartLockedEvENUlNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEN4absl12lts_202401168StatusOrIS8_EEE_clES8_SC_EUlvE_J...
        @ 0x564f4c2f748c std::_Function_handler<>::_M_invoke()
        @ 0x564f4b8ad682 std::function<>::operator()()
        @ 0x564f4cd1c6bf grpc_core::WorkSerializer::LegacyWorkSerializer::Run()
        @ 0x564f4cd1dae4 grpc_core::WorkSerializer::Run()
        @ 0x564f4c2f4b0b grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartLocked()::{lambda()#1}::operator()()
        @ 0x564f4c2f8dc7 absl::lts_20240116::base_internal::Callable::Invoke<>()
        @ 0x564f4c2f8cb8 absl::lts_20240116::base_internal::invoke<>()
        @ 0x564f4c2f8b16 absl::lts_20240116::internal_any_invocable::InvokeR<>()
        @ 0x564f4c2f8a0c absl::lts_20240116::internal_any_invocable::LocalInvoker<>()
        @ 0x564f4c2fb88d absl::lts_20240116::internal_any_invocable::Impl<>::operator()()
        @ 0x564f4c2fb1f3 grpc_core::GcpMetadataQuery::OnDone()
        @ 0x564f4cd75a72 exec_ctx_run()
        @ 0x564f4cd75ba9 grpc_core::ExecCtx::Flush()
        @ 0x564f4cc8ee1d end_worker()
        @ 0x564f4cc8f304 pollset_work()
        @ 0x564f4cc5dcaf pollset_work()
        @ 0x564f4cc69220 grpc_pollset_work()
        @ 0x564f4cbe7733 cq_pluck()
        @ 0x564f4cbe7ad5 grpc_completion_queue_pluck
        @ 0x564f4bc61d96 grpc::CompletionQueue::Pluck()
        @ 0x564f4bfdb055 grpc::ClientReader<>::ClientReader<>()
        @ 0x564f4bfd6035 grpc::internal::ClientReaderFactory<>::Create<>()
        @ 0x564f4bfc322b google::storage::v2::Storage::Stub::ReadObjectRaw()
        @ 0x564f4bf9934b google::storage::v2::Storage::Stub::ReadObject()

[mutex.cc : 1446] RAW: mutex@0x7f939834bb70 stack:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4c1ce9eb grpc_core::(anonymous namespace)::RlsLb::RlsLb()::{lambda()#1}::operator()()
        @ 0x564f4c1e794c absl::lts_20240116::base_internal::Callable::Invoke<>()
        @ 0x564f4c1e72c1 absl::lts_20240116::base_internal::invoke<>()
        @ 0x564f4c1e6af1 absl::lts_20240116::internal_any_invocable::InvokeR<>()
        @ 0x564f4c1e5d6c absl::lts_20240116::internal_any_invocable::LocalInvoker<>()
        @ 0x564f4be9d0c8 absl::lts_20240116::internal_any_invocable::Impl<>::operator()()
        @ 0x564f4be9b4ff grpc_core::RegisteredMetricCallback::Run()
        @ 0x564f4bea07ae grpc::internal::OpenTelemetryPluginImpl::CallbackGaugeState<>::CallbackGaugeCallback()
        @ 0x564f4bf844de opentelemetry::v1::sdk::metrics::ObservableRegistry::Observe()
        @ 0x564f4bf56529 opentelemetry::v1::sdk::metrics::Meter::Collect()
        @ 0x564f4bf8c1d5 opentelemetry::v1::sdk::metrics::MetricCollector::Collect()::{lambda()#1}::operator()()
        @ 0x564f4bf8c5ac opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::operator()()
        @ 0x564f4bf8c5e8 opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::_FUN()
        @ 0x564f4bf7604d opentelemetry::v1::nostd::function_ref<>::operator()()
        @ 0x564f4bf74ad9 opentelemetry::v1::sdk::metrics::MeterContext::ForEachMeter()
        @ 0x564f4bf8c457 opentelemetry::v1::sdk::metrics::MetricCollector::Collect()
        @ 0x564f4bf4a7fe opentelemetry::v1::sdk::metrics::MetricReader::Collect()
        @ 0x564f4bed5e24 opentelemetry::v1::exporter::metrics::PrometheusCollector::Collect()
        @ 0x564f4bef004f prometheus::detail::CollectMetrics()
        @ 0x564f4beec26d prometheus::detail::MetricsHandler::handleGet()
        @ 0x564f4bf1cd8b CivetServer::requestHandler()
        @ 0x564f4bf35e7b handle_request
        @ 0x564f4bf29534 handle_request_stat_log
        @ 0x564f4bf39b3f process_new_connection
        @ 0x564f4bf3a448 worker_thread_run
        @ 0x564f4bf3a57f worker_thread
        @ 0x7f93e9137ea7 start_thread

[mutex.cc : 1454] RAW: dying due to potential deadlock
Aborted
```

From the stack, it looks like we are ending up holding a lock to the `RlsLB` policy while removing a callback from the gRPC OpenTelemetry plugin, which is a lock ordering inversion. The correct order is `OpenTelemetry` -> `gRPC OpenTelemetry plugin` -> `gRPC Component like RLS/xDSClient`.

A common pattern we employ for metrics is for the callbacks to be unregistered when the corresponding component object is orphaned/destroyed (unreffing). Also, note that removing callbacks requires a lock in `gRPC OpenTelemetry plugin`. To avoid deadlocks, we remove the callback inside `RlsLb` from outside the critical region, but `RlsLb` owns refs to child policies which in turn hold refs to `XdsClient`. The lock ordering inversion occurred due to unreffing child policies within the critical region.

This PR is an alternative fix to this problem. Original fix in grpc#37425.
Verified that it fixes the bug.

Closes grpc#37459

COPYBARA_INTEGRATE_REVIEW=grpc#37459 from yashykt:FixDeadlocks ec7fbcf
PiperOrigin-RevId: 663360427
yashykt added a commit to yashykt/grpc that referenced this pull request Aug 15, 2024
Internal bug: b/357864682

A lock ordering inversion was noticed with the following stacks -
```
[mutex.cc : 1418] RAW: Potential Mutex deadlock:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4be968c5 grpc::internal::OpenTelemetryPluginImpl::RemoveCallback()
        @ 0x564f4cd097b8 grpc_core::RegisteredMetricCallback::~RegisteredMetricCallback()
        @ 0x564f4c1f1216 std::default_delete<>::operator()()
        @ 0x564f4c1f157f std::__uniq_ptr_impl<>::reset()
        @ 0x564f4c1ee967 std::unique_ptr<>::reset()
        @ 0x564f4c352f44 grpc_core::GrpcXdsClient::Orphaned()
        @ 0x564f4c25dad1 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c4653ed grpc_core::RefCountedPtr<>::reset()
        @ 0x564f4c463c73 grpc_core::XdsClusterDropStats::~XdsClusterDropStats()
        @ 0x564f4c463d02 grpc_core::XdsClusterDropStats::~XdsClusterDropStats()
        @ 0x564f4c25efa9 grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c25d5f0 grpc_core::RefCounted<>::Unref()
        @ 0x564f4c25c2d9 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c25b1d8 grpc_core::(anonymous namespace)::XdsClusterImplLb::Picker::~Picker()
        @ 0x564f4c25b240 grpc_core::(anonymous namespace)::XdsClusterImplLb::Picker::~Picker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()
        @ 0x564f4c124fb8 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c11f029 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c14e958 grpc_core::(anonymous namespace)::OutlierDetectionLb::Picker::~Picker()
        @ 0x564f4c14e980 grpc_core::(anonymous namespace)::OutlierDetectionLb::Picker::~Picker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()
        @ 0x564f4c124fb8 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c11f029 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c26bafc std::pair<>::~pair()
        @ 0x564f4c26bb28 __gnu_cxx::new_allocator<>::destroy<>()
        @ 0x564f4c26b88f std::allocator_traits<>::destroy<>()
        @ 0x564f4c26b297 std::_Rb_tree<>::_M_destroy_node()
        @ 0x564f4c26abfb std::_Rb_tree<>::_M_drop_node()
        @ 0x564f4c26a926 std::_Rb_tree<>::_M_erase()
        @ 0x564f4c26a6f0 std::_Rb_tree<>::~_Rb_tree()
        @ 0x564f4c26a62a std::map<>::~map()
        @ 0x564f4c2691a4 grpc_core::(anonymous namespace)::XdsClusterManagerLb::ClusterPicker::~ClusterPicker()
        @ 0x564f4c2691cc grpc_core::(anonymous namespace)::XdsClusterManagerLb::ClusterPicker::~ClusterPicker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()

[mutex.cc : 1428] RAW: Acquiring absl::Mutex 0x564f4f22ad40 while holding  0x7f939834bb70; a cycle in the historical lock ordering graph has been observed
[mutex.cc : 1432] RAW: Cycle:
[mutex.cc : 1446] RAW: mutex@0x564f4f22ad40 stack:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4be96124 grpc::internal::OpenTelemetryPluginImpl::AddCallback()
        @ 0x564f4cd096f0 grpc_core::RegisteredMetricCallback::RegisteredMetricCallback()
        @ 0x564f4c1f111b std::make_unique<>()
        @ 0x564f4c3564b0 grpc_core::GlobalStatsPluginRegistry::StatsPluginGroup::RegisterCallback<>()
        @ 0x564f4c352dea grpc_core::GrpcXdsClient::GrpcXdsClient()
        @ 0x564f4c355bc6 grpc_core::MakeRefCounted<>()
        @ 0x564f4c3525f2 grpc_core::GrpcXdsClient::GetOrCreate()
        @ 0x564f4c28f8f8 grpc_core::(anonymous namespace)::XdsResolver::StartLocked()
        @ 0x564f4c2f5f82 grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartXdsResolver()
        @ 0x564f4c2f515d grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::ZoneQueryDone()
        @ 0x564f4c2f496b grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartLocked()::{lambda()#1}::operator()()::{lambda()#1}::operator()()
        @ 0x564f4c2f80f6 std::__invoke_impl<>()
        @ 0x564f4c2f7b9d _ZSt10__invoke_rIvRZZN9grpc_core12_GLOBAL__N_124GoogleCloud2ProdResolver11StartLockedEvENUlNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEN4absl12lts_202401168StatusOrIS8_EEE_clES8_SC_EUlvE_J...
        @ 0x564f4c2f748c std::_Function_handler<>::_M_invoke()
        @ 0x564f4b8ad682 std::function<>::operator()()
        @ 0x564f4cd1c6bf grpc_core::WorkSerializer::LegacyWorkSerializer::Run()
        @ 0x564f4cd1dae4 grpc_core::WorkSerializer::Run()
        @ 0x564f4c2f4b0b grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartLocked()::{lambda()#1}::operator()()
        @ 0x564f4c2f8dc7 absl::lts_20240116::base_internal::Callable::Invoke<>()
        @ 0x564f4c2f8cb8 absl::lts_20240116::base_internal::invoke<>()
        @ 0x564f4c2f8b16 absl::lts_20240116::internal_any_invocable::InvokeR<>()
        @ 0x564f4c2f8a0c absl::lts_20240116::internal_any_invocable::LocalInvoker<>()
        @ 0x564f4c2fb88d absl::lts_20240116::internal_any_invocable::Impl<>::operator()()
        @ 0x564f4c2fb1f3 grpc_core::GcpMetadataQuery::OnDone()
        @ 0x564f4cd75a72 exec_ctx_run()
        @ 0x564f4cd75ba9 grpc_core::ExecCtx::Flush()
        @ 0x564f4cc8ee1d end_worker()
        @ 0x564f4cc8f304 pollset_work()
        @ 0x564f4cc5dcaf pollset_work()
        @ 0x564f4cc69220 grpc_pollset_work()
        @ 0x564f4cbe7733 cq_pluck()
        @ 0x564f4cbe7ad5 grpc_completion_queue_pluck
        @ 0x564f4bc61d96 grpc::CompletionQueue::Pluck()
        @ 0x564f4bfdb055 grpc::ClientReader<>::ClientReader<>()
        @ 0x564f4bfd6035 grpc::internal::ClientReaderFactory<>::Create<>()
        @ 0x564f4bfc322b google::storage::v2::Storage::Stub::ReadObjectRaw()
        @ 0x564f4bf9934b google::storage::v2::Storage::Stub::ReadObject()

[mutex.cc : 1446] RAW: mutex@0x7f939834bb70 stack:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4c1ce9eb grpc_core::(anonymous namespace)::RlsLb::RlsLb()::{lambda()#1}::operator()()
        @ 0x564f4c1e794c absl::lts_20240116::base_internal::Callable::Invoke<>()
        @ 0x564f4c1e72c1 absl::lts_20240116::base_internal::invoke<>()
        @ 0x564f4c1e6af1 absl::lts_20240116::internal_any_invocable::InvokeR<>()
        @ 0x564f4c1e5d6c absl::lts_20240116::internal_any_invocable::LocalInvoker<>()
        @ 0x564f4be9d0c8 absl::lts_20240116::internal_any_invocable::Impl<>::operator()()
        @ 0x564f4be9b4ff grpc_core::RegisteredMetricCallback::Run()
        @ 0x564f4bea07ae grpc::internal::OpenTelemetryPluginImpl::CallbackGaugeState<>::CallbackGaugeCallback()
        @ 0x564f4bf844de opentelemetry::v1::sdk::metrics::ObservableRegistry::Observe()
        @ 0x564f4bf56529 opentelemetry::v1::sdk::metrics::Meter::Collect()
        @ 0x564f4bf8c1d5 opentelemetry::v1::sdk::metrics::MetricCollector::Collect()::{lambda()#1}::operator()()
        @ 0x564f4bf8c5ac opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::operator()()
        @ 0x564f4bf8c5e8 opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::_FUN()
        @ 0x564f4bf7604d opentelemetry::v1::nostd::function_ref<>::operator()()
        @ 0x564f4bf74ad9 opentelemetry::v1::sdk::metrics::MeterContext::ForEachMeter()
        @ 0x564f4bf8c457 opentelemetry::v1::sdk::metrics::MetricCollector::Collect()
        @ 0x564f4bf4a7fe opentelemetry::v1::sdk::metrics::MetricReader::Collect()
        @ 0x564f4bed5e24 opentelemetry::v1::exporter::metrics::PrometheusCollector::Collect()
        @ 0x564f4bef004f prometheus::detail::CollectMetrics()
        @ 0x564f4beec26d prometheus::detail::MetricsHandler::handleGet()
        @ 0x564f4bf1cd8b CivetServer::requestHandler()
        @ 0x564f4bf35e7b handle_request
        @ 0x564f4bf29534 handle_request_stat_log
        @ 0x564f4bf39b3f process_new_connection
        @ 0x564f4bf3a448 worker_thread_run
        @ 0x564f4bf3a57f worker_thread
        @ 0x7f93e9137ea7 start_thread

[mutex.cc : 1454] RAW: dying due to potential deadlock
Aborted
```

From the stack, it looks like we are ending up holding a lock to the `RlsLB` policy while removing a callback from the gRPC OpenTelemetry plugin, which is a lock ordering inversion. The correct order is `OpenTelemetry` -> `gRPC OpenTelemetry plugin` -> `gRPC Component like RLS/xDSClient`.

A common pattern we employ for metrics is for the callbacks to be unregistered when the corresponding component object is orphaned/destroyed (unreffing). Also, note that removing callbacks requires a lock in `gRPC OpenTelemetry plugin`. To avoid deadlocks, we remove the callback inside `RlsLb` from outside the critical region, but `RlsLb` owns refs to child policies which in turn hold refs to `XdsClient`. The lock ordering inversion occurred due to unreffing child policies within the critical region.

This PR is an alternative fix to this problem. Original fix in grpc#37425.
Verified that it fixes the bug.

Closes grpc#37459

COPYBARA_INTEGRATE_REVIEW=grpc#37459 from yashykt:FixDeadlocks ec7fbcf
PiperOrigin-RevId: 663360427
drfloob pushed a commit that referenced this pull request Aug 15, 2024
Backport #37459 to 1.66
Internal bug: b/357864682

A lock ordering inversion was noticed with the following stacks -
```
[mutex.cc : 1418] RAW: Potential Mutex deadlock:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4be968c5 grpc::internal::OpenTelemetryPluginImpl::RemoveCallback()
        @ 0x564f4cd097b8 grpc_core::RegisteredMetricCallback::~RegisteredMetricCallback()
        @ 0x564f4c1f1216 std::default_delete<>::operator()()
        @ 0x564f4c1f157f std::__uniq_ptr_impl<>::reset()
        @ 0x564f4c1ee967 std::unique_ptr<>::reset()
        @ 0x564f4c352f44 grpc_core::GrpcXdsClient::Orphaned()
        @ 0x564f4c25dad1 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c4653ed grpc_core::RefCountedPtr<>::reset()
        @ 0x564f4c463c73 grpc_core::XdsClusterDropStats::~XdsClusterDropStats()
        @ 0x564f4c463d02 grpc_core::XdsClusterDropStats::~XdsClusterDropStats()
        @ 0x564f4c25efa9 grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c25d5f0 grpc_core::RefCounted<>::Unref()
        @ 0x564f4c25c2d9 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c25b1d8 grpc_core::(anonymous namespace)::XdsClusterImplLb::Picker::~Picker()
        @ 0x564f4c25b240 grpc_core::(anonymous namespace)::XdsClusterImplLb::Picker::~Picker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()
        @ 0x564f4c124fb8 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c11f029 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c14e958 grpc_core::(anonymous namespace)::OutlierDetectionLb::Picker::~Picker()
        @ 0x564f4c14e980 grpc_core::(anonymous namespace)::OutlierDetectionLb::Picker::~Picker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()
        @ 0x564f4c124fb8 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c11f029 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c26bafc std::pair<>::~pair()
        @ 0x564f4c26bb28 __gnu_cxx::new_allocator<>::destroy<>()
        @ 0x564f4c26b88f std::allocator_traits<>::destroy<>()
        @ 0x564f4c26b297 std::_Rb_tree<>::_M_destroy_node()
        @ 0x564f4c26abfb std::_Rb_tree<>::_M_drop_node()
        @ 0x564f4c26a926 std::_Rb_tree<>::_M_erase()
        @ 0x564f4c26a6f0 std::_Rb_tree<>::~_Rb_tree()
        @ 0x564f4c26a62a std::map<>::~map()
        @ 0x564f4c2691a4 grpc_core::(anonymous namespace)::XdsClusterManagerLb::ClusterPicker::~ClusterPicker()
        @ 0x564f4c2691cc grpc_core::(anonymous namespace)::XdsClusterManagerLb::ClusterPicker::~ClusterPicker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()

[mutex.cc : 1428] RAW: Acquiring absl::Mutex 0x564f4f22ad40 while holding  0x7f939834bb70; a cycle in the historical lock ordering graph has been observed
[mutex.cc : 1432] RAW: Cycle:
[mutex.cc : 1446] RAW: mutex@0x564f4f22ad40 stack:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4be96124 grpc::internal::OpenTelemetryPluginImpl::AddCallback()
        @ 0x564f4cd096f0 grpc_core::RegisteredMetricCallback::RegisteredMetricCallback()
        @ 0x564f4c1f111b std::make_unique<>()
        @ 0x564f4c3564b0 grpc_core::GlobalStatsPluginRegistry::StatsPluginGroup::RegisterCallback<>()
        @ 0x564f4c352dea grpc_core::GrpcXdsClient::GrpcXdsClient()
        @ 0x564f4c355bc6 grpc_core::MakeRefCounted<>()
        @ 0x564f4c3525f2 grpc_core::GrpcXdsClient::GetOrCreate()
        @ 0x564f4c28f8f8 grpc_core::(anonymous namespace)::XdsResolver::StartLocked()
        @ 0x564f4c2f5f82 grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartXdsResolver()
        @ 0x564f4c2f515d grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::ZoneQueryDone()
        @ 0x564f4c2f496b grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartLocked()::{lambda()#1}::operator()()::{lambda()#1}::operator()()
        @ 0x564f4c2f80f6 std::__invoke_impl<>()
        @ 0x564f4c2f7b9d _ZSt10__invoke_rIvRZZN9grpc_core12_GLOBAL__N_124GoogleCloud2ProdResolver11StartLockedEvENUlNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEN4absl12lts_202401168StatusOrIS8_EEE_clES8_SC_EUlvE_J...
        @ 0x564f4c2f748c std::_Function_handler<>::_M_invoke()
        @ 0x564f4b8ad682 std::function<>::operator()()
        @ 0x564f4cd1c6bf grpc_core::WorkSerializer::LegacyWorkSerializer::Run()
        @ 0x564f4cd1dae4 grpc_core::WorkSerializer::Run()
        @ 0x564f4c2f4b0b grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartLocked()::{lambda()#1}::operator()()
        @ 0x564f4c2f8dc7 absl::lts_20240116::base_internal::Callable::Invoke<>()
        @ 0x564f4c2f8cb8 absl::lts_20240116::base_internal::invoke<>()
        @ 0x564f4c2f8b16 absl::lts_20240116::internal_any_invocable::InvokeR<>()
        @ 0x564f4c2f8a0c absl::lts_20240116::internal_any_invocable::LocalInvoker<>()
        @ 0x564f4c2fb88d absl::lts_20240116::internal_any_invocable::Impl<>::operator()()
        @ 0x564f4c2fb1f3 grpc_core::GcpMetadataQuery::OnDone()
        @ 0x564f4cd75a72 exec_ctx_run()
        @ 0x564f4cd75ba9 grpc_core::ExecCtx::Flush()
        @ 0x564f4cc8ee1d end_worker()
        @ 0x564f4cc8f304 pollset_work()
        @ 0x564f4cc5dcaf pollset_work()
        @ 0x564f4cc69220 grpc_pollset_work()
        @ 0x564f4cbe7733 cq_pluck()
        @ 0x564f4cbe7ad5 grpc_completion_queue_pluck
        @ 0x564f4bc61d96 grpc::CompletionQueue::Pluck()
        @ 0x564f4bfdb055 grpc::ClientReader<>::ClientReader<>()
        @ 0x564f4bfd6035 grpc::internal::ClientReaderFactory<>::Create<>()
        @ 0x564f4bfc322b google::storage::v2::Storage::Stub::ReadObjectRaw()
        @ 0x564f4bf9934b google::storage::v2::Storage::Stub::ReadObject()

[mutex.cc : 1446] RAW: mutex@0x7f939834bb70 stack:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4c1ce9eb grpc_core::(anonymous namespace)::RlsLb::RlsLb()::{lambda()#1}::operator()()
        @ 0x564f4c1e794c absl::lts_20240116::base_internal::Callable::Invoke<>()
        @ 0x564f4c1e72c1 absl::lts_20240116::base_internal::invoke<>()
        @ 0x564f4c1e6af1 absl::lts_20240116::internal_any_invocable::InvokeR<>()
        @ 0x564f4c1e5d6c absl::lts_20240116::internal_any_invocable::LocalInvoker<>()
        @ 0x564f4be9d0c8 absl::lts_20240116::internal_any_invocable::Impl<>::operator()()
        @ 0x564f4be9b4ff grpc_core::RegisteredMetricCallback::Run()
        @ 0x564f4bea07ae grpc::internal::OpenTelemetryPluginImpl::CallbackGaugeState<>::CallbackGaugeCallback()
        @ 0x564f4bf844de opentelemetry::v1::sdk::metrics::ObservableRegistry::Observe()
        @ 0x564f4bf56529 opentelemetry::v1::sdk::metrics::Meter::Collect()
        @ 0x564f4bf8c1d5 opentelemetry::v1::sdk::metrics::MetricCollector::Collect()::{lambda()#1}::operator()()
        @ 0x564f4bf8c5ac opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::operator()()
        @ 0x564f4bf8c5e8 opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::_FUN()
        @ 0x564f4bf7604d opentelemetry::v1::nostd::function_ref<>::operator()()
        @ 0x564f4bf74ad9 opentelemetry::v1::sdk::metrics::MeterContext::ForEachMeter()
        @ 0x564f4bf8c457 opentelemetry::v1::sdk::metrics::MetricCollector::Collect()
        @ 0x564f4bf4a7fe opentelemetry::v1::sdk::metrics::MetricReader::Collect()
        @ 0x564f4bed5e24 opentelemetry::v1::exporter::metrics::PrometheusCollector::Collect()
        @ 0x564f4bef004f prometheus::detail::CollectMetrics()
        @ 0x564f4beec26d prometheus::detail::MetricsHandler::handleGet()
        @ 0x564f4bf1cd8b CivetServer::requestHandler()
        @ 0x564f4bf35e7b handle_request
        @ 0x564f4bf29534 handle_request_stat_log
        @ 0x564f4bf39b3f process_new_connection
        @ 0x564f4bf3a448 worker_thread_run
        @ 0x564f4bf3a57f worker_thread
        @ 0x7f93e9137ea7 start_thread

[mutex.cc : 1454] RAW: dying due to potential deadlock
Aborted
```

From the stack, it looks like we are ending up holding a lock to the
`RlsLB` policy while removing a callback from the gRPC OpenTelemetry
plugin, which is a lock ordering inversion. The correct order is
`OpenTelemetry` -> `gRPC OpenTelemetry plugin` -> `gRPC Component like
RLS/xDSClient`.

A common pattern we employ for metrics is for the callbacks to be
unregistered when the corresponding component object is
orphaned/destroyed (unreffing). Also, note that removing callbacks
requires a lock in `gRPC OpenTelemetry plugin`. To avoid deadlocks, we
remove the callback inside `RlsLb` from outside the critical region, but
`RlsLb` owns refs to child policies which in turn hold refs to
`XdsClient`. The lock ordering inversion occurred due to unreffing child
policies within the critical region.

This PR is an alternative fix to this problem. Original fix in
#37425. Verified that it fixes the bug.
copybara-service bot pushed a commit that referenced this pull request Aug 15, 2024
Split off from #37425

We are adding and removing callbacks on the OpenTelemetry Async Instruments without synchronization. This opens us to races where we have an AddCallback and RemoveCallback operation happening at the same time. The correct result after these operations is to still have a callback registered with OpenTelemetry at the end, but the two operations could race and we could just decide to remove the OpenTelemetry callback.

The fix delays removing OpenTelemetry callbacks to plugin destruction time.

Closes #37485

COPYBARA_INTEGRATE_REVIEW=#37485 from yashykt:FixRaceOTelGauge 016b0a4
PiperOrigin-RevId: 663492598
yashykt added a commit to yashykt/grpc that referenced this pull request Aug 16, 2024
Split off from grpc#37425

We are adding and removing callbacks on the OpenTelemetry Async Instruments without synchronization. This opens us to races where we have an AddCallback and RemoveCallback operation happening at the same time. The correct result after these operations is to still have a callback registered with OpenTelemetry at the end, but the two operations could race and we could just decide to remove the OpenTelemetry callback.

The fix delays removing OpenTelemetry callbacks to plugin destruction time.

Closes grpc#37485

COPYBARA_INTEGRATE_REVIEW=grpc#37485 from yashykt:FixRaceOTelGauge 016b0a4
PiperOrigin-RevId: 663492598
yashykt added a commit to yashykt/grpc that referenced this pull request Aug 16, 2024
Split off from grpc#37425

We are adding and removing callbacks on the OpenTelemetry Async Instruments without synchronization. This opens us to races where we have an AddCallback and RemoveCallback operation happening at the same time. The correct result after these operations is to still have a callback registered with OpenTelemetry at the end, but the two operations could race and we could just decide to remove the OpenTelemetry callback.

The fix delays removing OpenTelemetry callbacks to plugin destruction time.

Closes grpc#37485

COPYBARA_INTEGRATE_REVIEW=grpc#37485 from yashykt:FixRaceOTelGauge 016b0a4
PiperOrigin-RevId: 663492598
yashykt added a commit that referenced this pull request Aug 16, 2024
Backport #37459 to v1.65
Internal bug: b/357864682

A lock ordering inversion was noticed with the following stacks -
```
[mutex.cc : 1418] RAW: Potential Mutex deadlock:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4be968c5 grpc::internal::OpenTelemetryPluginImpl::RemoveCallback()
        @ 0x564f4cd097b8 grpc_core::RegisteredMetricCallback::~RegisteredMetricCallback()
        @ 0x564f4c1f1216 std::default_delete<>::operator()()
        @ 0x564f4c1f157f std::__uniq_ptr_impl<>::reset()
        @ 0x564f4c1ee967 std::unique_ptr<>::reset()
        @ 0x564f4c352f44 grpc_core::GrpcXdsClient::Orphaned()
        @ 0x564f4c25dad1 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c4653ed grpc_core::RefCountedPtr<>::reset()
        @ 0x564f4c463c73 grpc_core::XdsClusterDropStats::~XdsClusterDropStats()
        @ 0x564f4c463d02 grpc_core::XdsClusterDropStats::~XdsClusterDropStats()
        @ 0x564f4c25efa9 grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c25d5f0 grpc_core::RefCounted<>::Unref()
        @ 0x564f4c25c2d9 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c25b1d8 grpc_core::(anonymous namespace)::XdsClusterImplLb::Picker::~Picker()
        @ 0x564f4c25b240 grpc_core::(anonymous namespace)::XdsClusterImplLb::Picker::~Picker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()
        @ 0x564f4c124fb8 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c11f029 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c14e958 grpc_core::(anonymous namespace)::OutlierDetectionLb::Picker::~Picker()
        @ 0x564f4c14e980 grpc_core::(anonymous namespace)::OutlierDetectionLb::Picker::~Picker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()
        @ 0x564f4c124fb8 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c11f029 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c26bafc std::pair<>::~pair()
        @ 0x564f4c26bb28 __gnu_cxx::new_allocator<>::destroy<>()
        @ 0x564f4c26b88f std::allocator_traits<>::destroy<>()
        @ 0x564f4c26b297 std::_Rb_tree<>::_M_destroy_node()
        @ 0x564f4c26abfb std::_Rb_tree<>::_M_drop_node()
        @ 0x564f4c26a926 std::_Rb_tree<>::_M_erase()
        @ 0x564f4c26a6f0 std::_Rb_tree<>::~_Rb_tree()
        @ 0x564f4c26a62a std::map<>::~map()
        @ 0x564f4c2691a4 grpc_core::(anonymous namespace)::XdsClusterManagerLb::ClusterPicker::~ClusterPicker()
        @ 0x564f4c2691cc grpc_core::(anonymous namespace)::XdsClusterManagerLb::ClusterPicker::~ClusterPicker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()

[mutex.cc : 1428] RAW: Acquiring absl::Mutex 0x564f4f22ad40 while holding  0x7f939834bb70; a cycle in the historical lock ordering graph has been observed
[mutex.cc : 1432] RAW: Cycle:
[mutex.cc : 1446] RAW: mutex@0x564f4f22ad40 stack:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4be96124 grpc::internal::OpenTelemetryPluginImpl::AddCallback()
        @ 0x564f4cd096f0 grpc_core::RegisteredMetricCallback::RegisteredMetricCallback()
        @ 0x564f4c1f111b std::make_unique<>()
        @ 0x564f4c3564b0 grpc_core::GlobalStatsPluginRegistry::StatsPluginGroup::RegisterCallback<>()
        @ 0x564f4c352dea grpc_core::GrpcXdsClient::GrpcXdsClient()
        @ 0x564f4c355bc6 grpc_core::MakeRefCounted<>()
        @ 0x564f4c3525f2 grpc_core::GrpcXdsClient::GetOrCreate()
        @ 0x564f4c28f8f8 grpc_core::(anonymous namespace)::XdsResolver::StartLocked()
        @ 0x564f4c2f5f82 grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartXdsResolver()
        @ 0x564f4c2f515d grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::ZoneQueryDone()
        @ 0x564f4c2f496b grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartLocked()::{lambda()#1}::operator()()::{lambda()#1}::operator()()
        @ 0x564f4c2f80f6 std::__invoke_impl<>()
        @ 0x564f4c2f7b9d _ZSt10__invoke_rIvRZZN9grpc_core12_GLOBAL__N_124GoogleCloud2ProdResolver11StartLockedEvENUlNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEN4absl12lts_202401168StatusOrIS8_EEE_clES8_SC_EUlvE_J...
        @ 0x564f4c2f748c std::_Function_handler<>::_M_invoke()
        @ 0x564f4b8ad682 std::function<>::operator()()
        @ 0x564f4cd1c6bf grpc_core::WorkSerializer::LegacyWorkSerializer::Run()
        @ 0x564f4cd1dae4 grpc_core::WorkSerializer::Run()
        @ 0x564f4c2f4b0b grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartLocked()::{lambda()#1}::operator()()
        @ 0x564f4c2f8dc7 absl::lts_20240116::base_internal::Callable::Invoke<>()
        @ 0x564f4c2f8cb8 absl::lts_20240116::base_internal::invoke<>()
        @ 0x564f4c2f8b16 absl::lts_20240116::internal_any_invocable::InvokeR<>()
        @ 0x564f4c2f8a0c absl::lts_20240116::internal_any_invocable::LocalInvoker<>()
        @ 0x564f4c2fb88d absl::lts_20240116::internal_any_invocable::Impl<>::operator()()
        @ 0x564f4c2fb1f3 grpc_core::GcpMetadataQuery::OnDone()
        @ 0x564f4cd75a72 exec_ctx_run()
        @ 0x564f4cd75ba9 grpc_core::ExecCtx::Flush()
        @ 0x564f4cc8ee1d end_worker()
        @ 0x564f4cc8f304 pollset_work()
        @ 0x564f4cc5dcaf pollset_work()
        @ 0x564f4cc69220 grpc_pollset_work()
        @ 0x564f4cbe7733 cq_pluck()
        @ 0x564f4cbe7ad5 grpc_completion_queue_pluck
        @ 0x564f4bc61d96 grpc::CompletionQueue::Pluck()
        @ 0x564f4bfdb055 grpc::ClientReader<>::ClientReader<>()
        @ 0x564f4bfd6035 grpc::internal::ClientReaderFactory<>::Create<>()
        @ 0x564f4bfc322b google::storage::v2::Storage::Stub::ReadObjectRaw()
        @ 0x564f4bf9934b google::storage::v2::Storage::Stub::ReadObject()

[mutex.cc : 1446] RAW: mutex@0x7f939834bb70 stack:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4c1ce9eb grpc_core::(anonymous namespace)::RlsLb::RlsLb()::{lambda()#1}::operator()()
        @ 0x564f4c1e794c absl::lts_20240116::base_internal::Callable::Invoke<>()
        @ 0x564f4c1e72c1 absl::lts_20240116::base_internal::invoke<>()
        @ 0x564f4c1e6af1 absl::lts_20240116::internal_any_invocable::InvokeR<>()
        @ 0x564f4c1e5d6c absl::lts_20240116::internal_any_invocable::LocalInvoker<>()
        @ 0x564f4be9d0c8 absl::lts_20240116::internal_any_invocable::Impl<>::operator()()
        @ 0x564f4be9b4ff grpc_core::RegisteredMetricCallback::Run()
        @ 0x564f4bea07ae grpc::internal::OpenTelemetryPluginImpl::CallbackGaugeState<>::CallbackGaugeCallback()
        @ 0x564f4bf844de opentelemetry::v1::sdk::metrics::ObservableRegistry::Observe()
        @ 0x564f4bf56529 opentelemetry::v1::sdk::metrics::Meter::Collect()
        @ 0x564f4bf8c1d5 opentelemetry::v1::sdk::metrics::MetricCollector::Collect()::{lambda()#1}::operator()()
        @ 0x564f4bf8c5ac opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::operator()()
        @ 0x564f4bf8c5e8 opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::_FUN()
        @ 0x564f4bf7604d opentelemetry::v1::nostd::function_ref<>::operator()()
        @ 0x564f4bf74ad9 opentelemetry::v1::sdk::metrics::MeterContext::ForEachMeter()
        @ 0x564f4bf8c457 opentelemetry::v1::sdk::metrics::MetricCollector::Collect()
        @ 0x564f4bf4a7fe opentelemetry::v1::sdk::metrics::MetricReader::Collect()
        @ 0x564f4bed5e24 opentelemetry::v1::exporter::metrics::PrometheusCollector::Collect()
        @ 0x564f4bef004f prometheus::detail::CollectMetrics()
        @ 0x564f4beec26d prometheus::detail::MetricsHandler::handleGet()
        @ 0x564f4bf1cd8b CivetServer::requestHandler()
        @ 0x564f4bf35e7b handle_request
        @ 0x564f4bf29534 handle_request_stat_log
        @ 0x564f4bf39b3f process_new_connection
        @ 0x564f4bf3a448 worker_thread_run
        @ 0x564f4bf3a57f worker_thread
        @ 0x7f93e9137ea7 start_thread

[mutex.cc : 1454] RAW: dying due to potential deadlock
Aborted
```

From the stack, it looks like we are ending up holding a lock to the
`RlsLB` policy while removing a callback from the gRPC OpenTelemetry
plugin, which is a lock ordering inversion. The correct order is
`OpenTelemetry` -> `gRPC OpenTelemetry plugin` -> `gRPC Component like
RLS/xDSClient`.

A common pattern we employ for metrics is for the callbacks to be
unregistered when the corresponding component object is
orphaned/destroyed (unreffing). Also, note that removing callbacks
requires a lock in `gRPC OpenTelemetry plugin`. To avoid deadlocks, we
remove the callback inside `RlsLb` from outside the critical region, but
`RlsLb` owns refs to child policies which in turn hold refs to
`XdsClient`. The lock ordering inversion occurred due to unreffing child
policies within the critical region.

This PR is an alternative fix to this problem. Original fix in
#37425. Verified that it fixes the bug.

Closes #37459

COPYBARA_INTEGRATE_REVIEW=#37459 from
yashykt:FixDeadlocks ec7fbcf
PiperOrigin-RevId: 663360427




<!--

If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.

If your pull request is for a specific language, please add the
appropriate
lang label.

-->
yashykt added a commit that referenced this pull request Aug 16, 2024
Backport #37459 to 1.64.x
Internal bug: b/357864682

A lock ordering inversion was noticed with the following stacks -
```
[mutex.cc : 1418] RAW: Potential Mutex deadlock:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4be968c5 grpc::internal::OpenTelemetryPluginImpl::RemoveCallback()
        @ 0x564f4cd097b8 grpc_core::RegisteredMetricCallback::~RegisteredMetricCallback()
        @ 0x564f4c1f1216 std::default_delete<>::operator()()
        @ 0x564f4c1f157f std::__uniq_ptr_impl<>::reset()
        @ 0x564f4c1ee967 std::unique_ptr<>::reset()
        @ 0x564f4c352f44 grpc_core::GrpcXdsClient::Orphaned()
        @ 0x564f4c25dad1 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c4653ed grpc_core::RefCountedPtr<>::reset()
        @ 0x564f4c463c73 grpc_core::XdsClusterDropStats::~XdsClusterDropStats()
        @ 0x564f4c463d02 grpc_core::XdsClusterDropStats::~XdsClusterDropStats()
        @ 0x564f4c25efa9 grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c25d5f0 grpc_core::RefCounted<>::Unref()
        @ 0x564f4c25c2d9 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c25b1d8 grpc_core::(anonymous namespace)::XdsClusterImplLb::Picker::~Picker()
        @ 0x564f4c25b240 grpc_core::(anonymous namespace)::XdsClusterImplLb::Picker::~Picker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()
        @ 0x564f4c124fb8 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c11f029 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c14e958 grpc_core::(anonymous namespace)::OutlierDetectionLb::Picker::~Picker()
        @ 0x564f4c14e980 grpc_core::(anonymous namespace)::OutlierDetectionLb::Picker::~Picker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()
        @ 0x564f4c124fb8 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c11f029 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c26bafc std::pair<>::~pair()
        @ 0x564f4c26bb28 __gnu_cxx::new_allocator<>::destroy<>()
        @ 0x564f4c26b88f std::allocator_traits<>::destroy<>()
        @ 0x564f4c26b297 std::_Rb_tree<>::_M_destroy_node()
        @ 0x564f4c26abfb std::_Rb_tree<>::_M_drop_node()
        @ 0x564f4c26a926 std::_Rb_tree<>::_M_erase()
        @ 0x564f4c26a6f0 std::_Rb_tree<>::~_Rb_tree()
        @ 0x564f4c26a62a std::map<>::~map()
        @ 0x564f4c2691a4 grpc_core::(anonymous namespace)::XdsClusterManagerLb::ClusterPicker::~ClusterPicker()
        @ 0x564f4c2691cc grpc_core::(anonymous namespace)::XdsClusterManagerLb::ClusterPicker::~ClusterPicker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()

[mutex.cc : 1428] RAW: Acquiring absl::Mutex 0x564f4f22ad40 while holding  0x7f939834bb70; a cycle in the historical lock ordering graph has been observed
[mutex.cc : 1432] RAW: Cycle:
[mutex.cc : 1446] RAW: mutex@0x564f4f22ad40 stack:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4be96124 grpc::internal::OpenTelemetryPluginImpl::AddCallback()
        @ 0x564f4cd096f0 grpc_core::RegisteredMetricCallback::RegisteredMetricCallback()
        @ 0x564f4c1f111b std::make_unique<>()
        @ 0x564f4c3564b0 grpc_core::GlobalStatsPluginRegistry::StatsPluginGroup::RegisterCallback<>()
        @ 0x564f4c352dea grpc_core::GrpcXdsClient::GrpcXdsClient()
        @ 0x564f4c355bc6 grpc_core::MakeRefCounted<>()
        @ 0x564f4c3525f2 grpc_core::GrpcXdsClient::GetOrCreate()
        @ 0x564f4c28f8f8 grpc_core::(anonymous namespace)::XdsResolver::StartLocked()
        @ 0x564f4c2f5f82 grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartXdsResolver()
        @ 0x564f4c2f515d grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::ZoneQueryDone()
        @ 0x564f4c2f496b grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartLocked()::{lambda()#1}::operator()()::{lambda()#1}::operator()()
        @ 0x564f4c2f80f6 std::__invoke_impl<>()
        @ 0x564f4c2f7b9d _ZSt10__invoke_rIvRZZN9grpc_core12_GLOBAL__N_124GoogleCloud2ProdResolver11StartLockedEvENUlNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEN4absl12lts_202401168StatusOrIS8_EEE_clES8_SC_EUlvE_J...
        @ 0x564f4c2f748c std::_Function_handler<>::_M_invoke()
        @ 0x564f4b8ad682 std::function<>::operator()()
        @ 0x564f4cd1c6bf grpc_core::WorkSerializer::LegacyWorkSerializer::Run()
        @ 0x564f4cd1dae4 grpc_core::WorkSerializer::Run()
        @ 0x564f4c2f4b0b grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartLocked()::{lambda()#1}::operator()()
        @ 0x564f4c2f8dc7 absl::lts_20240116::base_internal::Callable::Invoke<>()
        @ 0x564f4c2f8cb8 absl::lts_20240116::base_internal::invoke<>()
        @ 0x564f4c2f8b16 absl::lts_20240116::internal_any_invocable::InvokeR<>()
        @ 0x564f4c2f8a0c absl::lts_20240116::internal_any_invocable::LocalInvoker<>()
        @ 0x564f4c2fb88d absl::lts_20240116::internal_any_invocable::Impl<>::operator()()
        @ 0x564f4c2fb1f3 grpc_core::GcpMetadataQuery::OnDone()
        @ 0x564f4cd75a72 exec_ctx_run()
        @ 0x564f4cd75ba9 grpc_core::ExecCtx::Flush()
        @ 0x564f4cc8ee1d end_worker()
        @ 0x564f4cc8f304 pollset_work()
        @ 0x564f4cc5dcaf pollset_work()
        @ 0x564f4cc69220 grpc_pollset_work()
        @ 0x564f4cbe7733 cq_pluck()
        @ 0x564f4cbe7ad5 grpc_completion_queue_pluck
        @ 0x564f4bc61d96 grpc::CompletionQueue::Pluck()
        @ 0x564f4bfdb055 grpc::ClientReader<>::ClientReader<>()
        @ 0x564f4bfd6035 grpc::internal::ClientReaderFactory<>::Create<>()
        @ 0x564f4bfc322b google::storage::v2::Storage::Stub::ReadObjectRaw()
        @ 0x564f4bf9934b google::storage::v2::Storage::Stub::ReadObject()

[mutex.cc : 1446] RAW: mutex@0x7f939834bb70 stack:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4c1ce9eb grpc_core::(anonymous namespace)::RlsLb::RlsLb()::{lambda()#1}::operator()()
        @ 0x564f4c1e794c absl::lts_20240116::base_internal::Callable::Invoke<>()
        @ 0x564f4c1e72c1 absl::lts_20240116::base_internal::invoke<>()
        @ 0x564f4c1e6af1 absl::lts_20240116::internal_any_invocable::InvokeR<>()
        @ 0x564f4c1e5d6c absl::lts_20240116::internal_any_invocable::LocalInvoker<>()
        @ 0x564f4be9d0c8 absl::lts_20240116::internal_any_invocable::Impl<>::operator()()
        @ 0x564f4be9b4ff grpc_core::RegisteredMetricCallback::Run()
        @ 0x564f4bea07ae grpc::internal::OpenTelemetryPluginImpl::CallbackGaugeState<>::CallbackGaugeCallback()
        @ 0x564f4bf844de opentelemetry::v1::sdk::metrics::ObservableRegistry::Observe()
        @ 0x564f4bf56529 opentelemetry::v1::sdk::metrics::Meter::Collect()
        @ 0x564f4bf8c1d5 opentelemetry::v1::sdk::metrics::MetricCollector::Collect()::{lambda()#1}::operator()()
        @ 0x564f4bf8c5ac opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::operator()()
        @ 0x564f4bf8c5e8 opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::_FUN()
        @ 0x564f4bf7604d opentelemetry::v1::nostd::function_ref<>::operator()()
        @ 0x564f4bf74ad9 opentelemetry::v1::sdk::metrics::MeterContext::ForEachMeter()
        @ 0x564f4bf8c457 opentelemetry::v1::sdk::metrics::MetricCollector::Collect()
        @ 0x564f4bf4a7fe opentelemetry::v1::sdk::metrics::MetricReader::Collect()
        @ 0x564f4bed5e24 opentelemetry::v1::exporter::metrics::PrometheusCollector::Collect()
        @ 0x564f4bef004f prometheus::detail::CollectMetrics()
        @ 0x564f4beec26d prometheus::detail::MetricsHandler::handleGet()
        @ 0x564f4bf1cd8b CivetServer::requestHandler()
        @ 0x564f4bf35e7b handle_request
        @ 0x564f4bf29534 handle_request_stat_log
        @ 0x564f4bf39b3f process_new_connection
        @ 0x564f4bf3a448 worker_thread_run
        @ 0x564f4bf3a57f worker_thread
        @ 0x7f93e9137ea7 start_thread

[mutex.cc : 1454] RAW: dying due to potential deadlock
Aborted
```

From the stack, it looks like we are ending up holding a lock to the
`RlsLB` policy while removing a callback from the gRPC OpenTelemetry
plugin, which is a lock ordering inversion. The correct order is
`OpenTelemetry` -> `gRPC OpenTelemetry plugin` -> `gRPC Component like
RLS/xDSClient`.

A common pattern we employ for metrics is for the callbacks to be
unregistered when the corresponding component object is
orphaned/destroyed (unreffing). Also, note that removing callbacks
requires a lock in `gRPC OpenTelemetry plugin`. To avoid deadlocks, we
remove the callback inside `RlsLb` from outside the critical region, but
`RlsLb` owns refs to child policies which in turn hold refs to
`XdsClient`. The lock ordering inversion occurred due to unreffing child
policies within the critical region.

This PR is an alternative fix to this problem. Original fix in
#37425. Verified that it fixes the bug.
yashykt added a commit that referenced this pull request Aug 16, 2024
Backport #37459 to 1.63
Internal bug: b/357864682

A lock ordering inversion was noticed with the following stacks -
```
[mutex.cc : 1418] RAW: Potential Mutex deadlock:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4be968c5 grpc::internal::OpenTelemetryPluginImpl::RemoveCallback()
        @ 0x564f4cd097b8 grpc_core::RegisteredMetricCallback::~RegisteredMetricCallback()
        @ 0x564f4c1f1216 std::default_delete<>::operator()()
        @ 0x564f4c1f157f std::__uniq_ptr_impl<>::reset()
        @ 0x564f4c1ee967 std::unique_ptr<>::reset()
        @ 0x564f4c352f44 grpc_core::GrpcXdsClient::Orphaned()
        @ 0x564f4c25dad1 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c4653ed grpc_core::RefCountedPtr<>::reset()
        @ 0x564f4c463c73 grpc_core::XdsClusterDropStats::~XdsClusterDropStats()
        @ 0x564f4c463d02 grpc_core::XdsClusterDropStats::~XdsClusterDropStats()
        @ 0x564f4c25efa9 grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c25d5f0 grpc_core::RefCounted<>::Unref()
        @ 0x564f4c25c2d9 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c25b1d8 grpc_core::(anonymous namespace)::XdsClusterImplLb::Picker::~Picker()
        @ 0x564f4c25b240 grpc_core::(anonymous namespace)::XdsClusterImplLb::Picker::~Picker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()
        @ 0x564f4c124fb8 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c11f029 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c14e958 grpc_core::(anonymous namespace)::OutlierDetectionLb::Picker::~Picker()
        @ 0x564f4c14e980 grpc_core::(anonymous namespace)::OutlierDetectionLb::Picker::~Picker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()
        @ 0x564f4c124fb8 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c11f029 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c26bafc std::pair<>::~pair()
        @ 0x564f4c26bb28 __gnu_cxx::new_allocator<>::destroy<>()
        @ 0x564f4c26b88f std::allocator_traits<>::destroy<>()
        @ 0x564f4c26b297 std::_Rb_tree<>::_M_destroy_node()
        @ 0x564f4c26abfb std::_Rb_tree<>::_M_drop_node()
        @ 0x564f4c26a926 std::_Rb_tree<>::_M_erase()
        @ 0x564f4c26a6f0 std::_Rb_tree<>::~_Rb_tree()
        @ 0x564f4c26a62a std::map<>::~map()
        @ 0x564f4c2691a4 grpc_core::(anonymous namespace)::XdsClusterManagerLb::ClusterPicker::~ClusterPicker()
        @ 0x564f4c2691cc grpc_core::(anonymous namespace)::XdsClusterManagerLb::ClusterPicker::~ClusterPicker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()

[mutex.cc : 1428] RAW: Acquiring absl::Mutex 0x564f4f22ad40 while holding  0x7f939834bb70; a cycle in the historical lock ordering graph has been observed
[mutex.cc : 1432] RAW: Cycle:
[mutex.cc : 1446] RAW: mutex@0x564f4f22ad40 stack:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4be96124 grpc::internal::OpenTelemetryPluginImpl::AddCallback()
        @ 0x564f4cd096f0 grpc_core::RegisteredMetricCallback::RegisteredMetricCallback()
        @ 0x564f4c1f111b std::make_unique<>()
        @ 0x564f4c3564b0 grpc_core::GlobalStatsPluginRegistry::StatsPluginGroup::RegisterCallback<>()
        @ 0x564f4c352dea grpc_core::GrpcXdsClient::GrpcXdsClient()
        @ 0x564f4c355bc6 grpc_core::MakeRefCounted<>()
        @ 0x564f4c3525f2 grpc_core::GrpcXdsClient::GetOrCreate()
        @ 0x564f4c28f8f8 grpc_core::(anonymous namespace)::XdsResolver::StartLocked()
        @ 0x564f4c2f5f82 grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartXdsResolver()
        @ 0x564f4c2f515d grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::ZoneQueryDone()
        @ 0x564f4c2f496b grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartLocked()::{lambda()#1}::operator()()::{lambda()#1}::operator()()
        @ 0x564f4c2f80f6 std::__invoke_impl<>()
        @ 0x564f4c2f7b9d _ZSt10__invoke_rIvRZZN9grpc_core12_GLOBAL__N_124GoogleCloud2ProdResolver11StartLockedEvENUlNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEN4absl12lts_202401168StatusOrIS8_EEE_clES8_SC_EUlvE_J...
        @ 0x564f4c2f748c std::_Function_handler<>::_M_invoke()
        @ 0x564f4b8ad682 std::function<>::operator()()
        @ 0x564f4cd1c6bf grpc_core::WorkSerializer::LegacyWorkSerializer::Run()
        @ 0x564f4cd1dae4 grpc_core::WorkSerializer::Run()
        @ 0x564f4c2f4b0b grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartLocked()::{lambda()#1}::operator()()
        @ 0x564f4c2f8dc7 absl::lts_20240116::base_internal::Callable::Invoke<>()
        @ 0x564f4c2f8cb8 absl::lts_20240116::base_internal::invoke<>()
        @ 0x564f4c2f8b16 absl::lts_20240116::internal_any_invocable::InvokeR<>()
        @ 0x564f4c2f8a0c absl::lts_20240116::internal_any_invocable::LocalInvoker<>()
        @ 0x564f4c2fb88d absl::lts_20240116::internal_any_invocable::Impl<>::operator()()
        @ 0x564f4c2fb1f3 grpc_core::GcpMetadataQuery::OnDone()
        @ 0x564f4cd75a72 exec_ctx_run()
        @ 0x564f4cd75ba9 grpc_core::ExecCtx::Flush()
        @ 0x564f4cc8ee1d end_worker()
        @ 0x564f4cc8f304 pollset_work()
        @ 0x564f4cc5dcaf pollset_work()
        @ 0x564f4cc69220 grpc_pollset_work()
        @ 0x564f4cbe7733 cq_pluck()
        @ 0x564f4cbe7ad5 grpc_completion_queue_pluck
        @ 0x564f4bc61d96 grpc::CompletionQueue::Pluck()
        @ 0x564f4bfdb055 grpc::ClientReader<>::ClientReader<>()
        @ 0x564f4bfd6035 grpc::internal::ClientReaderFactory<>::Create<>()
        @ 0x564f4bfc322b google::storage::v2::Storage::Stub::ReadObjectRaw()
        @ 0x564f4bf9934b google::storage::v2::Storage::Stub::ReadObject()

[mutex.cc : 1446] RAW: mutex@0x7f939834bb70 stack:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4c1ce9eb grpc_core::(anonymous namespace)::RlsLb::RlsLb()::{lambda()#1}::operator()()
        @ 0x564f4c1e794c absl::lts_20240116::base_internal::Callable::Invoke<>()
        @ 0x564f4c1e72c1 absl::lts_20240116::base_internal::invoke<>()
        @ 0x564f4c1e6af1 absl::lts_20240116::internal_any_invocable::InvokeR<>()
        @ 0x564f4c1e5d6c absl::lts_20240116::internal_any_invocable::LocalInvoker<>()
        @ 0x564f4be9d0c8 absl::lts_20240116::internal_any_invocable::Impl<>::operator()()
        @ 0x564f4be9b4ff grpc_core::RegisteredMetricCallback::Run()
        @ 0x564f4bea07ae grpc::internal::OpenTelemetryPluginImpl::CallbackGaugeState<>::CallbackGaugeCallback()
        @ 0x564f4bf844de opentelemetry::v1::sdk::metrics::ObservableRegistry::Observe()
        @ 0x564f4bf56529 opentelemetry::v1::sdk::metrics::Meter::Collect()
        @ 0x564f4bf8c1d5 opentelemetry::v1::sdk::metrics::MetricCollector::Collect()::{lambda()#1}::operator()()
        @ 0x564f4bf8c5ac opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::operator()()
        @ 0x564f4bf8c5e8 opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::_FUN()
        @ 0x564f4bf7604d opentelemetry::v1::nostd::function_ref<>::operator()()
        @ 0x564f4bf74ad9 opentelemetry::v1::sdk::metrics::MeterContext::ForEachMeter()
        @ 0x564f4bf8c457 opentelemetry::v1::sdk::metrics::MetricCollector::Collect()
        @ 0x564f4bf4a7fe opentelemetry::v1::sdk::metrics::MetricReader::Collect()
        @ 0x564f4bed5e24 opentelemetry::v1::exporter::metrics::PrometheusCollector::Collect()
        @ 0x564f4bef004f prometheus::detail::CollectMetrics()
        @ 0x564f4beec26d prometheus::detail::MetricsHandler::handleGet()
        @ 0x564f4bf1cd8b CivetServer::requestHandler()
        @ 0x564f4bf35e7b handle_request
        @ 0x564f4bf29534 handle_request_stat_log
        @ 0x564f4bf39b3f process_new_connection
        @ 0x564f4bf3a448 worker_thread_run
        @ 0x564f4bf3a57f worker_thread
        @ 0x7f93e9137ea7 start_thread

[mutex.cc : 1454] RAW: dying due to potential deadlock
Aborted
```

From the stack, it looks like we are ending up holding a lock to the
`RlsLB` policy while removing a callback from the gRPC OpenTelemetry
plugin, which is a lock ordering inversion. The correct order is
`OpenTelemetry` -> `gRPC OpenTelemetry plugin` -> `gRPC Component like
RLS/xDSClient`.

A common pattern we employ for metrics is for the callbacks to be
unregistered when the corresponding component object is
orphaned/destroyed (unreffing). Also, note that removing callbacks
requires a lock in `gRPC OpenTelemetry plugin`. To avoid deadlocks, we
remove the callback inside `RlsLb` from outside the critical region, but
`RlsLb` owns refs to child policies which in turn hold refs to
`XdsClient`. The lock ordering inversion occurred due to unreffing child
policies within the critical region.

This PR is an alternative fix to this problem. Original fix in
#37425. Verified that it fixes the bug.
yashykt added a commit that referenced this pull request Aug 16, 2024
Backport #37485 to v1.65.x
Split off from #37425

We are adding and removing callbacks on the OpenTelemetry Async
Instruments without synchronization. This opens us to races where we
have an AddCallback and RemoveCallback operation happening at the same
time. The correct result after these operations is to still have a
callback registered with OpenTelemetry at the end, but the two
operations could race and we could just decide to remove the
OpenTelemetry callback.

The fix delays removing OpenTelemetry callbacks to plugin destruction
time.
yashykt added a commit that referenced this pull request Aug 16, 2024
Backport #37485 to v1.66
Split off from #37425

We are adding and removing callbacks on the OpenTelemetry Async
Instruments without synchronization. This opens us to races where we
have an AddCallback and RemoveCallback operation happening at the same
time. The correct result after these operations is to still have a
callback registered with OpenTelemetry at the end, but the two
operations could race and we could just decide to remove the
OpenTelemetry callback.

The fix delays removing OpenTelemetry callbacks to plugin destruction
time.
@yashykt yashykt closed this Aug 20, 2024
paulosjca pushed a commit to paulosjca/grpc that referenced this pull request Nov 25, 2024
Internal bug: b/357864682

A lock ordering inversion was noticed with the following stacks -
```
[mutex.cc : 1418] RAW: Potential Mutex deadlock:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4be968c5 grpc::internal::OpenTelemetryPluginImpl::RemoveCallback()
        @ 0x564f4cd097b8 grpc_core::RegisteredMetricCallback::~RegisteredMetricCallback()
        @ 0x564f4c1f1216 std::default_delete<>::operator()()
        @ 0x564f4c1f157f std::__uniq_ptr_impl<>::reset()
        @ 0x564f4c1ee967 std::unique_ptr<>::reset()
        @ 0x564f4c352f44 grpc_core::GrpcXdsClient::Orphaned()
        @ 0x564f4c25dad1 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c4653ed grpc_core::RefCountedPtr<>::reset()
        @ 0x564f4c463c73 grpc_core::XdsClusterDropStats::~XdsClusterDropStats()
        @ 0x564f4c463d02 grpc_core::XdsClusterDropStats::~XdsClusterDropStats()
        @ 0x564f4c25efa9 grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c25d5f0 grpc_core::RefCounted<>::Unref()
        @ 0x564f4c25c2d9 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c25b1d8 grpc_core::(anonymous namespace)::XdsClusterImplLb::Picker::~Picker()
        @ 0x564f4c25b240 grpc_core::(anonymous namespace)::XdsClusterImplLb::Picker::~Picker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()
        @ 0x564f4c124fb8 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c11f029 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c14e958 grpc_core::(anonymous namespace)::OutlierDetectionLb::Picker::~Picker()
        @ 0x564f4c14e980 grpc_core::(anonymous namespace)::OutlierDetectionLb::Picker::~Picker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()
        @ 0x564f4c124fb8 grpc_core::DualRefCounted<>::Unref()
        @ 0x564f4c11f029 grpc_core::RefCountedPtr<>::~RefCountedPtr()
        @ 0x564f4c26bafc std::pair<>::~pair()
        @ 0x564f4c26bb28 __gnu_cxx::new_allocator<>::destroy<>()
        @ 0x564f4c26b88f std::allocator_traits<>::destroy<>()
        @ 0x564f4c26b297 std::_Rb_tree<>::_M_destroy_node()
        @ 0x564f4c26abfb std::_Rb_tree<>::_M_drop_node()
        @ 0x564f4c26a926 std::_Rb_tree<>::_M_erase()
        @ 0x564f4c26a6f0 std::_Rb_tree<>::~_Rb_tree()
        @ 0x564f4c26a62a std::map<>::~map()
        @ 0x564f4c2691a4 grpc_core::(anonymous namespace)::XdsClusterManagerLb::ClusterPicker::~ClusterPicker()
        @ 0x564f4c2691cc grpc_core::(anonymous namespace)::XdsClusterManagerLb::ClusterPicker::~ClusterPicker()
        @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>()
        @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref()

[mutex.cc : 1428] RAW: Acquiring absl::Mutex 0x564f4f22ad40 while holding  0x7f939834bb70; a cycle in the historical lock ordering graph has been observed
[mutex.cc : 1432] RAW: Cycle:
[mutex.cc : 1446] RAW: mutex@0x564f4f22ad40 stack:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4be96124 grpc::internal::OpenTelemetryPluginImpl::AddCallback()
        @ 0x564f4cd096f0 grpc_core::RegisteredMetricCallback::RegisteredMetricCallback()
        @ 0x564f4c1f111b std::make_unique<>()
        @ 0x564f4c3564b0 grpc_core::GlobalStatsPluginRegistry::StatsPluginGroup::RegisterCallback<>()
        @ 0x564f4c352dea grpc_core::GrpcXdsClient::GrpcXdsClient()
        @ 0x564f4c355bc6 grpc_core::MakeRefCounted<>()
        @ 0x564f4c3525f2 grpc_core::GrpcXdsClient::GetOrCreate()
        @ 0x564f4c28f8f8 grpc_core::(anonymous namespace)::XdsResolver::StartLocked()
        @ 0x564f4c2f5f82 grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartXdsResolver()
        @ 0x564f4c2f515d grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::ZoneQueryDone()
        @ 0x564f4c2f496b grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartLocked()::{lambda()#1}::operator()()::{lambda()#1}::operator()()
        @ 0x564f4c2f80f6 std::__invoke_impl<>()
        @ 0x564f4c2f7b9d _ZSt10__invoke_rIvRZZN9grpc_core12_GLOBAL__N_124GoogleCloud2ProdResolver11StartLockedEvENUlNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEN4absl12lts_202401168StatusOrIS8_EEE_clES8_SC_EUlvE_J...
        @ 0x564f4c2f748c std::_Function_handler<>::_M_invoke()
        @ 0x564f4b8ad682 std::function<>::operator()()
        @ 0x564f4cd1c6bf grpc_core::WorkSerializer::LegacyWorkSerializer::Run()
        @ 0x564f4cd1dae4 grpc_core::WorkSerializer::Run()
        @ 0x564f4c2f4b0b grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartLocked()::{lambda()#1}::operator()()
        @ 0x564f4c2f8dc7 absl::lts_20240116::base_internal::Callable::Invoke<>()
        @ 0x564f4c2f8cb8 absl::lts_20240116::base_internal::invoke<>()
        @ 0x564f4c2f8b16 absl::lts_20240116::internal_any_invocable::InvokeR<>()
        @ 0x564f4c2f8a0c absl::lts_20240116::internal_any_invocable::LocalInvoker<>()
        @ 0x564f4c2fb88d absl::lts_20240116::internal_any_invocable::Impl<>::operator()()
        @ 0x564f4c2fb1f3 grpc_core::GcpMetadataQuery::OnDone()
        @ 0x564f4cd75a72 exec_ctx_run()
        @ 0x564f4cd75ba9 grpc_core::ExecCtx::Flush()
        @ 0x564f4cc8ee1d end_worker()
        @ 0x564f4cc8f304 pollset_work()
        @ 0x564f4cc5dcaf pollset_work()
        @ 0x564f4cc69220 grpc_pollset_work()
        @ 0x564f4cbe7733 cq_pluck()
        @ 0x564f4cbe7ad5 grpc_completion_queue_pluck
        @ 0x564f4bc61d96 grpc::CompletionQueue::Pluck()
        @ 0x564f4bfdb055 grpc::ClientReader<>::ClientReader<>()
        @ 0x564f4bfd6035 grpc::internal::ClientReaderFactory<>::Create<>()
        @ 0x564f4bfc322b google::storage::v2::Storage::Stub::ReadObjectRaw()
        @ 0x564f4bf9934b google::storage::v2::Storage::Stub::ReadObject()

[mutex.cc : 1446] RAW: mutex@0x7f939834bb70 stack:
        @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck()
        @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock()
        @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock()
        @ 0x564f4c1ce9eb grpc_core::(anonymous namespace)::RlsLb::RlsLb()::{lambda()#1}::operator()()
        @ 0x564f4c1e794c absl::lts_20240116::base_internal::Callable::Invoke<>()
        @ 0x564f4c1e72c1 absl::lts_20240116::base_internal::invoke<>()
        @ 0x564f4c1e6af1 absl::lts_20240116::internal_any_invocable::InvokeR<>()
        @ 0x564f4c1e5d6c absl::lts_20240116::internal_any_invocable::LocalInvoker<>()
        @ 0x564f4be9d0c8 absl::lts_20240116::internal_any_invocable::Impl<>::operator()()
        @ 0x564f4be9b4ff grpc_core::RegisteredMetricCallback::Run()
        @ 0x564f4bea07ae grpc::internal::OpenTelemetryPluginImpl::CallbackGaugeState<>::CallbackGaugeCallback()
        @ 0x564f4bf844de opentelemetry::v1::sdk::metrics::ObservableRegistry::Observe()
        @ 0x564f4bf56529 opentelemetry::v1::sdk::metrics::Meter::Collect()
        @ 0x564f4bf8c1d5 opentelemetry::v1::sdk::metrics::MetricCollector::Collect()::{lambda()#1}::operator()()
        @ 0x564f4bf8c5ac opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::operator()()
        @ 0x564f4bf8c5e8 opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::_FUN()
        @ 0x564f4bf7604d opentelemetry::v1::nostd::function_ref<>::operator()()
        @ 0x564f4bf74ad9 opentelemetry::v1::sdk::metrics::MeterContext::ForEachMeter()
        @ 0x564f4bf8c457 opentelemetry::v1::sdk::metrics::MetricCollector::Collect()
        @ 0x564f4bf4a7fe opentelemetry::v1::sdk::metrics::MetricReader::Collect()
        @ 0x564f4bed5e24 opentelemetry::v1::exporter::metrics::PrometheusCollector::Collect()
        @ 0x564f4bef004f prometheus::detail::CollectMetrics()
        @ 0x564f4beec26d prometheus::detail::MetricsHandler::handleGet()
        @ 0x564f4bf1cd8b CivetServer::requestHandler()
        @ 0x564f4bf35e7b handle_request
        @ 0x564f4bf29534 handle_request_stat_log
        @ 0x564f4bf39b3f process_new_connection
        @ 0x564f4bf3a448 worker_thread_run
        @ 0x564f4bf3a57f worker_thread
        @ 0x7f93e9137ea7 start_thread

[mutex.cc : 1454] RAW: dying due to potential deadlock
Aborted
```

From the stack, it looks like we are ending up holding a lock to the `RlsLB` policy while removing a callback from the gRPC OpenTelemetry plugin, which is a lock ordering inversion. The correct order is `OpenTelemetry` -> `gRPC OpenTelemetry plugin` -> `gRPC Component like RLS/xDSClient`.

A common pattern we employ for metrics is for the callbacks to be unregistered when the corresponding component object is orphaned/destroyed (unreffing). Also, note that removing callbacks requires a lock in `gRPC OpenTelemetry plugin`. To avoid deadlocks, we remove the callback inside `RlsLb` from outside the critical region, but `RlsLb` owns refs to child policies which in turn hold refs to `XdsClient`. The lock ordering inversion occurred due to unreffing child policies within the critical region.

This PR is an alternative fix to this problem. Original fix in grpc#37425.
Verified that it fixes the bug.

Closes grpc#37459

COPYBARA_INTEGRATE_REVIEW=grpc#37459 from yashykt:FixDeadlocks ec7fbcf
PiperOrigin-RevId: 663360427
paulosjca pushed a commit to paulosjca/grpc that referenced this pull request Nov 25, 2024
Split off from grpc#37425

We are adding and removing callbacks on the OpenTelemetry Async Instruments without synchronization. This opens us to races where we have an AddCallback and RemoveCallback operation happening at the same time. The correct result after these operations is to still have a callback registered with OpenTelemetry at the end, but the two operations could race and we could just decide to remove the OpenTelemetry callback.

The fix delays removing OpenTelemetry callbacks to plugin destruction time.

Closes grpc#37485

COPYBARA_INTEGRATE_REVIEW=grpc#37485 from yashykt:FixRaceOTelGauge 016b0a4
PiperOrigin-RevId: 663492598
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants