Skip to content

Creating thread local cluster with LoadBalancerType::OriginalDst can't find its ClusterData in active_clusters_ and crash #7500

@l8huang

Description

@l8huang

Title: Creating thread local cluster with LoadBalancerType::OriginalDst can't find its ClusterData in active_clusters_ and crash

Description:
In our k8s+Istio environment, sometimes Envoy crashes when constructing ClusterManagerImpl::ThreadLocalClusterManagerImpl::ClusterEntry::ClusterEntry which type is LoadBalancerType::OriginalDst. It tried to look up its ClusterData in active_clusters_ by std::map's at() method, but std::out_of_range was thrown because element not exist:

    case LoadBalancerType::OriginalDst: {
      ASSERT(lb_factory_ == nullptr);
      lb_ = std::make_unique<OriginalDstCluster::LoadBalancer>(
          priority_set_, parent.parent_.active_clusters_.at(cluster->name())->cluster_,
          cluster->lbOriginalDstConfig());
      break;
    }

Looks like in normal case this couldn't happen:

  • CdsApiImpl::onConfigUpdate() makes to_add_repeated and to_remove_repeated being exclusive.
  • ClusterManagerImpl::addOrUpdateCluster() makes sure active_clusters_ has needed ClusterData

So this might be a concurrency issue, I doubt there were 2 config updating, the first one created a cluster, and the second one removed it before ThreadLocalClusterManagerImpl::ClusterEntry got constructed on worker threads.

The Envoy version is:

version: 4f5b5e101a081e05924990b1903d9d46553558d4/1.11.0-dev/Clean/RELEASE/BoringSSL

Considering Envoy guarantees eventually consistence, I guess the solution could be adding defensive code to handle the exception gracefully instead of crash. But from design point of view, do you have any rule like worker thread should not access active_clusters_ directly? I just wanna to figure out what's the suitable way to resolve such kind of issue ;)

If there are any similar issue was fixed before, could you please kindly let me know the PR or issue number?

Repro steps:
We don't have a stable method to reproduce this issue right now, some cluster info related to the cash are:

(gdb) p ((Envoy::Upstream::ClusterInfoImpl*)cluster_info_._M_ptr)->type_
$6 = envoy::api::v2::Cluster_DiscoveryType_ORIGINAL_DST   

(gdb) p ((Envoy::Upstream::ClusterInfoImpl*)cluster_info_._M_ptr)->added_via_api_
$7 = true  

(gdb) p ((Envoy::Upstream::ClusterInfoImpl*)cluster_info_._M_ptr)->lb_type_   
$1 = Envoy::Upstream::LoadBalancerType::OriginalDst

(gdb) p cluster_manager.parent_.init_helper_.state_
$12 = Envoy::Upstream::ClusterManagerInitHelper::State::AllClustersInitialized

(gdb) p cluster_manager.parent_.warming_clusters_ 
$21 = std::map with 0 elements

Call Stack:

(gdb) bt
#0  0x00007f004e85c428 in __GI_raise (sig=sig@entry=6) 
     at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007f004e85e02a in __GI_abort () 
     at abort.c:89
#2  0x0000000001063375 in Envoy::TerminateHandler::logOnTerminate() const::$_0::operator()() const (this=<optimized out>) 
     at source/exe/terminate_handler.cc:15
#3  0x0000000001063279 in Envoy::TerminateHandler::logOnTerminate() const::$_0::__invoke() () 
     at source/exe/terminate_handler.cc:12
#4  0x00000000011d55a6 in __cxxabiv1::__terminate(void (*)()) ()
#5  0x00000000011d55f1 in std::terminate() ()
#6  0x00000000011e1794 in __cxa_throw ()
#7  0x00000000011eb9ef in std::__throw_out_of_range(char const*) ()
#8  0x0000000000b54f27 in std::map<
            std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, 
            std::unique_ptr<Envoy::Upstream::ClusterManagerImpl::ClusterData, std::default_delete<Envoy::Upstream::ClusterManagerImpl::ClusterData> >, 
            std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, 
            std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > 
            const, std::unique_ptr<Envoy::Upstream::ClusterManagerImpl::ClusterData, std::default_delete<Envoy::Upstream::ClusterManagerImpl::ClusterData> > > > >
            ::at (this=<optimized out>, __k=...) 
     at /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/c++/7.4.0/bits/stl_map.h:533
#9  0x0000000000b5c8db in Envoy::Upstream::ClusterManagerImpl::ThreadLocalClusterManagerImpl::ClusterEntry::ClusterEntry (this=0x438b4000, parent=..., cluster=..., lb_factory=...) 
     at source/common/upstream/cluster_manager_impl.cc:1099
#10 0x0000000000b60c8c in Envoy::Upstream::ClusterManagerImpl::createOrUpdateThreadLocalCluster(Envoy::Upstream::ClusterManagerImpl::ClusterData&)::$_10::operator()() const (this=0x1ecabd10) 
     at source/common/upstream/cluster_manager_impl.cc:530
#11 std::_Function_handler<void (), Envoy::Upstream::ClusterManagerImpl::createOrUpdateThreadLocalCluster(Envoy::Upstream::ClusterManagerImpl::ClusterData&)::$_10>::_M_invoke(std::_Any_data const&) (__functor=...) 
     at /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/c++/7.4.0/bits/std_function.h:316
#12 0x0000000000b1f10d in std::function<void ()>::operator()() const (this=<optimized out>) 
     at /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/c++/7.4.0/bits/std_function.h:706
#13 Envoy::Event::DispatcherImpl::runPostCallbacks (this=0x2bf1200) 
     at source/common/event/dispatcher_impl.cc:198
#14 0x0000000000b1efbe in Envoy::Event::DispatcherImpl::run (this=0x2bf1200, type=Envoy::Event::Dispatcher::RunType::Block) 
     at source/common/event/dispatcher_impl.cc:177
#15 0x0000000000b19442 in Envoy::Server::WorkerImpl::threadRoutine (this=0x2c3f2c0, guard_dog=...) 
     at source/server/worker_impl.cc:104
#16 0x00000000010639e5 in std::function<void ()>::operator()() const (this=0x18) 
     at /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/c++/7.4.0/bits/std_function.h:706
#17 Envoy::Thread::ThreadImplPosix::ThreadImplPosix(std::function<void ()>)::$_0::operator()(void*) const (this=<optimized out>, arg=0x0) 
     at source/common/common/posix/thread_impl.cc:39
#18 Envoy::Thread::ThreadImplPosix::ThreadImplPosix(std::function<void ()>)::$_0::__invoke(void*) (arg=0x0) 
     at source/common/common/posix/thread_impl.cc:38
#19 0x00007f004ef016ba in start_thread (arg=0x7f0030feb700) 
     at pthread_create.c:333
#20 0x00007f004e92e41d in clone () 
     at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions