-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
Title: Creating thread local cluster with LoadBalancerType::OriginalDst can't find its ClusterData in active_clusters_ and crash
Description:
In our k8s+Istio environment, sometimes Envoy crashes when constructing ClusterManagerImpl::ThreadLocalClusterManagerImpl::ClusterEntry::ClusterEntry which type is LoadBalancerType::OriginalDst. It tried to look up its ClusterData in active_clusters_ by std::map's at() method, but std::out_of_range was thrown because element not exist:
case LoadBalancerType::OriginalDst: {
ASSERT(lb_factory_ == nullptr);
lb_ = std::make_unique<OriginalDstCluster::LoadBalancer>(
priority_set_, parent.parent_.active_clusters_.at(cluster->name())->cluster_,
cluster->lbOriginalDstConfig());
break;
}
Looks like in normal case this couldn't happen:
CdsApiImpl::onConfigUpdate()makesto_add_repeatedandto_remove_repeatedbeing exclusive.ClusterManagerImpl::addOrUpdateCluster()makes sureactive_clusters_has neededClusterData
So this might be a concurrency issue, I doubt there were 2 config updating, the first one created a cluster, and the second one removed it before ThreadLocalClusterManagerImpl::ClusterEntry got constructed on worker threads.
The Envoy version is:
version: 4f5b5e101a081e05924990b1903d9d46553558d4/1.11.0-dev/Clean/RELEASE/BoringSSL
Considering Envoy guarantees eventually consistence, I guess the solution could be adding defensive code to handle the exception gracefully instead of crash. But from design point of view, do you have any rule like worker thread should not access active_clusters_ directly? I just wanna to figure out what's the suitable way to resolve such kind of issue ;)
If there are any similar issue was fixed before, could you please kindly let me know the PR or issue number?
Repro steps:
We don't have a stable method to reproduce this issue right now, some cluster info related to the cash are:
(gdb) p ((Envoy::Upstream::ClusterInfoImpl*)cluster_info_._M_ptr)->type_
$6 = envoy::api::v2::Cluster_DiscoveryType_ORIGINAL_DST
(gdb) p ((Envoy::Upstream::ClusterInfoImpl*)cluster_info_._M_ptr)->added_via_api_
$7 = true
(gdb) p ((Envoy::Upstream::ClusterInfoImpl*)cluster_info_._M_ptr)->lb_type_
$1 = Envoy::Upstream::LoadBalancerType::OriginalDst
(gdb) p cluster_manager.parent_.init_helper_.state_
$12 = Envoy::Upstream::ClusterManagerInitHelper::State::AllClustersInitialized
(gdb) p cluster_manager.parent_.warming_clusters_
$21 = std::map with 0 elements
Call Stack:
(gdb) bt
#0 0x00007f004e85c428 in __GI_raise (sig=sig@entry=6)
at ../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007f004e85e02a in __GI_abort ()
at abort.c:89
#2 0x0000000001063375 in Envoy::TerminateHandler::logOnTerminate() const::$_0::operator()() const (this=<optimized out>)
at source/exe/terminate_handler.cc:15
#3 0x0000000001063279 in Envoy::TerminateHandler::logOnTerminate() const::$_0::__invoke() ()
at source/exe/terminate_handler.cc:12
#4 0x00000000011d55a6 in __cxxabiv1::__terminate(void (*)()) ()
#5 0x00000000011d55f1 in std::terminate() ()
#6 0x00000000011e1794 in __cxa_throw ()
#7 0x00000000011eb9ef in std::__throw_out_of_range(char const*) ()
#8 0x0000000000b54f27 in std::map<
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >,
std::unique_ptr<Envoy::Upstream::ClusterManagerImpl::ClusterData, std::default_delete<Envoy::Upstream::ClusterManagerImpl::ClusterData> >,
std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >,
std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >
const, std::unique_ptr<Envoy::Upstream::ClusterManagerImpl::ClusterData, std::default_delete<Envoy::Upstream::ClusterManagerImpl::ClusterData> > > > >
::at (this=<optimized out>, __k=...)
at /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/c++/7.4.0/bits/stl_map.h:533
#9 0x0000000000b5c8db in Envoy::Upstream::ClusterManagerImpl::ThreadLocalClusterManagerImpl::ClusterEntry::ClusterEntry (this=0x438b4000, parent=..., cluster=..., lb_factory=...)
at source/common/upstream/cluster_manager_impl.cc:1099
#10 0x0000000000b60c8c in Envoy::Upstream::ClusterManagerImpl::createOrUpdateThreadLocalCluster(Envoy::Upstream::ClusterManagerImpl::ClusterData&)::$_10::operator()() const (this=0x1ecabd10)
at source/common/upstream/cluster_manager_impl.cc:530
#11 std::_Function_handler<void (), Envoy::Upstream::ClusterManagerImpl::createOrUpdateThreadLocalCluster(Envoy::Upstream::ClusterManagerImpl::ClusterData&)::$_10>::_M_invoke(std::_Any_data const&) (__functor=...)
at /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/c++/7.4.0/bits/std_function.h:316
#12 0x0000000000b1f10d in std::function<void ()>::operator()() const (this=<optimized out>)
at /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/c++/7.4.0/bits/std_function.h:706
#13 Envoy::Event::DispatcherImpl::runPostCallbacks (this=0x2bf1200)
at source/common/event/dispatcher_impl.cc:198
#14 0x0000000000b1efbe in Envoy::Event::DispatcherImpl::run (this=0x2bf1200, type=Envoy::Event::Dispatcher::RunType::Block)
at source/common/event/dispatcher_impl.cc:177
#15 0x0000000000b19442 in Envoy::Server::WorkerImpl::threadRoutine (this=0x2c3f2c0, guard_dog=...)
at source/server/worker_impl.cc:104
#16 0x00000000010639e5 in std::function<void ()>::operator()() const (this=0x18)
at /usr/lib/gcc/x86_64-linux-gnu/7.4.0/../../../../include/c++/7.4.0/bits/std_function.h:706
#17 Envoy::Thread::ThreadImplPosix::ThreadImplPosix(std::function<void ()>)::$_0::operator()(void*) const (this=<optimized out>, arg=0x0)
at source/common/common/posix/thread_impl.cc:39
#18 Envoy::Thread::ThreadImplPosix::ThreadImplPosix(std::function<void ()>)::$_0::__invoke(void*) (arg=0x0)
at source/common/common/posix/thread_impl.cc:38
#19 0x00007f004ef016ba in start_thread (arg=0x7f0030feb700)
at pthread_create.c:333
#20 0x00007f004e92e41d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109