-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
Listener Crash on Start after enabling exact_balance
NOTE: emailed envoy-security@ and was asked to create an issue by @mattklein123 - I'm in the process of extracting more information, wanted to kick-off the discussion in the interim.
Envoy crashes on startup with exact_balance configuration enabled.
"connection_balance_config": {
"exact_balance": {}
}
Envoy version:
http_archive(
name = "envoy",
sha256 = "38bd41e5229532abbeccba7b87d80c8664e915ec6f780a60a6b7ef1818458b5a",
strip_prefix = "envoy-1.15.0",
urls = ["https://github.com/envoyproxy/envoy/archive/v1.15.0.tar.gz"],
)
backtrace:
(gdb) bt
#0 Envoy::Network::ExactConnectionBalancerImpl::pickTargetHandler (this=0x556dcafbe900) at external/envoy/source/common/network/connection_balancer_impl.cc:31
#1 0x0000556dc59eb6a3 in Envoy::Server::ConnectionHandlerImpl::ActiveTcpListener::onAcceptWorker (this=0x556dca71d300, socket=..., hand_off_restored_destination_connections=<optimized out>, rebalanced=<optimized out>) at external/envoy/source/server/connection_handler_impl.cc:347
#2 0x0000556dc5a0618d in Envoy::Network::ListenerImpl::listenCallback (fd=<optimized out>, remote_addr=<optimized out>, remote_addr_len=<optimized out>, arg=<optimized out>) at external/envoy/source/common/network/listener_impl.cc:78
#3 0x0000556dc5e6216b in listener_read_cb (fd=55, what=<optimized out>, p=0x556dcc113ad0) at /root/.cache/bazel/b570b5ccd0454dc9af9f65ab1833764d/sandbox/processwrapper-sandbox/32/execroot/gateway/external/com_github_libevent_libevent/listener.c:437
#4 0x0000556dc5e5ee80 in event_persist_closure (ev=<optimized out>, base=0x556dc8d6f340) at /root/.cache/bazel/b570b5ccd0454dc9af9f65ab1833764d/sandbox/processwrapper-sandbox/32/execroot/gateway/external/com_github_libevent_libevent/event.c:1639
#5 event_process_active_single_queue (base=base@entry=0x556dc8d6f340, max_to_process=max_to_process@entry=2147483647, endtime=endtime@entry=0x0, activeq=<optimized out>) at /root/.cache/bazel/b570b5ccd0454dc9af9f65ab1833764d/sandbox/processwrapper-sandbox/32/execroot/gateway/external/com_github_libevent_libevent/event.c:1698
#6 0x0000556dc5e5f4ff in event_process_active (base=0x556dc8d6f340) at /root/.cache/bazel/b570b5ccd0454dc9af9f65ab1833764d/sandbox/processwrapper-sandbox/32/execroot/gateway/external/com_github_libevent_libevent/event.c:1799
#7 event_base_loop (base=0x556dc8d6f340, flags=0) at /root/.cache/bazel/b570b5ccd0454dc9af9f65ab1833764d/sandbox/processwrapper-sandbox/32/execroot/gateway/external/com_github_libevent_libevent/event.c:2041
#8 0x0000556dc59e70cd in Envoy::Server::WorkerImpl::threadRoutine (this=0x556dc8db5310, guard_dog=...) at external/envoy/source/server/worker_impl.cc:133
#9 0x0000556dc5f20475 in std::function<void ()>::operator()() const (this=<optimized out>) at /usr/include/c++/7/bits/std_function.h:706
#10 Envoy::Thread::ThreadImplPosix::ThreadImplPosix(std::function<void ()>, absl::optional<Envoy::Thread::Options> const&)::{lambda(void*)#1}::operator()(void*) const (arg=<optimized out>, __closure=0x0) at external/envoy/source/common/common/posix/thread_impl.cc:49
#11 Envoy::Thread::ThreadImplPosix::ThreadImplPosix(std::function<void ()>, absl::optional<Envoy::Thread::Options> const&)::{lambda(void*)#1}::_FUN(void*) () at external/envoy/source/common/common/posix/thread_impl.cc:51
#12 0x00007f33559246db in start_thread (arg=0x7f333650f700) at pthread_create.c:463
#13 0x00007f335564da3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
min_connection_handler is a nullptr as there aren't any BalanceConnectionHandler handlers registered when this balancing occurs.
// printout of handlers_ (empty list)
handlers_ = {
<std::_Vector_base<Envoy::Network::BalancedConnectionHandler*, std::allocator<Envoy::Network::BalancedConnectionHandler*> >> = {
_M_impl = {
<std::allocator<Envoy::Network::BalancedConnectionHandler*>> = {
<__gnu_cxx::new_allocator<Envoy::Network::BalancedConnectionHandler*>> = {<No data fields>}, <No data fields>},
members of std::_Vector_base<Envoy::Network::BalancedConnectionHandler*, std::allocator<Envoy::Network::BalancedConnectionHandler*> >::_Vector_impl:
_M_start = 0x0,
_M_finish = 0x0,
_M_end_of_storage = 0x0
}
}, <No data fields>}
In-order to see if we could mitigate the crash, we were returning the handler passed as the sole argument to ExactConnectionBalancerImpl::pickTargetHandler. It resolves the crash on startup, but, then we don't achieve exact balancing consistently. Below are stats dumped for the worker's on two of our staging clusters deployed in different zones.
example:
a case where it was working after the patch
listener.0.0.0.0_8080.worker_0.downstream_cx_active: 810
listener.0.0.0.0_8080.worker_0.downstream_cx_total: 1092
listener.0.0.0.0_8080.worker_1.downstream_cx_active: 811
listener.0.0.0.0_8080.worker_1.downstream_cx_total: 1072
listener.0.0.0.0_8080.worker_2.downstream_cx_active: 810
listener.0.0.0.0_8080.worker_2.downstream_cx_total: 1112
listener.0.0.0.0_8080.worker_3.downstream_cx_active: 811
listener.0.0.0.0_8080.worker_3.downstream_cx_total: 1086
listener.0.0.0.0_8080.worker_4.downstream_cx_active: 811
listener.0.0.0.0_8080.worker_4.downstream_cx_total: 1096
a case where it was not working (different cluster)
listener.0.0.0.0_8080.worker_0.downstream_cx_active: 1030
listener.0.0.0.0_8080.worker_0.downstream_cx_total: 2236
listener.0.0.0.0_8080.worker_1.downstream_cx_active: 452
listener.0.0.0.0_8080.worker_1.downstream_cx_total: 1381
listener.0.0.0.0_8080.worker_2.downstream_cx_active: 125
listener.0.0.0.0_8080.worker_2.downstream_cx_total: 875
listener.0.0.0.0_8080.worker_3.downstream_cx_active: 252
listener.0.0.0.0_8080.worker_3.downstream_cx_total: 1054
listener.0.0.0.0_8080.worker_4.downstream_cx_active: 1812
listener.0.0.0.0_8080.worker_4.downstream_cx_total: 3744