Skip to content

eds: during startup Envoy is sending too many ClusterLoadAssignment request to Istio pilot #7955

@l8huang

Description

@l8huang

Title: during initializing Envoy is sending too many ClusterLoadAssignment requests to pilot-discovery

Description:
In our env, during Envoy startup and initializing cds, Istio pilot pushes ~18k clusters to Envoy,
then Envoy sends one ClusterLoadAssignment discovery request for each resource by appending its name to request's resource_names list, that's, Envoy may send ~18K ClusterLoadAssignment request, and each request's resource_names list keeps growing.

This has several problems:

  • GrpcStream<>::sendMessage() is called, but messages are not sent though TCP connection, memory are accumulated on Envoy side, Envoy's RES memory grows more than 4g.
  • Envoy takes more than 2 min to send these message and init cluster manager, during this period, Istio pilot may send other messages to Envoy, but Envoy doesn't receive them, that causes TCP connection has "TCP Window Full" issue and being RST by Istio pilot.
  • After grpc stream disconnected, Envoy will create a new grpc stream and send a Listener request to pilot to continue the startup procedure. But when will grpc stream be disconnected is not predicable, Envoy takes random long time to be ready.

The Envoy version is https://github.com/istio/envoy/tree/release-1.1
Istio proxy version is https://github.com/istio/proxy/tree/release-1.1

Is it a known issue? If not, I guess the Envoy::Upstream::EdsClusterImpl::startPreInit() should be modified to postpone sending ClusterLoadAssignment request.

Logs:
The log is extremely long, basically they are repeated with a growing resource_names list, and each time a new resource is added into it.

[2019-08-16 22:28:09.978][1638][info][config] [bazel-out/k8-dbg/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:61] Send request node {
    id: "router~10.148.74.41~istio-ingressgateway-test-579747687d-jb9gc.istio-system~istio-system.svc.33.tess.io"
    cluster: "istio-ingressgateway-test"
    metadata {
 ...
    build_version: "f77d9088d44642decee764c3c9ee0cf13e14f873/1.11.0-dev/Modified/DEBUG/BoringSSL"
  }
  resource_names: "outbound|8080||ner171485deployment.ns-b-ner171485.svc.33.tess.io"
...
  type_url: "type.googleapis.com/envoy.api.v2.ClusterLoadAssignment"

Call Stack:

(gdb) bt
#0  Envoy::Config::GrpcStream<envoy::api::v2::DiscoveryRequest, envoy::api::v2::DiscoveryResponse>::sendMessage (this=0x37563d0, request=...) 
        at bazel-out/k8-dbg/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:62
#1  0x000000000102baa5 in Envoy::Config::GrpcMuxImpl::sendDiscoveryRequest (this=0x37563c0, type_url=...) 
        at external/envoy/source/common/config/grpc_mux_impl.cc:61
#2  0x000000000102cfba in Envoy::Config::GrpcMuxImpl::drainRequests (this=0x37563c0) 
        at external/envoy/source/common/config/grpc_mux_impl.cc:229
#3  0x000000000102cef6 in Envoy::Config::GrpcMuxImpl::queueDiscoveryRequest (this=0x37563c0, queue_item=...) 
        at external/envoy/source/common/config/grpc_mux_impl.cc:215
#4  0x000000000102bd11 in Envoy::Config::GrpcMuxImpl::subscribe (this=0x37563c0, type_url=..., resources=..., callbacks=...) 
        at external/envoy/source/common/config/grpc_mux_impl.cc:91
#5  0x0000000000d53874 in Envoy::Config::GrpcMuxSubscriptionImpl::start (this=0x71d25a0, resources=..., callbacks=...) 
        at bazel-out/k8-dbg/bin/external/envoy/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:41
#6  0x00000000010248ef in Envoy::Upstream::EdsClusterImpl::startPreInit (this=0x35e5b00) 
        at external/envoy/source/common/upstream/eds.cc:34
#7  0x0000000000ffa094 in Envoy::Upstream::ClusterImplBase::initialize(std::function<void ()>) (this=0x35e5b00, callback=...) 
        at external/envoy/source/common/upstream/upstream_impl.cc:711
#8  0x0000000000e297cc in Envoy::Upstream::ClusterManagerInitHelper::maybeFinishInitialize (this=0x372f038) 
        at external/envoy/source/common/upstream/cluster_manager_impl.cc:122
#9  0x0000000000e2946b in Envoy::Upstream::ClusterManagerInitHelper::removeCluster (this=0x372f038, cluster=...) 
        at external/envoy/source/common/upstream/cluster_manager_impl.cc:93
#10 0x0000000000e291ff in Envoy::Upstream::ClusterManagerInitHelper::onClusterInit (this=0x372f038, cluster=...) 
        at external/envoy/source/common/upstream/cluster_manager_impl.cc:70
#11 0x0000000000e28c74 in Envoy::Upstream::ClusterManagerInitHelper::<lambda()>::operator()(void) const (__closure=0x7fffffff4ee0) 
        at external/envoy/source/common/upstream/cluster_manager_impl.cc:49
#12 0x0000000000e3508d in std::_Function_handler<void(), Envoy::Upstream::ClusterManagerInitHelper::addCluster(Envoy::Upstream::Cluster&)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) 
        at /usr/include/c++/7/bits/std_function.h:316
#13 0x00000000004f82a4 in std::function<void ()>::operator()() const (this=0x7fffffff4ee0) 
        at /usr/include/c++/7/bits/std_function.h:706
#14 0x0000000000ffa64b in Envoy::Upstream::ClusterImplBase::finishInitialization (this=0x39f1860) 
        at external/envoy/source/common/upstream/upstream_impl.cc:758
#15 0x0000000000ffa3e9 in Envoy::Upstream::ClusterImplBase::onInitDone (this=0x39f1860) 
        at external/envoy/source/common/upstream/upstream_impl.cc:740
#16 0x0000000000ff9461 in Envoy::Upstream::ClusterImplBase::<lambda()>::operator()(void) const (__closure=0x1b3f57e0) 
        at external/envoy/source/common/upstream/upstream_impl.cc:652
#17 0x0000000001000e13 in std::_Function_handler<void(), Envoy::Upstream::ClusterImplBase::ClusterImplBase(const envoy::api::v2::Cluster&, Envoy::Runtime::Loader&, Envoy::Server::Configuration::TransportSocketFactoryContext&, Envoy::Stats::ScopePtr&&, bool)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) 
        at /usr/include/c++/7/bits/std_function.h:316
#18 0x00000000004f82a4 in std::function<void ()>::operator()() const (this=0x1b3f57e0) 
        at /usr/include/c++/7/bits/std_function.h:706
#19 0x0000000001081bd7 in Envoy::Init::WatcherHandleImpl::ready (this=0x9126f60) 
        at external/envoy/source/common/init/watcher_impl.cc:15
#20 0x00000000010801cc in Envoy::Init::ManagerImpl::ready (this=0x39f1868) 
        at external/envoy/source/common/init/manager_impl.cc:75
#21 0x000000000107ff5a in Envoy::Init::ManagerImpl::initialize (this=0x39f1868, watcher=...) 
        at external/envoy/source/common/init/manager_impl.cc:46
#22 0x0000000000ffa1d9 in Envoy::Upstream::ClusterImplBase::onPreInitComplete (this=0x39f1860) 
        at external/envoy/source/common/upstream/upstream_impl.cc:722
#23 0x0000000001000387 in Envoy::Upstream::StrictDnsClusterImpl::ResolveTarget::<lambda(const std::__cxx11::list<std::shared_ptr<const Envoy::Network::Address::Instance>, std::allocator<std::shared_ptr<const Envoy::Network::Address::Instance> > >&&)>::operator()(const std::__cxx11::list<std::shared_ptr<Envoy::Network::Address::Instance const>, std::allocator<std::shared_ptr<Envoy::Network::Address::Instance const> > > &&) const (__closure=0x1b3f7aa8, address_list=...) 
        at external/envoy/source/common/upstream/upstream_impl.cc:1373
#24 0x0000000001001862 in std::_Function_handler<void(const std::__cxx11::list<std::shared_ptr<const Envoy::Network::Address::Instance>, std::allocator<std::shared_ptr<const Envoy::Network::Address::Instance> > >&&), Envoy::Upstream::StrictDnsClusterImpl::ResolveTarget::startResolve()::<lambda(const std::__cxx11::list<std::shared_ptr<const Envoy::Network::Address::Instance>, std::allocator<std::shared_ptr<const Envoy::Network::Address::Instance> > >&&)> >::_M_invoke(const std::_Any_data &, const std::__cxx11::list<std::shared_ptr<Envoy::Network::Address::Instance const>, std::allocator<std::shared_ptr<Envoy::Network::Address::Instance const> > > &&) (__functor=..., __args#0=...) 
        at /usr/include/c++/7/bits/std_function.h:316
#25 0x0000000000ce3277 in std::function<void (std::__cxx11::list<std::shared_ptr<Envoy::Network::Address::Instance const>, std::allocator<std::shared_ptr<Envoy::Network::Address::Instance const> > > const&&)>::operator()(std::__cxx11::list<std::shared_ptr<Envoy::Network::Address::Instance const>, std::allocator<std::shared_ptr<Envoy::Network::Address::Instance const> > > const&&) const (this=0x1b3f7aa8, __args#0=...) 
        at /usr/include/c++/7/bits/std_function.h:706
#26 0x0000000000dbc6ac in Envoy::Network::DnsResolverImpl::PendingResolution::onAresHostCallback (this=0x1b3f7aa0, status=0, timeouts=1, hostent=0x3a49fe0) 
        at external/envoy/source/common/network/dns_impl.cc:124
#27 0x0000000000dbd112 in Envoy::Network::DnsResolverImpl::PendingResolution::<lambda(void*, int, int, hostent*)>::operator()(void *, int, int, hostent *) const (__closure=0x0, arg=0x1b3f7aa0, status=0, timeouts=1, hostent=0x3a49fe0) 
        at external/envoy/source/common/network/dns_impl.cc:228
#28 0x0000000000dbd149 in Envoy::Network::DnsResolverImpl::PendingResolution::<lambda(void*, int, int, hostent*)>::_FUN(void *, int, int, hostent *) () 
        at external/envoy/source/common/network/dns_impl.cc:230
#29 0x0000000000ef495a in end_hquery (hquery=0x1b3e1180, status=<optimized out>, host=0x3a49fe0) 
        at /root/.cache/bazel/_bazel_root/982c6ecebbec89f8d3e74d5491958848/sandbox/linux-sandbox/325/execroot/__main__/external/com_github_c_ares_c_ares/ares_gethostbyname.c:234
#30 0x0000000000ef4ce1 in host_callback (arg=0x1b3e1180, status=<optimized out>, timeouts=<optimized out>, abuf=<optimized out>, alen=<optimized out>) 
        at /root/.cache/bazel/_bazel_root/982c6ecebbec89f8d3e74d5491958848/sandbox/linux-sandbox/325/execroot/__main__/external/com_github_c_ares_c_ares/ares_gethostbyname.c:214
#31 0x0000000000efbd6c in end_squery (squery=0x1b3e11c0, status=<optimized out>, abuf=<optimized out>, alen=<optimized out>) 
        at /root/.cache/bazel/_bazel_root/982c6ecebbec89f8d3e74d5491958848/sandbox/linux-sandbox/325/execroot/__main__/external/com_github_c_ares_c_ares/ares_search.c:210
#32 0x0000000000efbea1 in search_callback (arg=0x1b3e11c0, status=<optimized out>, timeouts=<optimized out>, abuf=<optimized out>, alen=<optimized out>) 
        at /root/.cache/bazel/_bazel_root/982c6ecebbec89f8d3e74d5491958848/sandbox/linux-sandbox/325/execroot/__main__/external/com_github_c_ares_c_ares/ares_search.c:157
#33 0x0000000000efbab9 in qcallback (arg=0x35a53f0, status=<optimized out>, timeouts=<optimized out>, abuf=<optimized out>, alen=<optimized out>) 
        at /root/.cache/bazel/_bazel_root/982c6ecebbec89f8d3e74d5491958848/sandbox/linux-sandbox/325/execroot/__main__/external/com_github_c_ares_c_ares/ares_query.c:183
#34 0x0000000000efa2a7 in end_query (channel=channel@entry=0x367c000, query=query@entry=0x71dfd40, status=status@entry=0, abuf=abuf@entry=0x7fffffff5780 "\220\324\201\200", alen=alen@entry=502) 
        at /root/.cache/bazel/_bazel_root/982c6ecebbec89f8d3e74d5491958848/sandbox/linux-sandbox/325/execroot/__main__/external/com_github_c_ares_c_ares/ares_process.c:1437
#35 0x0000000000efb1eb in process_answer (channel=channel@entry=0x367c000, abuf=abuf@entry=0x7fffffff5780 "\220\324\201\200", alen=502, whichserver=whichserver@entry=0, tcp=tcp@entry=0, now=now@entry=0x7fffffff6800) 
        at /root/.cache/bazel/_bazel_root/982c6ecebbec89f8d3e74d5491958848/sandbox/linux-sandbox/325/execroot/__main__/external/com_github_c_ares_c_ares/ares_process.c:668
#36 0x0000000000efb42e in process_answer (now=0x7fffffff6800, tcp=0, whichserver=0, alen=<optimized out>, abuf=0x7fffffff5780 "\220\324\201\200", channel=0x367c000) 
        at /root/.cache/bazel/_bazel_root/982c6ecebbec89f8d3e74d5491958848/sandbox/linux-sandbox/325/execroot/__main__/external/com_github_c_ares_c_ares/ares_process.c:168
#37 read_udp_packets (channel=channel@entry=0x367c000, read_fds=read_fds@entry=0x0, read_fd=read_fd@entry=116, now=now@entry=0x7fffffff6800) 
        at /root/.cache/bazel/_bazel_root/982c6ecebbec89f8d3e74d5491958848/sandbox/linux-sandbox/325/execroot/__main__/external/com_github_c_ares_c_ares/ares_process.c:532
#38 0x0000000000efb510 in processfds (channel=0x367c000, read_fds=<optimized out>, read_fd=116, write_fds=<optimized out>, write_fd=<optimized out>) 
        at /root/.cache/bazel/_bazel_root/982c6ecebbec89f8d3e74d5491958848/sandbox/linux-sandbox/325/execroot/__main__/external/com_github_c_ares_c_ares/ares_process.c:131
#39 0x0000000000dbccd2 in Envoy::Network::DnsResolverImpl::onEventCallback (this=0x35c40c0, fd=116, events=1) 
        at external/envoy/source/common/network/dns_impl.cc:168
#40 0x0000000000dbcd0c in Envoy::Network::DnsResolverImpl::<lambda(uint32_t)>::operator()(uint32_t) const (__closure=0x71f7b50, events=1) 
        at external/envoy/source/common/network/dns_impl.cc:186
#41 0x0000000000dbd9c6 in std::_Function_handler<void(unsigned int), Envoy::Network::DnsResolverImpl::onAresSocketStateChange(int, int, int)::<lambda(uint32_t)> >::_M_invoke(const std::_Any_data &, unsigned int &&) (__functor=..., __args#0=@0x7fffffff68d4: 1) 
        at /usr/include/c++/7/bits/std_function.h:316
#42 0x0000000000da775a in std::function<void (unsigned int)>::operator()(unsigned int) const (this=0x71f7b50, __args#0=1) 
        at /usr/include/c++/7/bits/std_function.h:706
#43 0x0000000000da73b6 in Envoy::Event::FileEventImpl::<lambda(int, short int, void*)>::operator()(int, short, void *) const (__closure=0x0, what=2, arg=0x71f7ad0) 
        at external/envoy/source/common/event/file_event_impl.cc:64
#44 0x0000000000da7434 in Envoy::Event::FileEventImpl::<lambda(int, short int, void*)>::_FUN(int, short, void *) () 
        at external/envoy/source/common/event/file_event_impl.cc:65
#45 0x00000000013c8019 in event_persist_closure (ev=<optimized out>, base=0x35d2840) 
        at /root/.cache/bazel/_bazel_root/982c6ecebbec89f8d3e74d5491958848/sandbox/linux-sandbox/330/execroot/__main__/external/com_github_libevent_libevent/event.c:1623
#46 event_process_active_single_queue (base=base@entry=0x35d2840, max_to_process=max_to_process@entry=2147483647, endtime=endtime@entry=0x0, activeq=<optimized out>) 
        at /root/.cache/bazel/_bazel_root/982c6ecebbec89f8d3e74d5491958848/sandbox/linux-sandbox/330/execroot/__main__/external/com_github_libevent_libevent/event.c:1682
#47 0x00000000013c854f in event_process_active (base=base@entry=0x35d2840) 
        at /root/.cache/bazel/_bazel_root/982c6ecebbec89f8d3e74d5491958848/sandbox/linux-sandbox/330/execroot/__main__/external/com_github_libevent_libevent/event.c:1783
#48 0x00000000013ccb88 in event_base_loop (base=0x35d2840, flags=<optimized out>) 
        at /root/.cache/bazel/_bazel_root/982c6ecebbec89f8d3e74d5491958848/sandbox/linux-sandbox/330/execroot/__main__/external/com_github_libevent_libevent/event.c:2006
#49 0x0000000000dc6ebc in Envoy::Event::LibeventScheduler::run (this=0x35fe860, mode=Envoy::Event::Dispatcher::RunType::Block) 
        at external/envoy/source/common/event/libevent_scheduler.cc:39
#50 0x0000000000da27d0 in Envoy::Event::DispatcherImpl::run (this=0x35fe840, type=Envoy::Event::Dispatcher::RunType::Block) 
        at external/envoy/source/common/event/dispatcher_impl.cc:167
#51 0x0000000000cf01b9 in Envoy::Server::InstanceImpl::run (this=0x3618500) 
        at external/envoy/source/server/server.cc:482
#52 0x000000000066c306 in Envoy::MainCommonBase::run (this=0x35e8518) 
        at external/envoy/source/exe/main_common.cc:117
#53 0x000000000066a5a0 in Envoy::MainCommon::run (this=0x35e8000) 
        at bazel-out/k8-dbg/bin/external/envoy/source/exe/_virtual_includes/envoy_main_common_lib/exe/main_common.h:88
#54 0x000000000066a057 in main (argc=18, argv=0x7fffffff6e38) 
        at external/envoy/source/exe/main.cc:39

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionQuestions that are neither investigations, bugs, nor enhancements

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions