[stats]: Subgroup cluster stats() into lb-stats, endpoint-stats, config-update-stats and "the rest"(will be renamed to upstream-stats) by stevenzzzz · Pull Request #23907 · envoyproxy/envoy

stevenzzzz · 2022-11-08T21:55:01Z

Commit Message: Subgroup cluster stats() into lb-stats, endpoint-stats, config-update-stats and "the rest"(will be renamed to upstream-stats)
Additional Description:
See more description and live example in #23575.

With lots of clusters and route-tables in a cloud proxy, we are seeing tons of RAM been spent on stats while most of the stats are never inc-ed due to traffic pattern(or long tail). We are thinking that we can lazy init cluster stats() so that the RAM is only allocated when it's required.

To achieve that we need to have finer grained stats group, e.g. configUpdateStats() are frequently updated by config management server, while upstream_xxx are only required when there is traffic for the cluster, for this sub-group we can save RAM by lazy init it.

This is the subgroups I plan to have for clusters.

/**

All cluster config update related stats.
/
#define ALL_CLUSTER_CONFIG_UPDATE_STATS(COUNTER, GAUGE, HISTOGRAM,
TEXT_READOUT, STATNAME)
COUNTER(assignment_stale)
COUNTER(assignment_timeout_received)
COUNTER(update_attempt)
COUNTER(update_empty)
COUNTER(update_failure)
COUNTER(update_no_rebuild)
COUNTER(update_success)
GAUGE(version, NeverImport)
/*

All cluster endpoints related stats.
/
#define ALL_CLUSTER_ENDPOINT_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT,
STATNAME)
GAUGE(max_host_weight, NeverImport)
COUNTER(membership_change)
GAUGE(membership_degraded, NeverImport)
GAUGE(membership_excluded, NeverImport)
GAUGE(membership_healthy, NeverImport)
GAUGE(membership_total, NeverImport)
/*

All cluster loadbalancing related stats.
/
#define ALL_CLUSTER_LB_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT,
STATNAME)
COUNTER(lb_healthy_panic)
COUNTER(lb_local_cluster_not_ok)
COUNTER(lb_recalculate_zone_structures)
COUNTER(lb_subsets_created)
COUNTER(lb_subsets_fallback)
COUNTER(lb_subsets_fallback_panic)
COUNTER(lb_subsets_removed)
COUNTER(lb_subsets_selected)
COUNTER(lb_zone_cluster_too_small)
COUNTER(lb_zone_no_capacity_left)
COUNTER(lb_zone_number_differs)
COUNTER(lb_zone_routing_all_directly)
COUNTER(lb_zone_routing_cross_zone)
COUNTER(lb_zone_routing_sampled)
GAUGE(lb_subsets_active, Accumulate)
/*

All cluster stats. https://github.com/see stats_macros.h
*/
#define ALL_CLUSTER_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT, STATNAME)
COUNTER(bind_errors)
COUNTER(original_dst_host_invalid)
COUNTER(retry_or_shadow_abandoned)
COUNTER(upstream_cx_close_notify)
COUNTER(upstream_cx_connect_attempts_exceeded)
COUNTER(upstream_cx_connect_fail)
COUNTER(upstream_cx_connect_timeout)
COUNTER(upstream_cx_connect_with_0_rtt)
COUNTER(upstream_cx_destroy)
COUNTER(upstream_cx_destroy_local)
COUNTER(upstream_cx_destroy_local_with_active_rq)
COUNTER(upstream_cx_destroy_remote)
COUNTER(upstream_cx_destroy_remote_with_active_rq)
COUNTER(upstream_cx_destroy_with_active_rq)
COUNTER(upstream_cx_http1_total)
COUNTER(upstream_cx_http2_total)
COUNTER(upstream_cx_http3_total)
COUNTER(upstream_cx_idle_timeout)
COUNTER(upstream_cx_max_duration_reached)
COUNTER(upstream_cx_max_requests)
COUNTER(upstream_cx_none_healthy)
COUNTER(upstream_cx_overflow)
COUNTER(upstream_cx_pool_overflow)
COUNTER(upstream_cx_protocol_error)
COUNTER(upstream_cx_rx_bytes_total)
COUNTER(upstream_cx_total)
COUNTER(upstream_cx_tx_bytes_total)
COUNTER(upstream_flow_control_backed_up_total)
COUNTER(upstream_flow_control_drained_total)
COUNTER(upstream_flow_control_paused_reading_total)
COUNTER(upstream_flow_control_resumed_reading_total)
COUNTER(upstream_internal_redirect_failed_total)
COUNTER(upstream_internal_redirect_succeeded_total)
COUNTER(upstream_rq_cancelled)
COUNTER(upstream_rq_completed)
COUNTER(upstream_rq_maintenance_mode)
COUNTER(upstream_rq_max_duration_reached)
COUNTER(upstream_rq_pending_failure_eject)
COUNTER(upstream_rq_pending_overflow)
COUNTER(upstream_rq_pending_total)
COUNTER(upstream_rq_0rtt)
COUNTER(upstream_rq_per_try_timeout)
COUNTER(upstream_rq_per_try_idle_timeout)
COUNTER(upstream_rq_retry)
COUNTER(upstream_rq_retry_backoff_exponential)
COUNTER(upstream_rq_retry_backoff_ratelimited)
COUNTER(upstream_rq_retry_limit_exceeded)
COUNTER(upstream_rq_retry_overflow)
COUNTER(upstream_rq_retry_success)
COUNTER(upstream_rq_rx_reset)
COUNTER(upstream_rq_timeout)
COUNTER(upstream_rq_total)
COUNTER(upstream_rq_tx_reset)
COUNTER(upstream_http3_broken)
GAUGE(upstream_cx_active, Accumulate)
GAUGE(upstream_cx_rx_bytes_buffered, Accumulate)
GAUGE(upstream_cx_tx_bytes_buffered, Accumulate)
GAUGE(upstream_rq_active, Accumulate)
GAUGE(upstream_rq_pending_active, Accumulate)
HISTOGRAM(upstream_cx_connect_ms, Milliseconds)
HISTOGRAM(upstream_cx_length_ms, Milliseconds)

Risk Level: LOW, should be a no-behavior-change CL.

Testing: N/A existing stats test should prove that there is no behavior change.
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

…n non-config-update stats Signed-off-by: Xin Zhuang <stevenzzz@google.com>

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

…-stats and "the rest"(will be renamed to upstream-stats) Signed-off-by: Xin Zhuang <stevenzzz@google.com>

stevenzzzz · 2022-11-08T21:55:31Z

/assign @jmarantz

jmarantz

looks good basically; just a few nits.

Can you merge main?

jmarantz · 2022-11-09T19:22:06Z

envoy/upstream/upstream.h

- * All cluster stats. @see stats_macros.h
+ * All cluster config update related stats.
+ * See https://github.com/envoyproxy/envoy/issues/23575 for details. Stats from ClusterInfo::stats()
+ * will be split into subgroups "config-update", "lb", "endpoint" and "upstream", roughly based on


This comment should describe the current state of affairs assuming this PR is checked in, not express a plan.

AFAICT stats_ no longer exists for upstream; it's now fully partitioned so we can just document that.

I decided to rename that field in the next PR.
I think the comment here is kinda accurate, updated the "upstream" one to reflect the "current" status.

jmarantz · 2022-11-09T19:23:02Z

envoy/upstream/upstream.h

+/**
+ * All cluster stats. @see stats_macros.h
+ */
+#define ALL_CLUSTER_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT, STATNAME)                       \


I think maybe we should rename this macro and related call-sites because this does not represent all the stats anymore.

I did that in another CL. See https://github.com/envoyproxy/envoy/pull/23921/files

can we focus on just making this one coherent? and getting it in?

WDYT about renaming this CLUSTER_TRAFFIC_STATS or similar; these all look like they are specifically to handle actual traffic, so they are obviously a good candidate to lazy-load in the next PR.

oh I see, you were saying you renamed that set in the lazy-init PR. I think it would make more sense in this one.

Also rename clusterStats() to clusterTrafficStats() or something; same with its StatNames object.

jmarantz · 2022-11-09T19:24:49Z

envoy/upstream/upstream.h

  virtual TransportSocketMatcher& transportSocketMatcher() const PURE;

+  /**
+   * @return ClusterConfigUpdateStats& strongly named config update stats for this cluster.


I realize this phrase 'strongly named' pre-existed, but I'm not sure what it means.

It's possible I came up with phrase and i still don't know what it means :)

Do you? If not, can we just remove this here & below?

For ClusterLbStats maybe use 'load-balancer-related stats for this cluster'.

jmarantz · 2022-11-09T19:26:11Z

source/common/upstream/cluster_manager_impl.h

@@ -326,6 +326,13 @@ class ClusterManagerImpl : public ClusterManager,
  initializeSecondaryClusters(const envoy::config::bootstrap::v3::Bootstrap& bootstrap) override;

  const ClusterStatNames& clusterStatNames() const override { return cluster_stat_names_; }


do we still use ClusterStatNames anywhere?

I am not sure.
I saw it only used to instantiate the ClusterStats in upstream_impl.cc.

can you comment it out and see what breaks?

I'd like this PR to just be a repartitioning without adding any overlap.

oh maybe a rename to clusterTrafficStatNames()

I am sufficiently myopic in my Envoy knowledge that I'm actually not sure if Clusters are used at L4, and thus clusterRequestStatNames may be inaccurate :)

this is required in instantiate ClusterStats in ClusterInfoImpl::generateStats

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

stevenzzzz

Thanks! PTAL.

stevenzzzz · 2022-11-09T21:45:13Z

envoy/upstream/upstream.h

- * All cluster stats. @see stats_macros.h
+ * All cluster config update related stats.
+ * See https://github.com/envoyproxy/envoy/issues/23575 for details. Stats from ClusterInfo::stats()
+ * will be split into subgroups "config-update", "lb", "endpoint" and "upstream", roughly based on


I decided to rename that field in the next PR.
I think the comment here is kinda accurate, updated the "upstream" one to reflect the "current" status.

stevenzzzz · 2022-11-09T21:47:01Z

envoy/upstream/upstream.h

  virtual TransportSocketMatcher& transportSocketMatcher() const PURE;

+  /**
+   * @return ClusterConfigUpdateStats& strongly named config update stats for this cluster.


stevenzzzz · 2022-11-09T21:48:15Z

source/common/upstream/cluster_manager_impl.h

@@ -326,6 +326,13 @@ class ClusterManagerImpl : public ClusterManager,
  initializeSecondaryClusters(const envoy::config::bootstrap::v3::Bootstrap& bootstrap) override;

  const ClusterStatNames& clusterStatNames() const override { return cluster_stat_names_; }


I am not sure.
I saw it only used to instantiate the ClusterStats in upstream_impl.cc.

jmarantz · 2022-11-09T21:49:49Z

confused about how this stacks on the other PR -- this works on its own and appears coherent, right? Also that one is a draft and this one looks close to being ready.

stevenzzzz · 2022-11-09T21:49:59Z

@jmarantz let's commit this one and ignore the other one in the chain. (#23523)

stevenzzzz · 2022-11-09T21:52:27Z

confused about how this stacks on the other PR -- this works on its own and appears coherent, right? Also that one is a draft and this one looks close to being ready.

there are 3 PRs: #23523, this and the lazy-init one #23921

I was thinking we merge them one after another, to make incremental/smaller PRs to the code base.

But if we can merge this one directly, I am happy to remove the first one. #23523

jmarantz · 2022-11-09T21:56:36Z

envoy/upstream/upstream.h

+/**
+ * All cluster stats. @see stats_macros.h
+ */
+#define ALL_CLUSTER_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT, STATNAME)                       \


WDYT about renaming this CLUSTER_TRAFFIC_STATS or similar; these all look like they are specifically to handle actual traffic, so they are obviously a good candidate to lazy-load in the next PR.

jmarantz · 2022-11-09T21:58:40Z

source/common/upstream/cluster_manager_impl.h

@@ -326,6 +326,13 @@ class ClusterManagerImpl : public ClusterManager,
  initializeSecondaryClusters(const envoy::config::bootstrap::v3::Bootstrap& bootstrap) override;

  const ClusterStatNames& clusterStatNames() const override { return cluster_stat_names_; }


oh maybe a rename to clusterTrafficStatNames()

I am sufficiently myopic in my Envoy knowledge that I'm actually not sure if Clusters are used at L4, and thus clusterRequestStatNames may be inaccurate :)

jmarantz · 2022-11-09T22:03:09Z

test/common/upstream/load_balancer_fuzz_base.h

  Stats::IsolatedStoreImpl stats_store_;
-  ClusterStatNames stat_names_;
-  ClusterStats stats_;
+  ClusterLbStatNames lb_stat_names_;


hm. I guess you didn't need to change the variable name from stat_names_ to lb_stat_names_ here. You could have just left it at stat_names_ (still changing the type. Ditto for lb_stats_. That would make this PR a lot smaller I think.

IIUC, there is a statsname for every StatStruct. So we have now 4 {Lb, Endpoint, ConfigUpdate, }StatNames definitions.

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

…pstream/*.cc Signed-off-by: Xin Zhuang <stevenzzz@google.com>

…pstream/ Signed-off-by: Xin Zhuang <stevenzzz@google.com>

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

stevenzzzz · 2022-11-09T23:29:02Z

reverted the lb_stats** renaming under {source, test}/common/upstream/**.

PTAL

jmarantz

lgtm modulo minor rename request, plus the superflous changes to set host weight.

thank you!

jmarantz · 2022-11-10T02:39:00Z

envoy/upstream/upstream.h

+/**
+ * All cluster stats. @see stats_macros.h
+ */
+#define ALL_CLUSTER_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT, STATNAME)                       \


oh I see, you were saying you renamed that set in the lazy-init PR. I think it would make more sense in this one.

Also rename clusterStats() to clusterTrafficStats() or something; same with its StatNames object.

jmarantz · 2022-11-10T03:05:59Z

test/common/upstream/load_balancer_impl_test.cc

  // Host weight is 1.
  {
    EXPECT_CALL(random_, random()).WillOnce(Return(0)).WillOnce(Return(2)).WillOnce(Return(3));
-    stats_.max_host_weight_.set(1UL);


is this change related?

BTW it's fine to set stats in a test. It's only in prod that I have a concern. In any case this change probably should go in a different PR, no?

it sorta is. Now that max_host_weight is not part of stats_ we can't do this anymore.
but luckily this is also "dead" code, in that it's not necessary at all.

At some point the stat is used in lb-decision making, but the logic has changed, and per our discussion it's better that flow never depend on some stats.

jmarantz · 2022-11-10T03:08:26Z

test/common/upstream/zone_aware_load_balancer_fuzz_base.cc

  // Having 3 possible weights, 1, 2, and 3 to provide the state space at least some variation
  // in regards to weights, which do affect the load balancing algorithm. Cap the amount of
  // weights at 3 for simplicity's sake
-  stats_.max_host_weight_.set(3UL);


here too.

Actually if the reason you removed this is that (a) these stats were no longer accessible from this point and so couldn't be written and (b) that had no impact on the tests, then maybe send just the deletion of these max_host_weight_set() calls into a separate PR? That should be easy to review (deleting a bunch of superfluous test lines).

One concern I might have is if setting these values wakes up some code we need to be testable.

removing the max_host_weight_ references in tests with PR: #23936

jmarantz · 2022-11-10T13:43:32Z

/wait

stevenzzzz

thanks!

stevenzzzz · 2022-11-09T23:27:53Z

source/common/upstream/cluster_manager_impl.h

@@ -326,6 +326,13 @@ class ClusterManagerImpl : public ClusterManager,
  initializeSecondaryClusters(const envoy::config::bootstrap::v3::Bootstrap& bootstrap) override;

  const ClusterStatNames& clusterStatNames() const override { return cluster_stat_names_; }


this is required in instantiate ClusterStats in ClusterInfoImpl::generateStats

stevenzzzz · 2022-11-10T14:16:42Z

test/common/upstream/load_balancer_impl_test.cc

  // Host weight is 1.
  {
    EXPECT_CALL(random_, random()).WillOnce(Return(0)).WillOnce(Return(2)).WillOnce(Return(3));
-    stats_.max_host_weight_.set(1UL);


it sorta is. Now that max_host_weight is not part of stats_ we can't do this anymore.
but luckily this is also "dead" code, in that it's not necessary at all.

At some point the stat is used in lb-decision making, but the logic has changed, and per our discussion it's better that flow never depend on some stats.

stevenzzzz · 2022-11-10T14:16:49Z

test/common/upstream/zone_aware_load_balancer_fuzz_base.cc

  // Having 3 possible weights, 1, 2, and 3 to provide the state space at least some variation
  // in regards to weights, which do affect the load balancing algorithm. Cap the amount of
  // weights at 3 for simplicity's sake
-  stats_.max_host_weight_.set(3UL);


stevenzzzz · 2022-11-10T18:25:08Z

/assign @zuercher
hi Zuercher, could you please take a look?

jmarantz

Thanks Xin.

Stephen please let me know if you need more context or reach out to the PR author direclty.

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

stevenzzzz · 2022-11-11T02:06:03Z

/assign @yanavlasov

jmarantz · 2022-11-11T02:24:19Z

I think we should wait for a non-google maintainer to review. If Stephan can't do it we can get a non-Google non-senior maintainer and then Yan can give final approval.

jmarantz · 2022-11-11T02:25:05Z

Also PTAL @ clang-tidy failure

yanavlasov

/wait

yanavlasov · 2022-11-11T14:54:57Z

source/common/upstream/upstream_impl.cc

-  return ClusterStats(stat_names, scope);
+ClusterTrafficStats ClusterInfoImpl::generateStats(Stats::Scope& scope,
+                                                   const ClusterTrafficStatNames& stat_names) {
+  return ClusterTrafficStats(stat_names, scope);


Suggested change

return ClusterTrafficStats(stat_names, scope);

return {stat_names, scope};

yanavlasov · 2022-11-11T14:59:52Z

test/common/upstream/load_balancer_simulation_test.cc

@@ -107,8 +107,7 @@ TEST(DISABLED_LeastRequestLoadBalancerWeightTest, Weight) {
 class DISABLED_SimulationTest : public testing::Test {


I know this is not your change, but we need to mollify preexisting clang-tidy

Suggested change

class DISABLED_SimulationTest : public testing::Test {

class DISABLED_SimulationTest : public testing::Test { // NOLINT(readability-identifier-naming)

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

stevenzzzz

thanks!

stevenzzzz · 2022-11-14T14:17:49Z

test/common/upstream/load_balancer_simulation_test.cc

@@ -107,8 +107,7 @@ TEST(DISABLED_LeastRequestLoadBalancerWeightTest, Weight) {
 class DISABLED_SimulationTest : public testing::Test {


ggreenway · 2022-11-14T17:27:27Z

envoy/upstream/upstream.h

+/**
+ * All cluster loadbalancing related stats.
+ */
+#define ALL_CLUSTER_LB_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT, STATNAME)                    \


I think these should be further sub-divided by functionality. lb_subset, zone_aware, and then all the rest. Both of those features are unused in many configs so they could all be omitted.

IIUC, the zone_aware stats are the "base" for subset_lb loadbalancers, due to the following:

class EdfLoadBalancerBase : public ZoneAwareLoadBalancerBase

so the lb-related stats (subset, and zone-aware) will be created/required at the same time.

But I think the zone-awareness isn't used in many cases, and those stats will remain zero'd (at least I think that's how it works). So if they are separate and lazy-created, I think they won't get created in many configs.

our of the 5 lb-impls used in subset_lb, only RingHashLoadBalancer and MagLevLoadBalancer are NOT inherited from zoneawareLoadBalancer. But I see your point there tho. It requires some refactoring work: we'd better instantiate those stats objs from within each lb-object, which triggers even more delta.

do you mind if I do it incrementally?

Incrementally sounds good. Do you want to land this as-is and do it in a subsequent PR?

yeah. it'd be much cleaner.

yanavlasov · 2022-11-14T19:57:28Z

/wait-any

comments were addressed in one of Xin's commits.

stevenzzzz · 2022-11-15T14:56:50Z

Thanks for working on my PR!

@jmarantz

Commit Message: LazyInit ClusterInfo::trafficStats(). Additional Description: 3rd PR for #23575, this is stack upon 2nd PR #23907 With 100K clusters, we are seeing ~1.5 GBs less RAM usage with this PR. Kudos to @jmarantz for the data: | | Clean client | in-MB | Defferred | Diff | Diff % | Defferred-Fulfilled | Diff-in-MB (fulfilled - Clean) | |----------------------|------------|-------------|-------------------|------------|-----------------|-----------------------------|-----------------------------| | allocated | 4561550208 | 4350.233276 | 2886860656 | 1674689552 | 36.71316714 | 4565167632 | 3.44984436 | | heap_size | 5303697408 | 656 | 3443523584 | 1860173824 | 35.07315144 | 5146411008 | -150 | | pageheap_unmapped | 687865856 | | 501219328 | 186646528 | 27.13414634 | 524288000 | -156 | | pageheap_free | 22921216 | 21.859375 | 23109632 | -188416 | -00.8220157255 | 22257664 | -0.6328125 | | total_thread_cache | 1718288 | 1.638687134 | 4197032 | -2478744 | -144.2566089 | 4833576 | 2.970970154 | | total_physical_bytes | 4647192158 | 4431.907804 | 2965242430 | 1681949728 | 36.19281645 | 4653479518 | 5.99609375 | To reproduce, use this script to add 100K clusters to the bootstrap tmpl: ``` $ cat large_bootstrap_maker.sh #!/bin/bash set -u set -e limit="$1" for i in $(seq 1 $limit); do service="zzzzzzz-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx_$i" echo " - name: $service" echo " connect_timeout: 0.25s" echo " type: LOGICAL_DNS" echo " dns_lookup_family: \"v6_only\"" echo " lb_policy: ROUND_ROBIN" echo " load_assignment:" echo " cluster_name: $service" echo " endpoints:" echo " - lb_endpoints:" echo " - endpoint:" echo " address: {socket_address: {address: google.com, port_value: 443}}" done ``` base.tmpl: ```$ cat base.yaml admin: access_log: - name: envoy.access_loggers.file typed_config: "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog path: "/dev/null" address: socket_address: address: "::" port_value: 8080 layered_runtime: layers: - name: admin admin_layer: {} static_resources: listeners: - name: listener_0 address: socket_address: address: "::" port_value: 0 filter_chains: - filters: - name: http typed_config: "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager stat_prefix: ingress_http codec_type: AUTO route_config: name: local_route virtual_hosts: - name: local_service domains: ["*"] routes: - match: {prefix: "/"} route: {host_rewrite_literal: 127.0.0.1, cluster: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy-wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww} clusters: ``` Run the following commands to generate the bootstrap. ``` bash large_bootstrap_maker.sh 100000 >> base.yaml mv base.yaml test/config/integration/100k_clusters.yaml ``` and modify the test/config/main_common_test.cc to load from test/config/integration/100k_clusters.yaml instead. Then you could run the test and observe the RAM uasage. Risk Level: Medium (changes how cluster_info::trafficStats() are created). Testing: existing unit tests. Docs Changes: done Release Notes: incluided Platform Specific Features: n/a Signed-off-by: Xin Zhuang <stevenzzz@google.com>

@jmarantz

Commit Message: LazyInit ClusterInfo::trafficStats(). Additional Description: 3rd PR for envoyproxy#23575, this is stack upon 2nd PR envoyproxy#23907 With 100K clusters, we are seeing ~1.5 GBs less RAM usage with this PR. Kudos to @jmarantz for the data: | | Clean client | in-MB | Defferred | Diff | Diff % | Defferred-Fulfilled | Diff-in-MB (fulfilled - Clean) | |----------------------|------------|-------------|-------------------|------------|-----------------|-----------------------------|-----------------------------| | allocated | 4561550208 | 4350.233276 | 2886860656 | 1674689552 | 36.71316714 | 4565167632 | 3.44984436 | | heap_size | 5303697408 | 656 | 3443523584 | 1860173824 | 35.07315144 | 5146411008 | -150 | | pageheap_unmapped | 687865856 | | 501219328 | 186646528 | 27.13414634 | 524288000 | -156 | | pageheap_free | 22921216 | 21.859375 | 23109632 | -188416 | -00.8220157255 | 22257664 | -0.6328125 | | total_thread_cache | 1718288 | 1.638687134 | 4197032 | -2478744 | -144.2566089 | 4833576 | 2.970970154 | | total_physical_bytes | 4647192158 | 4431.907804 | 2965242430 | 1681949728 | 36.19281645 | 4653479518 | 5.99609375 | To reproduce, use this script to add 100K clusters to the bootstrap tmpl: ``` $ cat large_bootstrap_maker.sh #!/bin/bash set -u set -e limit="$1" for i in $(seq 1 $limit); do service="zzzzzzz-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx_$i" echo " - name: $service" echo " connect_timeout: 0.25s" echo " type: LOGICAL_DNS" echo " dns_lookup_family: \"v6_only\"" echo " lb_policy: ROUND_ROBIN" echo " load_assignment:" echo " cluster_name: $service" echo " endpoints:" echo " - lb_endpoints:" echo " - endpoint:" echo " address: {socket_address: {address: google.com, port_value: 443}}" done ``` base.tmpl: ```$ cat base.yaml admin: access_log: - name: envoy.access_loggers.file typed_config: "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog path: "/dev/null" address: socket_address: address: "::" port_value: 8080 layered_runtime: layers: - name: admin admin_layer: {} static_resources: listeners: - name: listener_0 address: socket_address: address: "::" port_value: 0 filter_chains: - filters: - name: http typed_config: "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager stat_prefix: ingress_http codec_type: AUTO route_config: name: local_route virtual_hosts: - name: local_service domains: ["*"] routes: - match: {prefix: "/"} route: {host_rewrite_literal: 127.0.0.1, cluster: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy-wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww} clusters: ``` Run the following commands to generate the bootstrap. ``` bash large_bootstrap_maker.sh 100000 >> base.yaml mv base.yaml test/config/integration/100k_clusters.yaml ``` and modify the test/config/main_common_test.cc to load from test/config/integration/100k_clusters.yaml instead. Then you could run the test and observe the RAM uasage. Risk Level: Medium (changes how cluster_info::trafficStats() are created). Testing: existing unit tests. Docs Changes: done Release Notes: incluided Platform Specific Features: n/a Signed-off-by: Xin Zhuang <stevenzzz@google.com> Signed-off-by: Ryan Eskin <ryan.eskin89@protonmail.com>

@jmarantz

Commit Message: LazyInit ClusterInfo::trafficStats(). Additional Description: 3rd PR for envoyproxy/envoy#23575, this is stack upon 2nd PR envoyproxy/envoy#23907 With 100K clusters, we are seeing ~1.5 GBs less RAM usage with this PR. Kudos to @jmarantz for the data: | | Clean client | in-MB | Defferred | Diff | Diff % | Defferred-Fulfilled | Diff-in-MB (fulfilled - Clean) | |----------------------|------------|-------------|-------------------|------------|-----------------|-----------------------------|-----------------------------| | allocated | 4561550208 | 4350.233276 | 2886860656 | 1674689552 | 36.71316714 | 4565167632 | 3.44984436 | | heap_size | 5303697408 | 656 | 3443523584 | 1860173824 | 35.07315144 | 5146411008 | -150 | | pageheap_unmapped | 687865856 | | 501219328 | 186646528 | 27.13414634 | 524288000 | -156 | | pageheap_free | 22921216 | 21.859375 | 23109632 | -188416 | -00.8220157255 | 22257664 | -0.6328125 | | total_thread_cache | 1718288 | 1.638687134 | 4197032 | -2478744 | -144.2566089 | 4833576 | 2.970970154 | | total_physical_bytes | 4647192158 | 4431.907804 | 2965242430 | 1681949728 | 36.19281645 | 4653479518 | 5.99609375 | To reproduce, use this script to add 100K clusters to the bootstrap tmpl: ``` $ cat large_bootstrap_maker.sh #!/bin/bash set -u set -e limit="$1" for i in $(seq 1 $limit); do service="zzzzzzz-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx_$i" echo " - name: $service" echo " connect_timeout: 0.25s" echo " type: LOGICAL_DNS" echo " dns_lookup_family: \"v6_only\"" echo " lb_policy: ROUND_ROBIN" echo " load_assignment:" echo " cluster_name: $service" echo " endpoints:" echo " - lb_endpoints:" echo " - endpoint:" echo " address: {socket_address: {address: google.com, port_value: 443}}" done ``` base.tmpl: ```$ cat base.yaml admin: access_log: - name: envoy.access_loggers.file typed_config: "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog path: "/dev/null" address: socket_address: address: "::" port_value: 8080 layered_runtime: layers: - name: admin admin_layer: {} static_resources: listeners: - name: listener_0 address: socket_address: address: "::" port_value: 0 filter_chains: - filters: - name: http typed_config: "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager stat_prefix: ingress_http codec_type: AUTO route_config: name: local_route virtual_hosts: - name: local_service domains: ["*"] routes: - match: {prefix: "/"} route: {host_rewrite_literal: 127.0.0.1, cluster: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy-wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww} clusters: ``` Run the following commands to generate the bootstrap. ``` bash large_bootstrap_maker.sh 100000 >> base.yaml mv base.yaml test/config/integration/100k_clusters.yaml ``` and modify the test/config/main_common_test.cc to load from test/config/integration/100k_clusters.yaml instead. Then you could run the test and observe the RAM uasage. Risk Level: Medium (changes how cluster_info::trafficStats() are created). Testing: existing unit tests. Docs Changes: done Release Notes: incluided Platform Specific Features: n/a Signed-off-by: Xin Zhuang <stevenzzz@google.com> Mirrored from https://github.com/envoyproxy/envoy @ 23d6164e4405c9389fb1f6403f274942e9dd2d9e

@jmarantz

Commit Message: LazyInit ClusterInfo::trafficStats(). Additional Description: 3rd PR for envoyproxy/envoy#23575, this is stack upon 2nd PR envoyproxy/envoy#23907 With 100K clusters, we are seeing ~1.5 GBs less RAM usage with this PR. Kudos to @jmarantz for the data: | | Clean client | in-MB | Defferred | Diff | Diff % | Defferred-Fulfilled | Diff-in-MB (fulfilled - Clean) | |----------------------|------------|-------------|-------------------|------------|-----------------|-----------------------------|-----------------------------| | allocated | 4561550208 | 4350.233276 | 2886860656 | 1674689552 | 36.71316714 | 4565167632 | 3.44984436 | | heap_size | 5303697408 | 656 | 3443523584 | 1860173824 | 35.07315144 | 5146411008 | -150 | | pageheap_unmapped | 687865856 | | 501219328 | 186646528 | 27.13414634 | 524288000 | -156 | | pageheap_free | 22921216 | 21.859375 | 23109632 | -188416 | -00.8220157255 | 22257664 | -0.6328125 | | total_thread_cache | 1718288 | 1.638687134 | 4197032 | -2478744 | -144.2566089 | 4833576 | 2.970970154 | | total_physical_bytes | 4647192158 | 4431.907804 | 2965242430 | 1681949728 | 36.19281645 | 4653479518 | 5.99609375 | To reproduce, use this script to add 100K clusters to the bootstrap tmpl: ``` $ cat large_bootstrap_maker.sh #!/bin/bash set -u set -e limit="$1" for i in $(seq 1 $limit); do service="zzzzzzz-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx_$i" echo " - name: $service" echo " connect_timeout: 0.25s" echo " type: LOGICAL_DNS" echo " dns_lookup_family: \"v6_only\"" echo " lb_policy: ROUND_ROBIN" echo " load_assignment:" echo " cluster_name: $service" echo " endpoints:" echo " - lb_endpoints:" echo " - endpoint:" echo " address: {socket_address: {address: google.com, port_value: 443}}" done ``` base.tmpl: ```$ cat base.yaml admin: access_log: - name: envoy.access_loggers.file typed_config: "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog path: "/dev/null" address: socket_address: address: "::" port_value: 8080 layered_runtime: layers: - name: admin admin_layer: {} static_resources: listeners: - name: listener_0 address: socket_address: address: "::" port_value: 0 filter_chains: - filters: - name: http typed_config: "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager stat_prefix: ingress_http codec_type: AUTO route_config: name: local_route virtual_hosts: - name: local_service domains: ["*"] routes: - match: {prefix: "/"} route: {host_rewrite_literal: 127.0.0.1, cluster: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy-wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww} clusters: ``` Run the following commands to generate the bootstrap. ``` bash large_bootstrap_maker.sh 100000 >> base.yaml mv base.yaml test/config/integration/100k_clusters.yaml ``` and modify the test/config/main_common_test.cc to load from test/config/integration/100k_clusters.yaml instead. Then you could run the test and observe the RAM uasage. Risk Level: Medium (changes how cluster_info::trafficStats() are created). Testing: existing unit tests. Docs Changes: done Release Notes: incluided Platform Specific Features: n/a Signed-off-by: Xin Zhuang <stevenzzz@google.com> Mirrored from https://github.com/envoyproxy/envoy @ 23d6164e4405c9389fb1f6403f274942e9dd2d9e

stevenzzzz added 6 commits October 17, 2022 11:28

subgroup cluster config update stats so that later we can lazy init o…

230b570

…n non-config-update stats Signed-off-by: Xin Zhuang <stevenzzz@google.com>

add missing init

4c1a076

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

Merge branch 'main' into subgroup-cluster-stats

aba2717

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

Merge branch 'main' into subgroup-cluster-stats

9b51c85

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

add more comments

e370c84

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

Subgroup cluster stats() into lb-stats, endpoint-stats, config-update…

2a45ccf

…-stats and "the rest"(will be renamed to upstream-stats) Signed-off-by: Xin Zhuang <stevenzzz@google.com>

stevenzzzz requested review from adisuissa, mattklein123 and snowp as code owners November 8, 2022 21:55

repokitteh-read-only bot assigned jmarantz Nov 8, 2022

jmarantz reviewed Nov 9, 2022

View reviewed changes

stevenzzzz mentioned this pull request Nov 9, 2022

[stats] lazyinit ClusterInfo::trafficStats() #23921

Merged

stevenzzzz added 2 commits November 9, 2022 16:42

merge with main

ee6d346

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

address Joshua comments

da5ee03

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

stevenzzzz commented Nov 9, 2022

View reviewed changes

jmarantz reviewed Nov 9, 2022

View reviewed changes

stevenzzzz added 4 commits November 9, 2022 18:13

revert the renaming for lb_stats_ in lb tests

821b934

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

revert the renaming for lb_stats_ in lb source code : source/common/u…

848ecbb

…pstream/*.cc Signed-off-by: Xin Zhuang <stevenzzz@google.com>

even more reverting of renaming lb_stats --> stats in source/common/u…

2c6dfa4

…pstream/ Signed-off-by: Xin Zhuang <stevenzzz@google.com>

reformat

c81ee6b

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

jmarantz reviewed Nov 10, 2022

View reviewed changes

repokitteh-read-only bot added the waiting label Nov 10, 2022

stevenzzzz commented Nov 10, 2022

View reviewed changes

repokitteh-read-only bot assigned zuercher Nov 10, 2022

jmarantz previously approved these changes Nov 10, 2022

View reviewed changes

stevenzzzz added 2 commits November 10, 2022 13:42

Merge branch 'main' into subgroup-cluster-lb-endpoint-stats

5d95c14

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

merge with upstream-main

111fcf6

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

repokitteh-read-only bot assigned yanavlasov Nov 11, 2022

yanavlasov previously requested changes Nov 11, 2022

View reviewed changes

repokitteh-read-only bot added the waiting label Nov 11, 2022

stevenzzzz mentioned this pull request Nov 14, 2022

Subgroup cluster stats so that later we can lazy init some of the subgroups to save RAM/GCU. #23523

Closed

fixes for Yan comments.

26d1e17

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

stevenzzzz dismissed jmarantz’s stale review via 26d1e17 November 14, 2022 14:21

stevenzzzz commented Nov 14, 2022

View reviewed changes

repokitteh-read-only bot removed the waiting label Nov 14, 2022

ggreenway requested changes Nov 14, 2022

View reviewed changes

repokitteh-read-only bot added waiting:any and removed waiting:any labels Nov 14, 2022

ggreenway approved these changes Nov 14, 2022

View reviewed changes

jmarantz assigned ggreenway and unassigned zuercher Nov 15, 2022

jmarantz merged commit 79165cd into envoyproxy:main Nov 15, 2022

		@@ -326,6 +326,13 @@ class ClusterManagerImpl : public ClusterManager,
		initializeSecondaryClusters(const envoy::config::bootstrap::v3::Bootstrap& bootstrap) override;

		const ClusterStatNames& clusterStatNames() const override { return cluster_stat_names_; }

	return ClusterTrafficStats(stat_names, scope);
	return {stat_names, scope};

		@@ -107,8 +107,7 @@ TEST(DISABLED_LeastRequestLoadBalancerWeightTest, Weight) {
		class DISABLED_SimulationTest : public testing::Test {

Conversation

stevenzzzz commented Nov 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevenzzzz commented Nov 8, 2022

Uh oh!

jmarantz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzzzz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmarantz commented Nov 9, 2022

Uh oh!

stevenzzzz commented Nov 9, 2022

Uh oh!

stevenzzzz commented Nov 9, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzzzz commented Nov 9, 2022

Uh oh!

jmarantz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmarantz Nov 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

stevenzzzz commented Nov 8, 2022 •

edited

Loading

jmarantz Nov 10, 2022 •

edited

Loading