Skip to content

[stats]: Subgroup cluster stats() into lb-stats, endpoint-stats, config-update-stats and "the rest"(will be renamed to upstream-stats)#23907

Merged
jmarantz merged 18 commits intoenvoyproxy:mainfrom
stevenzzzz:subgroup-cluster-lb-endpoint-stats
Nov 15, 2022
Merged

[stats]: Subgroup cluster stats() into lb-stats, endpoint-stats, config-update-stats and "the rest"(will be renamed to upstream-stats)#23907
jmarantz merged 18 commits intoenvoyproxy:mainfrom
stevenzzzz:subgroup-cluster-lb-endpoint-stats

Conversation

@stevenzzzz
Copy link
Copy Markdown
Contributor

@stevenzzzz stevenzzzz commented Nov 8, 2022

Commit Message: Subgroup cluster stats() into lb-stats, endpoint-stats, config-update-stats and "the rest"(will be renamed to upstream-stats)
Additional Description:
See more description and live example in #23575.

With lots of clusters and route-tables in a cloud proxy, we are seeing tons of RAM been spent on stats while most of the stats are never inc-ed due to traffic pattern(or long tail). We are thinking that we can lazy init cluster stats() so that the RAM is only allocated when it's required.

To achieve that we need to have finer grained stats group, e.g. configUpdateStats() are frequently updated by config management server, while upstream_xxx are only required when there is traffic for the cluster, for this sub-group we can save RAM by lazy init it.

This is the subgroups I plan to have for clusters.

/**

All cluster config update related stats.
/
#define ALL_CLUSTER_CONFIG_UPDATE_STATS(COUNTER, GAUGE, HISTOGRAM,
TEXT_READOUT, STATNAME)
COUNTER(assignment_stale)
COUNTER(assignment_timeout_received)
COUNTER(update_attempt)
COUNTER(update_empty)
COUNTER(update_failure)
COUNTER(update_no_rebuild)
COUNTER(update_success)
GAUGE(version, NeverImport)
/
*

All cluster endpoints related stats.
/
#define ALL_CLUSTER_ENDPOINT_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT,
STATNAME)
GAUGE(max_host_weight, NeverImport)
COUNTER(membership_change)
GAUGE(membership_degraded, NeverImport)
GAUGE(membership_excluded, NeverImport)
GAUGE(membership_healthy, NeverImport)
GAUGE(membership_total, NeverImport)
/
*

All cluster loadbalancing related stats.
/
#define ALL_CLUSTER_LB_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT,
STATNAME)
COUNTER(lb_healthy_panic)
COUNTER(lb_local_cluster_not_ok)
COUNTER(lb_recalculate_zone_structures)
COUNTER(lb_subsets_created)
COUNTER(lb_subsets_fallback)
COUNTER(lb_subsets_fallback_panic)
COUNTER(lb_subsets_removed)
COUNTER(lb_subsets_selected)
COUNTER(lb_zone_cluster_too_small)
COUNTER(lb_zone_no_capacity_left)
COUNTER(lb_zone_number_differs)
COUNTER(lb_zone_routing_all_directly)
COUNTER(lb_zone_routing_cross_zone)
COUNTER(lb_zone_routing_sampled)
GAUGE(lb_subsets_active, Accumulate)
/
*

All cluster stats. https://github.com/see stats_macros.h
*/
#define ALL_CLUSTER_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT, STATNAME)
COUNTER(bind_errors)
COUNTER(original_dst_host_invalid)
COUNTER(retry_or_shadow_abandoned)
COUNTER(upstream_cx_close_notify)
COUNTER(upstream_cx_connect_attempts_exceeded)
COUNTER(upstream_cx_connect_fail)
COUNTER(upstream_cx_connect_timeout)
COUNTER(upstream_cx_connect_with_0_rtt)
COUNTER(upstream_cx_destroy)
COUNTER(upstream_cx_destroy_local)
COUNTER(upstream_cx_destroy_local_with_active_rq)
COUNTER(upstream_cx_destroy_remote)
COUNTER(upstream_cx_destroy_remote_with_active_rq)
COUNTER(upstream_cx_destroy_with_active_rq)
COUNTER(upstream_cx_http1_total)
COUNTER(upstream_cx_http2_total)
COUNTER(upstream_cx_http3_total)
COUNTER(upstream_cx_idle_timeout)
COUNTER(upstream_cx_max_duration_reached)
COUNTER(upstream_cx_max_requests)
COUNTER(upstream_cx_none_healthy)
COUNTER(upstream_cx_overflow)
COUNTER(upstream_cx_pool_overflow)
COUNTER(upstream_cx_protocol_error)
COUNTER(upstream_cx_rx_bytes_total)
COUNTER(upstream_cx_total)
COUNTER(upstream_cx_tx_bytes_total)
COUNTER(upstream_flow_control_backed_up_total)
COUNTER(upstream_flow_control_drained_total)
COUNTER(upstream_flow_control_paused_reading_total)
COUNTER(upstream_flow_control_resumed_reading_total)
COUNTER(upstream_internal_redirect_failed_total)
COUNTER(upstream_internal_redirect_succeeded_total)
COUNTER(upstream_rq_cancelled)
COUNTER(upstream_rq_completed)
COUNTER(upstream_rq_maintenance_mode)
COUNTER(upstream_rq_max_duration_reached)
COUNTER(upstream_rq_pending_failure_eject)
COUNTER(upstream_rq_pending_overflow)
COUNTER(upstream_rq_pending_total)
COUNTER(upstream_rq_0rtt)
COUNTER(upstream_rq_per_try_timeout)
COUNTER(upstream_rq_per_try_idle_timeout)
COUNTER(upstream_rq_retry)
COUNTER(upstream_rq_retry_backoff_exponential)
COUNTER(upstream_rq_retry_backoff_ratelimited)
COUNTER(upstream_rq_retry_limit_exceeded)
COUNTER(upstream_rq_retry_overflow)
COUNTER(upstream_rq_retry_success)
COUNTER(upstream_rq_rx_reset)
COUNTER(upstream_rq_timeout)
COUNTER(upstream_rq_total)
COUNTER(upstream_rq_tx_reset)
COUNTER(upstream_http3_broken)
GAUGE(upstream_cx_active, Accumulate)
GAUGE(upstream_cx_rx_bytes_buffered, Accumulate)
GAUGE(upstream_cx_tx_bytes_buffered, Accumulate)
GAUGE(upstream_rq_active, Accumulate)
GAUGE(upstream_rq_pending_active, Accumulate)
HISTOGRAM(upstream_cx_connect_ms, Milliseconds)
HISTOGRAM(upstream_cx_length_ms, Milliseconds)

Risk Level: LOW, should be a no-behavior-change CL.

Testing: N/A existing stats test should prove that there is no behavior change.
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

…n non-config-update stats

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
…-stats and "the rest"(will be renamed to upstream-stats)

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
@stevenzzzz stevenzzzz changed the title Subgroup cluster stats() into lb-stats, endpoint-stats, config-update-stats and "the rest"(will be renamed to upstream-stats) [stats]: Subgroup cluster stats() into lb-stats, endpoint-stats, config-update-stats and "the rest"(will be renamed to upstream-stats) Nov 8, 2022
@stevenzzzz
Copy link
Copy Markdown
Contributor Author

/assign @jmarantz

Copy link
Copy Markdown
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good basically; just a few nits.

Can you merge main?

* All cluster stats. @see stats_macros.h
* All cluster config update related stats.
* See https://github.com/envoyproxy/envoy/issues/23575 for details. Stats from ClusterInfo::stats()
* will be split into subgroups "config-update", "lb", "endpoint" and "upstream", roughly based on
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should describe the current state of affairs assuming this PR is checked in, not express a plan.

AFAICT stats_ no longer exists for upstream; it's now fully partitioned so we can just document that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to rename that field in the next PR.
I think the comment here is kinda accurate, updated the "upstream" one to reflect the "current" status.

/**
* All cluster stats. @see stats_macros.h
*/
#define ALL_CLUSTER_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT, STATNAME) \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe we should rename this macro and related call-sites because this does not represent all the stats anymore.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we focus on just making this one coherent? and getting it in?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about renaming this CLUSTER_TRAFFIC_STATS or similar; these all look like they are specifically to handle actual traffic, so they are obviously a good candidate to lazy-load in the next PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see, you were saying you renamed that set in the lazy-init PR. I think it would make more sense in this one.

Also rename clusterStats() to clusterTrafficStats() or something; same with its StatNames object.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

virtual TransportSocketMatcher& transportSocketMatcher() const PURE;

/**
* @return ClusterConfigUpdateStats& strongly named config update stats for this cluster.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this phrase 'strongly named' pre-existed, but I'm not sure what it means.

It's possible I came up with phrase and i still don't know what it means :)

Do you? If not, can we just remove this here & below?

For ClusterLbStats maybe use 'load-balancer-related stats for this cluster'.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -326,6 +326,13 @@ class ClusterManagerImpl : public ClusterManager,
initializeSecondaryClusters(const envoy::config::bootstrap::v3::Bootstrap& bootstrap) override;

const ClusterStatNames& clusterStatNames() const override { return cluster_stat_names_; }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still use ClusterStatNames anywhere?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure.
I saw it only used to instantiate the ClusterStats in upstream_impl.cc.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you comment it out and see what breaks?

I'd like this PR to just be a repartitioning without adding any overlap.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh maybe a rename to clusterTrafficStatNames()

I am sufficiently myopic in my Envoy knowledge that I'm actually not sure if Clusters are used at L4, and thus clusterRequestStatNames may be inaccurate :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is required in instantiate ClusterStats in ClusterInfoImpl::generateStats

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Copy link
Copy Markdown
Contributor Author

@stevenzzzz stevenzzzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! PTAL.

* All cluster stats. @see stats_macros.h
* All cluster config update related stats.
* See https://github.com/envoyproxy/envoy/issues/23575 for details. Stats from ClusterInfo::stats()
* will be split into subgroups "config-update", "lb", "endpoint" and "upstream", roughly based on
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to rename that field in the next PR.
I think the comment here is kinda accurate, updated the "upstream" one to reflect the "current" status.

virtual TransportSocketMatcher& transportSocketMatcher() const PURE;

/**
* @return ClusterConfigUpdateStats& strongly named config update stats for this cluster.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -326,6 +326,13 @@ class ClusterManagerImpl : public ClusterManager,
initializeSecondaryClusters(const envoy::config::bootstrap::v3::Bootstrap& bootstrap) override;

const ClusterStatNames& clusterStatNames() const override { return cluster_stat_names_; }
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure.
I saw it only used to instantiate the ClusterStats in upstream_impl.cc.

@jmarantz
Copy link
Copy Markdown
Contributor

jmarantz commented Nov 9, 2022

confused about how this stacks on the other PR -- this works on its own and appears coherent, right? Also that one is a draft and this one looks close to being ready.

@stevenzzzz
Copy link
Copy Markdown
Contributor Author

@jmarantz let's commit this one and ignore the other one in the chain. (#23523)

@stevenzzzz
Copy link
Copy Markdown
Contributor Author

confused about how this stacks on the other PR -- this works on its own and appears coherent, right? Also that one is a draft and this one looks close to being ready.

there are 3 PRs: #23523, this and the lazy-init one #23921

I was thinking we merge them one after another, to make incremental/smaller PRs to the code base.

But if we can merge this one directly, I am happy to remove the first one. #23523

/**
* All cluster stats. @see stats_macros.h
*/
#define ALL_CLUSTER_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT, STATNAME) \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about renaming this CLUSTER_TRAFFIC_STATS or similar; these all look like they are specifically to handle actual traffic, so they are obviously a good candidate to lazy-load in the next PR.

@@ -326,6 +326,13 @@ class ClusterManagerImpl : public ClusterManager,
initializeSecondaryClusters(const envoy::config::bootstrap::v3::Bootstrap& bootstrap) override;

const ClusterStatNames& clusterStatNames() const override { return cluster_stat_names_; }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh maybe a rename to clusterTrafficStatNames()

I am sufficiently myopic in my Envoy knowledge that I'm actually not sure if Clusters are used at L4, and thus clusterRequestStatNames may be inaccurate :)

Stats::IsolatedStoreImpl stats_store_;
ClusterStatNames stat_names_;
ClusterStats stats_;
ClusterLbStatNames lb_stat_names_;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm. I guess you didn't need to change the variable name from stat_names_ to lb_stat_names_ here. You could have just left it at stat_names_ (still changing the type. Ditto for lb_stats_. That would make this PR a lot smaller I think.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, there is a statsname for every StatStruct. So we have now 4 {Lb, Endpoint, ConfigUpdate, }StatNames definitions.

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
…pstream/*.cc

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
…pstream/

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
@stevenzzzz
Copy link
Copy Markdown
Contributor Author

reverted the lb_stats** renaming under {source, test}/common/upstream/**.

PTAL

Copy link
Copy Markdown
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm modulo minor rename request, plus the superflous changes to set host weight.

thank you!

/**
* All cluster stats. @see stats_macros.h
*/
#define ALL_CLUSTER_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT, STATNAME) \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see, you were saying you renamed that set in the lazy-init PR. I think it would make more sense in this one.

Also rename clusterStats() to clusterTrafficStats() or something; same with its StatNames object.

// Host weight is 1.
{
EXPECT_CALL(random_, random()).WillOnce(Return(0)).WillOnce(Return(2)).WillOnce(Return(3));
stats_.max_host_weight_.set(1UL);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change related?

BTW it's fine to set stats in a test. It's only in prod that I have a concern. In any case this change probably should go in a different PR, no?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it sorta is. Now that max_host_weight is not part of stats_ we can't do this anymore.
but luckily this is also "dead" code, in that it's not necessary at all.

At some point the stat is used in lb-decision making, but the logic has changed, and per our discussion it's better that flow never depend on some stats.

// Having 3 possible weights, 1, 2, and 3 to provide the state space at least some variation
// in regards to weights, which do affect the load balancing algorithm. Cap the amount of
// weights at 3 for simplicity's sake
stats_.max_host_weight_.set(3UL);
Copy link
Copy Markdown
Contributor

@jmarantz jmarantz Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too.

Actually if the reason you removed this is that (a) these stats were no longer accessible from this point and so couldn't be written and (b) that had no impact on the tests, then maybe send just the deletion of these max_host_weight_set() calls into a separate PR? That should be easy to review (deleting a bunch of superfluous test lines).

One concern I might have is if setting these values wakes up some code we need to be testable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing the max_host_weight_ references in tests with PR: #23936

@jmarantz
Copy link
Copy Markdown
Contributor

/wait

Copy link
Copy Markdown
Contributor Author

@stevenzzzz stevenzzzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@@ -326,6 +326,13 @@ class ClusterManagerImpl : public ClusterManager,
initializeSecondaryClusters(const envoy::config::bootstrap::v3::Bootstrap& bootstrap) override;

const ClusterStatNames& clusterStatNames() const override { return cluster_stat_names_; }
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is required in instantiate ClusterStats in ClusterInfoImpl::generateStats

// Host weight is 1.
{
EXPECT_CALL(random_, random()).WillOnce(Return(0)).WillOnce(Return(2)).WillOnce(Return(3));
stats_.max_host_weight_.set(1UL);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it sorta is. Now that max_host_weight is not part of stats_ we can't do this anymore.
but luckily this is also "dead" code, in that it's not necessary at all.

At some point the stat is used in lb-decision making, but the logic has changed, and per our discussion it's better that flow never depend on some stats.

// Having 3 possible weights, 1, 2, and 3 to provide the state space at least some variation
// in regards to weights, which do affect the load balancing algorithm. Cap the amount of
// weights at 3 for simplicity's sake
stats_.max_host_weight_.set(3UL);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@stevenzzzz
Copy link
Copy Markdown
Contributor Author

/assign @zuercher
hi Zuercher, could you please take a look?

jmarantz
jmarantz previously approved these changes Nov 10, 2022
Copy link
Copy Markdown
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Xin.

Stephen please let me know if you need more context or reach out to the PR author direclty.

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
@stevenzzzz
Copy link
Copy Markdown
Contributor Author

/assign @yanavlasov

@jmarantz
Copy link
Copy Markdown
Contributor

I think we should wait for a non-google maintainer to review. If Stephan can't do it we can get a non-Google non-senior maintainer and then Yan can give final approval.

@jmarantz
Copy link
Copy Markdown
Contributor

Also PTAL @ clang-tidy failure

Copy link
Copy Markdown
Contributor

@yanavlasov yanavlasov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/wait

return ClusterStats(stat_names, scope);
ClusterTrafficStats ClusterInfoImpl::generateStats(Stats::Scope& scope,
const ClusterTrafficStatNames& stat_names) {
return ClusterTrafficStats(stat_names, scope);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return ClusterTrafficStats(stat_names, scope);
return {stat_names, scope};

@@ -107,8 +107,7 @@ TEST(DISABLED_LeastRequestLoadBalancerWeightTest, Weight) {
class DISABLED_SimulationTest : public testing::Test {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is not your change, but we need to mollify preexisting clang-tidy

Suggested change
class DISABLED_SimulationTest : public testing::Test {
class DISABLED_SimulationTest : public testing::Test { // NOLINT(readability-identifier-naming)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Copy link
Copy Markdown
Contributor Author

@stevenzzzz stevenzzzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@@ -107,8 +107,7 @@ TEST(DISABLED_LeastRequestLoadBalancerWeightTest, Weight) {
class DISABLED_SimulationTest : public testing::Test {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

/**
* All cluster loadbalancing related stats.
*/
#define ALL_CLUSTER_LB_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT, STATNAME) \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should be further sub-divided by functionality. lb_subset, zone_aware, and then all the rest. Both of those features are unused in many configs so they could all be omitted.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the zone_aware stats are the "base" for subset_lb loadbalancers, due to the following:

class EdfLoadBalancerBase : public ZoneAwareLoadBalancerBase

so the lb-related stats (subset, and zone-aware) will be created/required at the same time.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think the zone-awareness isn't used in many cases, and those stats will remain zero'd (at least I think that's how it works). So if they are separate and lazy-created, I think they won't get created in many configs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our of the 5 lb-impls used in subset_lb, only RingHashLoadBalancer and MagLevLoadBalancer are NOT inherited from zoneawareLoadBalancer. But I see your point there tho. It requires some refactoring work: we'd better instantiate those stats objs from within each lb-object, which triggers even more delta.

do you mind if I do it incrementally?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incrementally sounds good. Do you want to land this as-is and do it in a subsequent PR?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. it'd be much cleaner.

@yanavlasov
Copy link
Copy Markdown
Contributor

/wait-any

@jmarantz jmarantz dismissed yanavlasov’s stale review November 15, 2022 14:10

comments were addressed in one of Xin's commits.

@jmarantz jmarantz assigned ggreenway and unassigned zuercher Nov 15, 2022
@jmarantz jmarantz merged commit 79165cd into envoyproxy:main Nov 15, 2022
@stevenzzzz
Copy link
Copy Markdown
Contributor Author

Thanks for working on my PR!

jmarantz pushed a commit that referenced this pull request Jun 27, 2023
Commit Message: LazyInit ClusterInfo::trafficStats(). 

Additional Description:
3rd PR for #23575, this is stack upon 2nd PR #23907

With 100K clusters, we are seeing ~1.5 GBs less RAM usage with this PR.  
Kudos to @jmarantz for the data: 


|                      | Clean client     | in-MB       | Defferred | Diff       | Diff %           | Defferred-Fulfilled | Diff-in-MB (fulfilled - Clean) |
|----------------------|------------|-------------|-------------------|------------|-----------------|-----------------------------|-----------------------------|
| allocated            | 4561550208 | 4350.233276 | 2886860656        | 1674689552 | 36.71316714    | 4565167632                  | 3.44984436                  |
| heap_size            | 5303697408 | 656         | 3443523584        | 1860173824 | 35.07315144    | 5146411008                  | -150                        |
| pageheap_unmapped    | 687865856  |             | 501219328         | 186646528  | 27.13414634    | 524288000                   | -156                        |
| pageheap_free        | 22921216   | 21.859375   | 23109632          | -188416    | -00.8220157255 | 22257664                    | -0.6328125                  |
| total_thread_cache   | 1718288    | 1.638687134 | 4197032           | -2478744   | -144.2566089    | 4833576                     | 2.970970154                 |
| total_physical_bytes | 4647192158 | 4431.907804 | 2965242430        | 1681949728 | 36.19281645    | 4653479518                  | 5.99609375                  |


To reproduce, use this script to add 100K clusters to the bootstrap tmpl: 
```
$ cat large_bootstrap_maker.sh 
#!/bin/bash

set -u
set -e
limit="$1"

for i in $(seq 1 $limit); do
  service="zzzzzzz-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx_$i"
  echo "  - name: $service"
  echo "    connect_timeout: 0.25s"
  echo "    type: LOGICAL_DNS"
  echo "    dns_lookup_family: \"v6_only\""
  echo "    lb_policy: ROUND_ROBIN"
  echo "    load_assignment:"
  echo "      cluster_name: $service"
  echo "      endpoints:"
  echo "      - lb_endpoints:"
  echo "        - endpoint:"
  echo "            address: {socket_address: {address: google.com, port_value: 443}}"
done


```
base.tmpl: 

```$ cat base.yaml 
admin:
  access_log:
  - name: envoy.access_loggers.file
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
      path: "/dev/null"
  address:
    socket_address:
      address: "::"
      port_value: 8080
layered_runtime:
  layers:
  - name: admin
    admin_layer: {}

static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address:
        address: "::"
        port_value: 0
    filter_chains:
    - filters:
      - name: http
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_http
          codec_type: AUTO
          route_config:
            name: local_route
            virtual_hosts:
            - name: local_service
              domains: ["*"]
              routes:
              - match: {prefix: "/"}
                route: {host_rewrite_literal: 127.0.0.1, cluster: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy-wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww}
  clusters:
```

Run the following commands to generate the bootstrap.
```
bash large_bootstrap_maker.sh 100000 >> base.yaml
mv base.yaml test/config/integration/100k_clusters.yaml
``` 
and modify the test/config/main_common_test.cc to load from test/config/integration/100k_clusters.yaml instead. Then you could run the test and observe the RAM uasage.

Risk Level: Medium (changes how cluster_info::trafficStats() are created).

Testing: existing unit tests.
Docs Changes: done
Release Notes: incluided
Platform Specific Features: n/a

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
reskin89 pushed a commit to reskin89/envoy that referenced this pull request Jul 11, 2023
Commit Message: LazyInit ClusterInfo::trafficStats().

Additional Description:
3rd PR for envoyproxy#23575, this is stack upon 2nd PR envoyproxy#23907

With 100K clusters, we are seeing ~1.5 GBs less RAM usage with this PR.
Kudos to @jmarantz for the data:

|                      | Clean client     | in-MB       | Defferred | Diff       | Diff %           | Defferred-Fulfilled | Diff-in-MB (fulfilled - Clean) |
|----------------------|------------|-------------|-------------------|------------|-----------------|-----------------------------|-----------------------------|
| allocated            | 4561550208 | 4350.233276 | 2886860656        | 1674689552 | 36.71316714    | 4565167632                  | 3.44984436                  |
| heap_size            | 5303697408 | 656         | 3443523584        | 1860173824 | 35.07315144    | 5146411008                  | -150                        |
| pageheap_unmapped    | 687865856  |             | 501219328         | 186646528  | 27.13414634    | 524288000                   | -156                        |
| pageheap_free        | 22921216   | 21.859375   | 23109632          | -188416    | -00.8220157255 | 22257664                    | -0.6328125                  |
| total_thread_cache   | 1718288    | 1.638687134 | 4197032           | -2478744   | -144.2566089    | 4833576                     | 2.970970154                 |
| total_physical_bytes | 4647192158 | 4431.907804 | 2965242430        | 1681949728 | 36.19281645    | 4653479518                  | 5.99609375                  |

To reproduce, use this script to add 100K clusters to the bootstrap tmpl:
```
$ cat large_bootstrap_maker.sh
#!/bin/bash

set -u
set -e
limit="$1"

for i in $(seq 1 $limit); do
  service="zzzzzzz-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx_$i"
  echo "  - name: $service"
  echo "    connect_timeout: 0.25s"
  echo "    type: LOGICAL_DNS"
  echo "    dns_lookup_family: \"v6_only\""
  echo "    lb_policy: ROUND_ROBIN"
  echo "    load_assignment:"
  echo "      cluster_name: $service"
  echo "      endpoints:"
  echo "      - lb_endpoints:"
  echo "        - endpoint:"
  echo "            address: {socket_address: {address: google.com, port_value: 443}}"
done

```
base.tmpl:

```$ cat base.yaml
admin:
  access_log:
  - name: envoy.access_loggers.file
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
      path: "/dev/null"
  address:
    socket_address:
      address: "::"
      port_value: 8080
layered_runtime:
  layers:
  - name: admin
    admin_layer: {}

static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address:
        address: "::"
        port_value: 0
    filter_chains:
    - filters:
      - name: http
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_http
          codec_type: AUTO
          route_config:
            name: local_route
            virtual_hosts:
            - name: local_service
              domains: ["*"]
              routes:
              - match: {prefix: "/"}
                route: {host_rewrite_literal: 127.0.0.1, cluster: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy-wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww}
  clusters:
```

Run the following commands to generate the bootstrap.
```
bash large_bootstrap_maker.sh 100000 >> base.yaml
mv base.yaml test/config/integration/100k_clusters.yaml
```
and modify the test/config/main_common_test.cc to load from test/config/integration/100k_clusters.yaml instead. Then you could run the test and observe the RAM uasage.

Risk Level: Medium (changes how cluster_info::trafficStats() are created).

Testing: existing unit tests.
Docs Changes: done
Release Notes: incluided
Platform Specific Features: n/a

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Ryan Eskin <ryan.eskin89@protonmail.com>
jbohanon added a commit to jbohanon/data-plane-api that referenced this pull request Aug 7, 2023
Commit Message: LazyInit ClusterInfo::trafficStats().

Additional Description:
3rd PR for envoyproxy/envoy#23575, this is stack upon 2nd PR envoyproxy/envoy#23907

With 100K clusters, we are seeing ~1.5 GBs less RAM usage with this PR.
Kudos to @jmarantz for the data:

|                      | Clean client     | in-MB       | Defferred | Diff       | Diff %           | Defferred-Fulfilled | Diff-in-MB (fulfilled - Clean) |
|----------------------|------------|-------------|-------------------|------------|-----------------|-----------------------------|-----------------------------|
| allocated            | 4561550208 | 4350.233276 | 2886860656        | 1674689552 | 36.71316714    | 4565167632                  | 3.44984436                  |
| heap_size            | 5303697408 | 656         | 3443523584        | 1860173824 | 35.07315144    | 5146411008                  | -150                        |
| pageheap_unmapped    | 687865856  |             | 501219328         | 186646528  | 27.13414634    | 524288000                   | -156                        |
| pageheap_free        | 22921216   | 21.859375   | 23109632          | -188416    | -00.8220157255 | 22257664                    | -0.6328125                  |
| total_thread_cache   | 1718288    | 1.638687134 | 4197032           | -2478744   | -144.2566089    | 4833576                     | 2.970970154                 |
| total_physical_bytes | 4647192158 | 4431.907804 | 2965242430        | 1681949728 | 36.19281645    | 4653479518                  | 5.99609375                  |

To reproduce, use this script to add 100K clusters to the bootstrap tmpl:
```
$ cat large_bootstrap_maker.sh
#!/bin/bash

set -u
set -e
limit="$1"

for i in $(seq 1 $limit); do
  service="zzzzzzz-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx_$i"
  echo "  - name: $service"
  echo "    connect_timeout: 0.25s"
  echo "    type: LOGICAL_DNS"
  echo "    dns_lookup_family: \"v6_only\""
  echo "    lb_policy: ROUND_ROBIN"
  echo "    load_assignment:"
  echo "      cluster_name: $service"
  echo "      endpoints:"
  echo "      - lb_endpoints:"
  echo "        - endpoint:"
  echo "            address: {socket_address: {address: google.com, port_value: 443}}"
done

```
base.tmpl:

```$ cat base.yaml
admin:
  access_log:
  - name: envoy.access_loggers.file
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
      path: "/dev/null"
  address:
    socket_address:
      address: "::"
      port_value: 8080
layered_runtime:
  layers:
  - name: admin
    admin_layer: {}

static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address:
        address: "::"
        port_value: 0
    filter_chains:
    - filters:
      - name: http
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_http
          codec_type: AUTO
          route_config:
            name: local_route
            virtual_hosts:
            - name: local_service
              domains: ["*"]
              routes:
              - match: {prefix: "/"}
                route: {host_rewrite_literal: 127.0.0.1, cluster: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy-wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww}
  clusters:
```

Run the following commands to generate the bootstrap.
```
bash large_bootstrap_maker.sh 100000 >> base.yaml
mv base.yaml test/config/integration/100k_clusters.yaml
```
and modify the test/config/main_common_test.cc to load from test/config/integration/100k_clusters.yaml instead. Then you could run the test and observe the RAM uasage.

Risk Level: Medium (changes how cluster_info::trafficStats() are created).

Testing: existing unit tests.
Docs Changes: done
Release Notes: incluided
Platform Specific Features: n/a

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

Mirrored from https://github.com/envoyproxy/envoy @ 23d6164e4405c9389fb1f6403f274942e9dd2d9e
update-envoy bot added a commit to envoyproxy/data-plane-api that referenced this pull request Sep 1, 2023
Commit Message: LazyInit ClusterInfo::trafficStats().

Additional Description:
3rd PR for envoyproxy/envoy#23575, this is stack upon 2nd PR envoyproxy/envoy#23907

With 100K clusters, we are seeing ~1.5 GBs less RAM usage with this PR.
Kudos to @jmarantz for the data:

|                      | Clean client     | in-MB       | Defferred | Diff       | Diff %           | Defferred-Fulfilled | Diff-in-MB (fulfilled - Clean) |
|----------------------|------------|-------------|-------------------|------------|-----------------|-----------------------------|-----------------------------|
| allocated            | 4561550208 | 4350.233276 | 2886860656        | 1674689552 | 36.71316714    | 4565167632                  | 3.44984436                  |
| heap_size            | 5303697408 | 656         | 3443523584        | 1860173824 | 35.07315144    | 5146411008                  | -150                        |
| pageheap_unmapped    | 687865856  |             | 501219328         | 186646528  | 27.13414634    | 524288000                   | -156                        |
| pageheap_free        | 22921216   | 21.859375   | 23109632          | -188416    | -00.8220157255 | 22257664                    | -0.6328125                  |
| total_thread_cache   | 1718288    | 1.638687134 | 4197032           | -2478744   | -144.2566089    | 4833576                     | 2.970970154                 |
| total_physical_bytes | 4647192158 | 4431.907804 | 2965242430        | 1681949728 | 36.19281645    | 4653479518                  | 5.99609375                  |

To reproduce, use this script to add 100K clusters to the bootstrap tmpl:
```
$ cat large_bootstrap_maker.sh
#!/bin/bash

set -u
set -e
limit="$1"

for i in $(seq 1 $limit); do
  service="zzzzzzz-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx_$i"
  echo "  - name: $service"
  echo "    connect_timeout: 0.25s"
  echo "    type: LOGICAL_DNS"
  echo "    dns_lookup_family: \"v6_only\""
  echo "    lb_policy: ROUND_ROBIN"
  echo "    load_assignment:"
  echo "      cluster_name: $service"
  echo "      endpoints:"
  echo "      - lb_endpoints:"
  echo "        - endpoint:"
  echo "            address: {socket_address: {address: google.com, port_value: 443}}"
done

```
base.tmpl:

```$ cat base.yaml
admin:
  access_log:
  - name: envoy.access_loggers.file
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
      path: "/dev/null"
  address:
    socket_address:
      address: "::"
      port_value: 8080
layered_runtime:
  layers:
  - name: admin
    admin_layer: {}

static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address:
        address: "::"
        port_value: 0
    filter_chains:
    - filters:
      - name: http
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_http
          codec_type: AUTO
          route_config:
            name: local_route
            virtual_hosts:
            - name: local_service
              domains: ["*"]
              routes:
              - match: {prefix: "/"}
                route: {host_rewrite_literal: 127.0.0.1, cluster: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy-wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww}
  clusters:
```

Run the following commands to generate the bootstrap.
```
bash large_bootstrap_maker.sh 100000 >> base.yaml
mv base.yaml test/config/integration/100k_clusters.yaml
```
and modify the test/config/main_common_test.cc to load from test/config/integration/100k_clusters.yaml instead. Then you could run the test and observe the RAM uasage.

Risk Level: Medium (changes how cluster_info::trafficStats() are created).

Testing: existing unit tests.
Docs Changes: done
Release Notes: incluided
Platform Specific Features: n/a

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

Mirrored from https://github.com/envoyproxy/envoy @ 23d6164e4405c9389fb1f6403f274942e9dd2d9e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants