Subgroup cluster stats so that later we can lazy init some of the subgroups to save RAM/GCU. by stevenzzzz · Pull Request #23523 · envoyproxy/envoy

stevenzzzz · 2022-10-17T16:44:28Z

Commit Message: split the per-cluster stats into subgroups, so that later we can lazy-init them to save RAM.

Additional Description:
See more description and live example in #23575.

With lots of clusters and route-tables in a cloud proxy, we are seeing tons of RAM been spent on stats while most of the stats are never inc-ed due to traffic pattern(or long tail). We are thinking that we can lazy init cluster stats() so that the RAM is only allocated when it's required.

To achieve that we need to have finer grained stats group, e.g. configUpdateStats() are frequently updated by config management server, while upstream_xxx are only required when there is traffic for the cluster, for this sub-group we can save RAM by lazy init it.

This is the subgroups I plan to have for clusters.

/**

All cluster config update related stats.
*/
#define ALL_CLUSTER_CONFIG_UPDATE_STATS(COUNTER, GAUGE, HISTOGRAM,
TEXT_READOUT, STATNAME)
COUNTER(assignment_stale)
COUNTER(assignment_timeout_received)
COUNTER(update_attempt)
COUNTER(update_empty)
COUNTER(update_failure)
COUNTER(update_no_rebuild)
COUNTER(update_success)
GAUGE(version, NeverImport)

/**

All cluster endpoints related stats.
*/
#define ALL_CLUSTER_ENDPOINT_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT,
STATNAME)
GAUGE(max_host_weight, NeverImport)
COUNTER(membership_change)
GAUGE(membership_degraded, NeverImport)
GAUGE(membership_excluded, NeverImport)
GAUGE(membership_healthy, NeverImport)
GAUGE(membership_total, NeverImport)

/**

All cluster loadbalancing related stats.
*/
#define ALL_CLUSTER_LB_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT,
STATNAME)
COUNTER(lb_healthy_panic)
COUNTER(lb_local_cluster_not_ok)
COUNTER(lb_recalculate_zone_structures)
COUNTER(lb_subsets_created)
COUNTER(lb_subsets_fallback)
COUNTER(lb_subsets_fallback_panic)
COUNTER(lb_subsets_removed)
COUNTER(lb_subsets_selected)
COUNTER(lb_zone_cluster_too_small)
COUNTER(lb_zone_no_capacity_left)
COUNTER(lb_zone_number_differs)
COUNTER(lb_zone_routing_all_directly)
COUNTER(lb_zone_routing_cross_zone)
COUNTER(lb_zone_routing_sampled)
GAUGE(lb_subsets_active, Accumulate)

/**

All cluster stats. @see stats_macros.h
*/
#define ALL_CLUSTER_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT, STATNAME)
COUNTER(bind_errors)
COUNTER(original_dst_host_invalid)
COUNTER(retry_or_shadow_abandoned)
COUNTER(upstream_cx_close_notify)
COUNTER(upstream_cx_connect_attempts_exceeded)
COUNTER(upstream_cx_connect_fail)
COUNTER(upstream_cx_connect_timeout)
COUNTER(upstream_cx_connect_with_0_rtt)
COUNTER(upstream_cx_destroy)
COUNTER(upstream_cx_destroy_local)
COUNTER(upstream_cx_destroy_local_with_active_rq)
COUNTER(upstream_cx_destroy_remote)
COUNTER(upstream_cx_destroy_remote_with_active_rq)
COUNTER(upstream_cx_destroy_with_active_rq)
COUNTER(upstream_cx_http1_total)
COUNTER(upstream_cx_http2_total)
COUNTER(upstream_cx_http3_total)
COUNTER(upstream_cx_idle_timeout)
COUNTER(upstream_cx_max_duration_reached)
COUNTER(upstream_cx_max_requests)
COUNTER(upstream_cx_none_healthy)
COUNTER(upstream_cx_overflow)
COUNTER(upstream_cx_pool_overflow)
COUNTER(upstream_cx_protocol_error)
COUNTER(upstream_cx_rx_bytes_total)
COUNTER(upstream_cx_total)
COUNTER(upstream_cx_tx_bytes_total)
COUNTER(upstream_flow_control_backed_up_total)
COUNTER(upstream_flow_control_drained_total)
COUNTER(upstream_flow_control_paused_reading_total)
COUNTER(upstream_flow_control_resumed_reading_total)
COUNTER(upstream_internal_redirect_failed_total)
COUNTER(upstream_internal_redirect_succeeded_total)
COUNTER(upstream_rq_cancelled)
COUNTER(upstream_rq_completed)
COUNTER(upstream_rq_maintenance_mode)
COUNTER(upstream_rq_max_duration_reached)
COUNTER(upstream_rq_pending_failure_eject)
COUNTER(upstream_rq_pending_overflow)
COUNTER(upstream_rq_pending_total)
COUNTER(upstream_rq_0rtt)
COUNTER(upstream_rq_per_try_timeout)
COUNTER(upstream_rq_per_try_idle_timeout)
COUNTER(upstream_rq_retry)
COUNTER(upstream_rq_retry_backoff_exponential)
COUNTER(upstream_rq_retry_backoff_ratelimited)
COUNTER(upstream_rq_retry_limit_exceeded)
COUNTER(upstream_rq_retry_overflow)
COUNTER(upstream_rq_retry_success)
COUNTER(upstream_rq_rx_reset)
COUNTER(upstream_rq_timeout)
COUNTER(upstream_rq_total)
COUNTER(upstream_rq_tx_reset)
COUNTER(upstream_http3_broken)
GAUGE(upstream_cx_active, Accumulate)
GAUGE(upstream_cx_rx_bytes_buffered, Accumulate)
GAUGE(upstream_cx_tx_bytes_buffered, Accumulate)
GAUGE(upstream_rq_active, Accumulate)
GAUGE(upstream_rq_pending_active, Accumulate)
HISTOGRAM(upstream_cx_connect_ms, Milliseconds)
HISTOGRAM(upstream_cx_length_ms, Milliseconds)

Risk Level: LOW, should be a no-behavior-change CL.
Testing: N/A existing stats test should prove that there is no behavior change.
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

…n non-config-update stats Signed-off-by: Xin Zhuang <stevenzzz@google.com>

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

stevenzzzz · 2022-10-19T17:29:08Z

/assign @jmarantz

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

jmarantz

lgtm at high level modulo that you haven't made the traffic-only ones lazy yet and you need to add some comments showing why the partition exists.

jmarantz · 2022-10-19T19:04:57Z

envoy/upstream/upstream.h


 /**
- * All cluster stats. @see stats_macros.h
+ * All cluster config update related stats.


I think some comments here about the rationale for the partitioning would help.

In particular, what I think you mean to have here is a block of stats you expect to be incremented for clusters that see zero traffic.

Whereas ALL_CLUSTER_STATS (which should be re-named as that's now a lie :) are stats that you expect to only be incremented when a cluster sees traffic. Is that right?

sounds good. I will more details in comment.

jmarantz · 2022-10-19T19:07:20Z

source/common/upstream/upstream_impl.h

@@ -850,6 +851,7 @@ class ClusterInfoImpl : public ClusterInfo,
  TransportSocketMatcherPtr socket_matcher_;
  Stats::ScopeSharedPtr stats_scope_;
  mutable ClusterStats stats_;


so IIRC the game-plan would be to convert this to Thread::AtomicPtr<ClusterStats> stats_ right? And then stats() above would be implemented as stats_.get() ?

right. more PRs to come.
The idea is that we only increase when an individual field in the sub-group is accessed (ideally, only when it's inc-ed).

I would put the AtomicPtr in this PR, otherwise this PR actually increases memory usage, even in the case where some clusters get no traffic. It will definitely increase memory usage (by a small amount) when all clusters get traffic. But those situations probably don't have a huge number of clusters.

I plan to have another "refactoring" PR before actually introduce the "lazy-init" Struct, to hopefully make the PR easier to read.
Let me know if you would prefer a bigger PR. :P

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

stevenzzzz

Sorry for the late response, PTAL!

stevenzzzz · 2022-11-08T01:21:38Z

envoy/upstream/upstream.h


 /**
- * All cluster stats. @see stats_macros.h
+ * All cluster config update related stats.


sounds good. I will more details in comment.

stevenzzzz · 2022-11-08T01:30:46Z

source/common/upstream/upstream_impl.h

@@ -850,6 +851,7 @@ class ClusterInfoImpl : public ClusterInfo,
  TransportSocketMatcherPtr socket_matcher_;
  Stats::ScopeSharedPtr stats_scope_;
  mutable ClusterStats stats_;


I plan to have another "refactoring" PR before actually introduce the "lazy-init" Struct, to hopefully make the PR easier to read.
Let me know if you would prefer a bigger PR. :P

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

stevenzzzz · 2022-11-14T14:13:29Z

close this one as #23907 is a superset of it.

stevenzzzz added 2 commits October 17, 2022 11:28

subgroup cluster config update stats so that later we can lazy init o…

230b570

…n non-config-update stats Signed-off-by: Xin Zhuang <stevenzzz@google.com>

add missing init

4c1a076

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

stevenzzzz requested a review from mattklein123 as a code owner October 17, 2022 16:44

stevenzzzz marked this pull request as draft October 17, 2022 16:44

repokitteh-read-only bot assigned jmarantz Oct 19, 2022

Merge branch 'main' into subgroup-cluster-stats

aba2717

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

jmarantz reviewed Oct 19, 2022

View reviewed changes

Merge branch 'main' into subgroup-cluster-stats

9b51c85

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

stevenzzzz commented Nov 8, 2022

View reviewed changes

add more comments

e370c84

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

stevenzzzz mentioned this pull request Nov 8, 2022

[stats]: Subgroup cluster stats() into lb-stats, endpoint-stats, config-update-stats and "the rest"(will be renamed to upstream-stats) #23907

Merged

stevenzzzz closed this Nov 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subgroup cluster stats so that later we can lazy init some of the subgroups to save RAM/GCU.#23523

Subgroup cluster stats so that later we can lazy init some of the subgroups to save RAM/GCU.#23523
stevenzzzz wants to merge 5 commits intoenvoyproxy:mainfrom
stevenzzzz:subgroup-cluster-stats

stevenzzzz commented Oct 17, 2022 •

edited

Loading

Uh oh!

stevenzzzz commented Oct 19, 2022

Uh oh!

jmarantz left a comment

Uh oh!

jmarantz Oct 19, 2022

Uh oh!

stevenzzzz Nov 8, 2022

Uh oh!

jmarantz Oct 19, 2022

Uh oh!

stevenzzzz Oct 19, 2022

Uh oh!

jmarantz Oct 19, 2022

Uh oh!

stevenzzzz Nov 8, 2022

Uh oh!

stevenzzzz left a comment

Uh oh!

stevenzzzz Nov 8, 2022

Uh oh!

stevenzzzz Nov 8, 2022

Uh oh!

stevenzzzz commented Nov 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stevenzzzz commented Oct 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevenzzzz commented Oct 19, 2022

Uh oh!

jmarantz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzzzz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzzzz commented Nov 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stevenzzzz commented Oct 17, 2022 •

edited

Loading