Subgroup cluster stats so that later we can lazy init some of the subgroups to save RAM/GCU.#23523
Subgroup cluster stats so that later we can lazy init some of the subgroups to save RAM/GCU.#23523stevenzzzz wants to merge 5 commits intoenvoyproxy:mainfrom
Conversation
…n non-config-update stats Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
|
/assign @jmarantz |
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
jmarantz
left a comment
There was a problem hiding this comment.
lgtm at high level modulo that you haven't made the traffic-only ones lazy yet and you need to add some comments showing why the partition exists.
|
|
||
| /** | ||
| * All cluster stats. @see stats_macros.h | ||
| * All cluster config update related stats. |
There was a problem hiding this comment.
I think some comments here about the rationale for the partitioning would help.
In particular, what I think you mean to have here is a block of stats you expect to be incremented for clusters that see zero traffic.
Whereas ALL_CLUSTER_STATS (which should be re-named as that's now a lie :) are stats that you expect to only be incremented when a cluster sees traffic. Is that right?
There was a problem hiding this comment.
sounds good. I will more details in comment.
| @@ -850,6 +851,7 @@ class ClusterInfoImpl : public ClusterInfo, | |||
| TransportSocketMatcherPtr socket_matcher_; | |||
| Stats::ScopeSharedPtr stats_scope_; | |||
| mutable ClusterStats stats_; | |||
There was a problem hiding this comment.
so IIRC the game-plan would be to convert this to Thread::AtomicPtr<ClusterStats> stats_ right? And then stats() above would be implemented as stats_.get() ?
There was a problem hiding this comment.
right. more PRs to come.
The idea is that we only increase when an individual field in the sub-group is accessed (ideally, only when it's inc-ed).
There was a problem hiding this comment.
I would put the AtomicPtr in this PR, otherwise this PR actually increases memory usage, even in the case where some clusters get no traffic. It will definitely increase memory usage (by a small amount) when all clusters get traffic. But those situations probably don't have a huge number of clusters.
There was a problem hiding this comment.
I plan to have another "refactoring" PR before actually introduce the "lazy-init" Struct, to hopefully make the PR easier to read.
Let me know if you would prefer a bigger PR. :P
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
stevenzzzz
left a comment
There was a problem hiding this comment.
Sorry for the late response, PTAL!
|
|
||
| /** | ||
| * All cluster stats. @see stats_macros.h | ||
| * All cluster config update related stats. |
There was a problem hiding this comment.
sounds good. I will more details in comment.
| @@ -850,6 +851,7 @@ class ClusterInfoImpl : public ClusterInfo, | |||
| TransportSocketMatcherPtr socket_matcher_; | |||
| Stats::ScopeSharedPtr stats_scope_; | |||
| mutable ClusterStats stats_; | |||
There was a problem hiding this comment.
I plan to have another "refactoring" PR before actually introduce the "lazy-init" Struct, to hopefully make the PR easier to read.
Let me know if you would prefer a bigger PR. :P
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
|
close this one as #23907 is a superset of it. |
Commit Message: split the per-cluster stats into subgroups, so that later we can lazy-init them to save RAM.
Additional Description:
See more description and live example in #23575.
With lots of clusters and route-tables in a cloud proxy, we are seeing tons of RAM been spent on stats while most of the stats are never inc-ed due to traffic pattern(or long tail). We are thinking that we can lazy init cluster stats() so that the RAM is only allocated when it's required.
To achieve that we need to have finer grained stats group, e.g. configUpdateStats() are frequently updated by config management server, while upstream_xxx are only required when there is traffic for the cluster, for this sub-group we can save RAM by lazy init it.
This is the subgroups I plan to have for clusters.
/**
*/
#define ALL_CLUSTER_CONFIG_UPDATE_STATS(COUNTER, GAUGE, HISTOGRAM,
TEXT_READOUT, STATNAME)
COUNTER(assignment_stale)
COUNTER(assignment_timeout_received)
COUNTER(update_attempt)
COUNTER(update_empty)
COUNTER(update_failure)
COUNTER(update_no_rebuild)
COUNTER(update_success)
GAUGE(version, NeverImport)
/**
*/
#define ALL_CLUSTER_ENDPOINT_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT,
STATNAME)
GAUGE(max_host_weight, NeverImport)
COUNTER(membership_change)
GAUGE(membership_degraded, NeverImport)
GAUGE(membership_excluded, NeverImport)
GAUGE(membership_healthy, NeverImport)
GAUGE(membership_total, NeverImport)
/**
*/
#define ALL_CLUSTER_LB_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT,
STATNAME)
COUNTER(lb_healthy_panic)
COUNTER(lb_local_cluster_not_ok)
COUNTER(lb_recalculate_zone_structures)
COUNTER(lb_subsets_created)
COUNTER(lb_subsets_fallback)
COUNTER(lb_subsets_fallback_panic)
COUNTER(lb_subsets_removed)
COUNTER(lb_subsets_selected)
COUNTER(lb_zone_cluster_too_small)
COUNTER(lb_zone_no_capacity_left)
COUNTER(lb_zone_number_differs)
COUNTER(lb_zone_routing_all_directly)
COUNTER(lb_zone_routing_cross_zone)
COUNTER(lb_zone_routing_sampled)
GAUGE(lb_subsets_active, Accumulate)
/**
*/
#define ALL_CLUSTER_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT, STATNAME)
COUNTER(bind_errors)
COUNTER(original_dst_host_invalid)
COUNTER(retry_or_shadow_abandoned)
COUNTER(upstream_cx_close_notify)
COUNTER(upstream_cx_connect_attempts_exceeded)
COUNTER(upstream_cx_connect_fail)
COUNTER(upstream_cx_connect_timeout)
COUNTER(upstream_cx_connect_with_0_rtt)
COUNTER(upstream_cx_destroy)
COUNTER(upstream_cx_destroy_local)
COUNTER(upstream_cx_destroy_local_with_active_rq)
COUNTER(upstream_cx_destroy_remote)
COUNTER(upstream_cx_destroy_remote_with_active_rq)
COUNTER(upstream_cx_destroy_with_active_rq)
COUNTER(upstream_cx_http1_total)
COUNTER(upstream_cx_http2_total)
COUNTER(upstream_cx_http3_total)
COUNTER(upstream_cx_idle_timeout)
COUNTER(upstream_cx_max_duration_reached)
COUNTER(upstream_cx_max_requests)
COUNTER(upstream_cx_none_healthy)
COUNTER(upstream_cx_overflow)
COUNTER(upstream_cx_pool_overflow)
COUNTER(upstream_cx_protocol_error)
COUNTER(upstream_cx_rx_bytes_total)
COUNTER(upstream_cx_total)
COUNTER(upstream_cx_tx_bytes_total)
COUNTER(upstream_flow_control_backed_up_total)
COUNTER(upstream_flow_control_drained_total)
COUNTER(upstream_flow_control_paused_reading_total)
COUNTER(upstream_flow_control_resumed_reading_total)
COUNTER(upstream_internal_redirect_failed_total)
COUNTER(upstream_internal_redirect_succeeded_total)
COUNTER(upstream_rq_cancelled)
COUNTER(upstream_rq_completed)
COUNTER(upstream_rq_maintenance_mode)
COUNTER(upstream_rq_max_duration_reached)
COUNTER(upstream_rq_pending_failure_eject)
COUNTER(upstream_rq_pending_overflow)
COUNTER(upstream_rq_pending_total)
COUNTER(upstream_rq_0rtt)
COUNTER(upstream_rq_per_try_timeout)
COUNTER(upstream_rq_per_try_idle_timeout)
COUNTER(upstream_rq_retry)
COUNTER(upstream_rq_retry_backoff_exponential)
COUNTER(upstream_rq_retry_backoff_ratelimited)
COUNTER(upstream_rq_retry_limit_exceeded)
COUNTER(upstream_rq_retry_overflow)
COUNTER(upstream_rq_retry_success)
COUNTER(upstream_rq_rx_reset)
COUNTER(upstream_rq_timeout)
COUNTER(upstream_rq_total)
COUNTER(upstream_rq_tx_reset)
COUNTER(upstream_http3_broken)
GAUGE(upstream_cx_active, Accumulate)
GAUGE(upstream_cx_rx_bytes_buffered, Accumulate)
GAUGE(upstream_cx_tx_bytes_buffered, Accumulate)
GAUGE(upstream_rq_active, Accumulate)
GAUGE(upstream_rq_pending_active, Accumulate)
HISTOGRAM(upstream_cx_connect_ms, Milliseconds)
HISTOGRAM(upstream_cx_length_ms, Milliseconds)
Risk Level: LOW, should be a no-behavior-change CL.
Testing: N/A existing stats test should prove that there is no behavior change.
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]