Skip to content

Subgroup cluster stats so that later we can lazy init some of the subgroups to save RAM/GCU.#23523

Closed
stevenzzzz wants to merge 5 commits intoenvoyproxy:mainfrom
stevenzzzz:subgroup-cluster-stats
Closed

Subgroup cluster stats so that later we can lazy init some of the subgroups to save RAM/GCU.#23523
stevenzzzz wants to merge 5 commits intoenvoyproxy:mainfrom
stevenzzzz:subgroup-cluster-stats

Conversation

@stevenzzzz
Copy link
Copy Markdown
Contributor

@stevenzzzz stevenzzzz commented Oct 17, 2022

Commit Message: split the per-cluster stats into subgroups, so that later we can lazy-init them to save RAM.

Additional Description:
See more description and live example in #23575.

With lots of clusters and route-tables in a cloud proxy, we are seeing tons of RAM been spent on stats while most of the stats are never inc-ed due to traffic pattern(or long tail). We are thinking that we can lazy init cluster stats() so that the RAM is only allocated when it's required.

To achieve that we need to have finer grained stats group, e.g. configUpdateStats() are frequently updated by config management server, while upstream_xxx are only required when there is traffic for the cluster, for this sub-group we can save RAM by lazy init it.

This is the subgroups I plan to have for clusters.

/**

  • All cluster config update related stats.
    */
    #define ALL_CLUSTER_CONFIG_UPDATE_STATS(COUNTER, GAUGE, HISTOGRAM,
    TEXT_READOUT, STATNAME)
    COUNTER(assignment_stale)
    COUNTER(assignment_timeout_received)
    COUNTER(update_attempt)
    COUNTER(update_empty)
    COUNTER(update_failure)
    COUNTER(update_no_rebuild)
    COUNTER(update_success)
    GAUGE(version, NeverImport)

/**

  • All cluster endpoints related stats.
    */
    #define ALL_CLUSTER_ENDPOINT_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT,
    STATNAME)
    GAUGE(max_host_weight, NeverImport)
    COUNTER(membership_change)
    GAUGE(membership_degraded, NeverImport)
    GAUGE(membership_excluded, NeverImport)
    GAUGE(membership_healthy, NeverImport)
    GAUGE(membership_total, NeverImport)

/**

  • All cluster loadbalancing related stats.
    */
    #define ALL_CLUSTER_LB_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT,
    STATNAME)
    COUNTER(lb_healthy_panic)
    COUNTER(lb_local_cluster_not_ok)
    COUNTER(lb_recalculate_zone_structures)
    COUNTER(lb_subsets_created)
    COUNTER(lb_subsets_fallback)
    COUNTER(lb_subsets_fallback_panic)
    COUNTER(lb_subsets_removed)
    COUNTER(lb_subsets_selected)
    COUNTER(lb_zone_cluster_too_small)
    COUNTER(lb_zone_no_capacity_left)
    COUNTER(lb_zone_number_differs)
    COUNTER(lb_zone_routing_all_directly)
    COUNTER(lb_zone_routing_cross_zone)
    COUNTER(lb_zone_routing_sampled)
    GAUGE(lb_subsets_active, Accumulate)

/**

  • All cluster stats. @see stats_macros.h
    */
    #define ALL_CLUSTER_STATS(COUNTER, GAUGE, HISTOGRAM, TEXT_READOUT, STATNAME)
    COUNTER(bind_errors)
    COUNTER(original_dst_host_invalid)
    COUNTER(retry_or_shadow_abandoned)
    COUNTER(upstream_cx_close_notify)
    COUNTER(upstream_cx_connect_attempts_exceeded)
    COUNTER(upstream_cx_connect_fail)
    COUNTER(upstream_cx_connect_timeout)
    COUNTER(upstream_cx_connect_with_0_rtt)
    COUNTER(upstream_cx_destroy)
    COUNTER(upstream_cx_destroy_local)
    COUNTER(upstream_cx_destroy_local_with_active_rq)
    COUNTER(upstream_cx_destroy_remote)
    COUNTER(upstream_cx_destroy_remote_with_active_rq)
    COUNTER(upstream_cx_destroy_with_active_rq)
    COUNTER(upstream_cx_http1_total)
    COUNTER(upstream_cx_http2_total)
    COUNTER(upstream_cx_http3_total)
    COUNTER(upstream_cx_idle_timeout)
    COUNTER(upstream_cx_max_duration_reached)
    COUNTER(upstream_cx_max_requests)
    COUNTER(upstream_cx_none_healthy)
    COUNTER(upstream_cx_overflow)
    COUNTER(upstream_cx_pool_overflow)
    COUNTER(upstream_cx_protocol_error)
    COUNTER(upstream_cx_rx_bytes_total)
    COUNTER(upstream_cx_total)
    COUNTER(upstream_cx_tx_bytes_total)
    COUNTER(upstream_flow_control_backed_up_total)
    COUNTER(upstream_flow_control_drained_total)
    COUNTER(upstream_flow_control_paused_reading_total)
    COUNTER(upstream_flow_control_resumed_reading_total)
    COUNTER(upstream_internal_redirect_failed_total)
    COUNTER(upstream_internal_redirect_succeeded_total)
    COUNTER(upstream_rq_cancelled)
    COUNTER(upstream_rq_completed)
    COUNTER(upstream_rq_maintenance_mode)
    COUNTER(upstream_rq_max_duration_reached)
    COUNTER(upstream_rq_pending_failure_eject)
    COUNTER(upstream_rq_pending_overflow)
    COUNTER(upstream_rq_pending_total)
    COUNTER(upstream_rq_0rtt)
    COUNTER(upstream_rq_per_try_timeout)
    COUNTER(upstream_rq_per_try_idle_timeout)
    COUNTER(upstream_rq_retry)
    COUNTER(upstream_rq_retry_backoff_exponential)
    COUNTER(upstream_rq_retry_backoff_ratelimited)
    COUNTER(upstream_rq_retry_limit_exceeded)
    COUNTER(upstream_rq_retry_overflow)
    COUNTER(upstream_rq_retry_success)
    COUNTER(upstream_rq_rx_reset)
    COUNTER(upstream_rq_timeout)
    COUNTER(upstream_rq_total)
    COUNTER(upstream_rq_tx_reset)
    COUNTER(upstream_http3_broken)
    GAUGE(upstream_cx_active, Accumulate)
    GAUGE(upstream_cx_rx_bytes_buffered, Accumulate)
    GAUGE(upstream_cx_tx_bytes_buffered, Accumulate)
    GAUGE(upstream_rq_active, Accumulate)
    GAUGE(upstream_rq_pending_active, Accumulate)
    HISTOGRAM(upstream_cx_connect_ms, Milliseconds)
    HISTOGRAM(upstream_cx_length_ms, Milliseconds)

Risk Level: LOW, should be a no-behavior-change CL.
Testing: N/A existing stats test should prove that there is no behavior change.
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

…n non-config-update stats

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
@stevenzzzz stevenzzzz marked this pull request as draft October 17, 2022 16:44
@stevenzzzz
Copy link
Copy Markdown
Contributor Author

/assign @jmarantz

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Copy link
Copy Markdown
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm at high level modulo that you haven't made the traffic-only ones lazy yet and you need to add some comments showing why the partition exists.


/**
* All cluster stats. @see stats_macros.h
* All cluster config update related stats.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some comments here about the rationale for the partitioning would help.

In particular, what I think you mean to have here is a block of stats you expect to be incremented for clusters that see zero traffic.

Whereas ALL_CLUSTER_STATS (which should be re-named as that's now a lie :) are stats that you expect to only be incremented when a cluster sees traffic. Is that right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. I will more details in comment.

@@ -850,6 +851,7 @@ class ClusterInfoImpl : public ClusterInfo,
TransportSocketMatcherPtr socket_matcher_;
Stats::ScopeSharedPtr stats_scope_;
mutable ClusterStats stats_;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so IIRC the game-plan would be to convert this to Thread::AtomicPtr<ClusterStats> stats_ right? And then stats() above would be implemented as stats_.get() ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. more PRs to come.
The idea is that we only increase when an individual field in the sub-group is accessed (ideally, only when it's inc-ed).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put the AtomicPtr in this PR, otherwise this PR actually increases memory usage, even in the case where some clusters get no traffic. It will definitely increase memory usage (by a small amount) when all clusters get traffic. But those situations probably don't have a huge number of clusters.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to have another "refactoring" PR before actually introduce the "lazy-init" Struct, to hopefully make the PR easier to read.
Let me know if you would prefer a bigger PR. :P

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Copy link
Copy Markdown
Contributor Author

@stevenzzzz stevenzzzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response, PTAL!


/**
* All cluster stats. @see stats_macros.h
* All cluster config update related stats.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. I will more details in comment.

@@ -850,6 +851,7 @@ class ClusterInfoImpl : public ClusterInfo,
TransportSocketMatcherPtr socket_matcher_;
Stats::ScopeSharedPtr stats_scope_;
mutable ClusterStats stats_;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to have another "refactoring" PR before actually introduce the "lazy-init" Struct, to hopefully make the PR easier to read.
Let me know if you would prefer a bigger PR. :P

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
@stevenzzzz
Copy link
Copy Markdown
Contributor Author

close this one as #23907 is a superset of it.

@stevenzzzz stevenzzzz closed this Nov 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants