add deferred_creation util into stats#27899
Conversation
Cribbed from PR envoyproxy#23921 Please see that PR for how it is used. Signed-off-by: Xin Zhuang <stevenzzz@google.com>
|
/assign @pradeepcrao @jmarantz |
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
|
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
jmarantz
left a comment
There was a problem hiding this comment.
looks great; mostly comment nits but still a few questions on what you are capturing in your lambdas.
envoy/stats/stats.h
Outdated
| * Interface for stats lazy initialization. | ||
| * To save memory and CPU consumption from unused stats, Envoy can enable the bootstrap config | ||
| * :ref:`enable_deferred_creation_stats | ||
| * <envoy_v3_api_field_config.bootstrap.v3.Bootstrap.enable_deferred_creation_stats>`. |
There was a problem hiding this comment.
put the API update in this PR
There was a problem hiding this comment.
bootstrap.proto is in this pR
| */ | ||
| #define MAKE_STATS_STRUCT(StatsStruct, StatNamesStruct, ALL_STATS) \ | ||
| struct StatsStruct { \ | ||
| using StatNameType = StatNamesStruct; \ |
There was a problem hiding this comment.
I think I understand why you need to create this alias. But can you add a comment explaining that?
There was a problem hiding this comment.
done, added a line comment.
envoy/stats/stats.h
Outdated
|
|
||
| /** | ||
| * Interface for stats lazy initialization. | ||
| * To save memory and CPU consumption from unused stats, Envoy can enable the bootstrap config |
There was a problem hiding this comment.
suggest slightly more detail:
To save memory and CPU consumption on blocks of stats that are never referenced throughout the process lifetime, they can be encapsulated in a DeferredCreationCompatibleInterface. Then the Envoy bootstrap configuration can be set to defer the instantiation of those block. Note that when the blocks of stats are created, they carry an extra ~160 byte overhead (depending on worker thread count) due to internal bookkeeping data structures. The overhead when deferred stats are disabled is just 8 bytes.
WDYT? You should confirm the 160 bytes.
There was a problem hiding this comment.
updated.
did you see this? #23921 (comment)
"The last column shows the overhead is around 5MB for 100K clusters."
| *scope, {pool.add(StatsStructType::typeName()), pool.add("initialized")}, | ||
| Stats::Gauge::ImportMode::HiddenAccumulate); | ||
| }()), | ||
| ctor_([&, stats_scope = std::move(scope)]() -> StatsStructType* { |
There was a problem hiding this comment.
s/&/this/ ? does that work?
| ctor_([&, stats_scope = std::move(scope)]() -> StatsStructType* { | ||
| initialized_.inc(); | ||
| // Reset ctor_ to save some RAM. | ||
| Cleanup reset_ctor([&] { ctor_ = nullptr; }); |
source/server/configuration_impl.cc
Outdated
| // stats_config_ should be set before populating the sinks so that it is available | ||
| // from the ServerFactoryContext when creating the stats sinks. | ||
| stats_config_ = std::make_unique<StatsConfigImpl>(bootstrap); | ||
| ENVOY_LOG(info, "loading stats sinks configuration"); |
There was a problem hiding this comment.
s/sinks // -- this seems line-change seems superfluous to this PR; I assume it's a merge glitch.
There was a problem hiding this comment.
no, it's necessary move of code.
we have discussed this before.
see the comment at line 91 for why.
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
| // To do that we keep an "initialized" gauge in the cluster's scope, which will be associated by | ||
| // name to the previous generation's cluster's lazy-init block. We use the value in this shared | ||
| // gauge to determine whether to instantiate the lazy block on construction. | ||
| // TODO(#26106): See #14610. The initialized_ gauge could be disabled in a |
There was a problem hiding this comment.
This is not possible thanks to @DiazAlan's PR: https://github.com/envoyproxy/envoy/blob/main/source/common/stats/thread_local_store.cc#L602
There was a problem hiding this comment.
ahh, right, thanks!
There was a problem hiding this comment.
You cannot disable hidden stats per @DiazAlan so you can remove the TODO starting line 72 also.
| */ | ||
| #define MAKE_STATS_STRUCT(StatsStruct, StatNamesStruct, ALL_STATS) \ | ||
| struct StatsStruct { \ | ||
| using StatNameType = StatNamesStruct; \ |
There was a problem hiding this comment.
done, added a line comment.
| *scope, {pool.add(StatsStructType::typeName()), pool.add("initialized")}, | ||
| Stats::Gauge::ImportMode::HiddenAccumulate); | ||
| }()), | ||
| ctor_([&, stats_scope = std::move(scope)]() -> StatsStructType* { |
| ctor_([&, stats_scope = std::move(scope)]() -> StatsStructType* { | ||
| initialized_.inc(); | ||
| // Reset ctor_ to save some RAM. | ||
| Cleanup reset_ctor([&] { ctor_ = nullptr; }); |
| // Optional set of stats sinks. | ||
| repeated metrics.v3.StatsSink stats_sinks = 6; | ||
|
|
||
| // When true, enable deferred creation feature for compatible stats types. |
There was a problem hiding this comment.
This is where the doc is generated from, so more detail is needed here. Copying from my other comment:
// To save memory and CPU consumption on blocks of stats that are never referenced throughout
// the process lifetime, they can be encapsulated in a DeferredCreationCompatibleInterface. Then
// the Envoy bootstrap configuration can be set to defer the instantiation of those blocks. Note that
// when the blocks of stats are created, they carry an extra overhead (around 60-100 bytes depending
// on worker thread count) due to internal bookkeeping data structures.
//
// When set to true, Envoy will defer initializing selected blocks of stats until the first time they are updated.
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
envoy/stats/stats.h
Outdated
| * process lifetime, they can be encapsulated in a DeferredCreationCompatibleInterface. Then the | ||
| Envoy | ||
| * bootstrap configuration can be set to defer the instantiation of those block. Note that when the | ||
| * blocks of stats are created, they carry an extra ~160 byte overhead (depending on worker thread |
There was a problem hiding this comment.
sorry, change this now to 60-100 bytes to be consistent with the bootstrap comment.
| // Caller should make sure scope and stat_names outlive this object. | ||
| DeferredCreation(const typename StatsStructType::StatNameType& stat_names, | ||
| Stats::ScopeSharedPtr scope) | ||
| : initialized_([&scope]() -> Gauge& { |
There was a problem hiding this comment.
do we need to capture scope by value here?
That this works makes me suspect we don't have a test that removes the scope first and then instantiates the stats.
There was a problem hiding this comment.
The constructor accepts the scope by value, so this should be fine, right?
There was a problem hiding this comment.
Actually come to think of it, line 41 will cause the scope originally passed in to the constructor to be deleted. Shouldn't we keep the scope alive for the lifetime of DeferredCreation as a member within it and just reference that in ctor_ instead of moving it into it?
There was a problem hiding this comment.
the constructor will be out of scope before the lambda gets run, so no help there :(
I think maybe this is all moot because StatsStructType does not hold a reference-count for the scope, but holds it by Scope&. This is because I only (relatively) recently changed scopes to be represented as shared_ptr but didn't change all the Scope& everywhere in Envoy to a ScopeSharedPtr.
So what will happen in the test scenario I just thought of:
ScopeSharedPtr scope = createScope(...);
DeferredCreation deferred_stats(stat_names, scope);
scope.reset();
deferred_stats.instantiate();
I think the code as is will reference freed memory, but with my suggestion it will work, but not be useful :)
It will successfully instantiate the stats in the scope, and then the Cleanup thing will remove the scope. So you'll be left with struct full of invalid stats. So if you then do:
deferred_stats->my_counter_.inc():
that will then crash.
This is beyond the scope of this PR to resolve (sic) but I'd still vote for capturing scope by value on line 32.
There was a problem hiding this comment.
The scope is used first in initialize the gauge, and then moved into the ctor_, could you help me to understand the life cycle issue there?
the scope is always supposed to outlive the stats, otherwise the stats will be deleted IIUC.
There was a problem hiding this comment.
I think I articulated the lifecycle issue above.
In practice it doesn't matter but my recommendation stands -- just change &scope to scope in your capture.
There was a problem hiding this comment.
I am not following where this "reference freed memory" is from.
the line 32 capture is used and thrown away after init of the initialized_ Gauge.
Scope is held in ctor_ until it's called.
the case raised by @jmarantz should have no problem to call instantiate(), because the scope was captured.
but further references to its member I am not sure, since the last copy of "scope" is deleted.
I have the understanding that scope always outlive the individual StatStruct, if that not true, then the previous suggestion of do not capture scope in DeferredCreation need to be revisited.
There was a problem hiding this comment.
The lambda in line 32 is not stored anywhere, so capturing scope by value there does not change anything. It's a temporary that is invoked and immediately destroyed.
The lambda that is stored is the one on line 38, and the scope there is (effectively) captured by value (moved from the passed in by value object for the constructor of DeferredCreation.
My point though is that too is destroyed when instantiate is called as ctor_ is set to nullptr when invoked.
However storing the scope in DeferredCreation would also require storing it in DirectStats or else we could see unexpected behaviour in the future based on the value of enableDeferredCreationStats
There was a problem hiding this comment.
With the code as is, the call to deferred_stats.instantiate() will work just fine, but deferred_stats->my_counter_.inc() will crash (but this is the same behaviour for direct_stats).
There was a problem hiding this comment.
added comment as asked.
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
jmarantz
left a comment
There was a problem hiding this comment.
@ggreenway can you do a pass?
This is reduced from the previous large lazy-init PR and just has the new lazy-init support class, its doc, tests, and bootstrap API.
The application of this new deferred-compatible class to the cluster traffic stats will be in a follow-up PR, and is also much smaller than it was before.
Thanks @stevenzzzz for getting it to this highly reviewable state!
|
format still needs fixing. |
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
|
done
…On Fri, Jun 9, 2023 at 5:09 PM Joshua Marantz ***@***.***> wrote:
format still needs fixing.
—
Reply to this email directly, view it on GitHub
<#27899 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2I6TDENLXCZ4LJK3LF4TLXKOGJRANCNFSM6AAAAAAZAT5M7M>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Xin Zhuang
|
|
🙀 Error while processing event: |
|
/retest |
|
/assign @adisuissa |
adisuissa
left a comment
There was a problem hiding this comment.
Thanks!
Left a couple of API comments.
| // Optional set of stats sinks. | ||
| repeated metrics.v3.StatsSink stats_sinks = 6; | ||
|
|
||
| DeferredStatOptions deferred_stat_options = 39; |
There was a problem hiding this comment.
Please add a comment to the field.
|
|
||
| message DeferredStatOptions { | ||
| // To save memory and CPU consumption on blocks of stats that are never referenced throughout | ||
| // the process lifetime, they can be encapsulated in a DeferredCreationCompatibleInterface. Then |
There was a problem hiding this comment.
Stating the interface name seems to be very low-level.
Is the intent to have an attribute for a stat that declares whether it is deferred or not? If so, I suggest having a doc somewhere with a list of stats that are known to be deferred, and add a link here to the doc.
There was a problem hiding this comment.
Good point about the C++ class name -- that doesn't need to be in the API doc.
In this PR (per my request) there are no stats that are impacted by this setting; that will come in a follow-up PR with its own set of complexities. Adi you are saying that a doc of some form should be the place where we list what stats are affected? What kind of doc?
There was a problem hiding this comment.
Maybe an additional list isn't needed. Taking a step back, what is the expected behavior change from a user's perspective?
My assumption is that using a field and not a runtime flag implies that this is a knob that is intended to be kept.
Should a user care if some stat is part of the deferred list or not?
There was a problem hiding this comment.
According to an offline discussion with @stevenzzzz, there will be multiple PRs each adding a subset of stats that will be deferred.
It is unclear to me whether having a single flag to enable/disable all of the subsets is the right approach mainly because enabling the flag in two different versions of Envoy may present different behavior with the same traffic patterns.
Have you considered using different fields/list of enums for different subsets?
Assuming this was considered and was intentionally avoided, I suggest the following:
- Setting this field as
[#not-implemented-hide:]because setting it shouldn't change anything (correct me if I'm wrong). - clarifying the comment as follows:
When the flag is enabled, Envoy will lazily initialize a subset of the stats (see below). This will save memory and CPU cycles when creating the objects that own these stats, if those stats are never referenced throughout the lifetime of the process. However, it will incur additional memory overhead for these objects, and a small increase of CPU usage when a at least one of the stats is updated for the first time.
Groups of stats that will be lazily initialized:
* Cluster traffic stats: a subgroup of the [cluster statistics](add link) that are used when requests are routed to the cluster.
* ...
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
| // memory overhead for these objects, and a small increase of CPU usage when a at least one of the stats | ||
| // is updated for the first time. | ||
| // Groups of stats that will be lazily initialized: | ||
| // * Cluster traffic stats: a subgroup of the :ref:`cluster statistics <config_cluster_manager_cluster_stats>` |
There was a problem hiding this comment.
Just a sanity check: this PR does change the cluster-traffic stats to be deferred, right?
There was a problem hiding this comment.
Not this PR; that will be in the next PR. Maybe we should leave this as a "pending PR #xxxx" comment and adjust that line in that PR?
There was a problem hiding this comment.
Add not-implemented-hide tag, as noted in the comment above. This can be removed later
There was a problem hiding this comment.
done.
it's kinda "implemented". that that no stats has adopted it. Tag added.
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
|
/retest |
|
/retest |
1 similar comment
|
/retest |
Commit Message: With lots of clusters and route-tables in a cloud proxy, we are seeing tons of RAM been spent on stats while most of the stats are never inc-ed due to traffic pattern(or long tail). We are thinking that we can lazy init cluster stats() so that the RAM is only allocated when it's required. To achieve that we need to have finer grained stats group, e.g. configUpdateStats() are frequently updated by config management server, while upstream_xxx are only required when there is traffic for the cluster, for this sub-group we can save RAM by lazy init it. Introduce a new stats utility in this PR such that the nested StatsStruct is only instantiated when any of "->" or "*xx." operator is used. Cribbed from PR envoyproxy#23921 Please see that PR for how it is used. Additional Description: Risk Level: LOW,utility lib not used yet. Testing: unit test and speed test. Docs Changes: Release Notes: Platform Specific Features: Signed-off-by: Xin Zhuang <stevenzzz@google.com> Signed-off-by: asheryer <asheryer@amazon.com>
Commit Message: With lots of clusters and route-tables in a cloud proxy, we are seeing tons of RAM been spent on stats while most of the stats are never inc-ed due to traffic pattern(or long tail). We are thinking that we can lazy init cluster stats() so that the RAM is only allocated when it's required. To achieve that we need to have finer grained stats group, e.g. configUpdateStats() are frequently updated by config management server, while upstream_xxx are only required when there is traffic for the cluster, for this sub-group we can save RAM by lazy init it. Introduce a new stats utility in this PR such that the nested StatsStruct is only instantiated when any of "->" or "*xx." operator is used. Cribbed from PR envoyproxy#23921 Please see that PR for how it is used. Additional Description: Risk Level: LOW,utility lib not used yet. Testing: unit test and speed test. Docs Changes: Release Notes: Platform Specific Features: Signed-off-by: Xin Zhuang <stevenzzz@google.com> Signed-off-by: Ryan Eskin <ryan.eskin89@protonmail.com>
Commit Message:
With lots of clusters and route-tables in a cloud proxy, we are seeing tons of RAM been spent on stats while most of the stats are never inc-ed due to traffic pattern(or long tail). We are thinking that we can lazy init cluster stats() so that the RAM is only allocated when it's required.
To achieve that we need to have finer grained stats group, e.g. configUpdateStats() are frequently updated by config management server, while upstream_xxx are only required when there is traffic for the cluster, for this sub-group we can save RAM by lazy init it.
Introduce a new stats utility in this PR such that the nested StatsStruct is only instantiated when any of "->" or "*xx." operator is used.
Cribbed from PR #23921
Please see that PR for how it is used.
Additional Description:
Risk Level: LOW,utility lib not used yet.
Testing: unit test and speed test.
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]