Skip to content

dynamic modules: support metrics with labels by defining metric vector counterparts#41003

Merged
mathetake merged 22 commits intoenvoyproxy:mainfrom
wtzhang23:dynamic-modules-metric-vectors
Sep 15, 2025
Merged

dynamic modules: support metrics with labels by defining metric vector counterparts#41003
mathetake merged 22 commits intoenvoyproxy:mainfrom
wtzhang23:dynamic-modules-metric-vectors

Conversation

@wtzhang23
Copy link
Copy Markdown
Contributor

Commit Message: dynamic modules: support metrics with labels by defining metric vector counterparts
Additional Description: Here, we define CounterVec, GaugeVec, and HistogramVec, which behave similar to how prometheus metrics are defined in golang and other languages. Label names are fixed and set on filter config initialization whereas label values are dynamically provisioned for each new Filter instance.
Risk Level: Low (still in active development, feature itself would be high)
Testing: Unit and Integration tests
Docs Changes: N/A
Release Notes: Added in changelogs
Platform Specific Features: N/A
Closes #41002

Signed-off-by: William Zhang <wtzhang23@gmail.com>
…c-vectors

Signed-off-by: William Zhang <wtzhang23@gmail.com>
@kyessenov
Copy link
Copy Markdown
Contributor

kyessenov commented Sep 5, 2025

IMO Otel client library is probably the better UX than prometheus at this point https://opentelemetry.io/docs/languages/go/instrumentation/

@wtzhang23
Copy link
Copy Markdown
Contributor Author

wtzhang23 commented Sep 6, 2025

I think an api similar to otel would require both label names and values to be inserted into the per-filter instance stat name pool. In addition, if we want a simple api where the existing callbacks for defining unlabelled metrics is used to define potentially-labeled metrics, then those counters must be lazily fetched during the lifecycle of the filter. I don't think this is that big of a deal given that the metric values will be in thread-local storage after first invocation, but not 100% sure.

Not sure how expensive it is to not string intern here so not sure how big of a problem this is; at least the filters I've seen internally almost never always take the time to string intern. In this PR, I was somewhat hesitant to expose interning directly to the user, but it could be considered. Pseudocode like:

enum stat_type {
  ID,
  RAW,
}

struct str {
  char* ptr;
  size_t len;
}

struct stat_name {
  stat_type type_;
  union {
    str raw;
    size_t id;
  }
}

struct tag {
  stat_name name;
  stat_name value;
}

// returns id to represent a stat name
size_t reserve_stat_name(name);
...

void increment_counter(..., tag* tags, size_t num_tags);

@kyessenov
Copy link
Copy Markdown
Contributor

kyessenov commented Sep 9, 2025

AFAIK, interning requires a lock on the symbol table, so it cannot be trusted to be run on the data path. If we aspire to support SDKs, none of them really have interning capability AFAIK, and the tags are ultimately represented as vectors of strings somewhere. This is the same representation as Envoy's "inlined string", so it should perform comparatively the same.

If you really care about performance, then I think putting a bit of thought into histograms might pay the best value. They are very expensive in Envoy today, so you might need an alternate storage backend to support more rudimentary use cases (e.g. for plugins willing to take a big hit in accuracy - which is not an option today, all histograms are ~5% accurate today).

@wtzhang23 wtzhang23 marked this pull request as ready for review September 11, 2025 14:31
Copy link
Copy Markdown
Member

@mathetake mathetake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!!

/wait

Copy link
Copy Markdown
Member

@mathetake mathetake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!!

/wait

Co-authored-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Signed-off-by: William T Zhang <wtzhang23@gmail.com>
Signed-off-by: William Zhang <wtzhang23@gmail.com>
Signed-off-by: William Zhang <wtzhang23@gmail.com>
Signed-off-by: William Zhang <wtzhang23@gmail.com>
Copy link
Copy Markdown
Member

@mathetake mathetake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits but looking good

Signed-off-by: William Zhang <wtzhang23@gmail.com>
Signed-off-by: William Zhang <wtzhang23@gmail.com>
Signed-off-by: William Zhang <wtzhang23@gmail.com>
Copy link
Copy Markdown
Member

@mathetake mathetake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think i like the latest comment on removing virtual -- i like the less code

envoy_dynamic_module_type_module_buffer* label_values,
size_t label_values_length) {

ASSERT(label_values_length == label_names.size());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this will be hard for users to notice what they are doing wrong. ASSERT is only in debug build vs users would likely just use the release build, so this will simply result in the memory corruption without users noticing it, or segfault. In other words i feel this assertion is more likely being hit than not.

I would suggest that at the ABI level maybe allow return -1 to indicate that the label input it wrong as I assume that that is the only failure mode. Maybe we could use -2 for ASSERT(!filter_config->stat_creation_frozen_); as well for the same reason to provide better feedback to module devs. wdyt?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think this is a good point. To take this further we could borrow from C and support getting the last errno for future callbacks that return an error that is hard to express in the return value (e.g. a pointer). But I think we could keep this simple and just return -error code for now.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, so currently the error feedback is not consistent right now. Some callback returns only false or true while logging and the other has an enum for all the failure mode. I think we should stick to the enum everywhere while results should be stored in a pointer passed as an argument. We can do the refactoring in a follow up if we want

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think it might be worth having this in a separate PR focusing on standardizing the error type?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see I can follow the callout approach for now.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah

Signed-off-by: William Zhang <wtzhang23@gmail.com>
Signed-off-by: William Zhang <wtzhang23@gmail.com>
@wtzhang23
Copy link
Copy Markdown
Contributor Author

wtzhang23 commented Sep 12, 2025

Took the opposite approach where return a result error code but pass in a pointer for the resultant metric id. This way I didn't have to think about casting size_t to int32_t (though in practice Envoy will probably crash before > than INT32_MAX metrics of the same type are created) and casting the result code + negating it into int32_t.

}

impl<EHF: EnvoyHttpFilter> HttpFilterConfig<EHF> for StatsCallbacksFilterConfig {
fn new_http_filter(&mut self, envoy_filter: &mut EHF) -> Box<dyn HttpFilter<EHF>> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add more dynamic label tests like adding a header as a label value in request_headers callback instead of only just static label value?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And let's verify that in integration tests

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did these tests + more in an integration test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added unit tests as well testing getting a value from a header and from a local variable.

@mathetake
Copy link
Copy Markdown
Member

@wtzhang23 sorry where did you see the flake? i couldn't find here

@wtzhang23
Copy link
Copy Markdown
Contributor Author

@mathetake
Copy link
Copy Markdown
Member

oh in the prechecks... has this been happening on main branch as well?

@mathetake
Copy link
Copy Markdown
Member

ok in this integration test

@mathetake
Copy link
Copy Markdown
Member

so FakeExternalCache this test uses the scheduler API and the new integration tests runs after it; maybe existing bug

@wtzhang23
Copy link
Copy Markdown
Contributor Author

wtzhang23 commented Sep 13, 2025

Ya though from my staring at the code it seems your implementation should be very sound. this can't be null since the free callback can't be called while commit goes on, filter_ won't be null when upgrading to shared and calling onScheduled as checked by the if statement, dispatcher_ should live for the lifetime of the worker thread, and all the logic should be executed by the dispatcher on the thread that owns the filter should the filter be upgraded to shared... Not sure how the integration test framework works but do you think it is possible a nullptr was accidentally passed in as the argument to dispatcher here?

Edit: Oh does dispatcher really last that long? Don't know enough to see if the comment // The callbacks for the filter. They are only valid until onDestroy() is called. applies here to the lifetime of the dispatcher. Though I don't see how this would lead to a failure in the test case, but I haven't looked to carefully.

@mathetake
Copy link
Copy Markdown
Member

I believe dispatcher should live that long;

I have a slight hunch that this is something to do with the FakeExternalCache sending local reply from the on_scheduled callback. Could you try pushing the following patch and see if this happens again here? I think this is in anyways weird

diff --git a/source/extensions/filters/http/dynamic_modules/filter.cc b/source/extensions/filters/http/dynamic_modules/filter.cc
index 7ad53c5dca..82e99f119e 100644
--- a/source/extensions/filters/http/dynamic_modules/filter.cc
+++ b/source/extensions/filters/http/dynamic_modules/filter.cc
@@ -126,8 +126,8 @@ void DynamicModuleHttpFilter::sendLocalReply(
     Code code, absl::string_view body,
     std::function<void(ResponseHeaderMap& headers)> modify_headers,
     const absl::optional<Grpc::Status::GrpcStatus> grpc_status, absl::string_view details) {
-  decoder_callbacks_->sendLocalReply(code, body, modify_headers, grpc_status, details);
   sent_local_reply_ = true;
+  decoder_callbacks_->sendLocalReply(code, body, modify_headers, grpc_status, details);
 }

@wtzhang23
Copy link
Copy Markdown
Contributor Author

Will add tests for errors to resolve coverage failures

Signed-off-by: William Zhang <wtzhang23@gmail.com>
Signed-off-by: William Zhang <wtzhang23@gmail.com>
Signed-off-by: William Zhang <wtzhang23@gmail.com>
@mathetake
Copy link
Copy Markdown
Member

looks like the flake is gone? I think my hunch was right?

@wtzhang23
Copy link
Copy Markdown
Contributor Author

Ya you are a genius lol

@mathetake
Copy link
Copy Markdown
Member

cool lol... can you merge the main branch again so that we can be extra sure that it's gone

Copy link
Copy Markdown
Member

@mathetake mathetake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the iterations!

@mathetake mathetake merged commit 8b55883 into envoyproxy:main Sep 15, 2025
23 of 24 checks passed
@wtzhang23 wtzhang23 deleted the dynamic-modules-metric-vectors branch September 15, 2025 20:15
@mathetake mathetake mentioned this pull request Sep 15, 2025
11 tasks
@antoniovleonti
Copy link
Copy Markdown
Contributor

//test/extensions/dynamic_modules/http:integration_test is failing ASAN, PTAL? link to raw test failure logs.

@mathetake
Copy link
Copy Markdown
Member

oh ok thanks for the heads up. I will revert this one for now

@wtzhang23
Copy link
Copy Markdown
Contributor Author

wtzhang23 commented Sep 16, 2025

Is this failure related to this PR? Program counter being zero is somewhat strange and the test case is not a stats test case

IpVersions/DynamicModulesIntegrationTest.PerRouteConfig/IPv4AddressSanitizer:DEADLYSIGNAL=====================================================================
ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x000000000000 bp 0x7f324ed9d650 sp 0x7f324ed9d468 T97)==287==Hint: pc points to the zero page.==287==The signal is caused by a READ memory access.==287==Hint: address points to the zero page./bin/addr2line: DWARF error: invalid or unhandled FORM value: 0x25

@mathetake
Copy link
Copy Markdown
Member

oh

@wtzhang23
Copy link
Copy Markdown
Contributor Author

Is it possible the address sanitizer just can't read the symbol table/other metadata in the module binary and it checked randomly while the program was currently in the module? Not exactly sure how address sanitizers themselves are implemented

@wtzhang23
Copy link
Copy Markdown
Contributor Author

/bin/addr2line: DWARF error: invalid or unhandled FORM value: 0x25 was also emitted in the output FWIW

@mathetake
Copy link
Copy Markdown
Member

#41094 opened an issue

mbadov pushed a commit to mbadov/envoy that referenced this pull request Sep 22, 2025
…r counterparts (envoyproxy#41003)

Commit Message: dynamic modules: support metrics with labels by defining
metric vector counterparts
Additional Description: Here, we define CounterVec, GaugeVec, and
HistogramVec, which behave similar to how prometheus metrics are defined
in
[golang](https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#NewCounterVec)
and other languages. Label names are fixed and set on filter config
initialization whereas label values are dynamically provisioned for each
new Filter instance.
Risk Level: Low (still in active development, feature itself would be
high)
Testing: Unit and Integration tests
Docs Changes: N/A
Release Notes: Added in changelogs
Platform Specific Features: N/A
Closes envoyproxy#41002

---------

Signed-off-by: William Zhang <wtzhang23@gmail.com>
Co-authored-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Signed-off-by: Misha Badov <mbadov@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dynamic Modules: Support Metrics with Tags

4 participants