initial metrics service by ramaraochavali · Pull Request #2323 · envoyproxy/envoy

ramaraochavali · 2018-01-06T05:44:22Z

Signed-off-by: Rama ramaraochavali@gmail.com

Metrics Service Implementation

Description:
Implements Metrics Service in Envoy that continuously streams metrics to the gRPC end point configured.

Risk Level: Small-medium
Testing: unit test, integration test

Docs Changes:

Data Plane PR

Release Notes:

Release notes updated

API Changes:

Data Plane API

Signed-off-by: Rama <ramaraochavali@gmail.com>

ramaraochavali · 2018-01-06T05:46:37Z

@mattklein123 @htuch This is by no means complete. But set up the basic stuff along the lines of access log implementation. There are bunch of TODOs to address which i will do in the coming days. But want to get your input at this high level whether direction makes sense before I go too far.
Can you please quickly take a look from that perspective?

Signed-off-by: Rama <ramaraochavali@gmail.com>

dnoe · 2018-01-08T14:55:27Z

@mrice32 might like to take a look too

Signed-off-by: Rama <ramaraochavali@gmail.com>

ramaraochavali · 2018-01-08T16:52:44Z

@mattklein123 @htuch did some more work.Still few more TODOs to be addressed and tests need to be added.

mrice32

This is really cool! Thanks for working on it. I have one small nit above. I also have a design question. There are a few comments about the sink and streams operating with thread local data. Why does this need to work that way? Disclaimer: I may be a little behind on changes with the stats processing in Envoy, so my assertions may be incorrect, here. Aren't counters and gauges flushed to the sink periodically on the main thread? IIUC histogram values may be flushed from many threads, however. So is this TLS work centered around just giving each thread a stream to export its histogram samples, specifically? I only did a brief pass over the code, so it's totally possible that I'm missing something.

mrice32 · 2018-01-08T18:15:32Z

source/common/stats/grpc_metrics_service_impl.h

+
+private:
+  GrpcMetricsStreamerSharedPtr grpc_metrics_streamer_;
+  envoy::api::v2::StreamMetricsMessage message;


Nit: message_;

htuch

@ramaraochavali Thanks for the contribution. Needs coverage and tests, can do a deeper review tomorrow. Would second @mrice32 point on TLS, trying to understand why this needs to be per-worker thread..

htuch · 2018-01-08T22:56:12Z

source/common/stats/grpc_metrics_service_impl.h

+    void send(envoy::api::v2::StreamMetricsMessage& message);
+
+    GrpcMetricsServiceClientPtr client_;
+    // TODO(ramachavali): Map is not required as there is only one entry.


Yeah, I was wondering what the purpose of this was..

ramaraochavali · 2018-01-09T04:06:41Z

@mrice32 @htuch reg: thread local, I looked at the statsd sink implementation and it also uses TLS. So went ahead and used the similar approach that @mattklein123 used for access log implementation for gRPC streaming. I may have misunderstood or missing some thing here. I thought the stats sink should work similar.

mattklein123 · 2018-01-09T05:57:56Z

Aren't counters and gauges flushed to the sink periodically on the main thread? IIUC histogram values may be flushed from many threads, however. So is this TLS work centered around just giving each thread a stream to export its histogram samples, specifically?

That's exactly right. Counters/gauges are flushed on the main thread, however histograms are emitted directly to the sink from each worker.

In the future, when we optionally support built-in histograms, this may not always be the case. A sink could also decide to add the histograms to a queue for later flush by the main thread if desired, but that would add lock contention, unless the queue is per-thread, with periodic flush to the main thread.

I haven't looked at the review yet, but my recommendation would be to probably follow the pattern of the gRPC access logger as was done here I think. We can potentially consider additional designs at a later point. (Note also that this likely will be the same exact design as what will be used by the gRPC tracing code also).

Signed-off-by: Rama <ramaraochavali@gmail.com>

htuch

Two followups:

Are we planning on doing histograms here? I see a TODO, not sure what the plan is.
Is the implication of this architecture (and the gRPC access logging) that we have one stream per worker? I'm wondering if we'll potentially run into the same issues that inspired ADS in some deployments, namely that streams might map to distinct connections and each connection might point at a different management server, creating additional work at the server. As it is, there's a need to reconcile across the different streams even if only a single management server.

ramaraochavali · 2018-01-09T16:27:27Z

@htuch on histogram, i am actually trying to figure out how to exactly map from the envoy implementation to promotheus proto. I could not find direct mapping..will update by tomorrow.

mattklein123 · 2018-01-09T17:28:01Z

Are we planning on doing histograms here? I see a TODO, not sure what the plan is.

If there are no histogram samples emitted in this PR, we don't need TLS. (I have not yet looked at PR).

Is the implication of this architecture (and the gRPC access logging) that we have one stream per worker? I'm wondering if we'll potentially run into the same issues that inspired ADS in some deployments, namely that streams might map to distinct connections and each connection might point at a different management server, creating additional work at the server. As it is, there's a need to reconcile across the different streams even if only a single management server.

Yup that is the implication. I can't say whether we will run into the same issues, but I know that this design is fine for Lyft. I think if we need aggregation we can do it as a follow up since it will be more complicated (per-thread buffering with flush to central thread, appropriate locking, etc.).

Signed-off-by: Rama <ramaraochavali@gmail.com>

ramaraochavali · 2018-01-10T04:51:09Z

I have added integration test as well. So it should be good to review.
Few comments

On histogram, currently it is not possible to map Envoy's histogram implementation to Metrics Proto Histogram as we do not have buckets defined here. However, if we want to send the raw histogram values we can use Untyped metric and send them. Later when Envoy fully supports native histogram implementation this can be changed to use Histogram proto type. Otherwise we can leave histograms for now. Let me know what is the best option here.
Regarding the implication that @htuch brought, I think I agree with @mattklein123 that for use cases like metrics, accesslogs etc aggregation may not be required. Even if this stream go to different managed server, finally the managed server is going to push these metrics to backends - it may not matter which managed server does.
Even if we leave histograms in this PR, I prefer to keep the TLS implementation as is because when Envoy supports histograms natively, we can just send them with out much effort., @mattklein123 let me know what you think about this.

Only thing pending from this PR is to update the latest data-plane-api and use the new GrpcService that need to be coordinated.

mattklein123

In general LGTM. Thank you for working on this. Some comments to get started with. I will take another pass after.

mattklein123 · 2018-01-10T17:02:39Z

source/common/stats/grpc_metrics_service_impl.h

@@ -0,0 +1,167 @@
+#pragma once


Side note: Given how much of this code was mostly copied from my access log implementation, I feel like there could be a base class. All of this code is basically going to get copied again for the trace service implementation also. I would just add a TODO somewhere to look into converging this code and the access log code into a common base class.

mattklein123 · 2018-01-10T17:03:07Z

source/common/stats/grpc_metrics_service_impl.h

+private:
+  /**
+   * Shared state that is owned by the per-thread streamers. This allows the
+   * main streamer/TLS


nit: reflow comment

mattklein123 · 2018-01-10T17:03:50Z

source/common/stats/grpc_metrics_service_impl.h

+class MetricsServiceSink : public Sink {
+public:
+  // MetricsService::Sink
+  MetricsServiceSink(GrpcMetricsStreamerSharedPtr grpc_metrics_streamer);


nit: const GrpcMetricsStreamerSharedPtr&

mattklein123 · 2018-01-10T17:07:35Z

source/common/stats/grpc_metrics_service_impl.h

+    gauage_metric->set_value(value);
+  }
+
+  void endFlush() override { grpc_metrics_streamer_->send(message_); }


After the first flush, you should clear identifier info for perf reasons. Since you are reusing the same message as a member that won't happen by default (which is different from how the access log code works).

mattklein123 · 2018-01-10T17:08:07Z

test/integration/metrics_service_integration_test.cc

+                 metrics_service_request_->headers().Path()->value().c_str());
+    EXPECT_STREQ("application/grpc",
+                 metrics_service_request_->headers().ContentType()->value().c_str());
+    std::cout << "waitForMetricsRequest"


mattklein123 · 2018-01-10T17:09:17Z

source/server/config/stats/metrics_service.cc

+};
+
+// Singleton registration via macro defined in envoy/singleton/manager.h
+SINGLETON_MANAGER_REGISTRATION(grpc_metrics_streamer);


In the case of the stat sink, you can kill all the singleton stuff. There is only a single sink in the server instantiated.

mattklein123 · 2018-01-10T17:10:54Z

test/integration/metrics_service_integration_test.cc

+  metrics_service_request_->startGrpcStream();
+  envoy::api::v2::StreamMetricsResponse response_msg;
+  metrics_service_request_->sendGrpcMessage(response_msg);
+  metrics_service_request_->finishGrpcStream(Grpc::Status::Ok);


please see #2333. You will need a similar guard here.

mattklein123 · 2018-01-10T17:11:26Z

test/integration/metrics_service_integration_test.cc

+    metrics_service_request_ = fake_metrics_service_connection_->waitForNewStream(*dispatcher_);
+  }
+
+  void waitForMetricsRequest() {


Can we potentially verify some actual metrics here?

i can not verify actual metric names - but i can assert if there are metrics. I think that should be fine. LMK if you think otherwise.

I think you should be able to verify at least a single gauge and counter exists, and maybe even that a value is greater than zero. Can you dump them and find some good ones?

i think we can actually validate the metrics_service cluster related metrics as we are adding the cluster in the test test it self. They will be there for sure. So i have added a validation for a counter and gauge including values. LMK if that makes sense. Sorry could have done this earlier

Signed-off-by: Rama <ramaraochavali@gmail.com>

ramaraochavali · 2018-01-11T09:33:20Z

@mattklein123 addressed all your review comments. can you PTAL?

Signed-off-by: Rama <ramaraochavali@gmail.com>

mattklein123

Generally, LGTM. Thanks for this. A few small remaining comments.

@htuch @mrice32 can one of you take another pass?

mattklein123 · 2018-01-11T21:32:13Z

source/server/config/stats/metrics_service.cc

+                                                                cluster_name),
+          server.threadLocal(), server.localInfo());
+
+  return Stats::SinkPtr(new Stats::Metrics::MetricsServiceSink(grpc_metrics_streamer));


nit: std::make_unique

mattklein123 · 2018-01-11T21:32:33Z

source/server/config/stats/metrics_service.cc

+}
+
+ProtobufTypes::MessagePtr MetricsServiceSinkFactory::createEmptyConfigProto() {
+  return std::unique_ptr<envoy::api::v2::MetricsServiceConfig>(


nit: std::make_unique

mattklein123 · 2018-01-11T21:33:08Z

test/integration/metrics_service_integration_test.cc

+      metrics_service_cluster->mutable_http2_protocol_options();
+
+      auto* metrics_sink = bootstrap.add_stats_sinks();
+      // metrics_sink->MergeFrom(bootstrap.stat_sinks()[0]);


not required. removed.

mattklein123 · 2018-01-11T21:33:49Z

test/integration/metrics_service_integration_test.cc

+    metrics_service_request_ = fake_metrics_service_connection_->waitForNewStream(*dispatcher_);
+  }
+
+  void waitForMetricsRequest() {


I think you should be able to verify at least a single gauge and counter exists, and maybe even that a value is greater than zero. Can you dump them and find some good ones?

Signed-off-by: Rama <ramaraochavali@gmail.com>

ramaraochavali · 2018-01-12T07:34:22Z

@mattklein123 addressed all the comments. PTAL.

mattklein123

LGTM afer nits. Nice! @mrice32 @htuch LMK if either of you want to take a pass through this.

mattklein123 · 2018-01-12T16:19:51Z

test/integration/metrics_service_integration_test.cc

+    std::string known_counter("cluster.metrics_service.membership_change");
+    std::string known_gauge("cluster.metrics_service.membership_total");
+    int metrics_size = envoy_metrics.size();
+    for (int i = 0; i < metrics_size; i++) {


nit: You should be able to use a c++11 for loop here and just iterate through the envoy_metrics() metrics.

mattklein123 · 2018-01-12T16:20:30Z

test/integration/metrics_service_integration_test.cc

+    int metrics_size = envoy_metrics.size();
+    for (int i = 0; i < metrics_size; i++) {
+      ::io::prometheus::client::MetricFamily metrics_family = envoy_metrics.Get(i);
+      if (known_counter.compare(metrics_family.name()) == 0 &&


nit: just do metrics_family.name() == "cluster.metrics_service.membership_change". same below

Doc PR for Envoy implementation PR envoyproxy/envoy#2323 Signed-off-by: Rama <ramaraochavali@gmail.com>

Signed-off-by: Rama <ramaraochavali@gmail.com>

ramaraochavali · 2018-01-12T17:18:19Z

@mattklein123 addressed the nits. PTAL.

Signed-off-by: Rama <ramaraochavali@gmail.com>

htuch

LGTM, this looks rad.

htuch · 2018-01-12T14:33:46Z

source/common/stats/grpc_metrics_service_impl.cc

+                                                 ThreadLocal::SlotAllocator& tls,
+                                                 const LocalInfo::LocalInfo& local_info)
+    : tls_slot_(tls.allocateSlot()) {
+


Tiny nit: prefer less whitespace than more here.

htuch · 2018-01-12T17:16:55Z

source/common/stats/grpc_metrics_service_impl.h

@@ -0,0 +1,167 @@
+#pragma once


htuch · 2018-01-12T17:21:52Z

test/integration/metrics_service_integration_test.cc

+      if (metrics_family.name().compare("cluster.metrics_service.membership_change") == 0 &&
+          metrics_family.metric(0).has_counter()) {
+        known_counter_exists = true;
+        known_counter_value = metrics_family.metric(0).counter().value();


I would move the EXPECT_EQ to here.

htuch · 2018-01-12T17:22:36Z

test/integration/metrics_service_integration_test.cc

+    bool known_gauge_exists = false;
+    int known_counter_value = -1;
+    int known_gauge_value = -1;
+    for (::io::prometheus::client::MetricFamily metrics_family : envoy_metrics) {


It might be slightly cleaner to convert this into a std::unordered_map or the like and then write the match logic.

i thought about this earlier, but felt building another container may not required here. So did not do. let me know if you feel strongly about it. i can make that change.

Signed-off-by: Rama <ramaraochavali@gmail.com>

ramaraochavali · 2018-01-13T02:34:00Z

@htuch addressed all your comments expect using the unorderded map transform in test. let me know if you feel strongly about it. otherwise it is good to go.

ramaraochavali · 2018-01-15T01:33:05Z

@htuch can you PTAL when you get time and let me know if any thing else needs to be done on this?

htuch

Awesome sauce.

ramaraochavali · 2018-01-15T04:23:16Z

@htuch awesome. thanks much..

`android_dist_ci` was a superset of `android_dist` and `android_dist` stopped working in #2184. Signed-off-by: JP Simard <jp@jpsim.com>

Doc PR for Envoy implementation PR envoyproxy/envoy#2323 Signed-off-by: Rama <ramaraochavali@gmail.com>

initial metrics service commit

9ad68c8

Signed-off-by: Rama <ramaraochavali@gmail.com>

ramaraochavali added 3 commits January 8, 2018 16:23

convert envoy stats to metrics proto structures

de8e4a4

Signed-off-by: Rama <ramaraochavali@gmail.com>

code formatted

57d2c82

Signed-off-by: Rama <ramaraochavali@gmail.com>

formatted build files

8e081e0

Signed-off-by: Rama <ramaraochavali@gmail.com>

cleaned up some unused code

9f0267a

Signed-off-by: Rama <ramaraochavali@gmail.com>

mrice32 reviewed Jan 8, 2018

View reviewed changes

htuch reviewed Jan 8, 2018

View reviewed changes

ramaraochavali added 4 commits January 9, 2018 15:46

added unit tests

5dc6b1c

Signed-off-by: Rama <ramaraochavali@gmail.com>

format correction

765ddc3

Signed-off-by: Rama <ramaraochavali@gmail.com>

added more unit tests

f91b0bb

Signed-off-by: Rama <ramaraochavali@gmail.com>

added more unit tests

26d3b34

Signed-off-by: Rama <ramaraochavali@gmail.com>

htuch reviewed Jan 9, 2018

View reviewed changes

ramaraochavali added 2 commits January 10, 2018 09:53

added integration test

81188e1

Signed-off-by: Rama <ramaraochavali@gmail.com>

fix header order

bb024b1

Signed-off-by: Rama <ramaraochavali@gmail.com>

mattklein123 reviewed Jan 10, 2018

View reviewed changes

ramaraochavali added 2 commits January 11, 2018 11:03

Merge branch 'master' into feature/metrics_service_impl

92d8996

Signed-off-by: Rama <ramaraochavali@gmail.com>

addressed review comments

dd0aa9c

Signed-off-by: Rama <ramaraochavali@gmail.com>

updated release notes

c0ad932

Signed-off-by: Rama <ramaraochavali@gmail.com>

ramaraochavali mentioned this pull request Jan 11, 2018

Metrics service docs envoyproxy/data-plane-api#409

Merged

mattklein123 reviewed Jan 11, 2018

View reviewed changes

addressed review comments

48bb12b

Signed-off-by: Rama <ramaraochavali@gmail.com>

mattklein123 reviewed Jan 12, 2018

View reviewed changes

htuch pushed a commit to envoyproxy/data-plane-api that referenced this pull request Jan 12, 2018

Metrics service docs (#409)

a9cdffe

Doc PR for Envoy implementation PR envoyproxy/envoy#2323 Signed-off-by: Rama <ramaraochavali@gmail.com>

addressed review comments

b0ec3a6

Signed-off-by: Rama <ramaraochavali@gmail.com>

addressed review comments

5f5d5e0

Signed-off-by: Rama <ramaraochavali@gmail.com>

htuch reviewed Jan 12, 2018

View reviewed changes

mattklein123 self-assigned this Jan 12, 2018

addressed review comments

de065eb

Signed-off-by: Rama <ramaraochavali@gmail.com>

mattklein123 approved these changes Jan 13, 2018

View reviewed changes

htuch approved these changes Jan 15, 2018

View reviewed changes

htuch merged commit e103e8f into envoyproxy:master Jan 15, 2018

ramaraochavali deleted the feature/metrics_service_impl branch January 16, 2018 05:09

Shikugawa pushed a commit to Shikugawa/envoy that referenced this pull request Mar 28, 2020

add myself into owners (envoyproxy#2323)

4e0fdfb

jpsim added a commit that referenced this pull request Nov 28, 2022

Merge android_dist with android_dist_ci (#2323)

2ae20a0

`android_dist_ci` was a superset of `android_dist` and `android_dist` stopped working in #2184. Signed-off-by: JP Simard <jp@jpsim.com>

jpsim added a commit that referenced this pull request Nov 29, 2022

Merge android_dist with android_dist_ci (#2323)

86a7c86

`android_dist_ci` was a superset of `android_dist` and `android_dist` stopped working in #2184. Signed-off-by: JP Simard <jp@jpsim.com>

Elite1015 pushed a commit to Elite1015/data-plane-api that referenced this pull request Feb 23, 2025

Metrics service docs (#409)

95c5968

Doc PR for Envoy implementation PR envoyproxy/envoy#2323 Signed-off-by: Rama <ramaraochavali@gmail.com>

Conversation

ramaraochavali commented Jan 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ramaraochavali commented Jan 6, 2018

Uh oh!

dnoe commented Jan 8, 2018

Uh oh!

ramaraochavali commented Jan 8, 2018

Uh oh!

mrice32 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

htuch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ramaraochavali commented Jan 9, 2018

Uh oh!

mattklein123 commented Jan 9, 2018

Uh oh!

htuch left a comment

Choose a reason for hiding this comment

Uh oh!

ramaraochavali commented Jan 9, 2018

Uh oh!

mattklein123 commented Jan 9, 2018

Uh oh!

ramaraochavali commented Jan 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattklein123 Jan 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ramaraochavali commented Jan 11, 2018

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattklein123 Jan 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ramaraochavali commented Jan 6, 2018 •

edited

Loading

ramaraochavali commented Jan 10, 2018 •

edited

Loading

mattklein123 Jan 11, 2018 •

edited

Loading

mattklein123 Jan 11, 2018 •

edited

Loading

ramaraochavali commented Jan 12, 2018 •

edited

Loading