Conversation
Signed-off-by: Rama <ramaraochavali@gmail.com>
|
@mattklein123 @htuch This is by no means complete. But set up the basic stuff along the lines of access log implementation. There are bunch of TODOs to address which i will do in the coming days. But want to get your input at this high level whether direction makes sense before I go too far. |
Signed-off-by: Rama <ramaraochavali@gmail.com>
Signed-off-by: Rama <ramaraochavali@gmail.com>
Signed-off-by: Rama <ramaraochavali@gmail.com>
|
@mrice32 might like to take a look too |
Signed-off-by: Rama <ramaraochavali@gmail.com>
|
@mattklein123 @htuch did some more work.Still few more TODOs to be addressed and tests need to be added. |
mrice32
left a comment
There was a problem hiding this comment.
This is really cool! Thanks for working on it. I have one small nit above. I also have a design question. There are a few comments about the sink and streams operating with thread local data. Why does this need to work that way? Disclaimer: I may be a little behind on changes with the stats processing in Envoy, so my assertions may be incorrect, here. Aren't counters and gauges flushed to the sink periodically on the main thread? IIUC histogram values may be flushed from many threads, however. So is this TLS work centered around just giving each thread a stream to export its histogram samples, specifically? I only did a brief pass over the code, so it's totally possible that I'm missing something.
|
|
||
| private: | ||
| GrpcMetricsStreamerSharedPtr grpc_metrics_streamer_; | ||
| envoy::api::v2::StreamMetricsMessage message; |
htuch
left a comment
There was a problem hiding this comment.
@ramaraochavali Thanks for the contribution. Needs coverage and tests, can do a deeper review tomorrow. Would second @mrice32 point on TLS, trying to understand why this needs to be per-worker thread..
| void send(envoy::api::v2::StreamMetricsMessage& message); | ||
|
|
||
| GrpcMetricsServiceClientPtr client_; | ||
| // TODO(ramachavali): Map is not required as there is only one entry. |
There was a problem hiding this comment.
Yeah, I was wondering what the purpose of this was..
|
@mrice32 @htuch reg: thread local, I looked at the statsd sink implementation and it also uses TLS. So went ahead and used the similar approach that @mattklein123 used for access log implementation for gRPC streaming. I may have misunderstood or missing some thing here. I thought the stats sink should work similar. |
That's exactly right. Counters/gauges are flushed on the main thread, however histograms are emitted directly to the sink from each worker. In the future, when we optionally support built-in histograms, this may not always be the case. A sink could also decide to add the histograms to a queue for later flush by the main thread if desired, but that would add lock contention, unless the queue is per-thread, with periodic flush to the main thread. I haven't looked at the review yet, but my recommendation would be to probably follow the pattern of the gRPC access logger as was done here I think. We can potentially consider additional designs at a later point. (Note also that this likely will be the same exact design as what will be used by the gRPC tracing code also). |
Signed-off-by: Rama <ramaraochavali@gmail.com>
Signed-off-by: Rama <ramaraochavali@gmail.com>
Signed-off-by: Rama <ramaraochavali@gmail.com>
Signed-off-by: Rama <ramaraochavali@gmail.com>
htuch
left a comment
There was a problem hiding this comment.
Two followups:
-
Are we planning on doing histograms here? I see a TODO, not sure what the plan is.
-
Is the implication of this architecture (and the gRPC access logging) that we have one stream per worker? I'm wondering if we'll potentially run into the same issues that inspired ADS in some deployments, namely that streams might map to distinct connections and each connection might point at a different management server, creating additional work at the server. As it is, there's a need to reconcile across the different streams even if only a single management server.
|
@htuch on histogram, i am actually trying to figure out how to exactly map from the envoy implementation to promotheus proto. I could not find direct mapping..will update by tomorrow. |
If there are no histogram samples emitted in this PR, we don't need TLS. (I have not yet looked at PR).
Yup that is the implication. I can't say whether we will run into the same issues, but I know that this design is fine for Lyft. I think if we need aggregation we can do it as a follow up since it will be more complicated (per-thread buffering with flush to central thread, appropriate locking, etc.). |
Signed-off-by: Rama <ramaraochavali@gmail.com>
Signed-off-by: Rama <ramaraochavali@gmail.com>
|
I have added integration test as well. So it should be good to review.
Only thing pending from this PR is to update the latest data-plane-api and use the new |
mattklein123
left a comment
There was a problem hiding this comment.
In general LGTM. Thank you for working on this. Some comments to get started with. I will take another pass after.
| @@ -0,0 +1,167 @@ | |||
| #pragma once | |||
There was a problem hiding this comment.
Side note: Given how much of this code was mostly copied from my access log implementation, I feel like there could be a base class. All of this code is basically going to get copied again for the trace service implementation also. I would just add a TODO somewhere to look into converging this code and the access log code into a common base class.
| private: | ||
| /** | ||
| * Shared state that is owned by the per-thread streamers. This allows the | ||
| * main streamer/TLS |
| class MetricsServiceSink : public Sink { | ||
| public: | ||
| // MetricsService::Sink | ||
| MetricsServiceSink(GrpcMetricsStreamerSharedPtr grpc_metrics_streamer); |
There was a problem hiding this comment.
nit: const GrpcMetricsStreamerSharedPtr&
| gauage_metric->set_value(value); | ||
| } | ||
|
|
||
| void endFlush() override { grpc_metrics_streamer_->send(message_); } |
There was a problem hiding this comment.
After the first flush, you should clear identifier info for perf reasons. Since you are reusing the same message as a member that won't happen by default (which is different from how the access log code works).
| metrics_service_request_->headers().Path()->value().c_str()); | ||
| EXPECT_STREQ("application/grpc", | ||
| metrics_service_request_->headers().ContentType()->value().c_str()); | ||
| std::cout << "waitForMetricsRequest" |
| }; | ||
|
|
||
| // Singleton registration via macro defined in envoy/singleton/manager.h | ||
| SINGLETON_MANAGER_REGISTRATION(grpc_metrics_streamer); |
There was a problem hiding this comment.
In the case of the stat sink, you can kill all the singleton stuff. There is only a single sink in the server instantiated.
| metrics_service_request_->startGrpcStream(); | ||
| envoy::api::v2::StreamMetricsResponse response_msg; | ||
| metrics_service_request_->sendGrpcMessage(response_msg); | ||
| metrics_service_request_->finishGrpcStream(Grpc::Status::Ok); |
There was a problem hiding this comment.
please see #2333. You will need a similar guard here.
| metrics_service_request_ = fake_metrics_service_connection_->waitForNewStream(*dispatcher_); | ||
| } | ||
|
|
||
| void waitForMetricsRequest() { |
There was a problem hiding this comment.
Can we potentially verify some actual metrics here?
There was a problem hiding this comment.
i can not verify actual metric names - but i can assert if there are metrics. I think that should be fine. LMK if you think otherwise.
There was a problem hiding this comment.
I think you should be able to verify at least a single gauge and counter exists, and maybe even that a value is greater than zero. Can you dump them and find some good ones?
There was a problem hiding this comment.
i think we can actually validate the metrics_service cluster related metrics as we are adding the cluster in the test test it self. They will be there for sure. So i have added a validation for a counter and gauge including values. LMK if that makes sense. Sorry could have done this earlier
Signed-off-by: Rama <ramaraochavali@gmail.com>
Signed-off-by: Rama <ramaraochavali@gmail.com>
|
@mattklein123 addressed all your review comments. can you PTAL? |
Signed-off-by: Rama <ramaraochavali@gmail.com>
| cluster_name), | ||
| server.threadLocal(), server.localInfo()); | ||
|
|
||
| return Stats::SinkPtr(new Stats::Metrics::MetricsServiceSink(grpc_metrics_streamer)); |
| } | ||
|
|
||
| ProtobufTypes::MessagePtr MetricsServiceSinkFactory::createEmptyConfigProto() { | ||
| return std::unique_ptr<envoy::api::v2::MetricsServiceConfig>( |
| metrics_service_cluster->mutable_http2_protocol_options(); | ||
|
|
||
| auto* metrics_sink = bootstrap.add_stats_sinks(); | ||
| // metrics_sink->MergeFrom(bootstrap.stat_sinks()[0]); |
There was a problem hiding this comment.
not required. removed.
| metrics_service_request_ = fake_metrics_service_connection_->waitForNewStream(*dispatcher_); | ||
| } | ||
|
|
||
| void waitForMetricsRequest() { |
There was a problem hiding this comment.
I think you should be able to verify at least a single gauge and counter exists, and maybe even that a value is greater than zero. Can you dump them and find some good ones?
Signed-off-by: Rama <ramaraochavali@gmail.com>
|
@mattklein123 addressed all the comments. PTAL. |
| std::string known_counter("cluster.metrics_service.membership_change"); | ||
| std::string known_gauge("cluster.metrics_service.membership_total"); | ||
| int metrics_size = envoy_metrics.size(); | ||
| for (int i = 0; i < metrics_size; i++) { |
There was a problem hiding this comment.
nit: You should be able to use a c++11 for loop here and just iterate through the envoy_metrics() metrics.
| int metrics_size = envoy_metrics.size(); | ||
| for (int i = 0; i < metrics_size; i++) { | ||
| ::io::prometheus::client::MetricFamily metrics_family = envoy_metrics.Get(i); | ||
| if (known_counter.compare(metrics_family.name()) == 0 && |
There was a problem hiding this comment.
nit: just do metrics_family.name() == "cluster.metrics_service.membership_change". same below
Doc PR for Envoy implementation PR envoyproxy/envoy#2323 Signed-off-by: Rama <ramaraochavali@gmail.com>
Signed-off-by: Rama <ramaraochavali@gmail.com>
|
@mattklein123 addressed the nits. PTAL. |
Signed-off-by: Rama <ramaraochavali@gmail.com>
| ThreadLocal::SlotAllocator& tls, | ||
| const LocalInfo::LocalInfo& local_info) | ||
| : tls_slot_(tls.allocateSlot()) { | ||
|
|
There was a problem hiding this comment.
Tiny nit: prefer less whitespace than more here.
| @@ -0,0 +1,167 @@ | |||
| #pragma once | |||
| if (metrics_family.name().compare("cluster.metrics_service.membership_change") == 0 && | ||
| metrics_family.metric(0).has_counter()) { | ||
| known_counter_exists = true; | ||
| known_counter_value = metrics_family.metric(0).counter().value(); |
There was a problem hiding this comment.
I would move the EXPECT_EQ to here.
| bool known_gauge_exists = false; | ||
| int known_counter_value = -1; | ||
| int known_gauge_value = -1; | ||
| for (::io::prometheus::client::MetricFamily metrics_family : envoy_metrics) { |
There was a problem hiding this comment.
It might be slightly cleaner to convert this into a std::unordered_map or the like and then write the match logic.
There was a problem hiding this comment.
i thought about this earlier, but felt building another container may not required here. So did not do. let me know if you feel strongly about it. i can make that change.
Signed-off-by: Rama <ramaraochavali@gmail.com>
|
@htuch addressed all your comments expect using the unorderded map transform in test. let me know if you feel strongly about it. otherwise it is good to go. |
|
@htuch can you PTAL when you get time and let me know if any thing else needs to be done on this? |
|
@htuch awesome. thanks much.. |
`android_dist_ci` was a superset of `android_dist` and `android_dist` stopped working in #2184. Signed-off-by: JP Simard <jp@jpsim.com>
`android_dist_ci` was a superset of `android_dist` and `android_dist` stopped working in #2184. Signed-off-by: JP Simard <jp@jpsim.com>
Doc PR for Envoy implementation PR envoyproxy/envoy#2323 Signed-off-by: Rama <ramaraochavali@gmail.com>
Signed-off-by: Rama ramaraochavali@gmail.com
Metrics Service Implementation
Description:
Implements Metrics Service in Envoy that continuously streams metrics to the gRPC end point configured.
Risk Level: Small-medium
Testing: unit test, integration test
Docs Changes:
Release Notes:
API Changes: