Skip to content

initial metrics service #2323

Merged
htuch merged 18 commits intoenvoyproxy:masterfrom
ramaraochavali:feature/metrics_service_impl
Jan 15, 2018
Merged

initial metrics service #2323
htuch merged 18 commits intoenvoyproxy:masterfrom
ramaraochavali:feature/metrics_service_impl

Conversation

@ramaraochavali
Copy link
Copy Markdown
Contributor

@ramaraochavali ramaraochavali commented Jan 6, 2018

Signed-off-by: Rama ramaraochavali@gmail.com

Metrics Service Implementation

Description:
Implements Metrics Service in Envoy that continuously streams metrics to the gRPC end point configured.

Risk Level: Small-medium
Testing: unit test, integration test

Docs Changes:

Data Plane PR

Release Notes:

Release notes updated

API Changes:

Data Plane API

Signed-off-by: Rama <ramaraochavali@gmail.com>
@ramaraochavali
Copy link
Copy Markdown
Contributor Author

@mattklein123 @htuch This is by no means complete. But set up the basic stuff along the lines of access log implementation. There are bunch of TODOs to address which i will do in the coming days. But want to get your input at this high level whether direction makes sense before I go too far.
Can you please quickly take a look from that perspective?

Signed-off-by: Rama <ramaraochavali@gmail.com>
Signed-off-by: Rama <ramaraochavali@gmail.com>
Signed-off-by: Rama <ramaraochavali@gmail.com>
@dnoe
Copy link
Copy Markdown
Contributor

dnoe commented Jan 8, 2018

@mrice32 might like to take a look too

Signed-off-by: Rama <ramaraochavali@gmail.com>
@ramaraochavali
Copy link
Copy Markdown
Contributor Author

@mattklein123 @htuch did some more work.Still few more TODOs to be addressed and tests need to be added.

Copy link
Copy Markdown
Member

@mrice32 mrice32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool! Thanks for working on it. I have one small nit above. I also have a design question. There are a few comments about the sink and streams operating with thread local data. Why does this need to work that way? Disclaimer: I may be a little behind on changes with the stats processing in Envoy, so my assertions may be incorrect, here. Aren't counters and gauges flushed to the sink periodically on the main thread? IIUC histogram values may be flushed from many threads, however. So is this TLS work centered around just giving each thread a stream to export its histogram samples, specifically? I only did a brief pass over the code, so it's totally possible that I'm missing something.


private:
GrpcMetricsStreamerSharedPtr grpc_metrics_streamer_;
envoy::api::v2::StreamMetricsMessage message;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: message_;

Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ramaraochavali Thanks for the contribution. Needs coverage and tests, can do a deeper review tomorrow. Would second @mrice32 point on TLS, trying to understand why this needs to be per-worker thread..

void send(envoy::api::v2::StreamMetricsMessage& message);

GrpcMetricsServiceClientPtr client_;
// TODO(ramachavali): Map is not required as there is only one entry.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was wondering what the purpose of this was..

@ramaraochavali
Copy link
Copy Markdown
Contributor Author

@mrice32 @htuch reg: thread local, I looked at the statsd sink implementation and it also uses TLS. So went ahead and used the similar approach that @mattklein123 used for access log implementation for gRPC streaming. I may have misunderstood or missing some thing here. I thought the stats sink should work similar.

@mattklein123
Copy link
Copy Markdown
Member

Aren't counters and gauges flushed to the sink periodically on the main thread? IIUC histogram values may be flushed from many threads, however. So is this TLS work centered around just giving each thread a stream to export its histogram samples, specifically?

That's exactly right. Counters/gauges are flushed on the main thread, however histograms are emitted directly to the sink from each worker.

In the future, when we optionally support built-in histograms, this may not always be the case. A sink could also decide to add the histograms to a queue for later flush by the main thread if desired, but that would add lock contention, unless the queue is per-thread, with periodic flush to the main thread.

I haven't looked at the review yet, but my recommendation would be to probably follow the pattern of the gRPC access logger as was done here I think. We can potentially consider additional designs at a later point. (Note also that this likely will be the same exact design as what will be used by the gRPC tracing code also).

Signed-off-by: Rama <ramaraochavali@gmail.com>
Signed-off-by: Rama <ramaraochavali@gmail.com>
Signed-off-by: Rama <ramaraochavali@gmail.com>
Signed-off-by: Rama <ramaraochavali@gmail.com>
Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two followups:

  1. Are we planning on doing histograms here? I see a TODO, not sure what the plan is.

  2. Is the implication of this architecture (and the gRPC access logging) that we have one stream per worker? I'm wondering if we'll potentially run into the same issues that inspired ADS in some deployments, namely that streams might map to distinct connections and each connection might point at a different management server, creating additional work at the server. As it is, there's a need to reconcile across the different streams even if only a single management server.

@ramaraochavali
Copy link
Copy Markdown
Contributor Author

@htuch on histogram, i am actually trying to figure out how to exactly map from the envoy implementation to promotheus proto. I could not find direct mapping..will update by tomorrow.

@mattklein123
Copy link
Copy Markdown
Member

Are we planning on doing histograms here? I see a TODO, not sure what the plan is.

If there are no histogram samples emitted in this PR, we don't need TLS. (I have not yet looked at PR).

Is the implication of this architecture (and the gRPC access logging) that we have one stream per worker? I'm wondering if we'll potentially run into the same issues that inspired ADS in some deployments, namely that streams might map to distinct connections and each connection might point at a different management server, creating additional work at the server. As it is, there's a need to reconcile across the different streams even if only a single management server.

Yup that is the implication. I can't say whether we will run into the same issues, but I know that this design is fine for Lyft. I think if we need aggregation we can do it as a follow up since it will be more complicated (per-thread buffering with flush to central thread, appropriate locking, etc.).

Signed-off-by: Rama <ramaraochavali@gmail.com>
Signed-off-by: Rama <ramaraochavali@gmail.com>
@ramaraochavali
Copy link
Copy Markdown
Contributor Author

ramaraochavali commented Jan 10, 2018

I have added integration test as well. So it should be good to review.
Few comments

  1. On histogram, currently it is not possible to map Envoy's histogram implementation to Metrics Proto Histogram as we do not have buckets defined here. However, if we want to send the raw histogram values we can use Untyped metric and send them. Later when Envoy fully supports native histogram implementation this can be changed to use Histogram proto type. Otherwise we can leave histograms for now. Let me know what is the best option here.
  2. Regarding the implication that @htuch brought, I think I agree with @mattklein123 that for use cases like metrics, accesslogs etc aggregation may not be required. Even if this stream go to different managed server, finally the managed server is going to push these metrics to backends - it may not matter which managed server does.
  3. Even if we leave histograms in this PR, I prefer to keep the TLS implementation as is because when Envoy supports histograms natively, we can just send them with out much effort., @mattklein123 let me know what you think about this.

Only thing pending from this PR is to update the latest data-plane-api and use the new GrpcService that need to be coordinated.

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general LGTM. Thank you for working on this. Some comments to get started with. I will take another pass after.

@@ -0,0 +1,167 @@
#pragma once
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: Given how much of this code was mostly copied from my access log implementation, I feel like there could be a base class. All of this code is basically going to get copied again for the trace service implementation also. I would just add a TODO somewhere to look into converging this code and the access log code into a common base class.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

private:
/**
* Shared state that is owned by the per-thread streamers. This allows the
* main streamer/TLS
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: reflow comment

class MetricsServiceSink : public Sink {
public:
// MetricsService::Sink
MetricsServiceSink(GrpcMetricsStreamerSharedPtr grpc_metrics_streamer);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: const GrpcMetricsStreamerSharedPtr&

gauage_metric->set_value(value);
}

void endFlush() override { grpc_metrics_streamer_->send(message_); }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the first flush, you should clear identifier info for perf reasons. Since you are reusing the same message as a member that won't happen by default (which is different from how the access log code works).

metrics_service_request_->headers().Path()->value().c_str());
EXPECT_STREQ("application/grpc",
metrics_service_request_->headers().ContentType()->value().c_str());
std::cout << "waitForMetricsRequest"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

del

};

// Singleton registration via macro defined in envoy/singleton/manager.h
SINGLETON_MANAGER_REGISTRATION(grpc_metrics_streamer);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of the stat sink, you can kill all the singleton stuff. There is only a single sink in the server instantiated.

metrics_service_request_->startGrpcStream();
envoy::api::v2::StreamMetricsResponse response_msg;
metrics_service_request_->sendGrpcMessage(response_msg);
metrics_service_request_->finishGrpcStream(Grpc::Status::Ok);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see #2333. You will need a similar guard here.

metrics_service_request_ = fake_metrics_service_connection_->waitForNewStream(*dispatcher_);
}

void waitForMetricsRequest() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we potentially verify some actual metrics here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can not verify actual metric names - but i can assert if there are metrics. I think that should be fine. LMK if you think otherwise.

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 Jan 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should be able to verify at least a single gauge and counter exists, and maybe even that a value is greater than zero. Can you dump them and find some good ones?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can actually validate the metrics_service cluster related metrics as we are adding the cluster in the test test it self. They will be there for sure. So i have added a validation for a counter and gauge including values. LMK if that makes sense. Sorry could have done this earlier

Signed-off-by: Rama <ramaraochavali@gmail.com>
Signed-off-by: Rama <ramaraochavali@gmail.com>
@ramaraochavali
Copy link
Copy Markdown
Contributor Author

@mattklein123 addressed all your review comments. can you PTAL?

Signed-off-by: Rama <ramaraochavali@gmail.com>
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, LGTM. Thanks for this. A few small remaining comments.

@htuch @mrice32 can one of you take another pass?

cluster_name),
server.threadLocal(), server.localInfo());

return Stats::SinkPtr(new Stats::Metrics::MetricsServiceSink(grpc_metrics_streamer));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: std::make_unique

}

ProtobufTypes::MessagePtr MetricsServiceSinkFactory::createEmptyConfigProto() {
return std::unique_ptr<envoy::api::v2::MetricsServiceConfig>(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: std::make_unique

metrics_service_cluster->mutable_http2_protocol_options();

auto* metrics_sink = bootstrap.add_stats_sinks();
// metrics_sink->MergeFrom(bootstrap.stat_sinks()[0]);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not required. removed.

metrics_service_request_ = fake_metrics_service_connection_->waitForNewStream(*dispatcher_);
}

void waitForMetricsRequest() {
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 Jan 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should be able to verify at least a single gauge and counter exists, and maybe even that a value is greater than zero. Can you dump them and find some good ones?

Signed-off-by: Rama <ramaraochavali@gmail.com>
@ramaraochavali
Copy link
Copy Markdown
Contributor Author

ramaraochavali commented Jan 12, 2018

@mattklein123 addressed all the comments. PTAL.

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM afer nits. Nice! @mrice32 @htuch LMK if either of you want to take a pass through this.

std::string known_counter("cluster.metrics_service.membership_change");
std::string known_gauge("cluster.metrics_service.membership_total");
int metrics_size = envoy_metrics.size();
for (int i = 0; i < metrics_size; i++) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You should be able to use a c++11 for loop here and just iterate through the envoy_metrics() metrics.

int metrics_size = envoy_metrics.size();
for (int i = 0; i < metrics_size; i++) {
::io::prometheus::client::MetricFamily metrics_family = envoy_metrics.Get(i);
if (known_counter.compare(metrics_family.name()) == 0 &&
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just do metrics_family.name() == "cluster.metrics_service.membership_change". same below

htuch pushed a commit to envoyproxy/data-plane-api that referenced this pull request Jan 12, 2018
Doc PR for Envoy implementation PR envoyproxy/envoy#2323
Signed-off-by: Rama <ramaraochavali@gmail.com>
Signed-off-by: Rama <ramaraochavali@gmail.com>
@ramaraochavali
Copy link
Copy Markdown
Contributor Author

@mattklein123 addressed the nits. PTAL.

Signed-off-by: Rama <ramaraochavali@gmail.com>
Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this looks rad.

ThreadLocal::SlotAllocator& tls,
const LocalInfo::LocalInfo& local_info)
: tls_slot_(tls.allocateSlot()) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit: prefer less whitespace than more here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -0,0 +1,167 @@
#pragma once
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

if (metrics_family.name().compare("cluster.metrics_service.membership_change") == 0 &&
metrics_family.metric(0).has_counter()) {
known_counter_exists = true;
known_counter_value = metrics_family.metric(0).counter().value();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the EXPECT_EQ to here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved

bool known_gauge_exists = false;
int known_counter_value = -1;
int known_gauge_value = -1;
for (::io::prometheus::client::MetricFamily metrics_family : envoy_metrics) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be slightly cleaner to convert this into a std::unordered_map or the like and then write the match logic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought about this earlier, but felt building another container may not required here. So did not do. let me know if you feel strongly about it. i can make that change.

@mattklein123 mattklein123 self-assigned this Jan 12, 2018
Signed-off-by: Rama <ramaraochavali@gmail.com>
@ramaraochavali
Copy link
Copy Markdown
Contributor Author

@htuch addressed all your comments expect using the unorderded map transform in test. let me know if you feel strongly about it. otherwise it is good to go.

@ramaraochavali
Copy link
Copy Markdown
Contributor Author

@htuch can you PTAL when you get time and let me know if any thing else needs to be done on this?

Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome sauce.

@htuch htuch merged commit e103e8f into envoyproxy:master Jan 15, 2018
@ramaraochavali
Copy link
Copy Markdown
Contributor Author

@htuch awesome. thanks much..

@ramaraochavali ramaraochavali deleted the feature/metrics_service_impl branch January 16, 2018 05:09
Shikugawa pushed a commit to Shikugawa/envoy that referenced this pull request Mar 28, 2020
jpsim added a commit that referenced this pull request Nov 28, 2022
`android_dist_ci` was a superset of `android_dist` and `android_dist`
stopped working in #2184.

Signed-off-by: JP Simard <jp@jpsim.com>
jpsim added a commit that referenced this pull request Nov 29, 2022
`android_dist_ci` was a superset of `android_dist` and `android_dist`
stopped working in #2184.

Signed-off-by: JP Simard <jp@jpsim.com>
Elite1015 pushed a commit to Elite1015/data-plane-api that referenced this pull request Feb 23, 2025
Doc PR for Envoy implementation PR envoyproxy/envoy#2323
Signed-off-by: Rama <ramaraochavali@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants