Echo back and track origin request-receipt timing deltas#477
Echo back and track origin request-receipt timing deltas#477dubious90 merged 28 commits intoenvoyproxy:masterfrom
Conversation
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Will still fail the test because of mismatched expectations, but this will help merging master periodically in here to verify (almost) all is well. Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
cbe0f01 to
ae0c08e
Compare
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
source/client/stream_decoder.cc
Outdated
| response_header_sizes_statistic_.addValue(response_headers_->byteSize()); | ||
| const uint64_t response_code = Envoy::Http::Utility::getResponseStatus(*response_headers_); | ||
| stream_info_.response_code_ = static_cast<uint32_t>(response_code); | ||
| const auto timing_header_name = Envoy::Http::LowerCaseString("x-nh-do-not-use-origin-timings"); |
There was a problem hiding this comment.
Note to reviewers: the odd name here is to scare people away from using this. My plan is to make a new option for nighthawk_client to allow configuration of this header name.
There was a problem hiding this comment.
my concern here is i think we are introducing two different header namespaces now. See https://github.com/envoyproxy/nighthawk/blob/master/source/server/well_known_headers.h, where we have x-nighthawk-test-server-config. It would be nice if we only had the one namespace, to avoid confusion.
If this is the first introduction of x-nh-, can we change it to x-nighthawk-?
If not, can we create an issue to decide on one of those two, and at least make sure that one contains all of the headers, even if we support the other one for backwards compatibility.
There was a problem hiding this comment.
separately, can you explain why having a configurable header name here is beneficial? just not sure why we wouldn't just dictate this
There was a problem hiding this comment.
re: separately, can you explain why having a configurable header name here is beneficial? just not sure why we wouldn't just dictate this
Well, in my opinion, there's certainly something to say for dictating the names: less code, less configuration. Here are the thoughts I had that made me lean towards allow the end user control of the header names involved. I think this mostly applies to both the client and test-server aspects of it:
- Being able to configure the header names both on the server and client side of Nighthawk decouples them. This flexibility may come in handy when mixing Nighthawk components with other OSS or homegrown clients/proxies/test servers with similar capabilities (especially when these dictate the naming involved).
- If both the proxy and test server emit timing data using different header names, one can switch which one the clients should track
- If multiple proxies or test-servers are set up to emit latencies using different names in horizontal scaling, it could be used to track a single instance instead of all of them.
- If timing-filter gets extended to emit more timings, and we end up with multiple headers, clients can easily switch which one they track.
- In examples the header name could be a little bit longer for clarity, but in final testing, one might want to opt to make it as short as possible to minimise overhead.
Considering the above, let me know what you think;
There was a problem hiding this comment.
This makes sense. Thanks for the explanation.
|
@eric846 please review and assign back to me once done. |
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
source/client/stream_decoder.cc
Outdated
| if (absl::SimpleAtoi(timing_value, &origin_delta) && origin_delta >= 0) { | ||
| origin_latency_statistic_.addValue(origin_delta); | ||
| } else { | ||
| // TODO(XXX): Can we make sure we avoid high frequency logging for this somehow? |
There was a problem hiding this comment.
Punted to #484, which has a candidate PoC solution. I remember there are other spots where this would be good to have in the existing code base as well.
| // monotonically. | ||
| const Envoy::MonotonicTime new_time = time_source.monotonicTime(); | ||
| const uint64_t elapsed = start_ == Envoy::MonotonicTime::min() ? 0 : (new_time - start_).count(); | ||
| start_ = new_time; |
There was a problem hiding this comment.
Warning: Attempting to reason about concurrency...
Do you see any weird effects running with concurrency > 1 on the test server?
I was imagining a situation with 2 test server threads operating almost in lockstep, but 1ns apart. Suppose the requests come in evenly every 1ms. If the test server got stalled for a few milliseconds and the OS ended up buffering several packets, and then the test server threads started consuming them round robin, wouldn't the timing reports end up alternating 0, 1ns, 999999ns, 1ns, 999999ns, 1ns, ...?
This would actually represent the true response timing coming out of the test server, so the code in this PR would be working as intended.
Extreme statistics like that could give the impression that there is extreme distortion from the intermediary proxy, but it's entirely an artifact of the Nighthawk test server having two threads and stalling for a few milliseconds.
(Also the intermediary might also introduce its own equally extreme irregularities, since it could also be a multithreaded Envoy.)
The obvious way to avoid this issue is to use concurrency=1 on the test server. Then we would only be measuring the distortions from intermediaries and the OS.
Do you think there's any future reporting that could diagnose this issue? For example, if we also had exactly this PR but with thread-local stopwatches, we would see perfect evenness rather than 1-999999 fluctuation. We would know that the 1-999999 in the cross-thread stats was the fault of our own multithreading, and any timing deviations would be our OS or the intermediary.
There was a problem hiding this comment.
Actually, maybe varying between 1ns and 1ms would be considered negligible distortion, especially if the intermediary added multiple milliseconds.
There was a problem hiding this comment.
Your reasoning here makes sense to me. The intent of this is to be able to sanity check part of our timings, so if it would catch those extremes where we don't expect them, I think it would be providing value.
The way I see it is that this is going to be particularly useful in A/B tests, for example:
- across different clients
- with and without an intermediary
- with nighthawk client features that might perform relatively heavy lifting on cpu cores shared with the workers (e.g. stats flushing).
Some existing OSS clients can probably easily be tweaked to take these as inputs to their latency reports to figure out their request-release timing precisions.
Also, something like the delta between the expected and observed curves could perhaps serve as a means to quantify distortion; Indeed I think the extremes in the scenario you described would be signal in this case, and not noise (looking at you, adaptive load controller :-)).
As for tracking thread-local values: I've been thinking about that, and I think it's pretty doable doing that on the server side by adding a new Envoy histogram in the extension. On the client side however, I think it will take more consideration and work, as I think there would be some feature gaps:
- be able to set multiple response headers for tracking, or create a means to read multiple values from the header.
- if we want to be able to correlate thread id's we need to be able to tag values too, but this may not be nessecary.
- having a means to dynamically create histograms or a histogram capable of grouping on a tag would be needed
for the Nighthawk client to handle it
There was a problem hiding this comment.
Afterthought: maybe, if we'd have the per worker histograms on the test server side, plus a means to read/sample these histograms on the NH client side and propagate these to the output, that might be a relatively low effort means to achieve this.
There was a problem hiding this comment.
Last thought: per-worker tracking may also yield data that's interesting in light of exposing unbalancedness.
I filed #487 to get this on the radar and track this / serve as a point for further discussion.
|
LGTM modulo adding the comment |
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
|
/retest |
|
🔨 rebuilding |
mum4k
left a comment
There was a problem hiding this comment.
LGTM
Thank you for moving this into an extension. I have few documentation nits and one question - now that it is an extension, should we also add user documentation on how to enable and use it?
|
|
||
| using HttpTimeTrackingFilterConfigSharedPtr = std::shared_ptr<HttpTimeTrackingFilterConfig>; | ||
|
|
||
| class HttpTimeTrackingFilter : public Envoy::Http::PassThroughFilter { |
There was a problem hiding this comment.
Can we document this class and its constructor also?
|
Handing over to @dubious90 due to my vacation. |
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
|
Addressed @mum4k his comments in 70502e1. Marking this as ready for another round! With respect to user-facing documentation, I'm planning for the following as one or more followups to this, where I'm trying to keep PR size reasonable:
|
|
Opened #500 as a WIP follow up to this |
dubious90
left a comment
There was a problem hiding this comment.
Hi Otto, most of these changes look good. Have a few comments. Happy to discuss anything you disagree with or are confused by.
In the future, it would be easier for reviews if you split PRs like this into 2-3 smaller ones.
source/client/stream_decoder.cc
Outdated
| response_header_sizes_statistic_.addValue(response_headers_->byteSize()); | ||
| const uint64_t response_code = Envoy::Http::Utility::getResponseStatus(*response_headers_); | ||
| stream_info_.response_code_ = static_cast<uint32_t>(response_code); | ||
| const auto timing_header_name = Envoy::Http::LowerCaseString("x-nh-do-not-use-origin-timings"); |
There was a problem hiding this comment.
my concern here is i think we are introducing two different header namespaces now. See https://github.com/envoyproxy/nighthawk/blob/master/source/server/well_known_headers.h, where we have x-nighthawk-test-server-config. It would be nice if we only had the one namespace, to avoid confusion.
If this is the first introduction of x-nh-, can we change it to x-nighthawk-?
If not, can we create an issue to decide on one of those two, and at least make sure that one contains all of the headers, even if we support the other one for backwards compatibility.
source/client/stream_decoder.cc
Outdated
| response_header_sizes_statistic_.addValue(response_headers_->byteSize()); | ||
| const uint64_t response_code = Envoy::Http::Utility::getResponseStatus(*response_headers_); | ||
| stream_info_.response_code_ = static_cast<uint32_t>(response_code); | ||
| const auto timing_header_name = Envoy::Http::LowerCaseString("x-nh-do-not-use-origin-timings"); |
There was a problem hiding this comment.
separately, can you explain why having a configurable header name here is beneficial? just not sure why we wouldn't just dictate this
| "thread_safe_monotonic_time_stopwatch.h", | ||
| ], | ||
| repository = "@envoy", | ||
| visibility = ["//visibility:public"], |
There was a problem hiding this comment.
no changes required for this (not even sure if it's possible), but it would be nice if we could set this at the package level instead of on every single target in common/BUILD. But that doesn't seem to be supported by envoy_package.
There was a problem hiding this comment.
I think this is a good point, I filed an issue tagged as tech-debt to track this: #502
|
|
||
| /** | ||
| * @param time_source used to obtain a sample of the current monotonic time. | ||
| * @return uint64_t 0 on the first invocation, and the number of elapsed nanoseconds since the |
There was a problem hiding this comment.
In google-convention, we'd most likely try to use absl::duration here, rather than an int representing a unit of time. Would that be reasonable here, or no?
There was a problem hiding this comment.
Well, while I didn't consider absl::Duration (I didn't know it existed). But I did consider std::chrono::duration, and ended up leaning towards uint64_t because that:
- can't turn negative, which doesn't make sense here
- is less prone to some of the trickyness with conversions vs the underlying types I ran into over at Add TimerImpl::enableHRTimer - take two envoy#9229 (comment).
Having said that, I feel this is a subjective matter, so I'd be happy to change this if you lean towards absl::Duration after reading my considerations above.
There was a problem hiding this comment.
That makes sense. I don't know if it's worth bringing in absl::Duration just for this. If we wanted to, we could try to bring in absl::Duration on a more global change later.
| @@ -0,0 +1,68 @@ | |||
| #include "server/http_time_tracking_filter.h" | |||
There was a problem hiding this comment.
This directory is starting to grow. Would it make sense to create a separate PR that moves this and other filters all into a filters subdirectory?
There was a problem hiding this comment.
Yes I think that would make sense. Alternatively, we could keep this directory just for extensions and their configuration, and re-home everything else that's in there so far: configuration.h/cc (helpers), well_known_headers.h (http header name definitions). Should I file an issue to track this?
There was a problem hiding this comment.
An issue to track this would be great. I'm happy to defer to you on what makes the most sense. The filters subdirectory was just what made sense to me on a quick lookthrough.
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
|
@dubious90 Feedback partially addressed in 7774e2d |
dubious90
left a comment
There was a problem hiding this comment.
Thank you for the changes. Looks good
Changes were addressed. Mumak is no longer available or involved with this PR.
Adds a client option and wires it through in TCLAP. Amends tests and code to work with that instead of the hard coded value. Adds end to end test, and enables test-server extension build. Follow up to #477 Fixes #360: with this the feature is ready to use. Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com> ------- TODO: - [x] Land #477 and merge it in here - [x] Wire up TCLAP - [x] Tests for the new option parsing/handling - [x] Enable build of the new time-tracking extension into nighthawk_test_server - [x] End-to-end tests to prove the new client-side feature can be used to track latencies delivered by response header emitted by the test-server's new feature. - [x] CLI & proto description / comments, regen docs. - [x] Replace Stopwatch shared_ptr with unique_ptr
This adds a rough cut of the capability to the test server to allow tracking of the delta's between
request arrival times, and echo these back in a configurable response header name.
Subsequently, the nighthawk_client gets a feature to track numbers received via an arbitrary
header in a histogram. Right now it's hard coded to track
x-nh-do-not-use-origin-timings.But the plan would be to allow for at least one configurable header name to track.
Combined, these test server and client features enable obtaining insight into the origin-side point of
view regarding arrival timings [1]. The direct use case is manual inspection / sanity checking, but possibly
divergence of what would be expected based on client side configuration can be used as a way to quantify
"distortion" introduced by an intermediary proxy.
In a follow up, documentation, configuration of the client to specify the response header name that should
be used to track latency, and end to end tests will be added.
When setting up a test between the Nighthawk client and test server using this feature, a new histogram
will be send to the output. On my machine, a simple 1000qps test yields the following perspective from
the test-server when it comes to deltas between inbound requests: