Skip to content

Echo back and track origin request-receipt timing deltas#477

Merged
dubious90 merged 28 commits intoenvoyproxy:masterfrom
oschaaf:origin-timings
Sep 1, 2020
Merged

Echo back and track origin request-receipt timing deltas#477
dubious90 merged 28 commits intoenvoyproxy:masterfrom
oschaaf:origin-timings

Conversation

@oschaaf
Copy link
Copy Markdown
Member

@oschaaf oschaaf commented Aug 21, 2020

This adds a rough cut of the capability to the test server to allow tracking of the delta's between
request arrival times, and echo these back in a configurable response header name.

Subsequently, the nighthawk_client gets a feature to track numbers received via an arbitrary
header in a histogram. Right now it's hard coded to track x-nh-do-not-use-origin-timings.
But the plan would be to allow for at least one configurable header name to track.

Combined, these test server and client features enable obtaining insight into the origin-side point of
view regarding arrival timings [1]. The direct use case is manual inspection / sanity checking, but possibly
divergence of what would be expected based on client side configuration can be used as a way to quantify
"distortion" introduced by an intermediary proxy.

In a follow up, documentation, configuration of the client to specify the response header name that should
be used to track latency, and end to end tests will be added.

When setting up a test between the Nighthawk client and test server using this feature, a new histogram
will be send to the output. On my machine, a simple 1000qps test yields the following perspective from
the test-server when it comes to deltas between inbound requests:

  0.5         2519        0s 000ms 999us 
  0.75        3767        0s 001ms 001us 
  0.8         4002        0s 001ms 002us 
  0.9         4508        0s 001ms 003us 
  0.95        4756        0s 001ms 004us 
  0.990625    4953        0s 001ms 005us 
  0.99902344  4995        0s 001ms 015us ```

[1] Concretely, if a straightforward 1000 qps is configured, the histogram obtained from tracking the server side should reasonably reflect that and show a histogram diverging very little from 1ms. 

oschaaf added 9 commits June 15, 2020 17:31
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Will still fail the test because of mismatched expectations,
but this will help merging master periodically in here to
verify (almost) all is well.

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
@oschaaf oschaaf changed the title PoC: echo back origin request-receipt timing deltas Echo back and track origin request-receipt timing deltas Aug 24, 2020
@oschaaf oschaaf marked this pull request as ready for review August 24, 2020 22:21
@oschaaf oschaaf added the waiting-for-review A PR waiting for a review. label Aug 24, 2020
response_header_sizes_statistic_.addValue(response_headers_->byteSize());
const uint64_t response_code = Envoy::Http::Utility::getResponseStatus(*response_headers_);
stream_info_.response_code_ = static_cast<uint32_t>(response_code);
const auto timing_header_name = Envoy::Http::LowerCaseString("x-nh-do-not-use-origin-timings");
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: the odd name here is to scare people away from using this. My plan is to make a new option for nighthawk_client to allow configuration of this header name.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my concern here is i think we are introducing two different header namespaces now. See https://github.com/envoyproxy/nighthawk/blob/master/source/server/well_known_headers.h, where we have x-nighthawk-test-server-config. It would be nice if we only had the one namespace, to avoid confusion.

If this is the first introduction of x-nh-, can we change it to x-nighthawk-?

If not, can we create an issue to decide on one of those two, and at least make sure that one contains all of the headers, even if we support the other one for backwards compatibility.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separately, can you explain why having a configurable header name here is beneficial? just not sure why we wouldn't just dictate this

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: separately, can you explain why having a configurable header name here is beneficial? just not sure why we wouldn't just dictate this

Well, in my opinion, there's certainly something to say for dictating the names: less code, less configuration. Here are the thoughts I had that made me lean towards allow the end user control of the header names involved. I think this mostly applies to both the client and test-server aspects of it:

  1. Being able to configure the header names both on the server and client side of Nighthawk decouples them. This flexibility may come in handy when mixing Nighthawk components with other OSS or homegrown clients/proxies/test servers with similar capabilities (especially when these dictate the naming involved).
  2. If both the proxy and test server emit timing data using different header names, one can switch which one the clients should track
  3. If multiple proxies or test-servers are set up to emit latencies using different names in horizontal scaling, it could be used to track a single instance instead of all of them.
  4. If timing-filter gets extended to emit more timings, and we end up with multiple headers, clients can easily switch which one they track.
  5. In examples the header name could be a little bit longer for clarity, but in final testing, one might want to opt to make it as short as possible to minimise overhead.

Considering the above, let me know what you think;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. Thanks for the explanation.

@mum4k mum4k requested a review from eric846 August 24, 2020 22:34
@mum4k
Copy link
Copy Markdown
Collaborator

mum4k commented Aug 24, 2020

@eric846 please review and assign back to me once done.

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
if (absl::SimpleAtoi(timing_value, &origin_delta) && origin_delta >= 0) {
origin_latency_statistic_.addValue(origin_delta);
} else {
// TODO(XXX): Can we make sure we avoid high frequency logging for this somehow?
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Punted to #484, which has a candidate PoC solution. I remember there are other spots where this would be good to have in the existing code base as well.

// monotonically.
const Envoy::MonotonicTime new_time = time_source.monotonicTime();
const uint64_t elapsed = start_ == Envoy::MonotonicTime::min() ? 0 : (new_time - start_).count();
start_ = new_time;
Copy link
Copy Markdown
Contributor

@eric846 eric846 Aug 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning: Attempting to reason about concurrency...

Do you see any weird effects running with concurrency > 1 on the test server?

I was imagining a situation with 2 test server threads operating almost in lockstep, but 1ns apart. Suppose the requests come in evenly every 1ms. If the test server got stalled for a few milliseconds and the OS ended up buffering several packets, and then the test server threads started consuming them round robin, wouldn't the timing reports end up alternating 0, 1ns, 999999ns, 1ns, 999999ns, 1ns, ...?

This would actually represent the true response timing coming out of the test server, so the code in this PR would be working as intended.

Extreme statistics like that could give the impression that there is extreme distortion from the intermediary proxy, but it's entirely an artifact of the Nighthawk test server having two threads and stalling for a few milliseconds.

(Also the intermediary might also introduce its own equally extreme irregularities, since it could also be a multithreaded Envoy.)

The obvious way to avoid this issue is to use concurrency=1 on the test server. Then we would only be measuring the distortions from intermediaries and the OS.

Do you think there's any future reporting that could diagnose this issue? For example, if we also had exactly this PR but with thread-local stopwatches, we would see perfect evenness rather than 1-999999 fluctuation. We would know that the 1-999999 in the cross-thread stats was the fault of our own multithreading, and any timing deviations would be our OS or the intermediary.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, maybe varying between 1ns and 1ms would be considered negligible distortion, especially if the intermediary added multiple milliseconds.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your reasoning here makes sense to me. The intent of this is to be able to sanity check part of our timings, so if it would catch those extremes where we don't expect them, I think it would be providing value.

The way I see it is that this is going to be particularly useful in A/B tests, for example:

  • across different clients
  • with and without an intermediary
  • with nighthawk client features that might perform relatively heavy lifting on cpu cores shared with the workers (e.g. stats flushing).

Some existing OSS clients can probably easily be tweaked to take these as inputs to their latency reports to figure out their request-release timing precisions.

Also, something like the delta between the expected and observed curves could perhaps serve as a means to quantify distortion; Indeed I think the extremes in the scenario you described would be signal in this case, and not noise (looking at you, adaptive load controller :-)).

As for tracking thread-local values: I've been thinking about that, and I think it's pretty doable doing that on the server side by adding a new Envoy histogram in the extension. On the client side however, I think it will take more consideration and work, as I think there would be some feature gaps:

  • be able to set multiple response headers for tracking, or create a means to read multiple values from the header.
  • if we want to be able to correlate thread id's we need to be able to tag values too, but this may not be nessecary.
  • having a means to dynamically create histograms or a histogram capable of grouping on a tag would be needed
    for the Nighthawk client to handle it

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afterthought: maybe, if we'd have the per worker histograms on the test server side, plus a means to read/sample these histograms on the NH client side and propagate these to the output, that might be a relatively low effort means to achieve this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last thought: per-worker tracking may also yield data that's interesting in light of exposing unbalancedness.
I filed #487 to get this on the radar and track this / serve as a point for further discussion.

@eric846
Copy link
Copy Markdown
Contributor

eric846 commented Aug 26, 2020

LGTM modulo adding the comment

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
@oschaaf oschaaf added waiting-for-review A PR waiting for a review. and removed waiting-for-changes A PR waiting for comments to be resolved and changes to be applied. labels Aug 28, 2020
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
@oschaaf
Copy link
Copy Markdown
Member Author

oschaaf commented Aug 28, 2020

/retest

@repokitteh-read-only
Copy link
Copy Markdown

🔨 rebuilding ci/circleci: coverage (failed build)

🐱

Caused by: a #477 (comment) was created by @oschaaf.

see: more, trace.

Copy link
Copy Markdown
Collaborator

@mum4k mum4k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thank you for moving this into an extension. I have few documentation nits and one question - now that it is an extension, should we also add user documentation on how to enable and use it?


using HttpTimeTrackingFilterConfigSharedPtr = std::shared_ptr<HttpTimeTrackingFilterConfig>;

class HttpTimeTrackingFilter : public Envoy::Http::PassThroughFilter {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document this class and its constructor also?

@mum4k mum4k added waiting-for-changes A PR waiting for comments to be resolved and changes to be applied. and removed waiting-for-review A PR waiting for a review. labels Aug 28, 2020
@mum4k
Copy link
Copy Markdown
Collaborator

mum4k commented Aug 28, 2020

Handing over to @dubious90 due to my vacation.

@mum4k mum4k assigned dubious90 and unassigned mum4k Aug 28, 2020
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
@oschaaf
Copy link
Copy Markdown
Member Author

oschaaf commented Aug 29, 2020

Addressed @mum4k his comments in 70502e1. Marking this as ready for another round!

With respect to user-facing documentation, I'm planning for the following as one or more followups to this, where I'm trying to keep PR size reasonable:

  • Add configuration (cli arg/proto arg) of the client with respect to the response header name that should be treated as a latency input that should be tracked in a histogram.
  • End to end tests
  • User facing documentation

@oschaaf oschaaf added waiting-for-review A PR waiting for a review. and removed waiting-for-changes A PR waiting for comments to be resolved and changes to be applied. labels Aug 29, 2020
@oschaaf
Copy link
Copy Markdown
Member Author

oschaaf commented Aug 31, 2020

Opened #500 as a WIP follow up to this

Copy link
Copy Markdown
Contributor

@dubious90 dubious90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Otto, most of these changes look good. Have a few comments. Happy to discuss anything you disagree with or are confused by.

In the future, it would be easier for reviews if you split PRs like this into 2-3 smaller ones.

response_header_sizes_statistic_.addValue(response_headers_->byteSize());
const uint64_t response_code = Envoy::Http::Utility::getResponseStatus(*response_headers_);
stream_info_.response_code_ = static_cast<uint32_t>(response_code);
const auto timing_header_name = Envoy::Http::LowerCaseString("x-nh-do-not-use-origin-timings");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my concern here is i think we are introducing two different header namespaces now. See https://github.com/envoyproxy/nighthawk/blob/master/source/server/well_known_headers.h, where we have x-nighthawk-test-server-config. It would be nice if we only had the one namespace, to avoid confusion.

If this is the first introduction of x-nh-, can we change it to x-nighthawk-?

If not, can we create an issue to decide on one of those two, and at least make sure that one contains all of the headers, even if we support the other one for backwards compatibility.

response_header_sizes_statistic_.addValue(response_headers_->byteSize());
const uint64_t response_code = Envoy::Http::Utility::getResponseStatus(*response_headers_);
stream_info_.response_code_ = static_cast<uint32_t>(response_code);
const auto timing_header_name = Envoy::Http::LowerCaseString("x-nh-do-not-use-origin-timings");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separately, can you explain why having a configurable header name here is beneficial? just not sure why we wouldn't just dictate this

"thread_safe_monotonic_time_stopwatch.h",
],
repository = "@envoy",
visibility = ["//visibility:public"],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no changes required for this (not even sure if it's possible), but it would be nice if we could set this at the package level instead of on every single target in common/BUILD. But that doesn't seem to be supported by envoy_package.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good point, I filed an issue tagged as tech-debt to track this: #502


/**
* @param time_source used to obtain a sample of the current monotonic time.
* @return uint64_t 0 on the first invocation, and the number of elapsed nanoseconds since the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In google-convention, we'd most likely try to use absl::duration here, rather than an int representing a unit of time. Would that be reasonable here, or no?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, while I didn't consider absl::Duration (I didn't know it existed). But I did consider std::chrono::duration, and ended up leaning towards uint64_t because that:

Having said that, I feel this is a subjective matter, so I'd be happy to change this if you lean towards absl::Duration after reading my considerations above.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I don't know if it's worth bringing in absl::Duration just for this. If we wanted to, we could try to bring in absl::Duration on a more global change later.

@@ -0,0 +1,68 @@
#include "server/http_time_tracking_filter.h"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This directory is starting to grow. Would it make sense to create a separate PR that moves this and other filters all into a filters subdirectory?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think that would make sense. Alternatively, we could keep this directory just for extensions and their configuration, and re-home everything else that's in there so far: configuration.h/cc (helpers), well_known_headers.h (http header name definitions). Should I file an issue to track this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An issue to track this would be great. I'm happy to defer to you on what makes the most sense. The filters subdirectory was just what made sense to me on a quick lookthrough.

@dubious90 dubious90 added waiting-for-changes A PR waiting for comments to be resolved and changes to be applied. and removed waiting-for-review A PR waiting for a review. labels Sep 1, 2020
Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>
@oschaaf
Copy link
Copy Markdown
Member Author

oschaaf commented Sep 1, 2020

@dubious90 Feedback partially addressed in 7774e2d

@oschaaf oschaaf added waiting-for-review A PR waiting for a review. and removed waiting-for-changes A PR waiting for comments to be resolved and changes to be applied. labels Sep 1, 2020
Copy link
Copy Markdown
Contributor

@dubious90 dubious90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes. Looks good

@dubious90 dubious90 dismissed mum4k’s stale review September 1, 2020 17:17

Changes were addressed. Mumak is no longer available or involved with this PR.

@dubious90 dubious90 merged commit 8aa8fff into envoyproxy:master Sep 1, 2020
dubious90 pushed a commit that referenced this pull request Sep 9, 2020
Adds a client option and wires it through in TCLAP. Amends tests and code to work with that
instead of the hard coded value. Adds end to end test, and enables test-server extension build.

Follow up to #477

Fixes #360: with this the feature is ready to use.

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

-------

TODO: 
- [x] Land #477 and merge it in here 
- [x] Wire up TCLAP
- [x] Tests for the new option parsing/handling
- [x] Enable build of the new time-tracking extension into nighthawk_test_server
- [x] End-to-end tests to prove the new client-side feature can be used to track latencies delivered by response header emitted by the test-server's new feature. 
- [x] CLI & proto description / comments, regen docs. 
- [x] Replace Stopwatch shared_ptr with unique_ptr
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting-for-review A PR waiting for a review.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants