Echo back and track origin request-receipt timing deltas by oschaaf · Pull Request #477 · envoyproxy/nighthawk

oschaaf · 2020-08-21T15:17:03Z

This adds a rough cut of the capability to the test server to allow tracking of the delta's between
request arrival times, and echo these back in a configurable response header name.

Subsequently, the nighthawk_client gets a feature to track numbers received via an arbitrary
header in a histogram. Right now it's hard coded to track x-nh-do-not-use-origin-timings.
But the plan would be to allow for at least one configurable header name to track.

Combined, these test server and client features enable obtaining insight into the origin-side point of
view regarding arrival timings [1]. The direct use case is manual inspection / sanity checking, but possibly
divergence of what would be expected based on client side configuration can be used as a way to quantify
"distortion" introduced by an intermediary proxy.

In a follow up, documentation, configuration of the client to specify the response header name that should
be used to track latency, and end to end tests will be added.

When setting up a test between the Nighthawk client and test server using this feature, a new histogram
will be send to the output. On my machine, a simple 1000qps test yields the following perspective from
the test-server when it comes to deltas between inbound requests:

  0.5         2519        0s 000ms 999us 
  0.75        3767        0s 001ms 001us 
  0.8         4002        0s 001ms 002us 
  0.9         4508        0s 001ms 003us 
  0.95        4756        0s 001ms 004us 
  0.990625    4953        0s 001ms 005us 
  0.99902344  4995        0s 001ms 015us ```

[1] Concretely, if a straightforward 1000 qps is configured, the histogram obtained from tracking the server side should reasonably reflect that and show a histogram diverging very little from 1ms.

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

Will still fail the test because of mismatched expectations, but this will help merging master periodically in here to verify (almost) all is well. Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

oschaaf · 2020-08-24T22:25:09Z

source/client/stream_decoder.cc

  response_header_sizes_statistic_.addValue(response_headers_->byteSize());
  const uint64_t response_code = Envoy::Http::Utility::getResponseStatus(*response_headers_);
  stream_info_.response_code_ = static_cast<uint32_t>(response_code);
+  const auto timing_header_name = Envoy::Http::LowerCaseString("x-nh-do-not-use-origin-timings");


Note to reviewers: the odd name here is to scare people away from using this. My plan is to make a new option for nighthawk_client to allow configuration of this header name.

my concern here is i think we are introducing two different header namespaces now. See https://github.com/envoyproxy/nighthawk/blob/master/source/server/well_known_headers.h, where we have x-nighthawk-test-server-config. It would be nice if we only had the one namespace, to avoid confusion.

If this is the first introduction of x-nh-, can we change it to x-nighthawk-?

If not, can we create an issue to decide on one of those two, and at least make sure that one contains all of the headers, even if we support the other one for backwards compatibility.

separately, can you explain why having a configurable header name here is beneficial? just not sure why we wouldn't just dictate this

re: separately, can you explain why having a configurable header name here is beneficial? just not sure why we wouldn't just dictate this

Well, in my opinion, there's certainly something to say for dictating the names: less code, less configuration. Here are the thoughts I had that made me lean towards allow the end user control of the header names involved. I think this mostly applies to both the client and test-server aspects of it:

Being able to configure the header names both on the server and client side of Nighthawk decouples them. This flexibility may come in handy when mixing Nighthawk components with other OSS or homegrown clients/proxies/test servers with similar capabilities (especially when these dictate the naming involved).

If both the proxy and test server emit timing data using different header names, one can switch which one the clients should track

If multiple proxies or test-servers are set up to emit latencies using different names in horizontal scaling, it could be used to track a single instance instead of all of them.

If timing-filter gets extended to emit more timings, and we end up with multiple headers, clients can easily switch which one they track.

In examples the header name could be a little bit longer for clarity, but in final testing, one might want to opt to make it as short as possible to minimise overhead.

Considering the above, let me know what you think;

This makes sense. Thanks for the explanation.

mum4k · 2020-08-24T22:34:52Z

@eric846 please review and assign back to me once done.

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

oschaaf · 2020-08-25T22:02:12Z

source/client/stream_decoder.cc

+    if (absl::SimpleAtoi(timing_value, &origin_delta) && origin_delta >= 0) {
+      origin_latency_statistic_.addValue(origin_delta);
+    } else {
+      // TODO(XXX): Can we make sure we avoid high frequency logging for this somehow?


Punted to #484, which has a candidate PoC solution. I remember there are other spots where this would be good to have in the existing code base as well.

source/server/http_test_server_filter.cc

api/server/response_options.proto

source/server/http_test_server_filter.cc

eric846 · 2020-08-25T22:32:10Z

source/server/http_test_server_filter.cc

+  // monotonically.
+  const Envoy::MonotonicTime new_time = time_source.monotonicTime();
+  const uint64_t elapsed = start_ == Envoy::MonotonicTime::min() ? 0 : (new_time - start_).count();
+  start_ = new_time;


Warning: Attempting to reason about concurrency...

Do you see any weird effects running with concurrency > 1 on the test server?

I was imagining a situation with 2 test server threads operating almost in lockstep, but 1ns apart. Suppose the requests come in evenly every 1ms. If the test server got stalled for a few milliseconds and the OS ended up buffering several packets, and then the test server threads started consuming them round robin, wouldn't the timing reports end up alternating 0, 1ns, 999999ns, 1ns, 999999ns, 1ns, ...?

This would actually represent the true response timing coming out of the test server, so the code in this PR would be working as intended.

Extreme statistics like that could give the impression that there is extreme distortion from the intermediary proxy, but it's entirely an artifact of the Nighthawk test server having two threads and stalling for a few milliseconds.

(Also the intermediary might also introduce its own equally extreme irregularities, since it could also be a multithreaded Envoy.)

The obvious way to avoid this issue is to use concurrency=1 on the test server. Then we would only be measuring the distortions from intermediaries and the OS.

Do you think there's any future reporting that could diagnose this issue? For example, if we also had exactly this PR but with thread-local stopwatches, we would see perfect evenness rather than 1-999999 fluctuation. We would know that the 1-999999 in the cross-thread stats was the fault of our own multithreading, and any timing deviations would be our OS or the intermediary.

Actually, maybe varying between 1ns and 1ms would be considered negligible distortion, especially if the intermediary added multiple milliseconds.

Your reasoning here makes sense to me. The intent of this is to be able to sanity check part of our timings, so if it would catch those extremes where we don't expect them, I think it would be providing value.

The way I see it is that this is going to be particularly useful in A/B tests, for example:

across different clients

with and without an intermediary

with nighthawk client features that might perform relatively heavy lifting on cpu cores shared with the workers (e.g. stats flushing).

Some existing OSS clients can probably easily be tweaked to take these as inputs to their latency reports to figure out their request-release timing precisions.

Also, something like the delta between the expected and observed curves could perhaps serve as a means to quantify distortion; Indeed I think the extremes in the scenario you described would be signal in this case, and not noise (looking at you, adaptive load controller :-)).

As for tracking thread-local values: I've been thinking about that, and I think it's pretty doable doing that on the server side by adding a new Envoy histogram in the extension. On the client side however, I think it will take more consideration and work, as I think there would be some feature gaps:

be able to set multiple response headers for tracking, or create a means to read multiple values from the header.

if we want to be able to correlate thread id's we need to be able to tag values too, but this may not be nessecary.

having a means to dynamically create histograms or a histogram capable of grouping on a tag would be needed
for the Nighthawk client to handle it

Afterthought: maybe, if we'd have the per worker histograms on the test server side, plus a means to read/sample these histograms on the NH client side and propagate these to the output, that might be a relatively low effort means to achieve this.

Last thought: per-worker tracking may also yield data that's interesting in light of exposing unbalancedness.
I filed #487 to get this on the radar and track this / serve as a point for further discussion.

eric846 · 2020-08-26T02:35:33Z

LGTM modulo adding the comment

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

oschaaf · 2020-08-28T10:31:20Z

/retest

repokitteh-read-only · 2020-08-28T10:31:23Z

🔨 rebuilding ci/circleci: coverage (failed build)

🐱

Caused by: a #477 (comment) was created by @oschaaf.

see: more, trace.

mum4k

LGTM

Thank you for moving this into an extension. I have few documentation nits and one question - now that it is an extension, should we also add user documentation on how to enable and use it?

source/common/thread_safe_monotonic_time_stopwatch.h

mum4k · 2020-08-28T19:20:59Z

source/server/http_time_tracking_filter.h

+
+using HttpTimeTrackingFilterConfigSharedPtr = std::shared_ptr<HttpTimeTrackingFilterConfig>;
+
+class HttpTimeTrackingFilter : public Envoy::Http::PassThroughFilter {


Can we document this class and its constructor also?

mum4k · 2020-08-28T19:23:37Z

Handing over to @dubious90 due to my vacation.

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

oschaaf · 2020-08-29T16:59:01Z

Addressed @mum4k his comments in 70502e1. Marking this as ready for another round!

With respect to user-facing documentation, I'm planning for the following as one or more followups to this, where I'm trying to keep PR size reasonable:

Add configuration (cli arg/proto arg) of the client with respect to the response header name that should be treated as a latency input that should be tracked in a histogram.
End to end tests
User facing documentation

oschaaf · 2020-08-31T11:37:22Z

Opened #500 as a WIP follow up to this

dubious90

Hi Otto, most of these changes look good. Have a few comments. Happy to discuss anything you disagree with or are confused by.

In the future, it would be easier for reviews if you split PRs like this into 2-3 smaller ones.

dubious90 · 2020-08-31T18:48:32Z

source/client/stream_decoder.cc

  response_header_sizes_statistic_.addValue(response_headers_->byteSize());
  const uint64_t response_code = Envoy::Http::Utility::getResponseStatus(*response_headers_);
  stream_info_.response_code_ = static_cast<uint32_t>(response_code);
+  const auto timing_header_name = Envoy::Http::LowerCaseString("x-nh-do-not-use-origin-timings");


my concern here is i think we are introducing two different header namespaces now. See https://github.com/envoyproxy/nighthawk/blob/master/source/server/well_known_headers.h, where we have x-nighthawk-test-server-config. It would be nice if we only had the one namespace, to avoid confusion.

If this is the first introduction of x-nh-, can we change it to x-nighthawk-?

If not, can we create an issue to decide on one of those two, and at least make sure that one contains all of the headers, even if we support the other one for backwards compatibility.

dubious90 · 2020-08-31T23:29:44Z

source/client/stream_decoder.cc

  response_header_sizes_statistic_.addValue(response_headers_->byteSize());
  const uint64_t response_code = Envoy::Http::Utility::getResponseStatus(*response_headers_);
  stream_info_.response_code_ = static_cast<uint32_t>(response_code);
+  const auto timing_header_name = Envoy::Http::LowerCaseString("x-nh-do-not-use-origin-timings");


separately, can you explain why having a configurable header name here is beneficial? just not sure why we wouldn't just dictate this

dubious90 · 2020-08-31T23:38:20Z

source/common/BUILD

+        "thread_safe_monotonic_time_stopwatch.h",
+    ],
+    repository = "@envoy",
+    visibility = ["//visibility:public"],


no changes required for this (not even sure if it's possible), but it would be nice if we could set this at the package level instead of on every single target in common/BUILD. But that doesn't seem to be supported by envoy_package.

I think this is a good point, I filed an issue tagged as tech-debt to track this: #502

dubious90 · 2020-09-01T00:02:57Z

source/common/thread_safe_monotonic_time_stopwatch.h

+
+  /**
+   * @param time_source used to obtain a sample of the current monotonic time.
+   * @return uint64_t 0 on the first invocation, and the number of elapsed nanoseconds since the


In google-convention, we'd most likely try to use absl::duration here, rather than an int representing a unit of time. Would that be reasonable here, or no?

Well, while I didn't consider absl::Duration (I didn't know it existed). But I did consider std::chrono::duration, and ended up leaning towards uint64_t because that:

can't turn negative, which doesn't make sense here

is less prone to some of the trickyness with conversions vs the underlying types I ran into over at Add TimerImpl::enableHRTimer - take two envoy#9229 (comment).

Having said that, I feel this is a subjective matter, so I'd be happy to change this if you lean towards absl::Duration after reading my considerations above.

That makes sense. I don't know if it's worth bringing in absl::Duration just for this. If we wanted to, we could try to bring in absl::Duration on a more global change later.

dubious90 · 2020-09-01T00:05:20Z

source/server/http_time_tracking_filter.cc

@@ -0,0 +1,68 @@
+#include "server/http_time_tracking_filter.h"


This directory is starting to grow. Would it make sense to create a separate PR that moves this and other filters all into a filters subdirectory?

Yes I think that would make sense. Alternatively, we could keep this directory just for extensions and their configuration, and re-home everything else that's in there so far: configuration.h/cc (helpers), well_known_headers.h (http header name definitions). Should I file an issue to track this?

An issue to track this would be great. I'm happy to defer to you on what makes the most sense. The filters subdirectory was just what made sense to me on a quick lookthrough.

test/server/http_time_tracking_filter_integration_test.cc

test/stopwatch_test.cc

test/server/http_time_tracking_filter_integration_test.cc

test/stopwatch_test.cc

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

oschaaf · 2020-09-01T11:38:08Z

@dubious90 Feedback partially addressed in 7774e2d

dubious90

Thank you for the changes. Looks good

Changes were addressed. Mumak is no longer available or involved with this PR.

Adds a client option and wires it through in TCLAP. Amends tests and code to work with that instead of the hard coded value. Adds end to end test, and enables test-server extension build. Follow up to #477 Fixes #360: with this the feature is ready to use. Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com> ------- TODO: - [x] Land #477 and merge it in here - [x] Wire up TCLAP - [x] Tests for the new option parsing/handling - [x] Enable build of the new time-tracking extension into nighthawk_test_server - [x] End-to-end tests to prove the new client-side feature can be used to track latencies delivered by response header emitted by the test-server's new feature. - [x] CLI & proto description / comments, regen docs. - [x] Replace Stopwatch shared_ptr with unique_ptr

oschaaf added 9 commits June 15, 2020 17:31

Save state on origin timeing

c226497

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

Merge remote-tracking branch 'upstream/master' into origin-timings

c6697de

Fix clang-tidy, format, and test build

28e0d34

Will still fail the test because of mismatched expectations, but this will help merging master periodically in here to verify (almost) all is well. Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

Merge remote-tracking branch 'upstream/master' into origin-timings

94ad984

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

stats naming

933c589

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

Merge remote-tracking branch 'upstream/master' into origin-timings

a1691ec

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

Track request receipt deltas. Configure response header name.

740d0ca

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

Save state before context switching

1afe2f1

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

Merge remote-tracking branch 'upstream/master' into origin-timings

ae0c08e

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

oschaaf force-pushed the origin-timings branch from cbe0f01 to ae0c08e Compare August 24, 2020 10:30

oschaaf added 6 commits August 24, 2020 12:31

Merge remote-tracking branch 'upstream/master' into origin-timings

3487e19

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

Fix format

1d1f03a

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

Suppress clang-tidy on MUTABLE_CONSTRUCT_ON_FIRST_USE

a9d99eb

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

Proper locking, add TODOs and doc comments.

0aa8630

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

Merge remote-tracking branch 'upstream/master' into origin-timings

db01e40

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

Small fixes

5e48513

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

oschaaf changed the title ~~PoC: echo back origin request-receipt timing deltas~~ Echo back and track origin request-receipt timing deltas Aug 24, 2020

oschaaf marked this pull request as ready for review August 24, 2020 22:21

oschaaf added the waiting-for-review A PR waiting for a review. label Aug 24, 2020

oschaaf commented Aug 24, 2020

View reviewed changes

mum4k requested a review from eric846 August 24, 2020 22:34

Add unit test for StreamDecoder

d9569cb

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

oschaaf commented Aug 25, 2020

View reviewed changes

eric846 requested changes Aug 25, 2020

View reviewed changes

eric846 assigned mum4k Aug 26, 2020

This was referenced Aug 26, 2020

Quantify impact of locking in HttpTestServerDecoderFilter::setDecoderFilterCallbacks #486

Closed

Tracking request receipt timing on the test server per worker #487

Open

Review feedback

cf1170c

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

oschaaf added waiting-for-review A PR waiting for a review. and removed waiting-for-changes A PR waiting for comments to be resolved and changes to be applied. labels Aug 28, 2020

clang-tidy: fix for loop

35aee47

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

oschaaf mentioned this pull request Aug 28, 2020

Look into an extension base class for our test server #498

Closed

mum4k reviewed Aug 28, 2020

View reviewed changes

mum4k added waiting-for-changes A PR waiting for comments to be resolved and changes to be applied. and removed waiting-for-review A PR waiting for a review. labels Aug 28, 2020

mum4k assigned dubious90 and unassigned mum4k Aug 28, 2020

oschaaf added 2 commits August 29, 2020 18:03

Merge remote-tracking branch 'upstream/master' into origin-timings

027c00f

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

Review feedback: doc comments

70502e1

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

oschaaf added waiting-for-review A PR waiting for a review. and removed waiting-for-changes A PR waiting for comments to be resolved and changes to be applied. labels Aug 29, 2020

oschaaf mentioned this pull request Aug 31, 2020

Finalize emission and tracking of latencies in response headers #500

Merged

7 tasks

dubious90 suggested changes Sep 1, 2020

View reviewed changes

dubious90 added waiting-for-changes A PR waiting for comments to be resolved and changes to be applied. and removed waiting-for-review A PR waiting for a review. labels Sep 1, 2020

review feedback

7774e2d

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

oschaaf added waiting-for-review A PR waiting for a review. and removed waiting-for-changes A PR waiting for comments to be resolved and changes to be applied. labels Sep 1, 2020

dubious90 approved these changes Sep 1, 2020

View reviewed changes

dubious90 merged commit 8aa8fff into envoyproxy:master Sep 1, 2020


		using HttpTimeTrackingFilterConfigSharedPtr = std::shared_ptr<HttpTimeTrackingFilterConfig>;

		class HttpTimeTrackingFilter : public Envoy::Http::PassThroughFilter {

		@@ -0,0 +1,68 @@
		#include "server/http_time_tracking_filter.h"

Conversation

oschaaf commented Aug 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mum4k commented Aug 24, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eric846 Aug 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric846 commented Aug 26, 2020

Uh oh!

oschaaf commented Aug 28, 2020

Uh oh!

repokitteh-read-only bot commented Aug 28, 2020

Uh oh!

mum4k left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mum4k commented Aug 28, 2020

Uh oh!

oschaaf commented Aug 29, 2020

Uh oh!

oschaaf commented Aug 31, 2020

Uh oh!

dubious90 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oschaaf commented Sep 1, 2020

oschaaf commented Aug 21, 2020 •

edited

Loading

eric846 Aug 25, 2020 •

edited

Loading