Skip to content

metrics: add request_handle_us histogram#294

Merged
hawkw merged 13 commits intomasterfrom
eliza/handle-time
Aug 8, 2019
Merged

metrics: add request_handle_us histogram#294
hawkw merged 13 commits intomasterfrom
eliza/handle-time

Conversation

@hawkw
Copy link
Contributor

@hawkw hawkw commented Jul 30, 2019

As initially described in linkerd/linkerd2#730, there's a general need
to understand the proxy's latency overhead. We can begin by recording
the time a request spends in the proxy.

linkerd/linkerd2#3098 proposes adding a new request_handle_us
histogram to the proxy with only a direction label. This histogram
should store the elapsed time (in microseconds) from the moment a
request reaches the source stack until the request is dropped.

This branch adds the request_handle_us histogram to the proxy's
metrics. Handle time is recorded using a Tracker type which is
inserted into each request's extensions map by a new layer at the
top of the stack. Each tracker has a reference count, and when
requests are cloned for retries, the tracker is cloned into the new
request as well, incrementing the reference count. The reference
count is decremented when trackers are dropped, and when it reaches
zero, the elapsed time is recorded.

Since this depends on a time measurement that's internal to the proxy,
it's difficult to write reliable tests for it. Unlike the latency
histograms, we cannot use a test service that waits for a period of time
to ensure that the recorded latencies have a lower bound. Instead, I did
some manual verification that the new metric is present in the proxy's
metrics endpoint, that its total count is equal to the number of
requests I've sent through the proxy, and that the handle time
measurements appear to be less than the total request latency
measurements.

Closes linkerd/linkerd2#3098

Signed-off-by: Eliza Weisman eliza@buoyant.io

hawkw added 8 commits July 29, 2019 17:11
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
unlike an Arc, we don't actually need to synchronize on any work besides
the actual decrement, since everything guarded by the ref-count is only done atomically anyway.
@hawkw hawkw requested review from adleong, kleimkuhler and olix0r July 30, 2019 22:24
@hawkw hawkw self-assigned this Jul 30, 2019
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
use std::convert::TryInto;
self.0.as_micros().try_into().unwrap_or_else(|_| {
// These measurements should never be long enough to overflow
// warn!("Duration::as_micros would overflow u64");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this should be uncommented

struct Shared {
// NOTE: this is inside a `Mutex` since recording a latency requires a mutable
// reference to the histogram. In the future, we could consider making the
// histogram counters `AtomicU64, so that the histogram could be updated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing the closing ` on AtomicU64

}

#[cold]
#[inline(never)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an idea of what this attribute does, but what about this function made you add it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also curious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#[cold] is a hint to the compiler that it should use a calling convention that improves performance at the callsite under the assumption that a function is called infrequently: https://llvm.org/docs/LangRef.html#calling-conventions

in retrospect, I realise that #[cold] implies that a function should not be inlined, so I think the #[inline(never)] should be removed.

hawkw added 2 commits July 31, 2019 08:08
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
@hawkw
Copy link
Contributor Author

hawkw commented Jul 31, 2019

@kleimkuhler 30966f4 makes some naming changes & adds more comments, let me know if that clears up some of the stuff you had questions about?

@kleimkuhler
Copy link
Contributor

@hawkw Those comments and change in naming are helpful. Thanks for doing that!

// ===== impl Scope =====

impl Scope {
pub fn new() -> Self {
Copy link
Member

@olix0r olix0r Aug 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i tend to prefer the module fn pattern of, e.g. pub fn handle_time() -> (Scope, InsertLayer) -- using the scope to produce layers is fine, but also doesn't really seem necessary (since the layer should be clone)...


pub fn inbound(&self) -> handle_time::Scope {
self.inbound.clone()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, I'd be inclined to make a constructor that returns a 3-tuple rather than to hide the scopes in this struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to change all this code to return tuples, but I'm not sure if it's the right thing in this specific case. I think using tuples as return types is fine when the types in the tuple are heterogeneous (the compiler will notice if you try to call a Sender method on a Receiver, for example), but I'm a little leery of relying on tuple ordering to differentiate two different instances of the same type, like inbound/outbound.

}

#[cold]
#[inline(never)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also curious.

// Slow path: if there are no free recorders in the
// slab, extend it (acquiring a write lock temporarily).
self.grow();
self.recorders.read().unwrap()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we unwrap here, why not unwrap above (where you use ok())?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ok above could also be an unwrap; the reason it's ok() is because we are optionally returning a recorder from the filter, and the ok() lets us nicely chain the option, without the ok() the code is a bit more complex. I can make this consistent if that's preferred.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I find the filter harder to read, and it's confusing because it at first seems like we're graceful to the lock being poisoned but, in fact, we just panic in the next access in that case.

let sz = self.recorders.read().unwrap().len();
if sz < idx {
  self.grow();
}
self.recorders.read().unwrap()

Is clearer about what it's doing, imo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention was to avoid reacquiring the read lock in the fast path. The code you posted is definitely more readable but will always acquire the read lock twice, even when we don't need to acquire the write lock. It is entirely possible that the performance hit from reacquiring it is not significant enough to justify making the code less clear, though.

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Copy link
Contributor

@kleimkuhler kleimkuhler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hawkw There is a comment above about #[inline] I'm still curious about, and a few nits to clean up comments if you feel like grouping those in to any additional commits, but otherwise this looks good to me!

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Copy link
Member

@olix0r olix0r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!

:; linkerd metrics -n emojivoto po/emoji-84c4946fc5-hf6gk | grep handle
# HELP request_handle_us A histogram of the time in microseconds between when a request is received and when it is sent upstream.
# TYPE request_handle_us histogram
request_handle_us_bucket{direction="inbound",le="1"} 0
request_handle_us_bucket{direction="inbound",le="2"} 0
request_handle_us_bucket{direction="inbound",le="3"} 0
request_handle_us_bucket{direction="inbound",le="4"} 0
request_handle_us_bucket{direction="inbound",le="5"} 0
request_handle_us_bucket{direction="inbound",le="10"} 0
request_handle_us_bucket{direction="inbound",le="20"} 0
request_handle_us_bucket{direction="inbound",le="30"} 0
request_handle_us_bucket{direction="inbound",le="40"} 0
request_handle_us_bucket{direction="inbound",le="50"} 0
request_handle_us_bucket{direction="inbound",le="100"} 46
request_handle_us_bucket{direction="inbound",le="200"} 144
request_handle_us_bucket{direction="inbound",le="300"} 148
request_handle_us_bucket{direction="inbound",le="400"} 148
request_handle_us_bucket{direction="inbound",le="500"} 148
request_handle_us_bucket{direction="inbound",le="1000"} 149
request_handle_us_bucket{direction="inbound",le="2000"} 149
request_handle_us_bucket{direction="inbound",le="3000"} 150
request_handle_us_bucket{direction="inbound",le="4000"} 150
request_handle_us_bucket{direction="inbound",le="5000"} 150
request_handle_us_bucket{direction="inbound",le="10000"} 150
request_handle_us_bucket{direction="inbound",le="20000"} 150
request_handle_us_bucket{direction="inbound",le="30000"} 150
request_handle_us_bucket{direction="inbound",le="40000"} 150
request_handle_us_bucket{direction="inbound",le="50000"} 150
request_handle_us_bucket{direction="inbound",le="+Inf"} 150
request_handle_us_count{direction="inbound"} 150
request_handle_us_sum{direction="inbound"} 21520

@hawkw
Copy link
Contributor Author

hawkw commented Aug 8, 2019

will merge this pending CI

@hawkw hawkw merged commit faa7be6 into master Aug 8, 2019
olix0r added a commit to linkerd/linkerd2 that referenced this pull request Aug 9, 2019
* Update h2 to v0.1.26
* Properly fall back in the dst_router (linkerd/linkerd2-proxy#291)
* Tap server authorizes clients when identity is expected (linkerd/linkerd2-proxy#290)
* update-rust-version: Check usage (linkerd/linkerd2-proxy#298)
* tap: fix tap response streams never ending (linkerd/linkerd2-proxy#299)
* Require identity on tap requests (linkerd/linkerd2-proxy#295)
* Authority label should reflect logical dst (linkerd/linkerd2-proxy#300)
* Replace futures_watch with tokio::sync::watch (linkerd/linkerd2-proxy#301)
* metrics: add `request_handle_us` histogram (linkerd/linkerd2-proxy#294)
* linkerd2-proxy: Adopt Rust 2018 (linkerd/linkerd2-proxy#302)
* Remove futures-mpsc-lossy (linkerd/linkerd2-proxy#305)
* Adopt std::convert::TryFrom (linkerd/linkerd2-proxy#304)
* lib: Rename directories to match crate names (linkerd/linkerd2-proxy#303)
olix0r added a commit to linkerd/linkerd2 that referenced this pull request Aug 9, 2019
* update-rust-version: Check usage (linkerd/linkerd2-proxy#298)
* tap: fix tap response streams never ending (linkerd/linkerd2-proxy#299)
* Require identity on tap requests (linkerd/linkerd2-proxy#295)
* Authority label should reflect logical dst (linkerd/linkerd2-proxy#300)
* Replace futures_watch with tokio::sync::watch (linkerd/linkerd2-proxy#301)
* metrics: add `request_handle_us` histogram (linkerd/linkerd2-proxy#294)
* linkerd2-proxy: Adopt Rust 2018 (linkerd/linkerd2-proxy#302)
* Remove futures-mpsc-lossy (linkerd/linkerd2-proxy#305)
* Adopt std::convert::TryFrom (linkerd/linkerd2-proxy#304)
* lib: Rename directories to match crate names (linkerd/linkerd2-proxy#303)
olix0r added a commit to linkerd/linkerd2 that referenced this pull request Aug 9, 2019
* update-rust-version: Check usage (linkerd/linkerd2-proxy#298)
* tap: fix tap response streams never ending (linkerd/linkerd2-proxy#299)
* Require identity on tap requests (linkerd/linkerd2-proxy#295)
* Authority label should reflect logical dst (linkerd/linkerd2-proxy#300)
* Replace futures_watch with tokio::sync::watch (linkerd/linkerd2-proxy#301)
* metrics: add `request_handle_us` histogram (linkerd/linkerd2-proxy#294)
* linkerd2-proxy: Adopt Rust 2018 (linkerd/linkerd2-proxy#302)
* Remove futures-mpsc-lossy (linkerd/linkerd2-proxy#305)
* Adopt std::convert::TryFrom (linkerd/linkerd2-proxy#304)
* lib: Rename directories to match crate names (linkerd/linkerd2-proxy#303)
@olix0r olix0r deleted the eliza/handle-time branch August 17, 2019 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

proxy: Add request_handle_us histogram

4 participants