metrics: add request_handle_us histogram#294
Conversation
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
unlike an Arc, we don't actually need to synchronize on any work besides the actual decrement, since everything guarded by the ref-count is only done atomically anyway.
lib/metrics/src/latency.rs
Outdated
| use std::convert::TryInto; | ||
| self.0.as_micros().try_into().unwrap_or_else(|_| { | ||
| // These measurements should never be long enough to overflow | ||
| // warn!("Duration::as_micros would overflow u64"); |
There was a problem hiding this comment.
Looks like this should be uncommented
| struct Shared { | ||
| // NOTE: this is inside a `Mutex` since recording a latency requires a mutable | ||
| // reference to the histogram. In the future, we could consider making the | ||
| // histogram counters `AtomicU64, so that the histogram could be updated |
There was a problem hiding this comment.
Missing the closing ` on AtomicU64
| } | ||
|
|
||
| #[cold] | ||
| #[inline(never)] |
There was a problem hiding this comment.
I have an idea of what this attribute does, but what about this function made you add it?
There was a problem hiding this comment.
#[cold] is a hint to the compiler that it should use a calling convention that improves performance at the callsite under the assumption that a function is called infrequently: https://llvm.org/docs/LangRef.html#calling-conventions
in retrospect, I realise that #[cold] implies that a function should not be inlined, so I think the #[inline(never)] should be removed.
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
|
@kleimkuhler 30966f4 makes some naming changes & adds more comments, let me know if that clears up some of the stuff you had questions about? |
|
@hawkw Those comments and change in naming are helpful. Thanks for doing that! |
| // ===== impl Scope ===== | ||
|
|
||
| impl Scope { | ||
| pub fn new() -> Self { |
There was a problem hiding this comment.
nit: i tend to prefer the module fn pattern of, e.g. pub fn handle_time() -> (Scope, InsertLayer) -- using the scope to produce layers is fine, but also doesn't really seem necessary (since the layer should be clone)...
|
|
||
| pub fn inbound(&self) -> handle_time::Scope { | ||
| self.inbound.clone() | ||
| } |
There was a problem hiding this comment.
Similarly, I'd be inclined to make a constructor that returns a 3-tuple rather than to hide the scopes in this struct.
There was a problem hiding this comment.
I'd be happy to change all this code to return tuples, but I'm not sure if it's the right thing in this specific case. I think using tuples as return types is fine when the types in the tuple are heterogeneous (the compiler will notice if you try to call a Sender method on a Receiver, for example), but I'm a little leery of relying on tuple ordering to differentiate two different instances of the same type, like inbound/outbound.
| } | ||
|
|
||
| #[cold] | ||
| #[inline(never)] |
| // Slow path: if there are no free recorders in the | ||
| // slab, extend it (acquiring a write lock temporarily). | ||
| self.grow(); | ||
| self.recorders.read().unwrap() |
There was a problem hiding this comment.
if we unwrap here, why not unwrap above (where you use ok())?
There was a problem hiding this comment.
The ok above could also be an unwrap; the reason it's ok() is because we are optionally returning a recorder from the filter, and the ok() lets us nicely chain the option, without the ok() the code is a bit more complex. I can make this consistent if that's preferred.
There was a problem hiding this comment.
In general, I find the filter harder to read, and it's confusing because it at first seems like we're graceful to the lock being poisoned but, in fact, we just panic in the next access in that case.
let sz = self.recorders.read().unwrap().len();
if sz < idx {
self.grow();
}
self.recorders.read().unwrap()Is clearer about what it's doing, imo
There was a problem hiding this comment.
The intention was to avoid reacquiring the read lock in the fast path. The code you posted is definitely more readable but will always acquire the read lock twice, even when we don't need to acquire the write lock. It is entirely possible that the performance hit from reacquiring it is not significant enough to justify making the code less clear, though.
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
kleimkuhler
left a comment
There was a problem hiding this comment.
@hawkw There is a comment above about #[inline] I'm still curious about, and a few nits to clean up comments if you feel like grouping those in to any additional commits, but otherwise this looks good to me!
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
olix0r
left a comment
There was a problem hiding this comment.
awesome!
:; linkerd metrics -n emojivoto po/emoji-84c4946fc5-hf6gk | grep handle
# HELP request_handle_us A histogram of the time in microseconds between when a request is received and when it is sent upstream.
# TYPE request_handle_us histogram
request_handle_us_bucket{direction="inbound",le="1"} 0
request_handle_us_bucket{direction="inbound",le="2"} 0
request_handle_us_bucket{direction="inbound",le="3"} 0
request_handle_us_bucket{direction="inbound",le="4"} 0
request_handle_us_bucket{direction="inbound",le="5"} 0
request_handle_us_bucket{direction="inbound",le="10"} 0
request_handle_us_bucket{direction="inbound",le="20"} 0
request_handle_us_bucket{direction="inbound",le="30"} 0
request_handle_us_bucket{direction="inbound",le="40"} 0
request_handle_us_bucket{direction="inbound",le="50"} 0
request_handle_us_bucket{direction="inbound",le="100"} 46
request_handle_us_bucket{direction="inbound",le="200"} 144
request_handle_us_bucket{direction="inbound",le="300"} 148
request_handle_us_bucket{direction="inbound",le="400"} 148
request_handle_us_bucket{direction="inbound",le="500"} 148
request_handle_us_bucket{direction="inbound",le="1000"} 149
request_handle_us_bucket{direction="inbound",le="2000"} 149
request_handle_us_bucket{direction="inbound",le="3000"} 150
request_handle_us_bucket{direction="inbound",le="4000"} 150
request_handle_us_bucket{direction="inbound",le="5000"} 150
request_handle_us_bucket{direction="inbound",le="10000"} 150
request_handle_us_bucket{direction="inbound",le="20000"} 150
request_handle_us_bucket{direction="inbound",le="30000"} 150
request_handle_us_bucket{direction="inbound",le="40000"} 150
request_handle_us_bucket{direction="inbound",le="50000"} 150
request_handle_us_bucket{direction="inbound",le="+Inf"} 150
request_handle_us_count{direction="inbound"} 150
request_handle_us_sum{direction="inbound"} 21520
|
will merge this pending CI |
* Update h2 to v0.1.26 * Properly fall back in the dst_router (linkerd/linkerd2-proxy#291) * Tap server authorizes clients when identity is expected (linkerd/linkerd2-proxy#290) * update-rust-version: Check usage (linkerd/linkerd2-proxy#298) * tap: fix tap response streams never ending (linkerd/linkerd2-proxy#299) * Require identity on tap requests (linkerd/linkerd2-proxy#295) * Authority label should reflect logical dst (linkerd/linkerd2-proxy#300) * Replace futures_watch with tokio::sync::watch (linkerd/linkerd2-proxy#301) * metrics: add `request_handle_us` histogram (linkerd/linkerd2-proxy#294) * linkerd2-proxy: Adopt Rust 2018 (linkerd/linkerd2-proxy#302) * Remove futures-mpsc-lossy (linkerd/linkerd2-proxy#305) * Adopt std::convert::TryFrom (linkerd/linkerd2-proxy#304) * lib: Rename directories to match crate names (linkerd/linkerd2-proxy#303)
* update-rust-version: Check usage (linkerd/linkerd2-proxy#298) * tap: fix tap response streams never ending (linkerd/linkerd2-proxy#299) * Require identity on tap requests (linkerd/linkerd2-proxy#295) * Authority label should reflect logical dst (linkerd/linkerd2-proxy#300) * Replace futures_watch with tokio::sync::watch (linkerd/linkerd2-proxy#301) * metrics: add `request_handle_us` histogram (linkerd/linkerd2-proxy#294) * linkerd2-proxy: Adopt Rust 2018 (linkerd/linkerd2-proxy#302) * Remove futures-mpsc-lossy (linkerd/linkerd2-proxy#305) * Adopt std::convert::TryFrom (linkerd/linkerd2-proxy#304) * lib: Rename directories to match crate names (linkerd/linkerd2-proxy#303)
* update-rust-version: Check usage (linkerd/linkerd2-proxy#298) * tap: fix tap response streams never ending (linkerd/linkerd2-proxy#299) * Require identity on tap requests (linkerd/linkerd2-proxy#295) * Authority label should reflect logical dst (linkerd/linkerd2-proxy#300) * Replace futures_watch with tokio::sync::watch (linkerd/linkerd2-proxy#301) * metrics: add `request_handle_us` histogram (linkerd/linkerd2-proxy#294) * linkerd2-proxy: Adopt Rust 2018 (linkerd/linkerd2-proxy#302) * Remove futures-mpsc-lossy (linkerd/linkerd2-proxy#305) * Adopt std::convert::TryFrom (linkerd/linkerd2-proxy#304) * lib: Rename directories to match crate names (linkerd/linkerd2-proxy#303)
As initially described in linkerd/linkerd2#730, there's a general need
to understand the proxy's latency overhead. We can begin by recording
the time a request spends in the proxy.
linkerd/linkerd2#3098 proposes adding a new
request_handle_ushistogram to the proxy with only a
directionlabel. This histogramshould store the elapsed time (in microseconds) from the moment a
request reaches the source stack until the request is dropped.
This branch adds the
request_handle_ushistogram to the proxy'smetrics. Handle time is recorded using a
Trackertype which isinserted into each request's extensions map by a new layer at the
top of the stack. Each tracker has a reference count, and when
requests are cloned for retries, the tracker is cloned into the new
request as well, incrementing the reference count. The reference
count is decremented when trackers are dropped, and when it reaches
zero, the elapsed time is recorded.
Since this depends on a time measurement that's internal to the proxy,
it's difficult to write reliable tests for it. Unlike the latency
histograms, we cannot use a test service that waits for a period of time
to ensure that the recorded latencies have a lower bound. Instead, I did
some manual verification that the new metric is present in the proxy's
metrics endpoint, that its total count is equal to the number of
requests I've sent through the proxy, and that the handle time
measurements appear to be less than the total request latency
measurements.
Closes linkerd/linkerd2#3098
Signed-off-by: Eliza Weisman eliza@buoyant.io