metrics: add `request_handle_us` histogram by hawkw · Pull Request #294 · linkerd/linkerd2-proxy

hawkw · 2019-07-30T22:24:40Z

As initially described in linkerd/linkerd2#730, there's a general need
to understand the proxy's latency overhead. We can begin by recording
the time a request spends in the proxy.

linkerd/linkerd2#3098 proposes adding a new request_handle_us
histogram to the proxy with only a direction label. This histogram
should store the elapsed time (in microseconds) from the moment a
request reaches the source stack until the request is dropped.

This branch adds the request_handle_us histogram to the proxy's
metrics. Handle time is recorded using a Tracker type which is
inserted into each request's extensions map by a new layer at the
top of the stack. Each tracker has a reference count, and when
requests are cloned for retries, the tracker is cloned into the new
request as well, incrementing the reference count. The reference
count is decremented when trackers are dropped, and when it reaches
zero, the elapsed time is recorded.

Since this depends on a time measurement that's internal to the proxy,
it's difficult to write reliable tests for it. Unlike the latency
histograms, we cannot use a test service that waits for a period of time
to ensure that the recorded latencies have a lower bound. Instead, I did
some manual verification that the new metric is present in the proxy's
metrics endpoint, that its total count is equal to the number of
requests I've sent through the proxy, and that the handle time
measurements appear to be less than the total request latency
measurements.

Closes linkerd/linkerd2#3098

Signed-off-by: Eliza Weisman eliza@buoyant.io

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

unlike an Arc, we don't actually need to synchronize on any work besides the actual decrement, since everything guarded by the ref-count is only done atomically anyway.

src/proxy/http/insert.rs

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

kleimkuhler · 2019-07-31T00:00:21Z

lib/metrics/src/latency.rs

+        use std::convert::TryInto;
+        self.0.as_micros().try_into().unwrap_or_else(|_| {
+            // These measurements should never be long enough to overflow
+            // warn!("Duration::as_micros would overflow u64");


Looks like this should be uncommented

kleimkuhler · 2019-07-31T00:01:51Z

src/proxy/http/metrics/handle_time.rs

+struct Shared {
+    // NOTE: this is inside a `Mutex` since recording a latency requires a mutable
+    // reference to the histogram. In the future, we could consider making the
+    // histogram counters `AtomicU64, so that the histogram could be updated


Missing the closing ` on AtomicU64

src/proxy/http/metrics/handle_time.rs

kleimkuhler · 2019-07-31T00:03:09Z

src/proxy/http/metrics/handle_time.rs

+    }
+
+    #[cold]
+    #[inline(never)]


I have an idea of what this attribute does, but what about this function made you add it?

I'm also curious.

#[cold] is a hint to the compiler that it should use a calling convention that improves performance at the callsite under the assumption that a function is called infrequently: https://llvm.org/docs/LangRef.html#calling-conventions

in retrospect, I realise that #[cold] implies that a function should not be inlined, so I think the #[inline(never)] should be removed.

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

hawkw · 2019-07-31T15:18:19Z

@kleimkuhler 30966f4 makes some naming changes & adds more comments, let me know if that clears up some of the stuff you had questions about?

kleimkuhler · 2019-07-31T16:30:02Z

@hawkw Those comments and change in naming are helpful. Thanks for doing that!

olix0r · 2019-08-06T17:37:05Z

src/proxy/http/metrics/handle_time.rs

+// ===== impl Scope =====
+
+impl Scope {
+    pub fn new() -> Self {


nit: i tend to prefer the module fn pattern of, e.g. pub fn handle_time() -> (Scope, InsertLayer) -- using the scope to produce layers is fine, but also doesn't really seem necessary (since the layer should be clone)...

olix0r · 2019-08-06T17:38:13Z

src/app/handle_time.rs

+
+    pub fn inbound(&self) -> handle_time::Scope {
+        self.inbound.clone()
+    }


Similarly, I'd be inclined to make a constructor that returns a 3-tuple rather than to hide the scopes in this struct.

I'd be happy to change all this code to return tuples, but I'm not sure if it's the right thing in this specific case. I think using tuples as return types is fine when the types in the tuple are heterogeneous (the compiler will notice if you try to call a Sender method on a Receiver, for example), but I'm a little leery of relying on tuple ordering to differentiate two different instances of the same type, like inbound/outbound.

olix0r · 2019-08-06T17:39:38Z

src/proxy/http/metrics/handle_time.rs

+    }
+
+    #[cold]
+    #[inline(never)]


I'm also curious.

src/proxy/http/metrics/handle_time.rs

olix0r · 2019-08-06T17:44:24Z

src/proxy/http/metrics/handle_time.rs

+                        // Slow path: if there are no free recorders in the
+                        // slab, extend it (acquiring a write lock temporarily).
+                        self.grow();
+                        self.recorders.read().unwrap()


if we unwrap here, why not unwrap above (where you use ok())?

The ok above could also be an unwrap; the reason it's ok() is because we are optionally returning a recorder from the filter, and the ok() lets us nicely chain the option, without the ok() the code is a bit more complex. I can make this consistent if that's preferred.

In general, I find the filter harder to read, and it's confusing because it at first seems like we're graceful to the lock being poisoned but, in fact, we just panic in the next access in that case.

let sz = self.recorders.read().unwrap().len(); if sz < idx { self.grow(); } self.recorders.read().unwrap()

Is clearer about what it's doing, imo

The intention was to avoid reacquiring the read lock in the fast path. The code you posted is definitely more readable but will always acquire the read lock twice, even when we don't need to acquire the write lock. It is entirely possible that the performance hit from reacquiring it is not significant enough to justify making the code less clear, though.

src/proxy/http/metrics/handle_time.rs

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

kleimkuhler

@hawkw There is a comment above about #[inline] I'm still curious about, and a few nits to clean up comments if you feel like grouping those in to any additional commits, but otherwise this looks good to me!

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

olix0r

awesome!

:; linkerd metrics -n emojivoto po/emoji-84c4946fc5-hf6gk | grep handle
# HELP request_handle_us A histogram of the time in microseconds between when a request is received and when it is sent upstream.
# TYPE request_handle_us histogram
request_handle_us_bucket{direction="inbound",le="1"} 0
request_handle_us_bucket{direction="inbound",le="2"} 0
request_handle_us_bucket{direction="inbound",le="3"} 0
request_handle_us_bucket{direction="inbound",le="4"} 0
request_handle_us_bucket{direction="inbound",le="5"} 0
request_handle_us_bucket{direction="inbound",le="10"} 0
request_handle_us_bucket{direction="inbound",le="20"} 0
request_handle_us_bucket{direction="inbound",le="30"} 0
request_handle_us_bucket{direction="inbound",le="40"} 0
request_handle_us_bucket{direction="inbound",le="50"} 0
request_handle_us_bucket{direction="inbound",le="100"} 46
request_handle_us_bucket{direction="inbound",le="200"} 144
request_handle_us_bucket{direction="inbound",le="300"} 148
request_handle_us_bucket{direction="inbound",le="400"} 148
request_handle_us_bucket{direction="inbound",le="500"} 148
request_handle_us_bucket{direction="inbound",le="1000"} 149
request_handle_us_bucket{direction="inbound",le="2000"} 149
request_handle_us_bucket{direction="inbound",le="3000"} 150
request_handle_us_bucket{direction="inbound",le="4000"} 150
request_handle_us_bucket{direction="inbound",le="5000"} 150
request_handle_us_bucket{direction="inbound",le="10000"} 150
request_handle_us_bucket{direction="inbound",le="20000"} 150
request_handle_us_bucket{direction="inbound",le="30000"} 150
request_handle_us_bucket{direction="inbound",le="40000"} 150
request_handle_us_bucket{direction="inbound",le="50000"} 150
request_handle_us_bucket{direction="inbound",le="+Inf"} 150
request_handle_us_count{direction="inbound"} 150
request_handle_us_sum{direction="inbound"} 21520

hawkw · 2019-08-08T20:33:17Z

will merge this pending CI

* Update h2 to v0.1.26 * Properly fall back in the dst_router (linkerd/linkerd2-proxy#291) * Tap server authorizes clients when identity is expected (linkerd/linkerd2-proxy#290) * update-rust-version: Check usage (linkerd/linkerd2-proxy#298) * tap: fix tap response streams never ending (linkerd/linkerd2-proxy#299) * Require identity on tap requests (linkerd/linkerd2-proxy#295) * Authority label should reflect logical dst (linkerd/linkerd2-proxy#300) * Replace futures_watch with tokio::sync::watch (linkerd/linkerd2-proxy#301) * metrics: add `request_handle_us` histogram (linkerd/linkerd2-proxy#294) * linkerd2-proxy: Adopt Rust 2018 (linkerd/linkerd2-proxy#302) * Remove futures-mpsc-lossy (linkerd/linkerd2-proxy#305) * Adopt std::convert::TryFrom (linkerd/linkerd2-proxy#304) * lib: Rename directories to match crate names (linkerd/linkerd2-proxy#303)

* update-rust-version: Check usage (linkerd/linkerd2-proxy#298) * tap: fix tap response streams never ending (linkerd/linkerd2-proxy#299) * Require identity on tap requests (linkerd/linkerd2-proxy#295) * Authority label should reflect logical dst (linkerd/linkerd2-proxy#300) * Replace futures_watch with tokio::sync::watch (linkerd/linkerd2-proxy#301) * metrics: add `request_handle_us` histogram (linkerd/linkerd2-proxy#294) * linkerd2-proxy: Adopt Rust 2018 (linkerd/linkerd2-proxy#302) * Remove futures-mpsc-lossy (linkerd/linkerd2-proxy#305) * Adopt std::convert::TryFrom (linkerd/linkerd2-proxy#304) * lib: Rename directories to match crate names (linkerd/linkerd2-proxy#303)

hawkw added 8 commits July 29, 2019 17:11

wip

b3644e3

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

wip2

907a687

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

metric works

41bd861

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

wire up everything

afa3633

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

fix wrong refcounts

ee584fc

rustfmt

a17511a

add comments, cleanup

e815c4a

change ordering guarantees when dropping recorder

0231f60

unlike an Arc, we don't actually need to synchronize on any work besides the actual decrement, since everything guarded by the ref-count is only done atomically anyway.

hawkw requested review from adleong, kleimkuhler and olix0r July 30, 2019 22:24

hawkw self-assigned this Jul 30, 2019

adleong reviewed Jul 30, 2019

View reviewed changes

src/proxy/http/insert.rs Show resolved Hide resolved

rustfmt

e8f5d13

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

kleimkuhler reviewed Jul 31, 2019

View reviewed changes

hawkw added 2 commits July 31, 2019 08:08

improve naming/docs

30966f4

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

extend slab in one place

4900934

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

olix0r reviewed Aug 6, 2019

View reviewed changes

clarify purpose of counters

159b673

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

hawkw mentioned this pull request Aug 6, 2019

proxy: tune latency histogram buckets linkerd/linkerd2#3202

Closed

kleimkuhler approved these changes Aug 8, 2019

View reviewed changes

review feedback

cc14324

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

olix0r approved these changes Aug 8, 2019

View reviewed changes

hawkw merged commit faa7be6 into master Aug 8, 2019

olix0r mentioned this pull request Aug 9, 2019

proxy: Update proxy to master linkerd/linkerd2#3225

Merged

olix0r deleted the eliza/handle-time branch August 17, 2019 01:33

Conversation

hawkw commented Jul 30, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hawkw commented Jul 31, 2019

Uh oh!

kleimkuhler commented Jul 31, 2019

Uh oh!

olix0r Aug 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kleimkuhler left a comment

Choose a reason for hiding this comment

Uh oh!

olix0r left a comment

Choose a reason for hiding this comment

Uh oh!

hawkw commented Aug 8, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

olix0r Aug 6, 2019 •

edited

Loading