-
Notifications
You must be signed in to change notification settings - Fork 4.1k
rpc: latency jump detection logging is too noisy #96262
Description
Describe the problem
In most investigations the first thing I have to do to look at the logs is to remove these messages:
cockroach/pkg/rpc/clock_offset.go
Lines 268 to 269 in f8f83dd
| log.Health.Warningf(ctx, "latency jump (prev avg %.2fms, current %.2fms)", | |
| prevAvg/1e6, newLatencyf/1e6) |
They are spammy. I see them spam on my local machine and on most production deployments. I understand that detecting blips is helpful but somehow the latencies we measure here aren't quite right.
For example, just on my Mac M1 which is mostly idle:
Details
$ roachprod stop local; roachprod put local ./cockroach; roachprod start local; tail -F ~/local/*/logs/cockroach.log | grep 'latency jump'
local: stopping and waiting 3/3
10:18:31 cluster_synced.go:1584: WARNING: Source file is a symlink.
10:18:31 cluster_synced.go:1585: WARNING: Remote file will inherit the target permissions '-rwxr-xr-x'.
10:18:31 cluster_synced.go:1600: local: putting ./cockroach cockroach on nodes [1 2 3]
1: done
2: done
3: done
10:18:32 cockroach.go:174: local: starting nodes
10:18:33 cockroach.go:630: local: initializing cluster
10:18:33 cockroach.go:651: local: setting cluster settings
W230131 10:18:42.780350 442 2@rpc/clock_offset.go:268 ⋮ [T1,n2,rnode=3,raddr=‹127.0.0.1:26261›,class=default,rpc] 91 latency jump (prev avg 0.41ms, current 1.18ms)
W230131 10:18:43.622659 488 2@rpc/clock_offset.go:268 ⋮ [T1,n1,rnode=2,raddr=‹127.0.0.1:26259›,class=default,rpc] 102 latency jump (prev avg 0.33ms, current 1.08ms)
W230131 10:18:43.720008 508 2@rpc/clock_offset.go:268 ⋮ [T1,n1,rnode=3,raddr=‹127.0.0.1:26261›,class=system,rpc] 103 latency jump (prev avg 0.47ms, current 1.30ms)
W230131 10:18:43.639829 379 2@rpc/clock_offset.go:268 ⋮ [T1,n2,rnode=1,raddr=‹127.0.0.1:26257›,class=default,rpc] 93 latency jump (prev avg 0.44ms, current 1.23ms)
W230131 10:18:45.945567 436 2@rpc/clock_offset.go:268 ⋮ [T1,n3,rnode=2,raddr=‹127.0.0.1:26259›,class=default,rpc] 84 latency jump (prev avg 0.40ms, current 9.54ms)
W230131 10:18:45.945936 200 2@rpc/clock_offset.go:268 ⋮ [T1,n3,rnode=1,raddr=‹127.0.0.1:26257›,class=default,rpc] 85 latency jump (prev avg 0.45ms, current 9.78ms)
W230131 10:19:00.063778 548 2@rpc/clock_offset.go:268 ⋮ [T1,n3,rnode=3,raddr=‹127.0.0.1:26261›,class=default,rpc] 95 latency jump (prev avg 0.33ms, current 1.13ms)
W230131 10:19:00.689888 504 2@rpc/clock_offset.go:268 ⋮ [T1,n1,rnode=1,raddr=‹127.0.0.1:26257›,class=default,rpc] 109 latency jump (prev avg 0.31ms, current 1.25ms)
W230131 10:19:04.801464 342 2@rpc/clock_offset.go:268 ⋮ [T1,n1,rnode=2,raddr=‹127.0.0.1:26259›,class=system,rpc] 111 latency jump (prev avg 0.59ms, current 8.87ms)
W230131 10:19:13.877610 530 2@rpc/clock_offset.go:268 ⋮ [T1,n2,rnode=2,raddr=‹127.0.0.1:26259›,class=default,rpc] 111 latency jump (prev avg 0.39ms, current 1.18ms)
W230131 10:19:15.672922 379 2@rpc/clock_offset.go:268 ⋮ [T1,n2,rnode=1,raddr=‹127.0.0.1:26257›,class=default,rpc] 113 latency jump (prev avg 0.57ms, current 1.18ms)
W230131 10:19:15.810652 342 2@rpc/clock_offset.go:268 ⋮ [T1,n1,rnode=2,raddr=‹127.0.0.1:26259›,class=system,rpc] 122 latency jump (prev avg 0.61ms, current 1.25ms)
I'm looking at a debug.zip from a 30 node cluster now. I didn't pick this one for being particularly bad, it is just what I had here:
$ cockroach debug merge-logs --filter 'latency jump' --format crdb-v1 nodes/*/logs/crdb.ip*.log | wc -l
757129
$ cockroach debug merge-logs --format crdb-v1 nodes/*/logs/crdb.ip*.log | wc -l
4309547
In other words, ~17% of log lines in the dev log are spent on this log line.
There are actually some latency blips on this cluster according to the round-trip-latency metric, really big ones.
But the latency blip detection is logging all the time, like a clock, so like a clock that's stopped it's bound to be right sometimes but nobody will notice.
I think there are two things to look into here.
First, why does (seemingly) every deployment have latency blips that seem unreasonably large even when the cluster is idle? I certainly don't expect 8ms of self-ping times. ICMP pings locally are in the typically in the <0.2ms range. Of course a Go program pinging a copy of itself through gRPC is bound to have higher overhead but the occasional 8+ms still seem unexpected; maybe there's something we can fix.
Second, a lot of the logging seems just spammy - the current code suppresses1 the logging if the new latency is <1ms, perhaps we ought to bump that to something like 10ms.
Jira issue: CRDB-24032
Epic CRDB-54646