Skip to content

rpc: latency jump detection logging is too noisy #96262

@tbg

Description

@tbg

Describe the problem

In most investigations the first thing I have to do to look at the logs is to remove these messages:

log.Health.Warningf(ctx, "latency jump (prev avg %.2fms, current %.2fms)",
prevAvg/1e6, newLatencyf/1e6)

They are spammy. I see them spam on my local machine and on most production deployments. I understand that detecting blips is helpful but somehow the latencies we measure here aren't quite right.

For example, just on my Mac M1 which is mostly idle:

Details
$ roachprod stop local; roachprod put local ./cockroach;  roachprod start local; tail -F ~/local/*/logs/cockroach.log | grep 'latency jump'
local: stopping and waiting 3/3
10:18:31 cluster_synced.go:1584: WARNING: Source file is a symlink.
10:18:31 cluster_synced.go:1585: WARNING: Remote file will inherit the target permissions '-rwxr-xr-x'.
10:18:31 cluster_synced.go:1600: local: putting ./cockroach cockroach on nodes [1 2 3]
   1: done
   2: done
   3: done
10:18:32 cockroach.go:174: local: starting nodes
10:18:33 cockroach.go:630: local: initializing cluster
10:18:33 cockroach.go:651: local: setting cluster settings
W230131 10:18:42.780350 442 2@rpc/clock_offset.go:268 ⋮ [T1,n2,rnode=3,raddr=‹127.0.0.1:26261›,class=default,rpc] 91  latency jump (prev avg 0.41ms, current 1.18ms)
W230131 10:18:43.622659 488 2@rpc/clock_offset.go:268 ⋮ [T1,n1,rnode=2,raddr=‹127.0.0.1:26259›,class=default,rpc] 102  latency jump (prev avg 0.33ms, current 1.08ms)
W230131 10:18:43.720008 508 2@rpc/clock_offset.go:268 ⋮ [T1,n1,rnode=3,raddr=‹127.0.0.1:26261›,class=system,rpc] 103  latency jump (prev avg 0.47ms, current 1.30ms)
W230131 10:18:43.639829 379 2@rpc/clock_offset.go:268 ⋮ [T1,n2,rnode=1,raddr=‹127.0.0.1:26257›,class=default,rpc] 93  latency jump (prev avg 0.44ms, current 1.23ms)
W230131 10:18:45.945567 436 2@rpc/clock_offset.go:268 ⋮ [T1,n3,rnode=2,raddr=‹127.0.0.1:26259›,class=default,rpc] 84  latency jump (prev avg 0.40ms, current 9.54ms)
W230131 10:18:45.945936 200 2@rpc/clock_offset.go:268 ⋮ [T1,n3,rnode=1,raddr=‹127.0.0.1:26257›,class=default,rpc] 85  latency jump (prev avg 0.45ms, current 9.78ms)
W230131 10:19:00.063778 548 2@rpc/clock_offset.go:268 ⋮ [T1,n3,rnode=3,raddr=‹127.0.0.1:26261›,class=default,rpc] 95  latency jump (prev avg 0.33ms, current 1.13ms)
W230131 10:19:00.689888 504 2@rpc/clock_offset.go:268 ⋮ [T1,n1,rnode=1,raddr=‹127.0.0.1:26257›,class=default,rpc] 109  latency jump (prev avg 0.31ms, current 1.25ms)
W230131 10:19:04.801464 342 2@rpc/clock_offset.go:268 ⋮ [T1,n1,rnode=2,raddr=‹127.0.0.1:26259›,class=system,rpc] 111  latency jump (prev avg 0.59ms, current 8.87ms)
W230131 10:19:13.877610 530 2@rpc/clock_offset.go:268 ⋮ [T1,n2,rnode=2,raddr=‹127.0.0.1:26259›,class=default,rpc] 111  latency jump (prev avg 0.39ms, current 1.18ms)
W230131 10:19:15.672922 379 2@rpc/clock_offset.go:268 ⋮ [T1,n2,rnode=1,raddr=‹127.0.0.1:26257›,class=default,rpc] 113  latency jump (prev avg 0.57ms, current 1.18ms)
W230131 10:19:15.810652 342 2@rpc/clock_offset.go:268 ⋮ [T1,n1,rnode=2,raddr=‹127.0.0.1:26259›,class=system,rpc] 122  latency jump (prev avg 0.61ms, current 1.25ms)

I'm looking at a debug.zip from a 30 node cluster now. I didn't pick this one for being particularly bad, it is just what I had here:

$ cockroach debug merge-logs --filter 'latency jump' --format crdb-v1 nodes/*/logs/crdb.ip*.log | wc -l
  757129
$ cockroach debug merge-logs --format crdb-v1 nodes/*/logs/crdb.ip*.log | wc -l
 4309547

In other words, ~17% of log lines in the dev log are spent on this log line.

There are actually some latency blips on this cluster according to the round-trip-latency metric, really big ones.

But the latency blip detection is logging all the time, like a clock, so like a clock that's stopped it's bound to be right sometimes but nobody will notice.

I think there are two things to look into here.

First, why does (seemingly) every deployment have latency blips that seem unreasonably large even when the cluster is idle? I certainly don't expect 8ms of self-ping times. ICMP pings locally are in the typically in the <0.2ms range. Of course a Go program pinging a copy of itself through gRPC is bound to have higher overhead but the occasional 8+ms still seem unexpected; maybe there's something we can fix.

Second, a lot of the logging seems just spammy - the current code suppresses1 the logging if the new latency is <1ms, perhaps we ought to bump that to something like 10ms.

Jira issue: CRDB-24032

Epic CRDB-54646

Footnotes

  1. https://github.com/cockroachdb/cockroach/blob/f8f83dd8d41e697bfa2d278b2a8c287afeef871d/pkg/rpc/clock_offset.go#L262-L266

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-server-networkingPertains to network addressing,routing,initializationC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.O-supportWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsP-3Issues/test failures with no fix SLAT-kvKV Team

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions