kvserver: log if lease applies with a delay#96257
kvserver: log if lease applies with a delay#96257craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
erikgrinaker
left a comment
There was a problem hiding this comment.
Now that we transfer expiration-based leases, can we also warn if the previous unclaimed lease expired because the new leaseholder failed to apply it? Otherwise, this log message won't fire if the delay is greater than 6 seconds.
a96eb2d to
2705d7b
Compare
|
That's a good point, doesn't the current code handle this though? If, say, a recipient of a lease transfer applies the lease via the log only after, say, 120 minutes, it will still see the lease and log the warning. I don't think there's any condition here that checks that the lease is still valid at apply time. The case where we do miss the logging is if the recipient eventually gets caught up via a snapshot and the snapshot has another leaseholder already. In that case we can't rely on the recipient of the lease logging anything since it never even learns about the lease transfer. |
That's true, but that could be so late that we no longer care. Let's say we transfer a lease from A to C, C fails to apply the lease and it expires, then after 10 seconds B acquires a new lease (without logging anything), and only once C recovers 2 hours later do we log a warning. Meanwhile, the range was unavailable for 10 seconds and people are upset. Can we easily add a condition where we log a lapsed expiration lease that was never applied (i.e. converted to an epoch lease)?
Right, I think we can ignore this case for our purposes here. |
271d7e0 to
2c029f0
Compare
Good idea, I think we can, see new commit. |
erikgrinaker
left a comment
There was a problem hiding this comment.
I think that should do the trick, thanks!
When we transfer a lease to a lagging follower, there's often a latency
blip that we get asked to investigate. This is time consuming; it's
often very subtle to even figure out that it happened. We try to be
better about not doing it, but at least on 22.1 we know it's possible,
and we can't backport the rather involved fixes.
This warning makes it fairly obvious when it happens.
> W230131 [...] [T1,n2,s2,r23/3:‹/Table/2{1-2}›,raft] 165 lease repl=(n2,s2):3 seq=5 start=1675153630.108829000,0 epo=3 pro=1675153630.108829000,0 active after replication lag of ~0.58s; foreground traffic may have been impacted [prev=repl=(n3,s3):2 seq=4 start=1675153407.528408000,0 epo=2 pro=1675153419.837642000,0]
Also log if we're acquiring an epoch-based lease following a
non-cooperatively expired expiration-based lease, which would suggest
that a lease transfer went to a node that couldn't service the lease.
This would likely have caused an outage, and the log message will
provide a way to pinpoint its end timestamp
Addresses cockroachdb#95991.
Epic: none
Release note: None
2c029f0 to
548a4be
Compare
|
bors r=erikgrinaker |
|
Build succeeded: |
|
blathers backport 22.1 |
When we transfer a lease to a lagging follower, there's often a latency
blip that we get asked to investigate. This is time consuming; it's
often very subtle to even figure out that it happened. We try to be
better about not doing it, but at least on 22.1 we know it's possible,
and we can't backport the rather involved fixes.
This warning makes it fairly obvious when it happens.
Addresses #95991.
Epic: none
Release note: None