-
Notifications
You must be signed in to change notification settings - Fork 4.1k
server: decommissioning starts draining and then cancels it - I think we're not really draining when decomissioning #25586
Description
Here's a nice one. I have a change that makes TestHeartbeatCallbackForDecommissioning flaky. I have no idea how the change is affecting it exactly, perhaps other than weirdly affecting some timings.
But I've dug into what happens and the code just seems broken to me:
A server's liveness heartbeat loop detects the decommissioning bit on the liveness record and triggers draining. Except that it does draining as an async Worker and then races to cancel the context used by the worker (lolz). Canceling that context can and does stop the draining.
Here's some code:
This is the worker doing the draining:
cockroach/pkg/server/server.go
Line 1500 in 82b4293
| s.stopper.RunWorker(ctx, func(context.Context) { |
Notice that it's called from a closure passed to nodeLiveness.StartHeartbeat(). Now, the callback is called very indirectly from the heartbeat loop, but here's the context that's being used for it:
cockroach/pkg/storage/node_liveness.go
Line 451 in 82b4293
| defer cancel() |
As I was saying, this context cancelation races with callback. The test becomes flaky because sometimes we cancel so soon that we don't even get to mark the liveness record as "draining". Even when the test is not flaky, I assume that we don't actually do the "leases" part of server draining - i.e. the lease transfers. I postulate that decommissioning is thus quite broken - but I don't have proof.
If true, some sort of acceptance test might be in order.
Probably an immediate fix is to remove the parent's cancelation from the ctx that's passed to the draining worker. But I don't know if that's the right fix.