server: decommissioning starts draining and then cancels it - I think we're not really draining when decomissioning

Here's a nice one. I have a change that makes `TestHeartbeatCallbackForDecommissioning` flaky. I have no idea how the change is affecting it exactly, perhaps other than weirdly affecting some timings.
But I've dug into what happens and the code just seems broken to me:
A server's liveness heartbeat loop detects the decommissioning bit on the liveness record and triggers draining. Except that it does draining as an async `Worker` and then races to cancel the context used by the worker (lolz). Canceling that context can and does stop the draining.
Here's some code:
This is the worker doing the draining: https://github.com/cockroachdb/cockroach/blob/82b4293e3aff8c50952fddacbb22b57d34d1044c/pkg/server/server.go?utf8=%E2%9C%93#L1500

Notice that it's called from a closure passed to `nodeLiveness.StartHeartbeat()`. Now, the callback is called very indirectly from the heartbeat loop, but here's the context that's being used for it:
https://github.com/cockroachdb/cockroach/blob/82b4293e3aff8c50952fddacbb22b57d34d1044c/pkg/storage/node_liveness.go#L451

As I was saying, this context cancelation races with callback. The test becomes flaky because sometimes we cancel so soon that we don't even get to mark the liveness record as "draining". Even when the test is not flaky, I assume that we don't actually do the "leases" part of server draining - i.e. the lease transfers. I postulate that decommissioning is thus quite broken - but I don't have proof.
If true, some sort of acceptance test might be in order.

Probably an immediate fix is to remove the parent's cancelation from the ctx that's passed to the draining worker. But I don't know if that's the right fix.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: decommissioning starts draining and then cancels it - I think we're not really draining when decomissioning #25586

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

server: decommissioning starts draining and then cancels it - I think we're not really draining when decomissioning #25586

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions