-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvserver: merge queue racing against replicate queue #57129
Description
On 20.1.1, we observed a range not moving off a node during decommissioning.
Manually running through the replicate queue suggested that "someone" was
rolling back our learner before we could promote it. (This state lasted for
many hours, and presumably wouldn't ever have resolved).
Logs indicated that the merge queue was repeatedly retrying with
cannot up-replicate to s11; missing gossiped StoreDescriptor
(s11 is the decommissioning node's store).
We momentarily disabled the merge queue and decommissioning promptly finished.
I actually don't know why the store descriptor wasn't gossiped; n11 was running and according to my reading of the code the draining status shouldn't have affected that at all:
Lines 689 to 709 in b9ed5df
| statusTicker := time.NewTicker(gossipStatusInterval) | |
| storesTicker := time.NewTicker(gossip.StoresInterval) | |
| nodeTicker := time.NewTicker(gossip.NodeDescriptorInterval) | |
| defer storesTicker.Stop() | |
| defer nodeTicker.Stop() | |
| n.gossipStores(ctx) // one-off run before going to sleep | |
| for { | |
| select { | |
| case <-statusTicker.C: | |
| n.storeCfg.Gossip.LogStatus() | |
| case <-storesTicker.C: | |
| n.gossipStores(ctx) | |
| case <-nodeTicker.C: | |
| if err := n.storeCfg.Gossip.SetNodeDescriptor(&n.Descriptor); err != nil { | |
| log.Warningf(ctx, "couldn't gossip descriptor for node %d: %s", n.Descriptor.NodeID, err) | |
| } | |
| case <-stopper.ShouldStop(): | |
| return | |
| } | |
| } |
Also, the node shouldn't even be draining in the first place; it should be decommissioning, but we did find evidence that it was draining, too, which I don't quite understand.
https://cockroachlabs.slack.com/archives/C01351NFLE9/p1606320505056100 has some internal info here.
My assumption is that an explicit ./cockroach node drain was issued against the node at some point.
So for this issue:
- make sure merge queue doesn't try to relocate to decommissioning or draining nodes
- figure out why a draining node would stop gossiping its store descriptor
Jira issue: CRDB-2858