Skip to content

kvserver: merge queue racing against replicate queue #57129

@tbg

Description

@tbg

On 20.1.1, we observed a range not moving off a node during decommissioning.
Manually running through the replicate queue suggested that "someone" was
rolling back our learner before we could promote it. (This state lasted for
many hours, and presumably wouldn't ever have resolved).

Logs indicated that the merge queue was repeatedly retrying with

cannot up-replicate to s11; missing gossiped StoreDescriptor

(s11 is the decommissioning node's store).

We momentarily disabled the merge queue and decommissioning promptly finished.

I actually don't know why the store descriptor wasn't gossiped; n11 was running and according to my reading of the code the draining status shouldn't have affected that at all:

statusTicker := time.NewTicker(gossipStatusInterval)
storesTicker := time.NewTicker(gossip.StoresInterval)
nodeTicker := time.NewTicker(gossip.NodeDescriptorInterval)
defer storesTicker.Stop()
defer nodeTicker.Stop()
n.gossipStores(ctx) // one-off run before going to sleep
for {
select {
case <-statusTicker.C:
n.storeCfg.Gossip.LogStatus()
case <-storesTicker.C:
n.gossipStores(ctx)
case <-nodeTicker.C:
if err := n.storeCfg.Gossip.SetNodeDescriptor(&n.Descriptor); err != nil {
log.Warningf(ctx, "couldn't gossip descriptor for node %d: %s", n.Descriptor.NodeID, err)
}
case <-stopper.ShouldStop():
return
}
}

Also, the node shouldn't even be draining in the first place; it should be decommissioning, but we did find evidence that it was draining, too, which I don't quite understand.

https://cockroachlabs.slack.com/archives/C01351NFLE9/p1606320505056100 has some internal info here.

My assumption is that an explicit ./cockroach node drain was issued against the node at some point.

So for this issue:

  • make sure merge queue doesn't try to relocate to decommissioning or draining nodes
  • figure out why a draining node would stop gossiping its store descriptor

Jira issue: CRDB-2858

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-kv-replicationRelating to Raft, consensus, and coordination.C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.T-kvKV Team

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions