kvserver: merge queue racing against replicate queue

On 20.1.1, we observed a range not moving off a node during decommissioning.
Manually running through the replicate queue suggested that "someone" was
rolling back our learner before we could promote it. (This state lasted for
many hours, and presumably wouldn't ever have resolved).

Logs indicated that the merge queue was repeatedly retrying with

```
cannot up-replicate to s11; missing gossiped StoreDescriptor
```

(s11 is the decommissioning node's store).

We momentarily disabled the merge queue and decommissioning promptly finished.

I actually don't know why the store descriptor wasn't gossiped; n11 was running and according to my reading of the code the draining status shouldn't have affected that at all:

https://github.com/cockroachdb/cockroach/blob/b9ed5dfa3772d36fb548d52d5093022ce10411a2/pkg/server/node.go#L689-L709

Also, the node shouldn't even be draining in the first place; it should be *decommissioning*, but we did find evidence that it was draining, too, which I don't quite understand.

https://cockroachlabs.slack.com/archives/C01351NFLE9/p1606320505056100 has some internal info here.

My assumption is that an explicit `./cockroach node drain` was issued against the node at some point.

So for this issue:

- make sure merge queue doesn't try to relocate to decommissioning or draining nodes
- figure out why a draining node would stop gossiping its store descriptor


Jira issue: CRDB-2858


	statusTicker := time.NewTicker(gossipStatusInterval)
	storesTicker := time.NewTicker(gossip.StoresInterval)
	nodeTicker := time.NewTicker(gossip.NodeDescriptorInterval)
	defer storesTicker.Stop()
	defer nodeTicker.Stop()
	n.gossipStores(ctx) // one-off run before going to sleep
	for {
	select {
	case <-statusTicker.C:
	n.storeCfg.Gossip.LogStatus()
	case <-storesTicker.C:
	n.gossipStores(ctx)
	case <-nodeTicker.C:
	if err := n.storeCfg.Gossip.SetNodeDescriptor(&n.Descriptor); err != nil {
	log.Warningf(ctx, "couldn't gossip descriptor for node %d: %s", n.Descriptor.NodeID, err)
	}
	case <-stopper.ShouldStop():
	return
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: merge queue racing against replicate queue #57129

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kvserver: merge queue racing against replicate queue #57129

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions