Skip to content

kvserver: ignore draining nodes in proposal quota #55806

@tbg

Description

@tbg

Describe the problem

It doesn't seem like we take the Draining status of a node into account in the quota pool. This means that when the node terminates, from the POV of the quota pool it has just disappeared.

I think we mostly get this right, though perhaps accidentally:

if !r.mu.lastUpdateTimes.isFollowerActiveSince(
ctx, rep.ReplicaID, now, r.store.cfg.RangeLeaseActiveDuration(),
) {
return
}
// Only consider followers that that have "healthy" RPC connections.
if err := r.store.cfg.NodeDialer.ConnHealth(rep.NodeID, r.connectionClass.get()); err != nil {
return
}

Note the ConnHealth check here, which presumably would go red fairly quickly, on the order of an RPC heartbeat interval,

// defaultRPCHeartbeatInterval is the default value of RPCHeartbeatInterval
// used by the rpc context.
defaultRPCHeartbeatInterval = 3 * time.Second

while the isFollowerActiveSince check will be a bit slower to fire (maybe a few seconds more? Didn't check). Either way, if in that time period we run out of quota, the range will stall until one of the checks clears.

Even if the current checks might be mostly good enough most of the time, it seems desirable to exclude a node from quota pool considerations the moment it becomes draining, to avoid possibly second-long write stalls.

cc @aayushshah15 and @knz since you're both on related topics.

To Reproduce

I don't have a reproduction. One would involve going full speed on a certain range, and gracefully draining one of its members, while asserting that the write latency remains constant.

Expected behavior
Ignore the node for purposes of the quota pool when it has a Draining liveness record.

Additional data / screenshots

Environment:

Additional context

Jira issue: CRDB-3627

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-kv-replicationRelating to Raft, consensus, and coordination.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-kvKV Team

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions