-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Raft NodeID reuse #756
Description
Raft assumes that the state of a node can never move backwards, which means that if we delete data from ranges that have been removed from a store, we cannot simply re-add the same store with the same NodeID. There are two problems with reusing an old NodeID: the new incarnation of a node may cast a vote that conflicts with a vote cast by the previous incarnation, and the node may be counted towards the quorum of log entries that it used to contain but no longer does (I don't think the latter is an issue here: the node was by definition not part of the quorum for the configuration change entry that removed it, so once that entry is committed the fact that the removed node may have been a part of the quorum for some prior entry is irrelevant)
Options:
- Don't completely delete the old range. Specifically, keep the raftpb.HardState to record how we voted for this range. A Store would accumulate HardStates for every range it had held in the past (perhaps with eventual GC, although doing this safely requires knowledge of term changes as described below)
- Trigger a term change and a new election after a replica has been removed. Once the term has changed, the old range's HardState is no longer relevant and can be safely removed. This could be an availability problem since a range is leaderless during an election, and it would be tricky/ugly to inform the former member that the election has completed.
- Generate unique "NodeIDs" that do not relate to the real node ID (these are local to the range; so using the raft log index of the entry that committed the config change could work). This requires machinery to map these fake node IDs back to the real nodes (in two places: in the Transport for routing messages, and in multiraft itself for deciding which heartbeat messages can be coalesced). I think we'd store this mapping in the range descriptor. (This would also be an improvement to coalesced heartbeats: right now each store is treated as a separate node by multiraft, but this interface would let us decode the multiraft node ids and coalesce heartbeats that are bound for different stores on the same node).
I think #3 is the right long-term solution but I'm going to implement #1 first just to get something working.