raft: improve the availability related to member change

Hi,

Current member change implementation requires at least two nodes works for a cluster. If one node fails in a three nodes cluster, there is a short time gap that the availability risks on another node failure, after the previously failed node is removed, before the new node is added.

Another availability issue arises when balancing the nodes among racks/data centers. It's an usual way to add a new node and then remove one old node among different racks/data centers to do the balancing.  After adding a node into a three nodes cluster in one of the three racks/data centers, there will two nodes in one same rack/data center. If this rack/data center fails, the cluster is unavailable. The elaboration for this issue is in https://github.com/pingcap/tikv/issues/1468 

Both availability issues are related to the member change implementation. To fix them, I suggest to add a "ReplaceNode" primitive in member change. It requires to write and then commit one log entry to achieve the target "remove one existing node and add a new node". 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft: improve the availability related to member change #7625

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

raft: improve the availability related to member change #7625

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions