Skip to content

gossip: Can allocate same node ID to two different nodes in very unlikely startup race #15898

@a-robinson

Description

@a-robinson

While attempting to reproduce #15856, I decided to try spinning up 3 nodes at 3 different localhost addresses and repeatedly replace them, where "replace" means stop the node and restart it at the same address with a new data directory.

I replaced the third node a whole bunch of times (up to the point that it was running with node ID 12), then decided to replace the second node. However, I didn't notice that it didn't come up properly (due to node 1 having the only live replica of range necessary for ID allocation), so I moved on to taking down node 1 as well. At the time that I brought node 1 down, node 2 had received the cluster ID from node 1 but not a node ID yet.

I made a mistake when bringing node 1 back up that caused it to use a new data directory without joining either of the other two addresses, meaning that it created a new cluster. It initialized the new cluster then received a KV request from node 2 asking for a node ID and gave it node 2. Node 2 quickly crashed from a raft "tocommit(n) is out of range [lastIndex(0)]" error and I brought node 1 back up with its original data dir, but the damage was done. The new node 2 had the original cluster's ClusterID but the NodeID that it was allocated by the second cluster.

That allows it to connect to the old cluster, but it makes for some strange behavior due to its incorrect node ID. I decided to switch the third process to using the original node 2's data directory, so now there are two nodes that think they have ID 2. The cluster is still working, although that seems like a fluke.

This isn't a particularly scary issue due to the work that it took to create, but it felt worth writing up the race that you can get a node ID from a different cluster than the one that you got your cluster ID from. This is pretty closely related to #15801.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions