Skip to content

Forbid peer to join cluster with URI that is already used#7375

Merged
timvisee merged 4 commits intodevfrom
consensus-forbid-join-duplicate-uri
Oct 13, 2025
Merged

Forbid peer to join cluster with URI that is already used#7375
timvisee merged 4 commits intodevfrom
consensus-forbid-join-duplicate-uri

Conversation

@timvisee
Copy link
Member

@timvisee timvisee commented Oct 9, 2025

Multiple peers joining with the same cluster URL result in a whole myriad of problems. We should therefore forbid that as a safety measure.

This changes our consensus logic to reject adding a peer with an URI that is already used by some other peer. The peer fails to join and crashes.

It is easy to trigger this in practice. You can wipe the storage directory. On restart the peer will try to join a second time with the same URL.

If something like this ever happens in our cloud, it means that the peer will start a crash loop. This is desired, so the issue gets attention from our support team. Their resolution options are to recover the broken node, or to remove the broken node from consensus first after which the peer can rejoin just fine.

Rejoining on an existing URI with the same peer ID is allowed.

A test is included to assert the behavior.


When this occurs, the leader reports the following:

2025-10-10T13:22:27.547922Z  WARN qdrant::consensus: Rejected peer 7566390713471351 to join consensus, URI is already registered by peer 5936314104998682 (http://127.0.0.1:10200/)
2025-10-10T13:22:27.547954Z  WARN qdrant::consensus: peer URI http://127.0.0.1:10200/ already used by peer 5936314104998682, remove it first or use a different URI

The peer that tries (and fails) to rejoin crashes and reports the following. It is not very nice, but I couldn't find a good way to report a proper message here. I don't think it's worth a lot of effort.

2025-10-10T13:22:27.545366Z DEBUG qdrant::consensus: Bootstrapping from peer with address: http://127.0.0.1:10000/
2025-10-10T13:22:37.885731Z ERROR qdrant::startup: Panic backtrace: 
   0: {closure#0}
             at /home/timvisee/git/qdrant/src/startup.rs:21:25
   1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
             at /rustc/1159e78c4747b02ef996e55082b704c09b970588/library/alloc/src/boxed.rs:1985:9
   2: std::panicking::rust_panic_with_hook
             at /rustc/1159e78c4747b02ef996e55082b704c09b970588/library/std/src/panicking.rs:841:13
   ...
  23: main

2025-10-10T13:22:37.885850Z ERROR qdrant::startup: Panic occurred in file src/main.rs at line 445: Can't initialize consensus: Failed to initialize Consensus for new Raft state: Failed to add peer to known: status: Internal, message: "Failed to add peer: Service internal error: Waiting for consensus operation commit failed. Timeout set at: 10 seconds", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Fri, 10 Oct 2025 13:22:36 GMT", "content-length": "0"} }
2025-10-10T13:22:37.886045Z TRACE reqwest::blocking::wait: (ThreadId(1)) park without timeout
2025-10-10T13:22:37.886640Z TRACE reqwest::blocking::client: (ThreadId(55)) start runtime::block_on
2025-10-10T13:22:37.911066Z TRACE reqwest::blocking::wait: wait at most 1s
2025-10-10T13:22:37.911090Z TRACE reqwest::blocking::wait: (ThreadId(1)) park timeout 999.99768ms
2025-10-10T13:22:37.911239Z DEBUG reqwest::connect: starting new connection: https://staging-telemetry.qdrant.io/
2025-10-10T13:22:38.038589Z TRACE reqwest::retry: shouldn't retry!
2025-10-10T13:22:38.038793Z TRACE reqwest::blocking::client: closing runtime thread (ThreadId(55))
2025-10-10T13:22:38.038822Z TRACE reqwest::blocking::client: signaled close for runtime thread (ThreadId(55))
2025-10-10T13:22:38.039138Z TRACE reqwest::blocking::client: (ThreadId(55)) Receiver is shutdown
2025-10-10T13:22:38.039258Z TRACE reqwest::blocking::client: (ThreadId(55)) end runtime::block_on
2025-10-10T13:22:38.039861Z TRACE reqwest::blocking::client: (ThreadId(55)) finished
2025-10-10T13:22:38.040125Z TRACE reqwest::blocking::client: closed runtime thread (ThreadId(55))
# exit code 101

All Submissions:

  • Contributions should target the dev branch. Did you create your branch from dev?
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
  3. Have you checked your code using cargo clippy --all --all-features command?

@timvisee timvisee force-pushed the consensus-forbid-join-duplicate-uri branch from f2b2bf7 to 12e42af Compare October 10, 2025 11:40
@timvisee timvisee force-pushed the consensus-forbid-join-duplicate-uri branch from cf2616f to c18ed5e Compare October 10, 2025 13:18
@timvisee timvisee changed the title Draft: forbid peer to join cluster with URI that is already used Forbid peer to join cluster with URI that is already used Oct 10, 2025
@timvisee timvisee marked this pull request as ready for review October 10, 2025 13:39
coderabbitai[bot]

This comment was marked as resolved.

@qdrant qdrant deleted a comment from coderabbitai bot Oct 10, 2025
Copy link
Member

@agourlay agourlay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test 👍

Not completely happy about the failure through panic but I guess it is acceptable in this rare case.

@timvisee timvisee merged commit b9243c5 into dev Oct 13, 2025
15 checks passed
@timvisee timvisee deleted the consensus-forbid-join-duplicate-uri branch October 13, 2025 10:32
timvisee added a commit that referenced this pull request Nov 14, 2025
* Disallow peer to join with URI that is already used

* Add test for rejecting peer join with duplicate URI

* Improve peer rejection logic

* Try to rejoin twice, we expect a consistent result
@timvisee timvisee mentioned this pull request Nov 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants