Skip to content

fix(#4274): getReplicaAddresses excludes localPeer instead of leader#4280

Merged
robfrank merged 3 commits into
mainfrom
fix/4274-cluster-topology-consistency
May 22, 2026
Merged

fix(#4274): getReplicaAddresses excludes localPeer instead of leader#4280
robfrank merged 3 commits into
mainfrom
fix/4274-cluster-topology-consistency

Conversation

@robfrank

Copy link
Copy Markdown
Collaborator

Closes #4274

Summary

RaftHAServer.getReplicaAddresses() was filtering out localPeerId (the node that happens to serve the HTTP request) instead of the Raft leader. On a follower node this caused two anomalies:

  • The leader appeared inside replicaAddresses (it was not the local peer, so it was included).
  • The serving follower was missing from replicaAddresses (it was the local peer, so it was excluded).

The leader node produced the correct view only by coincidence, because localPeerId == leaderId there.

Fix: exclude the leader ID when known; fall back to localPeerId during elections (no leader elected yet) to preserve prior behavior.

Test plan

  • New regression test clusterTopologyIsConsistentAcrossNodes in RaftHTTP2ServersIT:
    • Queries every node's ?mode=cluster endpoint
    • Asserts leaderAddress is identical across all nodes
    • Asserts replicaAddresses does not contain the leaderAddress
    • Asserts replicaAddresses is identical across all nodes
  • All 6 tests in RaftHTTP2ServersIT pass locally

On follower nodes, getReplicaAddresses() was filtering out localPeerId
instead of the Raft leader. This caused the leader to appear inside
replicaAddresses and the serving follower to be missing from the list,
giving every node a different view of the cluster topology.

Fix: exclude the leader ID when known, fall back to localPeerId during
elections (preserves prior behavior when no leader is elected yet).
@codacy-production

codacy-production Bot commented May 21, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 complexity

Metric Results
Complexity 0

View in Codacy

🟢 Coverage 0.00% diff coverage · -7.39% coverage variation

Metric Results
Coverage variation -7.39% coverage variation
Diff coverage 0.00% diff coverage

View coverage diff in Codacy

Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (6e4de2c) 127476 93362 73.24%
Head commit (86fb190) 159158 (+31682) 104810 (+11448) 65.85% (-7.39%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#4280) 6 0 0.00%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a cluster topology inconsistency where follower nodes incorrectly included the leader in their replica list while excluding themselves. The fix updates getReplicaAddresses() in RaftHAServer to exclude the leader ID (falling back to the local peer ID if no leader is elected) instead of always excluding the local peer. A new regression test, clusterTopologyIsConsistentAcrossNodes, was added to verify that all nodes report a consistent view of the cluster. Feedback suggests that the getStats() method likely contains the same logic error and should be updated to maintain consistency across monitoring APIs.

@@ -681,9 +681,11 @@ public Map<String, Object> getStats() {
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The getStats() method (lines 655-681) appears to have the same logic issue as getReplicaAddresses(). It filters the replicas list using localPeerId instead of the leader ID. This means that on a follower node, the leader will be included in the replicas list, while the follower itself will be missing. To ensure consistency across all monitoring and status APIs, consider applying a similar fix to getStats() so that it also excludes the leader (or falls back to localPeerId if no leader is known).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Applied the same leaderId != null ? leaderId : localPeerId exclusion pattern to getStats() in the follow-up commit.

@claude

claude Bot commented May 21, 2026

Copy link
Copy Markdown

Code Review - fix(#4274): getReplicaAddresses excludes localPeer instead of leader

Overview

The PR correctly identifies and fixes a real bug: getReplicaAddresses() was filtering out localPeerId (the request-serving node) instead of the Raft leader, causing follower nodes to return an inconsistent cluster topology view. The fix is minimal and well-reasoned.


Fix (RaftHAServer.java) - Correct

final RaftPeerId leaderId = getLeaderId();
final RaftPeerId excludeId = leaderId != null ? leaderId : localPeerId;

The logic is sound. The fallback to localPeerId when no leader is elected (during an election) is pragmatic - it preserves the prior behavior for a transient state, avoiding new failure modes. One minor suggestion: a short inline comment would make the fallback intent explicit for future readers:

// Fall back to localPeerId during elections (no leader yet) to preserve prior behaviour
final RaftPeerId excludeId = leaderId != null ? leaderId : localPeerId;

Test (RaftHTTP2ServersIT.java) - Mostly Good, One Fragility

The regression test covers the three key invariants well. Two concerns:

1. String-based replica set comparison is order-sensitive

assertThat(replicaAddresses).isEqualTo(firstReplicaAddresses);

getReplicaAddresses() builds the comma-separated list by iterating raftGroup.getPeers(). If the peer order happens to differ between nodes, this assertion would produce a false failure even when the replica set is identical. The fix should use set comparison:

assertThat(Set.of(replicaAddresses.split(",")))
    .as("server %d: replicaAddresses must be consistent across all nodes", i)
    .isEqualTo(Set.of(firstReplicaAddresses.split(",")));

2. Missing non-empty assertion

In a 2-node cluster there should always be exactly one replica address. A bug that makes replicaAddresses consistently empty on every node would pass all current assertions (empty never contains leaderAddress, and "".equals("")). Adding this would close that gap:

assertThat(replicaAddresses).as("server %d: replicaAddresses must not be empty", i).isNotEmpty();

Documentation File (docs/4274-cluster-topology-consistency.md) - Should Be Removed

The committed markdown file duplicates what is already in the PR description. Once this PR is merged the file will be disconnected from the issue context and will just drift out of date alongside the code. The root-cause analysis belongs in the PR body and commit message, not in version control. Please remove this file from the PR.


Port Pattern Inconsistency (Minor)

Existing tests in this file use a hardcoded port pattern ("http://127.0.0.1:248" + serverIndex), while the new test uses the dynamic getServer(i).getHttpServer().getPort(). The dynamic approach is strictly better (no reliance on fixed port numbering), so this is actually a good change - worth noting that the existing tests could be improved the same way in a follow-up.


Summary

Area Status
Root cause analysis Correct
Fix logic Correct
Fallback during elections Acceptable (comment recommended)
Regression test coverage Good, with two small gaps
Docs file in repo Should be removed

The core change is correct and the test is a good addition. Addressing the replicaAddresses string comparison fragility and removing the docs file are the two items worth fixing before merge.

@claude

claude Bot commented May 21, 2026

Copy link
Copy Markdown

Code Review

The fix is correct. The root cause is clearly identified: getReplicaAddresses() and getStats() were excluding localPeerId (the serving node) instead of the Raft leader, producing an inconsistent topology view from follower nodes. The change is minimal and well-scoped.


RaftHAServer.java - Correct fix, comment recommended

final RaftPeerId excludeId = leaderId != null ? leaderId : localPeerId;

The fallback to localPeerId during elections is pragmatic. A short inline comment would help future readers understand why this is intentional:

// During elections (no leader yet), fall back to localPeerId to preserve prior behaviour
final RaftPeerId excludeId = leaderId != null ? leaderId : localPeerId;

Since the semantics of this method were apparently non-obvious (the bug proves it), a brief Javadoc on getReplicaAddresses() stating "Returns the HTTP addresses of all non-leader peers" would also help prevent a recurrence.


RaftHTTP2ServersIT.java - Two test robustness issues

1. doesNotContain uses substring matching

assertThat(replicaAddresses).doesNotContain(leaderAddress);

AssertJ's doesNotContain on a String checks for a substring, not a list element. If leaderAddress is "127.0.0.1:2480" and a replica is "127.0.0.1:24800", the substring "127.0.0.1:2480" appears inside "127.0.0.1:24800", producing a false failure. Use a split-and-check instead:

assertThat(List.of(replicaAddresses.split(",")))
    .as("server %d: leaderAddress must not appear in replicaAddresses", i)
    .doesNotContain(leaderAddress);

2. Ordering-sensitive equality check

assertThat(replicaAddresses).isEqualTo(firstReplicaAddresses);

getReplicaAddresses() iterates raftGroup.getPeers() - if peer ordering ever differs between nodes, this would give a false failure even when the set of replicas is correct. Use set comparison:

assertThat(Set.of(replicaAddresses.split(",")))
    .as("server %d: replicaAddresses must be consistent across all nodes", i)
    .isEqualTo(Set.of(firstReplicaAddresses.split(",")));

3. Missing non-empty assertion

In a 2-node cluster there is always exactly one replica. A bug that returns an empty replicaAddresses consistently from all nodes would pass all current assertions (empty string never contains anything, "".equals("") is true). Add:

assertThat(replicaAddresses)
    .as("server %d: replicaAddresses must not be empty", i)
    .isNotEmpty();

docs/4274-cluster-topology-consistency.md - Remove this file

This file is a process log of the development session (including notes about AI review cycles and "loop exited with timeout state"). It duplicates the PR description, will become disconnected from context after merge, and is not useful as product documentation. The root-cause analysis belongs in the PR body and commit message. Please remove it before merging.


Summary

Area Status
Root cause Correctly identified
Fix logic Correct
getStats() also fixed Good catch (same pattern)
Regression test - coverage Good
Regression test - doesNotContain fragility Fix needed
Regression test - ordering sensitivity Fix needed
Regression test - non-empty guard Add recommended
Docs file in repo Remove

The core change is solid. The three test robustness items and the docs file are the things worth addressing before merge.

@claude

claude Bot commented May 21, 2026

Copy link
Copy Markdown

Code Review for PR 4280. Overview: This PR fixes a real bug in RaftHAServer - both getReplicaAddresses() and getStats() were excluding localPeerId from the replica list, when they should exclude the leader. On follower nodes this produced an inverted view. The fix is correct. Issue 1 - docs/4274-cluster-topology-consistency.md should not be committed, it is an AI process artifact; remove before merging. Issue 2 - replicaAddresses string comparison is order-dependent, fine for 2 nodes but a set-based comparison is more robust for 3+. Issue 3 - election-period topology inconsistency is worth a comment. Positives: fix is minimal and consistent across both methods, stepDown() at line 784 correctly keeps using localPeerId, regression test has proper try/finally and AssertJ assertions. Summary: correct and ready to merge after removing the docs artifact file.

@codecov

codecov Bot commented May 21, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.15%. Comparing base (6e4de2c) to head (86fb190).

Files with missing lines Patch % Lines
...java/com/arcadedb/server/ha/raft/RaftHAServer.java 0.00% 6 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #4280      +/-   ##
============================================
- Coverage     64.16%   64.15%   -0.02%     
+ Complexity      445      444       -1     
============================================
  Files          1645     1645              
  Lines        127476   127480       +4     
  Branches      27324    27326       +2     
============================================
- Hits          81796    81781      -15     
- Misses        34145    34170      +25     
+ Partials      11535    11529       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@robfrank robfrank merged commit 8d16962 into main May 22, 2026
24 of 30 checks passed
@robfrank robfrank deleted the fix/4274-cluster-topology-consistency branch May 22, 2026 04:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

/api/v1/server?mode=cluster returns inconsistent topology depending on which node serves the request

1 participant