Skip to content

fix(#4372): snapshot leaderServer to prevent null connectToServer in httpCommand retry#4378

Merged
robfrank merged 3 commits into
mainfrom
fix/4372-remotehttp-leader-snapshot
May 28, 2026
Merged

fix(#4372): snapshot leaderServer to prevent null connectToServer in httpCommand retry#4378
robfrank merged 3 commits into
mainfrom
fix/4372-remotehttp-leader-snapshot

Conversation

@robfrank

Copy link
Copy Markdown
Collaborator

Closes #4372

RemoteHttpComponent.httpCommand read leaderServer twice in its retry path. The first read (pre-loop, line ~237) was already null-safe. The second read (retry catch block, line ~349) was not: after reloadClusterConfiguration() returns true, a concurrent call to the same method can set leaderServer = null before the second read executes. The result was connectToServer = null, which made the for-loop exit prematurely and abandon the request without attempting any available replica.

The fix snapshots leaderServer into a local variable once after reloadClusterConfiguration() completes, and adds an explicit null guard before assigning to connectToServer. When the snapshot is null, the code falls through to getNextReplicaAddress() so any available replica is tried instead of silently giving up. leaderServer, currentServer, and currentPort are also marked volatile so writes from one thread are immediately visible to concurrent readers.

Test plan

  • RemoteHttpComponentTest.httpCommandRetryWithReplicaWhenLeaderNulledConcurrently: overrides reloadClusterConfiguration() to return true with leaderServer = null and a working replica in the list. Configures NETWORK_SAME_SERVER_ERROR_RETRIES = 2 so maxRetry = 2. Connects to a closed port (primary) - triggers IOException - then verifies httpCommand retries with the replica and returns successfully instead of throwing.
  • Full network module test suite: 374 tests, 0 failures.

🤖 Generated with Claude Code

@codacy-production

codacy-production Bot commented May 27, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 complexity

Metric Results
Complexity 0

View in Codacy

🟢 Coverage 75.00% diff coverage · -7.46% coverage variation

Metric Results
Coverage variation -7.46% coverage variation
Diff coverage 75.00% diff coverage

View coverage diff in Codacy

Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (3ad0c09) 127798 93872 73.45%
Head commit (f162bc9) 159477 (+31679) 105247 (+11375) 66.00% (-7.46%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#4378) 4 3 75.00%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses issue #4372 by resolving a null-read race condition on leaderServer in the retry path of RemoteHttpComponent.httpCommand. It introduces volatile fields and snapshots leaderServer to prevent premature loop exits when the leader is concurrently set to null. The feedback highlights critical concurrency issues that remain: updating currentServer and currentPort independently is not atomic and can lead to mismatched host/port states, and other fields like currentReplicaServerIndex and replicaServerList are still not thread-safe, which could cause runtime exceptions under concurrent access.

Comment on lines +82 to +83
protected volatile String currentServer;
protected volatile int currentPort;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While making currentServer and currentPort volatile ensures that updates to each individual field are visible to other threads, it does not guarantee atomicity across updates to both fields. Because they are updated independently (e.g., in reloadClusterConfiguration()), a concurrent reader thread calling getUrl() or getDatabasePath() could read a new currentServer with an old currentPort (or vice versa), leading to a mismatched host/port connection.\n\nTo prevent this split-state race condition, consider encapsulating the current server address in a single volatile reference, such as a Pair<String, Integer> or a dedicated HostAddress object, and updating/reading that reference atomically.

protected String currentServer;
protected int currentPort;
private volatile Pair<String, Integer> leaderServer;
private int currentReplicaServerIndex = -1;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Since this PR introduces volatile fields to support multi-threaded usage of RemoteHttpComponent, please note that currentReplicaServerIndex and replicaServerList are still not thread-safe.\n\nSpecifically:\n1. currentReplicaServerIndex is incremented and checked non-atomically in getNextReplicaAddress(). Under concurrent access, this can lead to race conditions where the index exceeds the bounds of replicaServerList, throwing an IndexOutOfBoundsException.\n2. replicaServerList is a standard ArrayList which is modified (cleared and populated) in requestClusterConfiguration() and read in getNextReplicaAddress(). Concurrent modification and traversal of an ArrayList without synchronization will lead to ConcurrentModificationException or undefined behavior.\n\nTo fully support concurrent usage, consider using thread-safe alternatives such as AtomicInteger for the index and a thread-safe list (e.g., CopyOnWriteArrayList or proper synchronization) for the replica list.

@claude

claude Bot commented May 27, 2026

Copy link
Copy Markdown

Code Review

The core fix is correct and clearly described. Snapshotting leaderServer into a local variable before the null-check prevents the TOCTOU race where a concurrent reloadClusterConfiguration() call nulls the field between the check and the use. The volatile additions ensure the write from one thread is immediately visible to others. Overall this is a well-targeted fix.

A few things worth addressing:


1. getNextReplicaAddress() can still return null when leader is null

getNextReplicaAddress() falls back to returning leaderServer when replicaServerList is empty (line ~511). If leaderServer is still null at that point, the else branch returns null just as the old bug did. The PR description says the code "falls through to getNextReplicaAddress() so any available replica is tried instead of abandoning the request" - this is only true when there IS at least one replica. When replicaServerList is empty AND leader is null, the loop still exits silently. This is a pre-existing issue, but it is worth noting to avoid confusion about what the fix guarantees.


2. Incomplete volatile coverage leaves other fields exposed

leaderServer, currentServer, and currentPort are now volatile, but other fields written and read on multiple threads are not:

  • currentReplicaServerIndex (line 80) - incremented non-atomically in getNextReplicaAddress() with no synchronization
  • replicaServerList (line 70) - ArrayList is not thread-safe; cleared and refilled in requestClusterConfiguration() while httpCommand iterates it
  • stickyTransactionServer (line 84) - written from setStickySessionServer() and read from getStickyPin()

These are pre-existing gaps, but the PR selectively marks only three fields volatile while the same threading concerns apply to the others. Either document why the others are safe (e.g. always accessed under the RWLockContext) or apply volatile consistently.


3. currentServer + currentPort atomicity window

Even with both fields volatile, reloadClusterConfiguration() writes them in two separate statements (lines 527-528):

currentServer = connectToServer.getFirst();
currentPort = connectToServer.getSecond();

A concurrent reader could observe the new host with the old port, or vice versa. Since leaderServer is already a single Pair<String, Integer> field and is updated atomically, consider whether currentServer/currentPort should be collapsed into a similar pair to eliminate this window.


4. Test: variable name so_far uses underscores

Java naming convention uses camelCase. The variable at line ~558 should be soFar.


5. Test: reflection to access replicaServerList is fragile

Using getDeclaredField("replicaServerList") breaks silently if the field is renamed. Since TestableRemoteHttpComponent already subclasses RemoteHttpComponent, consider exposing a protected method or adding a test-only constructor that accepts an initial replica list. This would also remove the need for @SuppressWarnings("unchecked").


6. Minor: redundant comment before the Javadoc

// Regression test for issue #4372: leaderServer null-read in httpCommand retry path

/**
 * When reloadClusterConfiguration() returns true but leaderServer is concurrently null ...
 */
@Test

The // Regression test... comment is redundant - the Javadoc directly below it covers the same ground. Remove one.


Summary

The root fix (snapshot + null guard) is correct and sufficient for the described race. The volatile changes improve visibility. The main follow-up to consider is the broader thread-safety consistency of the other mutable fields - particularly replicaServerList and currentReplicaServerIndex - and the currentServer/currentPort pair atomicity gap. The test covers the specific scenario well; the reflection-based setup is the biggest fragility concern there.

@claude

claude Bot commented May 27, 2026

Copy link
Copy Markdown

Code Review: fix(#4372) - snapshot leaderServer to prevent null connectToServer in httpCommand retry

Overview

This PR fixes a real concurrency bug in RemoteHttpComponent.httpCommand: during the retry path, leaderServer was read twice without snapshotting, allowing a concurrent call to reloadClusterConfiguration() (which temporarily sets leaderServer = null) to race in between the two reads. The result was connectToServer = null, which caused the for-loop to silently exit without trying any available replica.

The fix is correct and minimal. A few observations below.


What's Good

  • Root cause accurately diagnosed. The race window in reloadClusterConfiguration() is real: the field is intentionally set to null before requestClusterConfiguration() is called, and a concurrent reader in httpCommand can observe this transient null.
  • Snapshot + null-guard is the right pattern. Snapshotting leaderServer into a local after reloadClusterConfiguration() returns, then guarding on snapshotLeader != null before assigning, is safe and idiomatic.
  • Regression test is well-structured. The overridden reloadClusterConfiguration() accurately simulates the race (returns true but leaves leaderServer = null) while adding a working replica so the fixed code can fall through to getNextReplicaAddress().

Issues and Suggestions

1. Two docs files should not be committed (medium)

The PR adds docs/4372-remotehttp-leader-null-snapshot.md and docs/review-deferred-518a5ce3.md. These are working notes - the first duplicates the PR description and commit message; the second is an internal review-cycle artifact referencing a specific git SHA that will be meaningless to future readers. Both should be dropped from the commit. The PR body is the right place for this analysis.

2. getLeaderAddress() can still NPE (minor)

With leaderServer now volatile, a concurrent null write is immediately visible to callers of getLeaderAddress(), making an NPE slightly more likely to surface than before. This is pre-existing, but consider adding a null check or documenting that callers must ensure leaderServer is not null.

3. volatile alone does not cover the check-then-act in reloadClusterConfiguration() (informational)

The PR description correctly notes the class is "not thread safe". Just capturing it here: volatile gives visibility, not atomicity. The fix is correct within the documented single-caller assumption, but making only leaderServer volatile while leaving the rest of the state unguarded creates a partial-visibility situation worth tracking in a follow-up issue (as the deferred doc itself notes).

4. Test server thread has no socket timeout (minor)

If the client never completes the HTTP header (e.g., test is aborted), the server thread hangs indefinitely. A client.setSoTimeout(5000) after accept() would make cleanup more reliable in CI.

5. Test uses reflection to access replicaServerList (minor)

TestableRemoteHttpComponent already uses subclassing to override methods. A protected void addReplica(Pair<String, Integer> replica) test seam on the base class would be less brittle and more consistent with the existing pattern.

6. @SuppressWarnings("unchecked") scoped to entire test method (nit)

The unchecked cast is isolated to a single line. Narrowing the annotation scope would be cleaner.


Summary

The core fix is correct and well-targeted. The main action before merge is removing the two docs-directory markdown files (item 1). Everything else is minor. The regression test covers the scenario effectively.

@claude

claude Bot commented May 27, 2026

Copy link
Copy Markdown

Code Review - PR #4378

Fix: snapshot leaderServer to prevent null connectToServer in httpCommand retry

Overview

This PR correctly identifies and fixes a real TOCTOU race in RemoteHttpComponent.httpCommand. The reloadClusterConfiguration() method sets leaderServer = null temporarily (line ~543) before restoring it via requestClusterConfiguration(). A concurrent caller reading leaderServer in the retry path during that window would get null, assign it to connectToServer, and cause the for-loop to exit silently due to the connectToServer != null guard.


What's Good

  • Root cause analysis is accurate. The volatile + snapshot approach is the correct fix for this class of visibility/TOCTOU bug.
  • volatile scope is well-reasoned. The PR correctly avoids making currentServer/currentPort volatile without pair-atomicity (which would create false confidence of correctness), and limits the change to just leaderServer where it's actually needed.
  • Graceful fallback. When snapshotLeader is null, falling through to getNextReplicaAddress() is the right behavior - it tries available replicas instead of abandoning the request silently.
  • Test covers the regression path. The test constructs the exact race scenario (null leader, available replica) and verifies the fixed code returns successfully.

Issues and Suggestions

1. Documentation files do not belong in docs/ (high priority)

docs/4372-remotehttp-leader-null-snapshot.md and docs/review-deferred-518a5ce3.md appear to be AI-assisted review artifacts - they document PR review cycles, gemini-code-assist comments, and interim commit SHAs. These are not production documentation. They will become stale, mislead future contributors, and clutter the repository. Please remove both files; the relevant context belongs in the PR description (which already covers it well) and in a follow-up GitHub issue for the deferred thread-safety items.

2. Test uses reflection to access replicaServerList (medium priority)

final Field f = RemoteHttpComponent.class.getDeclaredField("replicaServerList");
f.setAccessible(true);
((List<Pair<String, Integer>>) f.get(this)).add(new Pair<>("127.0.0.1", replicaPort));

This is fragile - a field rename breaks the test silently at runtime rather than compile time. A package-private or protected setter/accessor on RemoteHttpComponent for test use (e.g., void addReplicaForTest(Pair<String,Integer> replica)) would be more robust and self-documenting. Alternatively, since TestableRemoteHttpComponent already subclasses RemoteHttpComponent, the subclass could expose replicaServerList directly if it's promoted to protected.

3. Test contains em dashes in comments (low - style)

Per CLAUDE.md: "Never use the em dash character () in responses." The test has:

// Primary server: closed port — triggers ConnectException (IOException) on iteration 0
// With the fix: httpCommand snapshots leaderServer, detects null, falls through

Replace the with a comma or -.

4. Raw HTTP mock in test is fragile (low)

The test stands up a raw TCP socket that manually reads HTTP headers and returns HTTP/1.1 200. The HttpClient is constructed with HttpClient.Version.HTTP_2. While the JDK client will fall back to HTTP/1.1 when the server doesn't advertise HTTP/2, this is an undocumented behavior that could change. The test also handles only one connection. Consider noting HttpClient.Version.HTTP_1_1 explicitly in the test's mock or in a comment explaining the fallback assumption.

5. Missing @Tag("slow") consideration

The test binds real OS sockets. It's fast, but worth confirming it runs cleanly in the module's CI suite on all platforms (the ServerSocket(0) port allocation is fine; just noting the socket teardown path is guarded by daemon thread + try-with-resources, which looks correct).


Correctness Verification

The fix is logically correct:

final Pair<String, Integer> snapshotLeader = leaderServer; // volatile read, single snapshot

if (leaderIsPreferable && snapshotLeader != null && !currentConnectToServer.equals(snapshotLeader)) {
    connectToServer = snapshotLeader;
} else
    connectToServer = getNextReplicaAddress();
  • If snapshotLeader != null and differs from current: switch to leader. Correct.
  • If snapshotLeader == null (the race): fall to getNextReplicaAddress(). Correct.
  • If getNextReplicaAddress() also returns null: connectToServer = null, loop exits, RemoteException is thrown. Correct - no server is available.

Summary

The core fix is correct and well-scoped. The two blocking items before merge are:

  1. Remove docs/4372-remotehttp-leader-null-snapshot.md and docs/review-deferred-518a5ce3.md - these are review artifacts, not repository documentation.
  2. Consider replacing reflection in the test with a protected accessor to avoid fragility.

The deferred thread-safety items (currentReplicaServerIndex as AtomicInteger, replicaServerList as CopyOnWriteArrayList) are correctly identified and appropriately scoped out - please do file a follow-up issue for those.

@codecov

codecov Bot commented May 27, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.32%. Comparing base (3ad0c09) to head (f162bc9).

Files with missing lines Patch % Lines
.../java/com/arcadedb/remote/RemoteHttpComponent.java 50.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #4378      +/-   ##
============================================
- Coverage     64.37%   64.32%   -0.06%     
+ Complexity      447      428      -19     
============================================
  Files          1647     1647              
  Lines        127798   127799       +1     
  Branches      27395    27395              
============================================
- Hits          82276    82209      -67     
- Misses        33957    34052      +95     
+ Partials      11565    11538      -27     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@robfrank robfrank merged commit a9081c8 into main May 28, 2026
24 of 30 checks passed
@robfrank robfrank deleted the fix/4372-remotehttp-leader-snapshot branch May 28, 2026 07:28
@robfrank robfrank added this to the 26.6.1 milestone May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RemoteHttpComponent mutates leaderServer / currentServer non-atomically

1 participant